feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639) by mukund1985 · Pull Request #6123 · huggingface/trl

mukund1985 · 2026-06-19T20:01:11Z

Problem

A fixed β in DPO doesn't adapt to how well-separated chosen/rejected responses are in the current batch. The β-DPO paper (arXiv:2407.08639) shows that per-batch adaptive β improves alignment stability and final policy quality.

Closes #5211.

Solution

Implements the β-DPO algorithm orthogonally to loss_type — every loss type (sigmoid, IPO, SPPO, robust, …) benefits automatically.

Algorithm:

M_batch = mean(chosen_logratios − rejected_logratios)   # current batch margin
M₀      = 0.9·M₀ + 0.1·M_batch                         # EMA reference (updated each step)
effective_β = max([1 + α(M_batch − M₀)] · β₀, 1e-6)    # clipped to stay positive

New config fields

Field	Type	Default	Description
`adaptive_beta`	`str \| None`	`None`	Set to `"beta-dpo"` to enable
`beta_alpha`	`float \| None`	`None`	Scaling factor α (required when enabled)
`beta_reference_margin`	`float \| None`	`None`	Fixed M₀; `None` = use EMA

Usage

from trl import DPOConfig, DPOTrainer

config = DPOConfig(
    adaptive_beta="beta-dpo",
    beta_alpha=0.5,
    beta=0.1,  # β₀ base value
)

A beta/effective metric is logged at each training step when adaptive_beta is set.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Medium Risk
When enabled, it changes how strongly the policy is regularized each step across every DPO loss variant, which can materially affect training dynamics; default behavior is unchanged and the Liger kernel path does not use adaptive β.

Overview
Adds optional β-DPO adaptive scaling to DPOTrainer via new DPOConfig fields: adaptive_beta ("beta-dpo"), required beta_alpha, and optional fixed beta_reference_margin (otherwise M₀ is a 0.9-momentum EMA of batch margins).

During training only, each batch computes effective_beta from the mean chosen−rejected log-ratio margin vs. M₀, then uses that value everywhere beta previously scaled the loss (all supported loss_type variants) and reward metrics. Logs beta/effective when adaptive β is enabled.

Startup validation rejects unknown adaptive_beta values and missing beta_alpha. The Liger fused loss path is unchanged (still fixed args.beta). Also adds .contiguous() on shift_completion_mask in the Liger loss helper.

^{Reviewed by Cursor Bugbot for commit dca3573. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds per-batch adaptive β scaling to DPOTrainer. New DPOConfig fields: adaptive_beta: str | None — set to 'beta-dpo' to enable beta_alpha: float | None — α scaling factor beta_reference_margin: float | None — fixed M₀ (None → EMA) Algorithm (β-DPO, arXiv:2407.08639): M_batch = mean(chosen_logratios − rejected_logratios) M₀ = 0.9 * M₀ + 0.1 * M_batch (EMA, updated each batch) effective_β = max([1 + α(M_batch − M₀)] × β₀, 1e-6) effective_β replaces self.beta in all loss branches so every loss_type (sigmoid, IPO, SPPO, robust, …) benefits automatically. A 'beta/effective' metric is logged when adaptive_beta is set. Closes huggingface#5211

cursor

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit dca3573. Configure here.}

cursor · 2026-06-19T20:02:36Z

+            else:
+                self._running_margin = 0.9 * self._running_margin + 0.1 * batch_margin
+            effective_beta = max(
+                (1.0 + self.args.beta_alpha * (batch_margin - self._running_margin)) * self.beta,


Fixed reference margin overwritten

Medium Severity

When beta_reference_margin is set, docs say M₀ stays fixed, but _running_margin is initialized from that value and still updated each step with the 0.9/0.1 EMA. Adaptive β then compares against a drifting reference instead of the configured fixed margin.

^{Reviewed by Cursor Bugbot for commit dca3573. Configure here.}

cursor · 2026-06-19T20:02:36Z

        self._metrics[mode]["logps/chosen"].append(self.accelerator.gather(chosen_logps).mean().item())
        self._metrics[mode]["logps/rejected"].append(self.accelerator.gather(rejected_logps).mean().item())
+        if self.args.adaptive_beta is not None:
+            self._metrics[mode]["beta/effective"].append(effective_beta)


Liger path ignores adaptive beta

High Severity

Adaptive β is computed only in _compute_loss, but compute_loss routes use_liger_kernel=True to _compute_loss_liger, which uses LigerFusedLinearDPOLoss with static args.beta. Enabling adaptive_beta with the Liger kernel has no effect on the actual loss.

Additional Locations (1)

trl/trainer/dpo_trainer.py#L750-L751

^{Reviewed by Cursor Bugbot for commit dca3573. Configure here.}

cursor · 2026-06-19T20:02:36Z

+            effective_beta = max(
+                (1.0 + self.args.beta_alpha * (batch_margin - self._running_margin)) * self.beta,
+                1e-6,
+            )


Per-rank margin breaks multi-GPU

High Severity

batch_margin and _running_margin are updated from each process’s local micro-batch only, with no accelerator gather or sync. Under DDP, ranks diverge on effective_beta and apply different β scalings to the same global step.

^{Reviewed by Cursor Bugbot for commit dca3573. Configure here.}

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639)#6123

feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639)#6123
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:feat/adaptive-beta-dpo

mukund1985 commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 19, 2026

Uh oh!

cursor Bot Jun 19, 2026

Uh oh!

cursor Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mukund1985 commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

New config fields

Usage

Before submitting

AI writing disclosure

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 19, 2026

Choose a reason for hiding this comment

Fixed reference margin overwritten

Uh oh!

cursor Bot Jun 19, 2026

Choose a reason for hiding this comment

Liger path ignores adaptive beta

Uh oh!

cursor Bot Jun 19, 2026

Choose a reason for hiding this comment

Per-rank margin breaks multi-GPU

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mukund1985 commented Jun 19, 2026 •

edited by cursor Bot

Loading