Skip to content

feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639)#6123

Open
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:feat/adaptive-beta-dpo
Open

feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639)#6123
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:feat/adaptive-beta-dpo

Conversation

@mukund1985

@mukund1985 mukund1985 commented Jun 19, 2026

Copy link
Copy Markdown

Problem

A fixed β in DPO doesn't adapt to how well-separated chosen/rejected responses are in the current batch. The β-DPO paper (arXiv:2407.08639) shows that per-batch adaptive β improves alignment stability and final policy quality.

Closes #5211.

Solution

Implements the β-DPO algorithm orthogonally to loss_type — every loss type (sigmoid, IPO, SPPO, robust, …) benefits automatically.

Algorithm:

M_batch = mean(chosen_logratios − rejected_logratios)   # current batch margin
M₀      = 0.9·M₀ + 0.1·M_batch                         # EMA reference (updated each step)
effective_β = max([1 + α(M_batch − M₀)] · β₀, 1e-6)    # clipped to stay positive

New config fields

Field Type Default Description
adaptive_beta str | None None Set to "beta-dpo" to enable
beta_alpha float | None None Scaling factor α (required when enabled)
beta_reference_margin float | None None Fixed M₀; None = use EMA

Usage

from trl import DPOConfig, DPOTrainer

config = DPOConfig(
    adaptive_beta="beta-dpo",
    beta_alpha=0.5,
    beta=0.1,  # β₀ base value
)

A beta/effective metric is logged at each training step when adaptive_beta is set.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Medium Risk
When enabled, it changes how strongly the policy is regularized each step across every DPO loss variant, which can materially affect training dynamics; default behavior is unchanged and the Liger kernel path does not use adaptive β.

Overview
Adds optional β-DPO adaptive scaling to DPOTrainer via new DPOConfig fields: adaptive_beta ("beta-dpo"), required beta_alpha, and optional fixed beta_reference_margin (otherwise M₀ is a 0.9-momentum EMA of batch margins).

During training only, each batch computes effective_beta from the mean chosen−rejected log-ratio margin vs. M₀, then uses that value everywhere beta previously scaled the loss (all supported loss_type variants) and reward metrics. Logs beta/effective when adaptive β is enabled.

Startup validation rejects unknown adaptive_beta values and missing beta_alpha. The Liger fused loss path is unchanged (still fixed args.beta). Also adds .contiguous() on shift_completion_mask in the Liger loss helper.

Reviewed by Cursor Bugbot for commit dca3573. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds per-batch adaptive β scaling to DPOTrainer.

New DPOConfig fields:
  adaptive_beta: str | None  — set to 'beta-dpo' to enable
  beta_alpha: float | None   — α scaling factor
  beta_reference_margin: float | None — fixed M₀ (None → EMA)

Algorithm (β-DPO, arXiv:2407.08639):
  M_batch = mean(chosen_logratios − rejected_logratios)
  M₀ = 0.9 * M₀ + 0.1 * M_batch  (EMA, updated each batch)
  effective_β = max([1 + α(M_batch − M₀)] × β₀, 1e-6)

effective_β replaces self.beta in all loss branches so every
loss_type (sigmoid, IPO, SPPO, robust, …) benefits automatically.
A 'beta/effective' metric is logged when adaptive_beta is set.

Closes huggingface#5211

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dca3573. Configure here.

else:
self._running_margin = 0.9 * self._running_margin + 0.1 * batch_margin
effective_beta = max(
(1.0 + self.args.beta_alpha * (batch_margin - self._running_margin)) * self.beta,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed reference margin overwritten

Medium Severity

When beta_reference_margin is set, docs say M₀ stays fixed, but _running_margin is initialized from that value and still updated each step with the 0.9/0.1 EMA. Adaptive β then compares against a drifting reference instead of the configured fixed margin.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dca3573. Configure here.

self._metrics[mode]["logps/chosen"].append(self.accelerator.gather(chosen_logps).mean().item())
self._metrics[mode]["logps/rejected"].append(self.accelerator.gather(rejected_logps).mean().item())
if self.args.adaptive_beta is not None:
self._metrics[mode]["beta/effective"].append(effective_beta)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liger path ignores adaptive beta

High Severity

Adaptive β is computed only in _compute_loss, but compute_loss routes use_liger_kernel=True to _compute_loss_liger, which uses LigerFusedLinearDPOLoss with static args.beta. Enabling adaptive_beta with the Liger kernel has no effect on the actual loss.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dca3573. Configure here.

effective_beta = max(
(1.0 + self.args.beta_alpha * (batch_margin - self._running_margin)) * self.beta,
1e-6,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-rank margin breaks multi-GPU

High Severity

batch_margin and _running_margin are updated from each process’s local micro-batch only, with no accelerator gather or sync. Under DDP, ranks diverge on effective_beta and apply different β scalings to the same global step.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dca3573. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adaptive Beta-DPO

1 participant