feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639)#6123
feat(dpo): implement Adaptive Beta-DPO (arXiv:2407.08639)#6123mukund1985 wants to merge 1 commit into
Conversation
Adds per-batch adaptive β scaling to DPOTrainer. New DPOConfig fields: adaptive_beta: str | None — set to 'beta-dpo' to enable beta_alpha: float | None — α scaling factor beta_reference_margin: float | None — fixed M₀ (None → EMA) Algorithm (β-DPO, arXiv:2407.08639): M_batch = mean(chosen_logratios − rejected_logratios) M₀ = 0.9 * M₀ + 0.1 * M_batch (EMA, updated each batch) effective_β = max([1 + α(M_batch − M₀)] × β₀, 1e-6) effective_β replaces self.beta in all loss branches so every loss_type (sigmoid, IPO, SPPO, robust, …) benefits automatically. A 'beta/effective' metric is logged when adaptive_beta is set. Closes huggingface#5211
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dca3573. Configure here.
| else: | ||
| self._running_margin = 0.9 * self._running_margin + 0.1 * batch_margin | ||
| effective_beta = max( | ||
| (1.0 + self.args.beta_alpha * (batch_margin - self._running_margin)) * self.beta, |
There was a problem hiding this comment.
Fixed reference margin overwritten
Medium Severity
When beta_reference_margin is set, docs say M₀ stays fixed, but _running_margin is initialized from that value and still updated each step with the 0.9/0.1 EMA. Adaptive β then compares against a drifting reference instead of the configured fixed margin.
Reviewed by Cursor Bugbot for commit dca3573. Configure here.
| self._metrics[mode]["logps/chosen"].append(self.accelerator.gather(chosen_logps).mean().item()) | ||
| self._metrics[mode]["logps/rejected"].append(self.accelerator.gather(rejected_logps).mean().item()) | ||
| if self.args.adaptive_beta is not None: | ||
| self._metrics[mode]["beta/effective"].append(effective_beta) |
There was a problem hiding this comment.
Liger path ignores adaptive beta
High Severity
Adaptive β is computed only in _compute_loss, but compute_loss routes use_liger_kernel=True to _compute_loss_liger, which uses LigerFusedLinearDPOLoss with static args.beta. Enabling adaptive_beta with the Liger kernel has no effect on the actual loss.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit dca3573. Configure here.
| effective_beta = max( | ||
| (1.0 + self.args.beta_alpha * (batch_margin - self._running_margin)) * self.beta, | ||
| 1e-6, | ||
| ) |
There was a problem hiding this comment.
Per-rank margin breaks multi-GPU
High Severity
batch_margin and _running_margin are updated from each process’s local micro-batch only, with no accelerator gather or sync. Under DDP, ranks diverge on effective_beta and apply different β scalings to the same global step.
Reviewed by Cursor Bugbot for commit dca3573. Configure here.


Problem
A fixed
βin DPO doesn't adapt to how well-separated chosen/rejected responses are in the current batch. The β-DPO paper (arXiv:2407.08639) shows that per-batch adaptive β improves alignment stability and final policy quality.Closes #5211.
Solution
Implements the β-DPO algorithm orthogonally to
loss_type— every loss type (sigmoid, IPO, SPPO, robust, …) benefits automatically.Algorithm:
New config fields
adaptive_betastr | NoneNone"beta-dpo"to enablebeta_alphafloat | NoneNonebeta_reference_marginfloat | NoneNoneNone= use EMAUsage
A
beta/effectivemetric is logged at each training step whenadaptive_betais set.Before submitting
AI writing disclosure
Note
Medium Risk
When enabled, it changes how strongly the policy is regularized each step across every DPO loss variant, which can materially affect training dynamics; default behavior is unchanged and the Liger kernel path does not use adaptive β.
Overview
Adds optional β-DPO adaptive scaling to
DPOTrainervia newDPOConfigfields:adaptive_beta("beta-dpo"), requiredbeta_alpha, and optional fixedbeta_reference_margin(otherwise M₀ is a 0.9-momentum EMA of batch margins).During training only, each batch computes
effective_betafrom the mean chosen−rejected log-ratio margin vs. M₀, then uses that value everywherebetapreviously scaled the loss (all supportedloss_typevariants) and reward metrics. Logsbeta/effectivewhen adaptive β is enabled.Startup validation rejects unknown
adaptive_betavalues and missingbeta_alpha. The Liger fused loss path is unchanged (still fixedargs.beta). Also adds.contiguous()onshift_completion_maskin the Liger loss helper.Reviewed by Cursor Bugbot for commit dca3573. Bugbot is set up for automated code reviews on this repo. Configure here.