Warn when sequence-level importance sampling is combined with a token-summed loss type by discobot · Pull Request #6042 · huggingface/trl

discobot · 2026-06-13T11:15:48Z

What does this PR do?

Combining importance_sampling_level="sequence" (the GSPO setup) with a loss type that
sums per-token contributions (bnpo, dr_grpo, cispo, or the default dapo) silently
length-weights each sequence's loss instead of applying the per-sequence objective GSPO
prescribes.

This PR adds the warning suggested in
#3823 (comment), placed right
after the existing luspo/vespo mismatch checks in GRPOTrainer.__init__ and
mirroring their style. It points users to loss_type="grpo", matching the GSPO recipe in
the paper index. The experimental PAPOTrainer reuses GRPOTrainer.__init__, so it
inherits the check.

One thing that turned up while testing: the existing
test_train_sequence_importance_sampling itself uses the footgun combo (sequence-level
importance sampling with the default dapo), so the new warning now fires during its
setup. I left it unchanged to keep the diff minimal — happy to switch it to
loss_type="grpo" if preferred.

Tests: a parametrized test asserting the warning fires for all four affected loss types
(fails on main without the fix), and a negative test asserting the recommended GSPO
combo stays silent. Also ran the other warning tests and test_train_loss_types locally
(CPU).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. (Loss not calculated correctly when using GSPO with default loss type bnpo (importance_sampling_level == "sequence") #3823)
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

The fix and tests were developed with Claude Code and verified by running the test suite locally.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

Low Risk
Init-time logging only; training math and defaults are unchanged aside from surfacing a previously silent config pitfall.

Overview
Adds a startup warning in GRPOTrainer.__init__ when importance_sampling_level="sequence" is used with loss_type in bnpo, dr_grpo, dapo, or cispo. Those losses sum per-token terms, so with sequence-level importance sampling the objective effectively length-weights completions instead of matching the GSPO per-sequence setup. The message recommends loss_type="grpo" and links to the GSPO paper index, consistent with existing luspo / vespo config mismatch warnings.

Tests cover the four loss types (warning present) and sequence + grpo (no warning). PAPOTrainer inherits the check via GRPOTrainer.__init__.

^{Reviewed by Cursor Bugbot for commit 6e1a500. Bugbot is set up for automated code reviews on this repo. Configure here.}

…-summed loss type With importance_sampling_level='sequence' the per-token loss is (B, 1), and loss types that sum per-token contributions (bnpo, dr_grpo, dapo, cispo) broadcast it against the (B, T) completion mask, effectively weighting each sequence by its completion length instead of optimizing the per-sequence objective GSPO prescribes. The recommended combo (loss_type='grpo') was only documented in the paper index, so the default config triggered this silently. Add a warning in GRPOTrainer.__init__ mirroring the existing luspo/vespo mismatch checks, plus regression tests for the warning and the recommended setup.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e1a5005f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-13T11:20:22Z

                "set to `'token'` (the default)."
            )

+        if args.importance_sampling_level == "sequence" and args.loss_type in ["bnpo", "dr_grpo", "dapo", "cispo"]:


Propagate the sequence-level warning to SDPO

This only adds the warning to GRPOTrainer, but I checked the copied SDPO policy-loss path in trl/experimental/sdpo/sdpo_trainer.py:1160-1176: it also supports importance_sampling_level='sequence' and then token-summed bnpo, dr_grpo, and dapo reductions. With SDPOConfig(importance_sampling_level='sequence', loss_type='dapo'), users still silently get length-weighted sequence losses, so the duplicated trainer logic remains inconsistent with the repository requirement that these patterns stay aligned.

Useful? React with 👍 / 👎.

discobot mentioned this pull request Jun 13, 2026

Loss not calculated correctly when using GSPO with default loss type bnpo (importance_sampling_level == "sequence") #3823

Open

5 tasks

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when sequence-level importance sampling is combined with a token-summed loss type#6042

Warn when sequence-level importance sampling is combined with a token-summed loss type#6042
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/3823-gspo-loss-type-warning

discobot commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

discobot commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

discobot commented Jun 13, 2026 •

edited by cursor Bot

Loading