fix: preserve OnlineDPO vLLM completion ids by he-yufeng · Pull Request #6038 · huggingface/trl

he-yufeng · 2026-06-13T04:38:37Z

What does this PR do?

trl vllm-serve already returns completion_ids as one token-id list per generated completion. OnlineDPOTrainer._generate_vllm_server() was treating that flat list as if it were grouped by prompt, so it split each token sequence into single-token completions.

This PR keeps each returned completion token sequence intact, then reorders the server output into the generation-major order that Online DPO expects for reward splitting. It also keeps the local prompt_ids in the same order.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. OnlineDPOTrainer._generate_vllm_server() flattens vllm-serve completion_ids twice #5514
Did you make sure to update the documentation with your changes? Not needed; this is a behavior fix covered by a regression test.
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Test plan

python -m py_compile trl\experimental\online_dpo\online_dpo_trainer.py tests\experimental\test_online_dpo_trainer.py
python -c "import sys, types, importlib.machinery; m=types.ModuleType('kernels'); m.__spec__=importlib.machinery.ModuleSpec('kernels', loader=None); sys.modules['kernels']=m; import pytest; raise SystemExit(pytest.main(['tests/experimental/test_online_dpo_trainer.py', '-k', 'vllm_server_preserves_completion_token_sequences', '-q']))"
python -m ruff check trl\experimental\online_dpo\online_dpo_trainer.py tests\experimental\test_online_dpo_trainer.py
python -m ruff format --check trl\experimental\online_dpo\online_dpo_trainer.py tests\experimental\test_online_dpo_trainer.py
git diff --check

The targeted pytest command temporarily stubs the installed kernels package in the test process because this local environment has an import-time transformers/kernels version mismatch (ValueError: Either a revision or a version must be specified.) before test collection. The regression itself is a pure mock test and does not require vLLM or kernels.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

Medium Risk
Changes Online DPO vLLM-server generation and distributed indexing; wrong ordering would corrupt training, but scope is limited to that code path and is covered by a regression test.

Overview
Fixes incorrect handling of trl vllm-serve output in OnlineDPOTrainer._generate_vllm_server() (#5514). The server already returns one full token-id list per completion; the trainer was flattening those lists into single-token “completions” and deduplicating prompts before calling generate.

The path now passes all gathered prompts to the vLLM client with n=num_generations, keeps each returned completion sequence intact, and reorders results into the generation-major layout Online DPO uses for chosen/rejected reward pairing. Multi-process slicing and prompt_ids duplication use num_generations instead of hardcoded 2.

A mocked regression test test_vllm_server_preserves_completion_token_sequences locks in the expected reordering and prompt alignment.

^{Reviewed by Cursor Bugbot for commit 7db9a15. Bugbot is set up for automated code reviews on this repo. Configure here.}

fix: preserve OnlineDPO vLLM completion ids

7db9a15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve OnlineDPO vLLM completion ids#6038

fix: preserve OnlineDPO vLLM completion ids#6038
he-yufeng wants to merge 1 commit into
huggingface:mainfrom
he-yufeng:fix/online-dpo-vllm-completion-ids

he-yufeng commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

he-yufeng commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Test plan

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

he-yufeng commented Jun 13, 2026 •

edited by cursor Bot

Loading