Skip to content

fix: preserve OnlineDPO vLLM completion ids#6038

Open
he-yufeng wants to merge 1 commit into
huggingface:mainfrom
he-yufeng:fix/online-dpo-vllm-completion-ids
Open

fix: preserve OnlineDPO vLLM completion ids#6038
he-yufeng wants to merge 1 commit into
huggingface:mainfrom
he-yufeng:fix/online-dpo-vllm-completion-ids

Conversation

@he-yufeng

@he-yufeng he-yufeng commented Jun 13, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #5514.

trl vllm-serve already returns completion_ids as one token-id list per generated completion. OnlineDPOTrainer._generate_vllm_server() was treating that flat list as if it were grouped by prompt, so it split each token sequence into single-token completions.

This PR keeps each returned completion token sequence intact, then reorders the server output into the generation-major order that Online DPO expects for reward splitting. It also keeps the local prompt_ids in the same order.

Before submitting

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Test plan

  • python -m py_compile trl\experimental\online_dpo\online_dpo_trainer.py tests\experimental\test_online_dpo_trainer.py
  • python -c "import sys, types, importlib.machinery; m=types.ModuleType('kernels'); m.__spec__=importlib.machinery.ModuleSpec('kernels', loader=None); sys.modules['kernels']=m; import pytest; raise SystemExit(pytest.main(['tests/experimental/test_online_dpo_trainer.py', '-k', 'vllm_server_preserves_completion_token_sequences', '-q']))"
  • python -m ruff check trl\experimental\online_dpo\online_dpo_trainer.py tests\experimental\test_online_dpo_trainer.py
  • python -m ruff format --check trl\experimental\online_dpo\online_dpo_trainer.py tests\experimental\test_online_dpo_trainer.py
  • git diff --check

The targeted pytest command temporarily stubs the installed kernels package in the test process because this local environment has an import-time transformers/kernels version mismatch (ValueError: Either a revision or a version must be specified.) before test collection. The regression itself is a pure mock test and does not require vLLM or kernels.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.


Note

Medium Risk
Changes Online DPO vLLM-server generation and distributed indexing; wrong ordering would corrupt training, but scope is limited to that code path and is covered by a regression test.

Overview
Fixes incorrect handling of trl vllm-serve output in OnlineDPOTrainer._generate_vllm_server() (#5514). The server already returns one full token-id list per completion; the trainer was flattening those lists into single-token “completions” and deduplicating prompts before calling generate.

The path now passes all gathered prompts to the vLLM client with n=num_generations, keeps each returned completion sequence intact, and reorders results into the generation-major layout Online DPO uses for chosen/rejected reward pairing. Multi-process slicing and prompt_ids duplication use num_generations instead of hardcoded 2.

A mocked regression test test_vllm_server_preserves_completion_token_sequences locks in the expected reordering and prompt alignment.

Reviewed by Cursor Bugbot for commit 7db9a15. Bugbot is set up for automated code reviews on this repo. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OnlineDPOTrainer._generate_vllm_server() flattens vllm-serve completion_ids twice

1 participant