Skip to content

fix: pass AsyncGRPO environment rewards#6031

Open
he-yufeng wants to merge 4 commits into
huggingface:mainfrom
he-yufeng:fix/async-grpo-env-rewards
Open

fix: pass AsyncGRPO environment rewards#6031
he-yufeng wants to merge 4 commits into
huggingface:mainfrom
he-yufeng:fix/async-grpo-env-rewards

Conversation

@he-yufeng

@he-yufeng he-yufeng commented Jun 12, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #6027.

AsyncGRPOTrainer can build environments through environment_factory, but the rollout worker only passed dataset-derived kwargs to reward_funcs. As a result, rewards computed by an environment were never visible to reward functions.

This adds a per-group env_rewards list, captures each completed environment's reward before the slot is reused, and forwards it to reward functions as environment_reward. The change is additive: runs without environments keep the existing kwargs.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. Issue: AsyncGRPOTrainer: environment_factory rewards are never passed to reward_funcs #6027.
  • Did you make sure to update the documentation with your changes? Not needed; this preserves the existing experimental API and fixes missing reward plumbing.
  • Did you write any new necessary tests?

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Tests

  • python -m py_compile trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.py
  • python -m ruff check trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.py
  • python -m ruff format --check trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.py
  • git diff --check
  • .\\.venv\\Scripts\\python.exe -m pytest tests\\experimental\\test_async_rollout_worker.py -q

Who can review?

Anyone familiar with the experimental AsyncGRPO rollout worker.


Note

Medium Risk
Touches experimental rollout scoring and reward kwargs; behavior change is limited to the environment_factory path and is covered by a new unit test.

Overview
Fixes environment-backed rewards in the async GRPO rollout worker: when environment_factory is used, reward functions now receive a per-completion environments list instead of only dataset kwargs.

Each finished generation appends a copy.copy snapshot of the slot’s environment to the rollout group (before the slot is reset for the next rollout), and _score_group passes that list into reward funcs when non-empty. Runs without environments are unchanged.

A new test asserts that reused slots and later resets do not corrupt rewards for earlier completions in the same group.

Reviewed by Cursor Bugbot for commit 6143e0f. Bugbot is set up for automated code reviews on this repo. Configure here.

@adithya-s-k adithya-s-k requested a review from AmineDiro June 15, 2026 13:23
@adithya-s-k adithya-s-k self-requested a review June 15, 2026 13:28
@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

group.tool_failure_counts.append(tool_failure_count)
if self.environments is not None:
env = self.environments[slot]
group.env_rewards.append(await asyncio.to_thread(lambda env=env: env.reward))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue but we actually can't guarattee that every env has an env.reward.

This is a structural issue. But what we can do is this :

look at this PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I changed this to pass per-completion environment snapshots instead of reading env.reward directly from the worker.

The current head now does copy.copy(self.environments[slot]) when a completion finishes, stores those snapshots on the rollout group, and passes them to reward functions as environments. The test now mirrors the intended contract by reading env.reward inside the reward function.

Validated locally:

python -m py_compile trl\experimental\async_grpo\async_rollout_worker.py tests\experimental\test_async_rollout_worker.py
python -m pytest tests\experimental\test_async_rollout_worker.py -q
python -m ruff check trl\experimental\async_grpo\async_rollout_worker.py tests\experimental\test_async_rollout_worker.py
git diff --check origin/main..HEAD

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, would probably prefer a full test on a real example to validate. Not just type checking

@he-yufeng

Copy link
Copy Markdown
Author

@AmineDiro reworked the test per your note so it's not just checking types. It now uses a stateful env (reset plus an increment tool that updates env.reward) and asserts the behavior the snapshotting exists for: each completion is scored against its copy.copy snapshot taken when it finishes, and resetting/reusing the same instance for the next rollout leaves the earlier completions' rewards intact. Drop the copy.copy and the test fails. A full async rollout needs a live vLLM server, so I kept this at the worker scoring boundary rather than a model run. Happy to wire it through environment_factory end to end if you'd prefer that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AsyncGRPOTrainer: environment_factory rewards are never passed to reward_funcs

3 participants