fix: pass AsyncGRPO environment rewards by he-yufeng · Pull Request #6031 · huggingface/trl

he-yufeng · 2026-06-12T09:12:42Z

What does this PR do?

AsyncGRPOTrainer can build environments through environment_factory, but the rollout worker only passed dataset-derived kwargs to reward_funcs. As a result, rewards computed by an environment were never visible to reward functions.

This adds a per-group env_rewards list, captures each completed environment's reward before the slot is reused, and forwards it to reward functions as environment_reward. The change is additive: runs without environments keep the existing kwargs.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. Issue: AsyncGRPOTrainer: environment_factory rewards are never passed to reward_funcs #6027.
Did you make sure to update the documentation with your changes? Not needed; this preserves the existing experimental API and fixes missing reward plumbing.
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Tests

python -m py_compile trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.py
python -m ruff check trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.py
python -m ruff format --check trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.py
git diff --check
.\\.venv\\Scripts\\python.exe -m pytest tests\\experimental\\test_async_rollout_worker.py -q

Who can review?

Anyone familiar with the experimental AsyncGRPO rollout worker.

Note

Medium Risk
Touches experimental rollout scoring and reward kwargs; behavior change is limited to the environment_factory path and is covered by a new unit test.

Overview
Fixes environment-backed rewards in the async GRPO rollout worker: when environment_factory is used, reward functions now receive a per-completion environments list instead of only dataset kwargs.

Each finished generation appends a copy.copy snapshot of the slot’s environment to the rollout group (before the slot is reset for the next rollout), and _score_group passes that list into reward funcs when non-empty. Runs without environments are unchanged.

A new test asserts that reused slots and later resets do not corrupt rewards for earlier completions in the same group.

^{Reviewed by Cursor Bugbot for commit 6143e0f. Bugbot is set up for automated code reviews on this repo. Configure here.}

bot-ci-comment · 2026-06-15T13:31:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

AmineDiro · 2026-06-15T13:41:48Z

                    group.tool_failure_counts.append(tool_failure_count)
+                    if self.environments is not None:
+                        env = self.environments[slot]
+                        group.env_rewards.append(await asyncio.to_thread(lambda env=env: env.reward))


I see the issue but we actually can't guarattee that every env has an env.reward.

This is a structural issue. But what we can do is this :

look at this PR.

Thanks, that makes sense. I changed this to pass per-completion environment snapshots instead of reading env.reward directly from the worker.

The current head now does copy.copy(self.environments[slot]) when a completion finishes, stores those snapshots on the rollout group, and passes them to reward functions as environments. The test now mirrors the intended contract by reading env.reward inside the reward function.

Validated locally:

python -m py_compile trl\experimental\async_grpo\async_rollout_worker.py tests\experimental\test_async_rollout_worker.py python -m pytest tests\experimental\test_async_rollout_worker.py -q python -m ruff check trl\experimental\async_grpo\async_rollout_worker.py tests\experimental\test_async_rollout_worker.py git diff --check origin/main..HEAD

Thanks, would probably prefer a full test on a real example to validate. Not just type checking

he-yufeng · 2026-06-16T09:33:06Z

@AmineDiro reworked the test per your note so it's not just checking types. It now uses a stateful env (reset plus an increment tool that updates env.reward) and asserts the behavior the snapshotting exists for: each completion is scored against its copy.copy snapshot taken when it finishes, and resetting/reusing the same instance for the next rollout leaves the earlier completions' rewards intact. Drop the copy.copy and the test fails. A full async rollout needs a live vLLM server, so I kept this at the worker scoring boundary rather than a model run. Happy to wire it through environment_factory end to end if you'd prefer that.

fix: pass AsyncGRPO environment rewards

056ca0a

adithya-s-k requested a review from AmineDiro June 15, 2026 13:23

Merge branch 'main' into fix/async-grpo-env-rewards

103eb03

adithya-s-k self-requested a review June 15, 2026 13:28

adithya-s-k approved these changes Jun 15, 2026

View reviewed changes

AmineDiro requested changes Jun 15, 2026

View reviewed changes

adithya-s-k mentioned this pull request Jun 15, 2026

Add experimental Harbor integration for GRPO environment training #6018

Merged

Yufeng He added 2 commits June 15, 2026 22:12

fix: pass AsyncGRPO environment snapshots

f8c4f79

test: exercise per-completion env snapshot isolation on reuse

6143e0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pass AsyncGRPO environment rewards#6031

fix: pass AsyncGRPO environment rewards#6031
he-yufeng wants to merge 4 commits into
huggingface:mainfrom
he-yufeng:fix/async-grpo-env-rewards

he-yufeng commented Jun 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

bot-ci-comment Bot commented Jun 15, 2026

Uh oh!

AmineDiro Jun 15, 2026

Uh oh!

he-yufeng Jun 15, 2026

Uh oh!

AmineDiro Jun 15, 2026

Uh oh!

he-yufeng commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

he-yufeng commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Tests

Who can review?

Uh oh!

bot-ci-comment Bot commented Jun 15, 2026

Uh oh!

AmineDiro Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

AmineDiro Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

he-yufeng commented Jun 12, 2026 •

edited by cursor Bot

Loading