fix: pass AsyncGRPO environment rewards#6031
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| group.tool_failure_counts.append(tool_failure_count) | ||
| if self.environments is not None: | ||
| env = self.environments[slot] | ||
| group.env_rewards.append(await asyncio.to_thread(lambda env=env: env.reward)) |
There was a problem hiding this comment.
I see the issue but we actually can't guarattee that every env has an env.reward.
This is a structural issue. But what we can do is this :
look at this PR.
There was a problem hiding this comment.
Thanks, that makes sense. I changed this to pass per-completion environment snapshots instead of reading env.reward directly from the worker.
The current head now does copy.copy(self.environments[slot]) when a completion finishes, stores those snapshots on the rollout group, and passes them to reward functions as environments. The test now mirrors the intended contract by reading env.reward inside the reward function.
Validated locally:
python -m py_compile trl\experimental\async_grpo\async_rollout_worker.py tests\experimental\test_async_rollout_worker.py
python -m pytest tests\experimental\test_async_rollout_worker.py -q
python -m ruff check trl\experimental\async_grpo\async_rollout_worker.py tests\experimental\test_async_rollout_worker.py
git diff --check origin/main..HEAD
There was a problem hiding this comment.
Thanks, would probably prefer a full test on a real example to validate. Not just type checking
|
@AmineDiro reworked the test per your note so it's not just checking types. It now uses a stateful env (reset plus an increment tool that updates env.reward) and asserts the behavior the snapshotting exists for: each completion is scored against its copy.copy snapshot taken when it finishes, and resetting/reusing the same instance for the next rollout leaves the earlier completions' rewards intact. Drop the copy.copy and the test fails. A full async rollout needs a live vLLM server, so I kept this at the worker scoring boundary rather than a model run. Happy to wire it through environment_factory end to end if you'd prefer that. |
What does this PR do?
Fixes #6027.
AsyncGRPOTrainercan build environments throughenvironment_factory, but the rollout worker only passed dataset-derived kwargs toreward_funcs. As a result, rewards computed by an environment were never visible to reward functions.This adds a per-group
env_rewardslist, captures each completed environment'srewardbefore the slot is reused, and forwards it to reward functions asenvironment_reward. The change is additive: runs without environments keep the existing kwargs.Before submitting
AI writing disclosure
Tests
python -m py_compile trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.pypython -m ruff check trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.pypython -m ruff format --check trl\\experimental\\async_grpo\\async_rollout_worker.py tests\\experimental\\test_async_grpo_trainer.py tests\\experimental\\test_async_rollout_worker.pygit diff --check.\\.venv\\Scripts\\python.exe -m pytest tests\\experimental\\test_async_rollout_worker.py -qWho can review?
Anyone familiar with the experimental AsyncGRPO rollout worker.
Note
Medium Risk
Touches experimental rollout scoring and reward kwargs; behavior change is limited to the environment_factory path and is covered by a new unit test.
Overview
Fixes environment-backed rewards in the async GRPO rollout worker: when
environment_factoryis used, reward functions now receive a per-completionenvironmentslist instead of only dataset kwargs.Each finished generation appends a
copy.copysnapshot of the slot’s environment to the rollout group (before the slot is reset for the next rollout), and_score_grouppasses that list into reward funcs when non-empty. Runs without environments are unchanged.A new test asserts that reused slots and later resets do not corrupt rewards for earlier completions in the same group.
Reviewed by Cursor Bugbot for commit 6143e0f. Bugbot is set up for automated code reviews on this repo. Configure here.