[fully_async] fix DAPO postprocess_generator_output to accept metrics kwargs#1812
Conversation
… kwargs The base postprocess_generator_output gained metrics_generator_output / metrics_uids params (NovaSky-AI#1802, sample_full_batch), and the fully-async caller now forwards them. FullyAsyncDAPOTrainer's override still had the old 2-arg signature, raising: TypeError: FullyAsyncDAPOTrainer.postprocess_generator_output() got an unexpected keyword argument 'metrics_generator_output' Update the override to accept and forward the new kwargs. Also extract the soft-overlong penalty into a helper and apply it to metrics_generator_output (a separate kept+dropped concatenation) so reward metrics reflect the penalized rewards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the FullyAsyncDAPOTrainer.postprocess_generator_output method to accept and process optional metrics outputs, refactoring the soft overlong punishment logic into a helper method. However, a critical issue was identified where applying the penalty in-place to both the main generator output and the metrics generator output can result in a double penalty due to shared inner list references in memory. It is recommended to shallow-copy the inner lists to prevent this side effect.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| overlong_buffer_penalty_factor = self.cfg.trainer.algorithm.overlong_buffer_penalty_factor | ||
| # modify rewards here | ||
| response_ids = generator_output["response_ids"] | ||
| rewards = generator_output["rewards"] |
There was a problem hiding this comment.
When metrics_generator_output is provided, both generator_output and metrics_generator_output are passed to _apply_soft_overlong_punishment. Since metrics_generator_output is created by concatenating the same generator outputs, the inner lists in generator_output["rewards"] and metrics_generator_output["rewards"] share the same references in memory.\n\nIf is_per_token is True, _apply_soft_overlong_punishment modifies the inner lists in-place (rewards[i][-1] -= penalty). This means that applying the penalty to generator_output also implicitly modifies the rewards in metrics_generator_output. When _apply_soft_overlong_punishment is subsequently called on metrics_generator_output, it applies the penalty again, resulting in a double penalty for the kept groups in the metrics view.\n\nTo prevent this, we should shallow-copy the inner lists before modifying them.
| rewards = generator_output["rewards"] | |
| rewards = [list(r) if isinstance(r, list) else r for r in generator_output["rewards"]] |
concatenate_generator_outputs only copies the outer list (extend), so generator_output and metrics_generator_output share the same per-token inner reward-list objects for the kept groups. Mutating rewards[i][-1] in place penalized the shared list once per _apply_soft_overlong_punishment call, double-penalizing both the trained rewards and the metrics view. Reassign the slot with a fresh list instead of mutating in place so each view's penalty is independent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
FullyAsyncDAPOTrainer.postprocess_generator_output()crashed with:The base
postprocess_generator_output(intrainer.py) gainedmetrics_generator_output/metrics_uidsparams in #1802 (sample_full_batch), and the fully-async caller (fully_async_trainer.py) now forwards them. TheFullyAsyncDAPOTraineroverride still had the old 2-arg signature.Changes
metrics_generator_output/metrics_uidsto the override signature and forward them tosuper().postprocess_generator_output(...)._apply_soft_overlong_punishment(...)and apply it to bothgenerator_outputandmetrics_generator_output. The metrics view is a separate kept+dropped concatenation, so without this the reward metrics would be computed over un-penalized rewards.Scope check
This was the only override that needed updating — the new kwargs are only passed by
FullyAsyncRayPPOTrainer.run(). The other fully-async subclasses (thunder_agent,harbor,fully_async/main_fully_async.py) don't override the method, and the other DAPO overrides (main_dapo,main_dapo_flashrl,main_tis_dapo) only run through the synchronous trainer path, which doesn't pass the new kwargs.🤖 Generated with Claude Code