[fully_async] fix DAPO postprocess_generator_output to accept metrics kwargs by erictang000 · Pull Request #1812 · NovaSky-AI/SkyRL

erictang000 · 2026-06-19T01:45:16Z

Summary

FullyAsyncDAPOTrainer.postprocess_generator_output() crashed with:

TypeError: FullyAsyncDAPOTrainer.postprocess_generator_output() got an unexpected keyword argument 'metrics_generator_output'

The base postprocess_generator_output (in trainer.py) gained metrics_generator_output / metrics_uids params in #1802 (sample_full_batch), and the fully-async caller (fully_async_trainer.py) now forwards them. The FullyAsyncDAPOTrainer override still had the old 2-arg signature.

Changes

Add metrics_generator_output / metrics_uids to the override signature and forward them to super().postprocess_generator_output(...).
Extract the soft-overlong penalty into _apply_soft_overlong_punishment(...) and apply it to both generator_output and metrics_generator_output. The metrics view is a separate kept+dropped concatenation, so without this the reward metrics would be computed over un-penalized rewards.

Scope check

This was the only override that needed updating — the new kwargs are only passed by FullyAsyncRayPPOTrainer.run(). The other fully-async subclasses (thunder_agent, harbor, fully_async/main_fully_async.py) don't override the method, and the other DAPO overrides (main_dapo, main_dapo_flashrl, main_tis_dapo) only run through the synchronous trainer path, which doesn't pass the new kwargs.

🤖 Generated with Claude Code

… kwargs The base postprocess_generator_output gained metrics_generator_output / metrics_uids params (NovaSky-AI#1802, sample_full_batch), and the fully-async caller now forwards them. FullyAsyncDAPOTrainer's override still had the old 2-arg signature, raising: TypeError: FullyAsyncDAPOTrainer.postprocess_generator_output() got an unexpected keyword argument 'metrics_generator_output' Update the override to accept and forward the new kwargs. Also extract the soft-overlong penalty into a helper and apply it to metrics_generator_output (a separate kept+dropped concatenation) so reward metrics reflect the penalized rewards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the FullyAsyncDAPOTrainer.postprocess_generator_output method to accept and process optional metrics outputs, refactoring the soft overlong punishment logic into a helper method. However, a critical issue was identified where applying the penalty in-place to both the main generator output and the metrics generator output can result in a double penalty due to shared inner list references in memory. It is recommended to shallow-copy the inner lists to prevent this side effect.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-19T01:46:14Z

        overlong_buffer_penalty_factor = self.cfg.trainer.algorithm.overlong_buffer_penalty_factor
-        # modify rewards here
        response_ids = generator_output["response_ids"]
        rewards = generator_output["rewards"]


When metrics_generator_output is provided, both generator_output and metrics_generator_output are passed to _apply_soft_overlong_punishment. Since metrics_generator_output is created by concatenating the same generator outputs, the inner lists in generator_output["rewards"] and metrics_generator_output["rewards"] share the same references in memory.\n\nIf is_per_token is True, _apply_soft_overlong_punishment modifies the inner lists in-place (rewards[i][-1] -= penalty). This means that applying the penalty to generator_output also implicitly modifies the rewards in metrics_generator_output. When _apply_soft_overlong_punishment is subsequently called on metrics_generator_output, it applies the penalty again, resulting in a double penalty for the kept groups in the metrics view.\n\nTo prevent this, we should shallow-copy the inner lists before modifying them.

Suggested change

rewards = generator_output["rewards"]

rewards = [list(r) if isinstance(r, list) else r for r in generator_output["rewards"]]

concatenate_generator_outputs only copies the outer list (extend), so generator_output and metrics_generator_output share the same per-token inner reward-list objects for the kept groups. Mutating rewards[i][-1] in place penalized the shared list once per _apply_soft_overlong_punishment call, double-penalizing both the trained rewards and the metrics view. Reassign the slot with a fresh list instead of mutating in place so each view's penalty is independent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 19, 2026

View reviewed changes

SumanthRH approved these changes Jun 23, 2026

View reviewed changes

SumanthRH merged commit 165653d into NovaSky-AI:main Jun 23, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fully_async] fix DAPO postprocess_generator_output to accept metrics kwargs#1812

[fully_async] fix DAPO postprocess_generator_output to accept metrics kwargs#1812
SumanthRH merged 2 commits into
NovaSky-AI:mainfrom
erictang000:fix/fully-async-dapo-postprocess-kwargs

erictang000 commented Jun 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	rewards = generator_output["rewards"]
	rewards = [list(r) if isinstance(r, list) else r for r in generator_output["rewards"]]

Conversation

erictang000 commented Jun 19, 2026

Summary

Changes

Scope check

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants