Skip to content

[fully_async] fix DAPO postprocess_generator_output to accept metrics kwargs#1812

Merged
SumanthRH merged 2 commits into
NovaSky-AI:mainfrom
erictang000:fix/fully-async-dapo-postprocess-kwargs
Jun 23, 2026
Merged

[fully_async] fix DAPO postprocess_generator_output to accept metrics kwargs#1812
SumanthRH merged 2 commits into
NovaSky-AI:mainfrom
erictang000:fix/fully-async-dapo-postprocess-kwargs

Conversation

@erictang000

Copy link
Copy Markdown
Collaborator

Summary

FullyAsyncDAPOTrainer.postprocess_generator_output() crashed with:

TypeError: FullyAsyncDAPOTrainer.postprocess_generator_output() got an unexpected keyword argument 'metrics_generator_output'

The base postprocess_generator_output (in trainer.py) gained metrics_generator_output / metrics_uids params in #1802 (sample_full_batch), and the fully-async caller (fully_async_trainer.py) now forwards them. The FullyAsyncDAPOTrainer override still had the old 2-arg signature.

Changes

  • Add metrics_generator_output / metrics_uids to the override signature and forward them to super().postprocess_generator_output(...).
  • Extract the soft-overlong penalty into _apply_soft_overlong_punishment(...) and apply it to both generator_output and metrics_generator_output. The metrics view is a separate kept+dropped concatenation, so without this the reward metrics would be computed over un-penalized rewards.

Scope check

This was the only override that needed updating — the new kwargs are only passed by FullyAsyncRayPPOTrainer.run(). The other fully-async subclasses (thunder_agent, harbor, fully_async/main_fully_async.py) don't override the method, and the other DAPO overrides (main_dapo, main_dapo_flashrl, main_tis_dapo) only run through the synchronous trainer path, which doesn't pass the new kwargs.

🤖 Generated with Claude Code

… kwargs

The base postprocess_generator_output gained metrics_generator_output /
metrics_uids params (NovaSky-AI#1802, sample_full_batch), and the fully-async caller
now forwards them. FullyAsyncDAPOTrainer's override still had the old 2-arg
signature, raising:

  TypeError: FullyAsyncDAPOTrainer.postprocess_generator_output() got an
  unexpected keyword argument 'metrics_generator_output'

Update the override to accept and forward the new kwargs. Also extract the
soft-overlong penalty into a helper and apply it to metrics_generator_output
(a separate kept+dropped concatenation) so reward metrics reflect the
penalized rewards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the FullyAsyncDAPOTrainer.postprocess_generator_output method to accept and process optional metrics outputs, refactoring the soft overlong punishment logic into a helper method. However, a critical issue was identified where applying the penalty in-place to both the main generator output and the metrics generator output can result in a double penalty due to shared inner list references in memory. It is recommended to shallow-copy the inner lists to prevent this side effect.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

overlong_buffer_penalty_factor = self.cfg.trainer.algorithm.overlong_buffer_penalty_factor
# modify rewards here
response_ids = generator_output["response_ids"]
rewards = generator_output["rewards"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When metrics_generator_output is provided, both generator_output and metrics_generator_output are passed to _apply_soft_overlong_punishment. Since metrics_generator_output is created by concatenating the same generator outputs, the inner lists in generator_output["rewards"] and metrics_generator_output["rewards"] share the same references in memory.\n\nIf is_per_token is True, _apply_soft_overlong_punishment modifies the inner lists in-place (rewards[i][-1] -= penalty). This means that applying the penalty to generator_output also implicitly modifies the rewards in metrics_generator_output. When _apply_soft_overlong_punishment is subsequently called on metrics_generator_output, it applies the penalty again, resulting in a double penalty for the kept groups in the metrics view.\n\nTo prevent this, we should shallow-copy the inner lists before modifying them.

Suggested change
rewards = generator_output["rewards"]
rewards = [list(r) if isinstance(r, list) else r for r in generator_output["rewards"]]

concatenate_generator_outputs only copies the outer list (extend), so
generator_output and metrics_generator_output share the same per-token inner
reward-list objects for the kept groups. Mutating rewards[i][-1] in place
penalized the shared list once per _apply_soft_overlong_punishment call,
double-penalizing both the trained rewards and the metrics view.

Reassign the slot with a fresh list instead of mutating in place so each
view's penalty is independent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SumanthRH SumanthRH merged commit 165653d into NovaSky-AI:main Jun 23, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants