Skip to content

feat(ppo): add save_value_model flag to PPOConfig#6120

Open
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:fix/save-value-model
Open

feat(ppo): add save_value_model flag to PPOConfig#6120
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:fix/save-value-model

Conversation

@mukund1985

@mukund1985 mukund1985 commented Jun 19, 2026

Copy link
Copy Markdown

Adds a new save_value_model: bool = False field to PPOConfig. When enabled, PPOTrainer.save_model() writes the value model (critic) to a value_model/ sub-directory alongside the policy checkpoint, making critic checkpointing and faithful RLHF run resumption possible.

Fixes #3293

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Low Risk
Behavior is off by default and only extends checkpoint I/O on an explicit flag; no training-loop or inference path changes.

Overview
Adds an opt-in save_value_model flag on PPOConfig (default False) so PPO checkpoints can include the critic, not just the policy.

When enabled, PPOTrainer.save_model() still writes the policy as today, then saves backup_model.value_model under a value_model/ subfolder in the same output/checkpoint directory, including the same DeepSpeed swap pattern used for the policy save.

Reviewed by Cursor Bugbot for commit 482546f. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds a new `save_value_model: bool = False` field to `PPOConfig`.
When enabled, `PPOTrainer.save_model()` writes the value model
(critic) to a `value_model/` sub-directory alongside the policy
checkpoint, making reward-model checkpointing and resume possible.

Closes huggingface#3293
@mukund1985 mukund1985 closed this Jun 20, 2026
@mukund1985 mukund1985 reopened this Jun 20, 2026
@mukund1985

Copy link
Copy Markdown
Author

Hi @qgallouedec @albertvillanova — tagging you both as the active maintainers in case this is useful.

This PR addresses #3293. The value model (critic) is currently dropped at every checkpoint — only the policy is saved — so any interrupted PPO run has to re-initialise the critic from scratch on resume.

The fix is a single new flag save_value_model: bool = False in PPOConfig. When enabled, save_model() writes the critic to a value_model/ sub-directory using the existing super().save_model() path, so DeepSpeed and FSDP are handled transparently. Default behaviour is completely unchanged.

Happy to add tests or adjust to match whatever pattern you prefer. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Keep PPOTrainer Value Model (Critic Model)

1 participant