feat(ppo): add save_value_model flag to PPOConfig by mukund1985 · Pull Request #6120 · huggingface/trl

mukund1985 · 2026-06-19T19:58:30Z

Adds a new save_value_model: bool = False field to PPOConfig. When enabled, PPOTrainer.save_model() writes the value model (critic) to a value_model/ sub-directory alongside the policy checkpoint, making critic checkpointing and faithful RLHF run resumption possible.

Fixes #3293

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Low Risk
Behavior is off by default and only extends checkpoint I/O on an explicit flag; no training-loop or inference path changes.

Overview
Adds an opt-in save_value_model flag on PPOConfig (default False) so PPO checkpoints can include the critic, not just the policy.

When enabled, PPOTrainer.save_model() still writes the policy as today, then saves backup_model.value_model under a value_model/ subfolder in the same output/checkpoint directory, including the same DeepSpeed swap pattern used for the policy save.

^{Reviewed by Cursor Bugbot for commit 482546f. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds a new `save_value_model: bool = False` field to `PPOConfig`. When enabled, `PPOTrainer.save_model()` writes the value model (critic) to a `value_model/` sub-directory alongside the policy checkpoint, making reward-model checkpointing and resume possible. Closes huggingface#3293

mukund1985 · 2026-06-20T23:20:25Z

Hi @qgallouedec @albertvillanova — tagging you both as the active maintainers in case this is useful.

This PR addresses #3293. The value model (critic) is currently dropped at every checkpoint — only the policy is saved — so any interrupted PPO run has to re-initialise the critic from scratch on resume.

The fix is a single new flag save_value_model: bool = False in PPOConfig. When enabled, save_model() writes the critic to a value_model/ sub-directory using the existing super().save_model() path, so DeepSpeed and FSDP are handled transparently. Default behaviour is completely unchanged.

Happy to add tests or adjust to match whatever pattern you prefer. Thanks!

mukund1985 closed this Jun 20, 2026

mukund1985 reopened this Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ppo): add save_value_model flag to PPOConfig#6120

feat(ppo): add save_value_model flag to PPOConfig#6120
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:fix/save-value-model

mukund1985 commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

mukund1985 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mukund1985 commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

AI writing disclosure

Uh oh!

mukund1985 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mukund1985 commented Jun 19, 2026 •

edited by cursor Bot

Loading