fix(ppo): exclude padding tokens from entropy calculation by mukund1985 · Pull Request #6121 · huggingface/trl

mukund1985 · 2026-06-19T19:59:59Z

Problem

objective/entropy goes negative during PPO training, which is mathematically impossible for a discrete distribution — Shannon entropy H(p) = −Σ p·log(p) ≥ 0 always.

Root cause: INVALID_LOGPROB = 1.0 is used as a sentinel for padding positions. The current calculation:

mean_entropy = (-logprobs).sum(1).mean()

contributes −1.0 per padding position, driving the sum negative.

Fixes #2022.

Solution

Mask padding tokens before summing:

mean_entropy = ((-logprobs) * (~padding_mask).float()).sum(1).mean()

padding_mask is already computed in the same scope. One-line change.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Low Risk
Logging-only metric fix; no change to PPO loss, gradients, or reward computation.

Overview
Fixes objective/entropy reporting in PPO when padded response positions were counted in the rollout logprob sum.

Rollout logprobs are masked with INVALID_LOGPROB = 1.0 on padding, so (-logprobs).sum(1) added −1.0 per pad token and could make the logged entropy negative. mean_entropy now multiplies by (~padding_mask).float() before summing, using the same padding_mask already built for KL/rewards.

Training policy loss still uses per-step entropy from logits inside micro-batches; only this end-of-step objective/entropy metric changes. Fixes #2022.

^{Reviewed by Cursor Bugbot for commit b9622c0. Bugbot is set up for automated code reviews on this repo. Configure here.}

INVALID_LOGPROB = 1.0 is used as a sentinel for padding positions. Without masking, each padding position contributes -1.0 to the `(-logprobs).sum()` entropy calculation, driving `objective/entropy` negative — which is mathematically impossible for a discrete distribution. Fix: apply `(~padding_mask).float()` before summing so only real tokens contribute to the entropy metric. Closes huggingface#2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ppo): exclude padding tokens from entropy calculation#6121

fix(ppo): exclude padding tokens from entropy calculation#6121
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:fix/ppo-entropy

mukund1985 commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mukund1985 commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Before submitting

AI writing disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mukund1985 commented Jun 19, 2026 •

edited by cursor Bot

Loading