Skip to content

fix(ppo): exclude padding tokens from entropy calculation#6121

Open
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:fix/ppo-entropy
Open

fix(ppo): exclude padding tokens from entropy calculation#6121
mukund1985 wants to merge 1 commit into
huggingface:mainfrom
mukund1985:fix/ppo-entropy

Conversation

@mukund1985

@mukund1985 mukund1985 commented Jun 19, 2026

Copy link
Copy Markdown

Problem

objective/entropy goes negative during PPO training, which is mathematically impossible for a discrete distribution — Shannon entropy H(p) = −Σ p·log(p) ≥ 0 always.

Root cause: INVALID_LOGPROB = 1.0 is used as a sentinel for padding positions. The current calculation:

mean_entropy = (-logprobs).sum(1).mean()

contributes −1.0 per padding position, driving the sum negative.

Fixes #2022.

Solution

Mask padding tokens before summing:

mean_entropy = ((-logprobs) * (~padding_mask).float()).sum(1).mean()

padding_mask is already computed in the same scope. One-line change.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Low Risk
Logging-only metric fix; no change to PPO loss, gradients, or reward computation.

Overview
Fixes objective/entropy reporting in PPO when padded response positions were counted in the rollout logprob sum.

Rollout logprobs are masked with INVALID_LOGPROB = 1.0 on padding, so (-logprobs).sum(1) added −1.0 per pad token and could make the logged entropy negative. mean_entropy now multiplies by (~padding_mask).float() before summing, using the same padding_mask already built for KL/rewards.

Training policy loss still uses per-step entropy from logits inside micro-batches; only this end-of-step objective/entropy metric changes. Fixes #2022.

Reviewed by Cursor Bugbot for commit b9622c0. Bugbot is set up for automated code reviews on this repo. Configure here.

INVALID_LOGPROB = 1.0 is used as a sentinel for padding positions.
Without masking, each padding position contributes -1.0 to the
`(-logprobs).sum()` entropy calculation, driving `objective/entropy`
negative — which is mathematically impossible for a discrete
distribution.

Fix: apply `(~padding_mask).float()` before summing so only real
tokens contribute to the entropy metric.

Closes huggingface#2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Negative Entropy in TRL PPOv2Trainer TLDR Example

1 participant