fix(sft): drop examples with no trainable tokens after max_length truncation by 0xadvait · Pull Request #6025 · huggingface/trl

0xadvait · 2026-06-12T01:44:46Z

What does this PR do?

assistant_only_loss=True (or completion-only training) combined with max_length truncation can silently train on nothing.

The all-zero assistant_masks check in tokenize_fn runs before truncation, but truncation happens later, in DataCollatorForLanguageModeling. An example whose assistant/completion tokens all lie beyond max_length (e.g. the prompt alone reaches max_length) passes the check, then gets every label set to -100 by the collator. Since fully-masked micro-batches return a graph-connected zero loss (for DDP/FSDP safety), a batch made of such examples trains with no error and no warning, on current main:

{'loss': 0.0, 'grad_norm': nan, 'mean_token_accuracy': 0.0, 'num_tokens': 256.0, ...}
{'loss': 0.0, 'grad_norm': nan, 'mean_token_accuracy': 0.0, 'num_tokens': 512.0, ...}

(repro: 8 conversational examples with a ~200-token user turn and a short assistant reply, max_length=64, assistant_only_loss=True; num_tokens keeps counting while nothing is learned)

When only some examples are affected, they silently dilute training; the failure surfaces, if ever, as a confusing downstream error.

The fix

This implements the approach discussed in #3930 -- drop the affected examples at dataset preparation time and log how many were dropped -- without the dataset-wide trainable-token counting that stalled that PR:

after tokenization (and also for already-tokenized datasets), when max_length is set and packing is disabled, a batched filter keeps only examples with at least one trainable token within the truncation boundary
a token counts as trainable when every applicable mask is 1 at its position: assistant_masks is always applied by the collator when present, while completion_mask is only applied when completion_only_loss resolves to True -- the filter mirrors exactly that, so examples are never dropped based on a mask the collator won't use
both truncation_mode="keep_start" and "keep_end" are handled with the same slices the collator uses
if some examples are dropped: a warning with the dropped count and the relevant settings
if all examples are dropped: an actionable ValueError (previously: silent zero-loss training, or a cryptic downstream failure)
for IterableDataset, the same filter applies lazily; the counting/warning applies to map-style datasets where len() is available

No new dependencies, and a single extra batched filter pass (num_proc-parallel) that only runs when a mask column is present and truncation is active.

Tests

test_assistant_only_loss_drops_examples_with_no_trainable_tokens_after_truncation: 2 of 3 examples dropped, warning emitted, training proceeds on the survivor
test_assistant_only_loss_raises_if_no_example_has_trainable_tokens: clear ValueError when nothing survives
test_completion_only_loss_drops_examples_with_no_trainable_tokens_after_truncation: same protection for prompt-completion datasets via completion_mask
test_truncation_keep_end_keeps_examples_with_trailing_trainable_tokens: guards against over-dropping -- with keep_end, trailing completions survive and nothing is dropped

The first three fail without the fix; the last passes before and after.

Fixes #3927

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. -> Using assistant_only_loss=True with sequence length > max_length fails silently #3927, approach discussed in [SFTTrainer]: Check for assistant mask up to max_length #3930
Did you make sure to update the documentation with your changes? (behavior is self-documenting via the warning/error messages; no doc page describes the truncation interaction)
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@qgallouedec (proposed the drop-and-warn design in #3930)

Note

Medium Risk
Changes dataset preparation for a common SFT configuration (max_length + assistant/completion-only loss), which can shrink training data and fail fast when all examples are invalid; logic must stay aligned with collator truncation to avoid incorrect drops.

Overview
Fixes silent zero-loss SFT when max_length truncation cuts off every assistant/completion token after tokenization checks already passed (#3927).

SFTTrainer now runs a post-tokenization filter (only when max_length is set, packing is off, and assistant_masks / applicable completion_mask columns exist). It keeps examples that still have at least one trainable token inside the same truncation window the collator uses (keep_start vs keep_end), matching which masks actually affect labels. Partial drops log a warning with counts; if everything would be dropped, it raises a clear ValueError instead of training on all -100 labels.

Tests cover assistant-only and completion-only drops, the all-dropped error path, and that truncation_mode="keep_end" does not over-drop trailing completions.

^{Reviewed by Cursor Bugbot for commit ea5dd34. Bugbot is set up for automated code reviews on this repo. Configure here.}

…runcation When all completion or assistant tokens of an example lie beyond the max_length truncation boundary (e.g. the prompt alone reaches max_length), every label is set to -100 after truncation in the data collator. Such examples contribute nothing to the loss, and a batch made entirely of them silently trains with loss 0.0 and grad_norm nan. Drop these examples at dataset preparation time, warn with the dropped count, and raise an actionable error when no example survives. Fixes huggingface#3927

qgallouedec · 2026-06-12T09:53:56Z

Thanks @0xadvait, nice diagnosis; the silent zero-loss from post-tokenization truncation is a real footgun.

My hesitation: has_trainable_tokens re-derives what the collator does (the keep_start/keep_end slice and the "which masks apply" rule). That logic is duplicated from DataCollatorForLanguageModeling, so if the collator ever changes, the filter silently drifts.

I'd rather fix it from the other end. The -100 labels only exist after the collator runs, which is why you have to reconstruct them. If we instead build the labels column in dataset preparation (in tokenize_fn, labels = input_ids with -100 where the applicable masks are 0):

the collator just truncates/pads labels: it already reads a labels column (torch_call) and its mask-handling block goes away;
"what's trainable" lives in one place;
dropping these examples becomes a trivial filter on the labels column instead of an AND over masks.

Truncation can stay in the collator (moving it fights padding_free/packing), so the filter still slices with the same window, but reads one column and carries none of the mask-precedence logic.

Since that's bigger than your diff, I'd split it: a first PR moving label construction into prep (and slimming the collator), then the drop-and-warn filter on top; almost trivial at that point. Happy to help with the first part. WDYT?

0xadvait · 2026-06-12T10:06:15Z

Agreed on the drift risk; the filter already had to mirror the completion_only_loss resolution to avoid dropping rows the collator wouldn't mask, which is exactly the duplication you're describing.

I took a pass through what part 1 touches, and it's more drop-in than I expected: torch_call already reads example.get("labels", example["input_ids"]), truncates labels with the same slice, and pads them with -100, so prebuilt labels flow through as-is and only the mask-application block at the end goes away. The plumbing beyond that:

packing (columns = ["input_ids"] + masks) and the Liger collator_expected_keys need "labels" added; _signature_columns already includes it
pre-tokenized datasets (is_processed=True) skip tokenize_fn, so prep needs a small "build labels from masks if absent" step for them. Related: DataCollatorForLanguageModeling's mask handling is documented public behavior -- remove it outright, or keep it as a fallback when no labels column is present (with a FutureWarning per the deprecation policy)? That's the one real decision I see in part 1.
if tokenize_fn emits labels instead of the mask columns, storage stays roughly neutral and packing carries labels instead of masks
the vision collator keeps building labels itself, since vision examples aren't tokenized at prep time

For the residual duplication in part 2 (the truncation window), a tiny shared helper for the keep_start/keep_end slice used by both the collator and the filter would close that too.

Happy to draft part 1 along these lines as a separate PR and then strip this one down to the trivial labels filter on top -- or if you'd rather take part 1, I'll rebase once it's in. Converting this to draft meanwhile.

qgallouedec · 2026-06-12T17:44:20Z

Yes happy to receive a separate pr for 1.
I suggest that you ignore the vlm path, no easy and clean solution exist at this point.

0xadvait · 2026-06-12T18:08:29Z

Part 1 is up: #6037. Once it lands I'll strip this PR down to the labels-column filter using the shared truncation window.

0xadvait marked this pull request as draft June 12, 2026 10:06

0xadvait mentioned this pull request Jun 12, 2026

refactor(sft): build labels during dataset preparation instead of collation #6037

Open

8 tasks

qgallouedec mentioned this pull request Jun 18, 2026

Using assistant_only_loss=True with sequence length > max_length fails silently #3927

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sft): drop examples with no trainable tokens after max_length truncation#6025

fix(sft): drop examples with no trainable tokens after max_length truncation#6025
0xadvait wants to merge 1 commit into
huggingface:mainfrom
0xadvait:fix-sft-truncation-untrainable

0xadvait commented Jun 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

qgallouedec commented Jun 12, 2026

Uh oh!

0xadvait commented Jun 12, 2026

Uh oh!

qgallouedec commented Jun 12, 2026

Uh oh!

0xadvait commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0xadvait commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

The fix

Tests

Before submitting

AI writing disclosure

Who can review?

Uh oh!

qgallouedec commented Jun 12, 2026

Uh oh!

0xadvait commented Jun 12, 2026

Uh oh!

qgallouedec commented Jun 12, 2026

Uh oh!

0xadvait commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0xadvait commented Jun 12, 2026 •

edited by cursor Bot

Loading