Skip to content

fix(sft): drop examples with no trainable tokens after max_length truncation#6025

Draft
0xadvait wants to merge 1 commit into
huggingface:mainfrom
0xadvait:fix-sft-truncation-untrainable
Draft

fix(sft): drop examples with no trainable tokens after max_length truncation#6025
0xadvait wants to merge 1 commit into
huggingface:mainfrom
0xadvait:fix-sft-truncation-untrainable

Conversation

@0xadvait

@0xadvait 0xadvait commented Jun 12, 2026

Copy link
Copy Markdown

What does this PR do?

assistant_only_loss=True (or completion-only training) combined with max_length truncation can silently train on nothing.

The all-zero assistant_masks check in tokenize_fn runs before truncation, but truncation happens later, in DataCollatorForLanguageModeling. An example whose assistant/completion tokens all lie beyond max_length (e.g. the prompt alone reaches max_length) passes the check, then gets every label set to -100 by the collator. Since fully-masked micro-batches return a graph-connected zero loss (for DDP/FSDP safety), a batch made of such examples trains with no error and no warning, on current main:

{'loss': 0.0, 'grad_norm': nan, 'mean_token_accuracy': 0.0, 'num_tokens': 256.0, ...}
{'loss': 0.0, 'grad_norm': nan, 'mean_token_accuracy': 0.0, 'num_tokens': 512.0, ...}

(repro: 8 conversational examples with a ~200-token user turn and a short assistant reply, max_length=64, assistant_only_loss=True; num_tokens keeps counting while nothing is learned)

When only some examples are affected, they silently dilute training; the failure surfaces, if ever, as a confusing downstream error.

The fix

This implements the approach discussed in #3930 -- drop the affected examples at dataset preparation time and log how many were dropped -- without the dataset-wide trainable-token counting that stalled that PR:

  • after tokenization (and also for already-tokenized datasets), when max_length is set and packing is disabled, a batched filter keeps only examples with at least one trainable token within the truncation boundary
  • a token counts as trainable when every applicable mask is 1 at its position: assistant_masks is always applied by the collator when present, while completion_mask is only applied when completion_only_loss resolves to True -- the filter mirrors exactly that, so examples are never dropped based on a mask the collator won't use
  • both truncation_mode="keep_start" and "keep_end" are handled with the same slices the collator uses
  • if some examples are dropped: a warning with the dropped count and the relevant settings
  • if all examples are dropped: an actionable ValueError (previously: silent zero-loss training, or a cryptic downstream failure)
  • for IterableDataset, the same filter applies lazily; the counting/warning applies to map-style datasets where len() is available

No new dependencies, and a single extra batched filter pass (num_proc-parallel) that only runs when a mask column is present and truncation is active.

Tests

  • test_assistant_only_loss_drops_examples_with_no_trainable_tokens_after_truncation: 2 of 3 examples dropped, warning emitted, training proceeds on the survivor
  • test_assistant_only_loss_raises_if_no_example_has_trainable_tokens: clear ValueError when nothing survives
  • test_completion_only_loss_drops_examples_with_no_trainable_tokens_after_truncation: same protection for prompt-completion datasets via completion_mask
  • test_truncation_keep_end_keeps_examples_with_trailing_trainable_tokens: guards against over-dropping -- with keep_end, trailing completions survive and nothing is dropped

The first three fail without the fix; the last passes before and after.

Fixes #3927

Before submitting

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@qgallouedec (proposed the drop-and-warn design in #3930)


Note

Medium Risk
Changes dataset preparation for a common SFT configuration (max_length + assistant/completion-only loss), which can shrink training data and fail fast when all examples are invalid; logic must stay aligned with collator truncation to avoid incorrect drops.

Overview
Fixes silent zero-loss SFT when max_length truncation cuts off every assistant/completion token after tokenization checks already passed (#3927).

SFTTrainer now runs a post-tokenization filter (only when max_length is set, packing is off, and assistant_masks / applicable completion_mask columns exist). It keeps examples that still have at least one trainable token inside the same truncation window the collator uses (keep_start vs keep_end), matching which masks actually affect labels. Partial drops log a warning with counts; if everything would be dropped, it raises a clear ValueError instead of training on all -100 labels.

Tests cover assistant-only and completion-only drops, the all-dropped error path, and that truncation_mode="keep_end" does not over-drop trailing completions.

Reviewed by Cursor Bugbot for commit ea5dd34. Bugbot is set up for automated code reviews on this repo. Configure here.

…runcation

When all completion or assistant tokens of an example lie beyond the
max_length truncation boundary (e.g. the prompt alone reaches
max_length), every label is set to -100 after truncation in the data
collator. Such examples contribute nothing to the loss, and a batch
made entirely of them silently trains with loss 0.0 and grad_norm nan.

Drop these examples at dataset preparation time, warn with the dropped
count, and raise an actionable error when no example survives.

Fixes huggingface#3927
@qgallouedec

Copy link
Copy Markdown
Member

Thanks @0xadvait, nice diagnosis; the silent zero-loss from post-tokenization truncation is a real footgun.

My hesitation: has_trainable_tokens re-derives what the collator does (the keep_start/keep_end slice and the "which masks apply" rule). That logic is duplicated from DataCollatorForLanguageModeling, so if the collator ever changes, the filter silently drifts.

I'd rather fix it from the other end. The -100 labels only exist after the collator runs, which is why you have to reconstruct them. If we instead build the labels column in dataset preparation (in tokenize_fn, labels = input_ids with -100 where the applicable masks are 0):

  • the collator just truncates/pads labels: it already reads a labels column (torch_call) and its mask-handling block goes away;
  • "what's trainable" lives in one place;
  • dropping these examples becomes a trivial filter on the labels column instead of an AND over masks.

Truncation can stay in the collator (moving it fights padding_free/packing), so the filter still slices with the same window, but reads one column and carries none of the mask-precedence logic.

Since that's bigger than your diff, I'd split it: a first PR moving label construction into prep (and slimming the collator), then the drop-and-warn filter on top; almost trivial at that point. Happy to help with the first part. WDYT?

@0xadvait

Copy link
Copy Markdown
Author

Agreed on the drift risk; the filter already had to mirror the completion_only_loss resolution to avoid dropping rows the collator wouldn't mask, which is exactly the duplication you're describing.

I took a pass through what part 1 touches, and it's more drop-in than I expected: torch_call already reads example.get("labels", example["input_ids"]), truncates labels with the same slice, and pads them with -100, so prebuilt labels flow through as-is and only the mask-application block at the end goes away. The plumbing beyond that:

  • packing (columns = ["input_ids"] + masks) and the Liger collator_expected_keys need "labels" added; _signature_columns already includes it
  • pre-tokenized datasets (is_processed=True) skip tokenize_fn, so prep needs a small "build labels from masks if absent" step for them. Related: DataCollatorForLanguageModeling's mask handling is documented public behavior -- remove it outright, or keep it as a fallback when no labels column is present (with a FutureWarning per the deprecation policy)? That's the one real decision I see in part 1.
  • if tokenize_fn emits labels instead of the mask columns, storage stays roughly neutral and packing carries labels instead of masks
  • the vision collator keeps building labels itself, since vision examples aren't tokenized at prep time

For the residual duplication in part 2 (the truncation window), a tiny shared helper for the keep_start/keep_end slice used by both the collator and the filter would close that too.

Happy to draft part 1 along these lines as a separate PR and then strip this one down to the trivial labels filter on top -- or if you'd rather take part 1, I'll rebase once it's in. Converting this to draft meanwhile.

@0xadvait 0xadvait marked this pull request as draft June 12, 2026 10:06
@qgallouedec

Copy link
Copy Markdown
Member

Yes happy to receive a separate pr for 1.
I suggest that you ignore the vlm path, no easy and clean solution exist at this point.

@0xadvait

Copy link
Copy Markdown
Author

Part 1 is up: #6037. Once it lands I'll strip this PR down to the labels-column filter using the shared truncation window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using assistant_only_loss=True with sequence length > max_length fails silently

2 participants