Apply `TrainingArguments.max_length` during embedding training by robbiebusinessacc · Pull Request #642 · huggingface/setfit

robbiebusinessacc · 2026-06-16T15:05:36Z

Fixes #561.

TrainingArguments.max_length is documented as "the maximum token length a
tokenizer can generate," and it is applied when fitting the classifier head
(Trainer.train_classifier passes it to SetFitModel.fit, which builds the
dataloader with that length). It was never applied to the embedding phase,
though: Trainer.train_embeddings finetunes the SentenceTransformer body via
the underlying trainer without ever setting the body's truncation length, so
max_length had no effect there — matching the report that batches were padded
to the longest example instead of being truncated.

This sets the body's max_seq_length to max_length for the embedding phase
(clamped to the model's maximum, mirroring the existing clamp in
_prepare_dataloader) and restores the original value afterwards, so
inference-time encoding is unaffected — keeping the same non-persistent
behavior the classifier phase already has.

Test: test_trainer_max_length_applied_to_embedding_phase asserts the body is
truncated to max_length during the embedding phase and restored afterwards.
It fails on main (the body stays at its default 100) and passes with this
change.

`max_length` was honored when fitting the classifier head (via `SetFitModel.fit`) but ignored while finetuning the `SentenceTransformer` body, so the configured value had no effect on the embedding phase. Set the body's `max_seq_length` for that phase, clamped to the model's maximum to mirror `_prepare_dataloader`, and restore it afterwards so encoding at inference time is unaffected. Fixes huggingface#561

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply `TrainingArguments.max_length` during embedding training#642

Apply `TrainingArguments.max_length` during embedding training#642
robbiebusinessacc wants to merge 1 commit into
huggingface:mainfrom
robbiebusinessacc:contrib/setfit-embedding-max-length

robbiebusinessacc commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

robbiebusinessacc commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant