Skip to content

Apply TrainingArguments.max_length during embedding training#642

Open
robbiebusinessacc wants to merge 1 commit into
huggingface:mainfrom
robbiebusinessacc:contrib/setfit-embedding-max-length
Open

Apply TrainingArguments.max_length during embedding training#642
robbiebusinessacc wants to merge 1 commit into
huggingface:mainfrom
robbiebusinessacc:contrib/setfit-embedding-max-length

Conversation

@robbiebusinessacc

Copy link
Copy Markdown

Fixes #561.

TrainingArguments.max_length is documented as "the maximum token length a
tokenizer can generate," and it is applied when fitting the classifier head
(Trainer.train_classifier passes it to SetFitModel.fit, which builds the
dataloader with that length). It was never applied to the embedding phase,
though: Trainer.train_embeddings finetunes the SentenceTransformer body via
the underlying trainer without ever setting the body's truncation length, so
max_length had no effect there — matching the report that batches were padded
to the longest example instead of being truncated.

This sets the body's max_seq_length to max_length for the embedding phase
(clamped to the model's maximum, mirroring the existing clamp in
_prepare_dataloader) and restores the original value afterwards, so
inference-time encoding is unaffected — keeping the same non-persistent
behavior the classifier phase already has.

Test: test_trainer_max_length_applied_to_embedding_phase asserts the body is
truncated to max_length during the embedding phase and restored afterwards.
It fails on main (the body stays at its default 100) and passes with this
change.

`max_length` was honored when fitting the classifier head (via
`SetFitModel.fit`) but ignored while finetuning the `SentenceTransformer`
body, so the configured value had no effect on the embedding phase. Set
the body's `max_seq_length` for that phase, clamped to the model's
maximum to mirror `_prepare_dataloader`, and restore it afterwards so
encoding at inference time is unaffected.

Fixes huggingface#561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

max_length parameter of TrainingArguments not applied

1 participant