Apply TrainingArguments.max_length during embedding training#642
Open
robbiebusinessacc wants to merge 1 commit into
Open
Apply TrainingArguments.max_length during embedding training#642robbiebusinessacc wants to merge 1 commit into
TrainingArguments.max_length during embedding training#642robbiebusinessacc wants to merge 1 commit into
Conversation
`max_length` was honored when fitting the classifier head (via `SetFitModel.fit`) but ignored while finetuning the `SentenceTransformer` body, so the configured value had no effect on the embedding phase. Set the body's `max_seq_length` for that phase, clamped to the model's maximum to mirror `_prepare_dataloader`, and restore it afterwards so encoding at inference time is unaffected. Fixes huggingface#561
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #561.
TrainingArguments.max_lengthis documented as "the maximum token length atokenizer can generate," and it is applied when fitting the classifier head
(
Trainer.train_classifierpasses it toSetFitModel.fit, which builds thedataloader with that length). It was never applied to the embedding phase,
though:
Trainer.train_embeddingsfinetunes theSentenceTransformerbody viathe underlying trainer without ever setting the body's truncation length, so
max_lengthhad no effect there — matching the report that batches were paddedto the longest example instead of being truncated.
This sets the body's
max_seq_lengthtomax_lengthfor the embedding phase(clamped to the model's maximum, mirroring the existing clamp in
_prepare_dataloader) and restores the original value afterwards, soinference-time encoding is unaffected — keeping the same non-persistent
behavior the classifier phase already has.
Test:
test_trainer_max_length_applied_to_embedding_phaseasserts the body istruncated to
max_lengthduring the embedding phase and restored afterwards.It fails on
main(the body stays at its default100) and passes with thischange.