Make evaluate() accept the same dataset types as the trainer#6116
Merged
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cb02741. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Fixes #6115.
SFTTrainer,DPOTrainer, andRewardTrainerpreprocess datasets in__init__, butevaluate()(inherited fromtransformers.Trainer) does not. So a dataset passed directly toevaluate(eval_dataset=...)(e.g. a held-out test set) had to be manually preprocessed, while the exact same raw format works at init.This adds an
evaluate()override to each of these trainers that runs the same_prepare_datasetstep as__init__. It's idempotent (already-tokenized datasets pass through untouched) and leavesstrkeys (datasets prepared at init) alone.Each override mirrors its own
__init__block (SFT respectsskip_prepare_dataset/packing/formatting_func; DPO skips for vision datasets; Reward always prepares).Online trainers (GRPO, RLOO) don't need this: their eval datasets are prompt-only and processed during generation.
Note
Low Risk
Targeted API fix with mirrored init logic and new tests; Reward’s
pad_token_idconfig sync is a small behavioral fix for eval/scoring correctness.Overview
evaluate(eval_dataset=...)now accepts the same raw dataset formats as trainer construction forSFTTrainer,DPOTrainer, andRewardTrainer(fixes #6115). Each trainer overridesevaluateto run the same_prepare_datasetpath as__init__before delegating totransformers.Trainer, with idempotent handling for already-tokenized data and no re-processing forstrdataset keys.DPO skips preparation for vision datasets; when
precompute_ref_log_probsis enabled it also precomputes reference log-probs on ad-hoc eval data. SFT stores_formatting_funcand_skip_prepare_dataseton the instance so eval uses the same packing/formatting rules as training. Reward additionally setsmodel.config.pad_token_idfrom the tokenizer so sequence-classification scoring can find the last non-pad token during eval.Regression tests cover raw (and dict) eval datasets for all three trainers, including DPO with/without precomputed ref log-probs.
Reviewed by Cursor Bugbot for commit 06305ff. Bugbot is set up for automated code reviews on this repo. Configure here.