Make `evaluate()` accept the same dataset types as the trainer by qgallouedec · Pull Request #6116 · huggingface/trl

qgallouedec · 2026-06-19T13:26:06Z

SFTTrainer, DPOTrainer, and RewardTrainer preprocess datasets in __init__, but evaluate() (inherited from transformers.Trainer) does not. So a dataset passed directly to evaluate(eval_dataset=...) (e.g. a held-out test set) had to be manually preprocessed, while the exact same raw format works at init.

This adds an evaluate() override to each of these trainers that runs the same _prepare_dataset step as __init__. It's idempotent (already-tokenized datasets pass through untouched) and leaves str keys (datasets prepared at init) alone.

Each override mirrors its own __init__ block (SFT respects skip_prepare_dataset/packing/formatting_func; DPO skips for vision datasets; Reward always prepares).

Online trainers (GRPO, RLOO) don't need this: their eval datasets are prompt-only and processed during generation.

Note

Low Risk
Targeted API fix with mirrored init logic and new tests; Reward’s pad_token_id config sync is a small behavioral fix for eval/scoring correctness.

Overview
evaluate(eval_dataset=...) now accepts the same raw dataset formats as trainer construction for SFTTrainer, DPOTrainer, and RewardTrainer (fixes #6115). Each trainer overrides evaluate to run the same _prepare_dataset path as __init__ before delegating to transformers.Trainer, with idempotent handling for already-tokenized data and no re-processing for str dataset keys.

DPO skips preparation for vision datasets; when precompute_ref_log_probs is enabled it also precomputes reference log-probs on ad-hoc eval data. SFT stores _formatting_func and _skip_prepare_dataset on the instance so eval uses the same packing/formatting rules as training. Reward additionally sets model.config.pad_token_id from the tokenizer so sequence-classification scoring can find the last non-pad token during eval.

Regression tests cover raw (and dict) eval datasets for all three trainers, including DPO with/without precomputed ref log-probs.

^{Reviewed by Cursor Bugbot for commit 06305ff. Bugbot is set up for automated code reviews on this repo. Configure here.}

bot-ci-comment · 2026-06-19T13:28:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cb02741. Configure here.}

albertvillanova

Thanks.

Make evaluate() accept the same dataset types as the trainer

012afbb

qgallouedec requested review from AmineDiro, albertvillanova and kashif June 19, 2026 13:26

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread trl/trainer/dpo_trainer.py

Comment thread trl/trainer/dpo_trainer.py

qgallouedec and others added 2 commits June 19, 2026 13:40

fix reward

0b72143

Merge branch 'main' into fix-evaluate-raw-dataset

cb02741

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread trl/trainer/dpo_trainer.py

Merge branch 'main' into fix-evaluate-raw-dataset

06305ff

albertvillanova approved these changes Jun 20, 2026

View reviewed changes

qgallouedec merged commit 7b4c3b3 into main Jun 20, 2026
13 checks passed

qgallouedec deleted the fix-evaluate-raw-dataset branch June 20, 2026 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `evaluate()` accept the same dataset types as the trainer#6116

Make `evaluate()` accept the same dataset types as the trainer#6116
qgallouedec merged 4 commits into
mainfrom
fix-evaluate-raw-dataset

qgallouedec commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

bot-ci-comment Bot commented Jun 19, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

albertvillanova left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qgallouedec commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bot-ci-comment Bot commented Jun 19, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qgallouedec commented Jun 19, 2026 •

edited by cursor Bot

Loading