feat: reproducible train/eval split for post-training exports (#103) by dcfocus · Pull Request #112 · lance-format/lance-context

dcfocus · 2026-06-25T06:56:11Z

Summary

Follow-up to #96: adds a group-disjoint, reproducible train/eval split to the export path, so eval scores aren't inflated by conversation/session leakage across the boundary. Closes #103.

Stacked on #111 (#96). Review/merge #111 first; until then this PR shows both commits (the split commit is last). The diff reduces to just #103 once #111 lands.

What it does

SplitConfig { eval_fraction, by, seed } on ExportConfig. Each record's side is decided by a stable hash (FNV-1a + MurmurHash3 fmix64 finalizer) of its by grouping key + seed, so:
- no group (default session_id; also run_id / tenant / source / bot_id / external-id prefix) spans both sides, and
- the same seed reproduces the identical partition across runs/platforms (the hash is fully defined — not DefaultHasher).
With split set, export_training writes <path>.train.<ext> and <path>.eval.<ext>, each with its own manifest recording the split params (eval_fraction, by, seed, side) and the complementary output path. The returned manifest is the train side.
Curation (dedup/decontam/lifecycle from Curate lance-context records into trainable datasets (SFT/preference/rollout export) #96) runs once before the split, so the two compose: decontamination drops train rows near eval; the split keeps groups disjoint.

Python

ctx.export_training(
    "training/sft.jsonl",
    task="sft",
    split={"eval_fraction": 0.1, "by": "session_id", "seed": 42},
)
# -> training/sft.train.jsonl + training/sft.eval.jsonl (+ manifests)

Tests

core (3): determinism (same seed → identical files), group-disjointness, fraction-within-tolerance, and split manifests with complement path.
python (2): split disjointness + determinism.

README updated.

Checks

cargo test -p lance-context-core --lib — 62 passed
cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings — clean
ruff format --check, ruff check, pyright — clean
test_export_training.py — 10 passed

Acceptance criteria (#103)

Split into train/eval by a grouping key with a fixed seed; re-running yields the identical partition
No group (e.g. session_id) appears in both train and eval
Each output carries a manifest recording split params
Tests cover determinism, group-disjointness, and fraction accuracy within tolerance

Optional follow-up (noted in the issue): also emit the eval side in #98's labeled query-set format so retrieval-eval reuses the same artifact — deferred to keep this focused and avoid coupling to the (separate) #98 PR.

Closes #103

…format#103) Follow-up to lance-format#96. Adds a group-disjoint, reproducible train/eval split to the export path so eval scores aren't inflated by conversation leakage. - `SplitConfig { eval_fraction, by, seed }` on `ExportConfig`. Each record is assigned to train or eval by a stable hash (FNV-1a + MurmurHash3 fmix64 finalizer) of its `by` grouping key plus `seed`, so: - no group (default `session_id`, configurable to run_id / tenant / source / bot_id / external-id prefix) spans both sides, and - the same seed reproduces the identical partition across runs and platforms (the hash is fully defined, not `DefaultHasher`). - When `split` is set, `export_training` writes `<path>.train.<ext>` and `<path>.eval.<ext>`, each with its own manifest recording the split params (`eval_fraction`, `by`, `seed`, `side`) and the complementary output path. The returned manifest is the train side. - Curation (dedup/decontam/lifecycle) runs once before the split, so the two compose: decontamination drops train rows near eval, the split keeps groups disjoint. Python: `ctx.export_training(..., split={"eval_fraction": 0.1, "by": "session_id", "seed": 42})`. Tests: core (determinism, group-disjointness, fraction-within-tolerance, split manifests + complement path) and python (split disjointness + determinism). README updated. Stacked on lance-format#96 (PR lance-format#111). Closes lance-format#103

) Follow-up to lance-format#96. Adds an optional `<output_path>.stats.json` report so a training cut is auditable at a glance without loading the JSONL. - `ExportConfig.emit_stats` gates a stats artifact computed during the existing export pass (no extra dataset re-materialization). - `ExportStats` covers: - counts: examples, by `role`, by `source`, by `tenant`; - token stats (min/median/p95/max/mean) from `state_metadata.tokens_used`, falling back to a whitespace word-count proxy, with a `source` field (`tokens_used` / `length_proxy` / `mixed`); - grouping: `num_groups` + records-per-group distribution; - curation accounting: records excluded by reason (lifecycle, reward threshold, dedup, decontamination), derived from the curation counts; - reward distribution and `reward_source` breakdown (rollouts), plus the preference form. - With a train/eval split, each side gets its own `.stats.json`. Python: `ctx.export_training(..., emit_stats=True)` writes the sibling report; read it from `<output_path>.stats.json`. Tests: core (role/source/tenant counts, token stats + length-proxy fallback, lifecycle exclusions, rollout reward distribution, flag gating) and python (stats contents + flag gating). README updated. Stacked on lance-format#96 (PR lance-format#111) and lance-format#103 (PR lance-format#112). Closes lance-format#104

## Summary Follow-up to #96: emits an optional **dataset statistics report** (`<output_path>.stats.json`) alongside each export so a training cut is auditable at a glance — without loading the JSONL. Closes #104. > **Stacked on #111 (#96) and #112 (#103).** Review/merge those first; until then this PR shows their commits too (the stats commit is last). ## What it does - `ExportConfig.emit_stats` gates a stats artifact computed **during the existing export pass** (no extra dataset re-materialization). - `ExportStats` covers everything the issue asks for: - **counts**: examples, by `role`, by `source`, by `tenant`; - **token stats** (min/median/p95/max/mean) from `state_metadata.tokens_used`, falling back to a whitespace word-count proxy, with a `source` field (`tokens_used` / `length_proxy` / `mixed`); - **grouping**: `num_groups` + records-per-group distribution; - **curation accounting**: records excluded by reason (lifecycle, reward threshold, dedup, decontamination), derived from the curation counts; - **reward distribution** + `reward_source` breakdown (rollouts), plus the preference form. - With a train/eval split (#103), each side gets its own `.stats.json`. ## Python ```python ctx.export_training("training/sft.jsonl", task="sft", emit_stats=True) # -> training/sft.jsonl.stats.json ``` ## Tests - **core** (4): role/source/tenant counts + token stats + lifecycle exclusions, length-proxy fallback, rollout reward distribution + reward_source, flag gating. - **python** (2): stats contents + flag gating. README updated. ## Checks - `cargo test -p lance-context-core --lib` — 66 passed - `cargo fmt --all -- --check`, `cargo clippy --workspace --all-targets -- -D warnings` — clean - `ruff format --check`, `ruff check`, `pyright` — clean - `test_export_training.py` — 12 passed ## Acceptance criteria (#104) - [x] Export optionally emits a stats JSON with the counts/distributions above - [x] Token stats use `state_metadata.tokens_used` when available, with a documented fallback (whitespace word count) - [x] Curation accounting reports excluded counts by reason - [x] Stats computed in the export pass (no full re-materialization) - [x] Tests verify counts/distributions against known fixtures Closes #104

dcfocus mentioned this pull request Jun 25, 2026

feat: emit dataset statistics report alongside exports (#104) #113

Merged

5 tasks

dcfocus force-pushed the feat/issue-103-train-eval-split branch from 6f30288 to 5124edc Compare June 27, 2026 06:03

dcfocus merged commit 7e0a0fb into lance-format:main Jun 27, 2026
9 checks passed

dcfocus deleted the feat/issue-103-train-eval-split branch June 27, 2026 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: reproducible train/eval split for post-training exports (#103)#112

feat: reproducible train/eval split for post-training exports (#103)#112
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:feat/issue-103-train-eval-split

dcfocus commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dcfocus commented Jun 25, 2026

Summary

What it does

Python

Tests

Checks

Acceptance criteria (#103)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant