feat: reproducible train/eval split for post-training exports (#103)#112
Merged
dcfocus merged 1 commit intoJun 27, 2026
Merged
Conversation
5 tasks
…format#103) Follow-up to lance-format#96. Adds a group-disjoint, reproducible train/eval split to the export path so eval scores aren't inflated by conversation leakage. - `SplitConfig { eval_fraction, by, seed }` on `ExportConfig`. Each record is assigned to train or eval by a stable hash (FNV-1a + MurmurHash3 fmix64 finalizer) of its `by` grouping key plus `seed`, so: - no group (default `session_id`, configurable to run_id / tenant / source / bot_id / external-id prefix) spans both sides, and - the same seed reproduces the identical partition across runs and platforms (the hash is fully defined, not `DefaultHasher`). - When `split` is set, `export_training` writes `<path>.train.<ext>` and `<path>.eval.<ext>`, each with its own manifest recording the split params (`eval_fraction`, `by`, `seed`, `side`) and the complementary output path. The returned manifest is the train side. - Curation (dedup/decontam/lifecycle) runs once before the split, so the two compose: decontamination drops train rows near eval, the split keeps groups disjoint. Python: `ctx.export_training(..., split={"eval_fraction": 0.1, "by": "session_id", "seed": 42})`. Tests: core (determinism, group-disjointness, fraction-within-tolerance, split manifests + complement path) and python (split disjointness + determinism). README updated. Stacked on lance-format#96 (PR lance-format#111). Closes lance-format#103
6f30288 to
5124edc
Compare
dcfocus
added a commit
to dcfocus/lance-context
that referenced
this pull request
Jun 27, 2026
) Follow-up to lance-format#96. Adds an optional `<output_path>.stats.json` report so a training cut is auditable at a glance without loading the JSONL. - `ExportConfig.emit_stats` gates a stats artifact computed during the existing export pass (no extra dataset re-materialization). - `ExportStats` covers: - counts: examples, by `role`, by `source`, by `tenant`; - token stats (min/median/p95/max/mean) from `state_metadata.tokens_used`, falling back to a whitespace word-count proxy, with a `source` field (`tokens_used` / `length_proxy` / `mixed`); - grouping: `num_groups` + records-per-group distribution; - curation accounting: records excluded by reason (lifecycle, reward threshold, dedup, decontamination), derived from the curation counts; - reward distribution and `reward_source` breakdown (rollouts), plus the preference form. - With a train/eval split, each side gets its own `.stats.json`. Python: `ctx.export_training(..., emit_stats=True)` writes the sibling report; read it from `<output_path>.stats.json`. Tests: core (role/source/tenant counts, token stats + length-proxy fallback, lifecycle exclusions, rollout reward distribution, flag gating) and python (stats contents + flag gating). README updated. Stacked on lance-format#96 (PR lance-format#111) and lance-format#103 (PR lance-format#112). Closes lance-format#104
dcfocus
added a commit
that referenced
this pull request
Jun 27, 2026
## Summary Follow-up to #96: emits an optional **dataset statistics report** (`<output_path>.stats.json`) alongside each export so a training cut is auditable at a glance — without loading the JSONL. Closes #104. > **Stacked on #111 (#96) and #112 (#103).** Review/merge those first; until then this PR shows their commits too (the stats commit is last). ## What it does - `ExportConfig.emit_stats` gates a stats artifact computed **during the existing export pass** (no extra dataset re-materialization). - `ExportStats` covers everything the issue asks for: - **counts**: examples, by `role`, by `source`, by `tenant`; - **token stats** (min/median/p95/max/mean) from `state_metadata.tokens_used`, falling back to a whitespace word-count proxy, with a `source` field (`tokens_used` / `length_proxy` / `mixed`); - **grouping**: `num_groups` + records-per-group distribution; - **curation accounting**: records excluded by reason (lifecycle, reward threshold, dedup, decontamination), derived from the curation counts; - **reward distribution** + `reward_source` breakdown (rollouts), plus the preference form. - With a train/eval split (#103), each side gets its own `.stats.json`. ## Python ```python ctx.export_training("training/sft.jsonl", task="sft", emit_stats=True) # -> training/sft.jsonl.stats.json ``` ## Tests - **core** (4): role/source/tenant counts + token stats + lifecycle exclusions, length-proxy fallback, rollout reward distribution + reward_source, flag gating. - **python** (2): stats contents + flag gating. README updated. ## Checks - `cargo test -p lance-context-core --lib` — 66 passed - `cargo fmt --all -- --check`, `cargo clippy --workspace --all-targets -- -D warnings` — clean - `ruff format --check`, `ruff check`, `pyright` — clean - `test_export_training.py` — 12 passed ## Acceptance criteria (#104) - [x] Export optionally emits a stats JSON with the counts/distributions above - [x] Token stats use `state_metadata.tokens_used` when available, with a documented fallback (whitespace word count) - [x] Curation accounting reports excluded counts by reason - [x] Stats computed in the export pass (no full re-materialization) - [x] Tests verify counts/distributions against known fixtures Closes #104
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #96: adds a group-disjoint, reproducible train/eval split to the export path, so eval scores aren't inflated by conversation/session leakage across the boundary. Closes #103.
What it does
SplitConfig { eval_fraction, by, seed }onExportConfig. Each record's side is decided by a stable hash (FNV-1a + MurmurHash3fmix64finalizer) of itsbygrouping key +seed, so:session_id; alsorun_id/tenant/source/bot_id/ external-id prefix) spans both sides, andDefaultHasher).splitset,export_trainingwrites<path>.train.<ext>and<path>.eval.<ext>, each with its own manifest recording the split params (eval_fraction,by,seed,side) and the complementary output path. The returned manifest is the train side.Python
Tests
README updated.
Checks
cargo test -p lance-context-core --lib— 62 passedcargo fmt --all -- --check,cargo clippy --workspace --all-targets -- -D warnings— cleanruff format --check,ruff check,pyright— cleantest_export_training.py— 10 passedAcceptance criteria (#103)
session_id) appears in both train and evalOptional follow-up (noted in the issue): also emit the eval side in #98's labeled query-set format so retrieval-eval reuses the same artifact — deferred to keep this focused and avoid coupling to the (separate) #98 PR.
Closes #103