Skip to content

feat: reproducible train/eval split for post-training exports (#103)#112

Merged
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:feat/issue-103-train-eval-split
Jun 27, 2026
Merged

feat: reproducible train/eval split for post-training exports (#103)#112
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:feat/issue-103-train-eval-split

Conversation

@dcfocus

@dcfocus dcfocus commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #96: adds a group-disjoint, reproducible train/eval split to the export path, so eval scores aren't inflated by conversation/session leakage across the boundary. Closes #103.

Stacked on #111 (#96). Review/merge #111 first; until then this PR shows both commits (the split commit is last). The diff reduces to just #103 once #111 lands.

What it does

  • SplitConfig { eval_fraction, by, seed } on ExportConfig. Each record's side is decided by a stable hash (FNV-1a + MurmurHash3 fmix64 finalizer) of its by grouping key + seed, so:
    • no group (default session_id; also run_id / tenant / source / bot_id / external-id prefix) spans both sides, and
    • the same seed reproduces the identical partition across runs/platforms (the hash is fully defined — not DefaultHasher).
  • With split set, export_training writes <path>.train.<ext> and <path>.eval.<ext>, each with its own manifest recording the split params (eval_fraction, by, seed, side) and the complementary output path. The returned manifest is the train side.
  • Curation (dedup/decontam/lifecycle from Curate lance-context records into trainable datasets (SFT/preference/rollout export) #96) runs once before the split, so the two compose: decontamination drops train rows near eval; the split keeps groups disjoint.

Python

ctx.export_training(
    "training/sft.jsonl",
    task="sft",
    split={"eval_fraction": 0.1, "by": "session_id", "seed": 42},
)
# -> training/sft.train.jsonl + training/sft.eval.jsonl (+ manifests)

Tests

  • core (3): determinism (same seed → identical files), group-disjointness, fraction-within-tolerance, and split manifests with complement path.
  • python (2): split disjointness + determinism.

README updated.

Checks

  • cargo test -p lance-context-core --lib — 62 passed
  • cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings — clean
  • ruff format --check, ruff check, pyright — clean
  • test_export_training.py — 10 passed

Acceptance criteria (#103)

  • Split into train/eval by a grouping key with a fixed seed; re-running yields the identical partition
  • No group (e.g. session_id) appears in both train and eval
  • Each output carries a manifest recording split params
  • Tests cover determinism, group-disjointness, and fraction accuracy within tolerance

Optional follow-up (noted in the issue): also emit the eval side in #98's labeled query-set format so retrieval-eval reuses the same artifact — deferred to keep this focused and avoid coupling to the (separate) #98 PR.

Closes #103

…format#103)

Follow-up to lance-format#96. Adds a group-disjoint, reproducible train/eval split to
the export path so eval scores aren't inflated by conversation leakage.

- `SplitConfig { eval_fraction, by, seed }` on `ExportConfig`. Each record
  is assigned to train or eval by a stable hash (FNV-1a + MurmurHash3
  fmix64 finalizer) of its `by` grouping key plus `seed`, so:
  - no group (default `session_id`, configurable to run_id / tenant /
    source / bot_id / external-id prefix) spans both sides, and
  - the same seed reproduces the identical partition across runs and
    platforms (the hash is fully defined, not `DefaultHasher`).
- When `split` is set, `export_training` writes `<path>.train.<ext>` and
  `<path>.eval.<ext>`, each with its own manifest recording the split
  params (`eval_fraction`, `by`, `seed`, `side`) and the complementary
  output path. The returned manifest is the train side.
- Curation (dedup/decontam/lifecycle) runs once before the split, so the
  two compose: decontamination drops train rows near eval, the split keeps
  groups disjoint.

Python: `ctx.export_training(..., split={"eval_fraction": 0.1,
"by": "session_id", "seed": 42})`.

Tests: core (determinism, group-disjointness, fraction-within-tolerance,
split manifests + complement path) and python (split disjointness +
determinism). README updated.

Stacked on lance-format#96 (PR lance-format#111).

Closes lance-format#103
@dcfocus dcfocus force-pushed the feat/issue-103-train-eval-split branch from 6f30288 to 5124edc Compare June 27, 2026 06:03
@dcfocus dcfocus merged commit 7e0a0fb into lance-format:main Jun 27, 2026
9 checks passed
dcfocus added a commit to dcfocus/lance-context that referenced this pull request Jun 27, 2026
)

Follow-up to lance-format#96. Adds an optional `<output_path>.stats.json` report so a
training cut is auditable at a glance without loading the JSONL.

- `ExportConfig.emit_stats` gates a stats artifact computed during the
  existing export pass (no extra dataset re-materialization).
- `ExportStats` covers:
  - counts: examples, by `role`, by `source`, by `tenant`;
  - token stats (min/median/p95/max/mean) from
    `state_metadata.tokens_used`, falling back to a whitespace word-count
    proxy, with a `source` field (`tokens_used` / `length_proxy` / `mixed`);
  - grouping: `num_groups` + records-per-group distribution;
  - curation accounting: records excluded by reason (lifecycle, reward
    threshold, dedup, decontamination), derived from the curation counts;
  - reward distribution and `reward_source` breakdown (rollouts), plus the
    preference form.
- With a train/eval split, each side gets its own `.stats.json`.

Python: `ctx.export_training(..., emit_stats=True)` writes the sibling
report; read it from `<output_path>.stats.json`.

Tests: core (role/source/tenant counts, token stats + length-proxy
fallback, lifecycle exclusions, rollout reward distribution, flag gating)
and python (stats contents + flag gating). README updated.

Stacked on lance-format#96 (PR lance-format#111) and lance-format#103 (PR lance-format#112).

Closes lance-format#104
dcfocus added a commit that referenced this pull request Jun 27, 2026
## Summary

Follow-up to #96: emits an optional **dataset statistics report**
(`<output_path>.stats.json`) alongside each export so a training cut is
auditable at a glance — without loading the JSONL. Closes #104.

> **Stacked on #111 (#96) and #112 (#103).** Review/merge those first;
until then this PR shows their commits too (the stats commit is last).

## What it does

- `ExportConfig.emit_stats` gates a stats artifact computed **during the
existing export pass** (no extra dataset re-materialization).
- `ExportStats` covers everything the issue asks for:
  - **counts**: examples, by `role`, by `source`, by `tenant`;
- **token stats** (min/median/p95/max/mean) from
`state_metadata.tokens_used`, falling back to a whitespace word-count
proxy, with a `source` field (`tokens_used` / `length_proxy` / `mixed`);
  - **grouping**: `num_groups` + records-per-group distribution;
- **curation accounting**: records excluded by reason (lifecycle, reward
threshold, dedup, decontamination), derived from the curation counts;
- **reward distribution** + `reward_source` breakdown (rollouts), plus
the preference form.
- With a train/eval split (#103), each side gets its own `.stats.json`.

## Python

```python
ctx.export_training("training/sft.jsonl", task="sft", emit_stats=True)
# -> training/sft.jsonl.stats.json
```

## Tests

- **core** (4): role/source/tenant counts + token stats + lifecycle
exclusions, length-proxy fallback, rollout reward distribution +
reward_source, flag gating.
- **python** (2): stats contents + flag gating.

README updated.

## Checks

- `cargo test -p lance-context-core --lib` — 66 passed
- `cargo fmt --all -- --check`, `cargo clippy --workspace --all-targets
-- -D warnings` — clean
- `ruff format --check`, `ruff check`, `pyright` — clean
- `test_export_training.py` — 12 passed

## Acceptance criteria (#104)

- [x] Export optionally emits a stats JSON with the counts/distributions
above
- [x] Token stats use `state_metadata.tokens_used` when available, with
a documented fallback (whitespace word count)
- [x] Curation accounting reports excluded counts by reason
- [x] Stats computed in the export pass (no full re-materialization)
- [x] Tests verify counts/distributions against known fixtures

Closes #104
@dcfocus dcfocus deleted the feat/issue-103-train-eval-split branch June 27, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reproducible train/eval split helper for post-training exports

1 participant