Skip to content

fix(kvflash): pooled prefix snapshot at chunk boundary (≥128K restore)#445

Open
dusterbloom wants to merge 7 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-cache-persistence
Open

fix(kvflash): pooled prefix snapshot at chunk boundary (≥128K restore)#445
dusterbloom wants to merge 7 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-cache-persistence

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Wave-1 KVFlash deep-context. New commit in this PR: feee0a1d — save the pooled prefix snapshot at a chunk-aligned boundary so ≥128K restore works (was a crash).

Stacking / merge order: depends on #428 (pool reservation+sizing), #429 (pager serde), and #430 (pooled chunked prefill core). Those appear as lower commits in this branch's diff; review/merge them first — this branch rebases to drop them once they land, leaving only the snapshot-boundary fix.

Validation: rebased clean onto current main; the composed deep-context stack passed 64K NIAH (needle exact, restore engaged) on the integration branch. No standalone agentic gate yet.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 16 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/moe_hybrid_ffn_eval.cpp Outdated
Comment thread server/src/common/kvflash_pager.h Outdated
@dusterbloom dusterbloom reopened this Jun 24, 2026
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged
KV in logical chunk order; header-validated against layout) and a factored
for_each_segment() helper. serde uses synchronous get/set and adapts to the
pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning
(pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the
ensure_free_block + reselect protections; empty by default (byte-identical
non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard
reject, pinning, deadlock guard, reset.
…r KVFlash

Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts
larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction
automatic in alloc_span); the restore residual delta routes through the same
chunked path. Pooled snapshot save/restore serializes the pager into the prefix
snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain
skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn
128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS
critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385)
and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data.

GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool
sizes; cross-turn disk restore round-trips losslessly.
…gment

Three complexity cuts, no behavior change (GPU sink-recall gate + serde/
placement unit tests green):
- merge restore residual's identical snap_pooled/else chunk loops into one
  (the else ct ternary already subsumes the pooled case)
- extract chunked_prefill() shared by generate_impl kvf_paged + restore residual
- inline single-caller for_each_segment template into serialize

net -25 lines (54 ins / 79 del).
…restore works

At ≥128K with the KVFlash pool active, turn 1 never saved a prefix snapshot —
the pooled-prefill branch was stubbed to a diagnostic ("boundary snapshot
skipped: pooled prefill relocates chunks") and returned without saving. So turn
2 found nothing to restore (prefix_len=0), fell back to a full cold re-prefill
(0.8s→77.6s), decode regressed 80→20 tok/s, and turn 3 crashed. The all-hot
35B-A3B runs the dense Qwen35Backend path (moe_hybrid==nullptr), so this was the
live bug for the user's deep-context (>128K = 39% of real prompts) workload.

- add KvFlashPager::serialize(max_chunks) to capture only chunks [0, max_chunks)
  — the chunk-aligned turn boundary, not the whole prompt.
- add Qwen35Backend::snapshot_save_pooled_at(slot, boundary): floor the requested
  snap_pos to a chunk multiple, set cur_pos to that boundary, serialize the
  partial pager blob, and save it (the restore/deserialize path already existed
  and was correct — only the save was missing).
- replace the pooled-prefill skip stub at the chunk-aligned boundary with the
  real save; mirror the same save on the qwen35moe hybrid path.
- unit tests: floor_to_chunk + serialize(max_chunks) partial round-trip
  (bit-identical first k chunks).

131K 3-turn smoke: turn-2 restore=true prefix_len=34077 (97.5% hit), turn-3
restore=true, no crash, tool_call_valid=1.0, decode recovered 20→56-59 tok/s.
Known follow-up: warm-prefill at 131K is still ~44s (deserialize re-pages the
whole pool) — correctness/crash/decode are fixed; restoring only resident chunks
is the next optimization.
…uard removal

core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)
@dusterbloom dusterbloom force-pushed the pr/kvflash-cache-persistence branch from 750e090 to ef3b4cc Compare June 24, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant