fix(kvflash): pooled prefix snapshot at chunk boundary (≥128K restore)#445
Open
dusterbloom wants to merge 7 commits into
Open
fix(kvflash): pooled prefix snapshot at chunk boundary (≥128K restore)#445dusterbloom wants to merge 7 commits into
dusterbloom wants to merge 7 commits into
Conversation
Contributor
There was a problem hiding this comment.
2 issues found across 16 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged KV in logical chunk order; header-validated against layout) and a factored for_each_segment() helper. serde uses synchronous get/set and adapts to the pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning (pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the ensure_free_block + reselect protections; empty by default (byte-identical non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard reject, pinning, deadlock guard, reset.
…r KVFlash Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction automatic in alloc_span); the restore residual delta routes through the same chunked path. Pooled snapshot save/restore serializes the pager into the prefix snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn 128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385) and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data. GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool sizes; cross-turn disk restore round-trips losslessly.
…gment Three complexity cuts, no behavior change (GPU sink-recall gate + serde/ placement unit tests green): - merge restore residual's identical snap_pooled/else chunk loops into one (the else ct ternary already subsumes the pooled case) - extract chunked_prefill() shared by generate_impl kvf_paged + restore residual - inline single-caller for_each_segment template into serialize net -25 lines (54 ins / 79 del).
…restore works
At ≥128K with the KVFlash pool active, turn 1 never saved a prefix snapshot —
the pooled-prefill branch was stubbed to a diagnostic ("boundary snapshot
skipped: pooled prefill relocates chunks") and returned without saving. So turn
2 found nothing to restore (prefix_len=0), fell back to a full cold re-prefill
(0.8s→77.6s), decode regressed 80→20 tok/s, and turn 3 crashed. The all-hot
35B-A3B runs the dense Qwen35Backend path (moe_hybrid==nullptr), so this was the
live bug for the user's deep-context (>128K = 39% of real prompts) workload.
- add KvFlashPager::serialize(max_chunks) to capture only chunks [0, max_chunks)
— the chunk-aligned turn boundary, not the whole prompt.
- add Qwen35Backend::snapshot_save_pooled_at(slot, boundary): floor the requested
snap_pos to a chunk multiple, set cur_pos to that boundary, serialize the
partial pager blob, and save it (the restore/deserialize path already existed
and was correct — only the save was missing).
- replace the pooled-prefill skip stub at the chunk-aligned boundary with the
real save; mirror the same save on the qwen35moe hybrid path.
- unit tests: floor_to_chunk + serialize(max_chunks) partial round-trip
(bit-identical first k chunks).
131K 3-turn smoke: turn-2 restore=true prefix_len=34077 (97.5% hit), turn-3
restore=true, no crash, tool_call_valid=1.0, decode recovered 20→56-59 tok/s.
Known follow-up: warm-prefill at 131K is still ~44s (deserialize re-pages the
whole pool) — correctness/crash/decode are fixed; restoring only resident chunks
is the next optimization.
…uard removal core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)
750e090 to
ef3b4cc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave-1 KVFlash deep-context. New commit in this PR:
feee0a1d— save the pooled prefix snapshot at a chunk-aligned boundary so ≥128K restore works (was a crash).Stacking / merge order: depends on #428 (pool reservation+sizing), #429 (pager serde), and #430 (pooled chunked prefill core). Those appear as lower commits in this branch's diff; review/merge them first — this branch rebases to drop them once they land, leaving only the snapshot-boundary fix.
Validation: rebased clean onto current main; the composed deep-context stack passed 64K NIAH (needle exact, restore engaged) on the integration branch. No standalone agentic gate yet.