feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436)#3
Open
dusterbloom wants to merge 4 commits into
Open
feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436)#3dusterbloom wants to merge 4 commits into
dusterbloom wants to merge 4 commits into
Conversation
…ectness fixes DRAFTER CONVERTER (config-driven): - convert_dflash_to_gguf.py reads all architecture params from config.json (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for SWA, sliding_window). No hardcoded constants. - quantize_draft_q8.py shares load_arch with the converter. - GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[], dflash.block_size, attention.sliding_window + pattern. - draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF metadata, block_size override, SWA pattern from metadata. - draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from config.json. - gguf_target_loader.cpp: respect drafter-specified capture layers instead of overwriting with evenly-spaced heuristic. - qwen35_backend.cpp: early-read capture sync + mask token propagation. - internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16. - dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16. SPEC-DECODE PERFORMANCE: - graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx alive + n_tokens matches (centralized guard; was per-call-site). - qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for graph persistence; kFastRollbackThreshold env-tunable (DFLASH_FAST_ROLLBACK_MIN, default 5). - dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB). - graph_builders.cpp: 4MB ctx alloc (was 64MB). - step_graph.h: graph_ctx_len + graph_used_view tracking fields. SPEC-DECODE CORRECTNESS: - qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded 4096 feature ring cap. Default 4096 causes acceptance collapse from 85% to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features). - qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP. - qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax at long context (NaN/Inf → clamp to 0, verify rejects gracefully). MOE EXPERIMENTAL (behind flags): - qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid storage even with 0 cold experts to enable pipelined spec-decode verify. - Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback). - Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards). VALIDATED: - 27B dense + reconverted drafter: 57% accept on code gen, 85% on short prompts. block=16 gives 252 tok/s (2.2x AR) on code generation. - 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR). - Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K. - Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls valid), pFlash cuts 34K prefill from 475s to 208s (2.3x). - repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.
dFlash spec-decode is content-dependent: it wins big on verbatim/copyable
output (drafter accept ~80%, ~235 tok/s) but is 2-4x SLOWER than plain AR on
novel/high-entropy output (accept ~6-16%) — and on this MoE the rejected tokens
still pay full expert-routing verify cost. Gate it on target entropy so the
decoder automatically picks the faster path, transparently, no knobs.
- per decision point compute target top-1 prob p1 (cheap entropy proxy = expected
acceptance) from the logits we already have.
- keep spec at the trained full block (16) when confidence is high; floor the
remainder of the turn to the efficient do_ar_decode (real AR ~100+ tok/s) when
the drafter is losing.
- hysteresis: 1-step probe + sustained-low streak (DFLASH_ENTROPY_SUSTAIN, def 2)
holds full blocks through transient dips ("big blocks on uncertain transitions");
near-tie immediate floor (DFLASH_ENTROPY_TIE_P1, def 0.45) turns verify off when
the argmax is ambiguous.
- threshold DFLASH_ENTROPY_AR_P1 (def 0.90) swept for the Pareto point; gate
default-on, DFLASH_ENTROPY_GATE=0 disables, DFLASH_ENTROPY_DEBUG traces p1.
- measured: verbatim 236 / code-gen->AR 117 / novel->AR 83 tok/s, always >= AR.
- temp 0: semantically equivalent to AR (spec verifies vs target argmax; both take
the argmax). Not bit-identical — near-tie argmax flips via verify-batch FP
reduction order, the established spec-decode bar.
…after cliff Two changes that make dFlash spec-decode safe and useful across content and context length without per-model tuning. 1. Long-context drafter cliff fix. The block-diffusion drafter's prediction collapses when it self-attends more than ~2048 tokens (measured: 93% accept at draft_ctx<=2048 vs 6% at 4096, independent of total prompt context). The old default ran it at max(2048, draft_ctx_max=4096)=4096 — past the drafter's effective limit — so spec-decode died above ~2K context. Cap the drafter's self-attention at 2048 by default; spec now holds 77-93% accept / 110-200 tok/s out to 35K context for recent-derived output. DFLASH_DRAFT_CTX_MAX overrides for drafters with a larger usable window. 2. Self-calibrating commit-EMA gate (replaces the p1-entropy gate). dFlash wins only when its realized throughput beats AR; that break-even is model- and context-dependent (a fixed entropy threshold over-floored dense, under-floored MoE). Measure t_ar once per process (cached on the backend, no per-turn warmup tax), then floor the remainder of a turn to the efficient AR path when the EMA of commit_n*t_ar/step_wall stays below 1.0 (spec slower than AR) for a few steps. Knob-free, never slower than AR; floors novel/high-entropy turns, keeps spec on code/structured. Env: DFLASH_SPEC_GATE(=1), _MARGIN, _SUSTAIN, _WARMUP, _DEBUG. Applies to both base (do_spec_decode) and MoE hybrid (do_hybrid_spec_decode) paths. Temp 0: semantically equivalent to AR.
…uard removal core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave-2 dFlash gate. Stacked on integration/kvflash-complete (origin/main + full KVFlash stack), so the diff here is just the converter + the two gate commits.
New commits: entropy-gated spec-decode (never slower than AR) + self-calibrating commit-EMA gate + long-ctx drafter-cliff fix.
Composes with origin/main Luce-Org#436 (DFlash tree-verify under KVFlash) via a verified UNION in qwen35_backend.cpp: keeps Luce-Org#436's kvflash_tree_ok guard + alloc_span prefix registration AND the gate's EMA-timing rewrite (logically independent). Gate logic is inline (pre-abstraction; the common/ dedup is the separate Wave-3 abstraction).
VALIDATED — 64K NIAH on the full-KVFlash base: NO SIGSEGV at ~70K pooled prefill, needle 847291356 exact at turn3, restore=true (prefill 0.100s), ddtree tree-verify engaged, no invalid-seed, gate self-calibrating. (A partial substrate+converter base SIGSEGV'd here — the full KVFlash fix set is required.)
Base is a fork branch because Wave-2 sits on the unmerged KVFlash wave; retarget to Luce-Org:main once Wave-1 (Luce-Org#428/Luce-Org#429/Luce-Org#430/Luce-Org#445/Luce-Org#446) lands.