feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436) by dusterbloom · Pull Request #3 · dusterbloom/lucebox-hub

dusterbloom · 2026-06-24T15:37:32Z

Wave-2 dFlash gate. Stacked on integration/kvflash-complete (origin/main + full KVFlash stack), so the diff here is just the converter + the two gate commits.

New commits: entropy-gated spec-decode (never slower than AR) + self-calibrating commit-EMA gate + long-ctx drafter-cliff fix.

Composes with origin/main Luce-Org#436 (DFlash tree-verify under KVFlash) via a verified UNION in qwen35_backend.cpp: keeps Luce-Org#436's kvflash_tree_ok guard + alloc_span prefix registration AND the gate's EMA-timing rewrite (logically independent). Gate logic is inline (pre-abstraction; the common/ dedup is the separate Wave-3 abstraction).

VALIDATED — 64K NIAH on the full-KVFlash base: NO SIGSEGV at ~70K pooled prefill, needle 847291356 exact at turn3, restore=true (prefill 0.100s), ddtree tree-verify engaged, no invalid-seed, gate self-calibrating. (A partial substrate+converter base SIGSEGV'd here — the full KVFlash fix set is required.)

Base is a fork branch because Wave-2 sits on the unmerged KVFlash wave; retarget to Luce-Org:main once Wave-1 (Luce-Org#428/Luce-Org#429/Luce-Org#430/Luce-Org#445/Luce-Org#446) lands.

…ectness fixes DRAFTER CONVERTER (config-driven): - convert_dflash_to_gguf.py reads all architecture params from config.json (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for SWA, sliding_window). No hardcoded constants. - quantize_draft_q8.py shares load_arch with the converter. - GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[], dflash.block_size, attention.sliding_window + pattern. - draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF metadata, block_size override, SWA pattern from metadata. - draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from config.json. - gguf_target_loader.cpp: respect drafter-specified capture layers instead of overwriting with evenly-spaced heuristic. - qwen35_backend.cpp: early-read capture sync + mask token propagation. - internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16. - dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16. SPEC-DECODE PERFORMANCE: - graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx alive + n_tokens matches (centralized guard; was per-call-site). - qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for graph persistence; kFastRollbackThreshold env-tunable (DFLASH_FAST_ROLLBACK_MIN, default 5). - dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB). - graph_builders.cpp: 4MB ctx alloc (was 64MB). - step_graph.h: graph_ctx_len + graph_used_view tracking fields. SPEC-DECODE CORRECTNESS: - qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded 4096 feature ring cap. Default 4096 causes acceptance collapse from 85% to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features). - qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP. - qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax at long context (NaN/Inf → clamp to 0, verify rejects gracefully). MOE EXPERIMENTAL (behind flags): - qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid storage even with 0 cold experts to enable pipelined spec-decode verify. - Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback). - Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards). VALIDATED: - 27B dense + reconverted drafter: 57% accept on code gen, 85% on short prompts. block=16 gives 252 tok/s (2.2x AR) on code generation. - 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR). - Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K. - Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls valid), pFlash cuts 34K prefill from 475s to 208s (2.3x). - repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.

dFlash spec-decode is content-dependent: it wins big on verbatim/copyable output (drafter accept ~80%, ~235 tok/s) but is 2-4x SLOWER than plain AR on novel/high-entropy output (accept ~6-16%) — and on this MoE the rejected tokens still pay full expert-routing verify cost. Gate it on target entropy so the decoder automatically picks the faster path, transparently, no knobs. - per decision point compute target top-1 prob p1 (cheap entropy proxy = expected acceptance) from the logits we already have. - keep spec at the trained full block (16) when confidence is high; floor the remainder of the turn to the efficient do_ar_decode (real AR ~100+ tok/s) when the drafter is losing. - hysteresis: 1-step probe + sustained-low streak (DFLASH_ENTROPY_SUSTAIN, def 2) holds full blocks through transient dips ("big blocks on uncertain transitions"); near-tie immediate floor (DFLASH_ENTROPY_TIE_P1, def 0.45) turns verify off when the argmax is ambiguous. - threshold DFLASH_ENTROPY_AR_P1 (def 0.90) swept for the Pareto point; gate default-on, DFLASH_ENTROPY_GATE=0 disables, DFLASH_ENTROPY_DEBUG traces p1. - measured: verbatim 236 / code-gen->AR 117 / novel->AR 83 tok/s, always >= AR. - temp 0: semantically equivalent to AR (spec verifies vs target argmax; both take the argmax). Not bit-identical — near-tie argmax flips via verify-batch FP reduction order, the established spec-decode bar.

…after cliff Two changes that make dFlash spec-decode safe and useful across content and context length without per-model tuning. 1. Long-context drafter cliff fix. The block-diffusion drafter's prediction collapses when it self-attends more than ~2048 tokens (measured: 93% accept at draft_ctx<=2048 vs 6% at 4096, independent of total prompt context). The old default ran it at max(2048, draft_ctx_max=4096)=4096 — past the drafter's effective limit — so spec-decode died above ~2K context. Cap the drafter's self-attention at 2048 by default; spec now holds 77-93% accept / 110-200 tok/s out to 35K context for recent-derived output. DFLASH_DRAFT_CTX_MAX overrides for drafters with a larger usable window. 2. Self-calibrating commit-EMA gate (replaces the p1-entropy gate). dFlash wins only when its realized throughput beats AR; that break-even is model- and context-dependent (a fixed entropy threshold over-floored dense, under-floored MoE). Measure t_ar once per process (cached on the backend, no per-turn warmup tax), then floor the remainder of a turn to the efficient AR path when the EMA of commit_n*t_ar/step_wall stays below 1.0 (spec slower than AR) for a few steps. Knob-free, never slower than AR; floors novel/high-entropy turns, keeps spec on code/structured. Env: DFLASH_SPEC_GATE(=1), _MARGIN, _SUSTAIN, _WARMUP, _DEBUG. Applies to both base (do_spec_decode) and MoE hybrid (do_hybrid_spec_decode) paths. Temp 0: semantically equivalent to AR.

…uard removal core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.

dusterbloom added 4 commits June 24, 2026 16:19

test(server): cold_prefix_boundary returns 4000 after layout_known_ g…

10376f9

…uard removal core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436)#3

feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436)#3
dusterbloom wants to merge 4 commits into
integration/kvflash-completefrom
pr/dflash-spec-gate

dusterbloom commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant