Skip to content

feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436)#3

Open
dusterbloom wants to merge 4 commits into
integration/kvflash-completefrom
pr/dflash-spec-gate
Open

feat(dflash): entropy + self-calibrating spec-decode gate (composes with #436)#3
dusterbloom wants to merge 4 commits into
integration/kvflash-completefrom
pr/dflash-spec-gate

Conversation

@dusterbloom

Copy link
Copy Markdown
Owner

Wave-2 dFlash gate. Stacked on integration/kvflash-complete (origin/main + full KVFlash stack), so the diff here is just the converter + the two gate commits.

New commits: entropy-gated spec-decode (never slower than AR) + self-calibrating commit-EMA gate + long-ctx drafter-cliff fix.

Composes with origin/main Luce-Org#436 (DFlash tree-verify under KVFlash) via a verified UNION in qwen35_backend.cpp: keeps Luce-Org#436's kvflash_tree_ok guard + alloc_span prefix registration AND the gate's EMA-timing rewrite (logically independent). Gate logic is inline (pre-abstraction; the common/ dedup is the separate Wave-3 abstraction).

VALIDATED — 64K NIAH on the full-KVFlash base: NO SIGSEGV at ~70K pooled prefill, needle 847291356 exact at turn3, restore=true (prefill 0.100s), ddtree tree-verify engaged, no invalid-seed, gate self-calibrating. (A partial substrate+converter base SIGSEGV'd here — the full KVFlash fix set is required.)

Base is a fork branch because Wave-2 sits on the unmerged KVFlash wave; retarget to Luce-Org:main once Wave-1 (Luce-Org#428/Luce-Org#429/Luce-Org#430/Luce-Org#445/Luce-Org#446) lands.

…ectness fixes

DRAFTER CONVERTER (config-driven):
- convert_dflash_to_gguf.py reads all architecture params from config.json
  (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for
  SWA, sliding_window). No hardcoded constants.
- quantize_draft_q8.py shares load_arch with the converter.
- GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[],
  dflash.block_size, attention.sliding_window + pattern.
- draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF
  metadata, block_size override, SWA pattern from metadata.
- draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from
  config.json.
- gguf_target_loader.cpp: respect drafter-specified capture layers instead
  of overwriting with evenly-spaced heuristic.
- qwen35_backend.cpp: early-read capture sync + mask token propagation.
- internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16.
- dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16.

SPEC-DECODE PERFORMANCE:
- graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx
  alive + n_tokens matches (centralized guard; was per-call-site).
- qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for
  graph persistence; kFastRollbackThreshold env-tunable
  (DFLASH_FAST_ROLLBACK_MIN, default 5).
- dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard
  (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB).
- graph_builders.cpp: 4MB ctx alloc (was 64MB).
- step_graph.h: graph_ctx_len + graph_used_view tracking fields.

SPEC-DECODE CORRECTNESS:
- qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded
  4096 feature ring cap. Default 4096 causes acceptance collapse from 85%
  to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features).
- qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP.
- qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax
  at long context (NaN/Inf → clamp to 0, verify rejects gracefully).

MOE EXPERIMENTAL (behind flags):
- qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid
  storage even with 0 cold experts to enable pipelined spec-decode verify.
- Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in
  hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback).
- Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards).

VALIDATED:
- 27B dense + reconverted drafter: 57% accept on code gen, 85% on short
  prompts. block=16 gives 252 tok/s (2.2x AR) on code generation.
- 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR).
- Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K.
- Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls
  valid), pFlash cuts 34K prefill from 475s to 208s (2.3x).
- repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.
dFlash spec-decode is content-dependent: it wins big on verbatim/copyable
output (drafter accept ~80%, ~235 tok/s) but is 2-4x SLOWER than plain AR on
novel/high-entropy output (accept ~6-16%) — and on this MoE the rejected tokens
still pay full expert-routing verify cost. Gate it on target entropy so the
decoder automatically picks the faster path, transparently, no knobs.

- per decision point compute target top-1 prob p1 (cheap entropy proxy = expected
  acceptance) from the logits we already have.
- keep spec at the trained full block (16) when confidence is high; floor the
  remainder of the turn to the efficient do_ar_decode (real AR ~100+ tok/s) when
  the drafter is losing.
- hysteresis: 1-step probe + sustained-low streak (DFLASH_ENTROPY_SUSTAIN, def 2)
  holds full blocks through transient dips ("big blocks on uncertain transitions");
  near-tie immediate floor (DFLASH_ENTROPY_TIE_P1, def 0.45) turns verify off when
  the argmax is ambiguous.
- threshold DFLASH_ENTROPY_AR_P1 (def 0.90) swept for the Pareto point; gate
  default-on, DFLASH_ENTROPY_GATE=0 disables, DFLASH_ENTROPY_DEBUG traces p1.
- measured: verbatim 236 / code-gen->AR 117 / novel->AR 83 tok/s, always >= AR.
- temp 0: semantically equivalent to AR (spec verifies vs target argmax; both take
  the argmax). Not bit-identical — near-tie argmax flips via verify-batch FP
  reduction order, the established spec-decode bar.
…after cliff

Two changes that make dFlash spec-decode safe and useful across content and
context length without per-model tuning.

1. Long-context drafter cliff fix. The block-diffusion drafter's prediction
   collapses when it self-attends more than ~2048 tokens (measured: 93% accept
   at draft_ctx<=2048 vs 6% at 4096, independent of total prompt context). The
   old default ran it at max(2048, draft_ctx_max=4096)=4096 — past the drafter's
   effective limit — so spec-decode died above ~2K context. Cap the drafter's
   self-attention at 2048 by default; spec now holds 77-93% accept / 110-200
   tok/s out to 35K context for recent-derived output. DFLASH_DRAFT_CTX_MAX
   overrides for drafters with a larger usable window.

2. Self-calibrating commit-EMA gate (replaces the p1-entropy gate). dFlash wins
   only when its realized throughput beats AR; that break-even is model- and
   context-dependent (a fixed entropy threshold over-floored dense, under-floored
   MoE). Measure t_ar once per process (cached on the backend, no per-turn warmup
   tax), then floor the remainder of a turn to the efficient AR path when the EMA
   of commit_n*t_ar/step_wall stays below 1.0 (spec slower than AR) for a few
   steps. Knob-free, never slower than AR; floors novel/high-entropy turns,
   keeps spec on code/structured. Env: DFLASH_SPEC_GATE(=1), _MARGIN, _SUSTAIN,
   _WARMUP, _DEBUG. Applies to both base (do_spec_decode) and MoE hybrid
   (do_hybrid_spec_decode) paths. Temp 0: semantically equivalent to AR.
…uard removal

core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant