feat(dflash): add DeepSeek V4 Flash backend#353
Conversation
Implement full DS4 Flash model backend for AR-only decode: - deepseek4_internal.h: data structures (layer, weights, cache, config) - deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding - deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams) - deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1) - deepseek4_daemon.cpp: daemon entry point Integration: - Register 'deepseek4' arch in backend_factory.cpp - Add to CMakeLists.txt (include path + sources) Tests: - test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup) - deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)
The DS4 Flash GGUF stores rope.scaling.original_context_length as u32 and compress_ratios as i32 array. Handle both type widths gracefully.
The previous approach set dst->data directly but didn't associate the tensor with its backend buffer, causing 'tensor buffer not set' assert. Now uses ggml_backend_tensor_alloc (matching qwen35 loader pattern). Also keeps token_embd on CPU for embedding lookup.
TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.
When full model load fails (e.g., 81GB model on 24GB GPU), automatically fall back to hybrid mode (experts on CPU, core on GPU).
…er shapes - Output projection now correctly uses batched 3D matmul for grouped low-rank: reshape out_a [4096,8192] to [4096,1024,8], reshape q to [4096,8,n_tok], batched matmul → [1024,8,n_tok] → out_b [8192,4096] - Attention placeholder: use reshaped q (correct shape [32768,n_tok]) instead of broken kv×q matmul - Disable compressed context block (shapes incompatible with placeholder)
HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens]. Bypass HC entirely until proper multi-token HC state management is implemented.
The 3D matmul batch dimension (ne[2]) must match between weight and input. Use permute to put n_out_group in ne[2] for both tensors so ggml can broadcast correctly across the group dimension.
Ratio-4 layers use comp_width = 2*head_dim (1024) with 2*ratio state rows. Ratio-128 layers use comp_width = head_dim (512). Indexer uses n_indexer_head_dim (128) as output, not full multi-head width. Pooling placeholder just takes first head_dim elements for now.
sum_rows operates on ne[0] (heads) producing [1, n_comp]. Don't transpose first or elements won't match reshape.
Without ggml_set_input, the graph allocator doesn't allocate buffers for the position tensors, causing 'tensor buffer not set' when we try to set their values before compute.
The I32 position tensors for RoPE in side-effect subgraphs (cpy to external cache buffers) don't get their buffers allocated by gallocr. Skip RoPE for now - output is placeholder anyway. Will fix properly when implementing full compressor pooling logic.
Keep only meaningful error/info prints in the backend.
…ooling - Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN scaling for compressed layers. - Replace attention placeholder with full SWA dot-product attention: Q@KV^T scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output. - Implement per-dim softmax-weighted pooling for compressor state, replacing the first-row placeholder. - Add I32 array bindings for multi-element position tensors.
…ooling - Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN scaling for compressed layers. - Replace attention placeholder with full SWA dot-product attention: Q@KV^T scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output. - Implement per-dim softmax-weighted pooling for compressor state, replacing the first-row placeholder. - Add I32 array bindings for multi-element position tensors.
Implement the full HC mechanism on CPU for the hybrid path: - HC pre: RMSNorm → matmul with fn tensor → Sinkhorn normalization (20 iters on 4×4 combine matrix) → weighted sum of 4 residual streams - HC post: update all 4 streams using post gates + combine matrix - Output HC pre: sigmoid-weighted stream merge before final norm/logits - Lazy-load HC weight tensors from GPU to CPU on first use (~65MB total) - Restructure hybrid loop: separate attention and FFN into independent graphs with HC pre/post between them (eliminates incorrect residual additions)
Previously only the last token's KV was written to the ring buffer during prefill, causing decode to attend to a nearly empty cache. Now all tokens' KV entries are written to their correct ring buffer positions.
DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and caused completely wrong position encodings.
- Add eval_begin/eval_end async IPC pattern for pipelined expert offload - Add graph timing reporting (hot/cold cache hit/miss stats) - Extend MoeHybridFFNEval with async dispatch and fixed-slot graph caching - Wire async expert eval into DeepSeek4 graph forward pass - Add DeepSeek4Expert IPC mode to backend_ipc_main dispatch
…otocol
- Rename deepseek4_expert_ipc.{h,cpp} → expert_ipc.{h,cpp} (model-agnostic
client and wire protocol)
- Move deepseek4_expert_ipc_daemon.cpp → src/deepseek4/ (DS4-specific worker
implementation belongs with its model code, not under common/)
- Rename types: DeepSeek4ExpertIpc* → ExpertIpc*, DS4_EXPERT_IPC_FLAG_* →
EXPERT_IPC_FLAG_*
- Rename BackendIpcMode::DeepSeek4Expert → MoeExpert; accept both
'moe-expert' and 'deepseek4-expert' on the CLI for compat
Rewrites the DeepSeek V4 Flash integration doc with clearer structure: - Architecture summary (MLA, HC, MoE, KV compression) - Forward pass walkthrough (hybrid path) - Hot/cold expert partitioning logic and placement strategies - IPC protocol and async overlap - Full environment variable reference Moves from docs/specs/deepseek4-experts.md to server/docs/DS4.md to colocate with the server source it documents.
- Add DFLASH_DS4_ADAPTIVE_HOT=1 with configurable target ratio (DFLASH_DS4_ADAPTIVE_HOT_TARGET_RATIO, default 0.5). Places fewer hot experts to balance HIP worker compute with CPU cold overlap. With 256 experts and ratio=0.5, ~128 hot/layer means ~3 hot + 3 cold per token — optimal for async overlap hiding. - Add DFLASH_MOE_FIXED_SLOT_GRAPHS=adaptive mode: pads graph slots to max(actual_count, 3) instead of always n_expert_used=6. Reduces wasted expert compute while still caching most graph shapes. - Add DFLASH_MOE_FIXED_SLOT_MAX=N to cap fixed-slot padding. - Refactor budget computation into reusable Ds4HybridBudgetInfo and fill_prefix_hot_placement() helpers.
Three call sites were missing the swiglu_clamp parameter after it was added to the function signature, causing build failures.
- Revert .gitignore copilot-instructions entry (personal dev artifact) - Revert bench_server.py refactor (unrelated to DS4) - Remove ungated expert cache miss fprintf in IPC daemon
CPU-only unit tests (pooling, routing, RMSNorm, output proj shape, hash routing) now build and run alongside test_server_unit in CI.
Stream weights directly into the unified (managed) buffer with a parallel pread + posix_fadvise(DONTNEED) at disk bandwidth, instead of mmap page-faults. With the bumped ggml submodule (unified-memory allocation), an ~86GB model loads in ~81s on Strix Halo gfx1151 (was un-loadable). Falls back to the mmap copy path on non-managed buffers or when DFLASH_NO_PREAD=1. Bumps server/deps/llama.cpp to 9cd9e1ed for ggml_backend_cuda_buffer_is_managed and the unified-memory allocator.
|
@davidmroth help me check if llama.cpp submodule update is needed. |
There was a problem hiding this comment.
16 issues found across 39 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/deepseek4/deepseek4_hc_cuda.cu">
<violation number="1" location="server/src/deepseek4/deepseek4_hc_cuda.cu:13">
P2: Mix output dimension is hardcoded to 24 instead of being derived from model metadata; inconsistent with dynamically-sized CPU path</violation>
</file>
<file name="server/src/deepseek4/deepseek4_loader.cpp">
<violation number="1" location="server/src/deepseek4/deepseek4_loader.cpp:376">
P2: Failure paths after `out.ctx = meta_ctx` leave `DeepSeek4Weights` in a partially initialized state without resetting `out.ctx`; the `meta_ctx` context leaks if the caller does not invoke `free_deepseek4_weights()`. Either move `out.ctx = meta_ctx` to after all error paths or reset it (and free `meta_ctx`) in each failure branch.</violation>
<violation number="2" location="server/src/deepseek4/deepseek4_loader.cpp:438">
P1: Missing mmap bounds validation before tensor copy from GGUF file data (main load path).</violation>
<violation number="3" location="server/src/deepseek4/deepseek4_loader.cpp:532">
P2: Missing mmap bounds validation in embedder byte copy.</violation>
<violation number="4" location="server/src/deepseek4/deepseek4_loader.cpp:535">
P1: Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.</violation>
<violation number="5" location="server/src/deepseek4/deepseek4_loader.cpp:539">
P2: Potential divide-by-zero: `n_vocab` from GGUF metadata is not validated to be non-zero before dividing `a.file_size` by it. A malformed file with `vocab_size=0` will crash the loader.</violation>
<violation number="6" location="server/src/deepseek4/deepseek4_loader.cpp:633">
P1: Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation</violation>
</file>
<file name="server/CMakeLists.txt">
<violation number="1" location="server/CMakeLists.txt:570">
P2: Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).</violation>
</file>
<file name="server/src/server/chat_template.cpp">
<violation number="1" location="server/src/server/chat_template.cpp:370">
P1: `add_generation_prompt` is gated by `pending_assistant`, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check `add_generation_prompt` unconditionally.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.h">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:202">
P1: New default parameter inserted in middle of `build_cached_hot_graph` signature silently remaps existing positional arguments at call site</violation>
</file>
<file name="server/src/ipc/backend_ipc_main.cpp">
<violation number="1" location="server/src/ipc/backend_ipc_main.cpp:128">
P2: fprintf format/argument mismatch: 6 `%s` placeholders but only 5 `argv[0]` arguments supplied, causing undefined behavior.</violation>
<violation number="2" location="server/src/ipc/backend_ipc_main.cpp:284">
P2: New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.</violation>
</file>
<file name="server/src/deepseek4/deepseek4_backend.cpp">
<violation number="1" location="server/src/deepseek4/deepseek4_backend.cpp:690">
P1: park()/unpark() are no-op stubs that do not release or reacquire GPU resources, breaking the backend lifecycle contract</violation>
<violation number="2" location="server/src/deepseek4/deepseek4_backend.cpp:824">
P2: Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.</violation>
</file>
<file name="server/src/deepseek4/deepseek4_graph.cpp">
<violation number="1" location="server/src/deepseek4/deepseek4_graph.cpp:550">
P1: Indexer `build_indexer_score` result is discarded and attention still attends all compressed KV rows instead of selected top-k rows</violation>
<violation number="2" location="server/src/deepseek4/deepseek4_graph.cpp:1767">
P1: Non-hybrid fallback path silently produces incorrect results: HC mixing is bypassed with TODO stubs and hash-routed expert layers are zeroed out instead of using token-based hash routing.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| if (tid < 0) return {}; | ||
| const size_t off = data_start + gguf_get_tensor_offset(gctx, tid); | ||
| const size_t sz = gguf_get_tensor_size(gctx, tid); | ||
| if (off + sz > mmap.len) return {}; |
There was a problem hiding this comment.
P1: Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 633:
<comment>Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation</comment>
<file context>
@@ -0,0 +1,677 @@
+ if (tid < 0) return {};
+ const size_t off = data_start + gguf_get_tensor_offset(gctx, tid);
+ const size_t sz = gguf_get_tensor_size(gctx, tid);
+ if (off + sz > mmap.len) return {};
+ return { file_bytes + off, sz };
+ };
</file context>
| if (off + sz > mmap.len) return {}; | |
| if (off > mmap.len || sz > mmap.len - off) return {}; |
| } else { | ||
| for (auto & a : allocs) { | ||
| if (!a.upload_to_backend) continue; | ||
| const void * src_data = (const char *)mmap.addr + a.file_offset; |
There was a problem hiding this comment.
P1: Missing mmap bounds validation before tensor copy from GGUF file data (main load path).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 438:
<comment>Missing mmap bounds validation before tensor copy from GGUF file data (main load path).</comment>
<file context>
@@ -0,0 +1,677 @@
+ } else {
+ for (auto & a : allocs) {
+ if (!a.upload_to_backend) continue;
+ const void * src_data = (const char *)mmap.addr + a.file_offset;
+ ggml_backend_tensor_set(a.tensor, src_data, 0, a.file_size);
+ }
</file context>
| (const char *)emb_mmap.addr + a.file_offset, a.file_size); | ||
| emb_mmap.close_map(); | ||
| } | ||
| out.embedder.tok_embd_bytes = out.embedder.tok_embd_owned.data(); |
There was a problem hiding this comment.
P1: Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 535:
<comment>Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.</comment>
<file context>
@@ -0,0 +1,677 @@
+ (const char *)emb_mmap.addr + a.file_offset, a.file_size);
+ emb_mmap.close_map();
+ }
+ out.embedder.tok_embd_bytes = out.embedder.tok_embd_owned.data();
+ out.embedder.tok_embd_type = a.tensor->type;
+ out.embedder.n_embd = n_embd;
</file context>
| } | ||
| } | ||
|
|
||
| if (add_generation_prompt && pending_assistant) { |
There was a problem hiding this comment.
P1: add_generation_prompt is gated by pending_assistant, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check add_generation_prompt unconditionally.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/chat_template.cpp, line 370:
<comment>`add_generation_prompt` is gated by `pending_assistant`, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check `add_generation_prompt` unconditionally.</comment>
<file context>
@@ -313,6 +314,65 @@ std::string render_chat_template(
+ }
+ }
+
+ if (add_generation_prompt && pending_assistant) {
+ result += "<|Assistant|>";
+ result += enable_thinking ? "<think>" : "</think>";
</file context>
| if (add_generation_prompt && pending_assistant) { | |
| if (add_generation_prompt) { |
| int n_embd, | ||
| int n_ff_exp, | ||
| int n_hot, | ||
| float swiglu_clamp = 0.0f, |
There was a problem hiding this comment.
P1: New default parameter inserted in middle of build_cached_hot_graph signature silently remaps existing positional arguments at call site
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 202:
<comment>New default parameter inserted in middle of `build_cached_hot_graph` signature silently remaps existing positional arguments at call site</comment>
<file context>
@@ -186,6 +199,7 @@ bool build_cached_hot_graph(
int n_embd,
int n_ff_exp,
int n_hot,
+ float swiglu_clamp = 0.0f,
bool gpu_remap = false,
int n_expert = 0);
</file context>
| std::string emb_err; | ||
| if (emb_mmap.open_ro(path, emb_err)) { | ||
| std::memcpy(out.embedder.tok_embd_owned.data(), | ||
| (const char *)emb_mmap.addr + a.file_offset, a.file_size); |
There was a problem hiding this comment.
P2: Missing mmap bounds validation in embedder byte copy.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 532:
<comment>Missing mmap bounds validation in embedder byte copy.</comment>
<file context>
@@ -0,0 +1,677 @@
+ std::string emb_err;
+ if (emb_mmap.open_ro(path, emb_err)) {
+ std::memcpy(out.embedder.tok_embd_owned.data(),
+ (const char *)emb_mmap.addr + a.file_offset, a.file_size);
+ emb_mmap.close_map();
+ }
</file context>
| endif() | ||
| if(DFLASH27B_GPU_BACKEND STREQUAL "hip") | ||
| target_link_libraries(dflash_common PRIVATE hip::host) | ||
| if(DFLASH27B_GPU_BACKEND STREQUAL "cuda") |
There was a problem hiding this comment.
P2: Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/CMakeLists.txt, line 570:
<comment>Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).</comment>
<file context>
@@ -557,8 +567,10 @@ find_package(OpenMP)
endif()
-if(DFLASH27B_GPU_BACKEND STREQUAL "hip")
- target_link_libraries(dflash_common PRIVATE hip::host)
+if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")
+ target_link_libraries(dflash_common PUBLIC CUDA::cudart)
+elseif(DFLASH27B_GPU_BACKEND STREQUAL "hip")
</file context>
| const char * value = nullptr; | ||
| if (!require_value(i, argc, argv, "--kvflash-pool", value)) return 2; | ||
| if (!parse_nonnegative_int(value, kvflash_pool_tokens)) return 2; | ||
| } else if (std::strncmp(argv[i], "--expert-budget-mb=", 19) == 0) { |
There was a problem hiding this comment.
P2: New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/ipc/backend_ipc_main.cpp, line 284:
<comment>New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.</comment>
<file context>
@@ -274,6 +281,20 @@ int main(int argc, char ** argv) {
const char * value = nullptr;
if (!require_value(i, argc, argv, "--kvflash-pool", value)) return 2;
if (!parse_nonnegative_int(value, kvflash_pool_tokens)) return 2;
+ } else if (std::strncmp(argv[i], "--expert-budget-mb=", 19) == 0) {
+ ds4_expert_budget_mb = argv[i] + 19;
+ } else if (std::strcmp(argv[i], "--expert-budget-mb") == 0) {
</file context>
| "--layer-ends=N[,N...] --max-ctx=N " | ||
| "[--hidden=N --vocab=N --max-tokens=N]\n", | ||
| "[--hidden=N --vocab=N --max-tokens=N]\n" | ||
| " or: %s --backend-ipc-mode=moe-expert <model.gguf> " |
There was a problem hiding this comment.
P2: fprintf format/argument mismatch: 6 %s placeholders but only 5 argv[0] arguments supplied, causing undefined behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/ipc/backend_ipc_main.cpp, line 128:
<comment>fprintf format/argument mismatch: 6 `%s` placeholders but only 5 `argv[0]` arguments supplied, causing undefined behavior.</comment>
<file context>
@@ -123,7 +124,9 @@ int main(int argc, char ** argv) {
"--layer-ends=N[,N...] --max-ctx=N "
- "[--hidden=N --vocab=N --max-tokens=N]\n",
+ "[--hidden=N --vocab=N --max-tokens=N]\n"
+ " or: %s --backend-ipc-mode=moe-expert <model.gguf> "
+ "--stream-fd=FD [--payload-fd=FD] [--draft-gpu=N]\n",
argv[0],
</file context>
|
|
||
| // Check EOS | ||
| // TODO: proper EOS detection from tokenizer metadata | ||
| if (next_token == 151643 || next_token == 151644) { // common DS EOS/EOT |
There was a problem hiding this comment.
P2: Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_backend.cpp, line 824:
<comment>Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.</comment>
<file context>
@@ -0,0 +1,939 @@
+
+ // Check EOS
+ // TODO: proper EOS detection from tokenizer metadata
+ if (next_token == 151643 || next_token == 151644) { // common DS EOS/EOT
+ break;
+ }
</file context>
|
notice 20% regression running 27B dense model. |
Implement full DS4 Flash model backend for AR-only decode:
Integration:
Tests: