feat(dflash): add DeepSeek V4 Flash backend by howard0su · Pull Request #353 · Luce-Org/lucebox-hub

howard0su · 2026-06-09T09:50:45Z

Implement full DS4 Flash model backend for AR-only decode:

deepseek4_internal.h: data structures (layer, weights, cache, config)
deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding
deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams)
deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1)
deepseek4_daemon.cpp: daemon entry point

Integration:

Register 'deepseek4' arch in backend_factory.cpp
Add to CMakeLists.txt (include path + sources)

Tests:

test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup)
deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)

Implement full DS4 Flash model backend for AR-only decode: - deepseek4_internal.h: data structures (layer, weights, cache, config) - deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding - deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams) - deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1) - deepseek4_daemon.cpp: daemon entry point Integration: - Register 'deepseek4' arch in backend_factory.cpp - Add to CMakeLists.txt (include path + sources) Tests: - test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup) - deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)

The DS4 Flash GGUF stores rope.scaling.original_context_length as u32 and compress_ratios as i32 array. Handle both type widths gracefully.

The previous approach set dst->data directly but didn't associate the tensor with its backend buffer, causing 'tensor buffer not set' assert. Now uses ggml_backend_tensor_alloc (matching qwen35 loader pattern). Also keeps token_embd on CPU for embedding lookup.

TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.

When full model load fails (e.g., 81GB model on 24GB GPU), automatically fall back to hybrid mode (experts on CPU, core on GPU).

…er shapes - Output projection now correctly uses batched 3D matmul for grouped low-rank: reshape out_a [4096,8192] to [4096,1024,8], reshape q to [4096,8,n_tok], batched matmul → [1024,8,n_tok] → out_b [8192,4096] - Attention placeholder: use reshaped q (correct shape [32768,n_tok]) instead of broken kv×q matmul - Disable compressed context block (shapes incompatible with placeholder)

HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens]. Bypass HC entirely until proper multi-token HC state management is implemented.

The 3D matmul batch dimension (ne[2]) must match between weight and input. Use permute to put n_out_group in ne[2] for both tensors so ggml can broadcast correctly across the group dimension.

Ratio-4 layers use comp_width = 2*head_dim (1024) with 2*ratio state rows. Ratio-128 layers use comp_width = head_dim (512). Indexer uses n_indexer_head_dim (128) as output, not full multi-head width. Pooling placeholder just takes first head_dim elements for now.

sum_rows operates on ne[0] (heads) producing [1, n_comp]. Don't transpose first or elements won't match reshape.

Without ggml_set_input, the graph allocator doesn't allocate buffers for the position tensors, causing 'tensor buffer not set' when we try to set their values before compute.

The I32 position tensors for RoPE in side-effect subgraphs (cpy to external cache buffers) don't get their buffers allocated by gallocr. Skip RoPE for now - output is placeholder anyway. Will fix properly when implementing full compressor pooling logic.

Keep only meaningful error/info prints in the backend.

…ooling - Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN scaling for compressed layers. - Replace attention placeholder with full SWA dot-product attention: Q@KV^T scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output. - Implement per-dim softmax-weighted pooling for compressor state, replacing the first-row placeholder. - Add I32 array bindings for multi-element position tensors.

Implement the full HC mechanism on CPU for the hybrid path: - HC pre: RMSNorm → matmul with fn tensor → Sinkhorn normalization (20 iters on 4×4 combine matrix) → weighted sum of 4 residual streams - HC post: update all 4 streams using post gates + combine matrix - Output HC pre: sigmoid-weighted stream merge before final norm/logits - Lazy-load HC weight tensors from GPU to CPU on first use (~65MB total) - Restructure hybrid loop: separate attention and FFN into independent graphs with HC pre/post between them (eliminates incorrect residual additions)

Previously only the last token's KV was written to the ring buffer during prefill, causing decode to attend to a nearly empty cache. Now all tokens' KV entries are written to their correct ring buffer positions.

DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and caused completely wrong position encodings.

- Add eval_begin/eval_end async IPC pattern for pipelined expert offload - Add graph timing reporting (hot/cold cache hit/miss stats) - Extend MoeHybridFFNEval with async dispatch and fixed-slot graph caching - Wire async expert eval into DeepSeek4 graph forward pass - Add DeepSeek4Expert IPC mode to backend_ipc_main dispatch

…otocol - Rename deepseek4_expert_ipc.{h,cpp} → expert_ipc.{h,cpp} (model-agnostic client and wire protocol) - Move deepseek4_expert_ipc_daemon.cpp → src/deepseek4/ (DS4-specific worker implementation belongs with its model code, not under common/) - Rename types: DeepSeek4ExpertIpc* → ExpertIpc*, DS4_EXPERT_IPC_FLAG_* → EXPERT_IPC_FLAG_* - Rename BackendIpcMode::DeepSeek4Expert → MoeExpert; accept both 'moe-expert' and 'deepseek4-expert' on the CLI for compat

Rewrites the DeepSeek V4 Flash integration doc with clearer structure: - Architecture summary (MLA, HC, MoE, KV compression) - Forward pass walkthrough (hybrid path) - Hot/cold expert partitioning logic and placement strategies - IPC protocol and async overlap - Full environment variable reference Moves from docs/specs/deepseek4-experts.md to server/docs/DS4.md to colocate with the server source it documents.

- Add DFLASH_DS4_ADAPTIVE_HOT=1 with configurable target ratio (DFLASH_DS4_ADAPTIVE_HOT_TARGET_RATIO, default 0.5). Places fewer hot experts to balance HIP worker compute with CPU cold overlap. With 256 experts and ratio=0.5, ~128 hot/layer means ~3 hot + 3 cold per token — optimal for async overlap hiding. - Add DFLASH_MOE_FIXED_SLOT_GRAPHS=adaptive mode: pads graph slots to max(actual_count, 3) instead of always n_expert_used=6. Reduces wasted expert compute while still caching most graph shapes. - Add DFLASH_MOE_FIXED_SLOT_MAX=N to cap fixed-slot padding. - Refactor budget computation into reusable Ds4HybridBudgetInfo and fill_prefix_hot_placement() helpers.

…4 spec

Three call sites were missing the swiglu_clamp parameter after it was added to the function signature, causing build failures.

- Revert .gitignore copilot-instructions entry (personal dev artifact) - Revert bench_server.py refactor (unrelated to DS4) - Remove ungated expert cache miss fprintf in IPC daemon

CPU-only unit tests (pooling, routing, RMSNorm, output proj shape, hash routing) now build and run alongside test_server_unit in CI.

Stream weights directly into the unified (managed) buffer with a parallel pread + posix_fadvise(DONTNEED) at disk bandwidth, instead of mmap page-faults. With the bumped ggml submodule (unified-memory allocation), an ~86GB model loads in ~81s on Strix Halo gfx1151 (was un-loadable). Falls back to the mmap copy path on non-managed buffers or when DFLASH_NO_PREAD=1. Bumps server/deps/llama.cpp to 9cd9e1ed for ggml_backend_cuda_buffer_is_managed and the unified-memory allocator.

howard0su · 2026-06-22T00:46:11Z

@davidmroth help me check if llama.cpp submodule update is needed.

cubic-dev-ai

16 issues found across 39 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/deepseek4/deepseek4_hc_cuda.cu">

<violation number="1" location="server/src/deepseek4/deepseek4_hc_cuda.cu:13">
P2: Mix output dimension is hardcoded to 24 instead of being derived from model metadata; inconsistent with dynamically-sized CPU path</violation>
</file>

<file name="server/src/deepseek4/deepseek4_loader.cpp">

<violation number="1" location="server/src/deepseek4/deepseek4_loader.cpp:376">
P2: Failure paths after `out.ctx = meta_ctx` leave `DeepSeek4Weights` in a partially initialized state without resetting `out.ctx`; the `meta_ctx` context leaks if the caller does not invoke `free_deepseek4_weights()`. Either move `out.ctx = meta_ctx` to after all error paths or reset it (and free `meta_ctx`) in each failure branch.</violation>

<violation number="2" location="server/src/deepseek4/deepseek4_loader.cpp:438">
P1: Missing mmap bounds validation before tensor copy from GGUF file data (main load path).</violation>

<violation number="3" location="server/src/deepseek4/deepseek4_loader.cpp:532">
P2: Missing mmap bounds validation in embedder byte copy.</violation>

<violation number="4" location="server/src/deepseek4/deepseek4_loader.cpp:535">
P1: Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.</violation>

<violation number="5" location="server/src/deepseek4/deepseek4_loader.cpp:539">
P2: Potential divide-by-zero: `n_vocab` from GGUF metadata is not validated to be non-zero before dividing `a.file_size` by it. A malformed file with `vocab_size=0` will crash the loader.</violation>

<violation number="6" location="server/src/deepseek4/deepseek4_loader.cpp:633">
P1: Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation</violation>
</file>

<file name="server/CMakeLists.txt">

<violation number="1" location="server/CMakeLists.txt:570">
P2: Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).</violation>
</file>

<file name="server/src/server/chat_template.cpp">

<violation number="1" location="server/src/server/chat_template.cpp:370">
P1: `add_generation_prompt` is gated by `pending_assistant`, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check `add_generation_prompt` unconditionally.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:202">
P1: New default parameter inserted in middle of `build_cached_hot_graph` signature silently remaps existing positional arguments at call site</violation>
</file>

<file name="server/src/ipc/backend_ipc_main.cpp">

<violation number="1" location="server/src/ipc/backend_ipc_main.cpp:128">
P2: fprintf format/argument mismatch: 6 `%s` placeholders but only 5 `argv[0]` arguments supplied, causing undefined behavior.</violation>

<violation number="2" location="server/src/ipc/backend_ipc_main.cpp:284">
P2: New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.</violation>
</file>

<file name="server/src/deepseek4/deepseek4_backend.cpp">

<violation number="1" location="server/src/deepseek4/deepseek4_backend.cpp:690">
P1: park()/unpark() are no-op stubs that do not release or reacquire GPU resources, breaking the backend lifecycle contract</violation>

<violation number="2" location="server/src/deepseek4/deepseek4_backend.cpp:824">
P2: Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.</violation>
</file>

<file name="server/src/deepseek4/deepseek4_graph.cpp">

<violation number="1" location="server/src/deepseek4/deepseek4_graph.cpp:550">
P1: Indexer `build_indexer_score` result is discarded and attention still attends all compressed KV rows instead of selected top-k rows</violation>

<violation number="2" location="server/src/deepseek4/deepseek4_graph.cpp:1767">
P1: Non-hybrid fallback path silently produces incorrect results: HC mixing is bypassed with TODO stubs and hash-routed expert layers are zeroed out instead of using token-based hash routing.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-22T00:50:57Z

+            if (tid < 0) return {};
+            const size_t off = data_start + gguf_get_tensor_offset(gctx, tid);
+            const size_t sz = gguf_get_tensor_size(gctx, tid);
+            if (off + sz > mmap.len) return {};


P1: Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 633: <comment>Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation</comment> <file context> @@ -0,0 +1,677 @@ + if (tid < 0) return {}; + const size_t off = data_start + gguf_get_tensor_offset(gctx, tid); + const size_t sz = gguf_get_tensor_size(gctx, tid); + if (off + sz > mmap.len) return {}; + return { file_bytes + off, sz }; + }; </file context>

Suggested change

if (off + sz > mmap.len) return {};

if (off > mmap.len || sz > mmap.len - off) return {};

cubic-dev-ai · 2026-06-22T00:50:57Z

+    } else {
+        for (auto & a : allocs) {
+            if (!a.upload_to_backend) continue;
+            const void * src_data = (const char *)mmap.addr + a.file_offset;


P1: Missing mmap bounds validation before tensor copy from GGUF file data (main load path).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 438: <comment>Missing mmap bounds validation before tensor copy from GGUF file data (main load path).</comment> <file context> @@ -0,0 +1,677 @@ + } else { + for (auto & a : allocs) { + if (!a.upload_to_backend) continue; + const void * src_data = (const char *)mmap.addr + a.file_offset; + ggml_backend_tensor_set(a.tensor, src_data, 0, a.file_size); + } </file context>

cubic-dev-ai · 2026-06-22T00:50:57Z

+                                (const char *)emb_mmap.addr + a.file_offset, a.file_size);
+                    emb_mmap.close_map();
+                }
+                out.embedder.tok_embd_bytes = out.embedder.tok_embd_owned.data();


P1: Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 535: <comment>Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.</comment> <file context> @@ -0,0 +1,677 @@ + (const char *)emb_mmap.addr + a.file_offset, a.file_size); + emb_mmap.close_map(); + } + out.embedder.tok_embd_bytes = out.embedder.tok_embd_owned.data(); + out.embedder.tok_embd_type = a.tensor->type; + out.embedder.n_embd = n_embd; </file context>

cubic-dev-ai · 2026-06-22T00:50:57Z

+            }
+        }
+
+        if (add_generation_prompt && pending_assistant) {


P1: add_generation_prompt is gated by pending_assistant, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check add_generation_prompt unconditionally.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/chat_template.cpp, line 370: <comment>`add_generation_prompt` is gated by `pending_assistant`, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check `add_generation_prompt` unconditionally.</comment> <file context> @@ -313,6 +314,65 @@ std::string render_chat_template( + } + } + + if (add_generation_prompt && pending_assistant) { + result += "<｜Assistant｜>"; + result += enable_thinking ? "<think>" : "</think>"; </file context>

Suggested change

if (add_generation_prompt && pending_assistant) {

if (add_generation_prompt) {

cubic-dev-ai · 2026-06-22T00:50:57Z

    int n_embd,
    int n_ff_exp,
    int n_hot,
+    float swiglu_clamp = 0.0f,


P1: New default parameter inserted in middle of build_cached_hot_graph signature silently remaps existing positional arguments at call site

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 202: <comment>New default parameter inserted in middle of `build_cached_hot_graph` signature silently remaps existing positional arguments at call site</comment> <file context> @@ -186,6 +199,7 @@ bool build_cached_hot_graph( int n_embd, int n_ff_exp, int n_hot, + float swiglu_clamp = 0.0f, bool gpu_remap = false, int n_expert = 0); </file context>

cubic-dev-ai · 2026-06-22T00:50:57Z

+                std::string emb_err;
+                if (emb_mmap.open_ro(path, emb_err)) {
+                    std::memcpy(out.embedder.tok_embd_owned.data(),
+                                (const char *)emb_mmap.addr + a.file_offset, a.file_size);


P2: Missing mmap bounds validation in embedder byte copy.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 532: <comment>Missing mmap bounds validation in embedder byte copy.</comment> <file context> @@ -0,0 +1,677 @@ + std::string emb_err; + if (emb_mmap.open_ro(path, emb_err)) { + std::memcpy(out.embedder.tok_embd_owned.data(), + (const char *)emb_mmap.addr + a.file_offset, a.file_size); + emb_mmap.close_map(); + } </file context>

cubic-dev-ai · 2026-06-22T00:50:57Z

 endif()
-if(DFLASH27B_GPU_BACKEND STREQUAL "hip")
-    target_link_libraries(dflash_common PRIVATE hip::host)
+if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")


P2: Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/CMakeLists.txt, line 570: <comment>Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).</comment> <file context> @@ -557,8 +567,10 @@ find_package(OpenMP) endif() -if(DFLASH27B_GPU_BACKEND STREQUAL "hip") - target_link_libraries(dflash_common PRIVATE hip::host) +if(DFLASH27B_GPU_BACKEND STREQUAL "cuda") + target_link_libraries(dflash_common PUBLIC CUDA::cudart) +elseif(DFLASH27B_GPU_BACKEND STREQUAL "hip") </file context>

cubic-dev-ai · 2026-06-22T00:50:58Z

            const char * value = nullptr;
            if (!require_value(i, argc, argv, "--kvflash-pool", value)) return 2;
            if (!parse_nonnegative_int(value, kvflash_pool_tokens)) return 2;
+        } else if (std::strncmp(argv[i], "--expert-budget-mb=", 19) == 0) {


P2: New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/ipc/backend_ipc_main.cpp, line 284: <comment>New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.</comment> <file context> @@ -274,6 +281,20 @@ int main(int argc, char ** argv) { const char * value = nullptr; if (!require_value(i, argc, argv, "--kvflash-pool", value)) return 2; if (!parse_nonnegative_int(value, kvflash_pool_tokens)) return 2; + } else if (std::strncmp(argv[i], "--expert-budget-mb=", 19) == 0) { + ds4_expert_budget_mb = argv[i] + 19; + } else if (std::strcmp(argv[i], "--expert-budget-mb") == 0) { </file context>

cubic-dev-ai · 2026-06-22T00:50:58Z

            "--layer-ends=N[,N...] --max-ctx=N "
-            "[--hidden=N --vocab=N --max-tokens=N]\n",
+            "[--hidden=N --vocab=N --max-tokens=N]\n"
+            "   or: %s --backend-ipc-mode=moe-expert <model.gguf> "


P2: fprintf format/argument mismatch: 6 %s placeholders but only 5 argv[0] arguments supplied, causing undefined behavior.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/ipc/backend_ipc_main.cpp, line 128: <comment>fprintf format/argument mismatch: 6 `%s` placeholders but only 5 `argv[0]` arguments supplied, causing undefined behavior.</comment> <file context> @@ -123,7 +124,9 @@ int main(int argc, char ** argv) { "--layer-ends=N[,N...] --max-ctx=N " - "[--hidden=N --vocab=N --max-tokens=N]\n", + "[--hidden=N --vocab=N --max-tokens=N]\n" + " or: %s --backend-ipc-mode=moe-expert <model.gguf> " + "--stream-fd=FD [--payload-fd=FD] [--draft-gpu=N]\n", argv[0], </file context>

cubic-dev-ai · 2026-06-22T00:50:58Z

+
+        // Check EOS
+        // TODO: proper EOS detection from tokenizer metadata
+        if (next_token == 151643 || next_token == 151644) {  // common DS EOS/EOT


P2: Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_backend.cpp, line 824: <comment>Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.</comment> <file context> @@ -0,0 +1,939 @@ + + // Check EOS + // TODO: proper EOS detection from tokenizer metadata + if (next_token == 151643 || next_token == 151644) { // common DS EOS/EOT + break; + } </file context>

howard0su · 2026-06-22T01:34:30Z

notice 20% regression running 27B dense model.

howard0su force-pushed the ds4 branch from 74fb582 to 4b0d95d Compare June 9, 2026 22:48

howard0su force-pushed the ds4 branch from 4b0d95d to d85de68 Compare June 17, 2026 23:23

davide221 mentioned this pull request Jun 18, 2026

deepseek4: fast unified-memory weight load (on top of #353) howard0su/lucebox-hub#4

Merged

howard0su added 27 commits June 22, 2026 08:00

fix(deepseek4): handle u32/i32 metadata types in GGUF loader

1c72d12

The DS4 Flash GGUF stores rope.scaling.original_context_length as u32 and compress_ratios as i32 array. Handle both type widths gracefully.

fix(deepseek4): load all layers (fix layer_end default check)

fde6634

TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.

fix(deepseek4): auto-fallback to hybrid mode on GPU OOM

b423c9f

When full model load fails (e.g., 81GB model on 24GB GPU), automatically fall back to hybrid mode (experts on CPU, core on GPU).

fix(deepseek4): disable HC pre-mix to fix reshape assertion

1447768

HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens]. Bypass HC entirely until proper multi-token HC state management is implemented.

fix(deepseek4): correct batched grouped output projection

664ba36

The 3D matmul batch dimension (ne[2]) must match between weight and input. Use permute to put n_out_group in ne[2] for both tensors so ggml can broadcast correctly across the group dimension.

debug: add layer progress prints for remote debugging

fc1d891

fix(deepseek4): cast APE from F16 to F32 before add

052d4a6

debug: more specific crash location prints

57a5b55

debug: trace MLA vs compressor crash

f92ca38

debug: trace inside MLA attention

a72bf04

fix(deepseek4): indexer score sum_rows axis fix

668e1df

sum_rows operates on ne[0] (heads) producing [1, n_comp]. Don't transpose first or elements won't match reshape.

fix(deepseek4): mark I32 position inputs for gallocr

b795b99

Without ggml_set_input, the graph allocator doesn't allocate buffers for the position tensors, causing 'tensor buffer not set' when we try to set their values before compute.

chore(deepseek4): remove debug layer progress prints

1b9c075

Keep only meaningful error/info prints in the backend.

fix(deepseek4): store all prefill KV rows in SWA ring buffer

11d85d6

Previously only the last token's KV was written to the ring buffer during prefill, causing decode to attend to a nearly empty cache. Now all tokens' KV entries are written to their correct ring buffer positions.

fix(deepseek4): use standard RoPE mode (sequential pairs), not NEOX

c1094b9

DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and caused completely wrong position encodings.

fix(dflash): stabilize deepseek4 flash backend

139ed3b

perf(dflash): allow HIP cold expert execution

a8dd033

feat(dflash): log deepseek4 expert placement memory

00accc3

feat(dflash): add deepseek4 expert IPC mode

6df7ac3

howard0su and others added 18 commits June 22, 2026 08:01

feat(dflash): run deepseek4 experts in IPC worker

00715e3

feat(dflash): route deepseek4 experts to HIP worker

994ebee

perf(dflash): add deepseek4 timing breakdown

bb66f34

feat(dflash): split deepseek4 experts across hip and cpu

a49fb77

perf(dflash): accelerate deepseek4 HC pre on CUDA

b6f4b1f

perf(dflash): break down deepseek4 expert worker timing

4e3b81b

refactor(dflash): drop deprecated 'deepseek4-expert' IPC mode alias

a16cdc9

Remove instruction file

b510fd4

Add copilot instruciton file to gitignore

9587fe2

docs(dflash): add adaptive-hot and fixed-slot-adaptive env vars to DS…

91001fc

…4 spec

fix(dflash): pass swiglu_clamp to build_batched_routed_graph callers

5a9b5b1

Three call sites were missing the swiglu_clamp parameter after it was added to the function signature, causing build failures.

chore(dflash): remove unintentional changes and ungated debug print

e9d5f88

- Revert .gitignore copilot-instructions entry (personal dev artifact) - Revert bench_server.py refactor (unrelated to DS4) - Remove ungated expert cache miss fprintf in IPC daemon

ci(dflash): add test_deepseek4_unit to CI and check target

4a8a918

CPU-only unit tests (pooling, routing, RMSNorm, output proj shape, hash routing) now build and run alongside test_server_unit in CI.

howard0su force-pushed the ds4 branch from a0b3ae0 to 45a76e4 Compare June 22, 2026 00:14

howard0su marked this pull request as ready for review June 22, 2026 00:45

cubic-dev-ai Bot reviewed Jun 22, 2026

View reviewed changes

howard0su marked this pull request as draft June 22, 2026 01:34

	if (off + sz > mmap.len) return {};
	if (off > mmap.len \|\| sz > mmap.len - off) return {};

	if (add_generation_prompt && pending_assistant) {
	if (add_generation_prompt) {

Uh oh!

Conversation

howard0su commented Jun 9, 2026

Uh oh!

howard0su commented Jun 22, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

howard0su commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant