Skip to content

feat(dflash): add DeepSeek V4 Flash backend#353

Draft
howard0su wants to merge 45 commits into
Luce-Org:mainfrom
howard0su:ds4
Draft

feat(dflash): add DeepSeek V4 Flash backend#353
howard0su wants to merge 45 commits into
Luce-Org:mainfrom
howard0su:ds4

Conversation

@howard0su

Copy link
Copy Markdown
Contributor

Implement full DS4 Flash model backend for AR-only decode:

  • deepseek4_internal.h: data structures (layer, weights, cache, config)
  • deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding
  • deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams)
  • deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1)
  • deepseek4_daemon.cpp: daemon entry point

Integration:

  • Register 'deepseek4' arch in backend_factory.cpp
  • Add to CMakeLists.txt (include path + sources)

Tests:

  • test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup)
  • deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)

howard0su added 27 commits June 22, 2026 08:00
Implement full DS4 Flash model backend for AR-only decode:

- deepseek4_internal.h: data structures (layer, weights, cache, config)
- deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding
- deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression
  with ratio-4/ratio-128, indexer selective attention, MoE with
  sqrt(softplus) routing, hash routing, HC residual streams)
- deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold
  expert placement (DFLASH_DS4_HYBRID=1)
- deepseek4_daemon.cpp: daemon entry point

Integration:
- Register 'deepseek4' arch in backend_factory.cpp
- Add to CMakeLists.txt (include path + sources)

Tests:
- test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights
  (compressor pooling, MoE routing, RMSNorm, grouped output shape,
  hash routing lookup)
- deepseek4-vectors/: official API test vectors ported from ds4 project
  (greedy decode logprob fixtures for integration testing)
The DS4 Flash GGUF stores rope.scaling.original_context_length as u32
and compress_ratios as i32 array. Handle both type widths gracefully.
The previous approach set dst->data directly but didn't associate the
tensor with its backend buffer, causing 'tensor buffer not set' assert.
Now uses ggml_backend_tensor_alloc (matching qwen35 loader pattern).
Also keeps token_embd on CPU for embedding lookup.
TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.
When full model load fails (e.g., 81GB model on 24GB GPU), automatically
fall back to hybrid mode (experts on CPU, core on GPU).
…er shapes

- Output projection now correctly uses batched 3D matmul for grouped
  low-rank: reshape out_a [4096,8192] to [4096,1024,8], reshape q to
  [4096,8,n_tok], batched matmul → [1024,8,n_tok] → out_b [8192,4096]
- Attention placeholder: use reshaped q (correct shape [32768,n_tok])
  instead of broken kv×q matmul
- Disable compressed context block (shapes incompatible with placeholder)
HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens].
Bypass HC entirely until proper multi-token HC state management is implemented.
The 3D matmul batch dimension (ne[2]) must match between weight and input.
Use permute to put n_out_group in ne[2] for both tensors so ggml can
broadcast correctly across the group dimension.
Ratio-4 layers use comp_width = 2*head_dim (1024) with 2*ratio state rows.
Ratio-128 layers use comp_width = head_dim (512).
Indexer uses n_indexer_head_dim (128) as output, not full multi-head width.
Pooling placeholder just takes first head_dim elements for now.
sum_rows operates on ne[0] (heads) producing [1, n_comp].
Don't transpose first or elements won't match reshape.
Without ggml_set_input, the graph allocator doesn't allocate
buffers for the position tensors, causing 'tensor buffer not set'
when we try to set their values before compute.
The I32 position tensors for RoPE in side-effect subgraphs (cpy to
external cache buffers) don't get their buffers allocated by gallocr.
Skip RoPE for now - output is placeholder anyway. Will fix properly
when implementing full compressor pooling logic.
Keep only meaningful error/info prints in the backend.
…ooling

- Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat
  back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN
  scaling for compressed layers.
- Replace attention placeholder with full SWA dot-product attention: Q@KV^T
  scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output.
- Implement per-dim softmax-weighted pooling for compressor state, replacing
  the first-row placeholder.
- Add I32 array bindings for multi-element position tensors.
…ooling

- Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat
  back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN
  scaling for compressed layers.
- Replace attention placeholder with full SWA dot-product attention: Q@KV^T
  scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output.
- Implement per-dim softmax-weighted pooling for compressor state, replacing
  the first-row placeholder.
- Add I32 array bindings for multi-element position tensors.
Implement the full HC mechanism on CPU for the hybrid path:
- HC pre: RMSNorm → matmul with fn tensor → Sinkhorn normalization (20 iters
  on 4×4 combine matrix) → weighted sum of 4 residual streams
- HC post: update all 4 streams using post gates + combine matrix
- Output HC pre: sigmoid-weighted stream merge before final norm/logits
- Lazy-load HC weight tensors from GPU to CPU on first use (~65MB total)
- Restructure hybrid loop: separate attention and FFN into independent graphs
  with HC pre/post between them (eliminates incorrect residual additions)
Previously only the last token's KV was written to the ring buffer during
prefill, causing decode to attend to a nearly empty cache. Now all tokens'
KV entries are written to their correct ring buffer positions.
DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is
GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and
caused completely wrong position encodings.
howard0su and others added 18 commits June 22, 2026 08:01
- Add eval_begin/eval_end async IPC pattern for pipelined expert offload
- Add graph timing reporting (hot/cold cache hit/miss stats)
- Extend MoeHybridFFNEval with async dispatch and fixed-slot graph caching
- Wire async expert eval into DeepSeek4 graph forward pass
- Add DeepSeek4Expert IPC mode to backend_ipc_main dispatch
…otocol

- Rename deepseek4_expert_ipc.{h,cpp} → expert_ipc.{h,cpp} (model-agnostic
  client and wire protocol)
- Move deepseek4_expert_ipc_daemon.cpp → src/deepseek4/ (DS4-specific worker
  implementation belongs with its model code, not under common/)
- Rename types: DeepSeek4ExpertIpc* → ExpertIpc*, DS4_EXPERT_IPC_FLAG_* →
  EXPERT_IPC_FLAG_*
- Rename BackendIpcMode::DeepSeek4Expert → MoeExpert; accept both
  'moe-expert' and 'deepseek4-expert' on the CLI for compat
Rewrites the DeepSeek V4 Flash integration doc with clearer structure:
- Architecture summary (MLA, HC, MoE, KV compression)
- Forward pass walkthrough (hybrid path)
- Hot/cold expert partitioning logic and placement strategies
- IPC protocol and async overlap
- Full environment variable reference

Moves from docs/specs/deepseek4-experts.md to server/docs/DS4.md
to colocate with the server source it documents.
- Add DFLASH_DS4_ADAPTIVE_HOT=1 with configurable target ratio
  (DFLASH_DS4_ADAPTIVE_HOT_TARGET_RATIO, default 0.5). Places fewer
  hot experts to balance HIP worker compute with CPU cold overlap.
  With 256 experts and ratio=0.5, ~128 hot/layer means ~3 hot + 3 cold
  per token — optimal for async overlap hiding.

- Add DFLASH_MOE_FIXED_SLOT_GRAPHS=adaptive mode: pads graph slots to
  max(actual_count, 3) instead of always n_expert_used=6. Reduces
  wasted expert compute while still caching most graph shapes.

- Add DFLASH_MOE_FIXED_SLOT_MAX=N to cap fixed-slot padding.

- Refactor budget computation into reusable Ds4HybridBudgetInfo and
  fill_prefix_hot_placement() helpers.
Three call sites were missing the swiglu_clamp parameter after it was
added to the function signature, causing build failures.
- Revert .gitignore copilot-instructions entry (personal dev artifact)
- Revert bench_server.py refactor (unrelated to DS4)
- Remove ungated expert cache miss fprintf in IPC daemon
CPU-only unit tests (pooling, routing, RMSNorm, output proj shape, hash
routing) now build and run alongside test_server_unit in CI.
Stream weights directly into the unified (managed) buffer with a parallel
pread + posix_fadvise(DONTNEED) at disk bandwidth, instead of mmap page-faults.
With the bumped ggml submodule (unified-memory allocation), an ~86GB model
loads in ~81s on Strix Halo gfx1151 (was un-loadable). Falls back to the mmap
copy path on non-managed buffers or when DFLASH_NO_PREAD=1.

Bumps server/deps/llama.cpp to 9cd9e1ed for ggml_backend_cuda_buffer_is_managed
and the unified-memory allocator.
@howard0su howard0su marked this pull request as ready for review June 22, 2026 00:45
@howard0su

Copy link
Copy Markdown
Contributor Author

@davidmroth help me check if llama.cpp submodule update is needed.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 issues found across 39 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/deepseek4/deepseek4_hc_cuda.cu">

<violation number="1" location="server/src/deepseek4/deepseek4_hc_cuda.cu:13">
P2: Mix output dimension is hardcoded to 24 instead of being derived from model metadata; inconsistent with dynamically-sized CPU path</violation>
</file>

<file name="server/src/deepseek4/deepseek4_loader.cpp">

<violation number="1" location="server/src/deepseek4/deepseek4_loader.cpp:376">
P2: Failure paths after `out.ctx = meta_ctx` leave `DeepSeek4Weights` in a partially initialized state without resetting `out.ctx`; the `meta_ctx` context leaks if the caller does not invoke `free_deepseek4_weights()`. Either move `out.ctx = meta_ctx` to after all error paths or reset it (and free `meta_ctx`) in each failure branch.</violation>

<violation number="2" location="server/src/deepseek4/deepseek4_loader.cpp:438">
P1: Missing mmap bounds validation before tensor copy from GGUF file data (main load path).</violation>

<violation number="3" location="server/src/deepseek4/deepseek4_loader.cpp:532">
P2: Missing mmap bounds validation in embedder byte copy.</violation>

<violation number="4" location="server/src/deepseek4/deepseek4_loader.cpp:535">
P1: Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.</violation>

<violation number="5" location="server/src/deepseek4/deepseek4_loader.cpp:539">
P2: Potential divide-by-zero: `n_vocab` from GGUF metadata is not validated to be non-zero before dividing `a.file_size` by it. A malformed file with `vocab_size=0` will crash the loader.</violation>

<violation number="6" location="server/src/deepseek4/deepseek4_loader.cpp:633">
P1: Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation</violation>
</file>

<file name="server/CMakeLists.txt">

<violation number="1" location="server/CMakeLists.txt:570">
P2: Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).</violation>
</file>

<file name="server/src/server/chat_template.cpp">

<violation number="1" location="server/src/server/chat_template.cpp:370">
P1: `add_generation_prompt` is gated by `pending_assistant`, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check `add_generation_prompt` unconditionally.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:202">
P1: New default parameter inserted in middle of `build_cached_hot_graph` signature silently remaps existing positional arguments at call site</violation>
</file>

<file name="server/src/ipc/backend_ipc_main.cpp">

<violation number="1" location="server/src/ipc/backend_ipc_main.cpp:128">
P2: fprintf format/argument mismatch: 6 `%s` placeholders but only 5 `argv[0]` arguments supplied, causing undefined behavior.</violation>

<violation number="2" location="server/src/ipc/backend_ipc_main.cpp:284">
P2: New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.</violation>
</file>

<file name="server/src/deepseek4/deepseek4_backend.cpp">

<violation number="1" location="server/src/deepseek4/deepseek4_backend.cpp:690">
P1: park()/unpark() are no-op stubs that do not release or reacquire GPU resources, breaking the backend lifecycle contract</violation>

<violation number="2" location="server/src/deepseek4/deepseek4_backend.cpp:824">
P2: Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.</violation>
</file>

<file name="server/src/deepseek4/deepseek4_graph.cpp">

<violation number="1" location="server/src/deepseek4/deepseek4_graph.cpp:550">
P1: Indexer `build_indexer_score` result is discarded and attention still attends all compressed KV rows instead of selected top-k rows</violation>

<violation number="2" location="server/src/deepseek4/deepseek4_graph.cpp:1767">
P1: Non-hybrid fallback path silently produces incorrect results: HC mixing is bypassed with TODO stubs and hash-routed expert layers are zeroed out instead of using token-based hash routing.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

if (tid < 0) return {};
const size_t off = data_start + gguf_get_tensor_offset(gctx, tid);
const size_t sz = gguf_get_tensor_size(gctx, tid);
if (off + sz > mmap.len) return {};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 633:

<comment>Integer overflow in tensor bounds check allows malformed GGUF metadata to bypass validation</comment>

<file context>
@@ -0,0 +1,677 @@
+            if (tid < 0) return {};
+            const size_t off = data_start + gguf_get_tensor_offset(gctx, tid);
+            const size_t sz = gguf_get_tensor_size(gctx, tid);
+            if (off + sz > mmap.len) return {};
+            return { file_bytes + off, sz };
+        };
</file context>
Suggested change
if (off + sz > mmap.len) return {};
if (off > mmap.len || sz > mmap.len - off) return {};

} else {
for (auto & a : allocs) {
if (!a.upload_to_backend) continue;
const void * src_data = (const char *)mmap.addr + a.file_offset;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Missing mmap bounds validation before tensor copy from GGUF file data (main load path).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 438:

<comment>Missing mmap bounds validation before tensor copy from GGUF file data (main load path).</comment>

<file context>
@@ -0,0 +1,677 @@
+    } else {
+        for (auto & a : allocs) {
+            if (!a.upload_to_backend) continue;
+            const void * src_data = (const char *)mmap.addr + a.file_offset;
+            ggml_backend_tensor_set(a.tensor, src_data, 0, a.file_size);
+        }
</file context>

(const char *)emb_mmap.addr + a.file_offset, a.file_size);
emb_mmap.close_map();
}
out.embedder.tok_embd_bytes = out.embedder.tok_embd_owned.data();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 535:

<comment>Embedding-table reload failure is silently ignored, allowing model load to succeed with zeroed embeddings that corrupt all token outputs.</comment>

<file context>
@@ -0,0 +1,677 @@
+                                (const char *)emb_mmap.addr + a.file_offset, a.file_size);
+                    emb_mmap.close_map();
+                }
+                out.embedder.tok_embd_bytes = out.embedder.tok_embd_owned.data();
+                out.embedder.tok_embd_type  = a.tensor->type;
+                out.embedder.n_embd         = n_embd;
</file context>

}
}

if (add_generation_prompt && pending_assistant) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: add_generation_prompt is gated by pending_assistant, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check add_generation_prompt unconditionally.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/chat_template.cpp, line 370:

<comment>`add_generation_prompt` is gated by `pending_assistant`, so system-only or empty chats fail to emit an assistant generation prefix. All other formats in this file (QWEN3, LAGUNA, GEMMA4) check `add_generation_prompt` unconditionally.</comment>

<file context>
@@ -313,6 +314,65 @@ std::string render_chat_template(
+            }
+        }
+
+        if (add_generation_prompt && pending_assistant) {
+            result += "<|Assistant|>";
+            result += enable_thinking ? "<think>" : "</think>";
</file context>
Suggested change
if (add_generation_prompt && pending_assistant) {
if (add_generation_prompt) {

int n_embd,
int n_ff_exp,
int n_hot,
float swiglu_clamp = 0.0f,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: New default parameter inserted in middle of build_cached_hot_graph signature silently remaps existing positional arguments at call site

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 202:

<comment>New default parameter inserted in middle of `build_cached_hot_graph` signature silently remaps existing positional arguments at call site</comment>

<file context>
@@ -186,6 +199,7 @@ bool build_cached_hot_graph(
     int n_embd,
     int n_ff_exp,
     int n_hot,
+    float swiglu_clamp = 0.0f,
     bool gpu_remap = false,
     int n_expert = 0);
</file context>

std::string emb_err;
if (emb_mmap.open_ro(path, emb_err)) {
std::memcpy(out.embedder.tok_embd_owned.data(),
(const char *)emb_mmap.addr + a.file_offset, a.file_size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing mmap bounds validation in embedder byte copy.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_loader.cpp, line 532:

<comment>Missing mmap bounds validation in embedder byte copy.</comment>

<file context>
@@ -0,0 +1,677 @@
+                std::string emb_err;
+                if (emb_mmap.open_ro(path, emb_err)) {
+                    std::memcpy(out.embedder.tok_embd_owned.data(),
+                                (const char *)emb_mmap.addr + a.file_offset, a.file_size);
+                    emb_mmap.close_map();
+                }
</file context>

Comment thread server/CMakeLists.txt
endif()
if(DFLASH27B_GPU_BACKEND STREQUAL "hip")
target_link_libraries(dflash_common PRIVATE hip::host)
if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/CMakeLists.txt, line 570:

<comment>Duplicate GPU runtime link configuration for dflash_common. An identical conditional block for linking CUDA::cudart / hip::host already exists earlier in this file (around line 369).</comment>

<file context>
@@ -557,8 +567,10 @@ find_package(OpenMP)
 endif()
-if(DFLASH27B_GPU_BACKEND STREQUAL "hip")
-    target_link_libraries(dflash_common PRIVATE hip::host)
+if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")
+    target_link_libraries(dflash_common PUBLIC CUDA::cudart)
+elseif(DFLASH27B_GPU_BACKEND STREQUAL "hip")
</file context>

const char * value = nullptr;
if (!require_value(i, argc, argv, "--kvflash-pool", value)) return 2;
if (!parse_nonnegative_int(value, kvflash_pool_tokens)) return 2;
} else if (std::strncmp(argv[i], "--expert-budget-mb=", 19) == 0) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/ipc/backend_ipc_main.cpp, line 284:

<comment>New DS4 expert CLI options use ad-hoc argument parsing without required-value enforcement or numeric validation, allowing silent misconfiguration.</comment>

<file context>
@@ -274,6 +281,20 @@ int main(int argc, char ** argv) {
             const char * value = nullptr;
             if (!require_value(i, argc, argv, "--kvflash-pool", value)) return 2;
             if (!parse_nonnegative_int(value, kvflash_pool_tokens)) return 2;
+        } else if (std::strncmp(argv[i], "--expert-budget-mb=", 19) == 0) {
+            ds4_expert_budget_mb = argv[i] + 19;
+        } else if (std::strcmp(argv[i], "--expert-budget-mb") == 0) {
</file context>

"--layer-ends=N[,N...] --max-ctx=N "
"[--hidden=N --vocab=N --max-tokens=N]\n",
"[--hidden=N --vocab=N --max-tokens=N]\n"
" or: %s --backend-ipc-mode=moe-expert <model.gguf> "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: fprintf format/argument mismatch: 6 %s placeholders but only 5 argv[0] arguments supplied, causing undefined behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/ipc/backend_ipc_main.cpp, line 128:

<comment>fprintf format/argument mismatch: 6 `%s` placeholders but only 5 `argv[0]` arguments supplied, causing undefined behavior.</comment>

<file context>
@@ -123,7 +124,9 @@ int main(int argc, char ** argv) {
             "--layer-ends=N[,N...] --max-ctx=N "
-            "[--hidden=N --vocab=N --max-tokens=N]\n",
+            "[--hidden=N --vocab=N --max-tokens=N]\n"
+            "   or: %s --backend-ipc-mode=moe-expert <model.gguf> "
+            "--stream-fd=FD [--payload-fd=FD] [--draft-gpu=N]\n",
             argv[0],
</file context>


// Check EOS
// TODO: proper EOS detection from tokenizer metadata
if (next_token == 151643 || next_token == 151644) { // common DS EOS/EOT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/deepseek4/deepseek4_backend.cpp, line 824:

<comment>Hard-coded EOS token IDs (151643, 151644) make termination behavior fragile across tokenizer/model variants. Replace with tokenizer metadata lookup.</comment>

<file context>
@@ -0,0 +1,939 @@
+
+        // Check EOS
+        // TODO: proper EOS detection from tokenizer metadata
+        if (next_token == 151643 || next_token == 151644) {  // common DS EOS/EOT
+            break;
+        }
</file context>

@howard0su

Copy link
Copy Markdown
Contributor Author

notice 20% regression running 27B dense model.

@howard0su howard0su marked this pull request as draft June 22, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant