perf(server): reduce MoE expert-compute IPC overhead#388
Conversation
|
Additional remote validation on the lucebox heterogeneous setup: I reran the MoE expert-compute IPC comparison on the remote lucebox host with a Strix Halo HIP/gfx1151 parent and an RTX 3090 CUDA/sm86 remote expert daemon, using
The 4k old-vs-new case was rerun in reverse/repeated order ( |
ce4ec06 to
4996520
Compare
There was a problem hiding this comment.
5 issues found across 25 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:628">
P2: Failure path leaks GPU backend buffer when `ggml_backend_tensor_alloc` fails. Repeated load retries can accumulate unreleased VRAM.</violation>
</file>
<file name="server/src/common/moe_expert_compute.cpp">
<violation number="1" location="server/src/common/moe_expert_compute.cpp:137">
P2: Validate IPC bin override points to an executable before accepting it; current path trusts any non-empty env value and defers failure to child launch.
(Based on your team's feedback about validating env-bin executable overrides.) [FEEDBACK_USED]</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp:361">
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</violation>
</file>
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.
Re-trigger cubic
| PipelinedDecodeTelemetry * tel, | ||
| int kv_slot) { | ||
| int kv_slot, | ||
| bool capture_layers, |
There was a problem hiding this comment.
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_pipelined_decode.cpp, line 361:
<comment>Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</comment>
<file context>
@@ -315,7 +357,9 @@ bool pipelined_decode_one_token(
PipelinedDecodeTelemetry * tel,
- int kv_slot) {
+ int kv_slot,
+ bool capture_layers,
+ MoeHybridRoutingStats * routing_stats) {
</file context>
There was a problem hiding this comment.
2 issues found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp:361">
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:637">
P1: Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| set_last_error("ggml_backend_tensor_alloc failed (target)"); | ||
| const size_t data_start = gguf_get_data_offset(gctx); | ||
|
|
||
| if (plan.metadata_only) { |
There was a problem hiding this comment.
P1: Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/gguf_target_loader.cpp, line 637:
<comment>Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.</comment>
<file context>
@@ -632,14 +632,47 @@ bool load_target_gguf_partial(const std::string & path,
+ const size_t data_start = gguf_get_data_offset(gctx);
+
+ if (plan.metadata_only) {
+#if !defined(_WIN32)
+ struct stat st {};
</file context>
7366c3a to
29995d3
Compare
Summary
This PR reduces cross-backend MoE expert-compute IPC overhead by batching prefill remote-expert work instead of inheriting the local hot-stack safety slice granularity. In the 4248 prompt / 128 completion checks below, the batched path cuts IPC calls by about
91-92%and payload by about48-97%.Because the best request shape depends on remote backend compute and memory headroom,
autodefaults to the batched path while explicitstreamkeeps the conservative small-request path available.Changes
f32,f16,bf16).DFLASH_MOE_EXPERT_COMPUTE_IPC_MODE=auto|batched|stream;autouses the batched path and explicitstreamkeeps the conservative path available.DFLASH_MOE_EXPERT_COMPUTE_IPC_TRANSPORT=stream|shared|autoas the payload transport selector, separate from execution granularity.Notes
Mode behavior:
auto/batched: batches remote expert compute over the prefill chunk, capped byDFLASH_MOE_EXPERT_COMPUTE_IPC_BATCH_CAPACITY.stream: follows the small hot-stack safety slices, typically up to 4 prefill tokens per remote expert-compute call on this path.Empirical cases, 4248 prompt / 128 completion prefill check:
streamauto/batched290422657-90.9%612.430 MiB317.346 MiB-48.2%290202280-92.1%612.370 MiB17.943 MiB-97.1%On dual Pro VII, batched mode significantly reduced prefill IPC traffic and reduced prefill time from
40.93sto27.74s(-32.2%), while decode throughput stayed effectively flat (18.6 -> 18.5 tok/s). This is still a policy tradeoff rather than a universal win/loss: if the remote backend is too weak, batched mode can expose the remote-compute ceiling and increase prefill time, as shown by the Pro VII + P4 check from41.64sto81.20s(+95.0%). This is why the PR keeps explicitstreammode available instead of removing the conservative path.