Skip to content

perf(server): reduce MoE expert-compute IPC overhead#388

Open
weicj wants to merge 5 commits into
Luce-Org:mainfrom
weicj:perf-moe-expert-compute-ipc
Open

perf(server): reduce MoE expert-compute IPC overhead#388
weicj wants to merge 5 commits into
Luce-Org:mainfrom
weicj:perf-moe-expert-compute-ipc

Conversation

@weicj

@weicj weicj commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR reduces cross-backend MoE expert-compute IPC overhead by batching prefill remote-expert work instead of inheriting the local hot-stack safety slice granularity. In the 4248 prompt / 128 completion checks below, the batched path cuts IPC calls by about 91-92% and payload by about 48-97%.

Because the best request shape depends on remote backend compute and memory headroom, auto defaults to the batched path while explicit stream keeps the conservative small-request path available.

Changes

  • Batch remote MoE expert compute over the prefill chunk instead of inheriting the local hot-stack safety slice size.
  • Send backend-local expert ids for new MoE expert-compute IPC commands while keeping compatibility with the older command shape.
  • Add typed input payload support for MoE expert-compute IPC prefill (f32, f16, bf16).
  • Add DFLASH_MOE_EXPERT_COMPUTE_IPC_MODE=auto|batched|stream; auto uses the batched path and explicit stream keeps the conservative path available.
  • Keep DFLASH_MOE_EXPERT_COMPUTE_IPC_TRANSPORT=stream|shared|auto as the payload transport selector, separate from execution granularity.

Notes

Mode behavior:

  • auto / batched: batches remote expert compute over the prefill chunk, capped by DFLASH_MOE_EXPERT_COMPUTE_IPC_BATCH_CAPACITY.
  • stream: follows the small hot-stack safety slices, typically up to 4 prefill tokens per remote expert-compute call on this path.

Empirical cases, 4248 prompt / 128 completion prefill check:

  • Dual Pro VII: HIP -> IPC -> HIP.
  • Pro VII + P4: HIP -> IPC -> CUDA.
Path Metric stream auto / batched Change
HIP -> IPC -> HIP IPC calls 29042 2657 -90.9%
HIP -> IPC -> HIP IPC payload 612.430 MiB 317.346 MiB -48.2%
HIP -> IPC -> CUDA IPC calls 29020 2280 -92.1%
HIP -> IPC -> CUDA IPC payload 612.370 MiB 17.943 MiB -97.1%

On dual Pro VII, batched mode significantly reduced prefill IPC traffic and reduced prefill time from 40.93s to 27.74s (-32.2%), while decode throughput stayed effectively flat (18.6 -> 18.5 tok/s). This is still a policy tradeoff rather than a universal win/loss: if the remote backend is too weak, batched mode can expose the remote-compute ceiling and increase prefill time, as shown by the Pro VII + P4 check from 41.64s to 81.20s (+95.0%). This is why the PR keeps explicit stream mode available instead of removing the conservative path.

@weicj

weicj commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

Additional remote validation on the lucebox heterogeneous setup:

I reran the MoE expert-compute IPC comparison on the remote lucebox host with a Strix Halo HIP/gfx1151 parent and an RTX 3090 CUDA/sm86 remote expert daemon, using Qwopus3.6-35B-A3B. The main effect is on prefill: batched expert-compute IPC significantly reduces IPC call count and IPC wait time, while decode throughput stays effectively unchanged.

Case Comparison Prefill Decode IPC calls IPC wait
1k/64 stream -> auto/batched 325.8 -> 426.3 tok/s 45.8 -> 45.9 tok/s 7929 -> 1291 1293.7ms -> 481.5ms
4k/128 stream -> auto/batched 342.2 -> 451.9 tok/s 45.0 -> 44.7 tok/s 29168 -> 2791 4742.3ms -> 1458.4ms
4k/128 reverse repeat old -> new 342.9 -> 452.8 tok/s 45.0 -> 45.0 tok/s 29168 -> 2791 4734.5ms -> 1470.8ms
long prompt / 128 old -> new 295.9 -> 335.8 tok/s 37.7 -> 37.6 tok/s 203434 -> 5939 32121.5ms -> 9240.3ms

The 4k old-vs-new case was rerun in reverse/repeated order (new, old, new, old) to avoid relying on a one-off noisy result. That repeat confirmed the expected direction: about 24% lower prefill time, about 90% fewer IPC calls, and about 69% lower IPC wait, with decode unchanged.

@weicj weicj force-pushed the perf-moe-expert-compute-ipc branch from ce4ec06 to 4996520 Compare June 24, 2026 12:44
@weicj weicj marked this pull request as ready for review June 24, 2026 12:47

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 25 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:628">
P2: Failure path leaks GPU backend buffer when `ggml_backend_tensor_alloc` fails. Repeated load retries can accumulate unreleased VRAM.</violation>
</file>

<file name="server/src/common/moe_expert_compute.cpp">

<violation number="1" location="server/src/common/moe_expert_compute.cpp:137">
P2: Validate IPC bin override points to an executable before accepting it; current path trusts any non-empty env value and defers failure to child launch.

(Based on your team's feedback about validating env-bin executable overrides.) [FEEDBACK_USED]</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp:361">
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</violation>
</file>

Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic

Comment thread server/src/ipc/backend_ipc_main.cpp
PipelinedDecodeTelemetry * tel,
int kv_slot) {
int kv_slot,
bool capture_layers,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_pipelined_decode.cpp, line 361:

<comment>Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</comment>

<file context>
@@ -315,7 +357,9 @@ bool pipelined_decode_one_token(
     PipelinedDecodeTelemetry * tel,
-    int kv_slot) {
+    int kv_slot,
+    bool capture_layers,
+    MoeHybridRoutingStats * routing_stats) {
 
</file context>

Comment thread server/src/qwen35/gguf_target_loader.cpp
Comment thread server/src/common/moe_expert_compute.cpp
Comment thread server/src/qwen35moe/qwen35moe_pipelined_decode.h Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp:361">
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</violation>
</file>

<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:637">
P1: Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

set_last_error("ggml_backend_tensor_alloc failed (target)");
const size_t data_start = gguf_get_data_offset(gctx);

if (plan.metadata_only) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/gguf_target_loader.cpp, line 637:

<comment>Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.</comment>

<file context>
@@ -632,14 +632,47 @@ bool load_target_gguf_partial(const std::string & path,
 
+    const size_t data_start = gguf_get_data_offset(gctx);
+
+    if (plan.metadata_only) {
+#if !defined(_WIN32)
+        struct stat st {};
</file context>

Comment thread server/src/common/moe_expert_compute.cpp
@weicj weicj force-pushed the perf-moe-expert-compute-ipc branch from 7366c3a to 29995d3 Compare June 24, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant