perf(server): reduce MoE expert-compute IPC overhead by weicj · Pull Request #388 · Luce-Org/lucebox-hub

weicj · 2026-06-15T05:45:02Z

Summary

This PR reduces cross-backend MoE expert-compute IPC overhead by batching prefill remote-expert work instead of inheriting the local hot-stack safety slice granularity. In the 4248 prompt / 128 completion checks below, the batched path cuts IPC calls by about 91-92% and payload by about 48-97%.

Because the best request shape depends on remote backend compute and memory headroom, auto defaults to the batched path while explicit stream keeps the conservative small-request path available.

Changes

Batch remote MoE expert compute over the prefill chunk instead of inheriting the local hot-stack safety slice size.
Send backend-local expert ids for new MoE expert-compute IPC commands while keeping compatibility with the older command shape.
Add typed input payload support for MoE expert-compute IPC prefill (f32, f16, bf16).
Add DFLASH_MOE_EXPERT_COMPUTE_IPC_MODE=auto|batched|stream; auto uses the batched path and explicit stream keeps the conservative path available.
Keep DFLASH_MOE_EXPERT_COMPUTE_IPC_TRANSPORT=stream|shared|auto as the payload transport selector, separate from execution granularity.

Notes

Mode behavior:

auto / batched: batches remote expert compute over the prefill chunk, capped by DFLASH_MOE_EXPERT_COMPUTE_IPC_BATCH_CAPACITY.
stream: follows the small hot-stack safety slices, typically up to 4 prefill tokens per remote expert-compute call on this path.

Empirical cases, 4248 prompt / 128 completion prefill check:

Dual Pro VII: HIP -> IPC -> HIP.
Pro VII + P4: HIP -> IPC -> CUDA.

Path	Metric	`stream`	`auto` / `batched`	Change
HIP -> IPC -> HIP	IPC calls	`29042`	`2657`	`-90.9%`
HIP -> IPC -> HIP	IPC payload	`612.430 MiB`	`317.346 MiB`	`-48.2%`
HIP -> IPC -> CUDA	IPC calls	`29020`	`2280`	`-92.1%`
HIP -> IPC -> CUDA	IPC payload	`612.370 MiB`	`17.943 MiB`	`-97.1%`

On dual Pro VII, batched mode significantly reduced prefill IPC traffic and reduced prefill time from 40.93s to 27.74s (-32.2%), while decode throughput stayed effectively flat (18.6 -> 18.5 tok/s). This is still a policy tradeoff rather than a universal win/loss: if the remote backend is too weak, batched mode can expose the remote-compute ceiling and increase prefill time, as shown by the Pro VII + P4 check from 41.64s to 81.20s (+95.0%). This is why the PR keeps explicit stream mode available instead of removing the conservative path.

weicj · 2026-06-24T12:14:32Z

Additional remote validation on the lucebox heterogeneous setup:

I reran the MoE expert-compute IPC comparison on the remote lucebox host with a Strix Halo HIP/gfx1151 parent and an RTX 3090 CUDA/sm86 remote expert daemon, using Qwopus3.6-35B-A3B. The main effect is on prefill: batched expert-compute IPC significantly reduces IPC call count and IPC wait time, while decode throughput stays effectively unchanged.

Case	Comparison	Prefill	Decode	IPC calls	IPC wait
1k/64	stream -> auto/batched	325.8 -> 426.3 tok/s	45.8 -> 45.9 tok/s	7929 -> 1291	1293.7ms -> 481.5ms
4k/128	stream -> auto/batched	342.2 -> 451.9 tok/s	45.0 -> 44.7 tok/s	29168 -> 2791	4742.3ms -> 1458.4ms
4k/128 reverse repeat	old -> new	342.9 -> 452.8 tok/s	45.0 -> 45.0 tok/s	29168 -> 2791	4734.5ms -> 1470.8ms
long prompt / 128	old -> new	295.9 -> 335.8 tok/s	37.7 -> 37.6 tok/s	203434 -> 5939	32121.5ms -> 9240.3ms

The 4k old-vs-new case was rerun in reverse/repeated order (new, old, new, old) to avoid relying on a one-off noisy result. That repeat confirmed the expected direction: about 24% lower prefill time, about 90% fewer IPC calls, and about 69% lower IPC wait, with decode unchanged.

cubic-dev-ai

5 issues found across 25 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:628">
P2: Failure path leaks GPU backend buffer when `ggml_backend_tensor_alloc` fails. Repeated load retries can accumulate unreleased VRAM.</violation>
</file>

<file name="server/src/common/moe_expert_compute.cpp">

<violation number="1" location="server/src/common/moe_expert_compute.cpp:137">
P2: Validate IPC bin override points to an executable before accepting it; current path trusts any non-empty env value and defers failure to child launch.

(Based on your team's feedback about validating env-bin executable overrides.) [FEEDBACK_USED]</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp:361">
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</violation>
</file>

_{Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic}

cubic-dev-ai · 2026-06-24T13:07:28Z

    PipelinedDecodeTelemetry * tel,
-    int kv_slot) {
+    int kv_slot,
+    bool capture_layers,


P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_pipelined_decode.cpp, line 361: <comment>Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</comment> <file context> @@ -315,7 +357,9 @@ bool pipelined_decode_one_token( PipelinedDecodeTelemetry * tel, - int kv_slot) { + int kv_slot, + bool capture_layers, + MoeHybridRoutingStats * routing_stats) { </file context>

cubic-dev-ai

2 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_pipelined_decode.cpp:361">
P0: Parameter insertion breaks existing positional call sites via implicit conversion. This can force KV writes to slot 1 in hybrid-forward and silently drop routing stats collection.</violation>
</file>

<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:637">
P1: Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-24T13:31:23Z

-            set_last_error("ggml_backend_tensor_alloc failed (target)");
+    const size_t data_start = gguf_get_data_offset(gctx);
+
+    if (plan.metadata_only) {


P1: Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/gguf_target_loader.cpp, line 637: <comment>Metadata-only early return skips NVFP4 scale loading (and shape validation), changing model semantics for metadata-only users. Keep metadata-only mode from returning before scale extraction/validation.</comment> <file context> @@ -632,14 +632,47 @@ bool load_target_gguf_partial(const std::string & path, + const size_t data_start = gguf_get_data_offset(gctx); + + if (plan.metadata_only) { +#if !defined(_WIN32) + struct stat st {}; </file context>

weicj added 4 commits June 24, 2026 20:37

feat(server): add cross-backend MoE expert compute foundation

c9bde51

fix(server): address MoE expert compute review feedback

c6eb55d

perf(server): reduce MoE expert-compute IPC overhead

092e14b

fix(server): resolve MoE IPC rebase conflicts

4996520

weicj force-pushed the perf-moe-expert-compute-ipc branch from ce4ec06 to 4996520 Compare June 24, 2026 12:44

weicj marked this pull request as ready for review June 24, 2026 12:47

cubic-dev-ai Bot reviewed Jun 24, 2026

View reviewed changes

fix(server): address MoE IPC review feedback

29995d3

weicj force-pushed the perf-moe-expert-compute-ipc branch from 7366c3a to 29995d3 Compare June 24, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(server): reduce MoE expert-compute IPC overhead#388

perf(server): reduce MoE expert-compute IPC overhead#388
weicj wants to merge 5 commits into
Luce-Org:mainfrom
weicj:perf-moe-expert-compute-ipc

weicj commented Jun 15, 2026 •

edited

Loading

Uh oh!

weicj commented Jun 24, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

weicj commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

weicj commented Jun 24, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weicj commented Jun 15, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading