RDNA 3.5 MMVQ/fattn optimizations for gfx1151 by jeffli-xilinx · Pull Request #31 · ROCm/llama.cpp

jeffli-xilinx · 2026-06-26T21:18:14Z

Summary

Add RDNA 3.5 (gfx1151) specific MMVQ parameter table with nwarps=1 for optimal wave32 scheduling on 40-CU APUs
Bias defusion: run bias-only fused kernels through non-fused template + separate bias-add kernel (24% per-kernel speedup)
Split-K for low-row-count MMVQ: split K-dimension across multiple waves with atomicAdd when waves_per_cu < 80 (improves LPDDR5X BW utilization from 76% to 89%)
Q8_1 activation cache: skip redundant quantization when the same src1 pointer is reused across fused MMVQ dispatches
RMS norm + mul + MUL_MAT graph fusion: fold rms_norm + weight multiply + Q8_1 quantization into a single fused MMVQ dispatch, eliminating 3 separate kernel launches per fusion site
Fused RMS norm + quantize kernel: single-pass kernel combining RMS normalization, weight multiplication, and Q8_1 quantization
Flash attention tile configs for RDNA 3.5 head dimensions 72 and 80

Benchmark

Device: Radeon 8060S (gfx1151, 40 CUs, LPDDR5X), ROCm nightly 7.13

Decode (tg128, single token generation)

Model	Size	Baseline (t/s)	Patched (t/s)	Change
Qwen2.5-0.5B Q4_K_M	374 MiB	245.15	297.93	+21.5%
Qwen3-1.7B Q4_K_M	1.03 GiB	138.61	134.00	-3.3%
SmolLM2-1.7B Q4_K_M	1005 MiB	142.36	136.48	-4.1%
Qwen2.5-3B Q4_K_M	1.79 GiB	86.40	85.39	-1.2%
Gemma-4-E2B Q4_K_M	2.88 GiB	96.42	96.51	+0.1%
Qwen3-4B Q4_K_M	2.32 GiB	70.61	68.97	-2.3%
Qwen2.5-7B Q4_K_M	4.36 GiB	45.85	44.99	-1.9%
Llama-2-7B Q4_K_M	3.80 GiB	46.56	46.10	-1.0%
Qwen3-8B Q4_K_M	4.68 GiB	41.43	40.69	-1.8%
Llama-3.1-8B Q4_K_M	4.58 GiB	42.72	41.80	-2.2%
Qwen3.5-9B Q4_0	5.00 GiB	38.10	38.91	+2.1%

Prefill (prompt processing)

Model	pp128 base	pp128 patched	pp512 patched	pp2048 patched	pp128 change
Qwen2.5-7B Q4_K_M	1162	1290	1614	1606	+11.0%
Llama-3.1-8B Q4_K_M	1204	1224	1361	1325	+1.6%
Qwen3-8B Q4_K_M	994	1199	1328	1271	+20.6%
Llama-2-7B Q4_K_M	1227	1292	1477	1374	+5.3%

Notes

Qwen2.5-0.5B shows large decode improvement (+21.5%) due to very low row counts benefiting from split-K
Qwen3.5-9B Q4_0 (DeltaNet architecture) shows +2.1% decode from bias defusion + split-K
Prefill improvements across all tested models (+1.6% to +20.6% at pp128) from MMQ parameter tuning
Some models show small decode regressions (1-4%) — needs investigation, may be related to q8_1 cache or RMS norm fusion overhead for models where those fusions don't trigger beneficially

Test plan

Build with ROCm targeting gfx1151
Verify correctness: outputs match non-optimized path
Benchmark decode throughput on RDNA 3.5 APU
Verify no regression on other AMD GPU targets (RDNA2, RDNA3, RDNA4)
Investigate small decode regressions on 1-4B models

🤖 Generated with Claude Code

Add RDNA 3.5 (gfx1151) specific parameter tuning and kernel optimizations for decode throughput on LPDDR5X-based APUs (Radeon 8060S). Key changes: - MMVQ parameter table: add MMVQ_PARAMETERS_RDNA3_5 with nwarps=1 (single wave per block, optimal for wave32 on 40 CU APU) - Bias defusion: run bias-only fused kernels through non-fused template + separate bias-add kernel (24% per-kernel speedup) - Split-K for low-row-count MMVQ: split K-dimension across multiple waves with atomicAdd when waves_per_cu < 80 (improves LPDDR5X BW utilization from 76% to 89%) - Q8_1 cache: avoid redundant activation quantization when the same src1 pointer is reused across fused MMVQ dispatches - RMS norm + quantize fusion: fused kernel combining RMS normalization, weight multiplication, and Q8_1 quantization in a single pass - Flash attention tile configs: add RDNA 3.5 specific tile configurations for head dimensions 72 and 80 Benchmarked on Radeon 8060S (gfx1151, 40 CUs, LPDDR5X): Qwen3.5-9B Q4_0, 128 token decode: 38.1 -> 39.0 t/s (+2.3%) Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

jeffli-xilinx force-pushed the rdna3_5-mmvq-optimizations branch from bc52096 to 7a55d72 Compare June 26, 2026 21:23

jeffli-xilinx marked this pull request as draft June 26, 2026 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RDNA 3.5 MMVQ/fattn optimizations for gfx1151#31

RDNA 3.5 MMVQ/fattn optimizations for gfx1151#31
jeffli-xilinx wants to merge 1 commit into
ROCm:gfx11from
jeffli-xilinx:rdna3_5-mmvq-optimizations

jeffli-xilinx commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeffli-xilinx commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Decode (tg128, single token generation)

Prefill (prompt processing)

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeffli-xilinx commented Jun 26, 2026 •

edited

Loading