Skip to content

RDNA 3.5 MMVQ/fattn optimizations for gfx1151#31

Draft
jeffli-xilinx wants to merge 1 commit into
ROCm:gfx11from
jeffli-xilinx:rdna3_5-mmvq-optimizations
Draft

RDNA 3.5 MMVQ/fattn optimizations for gfx1151#31
jeffli-xilinx wants to merge 1 commit into
ROCm:gfx11from
jeffli-xilinx:rdna3_5-mmvq-optimizations

Conversation

@jeffli-xilinx

@jeffli-xilinx jeffli-xilinx commented Jun 26, 2026

Copy link
Copy Markdown

Summary

  • Add RDNA 3.5 (gfx1151) specific MMVQ parameter table with nwarps=1 for optimal wave32 scheduling on 40-CU APUs
  • Bias defusion: run bias-only fused kernels through non-fused template + separate bias-add kernel (24% per-kernel speedup)
  • Split-K for low-row-count MMVQ: split K-dimension across multiple waves with atomicAdd when waves_per_cu < 80 (improves LPDDR5X BW utilization from 76% to 89%)
  • Q8_1 activation cache: skip redundant quantization when the same src1 pointer is reused across fused MMVQ dispatches
  • RMS norm + mul + MUL_MAT graph fusion: fold rms_norm + weight multiply + Q8_1 quantization into a single fused MMVQ dispatch, eliminating 3 separate kernel launches per fusion site
  • Fused RMS norm + quantize kernel: single-pass kernel combining RMS normalization, weight multiplication, and Q8_1 quantization
  • Flash attention tile configs for RDNA 3.5 head dimensions 72 and 80

Benchmark

Device: Radeon 8060S (gfx1151, 40 CUs, LPDDR5X), ROCm nightly 7.13

Decode (tg128, single token generation)

Model Size Baseline (t/s) Patched (t/s) Change
Qwen2.5-0.5B Q4_K_M 374 MiB 245.15 297.93 +21.5%
Qwen3-1.7B Q4_K_M 1.03 GiB 138.61 134.00 -3.3%
SmolLM2-1.7B Q4_K_M 1005 MiB 142.36 136.48 -4.1%
Qwen2.5-3B Q4_K_M 1.79 GiB 86.40 85.39 -1.2%
Gemma-4-E2B Q4_K_M 2.88 GiB 96.42 96.51 +0.1%
Qwen3-4B Q4_K_M 2.32 GiB 70.61 68.97 -2.3%
Qwen2.5-7B Q4_K_M 4.36 GiB 45.85 44.99 -1.9%
Llama-2-7B Q4_K_M 3.80 GiB 46.56 46.10 -1.0%
Qwen3-8B Q4_K_M 4.68 GiB 41.43 40.69 -1.8%
Llama-3.1-8B Q4_K_M 4.58 GiB 42.72 41.80 -2.2%
Qwen3.5-9B Q4_0 5.00 GiB 38.10 38.91 +2.1%

Prefill (prompt processing)

Model pp128 base pp128 patched pp512 patched pp2048 patched pp128 change
Qwen2.5-7B Q4_K_M 1162 1290 1614 1606 +11.0%
Llama-3.1-8B Q4_K_M 1204 1224 1361 1325 +1.6%
Qwen3-8B Q4_K_M 994 1199 1328 1271 +20.6%
Llama-2-7B Q4_K_M 1227 1292 1477 1374 +5.3%

Notes

  • Qwen2.5-0.5B shows large decode improvement (+21.5%) due to very low row counts benefiting from split-K
  • Qwen3.5-9B Q4_0 (DeltaNet architecture) shows +2.1% decode from bias defusion + split-K
  • Prefill improvements across all tested models (+1.6% to +20.6% at pp128) from MMQ parameter tuning
  • Some models show small decode regressions (1-4%) — needs investigation, may be related to q8_1 cache or RMS norm fusion overhead for models where those fusions don't trigger beneficially

Test plan

  • Build with ROCm targeting gfx1151
  • Verify correctness: outputs match non-optimized path
  • Benchmark decode throughput on RDNA 3.5 APU
  • Verify no regression on other AMD GPU targets (RDNA2, RDNA3, RDNA4)
  • Investigate small decode regressions on 1-4B models

🤖 Generated with Claude Code

Add RDNA 3.5 (gfx1151) specific parameter tuning and kernel optimizations
for decode throughput on LPDDR5X-based APUs (Radeon 8060S).

Key changes:
- MMVQ parameter table: add MMVQ_PARAMETERS_RDNA3_5 with nwarps=1
  (single wave per block, optimal for wave32 on 40 CU APU)
- Bias defusion: run bias-only fused kernels through non-fused template
  + separate bias-add kernel (24% per-kernel speedup)
- Split-K for low-row-count MMVQ: split K-dimension across multiple
  waves with atomicAdd when waves_per_cu < 80 (improves LPDDR5X BW
  utilization from 76% to 89%)
- Q8_1 cache: avoid redundant activation quantization when the same
  src1 pointer is reused across fused MMVQ dispatches
- RMS norm + quantize fusion: fused kernel combining RMS normalization,
  weight multiplication, and Q8_1 quantization in a single pass
- Flash attention tile configs: add RDNA 3.5 specific tile
  configurations for head dimensions 72 and 80

Benchmarked on Radeon 8060S (gfx1151, 40 CUs, LPDDR5X):
Qwen3.5-9B Q4_0, 128 token decode: 38.1 -> 39.0 t/s (+2.3%)

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@jeffli-xilinx jeffli-xilinx force-pushed the rdna3_5-mmvq-optimizations branch from bc52096 to 7a55d72 Compare June 26, 2026 21:23
@jeffli-xilinx jeffli-xilinx marked this pull request as draft June 26, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant