Gfx11 by liangliangchang · Pull Request #29 · ROCm/llama.cpp

liangliangchang · 2026-06-26T18:04:55Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Instrumented CUDA kernels to log actual M×N×K dimensions to stderr for performance analysis and TFLOPS/bandwidth calculations. Changes: - quantize.cu: Log M, K, grid for quantize_mmq_q8_1 kernel - mmq.cu: Log M, N, K, ne dimensions for mul_mat_q (GEMM) kernel - mmvq.cu: Log M, K dimensions for mul_mat_vec_q (GEMV) kernel Usage: llama-bench ... 2>&1 | tee profile.log The logged dimensions are matched with rocprofv3 timing data to calculate: - TFLOPS: From actual M×N×K dimensions (2×M×N×K FLOPs) - Bandwidth: From actual data transfer patterns This instrumentation enabled accurate performance analysis showing: - GEMM kernels: 8.83-43.58 TFLOPS (varies by quantization type) - Quantize kernels: 284 GB/s bandwidth (data in L2 cache) - GEMV kernels: 0.05-5.52 TFLOPS (memory-bound decode) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Log GEMM as M=ne11, N=ne01, K=ne00 and GEMV as M=ne01, K=ne00 so instrumented shapes match actual matmul dimensions. Co-authored-by: Cursor <cursoragent@cursor.com>

On gfx115x, mmq_x=128 uses more than half of per-CU LDS and caps occupancy at one workgroup per CU. Select a smaller mmq_x that fits two workgroups without doubling M-tiles, improving P0 prefill TFLOPS ~25% on gfx1151. Co-authored-by: Cursor <cursoragent@cursor.com>

…ning. Restore WMMA granularity at mmq_x>=64, widen tile_y padding to cut LDS bank conflicts, and software-pipeline Q8_1 activation loads into registers during vec_dot to hide global memory latency on gfx115x. Co-authored-by: Cursor <cursoragent@cursor.com>

Enables MMQ_PROFILE_PHASES=1 to report load_tiles, y_act, and vec_dot cycle shares per launch, guiding RDNA3.5 prefill optimization work. Co-authored-by: Cursor <cursoragent@cursor.com>

…head. Add gfx115x load_tiles_q4_K_dm_rdna35 and vec_dot_q4_K_q8_1_mma with per-k dmA hoisting; restore pre-profiling tile loop when MMQ_PROFILE_PHASES is off so extra __syncthreads do not regress P0 TFLOPS. Co-authored-by: Cursor <cursoragent@cursor.com>

Split load_tiles_q4_K qs staging into load_tiles_q4_K_qs_wmma_rdna35 (one warp/row, coalesced qs global load) and mmq_q4_K_qs_lds_k helper that documents the pseudo-q8_1 LDS offsets used by load_ldmatrix. Perf (gfx1151, P0 128×12288×4096): neutral vs prior commit — 26.56 vs 26.51 TFLOPS, 485 vs 486 µs, pp128 ~1135 t/s (within run-to-run noise). Layout math is unchanged; this is groundwork for a Q4_K-native MMA tile. Co-authored-by: Cursor <cursoragent@cursor.com>

fprintf for MUL_MAT_Q, MUL_MAT_VEC_Q, QUANTIZE_MMQ_Q8_1, and MUL_MAT_Q_LAUNCH are off by default to avoid CPU overhead in llama-bench TTFT runs; set GGML_CUDA_MM_LOG=1 for profiling scripts. Co-authored-by: Cursor <cursoragent@cursor.com>

Hoist dmA*dsB into per-lane scl[] before mma on RDNA3.5 vec_dot_q4_K_q8_1_mma to separate scale multiply from C accumulation (ATT showed v_fma_mix hotspots). Prefill P0 Q4_K_M 128x12288x4096 (3x profile_mmq_prefill): median 26.59 TFLOPS / 484.7 us vs 26.56 / 485 us baseline; VGPR 224 (+8), LDS conflict ratio 0.348. Assisted-by: Auto Co-authored-by: Cursor <cursoragent@cursor.com>

clock64() can move backward across __syncthreads, which inflated phase cycle counts and produced garbage load_tiles percentages; discard small backward glitches and skip corrupt host-side totals. Co-authored-by: Cursor <cursoragent@cursor.com>

Use mmq_x=128 for small nrows_x grids instead of dual-WG LDS downsizing, and hoist dA in Q8_0/Q4_0 vec_dot on RDNA3.5 WMMA. Gate 128×32×4096 drops ~209µs to ~105µs with no P0 Q4_K regression. Co-authored-by: Cursor <cursoragent@cursor.com>

Precompute sclA = sc[k01/4]*d per k-slice in vec_dot_q6_K_q8_1_mma instead of reloading scales inside the j-loop. Q6_K FFN down 128×4096×12288 improves ~13.4 to ~15.2 TFLOPS with no P0 regression. Co-authored-by: Cursor <cursoragent@cursor.com>

Prefer ntx=1 tile width when batch ≤128 and LDS allows, improving Q6_K FFN down (~40% faster vs mmq_x=64 on gfx1151). Co-authored-by: Cursor <cursoragent@cursor.com>

liangliangchang and others added 13 commits June 23, 2026 13:12

Fix M/N/K dimension logging for MMQ and MMVQ profiling.

3c63ba2

Log GEMM as M=ne11, N=ne01, K=ne00 and GEMV as M=ne01, K=ne00 so instrumented shapes match actual matmul dimensions. Co-authored-by: Cursor <cursoragent@cursor.com>

Add opt-in HIP MMQ phase profiling to split dequant vs MMA time.

7248ec9

Enables MMQ_PROFILE_PHASES=1 to report load_tiles, y_act, and vec_dot cycle shares per launch, guiding RDNA3.5 prefill optimization work. Co-authored-by: Cursor <cursoragent@cursor.com>

cuda: use mmq_x=128 for Q6_K narrow-N prefill on RDNA3.5

4d9b940

Prefer ntx=1 tile width when batch ≤128 and LDS allows, improving Q6_K FFN down (~40% faster vs mmq_x=64 on gfx1151). Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gfx11#29

Gfx11#29
liangliangchang wants to merge 13 commits into
ROCm:gfx11from
liangliangchang:gfx11

liangliangchang commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liangliangchang commented Jun 26, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant