Skip to content

Gfx11#29

Draft
liangliangchang wants to merge 13 commits into
ROCm:gfx11from
liangliangchang:gfx11
Draft

Gfx11#29
liangliangchang wants to merge 13 commits into
ROCm:gfx11from
liangliangchang:gfx11

Conversation

@liangliangchang

Copy link
Copy Markdown

Overview

Additional information

Requirements

liangliangchang and others added 13 commits June 23, 2026 13:12
Instrumented CUDA kernels to log actual M×N×K dimensions to stderr for
performance analysis and TFLOPS/bandwidth calculations.

Changes:
- quantize.cu: Log M, K, grid for quantize_mmq_q8_1 kernel
- mmq.cu: Log M, N, K, ne dimensions for mul_mat_q (GEMM) kernel
- mmvq.cu: Log M, K dimensions for mul_mat_vec_q (GEMV) kernel

Usage:
  llama-bench ... 2>&1 | tee profile.log

The logged dimensions are matched with rocprofv3 timing data to calculate:
- TFLOPS: From actual M×N×K dimensions (2×M×N×K FLOPs)
- Bandwidth: From actual data transfer patterns

This instrumentation enabled accurate performance analysis showing:
- GEMM kernels: 8.83-43.58 TFLOPS (varies by quantization type)
- Quantize kernels: 284 GB/s bandwidth (data in L2 cache)
- GEMV kernels: 0.05-5.52 TFLOPS (memory-bound decode)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Log GEMM as M=ne11, N=ne01, K=ne00 and GEMV as M=ne01, K=ne00 so
instrumented shapes match actual matmul dimensions.

Co-authored-by: Cursor <cursoragent@cursor.com>
On gfx115x, mmq_x=128 uses more than half of per-CU LDS and caps occupancy at
one workgroup per CU. Select a smaller mmq_x that fits two workgroups without
doubling M-tiles, improving P0 prefill TFLOPS ~25% on gfx1151.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ning.

Restore WMMA granularity at mmq_x>=64, widen tile_y padding to cut LDS bank
conflicts, and software-pipeline Q8_1 activation loads into registers during
vec_dot to hide global memory latency on gfx115x.

Co-authored-by: Cursor <cursoragent@cursor.com>
Enables MMQ_PROFILE_PHASES=1 to report load_tiles, y_act, and vec_dot
cycle shares per launch, guiding RDNA3.5 prefill optimization work.

Co-authored-by: Cursor <cursoragent@cursor.com>
…head.

Add gfx115x load_tiles_q4_K_dm_rdna35 and vec_dot_q4_K_q8_1_mma with per-k
dmA hoisting; restore pre-profiling tile loop when MMQ_PROFILE_PHASES is off
so extra __syncthreads do not regress P0 TFLOPS.

Co-authored-by: Cursor <cursoragent@cursor.com>
Split load_tiles_q4_K qs staging into load_tiles_q4_K_qs_wmma_rdna35
(one warp/row, coalesced qs global load) and mmq_q4_K_qs_lds_k helper
that documents the pseudo-q8_1 LDS offsets used by load_ldmatrix.

Perf (gfx1151, P0 128×12288×4096): neutral vs prior commit — 26.56 vs
26.51 TFLOPS, 485 vs 486 µs, pp128 ~1135 t/s (within run-to-run noise).
Layout math is unchanged; this is groundwork for a Q4_K-native MMA tile.

Co-authored-by: Cursor <cursoragent@cursor.com>
fprintf for MUL_MAT_Q, MUL_MAT_VEC_Q, QUANTIZE_MMQ_Q8_1, and
MUL_MAT_Q_LAUNCH are off by default to avoid CPU overhead in
llama-bench TTFT runs; set GGML_CUDA_MM_LOG=1 for profiling scripts.

Co-authored-by: Cursor <cursoragent@cursor.com>
Hoist dmA*dsB into per-lane scl[] before mma on RDNA3.5 vec_dot_q4_K_q8_1_mma
to separate scale multiply from C accumulation (ATT showed v_fma_mix hotspots).

Prefill P0 Q4_K_M 128x12288x4096 (3x profile_mmq_prefill): median 26.59 TFLOPS /
484.7 us vs 26.56 / 485 us baseline; VGPR 224 (+8), LDS conflict ratio 0.348.

Assisted-by: Auto
Co-authored-by: Cursor <cursoragent@cursor.com>
clock64() can move backward across __syncthreads, which inflated phase
cycle counts and produced garbage load_tiles percentages; discard small
backward glitches and skip corrupt host-side totals.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use mmq_x=128 for small nrows_x grids instead of dual-WG LDS downsizing,
and hoist dA in Q8_0/Q4_0 vec_dot on RDNA3.5 WMMA. Gate 128×32×4096
drops ~209µs to ~105µs with no P0 Q4_K regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Precompute sclA = sc[k01/4]*d per k-slice in vec_dot_q6_K_q8_1_mma
instead of reloading scales inside the j-loop. Q6_K FFN down
128×4096×12288 improves ~13.4 to ~15.2 TFLOPS with no P0 regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Prefer ntx=1 tile width when batch ≤128 and LDS allows, improving
Q6_K FFN down (~40% faster vs mmq_x=64 on gfx1151).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant