Skip to content

Run example 07 GEMM all-scatter benchmark across 10 shapes on MI300X#404

Draft
Copilot wants to merge 1 commit intoJoseSantosAMD/copilot_enabled_runnerfrom
copilot/sub-pr-399-yet-again
Draft

Run example 07 GEMM all-scatter benchmark across 10 shapes on MI300X#404
Copilot wants to merge 1 commit intoJoseSantosAMD/copilot_enabled_runnerfrom
copilot/sub-pr-399-yet-again

Conversation

Copy link
Contributor

Copilot AI commented Feb 26, 2026

Benchmark results for examples/07_gemm_all_scatter across 10 GEMM shapes, run on 2x AMD Instinct MI300X (gfx942, 304 SMs/GPU), fp16, BLK_M=256/BLK_N=64/BLK_K=64, 2 ranks.

# M N K Time (ms) TFLOPs
1 1024 2048 2048 0.249 34.5
2 2048 2048 2048 0.300 57.3
3 4096 2048 4096 0.469 146.4
4 4096 4096 4096 0.670 205.0
5 4096 8192 8192 1.565 351.3
6 8192 4608 36864 4.745 586.6
7 8192 8192 8192 2.605 422.0
8 16384 4096 4096 2.385 230.5
9 4096 16384 4096 2.726 201.7
10 8192 16384 8192 7.510 292.8

Peak 586.6 TFLOPs on the LLM-representative shape (M=8192, N=4608, K=36864). Larger square shapes plateau around 420 TFLOPs. Performance degrades for non-square layouts with high M or N imbalance.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Add hardware-enabled copilot runner Run example 07 GEMM all-scatter benchmark across 10 shapes on MI300X Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants