HRX CI: run on self-hosted ROCm GPU runners for gfx1151/gfx1201 by AaronStGeorge · Pull Request #22 · ROCm/llama.cpp

AaronStGeorge · 2026-06-10T02:16:54Z

Replaces the ubuntu-latest CPU job in build-hrx.yml with a matrix on the self-hosted ROCm runners (linux-gfx1151-gpu-rocm, linux-gfx120X-gpu-rocm), building for the runner's gfx target and running the sample MUL_MAT correctness/perf configs on the HRX0 backend. Per-arch rocwmma dev artifacts are fetched (rocwmma_dev_gfx1151, rocwmma_dev_gfx120X-all; both verified present in pinned TheRock run 25753625030).

🤖 Generated with Claude Code

Replace the ubuntu-latest CPU job with a matrix on linux-gfx1151-gpu-rocm and linux-gfx120X-gpu-rocm, building for the runner's gfx target and running the sample MUL_MAT correctness/perf configs on the HRX0 backend. Fetch per-arch rocwmma dev artifacts. BENCH_REF temporarily points at the p066-multi-arch-ci branch of the bench tooling; restore to main when this merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run the job container with --user 0:0 so checkout can write the runner's _work tree on runners whose agent owns it as root (the plain shark65 runner); the image's default tester user (uid 1001) only works on the prolense fleet where _work is 1001-owned. Add --device kfd/dri for GPU access on runners that don't inject devices. Point HRX_WORK_DIR and LLAMA_BUILD_DIR at a container-local /work path so ROCm downloads and build trees are reclaimed on container removal instead of accumulating in the persistent runner's bind-mounted workspace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The bench tooling moved runtime env setup (LD_LIBRARY_PATH incl. the HRX install lib, and HSA_FORCE_FINE_GRAIN_PCIE=1 for gfx12) out of fetch-rocm-assets.sh into runtime-env.sh. Build/validate scripts source it internally; the op-test and op-perf steps invoke the python tools directly, so source runtime-env.sh there too. Without it those steps would now run with neither the HRX libs on LD_LIBRARY_PATH nor the gfx12 fine-grained-memory flag, which is what produced the MUL_MAT NaNs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Upstream HRX renamed hrx_executable_export_info_t.constant_count -> constant_byte_length (now the byte length of the constants buffer, not a uint32 count). Update the catalog-kernel ABI check accordingly: constant_count * sizeof(uint32_t) == constants_size becomes constant_byte_length == constants_size and fix the corresponding diagnostic log. Needed to build ggml-hrx against the bumped HRX (ROCm/hrx upstream main); verified building against e8275fbb5.

The fork's hrx-integration branch should run only the HRX CI, not the upstream llama.cpp matrix. - build-hrx.yml: add a push trigger for hrx-integration (was pull_request-only). - Every other workflow whose push/pull_request/pull_request_target could match hrx-integration now carries branches-ignore: ['hrx-integration'] (build.yml, build-vulkan/cann/apple/android/riscv/self-hosted, server*, check-vendor, hip-quality-check, python-*, labeler, copilot-setup-steps, etc.). - gguf-publish is tag-only (can't fire on a branch push) and the rest were already scoped to master. Verified: only build-hrx.yml fires on hrx-integration for push and PR events. Caveat: these touch upstream workflow files, so upstream merges into hrx-integration may conflict on them (build-hrx.yml itself does not).

- Debug (runner identity): first step, if: always(), logs RUNNER_NAME, hostname, uname, /dev/kfd|dri, lspci, RUNNER_*/GITHUB_* env -- so the specific self-hosted runner behind a failure is identifiable from the log alone. - GPU health check: after 'Fetch ROCm assets' (prefix binaries present), runs the bench runner-health.sh (rocminfo/amd-smi/rocm-smi diagnostics + fail-fast on /dev/kfd missing, dead rocminfo, or expected gfx arch not enumerated). Caveat: neither detects the gfx1151 coherence NaN flake (GPU reports healthy); the debug block enables post-hoc identification. runner-health.sh must be on the bench at BENCH_REF (AaronStGeorge/llamacpp_ci) for the health step to resolve.

Run the failing mul_mat_f16 config N times (MULMAT_LOOP_N, default 50) in one job and print the fail/pass count + RUNNER_NAME, to answer: on a bad machine does it fail every iteration or only some, and what's the rate per machine? Advisory (continue-on-error) and placed BEFORE the gating correctness step so it still runs on a failing machine. Correlate the count with RUNNER_NAME from the Debug step.

Replace the inline "Debug (runner identity)" block with a call to runner-identity.sh (in the bench repo, alongside runner-health.sh). The step now runs right after the bench checkout (earliest the script is available) instead of first; matrix values are passed via MATRIX_* env since a script can't read ${{ matrix.* }}. if: always() retained. Requires runner-identity.sh on the bench BENCH_REF branch (committed on AaronStGeorge/llamacpp_ci runner-health). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AaronStGeorge · 2026-06-15T14:06:00Z

This basic CI seems to be "good enough" for the first iteration.

There have been several issues with the runners:

The gfx1201 (9070) runners regularly fail when cleaning up old containers and have also had issues freezing during different steps.
The gfx1151 (strix halo) runners seem to be much more stable, but linux-strix-halo-gpu-rocm-8-gpu0 runner seems to have some sort of hardware fault. On this runner, and only on this runner, the input buffer contains nans after being read back from GPU. Running locally on a gfx1151 machine, and running experiments spanning runners + doing a loop over 50 runs of the same simple matmul test I've only seen the nan error on rocm-8. On rocm-8 it's very reproducible.

machine	runs landed	runs with any NaN	NaN iterations	rate
rocm-8	5	5 / 5	125 / 250	50.0 %
rocm-6	7	0	0 / 350	clean
rocm-7	3	0	0 / 150	clean
rocm-9	3	0	0 / 150	clean
rocm-3	2	0	0 / 100	clean

AaronStGeorge force-pushed the p066-multi-arch-ci branch 4 times, most recently from 019c88d to 4f2127c Compare June 10, 2026 21:01

AaronStGeorge force-pushed the p066-multi-arch-ci branch from 4f2127c to 8012f41 Compare June 10, 2026 22:04

AaronStGeorge and others added 5 commits June 11, 2026 14:14

Options should already be set up in the container

d8af41d

debug

59a9acd

cleanup comments

28e5c78

AaronStGeorge force-pushed the p066-multi-arch-ci branch 2 times, most recently from fa594ad to 5393cbb Compare June 12, 2026 14:10

AaronStGeorge added 6 commits June 12, 2026 15:07

cleanup

94782ce

ignore

7a40d70

AaronStGeorge force-pushed the p066-multi-arch-ci branch 4 times, most recently from 36619c6 to a31d889 Compare June 13, 2026 20:31

AaronStGeorge force-pushed the p066-multi-arch-ci branch 4 times, most recently from ea36d86 to 53cd7e0 Compare June 15, 2026 06:08

cleanup

bdcd68d

AaronStGeorge force-pushed the p066-multi-arch-ci branch from 53cd7e0 to bdcd68d Compare June 15, 2026 06:14

AaronStGeorge merged commit 0668b9e into ROCm:hrx-integration Jun 15, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HRX CI: run on self-hosted ROCm GPU runners for gfx1151/gfx1201#22

HRX CI: run on self-hosted ROCm GPU runners for gfx1151/gfx1201#22
AaronStGeorge merged 14 commits into
ROCm:hrx-integrationfrom
AaronStGeorge:p066-multi-arch-ci

AaronStGeorge commented Jun 10, 2026 •

edited

Loading

Uh oh!

AaronStGeorge commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AaronStGeorge commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AaronStGeorge commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AaronStGeorge commented Jun 10, 2026 •

edited

Loading

AaronStGeorge commented Jun 15, 2026 •

edited

Loading