Skip to content

HRX CI: run on self-hosted ROCm GPU runners for gfx1151/gfx1201#22

Merged
AaronStGeorge merged 14 commits into
ROCm:hrx-integrationfrom
AaronStGeorge:p066-multi-arch-ci
Jun 15, 2026
Merged

HRX CI: run on self-hosted ROCm GPU runners for gfx1151/gfx1201#22
AaronStGeorge merged 14 commits into
ROCm:hrx-integrationfrom
AaronStGeorge:p066-multi-arch-ci

Conversation

@AaronStGeorge

@AaronStGeorge AaronStGeorge commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Replaces the ubuntu-latest CPU job in build-hrx.yml with a matrix on the self-hosted ROCm runners (linux-gfx1151-gpu-rocm, linux-gfx120X-gpu-rocm), building for the runner's gfx target and running the sample MUL_MAT correctness/perf configs on the HRX0 backend. Per-arch rocwmma dev artifacts are fetched (rocwmma_dev_gfx1151, rocwmma_dev_gfx120X-all; both verified present in pinned TheRock run 25753625030).

🤖 Generated with Claude Code

@AaronStGeorge AaronStGeorge force-pushed the p066-multi-arch-ci branch 4 times, most recently from 019c88d to 4f2127c Compare June 10, 2026 21:01
Replace the ubuntu-latest CPU job with a matrix on
linux-gfx1151-gpu-rocm and linux-gfx120X-gpu-rocm, building for the
runner's gfx target and running the sample MUL_MAT correctness/perf
configs on the HRX0 backend. Fetch per-arch rocwmma dev artifacts.

BENCH_REF temporarily points at the p066-multi-arch-ci branch of the
bench tooling; restore to main when this merges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AaronStGeorge and others added 5 commits June 11, 2026 14:14
Run the job container with --user 0:0 so checkout can write the
runner's _work tree on runners whose agent owns it as root (the plain
shark65 runner); the image's default tester user (uid 1001) only works
on the prolense fleet where _work is 1001-owned. Add --device kfd/dri
for GPU access on runners that don't inject devices.

Point HRX_WORK_DIR and LLAMA_BUILD_DIR at a container-local /work path so
ROCm downloads and build trees are reclaimed on container removal instead
of accumulating in the persistent runner's bind-mounted workspace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bench tooling moved runtime env setup (LD_LIBRARY_PATH incl. the HRX
install lib, and HSA_FORCE_FINE_GRAIN_PCIE=1 for gfx12) out of
fetch-rocm-assets.sh into runtime-env.sh. Build/validate scripts source
it internally; the op-test and op-perf steps invoke the python tools
directly, so source runtime-env.sh there too. Without it those steps
would now run with neither the HRX libs on LD_LIBRARY_PATH nor the gfx12
fine-grained-memory flag, which is what produced the MUL_MAT NaNs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AaronStGeorge AaronStGeorge force-pushed the p066-multi-arch-ci branch 2 times, most recently from fa594ad to 5393cbb Compare June 12, 2026 14:10
Upstream HRX renamed hrx_executable_export_info_t.constant_count ->
constant_byte_length (now the byte length of the constants buffer, not a uint32
count). Update the catalog-kernel ABI check accordingly:
  constant_count * sizeof(uint32_t) == constants_size
becomes
  constant_byte_length == constants_size
and fix the corresponding diagnostic log. Needed to build ggml-hrx against the
bumped HRX (ROCm/hrx upstream main); verified building against e8275fbb5.
The fork's hrx-integration branch should run only the HRX CI, not the upstream
llama.cpp matrix.
- build-hrx.yml: add a push trigger for hrx-integration (was pull_request-only).
- Every other workflow whose push/pull_request/pull_request_target could match
  hrx-integration now carries branches-ignore: ['hrx-integration'] (build.yml,
  build-vulkan/cann/apple/android/riscv/self-hosted, server*, check-vendor,
  hip-quality-check, python-*, labeler, copilot-setup-steps, etc.).
- gguf-publish is tag-only (can't fire on a branch push) and the rest were
  already scoped to master.

Verified: only build-hrx.yml fires on hrx-integration for push and PR events.
Caveat: these touch upstream workflow files, so upstream merges into
hrx-integration may conflict on them (build-hrx.yml itself does not).
- Debug (runner identity): first step, if: always(), logs RUNNER_NAME, hostname,
  uname, /dev/kfd|dri, lspci, RUNNER_*/GITHUB_* env -- so the specific
  self-hosted runner behind a failure is identifiable from the log alone.
- GPU health check: after 'Fetch ROCm assets' (prefix binaries present), runs
  the bench runner-health.sh (rocminfo/amd-smi/rocm-smi diagnostics + fail-fast
  on /dev/kfd missing, dead rocminfo, or expected gfx arch not enumerated).

Caveat: neither detects the gfx1151 coherence NaN flake (GPU reports healthy);
the debug block enables post-hoc identification. runner-health.sh must be on the
bench at BENCH_REF (AaronStGeorge/llamacpp_ci) for the health step to resolve.
Run the failing mul_mat_f16 config N times (MULMAT_LOOP_N, default 50) in one
job and print the fail/pass count + RUNNER_NAME, to answer: on a bad machine
does it fail every iteration or only some, and what's the rate per machine?
Advisory (continue-on-error) and placed BEFORE the gating correctness step so it
still runs on a failing machine. Correlate the count with RUNNER_NAME from the
Debug step.
@AaronStGeorge AaronStGeorge force-pushed the p066-multi-arch-ci branch 4 times, most recently from 36619c6 to a31d889 Compare June 13, 2026 20:31
Replace the inline "Debug (runner identity)" block with a call to
runner-identity.sh (in the bench repo, alongside runner-health.sh). The step
now runs right after the bench checkout (earliest the script is available)
instead of first; matrix values are passed via MATRIX_* env since a script
can't read ${{ matrix.* }}. if: always() retained.

Requires runner-identity.sh on the bench BENCH_REF branch (committed on
AaronStGeorge/llamacpp_ci runner-health).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AaronStGeorge AaronStGeorge force-pushed the p066-multi-arch-ci branch 4 times, most recently from ea36d86 to 53cd7e0 Compare June 15, 2026 06:08
@AaronStGeorge

AaronStGeorge commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

This basic CI seems to be "good enough" for the first iteration.

There have been several issues with the runners:

  • The gfx1201 (9070) runners regularly fail when cleaning up old containers and have also had issues freezing during different steps.
  • The gfx1151 (strix halo) runners seem to be much more stable, but linux-strix-halo-gpu-rocm-8-gpu0 runner seems to have some sort of hardware fault. On this runner, and only on this runner, the input buffer contains nans after being read back from GPU. Running locally on a gfx1151 machine, and running experiments spanning runners + doing a loop over 50 runs of the same simple matmul test I've only seen the nan error on rocm-8. On rocm-8 it's very reproducible.
machine runs landed runs with any NaN NaN iterations rate
rocm-8 5 5 / 5 125 / 250 50.0 %
rocm-6 7 0 0 / 350 clean
rocm-7 3 0 0 / 150 clean
rocm-9 3 0 0 / 150 clean
rocm-3 2 0 0 / 100 clean

@AaronStGeorge AaronStGeorge merged commit 0668b9e into ROCm:hrx-integration Jun 15, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant