HRX CI: run on self-hosted ROCm GPU runners for gfx1151/gfx1201#22
Merged
AaronStGeorge merged 14 commits intoJun 15, 2026
Merged
Conversation
019c88d to
4f2127c
Compare
Replace the ubuntu-latest CPU job with a matrix on linux-gfx1151-gpu-rocm and linux-gfx120X-gpu-rocm, building for the runner's gfx target and running the sample MUL_MAT correctness/perf configs on the HRX0 backend. Fetch per-arch rocwmma dev artifacts. BENCH_REF temporarily points at the p066-multi-arch-ci branch of the bench tooling; restore to main when this merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4f2127c to
8012f41
Compare
Run the job container with --user 0:0 so checkout can write the runner's _work tree on runners whose agent owns it as root (the plain shark65 runner); the image's default tester user (uid 1001) only works on the prolense fleet where _work is 1001-owned. Add --device kfd/dri for GPU access on runners that don't inject devices. Point HRX_WORK_DIR and LLAMA_BUILD_DIR at a container-local /work path so ROCm downloads and build trees are reclaimed on container removal instead of accumulating in the persistent runner's bind-mounted workspace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bench tooling moved runtime env setup (LD_LIBRARY_PATH incl. the HRX install lib, and HSA_FORCE_FINE_GRAIN_PCIE=1 for gfx12) out of fetch-rocm-assets.sh into runtime-env.sh. Build/validate scripts source it internally; the op-test and op-perf steps invoke the python tools directly, so source runtime-env.sh there too. Without it those steps would now run with neither the HRX libs on LD_LIBRARY_PATH nor the gfx12 fine-grained-memory flag, which is what produced the MUL_MAT NaNs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fa594ad to
5393cbb
Compare
Upstream HRX renamed hrx_executable_export_info_t.constant_count -> constant_byte_length (now the byte length of the constants buffer, not a uint32 count). Update the catalog-kernel ABI check accordingly: constant_count * sizeof(uint32_t) == constants_size becomes constant_byte_length == constants_size and fix the corresponding diagnostic log. Needed to build ggml-hrx against the bumped HRX (ROCm/hrx upstream main); verified building against e8275fbb5.
The fork's hrx-integration branch should run only the HRX CI, not the upstream llama.cpp matrix. - build-hrx.yml: add a push trigger for hrx-integration (was pull_request-only). - Every other workflow whose push/pull_request/pull_request_target could match hrx-integration now carries branches-ignore: ['hrx-integration'] (build.yml, build-vulkan/cann/apple/android/riscv/self-hosted, server*, check-vendor, hip-quality-check, python-*, labeler, copilot-setup-steps, etc.). - gguf-publish is tag-only (can't fire on a branch push) and the rest were already scoped to master. Verified: only build-hrx.yml fires on hrx-integration for push and PR events. Caveat: these touch upstream workflow files, so upstream merges into hrx-integration may conflict on them (build-hrx.yml itself does not).
- Debug (runner identity): first step, if: always(), logs RUNNER_NAME, hostname, uname, /dev/kfd|dri, lspci, RUNNER_*/GITHUB_* env -- so the specific self-hosted runner behind a failure is identifiable from the log alone. - GPU health check: after 'Fetch ROCm assets' (prefix binaries present), runs the bench runner-health.sh (rocminfo/amd-smi/rocm-smi diagnostics + fail-fast on /dev/kfd missing, dead rocminfo, or expected gfx arch not enumerated). Caveat: neither detects the gfx1151 coherence NaN flake (GPU reports healthy); the debug block enables post-hoc identification. runner-health.sh must be on the bench at BENCH_REF (AaronStGeorge/llamacpp_ci) for the health step to resolve.
Run the failing mul_mat_f16 config N times (MULMAT_LOOP_N, default 50) in one job and print the fail/pass count + RUNNER_NAME, to answer: on a bad machine does it fail every iteration or only some, and what's the rate per machine? Advisory (continue-on-error) and placed BEFORE the gating correctness step so it still runs on a failing machine. Correlate the count with RUNNER_NAME from the Debug step.
36619c6 to
a31d889
Compare
Replace the inline "Debug (runner identity)" block with a call to
runner-identity.sh (in the bench repo, alongside runner-health.sh). The step
now runs right after the bench checkout (earliest the script is available)
instead of first; matrix values are passed via MATRIX_* env since a script
can't read ${{ matrix.* }}. if: always() retained.
Requires runner-identity.sh on the bench BENCH_REF branch (committed on
AaronStGeorge/llamacpp_ci runner-health).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ea36d86 to
53cd7e0
Compare
53cd7e0 to
bdcd68d
Compare
Collaborator
Author
|
This basic CI seems to be "good enough" for the first iteration. There have been several issues with the runners:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the ubuntu-latest CPU job in
build-hrx.ymlwith a matrix on the self-hosted ROCm runners (linux-gfx1151-gpu-rocm,linux-gfx120X-gpu-rocm), building for the runner's gfx target and running the sample MUL_MAT correctness/perf configs on theHRX0backend. Per-archrocwmmadev artifacts are fetched (rocwmma_dev_gfx1151,rocwmma_dev_gfx120X-all; both verified present in pinned TheRock run 25753625030).🤖 Generated with Claude Code