gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master by jimw567 · Pull Request #30 · ROCm/llama.cpp

jimw567 · 2026-06-26T19:29:01Z

Summary

Rebase the fork's gfx11 branch onto the freshly synced master (now level with upstream after Sync fork master with upstream llama.cpp (211 commits) #28).
Brings two gfx11-relevant changes onto current upstream:
1. .github/workflows/build-gfx11-rocm.yml — multi-arch ROCm build + on-hardware deterministic-generation test for RDNA3/3.5 gfx11 targets (gfx110X Hawk Point/Phoenix, gfx1150, gfx1153).
2. ggml/src/ggml-cuda/mmq.cuh — RDNA3.5 MMQ device table (from Add mmq device table for RDNA3.5 #25): sets mmq_y=64 and nwarps=4 for RDNA3.5 in both host and device paths.

Notes

mmq.cuh patch applied cleanly on upstream; GGML_CUDA_CC_IS_RDNA3_5 / RDNA3_5 macros confirmed present in upstream common.cuh and vendors/hip.h.
Branch is 0 commits behind upstream/master.

Test plan

gfx11 build workflow runs and on-hardware test passes on the synced master.
Validate MMQ path on RDNA3.5 hardware (gfx1150/gfx1151).

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* server : skip checkpoints beyond pos_next * cont : update comment + TODO + ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…nts (ggml-org#24371) * vocab : refactor normalizer flags into options struct, add strip_accents * Update src/llama-vocab.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…rg#24428)

* vulkan: use medium matmul tile on Asahi Linux * vulkan: switch Apple detection to Honeykrisp driver id

Fixes build/CI after ggml-org#24306.

* opencl: add q5_0 adreno support * opencl: add q5_1 adreno support * opencl: cosmetic fix --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* restore SYCL build and release, remove github cache * modify for test only * verify the ccache is used * remove debug code change * rm duplicate action, update key in ccache * add action ccache-clear after building in both ubuntu and windows * set %NUMBER_OF_PROCESSORS% in widnows build

* cuda: support concat for scalar types * Update concat.cu * fix metal ci issue

* llama : enable layer input extraction * spec: support eagle3 * eagle3: fix params bug * eagle3: support Gemma4 eagle3 from RedHatAI * eagle3: set sync when get features from target Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> * eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> * eagle3: adapt to upstream changes * eagle3: fix rebase issues and adapt to upstream changes * eagle3:exclude the eagle3 arch from test-llama-archs * eagle3: fix editorconfig check failures * eagle3: fix multi-seq issue in d2t vocab mapping * cont : minor style / clean-up * spec : remove `common_speculative_setup_draft_model()` * llama : clean-up unused API * eagle3: set d2t vocab mapping in decode graph * cont : assert layer inputs are configured * hparams : use n_embd_inp instead of n_embd_target_features * eagle3: make output.weight optional and inherit from target model when needed * haparams : generic norm-before-residual param * llama-ext : consistent names * cont : fix * hparams : remove target_hidden_size * cparams : rename output_layer_inp -> embeddings_layer_inp * arch : reuse ATTN_NORM_2 instead of adding new hidden norm * llama : clean-up names * cont : add assert + comment * Update conversion/llama.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ui: bake jpeg exif orientation into uploaded images stb_image in mtmd ignores exif metadata, so rotated smartphone photos reach the model with raw pixel orientation. The webui now reads the exif orientation tag at send time and feeds it into the existing capImageDataURLSize canvas pass: the browser applies the rotation when decoding, so capped images come out upright for free, and images under the cap threshold get a single plain redraw when orientation > 1. At most one re-encode ever happens per image. Upright jpegs with capping disabled pass through untouched, bit perfect. Adds jpeg-orientation.ts with a minimal exif parser working on a bounded base64 prefix (both endianness, returns 1 on any malformed input) and unit tests against handcrafted jpeg byte streams. * ui: move jpeg exif constants into lib/constants * ui: add browser test for jpeg orientation and capping Covers capImageDataURLSize end to end in chromium with real Pillow generated jpeg fixtures across exif orientations 1/3/5/6/8: upright quadrant colors checked pixel-wise, expected dimensions with and without capping, no orientation tag left in the output, and strict passthrough when nothing needs rewriting.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* feat: Add basic PWA support and service worker for offline caching * feat: Vite PWA implementation WIP * feat: Improve PWA icons generation * feat: Add PWA workbox to server routes * feat: Include `version.json` in static assets * feat: Add HTTP cache headers for PWA static assets * feat: Update app name for `apple-mobile-web-app-title` * feat: Implement PWA versioning and automatic update detection * chore: Update `.gitignore` files * feat: Splash Screens * feat: Add dark mode favicon support * refactor: Cleanup * fix: Use dark logo for dark splash screens * refactor: Simplify favicons SVG code * fix: Adjust caching and polling for reliable service worker updates * fix: Add missing favicon entry * fix: Align PWA service worker configuration with SvelteKit build structure * fix: Replace hashed bundle paths with versioned static paths * test: Add PWA tests * ci: Add build output for unit tests * refactor: Cleanup * fix: Server build & release versioning * chore: Update package-lock.json * chore: Increase PWA cache size * chore: Update packages * feat: Update favicons * refactor: Post-merge fix * feat: support explicit build version for PWA cache busting * fix: CI * feat: Improve PWA Refresh Alert UI * feat: Add toggleable build version display * refactor: Cleanup * feat: Add version mismatch detection and manual app reload * refactor: replace dynamic imports with static * refactor: Cleanup * feat: Add safe space for `pwa-<size>.png` rendered icons * fix: use relative paths for PWA assets to support base path deployment * feat: add PWA mode detection via URL query parameter * feat: Use ?cache=true for SW-cached PWA assets * refactor: Build process cleanup * refactor: Decouple PWA versioning and remove ?cache=true workaround * chore: Update README logo * feat: Include PWA Assets generation in build script * refactor: `usePwa` hook for core layout * fix: Relativize base vite plugin * fix: remove unnecessary backslash escapes in test regexes * test: update static asset paths for API Key test * refactor: Move SvelteKit PWA Options config to constants * ui: fix update notification never appearing Keep the PWA hook object intact instead of destructuring needRefreshByStorage, which freezes the reactive getter. Also exclude loading.html from PWA precache to prevent 404 errors and broken SW installation.

) * vulkan: add pipeline barriers for memcpy read/write operations * remove unnecessary host write pipeline barriers

…rg#24517) When reasoning-budget is set in model.ini, the per-request thinking_budget_tokens from the WebUI was ignored because the model.ini value took unconditional precedence. Swap the precedence so the WebUI per-request value is checked first, with the model.ini value serving as a fallback default. Assisted-by: pi:llama.cpp/Qwen3.6-27B

* unbreak release harder * missed one * remove missing test for now

* mtmd: add batching API * wip * first working version (gemma4v) * add arg * nits * wire up support_batch() * fix 0.0 output embd * fix audio * nits * refactor a bit * nits * fix non-batching case * fix comment

* fix sycl links in release notes * remove extra line

* server: clean up static assets handling * nits * simplify file name handling, use static file name everywhere * cmake/ui : bundle UI assets in an archive * ui : run prettier on post-build.js --------- Co-authored-by: Alde Rojas <hello@alde.dev>

* ui: keep original file name and path * fix nocache

* misc: update lables * bring back examples, add mtmd

* server: use status code 403 for disabled features * cont * fix test case

* Add failing test-case to test-backend-ops Extracted from ggml-org#24072 * Minimize repro with help of AI N = 8 * (65535 - 1) + 1 = 524273 * Port and adjust workaround from LostRuins@0ba7983 Fall-back should share code, also relax y-z constraint to be inclusive * Add test-case + fallback also for y dim * Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX * Fix overflow problems for transposed copy kernel

# Conflicts: # README.md

Sync fork master with upstream llama.cpp (211 commits)

Build llama.cpp from this branch's source against the latest TheRock nightly ROCm tarball for gfx1151/gfx1150 (Ubuntu), then smoke-test each artifact on the matching self-hosted Strix Halo GPU runner with llama-cli. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The 7.14.0a20260609+ nightlies regressed libhsa-runtime64 (build 1b2a555677), segfaulting in GpuAgent::InitDma on gfx115x runners. See ROCm/TheRock#5763. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The previous gfx1150 runner could not reach huggingface.co (curl 35, connection reset) during model download. Try a different runner. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Self-hosted runners (e.g. the gfx1150 box) can't reach huggingface.co (curl 35, connection reset) but can reach GitHub. Fetch the GGUF from a fixed release asset (jimw567/llamacpp-test-assets) with retries instead. Revert gfx1150 runner to the working label (runs-on matches labels, not machine names, so the prior machine-name value never scheduled). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…tput The binary failed to start on runners lacking libatomic1 (exit 127); bundle libatomic.so.1 with the artifact so $ORIGIN RPATH resolves it. Replace stale assertions ("offloaded 29/29 layers to GPU", "</think>") with functional checks against current output: ROCm GPU selected, all layers + output offloaded to ROCm0, and tokens actually generated (eval timing). Verified on a Strix Halo gfx1151 box: passes on real GPU output, fails on CPU-only/no-generation output. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The previous assertion only checked GPU offload + that >=1 token was emitted. Add functional checks that the model produced a non-empty answer and non-empty reasoning trace, and generated >=20 tokens. Verified against real gfx1151 output: passes on a genuine response, fails on empty/one-token. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Switch the llama-cli smoke test to a single-correct-answer prompt ("What is 2 + 2?") with greedy decoding (--temp 0), and assert the parsed assistant content is 4. This checks the GPU math path end to end rather than only that some text was generated. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- Publish a dated GitHub Release (b<YYYYMMDD>) with gfx1151 + gfx1150 tar.gz binaries, gated to the nightly workflow_dispatch (create_release input) so push/PR/manual runs never publish. - Wrap the llama-cli smoke test in a 180s timeout that dumps GPU/driver diagnostics on hang, so a stuck runner fails fast with visible logs. - Temporarily disable the gfx1150 on-hardware test (its runner hangs in GPU inference); gfx1150 binaries are still built and released. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Mirror the lemonade llamacpp-rocm setup: build the grouped gfx110X target (gfx1100;gfx1101;gfx1102;gfx1103) from the gfx110X-all ROCm tarball, covering desktop RDNA3 and the Hawk Point/Phoenix iGPU (Radeon 760M/780M). Decouple the build matrix into gfx_target / s3_target / gpu_targets so per-target tarball suffixes and GPU target lists are explicit. Publish the gfx110X binary in the nightly release. No on-hardware test leg (no Hawk Point runner). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

gfx1153 ships as its own standalone TheRock tarball (no bundle suffix), same as gfx1151/gfx1150. Build+release only — no on-hardware test runner. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Collapse the 4-leg per-family matrix (gfx1151/gfx1150/gfx1153/gfx110X) into a single build sourced from TheRock's multi-arch tarball. One fat binary covers all current CI arches (gfx1100-1103, gfx1150/1151/1153) and ships as one universal release archive instead of four mostly- duplicate per-family archives. The multi-arch tarball is streamed and pruned at the tar level: drop all .kpack and the Tensile DBs of every non-target arch. The GEMM path llama.cpp uses works from the per-arch Tensile DB alone (validated on gfx1151 hardware: rocBLAS sgemm succeeds with ROCM_KPACK_DISABLE=1), so no .kpack files are bundled. The gfx1151 hardware test job is the end-to-end safety net. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The "universal" vs "multi-arch" wording was redundant — both describe one package covering many arches. Standardize on "multiarch" to match TheRock's upstream vocabulary. Renames the artifact/archive to llama-<TAG>-ubuntu-rocm-multiarch-x64 and updates comments + release body. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Follow-up to the package rename: drop the hyphenated "multi-arch" prose in comments, echoes, and the release body for one consistent spelling. The upstream nightlies endpoint (tarball-multi-arch) and the therock-dist-linux- multiarch- filenames are external names and left untouched. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Fork-local: upstream targets [self-hosted, fast] runners that don't exist in this fork, so these two lint checks queue forever. Point them at GitHub-hosted ubuntu-24.04. Marked DO NOT UPSTREAM in each file. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Fork-local: add gfx11 to the push/pull_request branch filters so the two lint checks actually run on gfx11-targeted PRs (they previously only triggered on master). Marked DO NOT UPSTREAM. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

angt and others added 30 commits June 10, 2026 22:28

vendor : update LibreSSL to 4.3.2 (ggml-org#24397)

ac4cdde

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

server : skip checkpoints beyond pos_next (ggml-org#24411)

db94854

* server : skip checkpoints beyond pos_next * cont : update comment + TODO + ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

vocab : adopt leading TemplateProcessing special token as BOS (ggml-o…

1bfbdb1

…rg#24428)

server: skip unused log lines on router mode (ggml-org#24463)

18ef86e

vulkan: use medium matmul tile on Asahi Linux (ggml-org#24306)

1af154a

* vulkan: use medium matmul tile on Asahi Linux * vulkan: switch Apple detection to Honeykrisp driver id

vulkan: add fast path for contiguous buffer transfers (ggml-org#23973)

fdc3db9

ggml : bump version to 0.15.0 (ggml/1539)

17e59d6

sync : ggml

263cc04

vulkan: ifdef eMesaHoneykrisp (build fix) (ggml-org#24479)

4c65955

Fixes build/CI after ggml-org#24306.

docker : support specifying the GCC version for CUDA (ggml-org#24447)

1593d56

opencl: add q5_0/q5_1 gemm and gemv kernels for Adreno (ggml-org#24319)

ba1df05

* opencl: add q5_0 adreno support * opencl: add q5_1 adreno support * opencl: cosmetic fix --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

ggml: support concat for scalar types at cuda backend (ggml-org#24011)

85f99dc

* cuda: support concat for scalar types * Update concat.cu * fix metal ci issue

vendor : update cpp-httplib to 0.47.0 (ggml-org#24395)

70b54e1

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ggml : bump version to 0.15.1 (ggml/1541)

e08c226

sync : ggml

f532be8

fit : avoid including llama-ext.h in fit.h (ggml-org#24506)

02182fc

vulkan: add pipeline barriers for memcpy read operations (ggml-org#23770

3e7bd4f

) * vulkan: add pipeline barriers for memcpy read/write operations * remove unnecessary host write pipeline barriers

ci : unbreak release (ggml-org#24544)

cd50446

ci : unbreak release harder (ggml-org#24545)

f58bad4

* unbreak release harder * missed one * remove missing test for now

mtmd: add batching API (ggml-org#24384)

e37abd6

* mtmd: add batching API * wip * first working version (gemma4v) * add arg * nits * wire up support_batch() * fix 0.0 output embd * fix audio * nits * refactor a bit * nits * fix non-batching case * fix comment

fix sycl links in release notes (ggml-org#24527)

c34b922

* fix sycl links in release notes * remove extra line

fit : wrap llama_device_memory_data (ggml-org#24522)

d8a24cc

ui: keep original file name and path (ggml-org#24568)

597b667

* ui: keep original file name and path * fix nocache

ngxson and others added 23 commits June 25, 2026 16:26

misc: update lables (ggml-org#24920)

099bf06

* misc: update lables * bring back examples, add mtmd

server: use status code 403 for disabled features (ggml-org#24970)

e9d1b76

* server: use status code 403 for disabled features * cont * fix test case

misc: fix labeler (ggml-org#25012)

c7cddef

model : Add label for LFM2.5-230M (ggml-org#25008)

9d5d882

xcframework : disable mtmd video on i/tv/visionos (ggml-org#25018)

beac530

Merge remote-tracking branch 'upstream/master' into jimwu.sync-upstream

cd81994

# Conflicts: # README.md

Merge pull request #28 from ROCm/jimwu.sync-upstream

5d83b8b

Sync fork master with upstream llama.cpp (211 commits)

ci: pin ROCm to last known-good nightly 7.14.0a20260608

cb547e3

The 7.14.0a20260609+ nightlies regressed libhsa-runtime64 (build 1b2a555677), segfaulting in GpuAgent::InitDma on gfx115x runners. See ROCm/TheRock#5763. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

ci: route gfx1150 test to linux-strix-gpu-rocm-2 runner

c1709b0

The previous gfx1150 runner could not reach huggingface.co (curl 35, connection reset) during model download. Try a different runner. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

ci: add gfx1153 build + release target

aa90db9

gfx1153 ships as its own standalone TheRock tarball (no bundle suffix), same as gfx1151/gfx1150. Build+release only — no on-hardware test runner. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

ci: retrigger CI

914a8e5

Add mmq device table for RDNA3.5

5afbaec

jimw567 force-pushed the jimwu.gfx11-sync branch from 41013ff to 5692e0b Compare June 26, 2026 19:33

jimw567 changed the title ~~ci(gfx11): add gfx11 ROCm build + on-hardware test, rebased on synced master~~ gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master Jun 26, 2026

jimw567 changed the base branch from master to gfx11 June 26, 2026 19:36

Jim Wu and others added 3 commits June 26, 2026 19:30

ci(gfx11): run flake8 lint on ubuntu-24.04, trigger on gfx11

78372b6

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

jimw567 requested a review from Annieren June 28, 2026 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master#30

gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master#30
jimw567 wants to merge 231 commits into
gfx11from
jimwu.gfx11-sync

jimw567 commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

jimw567 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

jimw567 commented Jun 26, 2026 •

edited

Loading