Skip to content

gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master#30

Open
jimw567 wants to merge 231 commits into
gfx11from
jimwu.gfx11-sync
Open

gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master#30
jimw567 wants to merge 231 commits into
gfx11from
jimwu.gfx11-sync

Conversation

@jimw567

@jimw567 jimw567 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Rebase the fork's gfx11 branch onto the freshly synced master (now level with upstream after Sync fork master with upstream llama.cpp (211 commits) #28).
  • Brings two gfx11-relevant changes onto current upstream:
    1. .github/workflows/build-gfx11-rocm.yml — multi-arch ROCm build + on-hardware deterministic-generation test for RDNA3/3.5 gfx11 targets (gfx110X Hawk Point/Phoenix, gfx1150, gfx1153).
    2. ggml/src/ggml-cuda/mmq.cuh — RDNA3.5 MMQ device table (from Add mmq device table for RDNA3.5 #25): sets mmq_y=64 and nwarps=4 for RDNA3.5 in both host and device paths.

Notes

  • mmq.cuh patch applied cleanly on upstream; GGML_CUDA_CC_IS_RDNA3_5 / RDNA3_5 macros confirmed present in upstream common.cuh and vendors/hip.h.
  • Branch is 0 commits behind upstream/master.

Test plan

  • gfx11 build workflow runs and on-hardware test passes on the synced master.
  • Validate MMQ path on RDNA3.5 hardware (gfx1150/gfx1151).

angt and others added 30 commits June 10, 2026 22:28
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* server : skip checkpoints beyond pos_next

* cont : update comment + TODO + ref

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…nts (ggml-org#24371)

* vocab : refactor normalizer flags into options struct, add strip_accents

* Update src/llama-vocab.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: use medium matmul tile on Asahi Linux

* vulkan: switch Apple detection to Honeykrisp driver id
* opencl: add q5_0 adreno support

* opencl: add q5_1 adreno support

* opencl: cosmetic fix

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
* restore SYCL build and release, remove github cache

* modify for test only

* verify the ccache is used

* remove debug code change

* rm duplicate action, update key in ccache

* add action ccache-clear after building in both ubuntu and windows

* set %NUMBER_OF_PROCESSORS% in widnows build
* cuda: support concat for scalar types

* Update concat.cu

* fix metal ci issue
* llama : enable layer input extraction

* spec: support eagle3

* eagle3: fix params bug

* eagle3: support Gemma4 eagle3 from RedHatAI

* eagle3: set sync when get features from target

Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>

* eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder

Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>

* eagle3: adapt to upstream changes

* eagle3: fix rebase issues and adapt to upstream changes

* eagle3:exclude the eagle3 arch from test-llama-archs

* eagle3: fix editorconfig check failures

* eagle3: fix multi-seq issue in d2t vocab mapping

* cont : minor style / clean-up

* spec : remove `common_speculative_setup_draft_model()`

* llama : clean-up unused API

* eagle3: set d2t vocab mapping in decode graph

* cont : assert layer inputs are configured

* hparams : use n_embd_inp instead of n_embd_target_features

* eagle3: make output.weight optional and inherit from target model when needed

* haparams : generic norm-before-residual param

* llama-ext : consistent names

* cont : fix

* hparams : remove target_hidden_size

* cparams : rename output_layer_inp -> embeddings_layer_inp

* arch : reuse ATTN_NORM_2 instead of adding new hidden norm

* llama : clean-up names

* cont : add assert + comment

* Update conversion/llama.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>
Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ui: bake jpeg exif orientation into uploaded images

stb_image in mtmd ignores exif metadata, so rotated smartphone photos
reach the model with raw pixel orientation. The webui now reads the
exif orientation tag at send time and feeds it into the existing
capImageDataURLSize canvas pass: the browser applies the rotation when
decoding, so capped images come out upright for free, and images under
the cap threshold get a single plain redraw when orientation > 1.

At most one re-encode ever happens per image. Upright jpegs with
capping disabled pass through untouched, bit perfect.

Adds jpeg-orientation.ts with a minimal exif parser working on a
bounded base64 prefix (both endianness, returns 1 on any malformed
input) and unit tests against handcrafted jpeg byte streams.

* ui: move jpeg exif constants into lib/constants

* ui: add browser test for jpeg orientation and capping

Covers capImageDataURLSize end to end in chromium with real Pillow
generated jpeg fixtures across exif orientations 1/3/5/6/8: upright
quadrant colors checked pixel-wise, expected dimensions with and
without capping, no orientation tag left in the output, and strict
passthrough when nothing needs rewriting.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* feat: Add basic PWA support and service worker for offline caching

* feat: Vite PWA implementation WIP

* feat: Improve PWA icons generation

* feat: Add PWA workbox to server routes

* feat: Include `version.json` in static assets

* feat: Add HTTP cache headers for PWA static assets

* feat: Update app name for `apple-mobile-web-app-title`

* feat: Implement PWA versioning and automatic update detection

* chore: Update `.gitignore` files

* feat: Splash Screens

* feat: Add dark mode favicon support

* refactor: Cleanup

* fix: Use dark logo for dark splash screens

* refactor: Simplify favicons SVG code

* fix: Adjust caching and polling for reliable service worker updates

* fix: Add missing favicon entry

* fix: Align PWA service worker configuration with SvelteKit build structure

* fix: Replace hashed bundle paths with versioned static paths

* test: Add PWA tests

* ci: Add build output for unit tests

* refactor: Cleanup

* fix: Server build & release versioning

* chore: Update package-lock.json

* chore: Increase PWA cache size

* chore: Update packages

* feat: Update favicons

* refactor: Post-merge fix

* feat: support explicit build version for PWA cache busting

* fix: CI

* feat: Improve PWA Refresh Alert UI

* feat: Add toggleable build version display

* refactor: Cleanup

* feat: Add version mismatch detection and manual app reload

* refactor: replace dynamic imports with static

* refactor: Cleanup

* feat: Add safe space for `pwa-<size>.png` rendered icons

* fix: use relative paths for PWA assets to support base path deployment

* feat: add PWA mode detection via URL query parameter

* feat: Use ?cache=true for SW-cached PWA assets

* refactor: Build process cleanup

* refactor: Decouple PWA versioning and remove ?cache=true workaround

* chore: Update README logo

* feat: Include PWA Assets generation in build script

* refactor: `usePwa` hook for core layout

* fix: Relativize base vite plugin

* fix: remove unnecessary backslash escapes in test regexes

* test: update static asset paths for API Key test

* refactor: Move SvelteKit PWA Options config to constants

* ui: fix update notification never appearing

Keep the PWA hook object intact instead of destructuring needRefreshByStorage,
which freezes the reactive getter. Also exclude loading.html from PWA
precache to prevent 404 errors and broken SW installation.
)

* vulkan: add pipeline barriers for memcpy read/write operations

* remove unnecessary host write pipeline barriers
…rg#24517)

When reasoning-budget is set in model.ini, the per-request
thinking_budget_tokens from the WebUI was ignored because the
model.ini value took unconditional precedence.

Swap the precedence so the WebUI per-request value is checked
first, with the model.ini value serving as a fallback default.

Assisted-by: pi:llama.cpp/Qwen3.6-27B
* unbreak release harder

* missed one

* remove missing test for now
* mtmd: add batching API

* wip

* first working version (gemma4v)

* add arg

* nits

* wire up support_batch()

* fix 0.0 output embd

* fix audio

* nits

* refactor a bit

* nits

* fix non-batching case

* fix comment
* fix sycl links in release notes

* remove extra line
* server: clean up static assets handling

* nits

* simplify file name handling, use static file name everywhere

* cmake/ui : bundle UI assets in an archive

* ui : run prettier on post-build.js

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
* ui: keep original file name and path

* fix nocache
ngxson and others added 23 commits June 25, 2026 16:26
* misc: update lables

* bring back examples, add mtmd
* server: use status code 403 for disabled features

* cont

* fix test case
* Add failing test-case to test-backend-ops

Extracted from ggml-org#24072

* Minimize repro with help of AI

N = 8 * (65535 - 1) + 1 = 524273

* Port and adjust workaround from LostRuins@0ba7983

Fall-back should share code, also relax y-z constraint to be inclusive

* Add test-case + fallback also for y dim

* Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX

* Fix overflow problems for transposed copy kernel
Sync fork master with upstream llama.cpp (211 commits)
Build llama.cpp from this branch's source against the latest TheRock
nightly ROCm tarball for gfx1151/gfx1150 (Ubuntu), then smoke-test each
artifact on the matching self-hosted Strix Halo GPU runner with llama-cli.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The 7.14.0a20260609+ nightlies regressed libhsa-runtime64 (build
1b2a555677), segfaulting in GpuAgent::InitDma on gfx115x runners.
See ROCm/TheRock#5763.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The previous gfx1150 runner could not reach huggingface.co (curl 35,
connection reset) during model download. Try a different runner.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Self-hosted runners (e.g. the gfx1150 box) can't reach huggingface.co
(curl 35, connection reset) but can reach GitHub. Fetch the GGUF from a
fixed release asset (jimw567/llamacpp-test-assets) with retries instead.
Revert gfx1150 runner to the working label (runs-on matches labels, not
machine names, so the prior machine-name value never scheduled).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…tput

The binary failed to start on runners lacking libatomic1 (exit 127); bundle
libatomic.so.1 with the artifact so $ORIGIN RPATH resolves it.

Replace stale assertions ("offloaded 29/29 layers to GPU", "</think>") with
functional checks against current output: ROCm GPU selected, all layers +
output offloaded to ROCm0, and tokens actually generated (eval timing).
Verified on a Strix Halo gfx1151 box: passes on real GPU output, fails on
CPU-only/no-generation output.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The previous assertion only checked GPU offload + that >=1 token was
emitted. Add functional checks that the model produced a non-empty answer
and non-empty reasoning trace, and generated >=20 tokens. Verified against
real gfx1151 output: passes on a genuine response, fails on empty/one-token.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Switch the llama-cli smoke test to a single-correct-answer prompt
("What is 2 + 2?") with greedy decoding (--temp 0), and assert the
parsed assistant content is 4. This checks the GPU math path end to
end rather than only that some text was generated.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Publish a dated GitHub Release (b<YYYYMMDD>) with gfx1151 + gfx1150
  tar.gz binaries, gated to the nightly workflow_dispatch (create_release
  input) so push/PR/manual runs never publish.
- Wrap the llama-cli smoke test in a 180s timeout that dumps GPU/driver
  diagnostics on hang, so a stuck runner fails fast with visible logs.
- Temporarily disable the gfx1150 on-hardware test (its runner hangs in
  GPU inference); gfx1150 binaries are still built and released.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Mirror the lemonade llamacpp-rocm setup: build the grouped gfx110X
target (gfx1100;gfx1101;gfx1102;gfx1103) from the gfx110X-all ROCm
tarball, covering desktop RDNA3 and the Hawk Point/Phoenix iGPU
(Radeon 760M/780M). Decouple the build matrix into gfx_target /
s3_target / gpu_targets so per-target tarball suffixes and GPU target
lists are explicit. Publish the gfx110X binary in the nightly release.
No on-hardware test leg (no Hawk Point runner).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
gfx1153 ships as its own standalone TheRock tarball (no bundle suffix),
same as gfx1151/gfx1150. Build+release only — no on-hardware test runner.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Collapse the 4-leg per-family matrix (gfx1151/gfx1150/gfx1153/gfx110X)
into a single build sourced from TheRock's multi-arch tarball. One fat
binary covers all current CI arches (gfx1100-1103, gfx1150/1151/1153)
and ships as one universal release archive instead of four mostly-
duplicate per-family archives.

The multi-arch tarball is streamed and pruned at the tar level: drop all
.kpack and the Tensile DBs of every non-target arch. The GEMM path
llama.cpp uses works from the per-arch Tensile DB alone (validated on
gfx1151 hardware: rocBLAS sgemm succeeds with ROCM_KPACK_DISABLE=1), so
no .kpack files are bundled. The gfx1151 hardware test job is the
end-to-end safety net.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The "universal" vs "multi-arch" wording was redundant — both describe one
package covering many arches. Standardize on "multiarch" to match TheRock's
upstream vocabulary. Renames the artifact/archive to
llama-<TAG>-ubuntu-rocm-multiarch-x64 and updates comments + release body.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Follow-up to the package rename: drop the hyphenated "multi-arch" prose in
comments, echoes, and the release body for one consistent spelling. The
upstream nightlies endpoint (tarball-multi-arch) and the therock-dist-linux-
multiarch- filenames are external names and left untouched.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@jimw567 jimw567 force-pushed the jimwu.gfx11-sync branch from 41013ff to 5692e0b Compare June 26, 2026 19:33
@jimw567 jimw567 changed the title ci(gfx11): add gfx11 ROCm build + on-hardware test, rebased on synced master gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master Jun 26, 2026
@jimw567 jimw567 changed the base branch from master to gfx11 June 26, 2026 19:36
Jim Wu and others added 3 commits June 26, 2026 19:30
Fork-local: upstream targets [self-hosted, fast] runners that don't exist
in this fork, so these two lint checks queue forever. Point them at
GitHub-hosted ubuntu-24.04. Marked DO NOT UPSTREAM in each file.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Fork-local: add gfx11 to the push/pull_request branch filters so the two
lint checks actually run on gfx11-targeted PRs (they previously only
triggered on master). Marked DO NOT UPSTREAM.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@jimw567 jimw567 requested a review from Annieren June 28, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.