gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master#30
Open
jimw567 wants to merge 231 commits into
Open
gfx11: sync ROCm CI + RDNA3.5 MMQ device table onto upstream-synced master#30jimw567 wants to merge 231 commits into
jimw567 wants to merge 231 commits into
Conversation
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* server : skip checkpoints beyond pos_next * cont : update comment + TODO + ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…nts (ggml-org#24371) * vocab : refactor normalizer flags into options struct, add strip_accents * Update src/llama-vocab.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: use medium matmul tile on Asahi Linux * vulkan: switch Apple detection to Honeykrisp driver id
Fixes build/CI after ggml-org#24306.
* opencl: add q5_0 adreno support * opencl: add q5_1 adreno support * opencl: cosmetic fix --------- Co-authored-by: Li He <lih@qti.qualcomm.com>
* restore SYCL build and release, remove github cache * modify for test only * verify the ccache is used * remove debug code change * rm duplicate action, update key in ccache * add action ccache-clear after building in both ubuntu and windows * set %NUMBER_OF_PROCESSORS% in widnows build
* cuda: support concat for scalar types * Update concat.cu * fix metal ci issue
* llama : enable layer input extraction * spec: support eagle3 * eagle3: fix params bug * eagle3: support Gemma4 eagle3 from RedHatAI * eagle3: set sync when get features from target Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> * eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> * eagle3: adapt to upstream changes * eagle3: fix rebase issues and adapt to upstream changes * eagle3:exclude the eagle3 arch from test-llama-archs * eagle3: fix editorconfig check failures * eagle3: fix multi-seq issue in d2t vocab mapping * cont : minor style / clean-up * spec : remove `common_speculative_setup_draft_model()` * llama : clean-up unused API * eagle3: set d2t vocab mapping in decode graph * cont : assert layer inputs are configured * hparams : use n_embd_inp instead of n_embd_target_features * eagle3: make output.weight optional and inherit from target model when needed * haparams : generic norm-before-residual param * llama-ext : consistent names * cont : fix * hparams : remove target_hidden_size * cparams : rename output_layer_inp -> embeddings_layer_inp * arch : reuse ATTN_NORM_2 instead of adding new hidden norm * llama : clean-up names * cont : add assert + comment * Update conversion/llama.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ui: bake jpeg exif orientation into uploaded images stb_image in mtmd ignores exif metadata, so rotated smartphone photos reach the model with raw pixel orientation. The webui now reads the exif orientation tag at send time and feeds it into the existing capImageDataURLSize canvas pass: the browser applies the rotation when decoding, so capped images come out upright for free, and images under the cap threshold get a single plain redraw when orientation > 1. At most one re-encode ever happens per image. Upright jpegs with capping disabled pass through untouched, bit perfect. Adds jpeg-orientation.ts with a minimal exif parser working on a bounded base64 prefix (both endianness, returns 1 on any malformed input) and unit tests against handcrafted jpeg byte streams. * ui: move jpeg exif constants into lib/constants * ui: add browser test for jpeg orientation and capping Covers capImageDataURLSize end to end in chromium with real Pillow generated jpeg fixtures across exif orientations 1/3/5/6/8: upright quadrant colors checked pixel-wise, expected dimensions with and without capping, no orientation tag left in the output, and strict passthrough when nothing needs rewriting.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* feat: Add basic PWA support and service worker for offline caching * feat: Vite PWA implementation WIP * feat: Improve PWA icons generation * feat: Add PWA workbox to server routes * feat: Include `version.json` in static assets * feat: Add HTTP cache headers for PWA static assets * feat: Update app name for `apple-mobile-web-app-title` * feat: Implement PWA versioning and automatic update detection * chore: Update `.gitignore` files * feat: Splash Screens * feat: Add dark mode favicon support * refactor: Cleanup * fix: Use dark logo for dark splash screens * refactor: Simplify favicons SVG code * fix: Adjust caching and polling for reliable service worker updates * fix: Add missing favicon entry * fix: Align PWA service worker configuration with SvelteKit build structure * fix: Replace hashed bundle paths with versioned static paths * test: Add PWA tests * ci: Add build output for unit tests * refactor: Cleanup * fix: Server build & release versioning * chore: Update package-lock.json * chore: Increase PWA cache size * chore: Update packages * feat: Update favicons * refactor: Post-merge fix * feat: support explicit build version for PWA cache busting * fix: CI * feat: Improve PWA Refresh Alert UI * feat: Add toggleable build version display * refactor: Cleanup * feat: Add version mismatch detection and manual app reload * refactor: replace dynamic imports with static * refactor: Cleanup * feat: Add safe space for `pwa-<size>.png` rendered icons * fix: use relative paths for PWA assets to support base path deployment * feat: add PWA mode detection via URL query parameter * feat: Use ?cache=true for SW-cached PWA assets * refactor: Build process cleanup * refactor: Decouple PWA versioning and remove ?cache=true workaround * chore: Update README logo * feat: Include PWA Assets generation in build script * refactor: `usePwa` hook for core layout * fix: Relativize base vite plugin * fix: remove unnecessary backslash escapes in test regexes * test: update static asset paths for API Key test * refactor: Move SvelteKit PWA Options config to constants * ui: fix update notification never appearing Keep the PWA hook object intact instead of destructuring needRefreshByStorage, which freezes the reactive getter. Also exclude loading.html from PWA precache to prevent 404 errors and broken SW installation.
…rg#24517) When reasoning-budget is set in model.ini, the per-request thinking_budget_tokens from the WebUI was ignored because the model.ini value took unconditional precedence. Swap the precedence so the WebUI per-request value is checked first, with the model.ini value serving as a fallback default. Assisted-by: pi:llama.cpp/Qwen3.6-27B
* unbreak release harder * missed one * remove missing test for now
* mtmd: add batching API * wip * first working version (gemma4v) * add arg * nits * wire up support_batch() * fix 0.0 output embd * fix audio * nits * refactor a bit * nits * fix non-batching case * fix comment
* fix sycl links in release notes * remove extra line
* server: clean up static assets handling * nits * simplify file name handling, use static file name everywhere * cmake/ui : bundle UI assets in an archive * ui : run prettier on post-build.js --------- Co-authored-by: Alde Rojas <hello@alde.dev>
* ui: keep original file name and path * fix nocache
* misc: update lables * bring back examples, add mtmd
* server: use status code 403 for disabled features * cont * fix test case
* Add failing test-case to test-backend-ops Extracted from ggml-org#24072 * Minimize repro with help of AI N = 8 * (65535 - 1) + 1 = 524273 * Port and adjust workaround from LostRuins@0ba7983 Fall-back should share code, also relax y-z constraint to be inclusive * Add test-case + fallback also for y dim * Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX * Fix overflow problems for transposed copy kernel
# Conflicts: # README.md
Sync fork master with upstream llama.cpp (211 commits)
Build llama.cpp from this branch's source against the latest TheRock nightly ROCm tarball for gfx1151/gfx1150 (Ubuntu), then smoke-test each artifact on the matching self-hosted Strix Halo GPU runner with llama-cli. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The 7.14.0a20260609+ nightlies regressed libhsa-runtime64 (build 1b2a555677), segfaulting in GpuAgent::InitDma on gfx115x runners. See ROCm/TheRock#5763. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The previous gfx1150 runner could not reach huggingface.co (curl 35, connection reset) during model download. Try a different runner. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Self-hosted runners (e.g. the gfx1150 box) can't reach huggingface.co (curl 35, connection reset) but can reach GitHub. Fetch the GGUF from a fixed release asset (jimw567/llamacpp-test-assets) with retries instead. Revert gfx1150 runner to the working label (runs-on matches labels, not machine names, so the prior machine-name value never scheduled). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…tput
The binary failed to start on runners lacking libatomic1 (exit 127); bundle
libatomic.so.1 with the artifact so $ORIGIN RPATH resolves it.
Replace stale assertions ("offloaded 29/29 layers to GPU", "</think>") with
functional checks against current output: ROCm GPU selected, all layers +
output offloaded to ROCm0, and tokens actually generated (eval timing).
Verified on a Strix Halo gfx1151 box: passes on real GPU output, fails on
CPU-only/no-generation output.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The previous assertion only checked GPU offload + that >=1 token was emitted. Add functional checks that the model produced a non-empty answer and non-empty reasoning trace, and generated >=20 tokens. Verified against real gfx1151 output: passes on a genuine response, fails on empty/one-token. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Switch the llama-cli smoke test to a single-correct-answer prompt
("What is 2 + 2?") with greedy decoding (--temp 0), and assert the
parsed assistant content is 4. This checks the GPU math path end to
end rather than only that some text was generated.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Publish a dated GitHub Release (b<YYYYMMDD>) with gfx1151 + gfx1150 tar.gz binaries, gated to the nightly workflow_dispatch (create_release input) so push/PR/manual runs never publish. - Wrap the llama-cli smoke test in a 180s timeout that dumps GPU/driver diagnostics on hang, so a stuck runner fails fast with visible logs. - Temporarily disable the gfx1150 on-hardware test (its runner hangs in GPU inference); gfx1150 binaries are still built and released. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Mirror the lemonade llamacpp-rocm setup: build the grouped gfx110X target (gfx1100;gfx1101;gfx1102;gfx1103) from the gfx110X-all ROCm tarball, covering desktop RDNA3 and the Hawk Point/Phoenix iGPU (Radeon 760M/780M). Decouple the build matrix into gfx_target / s3_target / gpu_targets so per-target tarball suffixes and GPU target lists are explicit. Publish the gfx110X binary in the nightly release. No on-hardware test leg (no Hawk Point runner). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
gfx1153 ships as its own standalone TheRock tarball (no bundle suffix), same as gfx1151/gfx1150. Build+release only — no on-hardware test runner. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Collapse the 4-leg per-family matrix (gfx1151/gfx1150/gfx1153/gfx110X) into a single build sourced from TheRock's multi-arch tarball. One fat binary covers all current CI arches (gfx1100-1103, gfx1150/1151/1153) and ships as one universal release archive instead of four mostly- duplicate per-family archives. The multi-arch tarball is streamed and pruned at the tar level: drop all .kpack and the Tensile DBs of every non-target arch. The GEMM path llama.cpp uses works from the per-arch Tensile DB alone (validated on gfx1151 hardware: rocBLAS sgemm succeeds with ROCM_KPACK_DISABLE=1), so no .kpack files are bundled. The gfx1151 hardware test job is the end-to-end safety net. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The "universal" vs "multi-arch" wording was redundant — both describe one package covering many arches. Standardize on "multiarch" to match TheRock's upstream vocabulary. Renames the artifact/archive to llama-<TAG>-ubuntu-rocm-multiarch-x64 and updates comments + release body. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Follow-up to the package rename: drop the hyphenated "multi-arch" prose in comments, echoes, and the release body for one consistent spelling. The upstream nightlies endpoint (tarball-multi-arch) and the therock-dist-linux- multiarch- filenames are external names and left untouched. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
41013ff to
5692e0b
Compare
Fork-local: upstream targets [self-hosted, fast] runners that don't exist in this fork, so these two lint checks queue forever. Point them at GitHub-hosted ubuntu-24.04. Marked DO NOT UPSTREAM in each file. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Fork-local: add gfx11 to the push/pull_request branch filters so the two lint checks actually run on gfx11-targeted PRs (they previously only triggered on master). Marked DO NOT UPSTREAM. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gfx11branch onto the freshly syncedmaster(now level with upstream after Sync fork master with upstream llama.cpp (211 commits) #28)..github/workflows/build-gfx11-rocm.yml— multi-arch ROCm build + on-hardware deterministic-generation test for RDNA3/3.5 gfx11 targets (gfx110X Hawk Point/Phoenix, gfx1150, gfx1153).ggml/src/ggml-cuda/mmq.cuh— RDNA3.5 MMQ device table (from Add mmq device table for RDNA3.5 #25): sets mmq_y=64 and nwarps=4 for RDNA3.5 in both host and device paths.Notes
GGML_CUDA_CC_IS_RDNA3_5/RDNA3_5macros confirmed present in upstreamcommon.cuhandvendors/hip.h.upstream/master.Test plan