Add a HIP/ROCm device backend for KV-transfer and Hamming kernels by jeffdaily · Pull Request #1021 · ModelEngine-Group/unified-cache-management

jeffdaily · 2026-06-11T22:34:38Z

This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a __CUDA_ARCH__ or USE_ROCM guard that the CUDA build does not
compile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with export PLATFORM=rocm (or
-DRUNTIME_ENVIRONMENT=rocm when invoking CMake directly).

Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.

The compat shim resolves the project's #include <cuda_runtime.h>/<cuda.h>
to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.

Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
#if defined(__CUDA_ARCH__) with a HIP branch doing a plain vectorized uint4
load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.

The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.

A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.

Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document PLATFORM=rocm beside PLATFORM=cuda.

This work was authored with the assistance of Claude, an AI assistant.

Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).

Store/trans C++ surface plus unit tests:

cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \
  -DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_rocm -j16
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1

79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.

Hamming kernel (against a ROCm PyTorch):

cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \
  -DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
  -DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_sparse -j16 --target hamming
HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \
  python ucm/sparse/test/gsa/test_hamming_rocm_ref.py

The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.

The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.

This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming scoring kernel build and run on AMD GPUs via HIP. The backend is purely additive: the existing backends are not modified. We have made every effort to leave the NVIDIA build unchanged -- every source change to a shared file is behind a `__CUDA_ARCH__` or `USE_ROCM` guard that the CUDA build does not compile, and the compat shim is only placed on the include path for the rocm build, never the cuda one. Select it with `export PLATFORM=rocm` (or `-DRUNTIME_ENVIRONMENT=rocm` when invoking CMake directly). Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then the two guarded kernel sources, then the docs and Windows host-build guards. The compat shim resolves the project's `#include <cuda_runtime.h>`/`<cuda.h>` to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use (Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*. Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD targets need no source edit. Root cause of the one non-mechanical change: the two grid-stride copy kernels used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores) that does not exist on AMD. The PTX is now guarded by `#if defined(__CUDA_ARCH__)` with a HIP branch doing a plain vectorized uint4 load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the __ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are cache-policy hints rather than visibility semantics for this memcpy (each thread writes a disjoint unit and the only consumer is the host after a stream sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh already selects its portable non-PTX fallback under hipcc, so it needed no change. The sparse Hamming module links libtorch. operator.h now includes <ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20 (torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses INFINITY as a masking sentinel that finite-math would drop), and with pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the module init symbol under -fvisibility=hidden. A set of WIN32-guarded host-build fixes let the backend also compile with the clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded, three header-only infra sub-libraries changed from OBJECT to INTERFACE so the linker language is determinable under Ninja+clang-cl, a getpid shim, metrics symbol export, and excluding a POSIX-only thread test). The GPU device code is unchanged by those guards. Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and SGLang quickstarts document `PLATFORM=rocm` beside `PLATFORM=cuda`. This work was authored with the assistance of Claude, an AI assistant. Test Plan: Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU visible (HIP_VISIBLE_DEVICES=0). Store/trans C++ surface plus unit tests: ``` cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \ -DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_FLAGS="-Wno-error=unused-result" cmake --build build_rocm -j16 HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1 ``` 79/80 pass on Linux. The copy-kernel correctness gates all pass: UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases (store H2D/D2H batch copy with readback). The one failure, UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated to the GPU backend. Hamming kernel (against a ROCm PyTorch): ``` cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \ -DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \ -DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result" cmake --build build_sparse -j16 --target hamming HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \ python ucm/sparse/test/gsa/test_hamming_rocm_ref.py ``` The new test computes an independent CPU popcount reference for the paged block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa (min over kv heads) matches exactly, two-run output is bit-identical. The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards select the original inline-PTX branch (compile-checked with nvcc 12.8 at sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global streaming ops), so the CUDA build compiles the same device code as before.

flesher0813 · 2026-06-12T01:36:31Z

        match PLATFORM:
            case "cuda":
                cmake_args += ["-DRUNTIME_ENVIRONMENT=cuda"]
+            case "rocm":


setup.py adds PLATFORM=rocm support, but the warning message still lists only cuda/ascend/ascend-a3/musa/maca. Please include rocm there as well.

ygwpz · 2026-06-13T06:27:02Z

+    // streaming builtins; those are NVIDIA cache-policy hints, not semantics.
+    // A plain vectorized 32-byte copy is the portable equivalent.
+    const uint4* src4 = reinterpret_cast<const uint4*>(src);
+    uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst));


⚠️ Warning: The const_cast<uint8_t*>(dst) removes the volatile qualifier from the destination pointer. The original CUDA implementation uses st.volatile.global PTX which ensures proper memory visibility semantics for device-to-host transfers. While the comment correctly notes that cache-policy hints are NVIDIA-specific, the volatile qualifier itself has semantic meaning - it prevents compiler optimizations that could reorder or eliminate memory operations. On HIP, consider using __builtin_nontemporal_store or ensuring proper memory fence semantics to maintain equivalent visibility guarantees. The current plain copy may be functionally correct for this use case (disjoint writes with stream sync), but the removal of volatile should be explicitly justified in the comment.

ygwpz · 2026-06-13T06:27:04Z

+    // A plain vectorized 32-byte copy is the portable equivalent.
+    const uint4* src4 = reinterpret_cast<const uint4*>(src);
+    uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst));
+    dst4[0] = src4[0];


💡 Suggestion: The reinterpret_cast<uint4*> assumes the source pointer is properly aligned to 16-byte boundary for uint4 access. While this is likely true for the KV cache blocks (which are typically allocated with proper alignment), consider adding an assertion or documentation noting the alignment requirement. Misaligned access could cause undefined behavior or performance degradation on some AMD architectures.

ygwpz · 2026-06-13T06:27:05Z

    asm volatile("ld.global.cs.v2.u64 {%0, %1}, [%2];" : "=l"(a), "=l"(b) : "l"(src));
    asm volatile("st.global.cg.v2.u64 [%0], {%1, %2};" ::"l"(dst), "l"(a), "l"(b));
+#else
+    *reinterpret_cast<uint4*>(dst) =


⚠️ Warning: Similar to cuda_sm_kernel.cu, the reinterpret_cast<uint4*>(dst) removes both const and volatile qualifiers. The volatile qualifier on the destination pointer in D2HUnit was intentional to ensure proper visibility of writes to host memory. While the stream synchronization after the kernel provides the necessary fence, the explicit volatile in the original code served as a compiler barrier. On HIP, this plain copy should work correctly due to the subsequent sync, but the removal of volatile semantics should be documented as a deliberate design decision rather than an incidental consequence of the shim.

ygwpz · 2026-06-13T06:27:07Z

+#endif
 }

 inline __device__ void D2HUnit(volatile uint8_t* __restrict__ dst, const uint8_t* __restrict__ src)


⚠️ Warning: The const_cast<uint8_t*>(dst) removes the volatile qualifier. This mirrors the issue in H2DUnit above. The D2H path writes to host-registered memory, and the volatile qualifier ensured the compiler wouldn't optimize away or reorder these writes. The HIP fallback relies on stream synchronization for correctness, which is valid, but the comment should clarify that volatile semantics are intentionally replaced by stream sync barriers.

ygwpz · 2026-06-13T06:27:08Z

+#endif
 #include <cuda_runtime.h>
 #include <torch/script.h>



💡 Suggestion: The #include <cuda_runtime.h> after the conditional block relies on the hip_compat shim being on the include path for ROCm builds. This works correctly when the CMake configuration is proper, but could cause confusing build failures if the shim directory is not added to include paths. Consider adding a comment here noting this dependency, or adding an #ifdef USE_ROCM guard with an #error directive if the shim is missing (e.g., checking for UNIFIEDCACHE_HIP_COMPAT_CUDA_RUNTIME_H define).

ygwpz · 2026-06-13T06:27:10Z

+static constexpr hipError_t cudaSuccess = hipSuccess;
+static constexpr hipMemcpyKind cudaMemcpyHostToDevice = hipMemcpyHostToDevice;
+static constexpr hipMemcpyKind cudaMemcpyDeviceToHost = hipMemcpyDeviceToHost;
+static constexpr unsigned int cudaStreamNonBlocking = hipStreamNonBlocking;


💡 Suggestion: Using #define macros for function aliases prevents proper function overload resolution and could cause issues if CUDA APIs add new signatures in future versions. For this limited set of APIs, the macro approach is acceptable. However, consider adding inline wrapper functions instead of macros for type safety and better debugging support. Example: inline cudaError_t cudaMalloc(void** ptr, size_t size) { return hipMalloc(ptr, size); } would preserve the cudaError_t return type explicitly.

ygwpz · 2026-06-13T06:27:11Z

@@ -0,0 +1,30 @@
+if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
+    set(CMAKE_HIP_ARCHITECTURES "gfx90a")


💡 Suggestion: The architecture defaults to gfx90a (MI250X/MI210) which is appropriate for datacenter AMD GPUs, but may cause confusion for users with consumer Radeon GPUs (gfx1100, gfx1201, etc.). Consider adding a comment or CMake warning message when defaulting, suggesting users set CMAKE_HIP_ARCHITECTURES explicitly. The PR description mentions gfx1100 validation but the default doesn't reflect this broader support.

ygwpz · 2026-06-13T06:27:13Z

+string(REPLACE "-s " "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+string(REGEX REPLACE "(^| )-s$" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+
+if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")


💡 Suggestion: The string(REPLACE "-s " "" ...) and string(REGEX REPLACE ...) pattern for removing the -s flag is somewhat fragile. It assumes -s appears with a trailing space or at the end of the string. If -s appears without a space (e.g., -s -O2), the first REPLACE handles it, but other edge cases might slip through. Consider using a more robust approach like string(REGEX REPLACE "-s( |$)" "" ...) to handle all positions uniformly.

ygwpz · 2026-06-13T06:27:15Z

+            # element (kv*block_size + offset)*num_chunk + chunk. Index a flat
+            # view of the key tensor the same way so the reference matches the
+            # kernel's layout regardless of the host tensor's nominal shape.
+            block = key[phys].reshape(-1)


💡 Suggestion: The reference implementation uses a Python loop to compute popcount32, which is correct but slow. For test correctness verification this is acceptable. However, consider using torch.bincount or numpy's built-in popcount for better performance if this test needs to run on larger tensors. The current implementation is fine for the test sizes used (hd=576, block_size=64).

ygwpz · 2026-06-13T06:27:16Z

@@ -0,0 +1,18 @@
+if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
+    set(CMAKE_HIP_ARCHITECTURES "gfx90a")


💡 Suggestion: Same as the trans/rocm CMakeLists.txt - the gfx90a default should be documented or warned about. Users building on consumer AMD GPUs will need to override this.

jeffdaily requested review from FangRun2, Infinite666, Tarrei, flesher0813, mag1c-h, wangwenxin0312, wuhuxiao and ygwpz as code owners June 11, 2026 22:34

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026

UCM: lead -> pr-open (ModelEngine-Group/unified-cache-management#1021)

8eead7c

flesher0813 reviewed Jun 12, 2026

View reviewed changes

ygwpz reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021

Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021
jeffdaily wants to merge 1 commit into
ModelEngine-Group:developfrom
jeffdaily:moat-port

jeffdaily commented Jun 11, 2026

Uh oh!

flesher0813 Jun 12, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

ygwpz Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,30 @@
		if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
		set(CMAKE_HIP_ARCHITECTURES "gfx90a")

		@@ -0,0 +1,18 @@
		if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
		set(CMAKE_HIP_ARCHITECTURES "gfx90a")

Conversation

jeffdaily commented Jun 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants