Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021
Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021jeffdaily wants to merge 1 commit into
Conversation
This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a `__CUDA_ARCH__` or `USE_ROCM` guard that the CUDA build does not
compile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with `export PLATFORM=rocm` (or
`-DRUNTIME_ENVIRONMENT=rocm` when invoking CMake directly).
Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.
The compat shim resolves the project's `#include <cuda_runtime.h>`/`<cuda.h>`
to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.
Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
`#if defined(__CUDA_ARCH__)` with a HIP branch doing a plain vectorized uint4
load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.
The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.
A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.
Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document `PLATFORM=rocm` beside `PLATFORM=cuda`.
This work was authored with the assistance of Claude, an AI assistant.
Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).
Store/trans C++ surface plus unit tests:
```
cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \
-DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_rocm -j16
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1
```
79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.
Hamming kernel (against a ROCm PyTorch):
```
cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \
-DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
-DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_sparse -j16 --target hamming
HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \
python ucm/sparse/test/gsa/test_hamming_rocm_ref.py
```
The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.
The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.
| match PLATFORM: | ||
| case "cuda": | ||
| cmake_args += ["-DRUNTIME_ENVIRONMENT=cuda"] | ||
| case "rocm": |
There was a problem hiding this comment.
setup.py adds PLATFORM=rocm support, but the warning message still lists only cuda/ascend/ascend-a3/musa/maca. Please include rocm there as well.
| // streaming builtins; those are NVIDIA cache-policy hints, not semantics. | ||
| // A plain vectorized 32-byte copy is the portable equivalent. | ||
| const uint4* src4 = reinterpret_cast<const uint4*>(src); | ||
| uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst)); |
There was a problem hiding this comment.
const_cast<uint8_t*>(dst) removes the volatile qualifier from the destination pointer. The original CUDA implementation uses st.volatile.global PTX which ensures proper memory visibility semantics for device-to-host transfers. While the comment correctly notes that cache-policy hints are NVIDIA-specific, the volatile qualifier itself has semantic meaning - it prevents compiler optimizations that could reorder or eliminate memory operations. On HIP, consider using __builtin_nontemporal_store or ensuring proper memory fence semantics to maintain equivalent visibility guarantees. The current plain copy may be functionally correct for this use case (disjoint writes with stream sync), but the removal of volatile should be explicitly justified in the comment.
| // A plain vectorized 32-byte copy is the portable equivalent. | ||
| const uint4* src4 = reinterpret_cast<const uint4*>(src); | ||
| uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst)); | ||
| dst4[0] = src4[0]; |
There was a problem hiding this comment.
💡 Suggestion: The reinterpret_cast<uint4*> assumes the source pointer is properly aligned to 16-byte boundary for uint4 access. While this is likely true for the KV cache blocks (which are typically allocated with proper alignment), consider adding an assertion or documentation noting the alignment requirement. Misaligned access could cause undefined behavior or performance degradation on some AMD architectures.
| asm volatile("ld.global.cs.v2.u64 {%0, %1}, [%2];" : "=l"(a), "=l"(b) : "l"(src)); | ||
| asm volatile("st.global.cg.v2.u64 [%0], {%1, %2};" ::"l"(dst), "l"(a), "l"(b)); | ||
| #else | ||
| *reinterpret_cast<uint4*>(dst) = |
There was a problem hiding this comment.
reinterpret_cast<uint4*>(dst) removes both const and volatile qualifiers. The volatile qualifier on the destination pointer in D2HUnit was intentional to ensure proper visibility of writes to host memory. While the stream synchronization after the kernel provides the necessary fence, the explicit volatile in the original code served as a compiler barrier. On HIP, this plain copy should work correctly due to the subsequent sync, but the removal of volatile semantics should be documented as a deliberate design decision rather than an incidental consequence of the shim.
| #endif | ||
| } | ||
|
|
||
| inline __device__ void D2HUnit(volatile uint8_t* __restrict__ dst, const uint8_t* __restrict__ src) |
There was a problem hiding this comment.
const_cast<uint8_t*>(dst) removes the volatile qualifier. This mirrors the issue in H2DUnit above. The D2H path writes to host-registered memory, and the volatile qualifier ensured the compiler wouldn't optimize away or reorder these writes. The HIP fallback relies on stream synchronization for correctness, which is valid, but the comment should clarify that volatile semantics are intentionally replaced by stream sync barriers.
| #endif | ||
| #include <cuda_runtime.h> | ||
| #include <torch/script.h> | ||
|
|
There was a problem hiding this comment.
💡 Suggestion: The #include <cuda_runtime.h> after the conditional block relies on the hip_compat shim being on the include path for ROCm builds. This works correctly when the CMake configuration is proper, but could cause confusing build failures if the shim directory is not added to include paths. Consider adding a comment here noting this dependency, or adding an #ifdef USE_ROCM guard with an #error directive if the shim is missing (e.g., checking for UNIFIEDCACHE_HIP_COMPAT_CUDA_RUNTIME_H define).
| static constexpr hipError_t cudaSuccess = hipSuccess; | ||
| static constexpr hipMemcpyKind cudaMemcpyHostToDevice = hipMemcpyHostToDevice; | ||
| static constexpr hipMemcpyKind cudaMemcpyDeviceToHost = hipMemcpyDeviceToHost; | ||
| static constexpr unsigned int cudaStreamNonBlocking = hipStreamNonBlocking; |
There was a problem hiding this comment.
💡 Suggestion: Using #define macros for function aliases prevents proper function overload resolution and could cause issues if CUDA APIs add new signatures in future versions. For this limited set of APIs, the macro approach is acceptable. However, consider adding inline wrapper functions instead of macros for type safety and better debugging support. Example: inline cudaError_t cudaMalloc(void** ptr, size_t size) { return hipMalloc(ptr, size); } would preserve the cudaError_t return type explicitly.
| @@ -0,0 +1,30 @@ | |||
| if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "") | |||
| set(CMAKE_HIP_ARCHITECTURES "gfx90a") | |||
There was a problem hiding this comment.
💡 Suggestion: The architecture defaults to gfx90a (MI250X/MI210) which is appropriate for datacenter AMD GPUs, but may cause confusion for users with consumer Radeon GPUs (gfx1100, gfx1201, etc.). Consider adding a comment or CMake warning message when defaulting, suggesting users set CMAKE_HIP_ARCHITECTURES explicitly. The PR description mentions gfx1100 validation but the default doesn't reflect this broader support.
| string(REPLACE "-s " "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") | ||
| string(REGEX REPLACE "(^| )-s$" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") | ||
|
|
||
| if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "") |
There was a problem hiding this comment.
💡 Suggestion: The string(REPLACE "-s " "" ...) and string(REGEX REPLACE ...) pattern for removing the -s flag is somewhat fragile. It assumes -s appears with a trailing space or at the end of the string. If -s appears without a space (e.g., -s -O2), the first REPLACE handles it, but other edge cases might slip through. Consider using a more robust approach like string(REGEX REPLACE "-s( |$)" "" ...) to handle all positions uniformly.
| # element (kv*block_size + offset)*num_chunk + chunk. Index a flat | ||
| # view of the key tensor the same way so the reference matches the | ||
| # kernel's layout regardless of the host tensor's nominal shape. | ||
| block = key[phys].reshape(-1) |
There was a problem hiding this comment.
💡 Suggestion: The reference implementation uses a Python loop to compute popcount32, which is correct but slow. For test correctness verification this is acceptable. However, consider using torch.bincount or numpy's built-in popcount for better performance if this test needs to run on larger tensors. The current implementation is fine for the test sizes used (hd=576, block_size=64).
| @@ -0,0 +1,18 @@ | |||
| if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "") | |||
| set(CMAKE_HIP_ARCHITECTURES "gfx90a") | |||
There was a problem hiding this comment.
💡 Suggestion: Same as the trans/rocm CMakeLists.txt - the gfx90a default should be documented or warned about. Users building on consumer AMD GPUs will need to override this.
This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a
__CUDA_ARCH__orUSE_ROCMguard that the CUDA build does notcompile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with
export PLATFORM=rocm(or-DRUNTIME_ENVIRONMENT=rocmwhen invoking CMake directly).Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.
The compat shim resolves the project's
#include <cuda_runtime.h>/<cuda.h>to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.
Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
#if defined(__CUDA_ARCH__)with a HIP branch doing a plain vectorized uint4load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.
The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.
A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.
Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document
PLATFORM=rocmbesidePLATFORM=cuda.This work was authored with the assistance of Claude, an AI assistant.
Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).
Store/trans C++ surface plus unit tests:
79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.
Hamming kernel (against a ROCm PyTorch):
The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.
The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.