Device Barrier by micmelesse · Pull Request #443 · ROCm/iris

micmelesse · 2026-03-10T08:24:21Z

Motivation

This pr adds device_barrier() to Iris. device_barrier is a GPU-side barrier using atomic operations on the symmetric heap. This was needed to avoid crashes during graph capture in vllm workloads. See ROCm/hip#3876. The current barrier() uses dist.barrier() which goes through RCCL. The RCCL watchdog thread polls existing work items and fails with hipErrorStreamCaptureUnsupported when another stream is in capture mode. device_barrier() avoids RCCL entirely by synchronizing on the GPU via an Iris kernel.

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

mawad-amd · 2026-03-10T15:38:39Z

Thanks, @micmelesse for the fix! I actually prefer this triton-based barrier to a working host-side one. Seems like CI tests are failing though. Could you please take a look?

micmelesse · 2026-03-10T16:24:12Z

Thanks, @micmelesse for the fix! I actually prefer this triton-based barrier to a working host-side one. Seems like CI tests are failing though. Could you please take a look?

I will get the pr green and ping you. Thank you

Copilot

Pull request overview

This PR adds a GPU-side, CUDA-graph-capturable device_barrier() to Iris that avoids RCCL/NCCL host barriers by synchronizing ranks via a Triton kernel using atomics on the symmetric heap.

Changes:

Added Iris.device_barrier() with per-process-group flag storage and a new distributed_device_barrier() implementation.
Introduced a Triton-based device barrier kernel and centralized extract_group_info for group rank/stride handling.
Added unit tests covering basic usage, cross-rank visibility (eager + graph), state reuse, and timeout behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`iris/iris.py`	Adds `device_barrier()` API and caches per-group flags tensor state.
`iris/_distributed_helpers.py`	Implements `extract_group_info()`, Triton device barrier kernel, and `distributed_device_barrier()`.
`iris/ccl/utils.py`	Refactors group info extraction to delegate to the centralized helper.
`tests/unittests/test_barriers.py`	Adds unit tests for host/device barriers including CUDA graph capture scenarios.

You can also share your feedback on Copilot code review. Take the survey.

iris/_distributed_helpers.py

tests/unittests/test_barriers.py

iris/iris.py

iris/_distributed_helpers.py

micmelesse mentioned this pull request Mar 10, 2026

[COMMS] Fused allreduce, residual add, rms, quant, gemm ROCm/aiter#2238

Open

1 task

micmelesse changed the title ~~add device barrier~~ Device Barrier Mar 10, 2026

add device barrier

2080f13

micmelesse force-pushed the micmelesse/fusion branch from b9f8264 to 2080f13 Compare March 10, 2026 18:35

micmelesse marked this pull request as ready for review March 10, 2026 20:50

micmelesse requested review from BKP, mawad-amd and neoblizz as code owners March 10, 2026 20:50

Copilot AI review requested due to automatic review settings March 10, 2026 20:50

micmelesse marked this pull request as draft March 10, 2026 20:50

Copilot AI reviewed Mar 10, 2026

View reviewed changes

address copilot feedback

2249046

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Barrier#443

Device Barrier#443
micmelesse wants to merge 2 commits intoROCm:mainfrom
micmelesse:micmelesse/fusion

micmelesse commented Mar 10, 2026 •

edited

Loading

Uh oh!

mawad-amd commented Mar 10, 2026

Uh oh!

micmelesse commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

micmelesse commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

mawad-amd commented Mar 10, 2026

Uh oh!

micmelesse commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

micmelesse commented Mar 10, 2026 •

edited

Loading