Userbuffer epic #367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

alextmagro wants to merge 2 commits into dev from userbuffer_epic

Contributor

alextmagro commented Nov 11, 2025

This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.

alextmagro force-pushed the userbuffer_epic branch from 896c191 to 455b1ef Compare

December 6, 2025 21:38

alextmagro force-pushed the userbuffer_epic branch from 455b1ef to e4e40e8 Compare

December 15, 2025 06:28


          ROCm UserBuffers for Comm Overlap

823adfd

alextmagro force-pushed the userbuffer_epic branch from b3e676a to 823adfd Compare

January 27, 2026 15:37

alextmagro marked this pull request as ready for review

January 27, 2026 15:38

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners

January 27, 2026 15:38


          Copyrights and cleanup

d779653

wangye805 requested changes

View reviewed changes

build_tools/pytorch.py

                           if version < (12, 0):
                               raise RuntimeError("Transformer Engine requires CUDA 12.0 or newer")
-                  if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):

Collaborator

wangye805 Feb 9, 2026

Guard via ROCm specifc guards?

examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py

                   parser.add_argument("--seed", type=int, default=1234, help="RNG seed.")
                   parser.add_argument(
-                      "--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context."
+                      "--fp8", action="store_true", default=False, help="Enables the te.autocast() context."

Collaborator

wangye805 Feb 9, 2026

Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?

Contributor Author

alextmagro Feb 11, 2026

I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.

I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.

examples/pytorch/comm_gemm_overlap/te_layer_with_overlap_profile.py

    
              # This file was modified for portability to AMDGPU

              # Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved.

              # Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Collaborator

wangye805 Feb 9, 2026

Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files

examples/pytorch/comm_gemm_overlap/ub_config.json

    
            @@ -0,0 +1,15 @@
          
              {

Collaborator

wangye805 Feb 9, 2026

Why do we put this file here? Should it be under /transformer_engine/common or pytorch

tests/pytorch/distributed/test_comm_gemm_overlap.py

    
              import transformer_engine.pytorch.cpp_extensions as tex

              from transformer_engine.pytorch.fp8 import FP8GlobalStateManager

              from transformer_engine.jax.cpp_extensions.misc import is_hip_extension

Collaborator

wangye805 Feb 9, 2026

Let's not import jax specific code into pytorch side. Use this instead:

TransformerEngine/tests/pytorch/test_numerics.py

Line 17 in 0dfee56

from torch.utils.cpp_extension import IS_HIP_EXTENSION

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

+                initialize(buffer_shape, buffer_dtype, rs_overlap_first_gemm);
+              }
+              void CommOverlapBase::initialize(const std::vector<size_t> &buffer_shape, DType buffer_dtype,

Collaborator

wangye805 Feb 9, 2026

Is this initialize function used somewhere else? Or just to make NV upstream codes look cleaner?

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

-                if (_ub_comm->myrank == 0) printf("!!! [UB] Register UBuf %d\n", _ub_reg);
+                if (_ub_comm->myrank == 0) {
+                  printf("!!! [UB] Register UBuf %d\n", _ub_reg);
+                }

Collaborator

wangye805 Feb 9, 2026

I would prefer aligning the coding style with NV upstream so it's easier for us to maintain/IFU later

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

                                     allgather_handle, barrier_handle, tp_size, num_max_streams, comm_cga_size,
                                     gemm_priority, comm_priority, num_comm_sm, set_sm_margin, use_ce,
                                     atomic_gemm) {
+                initialize(buffer_shape, buffer_dtype, comm_type, aggregate);

Collaborator

wangye805 Feb 9, 2026

Same question here for the motivation of this initialize function in the constructor

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

    
                size_t buffer_bytes = get_buffer_size_bytes(buffer_shape[0], buffer_shape[1], buffer_dtype);

                int buffer_chunk_bytes = buffer_bytes / tp_size;

                _num_ubuf_chunks = tp_size;

                int buffer_chunk_bytes = buffer_bytes / _tp_size;

Collaborator

wangye805 Feb 9, 2026

Does NV's original code compile successfully? I mean tp_size -> _tp_size sounds like a typo in their original code :-)

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

                   NVTE_CHECK_CUDA(cudaStreamCreateWithPriority(&stream, cudaStreamNonBlocking, _comm_priority));
                   _stream_send.push_back(std::move(stream));
                 }
+                for (int i = 0; i < 7; i++) {

Collaborator

wangye805 Feb 9, 2026

Why do we need more streams than NV upstream and where does the constant 7 comes out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet