Skip to content

Conversation

@alextmagro
Copy link
Contributor

This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.

@alextmagro alextmagro marked this pull request as ready for review January 27, 2026 15:38
if version < (12, 0):
raise RuntimeError("Transformer Engine requires CUDA 12.0 or newer")

if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard via ROCm specifc guards?

parser.add_argument("--seed", type=int, default=1234, help="RNG seed.")
parser.add_argument(
"--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context."
"--fp8", action="store_true", default=False, help="Enables the te.autocast() context."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.

I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.


# This file was modified for portability to AMDGPU
# Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files

@@ -0,0 +1,15 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we put this file here? Should it be under /transformer_engine/common or pytorch

import transformer_engine.pytorch.cpp_extensions as tex
from transformer_engine.pytorch.fp8 import FP8GlobalStateManager

from transformer_engine.jax.cpp_extensions.misc import is_hip_extension
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not import jax specific code into pytorch side. Use this instead:

from torch.utils.cpp_extension import IS_HIP_EXTENSION

initialize(buffer_shape, buffer_dtype, rs_overlap_first_gemm);
}

void CommOverlapBase::initialize(const std::vector<size_t> &buffer_shape, DType buffer_dtype,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this initialize function used somewhere else? Or just to make NV upstream codes look cleaner?

if (_ub_comm->myrank == 0) printf("!!! [UB] Register UBuf %d\n", _ub_reg);
if (_ub_comm->myrank == 0) {
printf("!!! [UB] Register UBuf %d\n", _ub_reg);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer aligning the coding style with NV upstream so it's easier for us to maintain/IFU later

allgather_handle, barrier_handle, tp_size, num_max_streams, comm_cga_size,
gemm_priority, comm_priority, num_comm_sm, set_sm_margin, use_ce,
atomic_gemm) {
initialize(buffer_shape, buffer_dtype, comm_type, aggregate);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here for the motivation of this initialize function in the constructor

size_t buffer_bytes = get_buffer_size_bytes(buffer_shape[0], buffer_shape[1], buffer_dtype);
int buffer_chunk_bytes = buffer_bytes / tp_size;
_num_ubuf_chunks = tp_size;
int buffer_chunk_bytes = buffer_bytes / _tp_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NV's original code compile successfully? I mean tp_size -> _tp_size sounds like a typo in their original code :-)

NVTE_CHECK_CUDA(cudaStreamCreateWithPriority(&stream, cudaStreamNonBlocking, _comm_priority));
_stream_send.push_back(std::move(stream));
}
for (int i = 0; i < 7; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need more streams than NV upstream and where does the constant 7 comes out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants