Add 2d quant for mxfp8 by kunlunl · Pull Request #2634 · NVIDIA/TransformerEngine

kunlunl · 2026-01-29T20:49:33Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-29T20:55:26Z

Greptile Overview

Greptile Summary

This PR adds 2D block scaling support for MXFP8 quantization, where each 32x32 block of elements shares a single scaling factor (compared to 1D scaling where each row/column has its own scale).

Key Changes:

Introduces mxfp8_2d_quantization flag throughout the stack (Python recipe → C++ config → CUDA kernel)
Modified CUDA kernel uses warp shuffle reduction to compute per-block amax in colwise pass, stores scales in shared memory, and reuses them in rowwise pass
2D quantization only applies to weight tensors when enabled via MXFP8BlockScaling(enable_2d_quantization=True)
Comprehensive test suite validates correctness against reference implementation

Architecture:
The implementation coordinates between colwise and rowwise quantization passes using shared memory (block_scales_2d). During the colwise pass, each warp computes a single scale for its 32x32 block via warp shuffle reduction and stores it. The rowwise pass then loads these scales from shared memory to ensure both passes use identical scaling factors.

Confidence Score: 4/5

Safe to merge with minor potential synchronization concern that should be verified through testing
The implementation is well-structured with comprehensive tests, but there's a subtle concern in the CUDA kernel where scale_from_shmem may be uninitialized for threads with thread_lane >= THREADS_X before the __shfl_sync broadcast at line 460. While this likely works in practice (since only valid lanes hold meaningful data and shuffle broadcasts from valid lanes), explicit initialization would make the code more robust.
Pay attention to transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh lines 455-460 for the shared memory scale broadcast logic

Important Files Changed

Filename	Overview
tests/pytorch/test_mxfp8_2d_quantize.py	New comprehensive test suite for MXFP8 2D block scaling quantization with reference implementation and recipe configuration tests
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	Core CUDA kernel implementing 2D block scaling with shared memory coordination between colwise and rowwise passes
transformer_engine/common/recipe/init.py	Added `mxfp8_2d_quantization` flag to QParams and MXFP8BlockScaling recipe with environment variable support
transformer_engine/pytorch/quantization.py	Updated `MXFP8BlockScalingRecipeState.make_quantizers()` to conditionally enable 2D quantization based on recipe QParams

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Recipe as MXFP8BlockScaling
    participant RecipeState as MXFP8BlockScalingRecipeState
    participant Quantizer as MXFP8Quantizer (Python)
    participant CPPQuantizer as MXFP8Quantizer (C++)
    participant Config as QuantizationConfig
    participant Dispatch as quantize_fwd_helper
    participant Kernel as quantize_mxfp8_kernel

    User->>Recipe: MXFP8BlockScaling(enable_2d_quantization=True)
    Recipe->>Recipe: __post_init__(): set QParams.mxfp8_2d_quantization
    Note over Recipe: fp8_quant_fwd_weight.mxfp8_2d_quantization = True<br/>fp8_quant_fwd_inp.mxfp8_2d_quantization = False<br/>fp8_quant_bwd_grad.mxfp8_2d_quantization = False

    User->>RecipeState: make_quantizers()
    RecipeState->>RecipeState: Check QParams for each tensor type
    RecipeState->>Quantizer: MXFP8Quantizer(with_2d_quantization=qparams.mxfp8_2d_quantization)
    RecipeState-->>User: Return list of quantizers

    User->>Quantizer: quantizer(input_tensor)
    Quantizer->>CPPQuantizer: quantize(input, out)
    CPPQuantizer->>Config: set_mxfp8_2d_quantization(with_2d_quantization)
    CPPQuantizer->>Dispatch: nvte_quantize_v2(input, output, config)
    
    Dispatch->>Kernel: Call quantize<...>(use_2d_quantization=config.mxfp8_2d_quantization)
    
    alt 2D Quantization Enabled (kIs2DBlockScaling=true)
        Kernel->>Kernel: Colwise pass: Compute 32x32 block amax via warp shuffle
        Note over Kernel: Each warp reduces across 32 threads<br/>to get single scale per 32x32 block
        Kernel->>Kernel: Store block scale to shared memory (block_scales_2d)
        Kernel->>Kernel: Quantize colwise data with block scale
        
        Kernel->>Kernel: __syncthreads() before rowwise pass
        Kernel->>Kernel: Rowwise pass: Load scale from shared memory
        Note over Kernel: Use __shfl_sync to broadcast scale<br/>from shared memory to all threads in warp
        Kernel->>Kernel: Quantize rowwise data with same block scale
    else 1D Quantization (kIs2DBlockScaling=false)
        Kernel->>Kernel: Colwise pass: Compute per-column scale
        Kernel->>Kernel: Rowwise pass: Compute per-row scale
    end
    
    Kernel-->>CPPQuantizer: Quantized tensor with scales
    CPPQuantizer-->>User: MXFP8Tensor with rowwise/colwise data and scales

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-29T20:55:30Z

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh

+        e8m0_t scale_from_shmem;
+        if (thread_lane < THREADS_X) {
+          scale_from_shmem = block_scales_2d[thread_lane];
+        }
+        // Broadcast: each thread gets scale from lane matching its tid_X_rowwise
+        biased_exponent = __shfl_sync(0xffffffff, scale_from_shmem, tid_X_rowwise);


scale_from_shmem is potentially uninitialized for threads where thread_lane >= THREADS_X. While __shfl_sync only reads from lanes specified by tid_X_rowwise (which should be < THREADS_X), it's safer to initialize this variable.

Suggested change

e8m0_t scale_from_shmem;

if (thread_lane < THREADS_X) {

scale_from_shmem = block_scales_2d[thread_lane];

}

// Broadcast: each thread gets scale from lane matching its tid_X_rowwise

biased_exponent = __shfl_sync(0xffffffff, scale_from_shmem, tid_X_rowwise);

e8m0_t scale_from_shmem = 0;

if (thread_lane < THREADS_X) {

scale_from_shmem = block_scales_2d[thread_lane];

}

kunlunl and others added 2 commits January 30, 2026 04:48

Add 2d quant for mxfp8

43104fc

[pre-commit.ci] auto fixes from pre-commit.com hooks

be464ea

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 2d quant for mxfp8#2634

Add 2d quant for mxfp8#2634
kunlunl wants to merge 2 commits intoNVIDIA:mainfrom
kunlunl:2d_mxfp8

kunlunl commented Jan 29, 2026

Uh oh!

greptile-apps bot commented Jan 29, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kunlunl commented Jan 29, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 29, 2026

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant