[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel by Oleg-Goncharov · Pull Request #2630 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-01-28T14:45:33Z

Description

This PR fuses pre-swizzling into the grouped MXFP8 quantization kernel so that scaling factors are stored in the format expected by GEMM. It builds on PR#2586: [Common] MXFP8 kernel for grouped tensors and can be merged after that PR lands.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added a template parameter to the kernel to control the scaling-factor format.
Added a new member to GroupedTensor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-28T14:55:01Z

Greptile Overview

Greptile Summary

This PR extends the grouped MXFP8 quantization kernel to support pre-swizzled scaling factors by adding a WITH_GEMM_SWIZZLED_SCALES template parameter. The swizzling enables the scaling factors to be stored in the format expected by GEMM operations, eliminating the need for post-processing.

Key Changes

New kernel implementation (group_quantize_mxfp8.cuh): 983-line CUDA kernel supporting grouped tensor quantization with optional scale swizzling via template parameter
API extension (common.h): Added with_gemm_swizzled_scales boolean member to GroupedTensor struct
Public API additions: New C functions for grouped quantization (nvte_group_quantize, nvte_group_quantize_dbias) and grouped activation functions (GeLU, SiLU, ReLU variants)
Dispatch layer: Extended quantize.cuh with group_quantize_fwd_helper and group_quantize_bwd_helper template functions
Comprehensive testing: 777-line test file covering various shape representations and activation functions

Implementation Details

The kernel uses the gemm_swizzled_scale_idx function to compute swizzled indices when WITH_GEMM_SWIZZLED_SCALES=true. The swizzling is applied consistently for both rowwise and columnwise scaling:

Colwise: gemm_swizzled_scale_idx(X, Y, rows/128)
Rowwise: gemm_swizzled_scale_idx(Y, X, cols/128)

The implementation maintains feature parity with the base kernel, supporting activations (GeLU, SiLU, ReLU), activation derivatives, and dbias computation.

Confidence Score: 4/5

This PR is safe to merge after verifying test coverage for the swizzled scale path
The implementation follows established patterns from existing kernels (quantize_mxfp8.cuh) and correctly applies the swizzling logic. However, the PR description mentions tests are not yet added (checklist item unchecked), though test files are present in the changeset
Check that tests/cpp/operator/test_cast_mxfp8_grouped.cu includes test cases specifically for with_gemm_swizzled_scales=true

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	New file: implements CUDA kernel for grouped MXFP8 quantization with fused pre-swizzling controlled by WITH_GEMM_SWIZZLED_SCALES template parameter
transformer_engine/common/common.h	Added with_gemm_swizzled_scales boolean member to GroupedTensor struct to control scaling factor format
transformer_engine/common/cast/dispatch/quantize.cuh	Added dispatch helpers group_quantize_fwd_helper and group_quantize_bwd_helper for grouped tensors, routing to mxfp8::group_quantize
tests/cpp/operator/test_cast_mxfp8_grouped.cu	New test file with comprehensive tests for grouped MXFP8 quantization including various shape representations and activation functions

Sequence Diagram

sequenceDiagram
    participant User
    participant API as nvte_group_quantize
    participant Dispatch as group_quantize_fwd_helper
    participant Kernel as group_quantize_mxfp8_kernel
    participant Swizzle as gemm_swizzled_scale_idx

    User->>API: Call nvte_group_quantize(input, output, stream)
    API->>Dispatch: group_quantize_fwd_helper<IS_ACT, OP>()
    Dispatch->>Dispatch: Check scaling_mode (MXFP8_1D_SCALING)
    Dispatch->>Kernel: mxfp8::group_quantize(input, output, ...)
    
    Kernel->>Kernel: Read with_gemm_swizzled_scales from output->with_gemm_swizzled_scales
    Kernel->>Kernel: Instantiate kernel with WITH_GEMM_SWIZZLED_SCALES template parameter
    
    alt Multiple tensors (not single tensor)
        Kernel->>Kernel: Launch update_tma_descriptors kernel
        Kernel->>Kernel: Update tensor map descriptors per tensor
    end
    
    Kernel->>Kernel: Launch group_quantize_mxfp8_kernel<<<grid, block>>>
    
    loop For each tile in tensor
        Kernel->>Kernel: Load data via TMA
        Kernel->>Kernel: Compute activations (if IS_ACT or IS_DACT)
        
        alt Colwise Scaling
            Kernel->>Kernel: Compute column-wise amax
            Kernel->>Kernel: Convert to E8M0 scaling factor
            
            alt WITH_GEMM_SWIZZLED_SCALES
                Kernel->>Swizzle: gemm_swizzled_scale_idx(x, y, num_tiles)
                Swizzle-->>Kernel: Return swizzled index
            else No swizzling
                Kernel->>Kernel: Use compact index (y * stride + x)
            end
            
            Kernel->>Kernel: Store scale at computed index
            Kernel->>Kernel: Apply scale and quantize to MXFP8
        end
        
        alt Rowwise Scaling
            Kernel->>Kernel: Compute row-wise amax
            Kernel->>Kernel: Convert to E8M0 scaling factor
            
            alt WITH_GEMM_SWIZZLED_SCALES
                Kernel->>Swizzle: gemm_swizzled_scale_idx(y, x, num_tiles)
                Swizzle-->>Kernel: Return swizzled index
            else No swizzling
                Kernel->>Kernel: Use compact index (y * stride + x)
            end
            
            Kernel->>Kernel: Store scale at computed index
            Kernel->>Kernel: Apply scale and quantize to MXFP8
        end
        
        Kernel->>Kernel: Store quantized data via TMA
    end
    
    alt IS_DBIAS
        Kernel->>Kernel: Reduce dbias along columns
    end
    
    Kernel-->>User: Return quantized output with swizzled scales

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Oleg-Goncharov and others added 20 commits January 21, 2026 16:50

Rebased to main

88cf1b2

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ac23f06

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_mxfp8_grouped_kernel

44ec5ba

Fixed the year to 2026

99f1f63

Signed-off-by: Oleg Goncharov <[email protected]>

Added compilation guards

7415138

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

adacda9

for more information, see https://pre-commit.ci

Added BWD pass

39bb24f

Signed-off-by: Oleg Goncharov <[email protected]>

Merge branch 'main' into pr_mxfp8_grouped_kernel

02c05a6

[pre-commit.ci] auto fixes from pre-commit.com hooks

452651a

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_mxfp8_grouped_kernel

9da18bf

Added dbias and dact tests. Refactoring.

e8beb1e

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3f8468

for more information, see https://pre-commit.ci

Added grouped MXFP8 DACT and ACT API and tests

1235167

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

34b9dfd

for more information, see https://pre-commit.ci

Fixed a typo

6dd3814

Signed-off-by: Oleg Goncharov <[email protected]>

Fixes per the review

c20c9d4

Signed-off-by: Oleg Goncharov <[email protected]>

Merge branch 'main' into pr_mxfp8_grouped_kernel

82e9c77

More fixes from the review

65afe16

Signed-off-by: Oleg Goncharov <[email protected]>

Merge branch 'main' into pr_mxfp8_grouped_kernel

fc0f9e9

Added fused preswizzling to the kernel

e01865f

Signed-off-by: Oleg Goncharov <[email protected]>

Oleg-Goncharov requested a review from ptrendx January 28, 2026 14:45

[pre-commit.ci] auto fixes from pre-commit.com hooks

bf07d9d

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel#2630

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel#2630
Oleg-Goncharov wants to merge 21 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_mxfp8_grouped_preswizzle

Oleg-Goncharov commented Jan 28, 2026

Uh oh!

greptile-apps bot commented Jan 28, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oleg-Goncharov commented Jan 28, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

Key Changes

Implementation Details

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants