Add NVTE_KEEP_BACKWARD_UNQUANTIZED by zianglih · Pull Request #2644 · NVIDIA/TransformerEngine

zianglih · 2026-02-03T00:48:37Z

Description

@HumansAnd

Add an NVTE_KEEP_BACKWARD_UNQUANTIZED env var for quantized fprop + high precision wgrad & dgrad.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Ziang Li <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-03T00:52:31Z

Greptile Overview

Greptile Summary

This PR adds support for quantized forward pass with high-precision backward pass via the NVTE_KEEP_BACKWARD_UNQUANTIZED environment variable. When enabled, the feature performs FP8-quantized forward computation but uses high-precision (unquantized) tensors for weight gradient and data gradient computation in the backward pass.

Key Changes:

Added FP8GlobalStateManager.keep_backward_unquantized() method that reads the env var and automatically returns False for delayed scaling recipes
Modified Linear, GroupedLinear, and LayerNormLinear modules to save high-precision tensors instead of quantized ones when the flag is enabled
Updated backward pass logic to disable FP8 quantization and use high-precision saved tensors for dgrad and wgrad computation
Disabled Userbuffers communication optimizations in backward pass when this feature is active
Added assertion in LayerNormMLP to prevent usage (not yet implemented for this module)
Propagated the flag through all fused operations

Trade-offs:

Increased memory usage (stores high-precision tensors instead of FP8)
Potentially improved gradient accuracy
Feature is automatically disabled for delayed scaling recipes to avoid conflicts

Confidence Score: 4/5

This PR is generally safe to merge with minor concerns about memory usage and documentation
The implementation is systematic and correctly propagates the flag through all affected modules. The guard against delayed scaling recipes prevents assertion failures. However, there's increased memory usage that users should be aware of, and LayerNormMLP is not yet supported
Pay attention to transformer_engine/pytorch/module/layernorm_linear.py for memory usage implications

Important Files Changed

Filename	Overview
transformer_engine/pytorch/quantization.py	Added `keep_backward_unquantized()` method to check `NVTE_KEEP_BACKWARD_UNQUANTIZED` env var, correctly returns False for delayed scaling recipes
transformer_engine/pytorch/module/layernorm_mlp.py	Added assertion to block usage of `NVTE_KEEP_BACKWARD_UNQUANTIZED` in LayerNormMLP (not yet implemented)
transformer_engine/pytorch/module/linear.py	Implements high-precision backward by saving original tensors, disabling FP8 quantizers in backward, and using unquantized weights for dgrad/wgrad
transformer_engine/pytorch/module/layernorm_linear.py	Stores both quantized and high-precision layernorm output when flag is set, disables FP8 quantization in backward pass
transformer_engine/pytorch/ops/basic/basic_linear.py	Adds `keep_backward_unquantized` parameter to forward function, conditionally saves high-precision tensors instead of quantized ones

Sequence Diagram

sequenceDiagram
    participant User
    participant FP8GlobalStateManager
    participant Linear
    participant Forward
    participant Backward
    
    User->>FP8GlobalStateManager: Set NVTE_KEEP_BACKWARD_UNQUANTIZED=1
    User->>Linear: forward(input, weight)
    Linear->>FP8GlobalStateManager: keep_backward_unquantized()
    FP8GlobalStateManager->>FP8GlobalStateManager: Check recipe.delayed()
    alt Delayed Scaling Recipe
        FP8GlobalStateManager-->>Linear: False (ignore flag)
    else Other Recipe
        FP8GlobalStateManager-->>Linear: True (use high precision)
    end
    
    alt keep_backward_unquantized=True
        Linear->>Forward: Quantize input for FP8 forward
        Forward->>Forward: Compute: y = x_fp8 @ w_fp8
        Forward->>Forward: Save x_high_precision, w_high_precision
        Forward->>Forward: Set ctx.with_quantized_compute=False
        Forward-->>Linear: y, saved tensors
    else keep_backward_unquantized=False
        Linear->>Forward: Quantize input for FP8 forward
        Forward->>Forward: Compute: y = x_fp8 @ w_fp8
        Forward->>Forward: Save x_fp8, w_fp8
        Forward->>Forward: Set ctx.with_quantized_compute=True
        Forward-->>Linear: y, saved tensors
    end
    
    User->>Linear: backward(grad_output)
    Linear->>Backward: grad_output
    
    alt keep_backward_unquantized=True
        Backward->>Backward: Load x_high_precision, w_high_precision
        Backward->>Backward: Disable quantizers (set to None)
        Backward->>Backward: dgrad = grad_out @ w_high_precision (high precision)
        Backward->>Backward: wgrad = x_high_precision.T @ grad_out (high precision)
    else keep_backward_unquantized=False
        Backward->>Backward: Load x_fp8, w_fp8
        Backward->>Backward: Quantize grad_output to FP8
        Backward->>Backward: dgrad = grad_out_fp8 @ w_fp8 (FP8)
        Backward->>Backward: wgrad = x_fp8.T @ grad_out_fp8 (FP8)
    end
    
    Backward-->>Linear: grad_input, grad_weight
    Linear-->>User: gradients

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zianglih · 2026-02-03T08:31:14Z

I'll work on potential unit test breakage.

transformer_engine/pytorch/ops/fuser.py

Signed-off-by: Ziang Li <[email protected]>

transformer_engine/pytorch/module/layernorm_linear.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{5 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Ziang Li <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/linear.py

Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

… is used Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/layernorm_mlp.py

greptile-apps · 2026-02-03T18:44:46Z

transformer_engine/pytorch/module/layernorm_linear.py

        ln_out_return = None
        if return_layernorm_output or return_layernorm_output_gathered:
            ln_out_return = ln_out
+        ln_out_hp = ln_out if keep_backward_unquantized else None


storing both ln_out (quantized) and ln_out_hp (high precision) doubles the memory footprint for this activation

verify this memory overhead is acceptable for your target models, especially during training with large batch sizes or long sequences

Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <[email protected]>

zhongbozhu · 2026-02-03T22:30:27Z

transformer_engine/pytorch/module/linear.py

                not ctx.use_bias
                and not ctx.requires_wgrad
                and ctx.grad_output_quantizer is not None
+                and use_fp8_bwd


same comment as above

transformer_engine/pytorch/module/grouped_linear.py

zhongbozhu · 2026-02-03T22:39:21Z

transformer_engine/pytorch/quantization.py

+        recipe = cls.get_fp8_recipe()
+        if recipe is not None and recipe.delayed():
+            # Ignore NVTE_KEEP_BACKWARD_UNQUANTIZED when delayed scaling is used
+            return False


Maybe it's better to assert an error for delayed scaling? Okay with both.

I agree. If the user specifies an unsupported combination, I think it's better to fail loudly than to secretly disobey their instructions.

zhongbozhu · 2026-02-03T23:38:02Z

transformer_engine/pytorch/module/layernorm_linear.py

-            if ctx.fp8 and requires_grad(inp, ln_weight, ln_bias, weight, bias):
+            if (
+                ctx.fp8
+                and not ctx.keep_backward_unquantized


same comment

zhongbozhu · 2026-02-03T23:41:16Z

transformer_engine/pytorch/module/layernorm_linear.py

            # Note: dgrad GEMM requires row-wise usage, wgrad GEMM
            # requires column-wise usage
-            if ctx.grad_output_quantizer is not None:
+            if ctx.grad_output_quantizer is not None and use_fp8_bwd:


this seems redundant too if we skip quant in grad_output_preprocess

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Ziang Li <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T19:06:25Z

transformer_engine/pytorch/module/layernorm_mlp.py


        # Prepare GEMM input


recomputing activation_func(fc1_out, None, **act_params) adds compute overhead for activations like GELU

consider storing high-precision act_out during forward pass when this feature is enabled to avoid redundant computation (trade memory for compute)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: Ziang Li <[email protected]>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T19:10:33Z

Additional Comments (1)

transformer_engine/pytorch/module/layernorm_mlp.py
Incorrect instance check

In _LayerNormMLP.backward, this block checks isinstance(ctx.fc1_weight_quantizer, QuantizedTensorStorage) and then calls ctx.fc1_weight.update_usage(...).

QuantizedTensorStorage is a tensor storage type, not a quantizer; this condition will never be true, so usage for ctx.fc1_weight won’t be updated when it should be (FP8 backward + quantized weight path). This looks like a typo for checking the weight (or QuantizedTensorStorage on ctx.fc1_weight) and can break backward that relies on correct usage flags.

Also appears in: none found in this diff.

Signed-off-by: Ziang Li <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Ziang Li <[email protected]>

zianglih · 2026-02-04T20:40:28Z

Some nvfuser tests are failing:

=================================================================== short test summary info ===================================================================
FAILED tests/pytorch/test_sanity.py::test_sanity_amp_and_nvfuser[True-small-None-dtype1] - RuntimeError: /root/TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu:764 in function cublas_gemm: Assertion failed: status != CUBLAS_STAT...
FAILED tests/pytorch/test_sanity.py::test_sanity_amp_and_nvfuser[True-small-None-dtype2] - RuntimeError: /root/TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu:764 in function cublas_gemm: Assertion failed: status != CUBLAS_STAT...
FAILED tests/pytorch/test_sanity.py::test_sanity_amp_and_nvfuser[False-small-None-dtype1] - RuntimeError: /root/TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu:764 in function cublas_gemm: Assertion failed: status != CUBLAS_STAT...
FAILED tests/pytorch/test_sanity.py::test_sanity_amp_and_nvfuser[False-small-None-dtype2] - RuntimeError: /root/TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu:764 in function cublas_gemm: Assertion failed: status != CUBLAS_STAT...
================================================ 4 failed, 12918 passed, 16523 skipped, 20 warnings in 40.71s =================================================

Signed-off-by: Ziang Li <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10

This feature is reasonably straightforward, although I have some design suggestions to make it more general. Also, we should add some unit tests to make sure this works as expected.

timmoon10 · 2026-02-04T22:59:40Z

transformer_engine/pytorch/quantization.py

+        recipe = cls.get_fp8_recipe()
+        if recipe is not None and recipe.delayed():
+            # Ignore NVTE_KEEP_BACKWARD_UNQUANTIZED when delayed scaling is used
+            return False


I agree. If the user specifies an unsupported combination, I think it's better to fail loudly than to secretly disobey their instructions.

timmoon10 · 2026-02-04T23:31:01Z

transformer_engine/pytorch/quantization.py

        return cls.HIGH_PRECISION_INIT_VAL

+    @classmethod
+    def keep_backward_unquantized(cls) -> bool:


I would prefer this option to live in Recipe rather than FP8GlobalStateManager. FP8GlobalStateManager is for state that changes very frequently (e.g. when entering or exiting a te.autocast), while Recipe has configs that persist throughout training. Exposing the option in Recipe also makes it easier to configure programmatically rather than with an obscure envvar.

timmoon10 · 2026-02-04T23:41:45Z

transformer_engine/pytorch/quantization.py

        return cls.HIGH_PRECISION_INIT_VAL

+    @classmethod
+    def keep_backward_unquantized(cls) -> bool:


This option name is specific to this workflow and doesn't generalize well. How about we break this up into two options: quantize_forward and quantize_backward. We have the following cases:

quantize_forward=True, quantize_backward=True: Equivalent to quantized case. In the future we might be able to replace FP8GlobalStateManager.FP8_ENABLED with FP8GlobalStateManager.QUANTIZE_FORWARD or FP8GlobalStateManager.QUANTIZE_BACKWARD.

quantize_forward=False, quantize_backward=False: Equivalent to unquantized case.

quantize_forward=True, quantize_backward=False: Your desired workflow.

quantize_forward=False, quantize_backward=True: We can error out in this case, but who know if someone in the future might want this.

timmoon10 · 2026-02-04T23:46:21Z

transformer_engine/pytorch/module/linear.py

            ctx.fp8 = fp8
            ctx.fp8_recipe = FP8GlobalStateManager.get_fp8_recipe() if fp8 else None
+            ctx.keep_backward_unquantized = keep_backward_unquantized


If the backward pass has unquantized compute, does it need to know that the forward pass was quantized? If possible, it would be nice to keep all the changed confined here where we configure the autograd context.

Suggested change

ctx.fp8 = fp8

ctx.fp8_recipe = FP8GlobalStateManager.get_fp8_recipe() if fp8 else None

ctx.keep_backward_unquantized = keep_backward_unquantized

ctx.fp8 = fp8 and not keep_backward_unquantized

ctx.fp8_recipe = FP8GlobalStateManager.get_fp8_recipe() if fp8 else None

zianglih and others added 2 commits February 2, 2026 16:45

Add NVTE_KEEP_BACKWARD_UNQUANTIZED

3afce1f

Signed-off-by: Ziang Li <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

72149be

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/ops/fuser.py Outdated Show resolved Hide resolved

Merge branch 'main' into keep-bwd

3e6eb64

Signed-off-by: Ziang Li <[email protected]>

ziang-and force-pushed the keep-bwd branch from 539af7d to 3e6eb64 Compare February 3, 2026 08:58

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Show resolved Hide resolved

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih added 2 commits February 3, 2026 09:36

Disable ub and clean up

927d482

Signed-off-by: Ziang Li <[email protected]>

Drop fuser changes

cc85b60

Signed-off-by: Ziang Li <[email protected]>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih and others added 2 commits February 3, 2026 09:56

Replace use_quantized_bwd with use_fp8_bwd

fe24f95

Signed-off-by: Ziang Li <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5ca3615

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/linear.py Show resolved Hide resolved

Ignore keep_backward_unquantized if delayed scaling

5ba7674

Signed-off-by: Ziang Li <[email protected]>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/quantization.py Outdated Show resolved Hide resolved

Refactor ignoring NVTE_KEEP_BACKWARD_UNQUANTIZED when delayed scaling…

02b7b2a

… is used Signed-off-by: Ziang Li <[email protected]>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

Add back missing ctx.debug

01a7de0

Signed-off-by: Ziang Li <[email protected]>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih added 2 commits February 3, 2026 11:50

Refactor changes under fused

bf904aa

Signed-off-by: Ziang Li <[email protected]>

Clean up

b449fc4

Signed-off-by: Ziang Li <[email protected]>

ziang-and force-pushed the keep-bwd branch from c934298 to b449fc4 Compare February 3, 2026 19:51

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih added 2 commits February 3, 2026 14:11

Refactor high-precision overwrite if keep_backward_unquantized

de3acaf

Signed-off-by: Ziang Li <[email protected]>

Clean up

fe65d34

Signed-off-by: Ziang Li <[email protected]>

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/grouped_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zianglih and others added 2 commits February 4, 2026 10:56

Drop redundant fp8_recipe_bwd

59aaf6b

Signed-off-by: Ziang Li <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

44da625

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

Drop redundant ub changes

0f58793

Signed-off-by: Ziang Li <[email protected]>

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

Drop more redundant ub changes

192fbad

Signed-off-by: Ziang Li <[email protected]>

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

zianglih and others added 2 commits February 4, 2026 11:25

Drop redundant delayed scaling changes

0dd1268

Signed-off-by: Ziang Li <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

216621d

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

Drop unneeded backwards_needs_fc1_input

ab8749b

Signed-off-by: Ziang Li <[email protected]>

zianglih and others added 2 commits February 4, 2026 14:01

Drop and disallow LayerNormMLP implementation

5881083

Signed-off-by: Ziang Li <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

431f0c8

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

timmoon10 reviewed Feb 4, 2026

View reviewed changes

Conversation

zianglih commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zianglih commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

zianglih commented Feb 3, 2026 •

edited

Loading

greptile-apps bot commented Feb 3, 2026 •

edited

Loading