[PyTorch][Fused Attn] Add support for cuDNN to return Softmax `Stats` always and `Max` when `return_max_logit=True` by sudhakarsingh27 · Pull Request #2677 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-02-12T23:30:59Z

Description

cuDNN recently made returning any subset of {Stats, SumExp, Max} possible. This PR adapts TE to always get Stats from cuDNN and Max tensor if return_max_logit=True. (Note that Stats = log(SumExp)+Max)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

fused_attn_f16_arbitrary_seqlen.cu
- Removed references to SumExp tensor as it's not needed since cuDNN returns Stats by default.
- set generate_stats=True which forces cuDNN to always return Stats tensor (needed in the backward pass)
transformer_engine/pytorch/cpp_extensions/fused_attn.py
- Remove code that manually did Stats = log(SumExp) + Max since cuDNN returns Stats directly and TE doesn't need SumExp from cuDNN
Corresponding documentation

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-12T23:36:04Z

Greptile Summary

This PR adapts TransformerEngine's fused attention implementation to always request the Stats tensor (= log(SumExp) + Max) from cuDNN, and optionally also request the Max tensor when return_max_logit=True. Previously, cuDNN was either asked for Stats (training path) or {Max, SumExp} (return_max_logit path), and TE computed Stats manually from the latter pair. The new cuDNN frontend allows returning any subset, making the manual computation unnecessary.

Key changes:

generate_stats is now always true; set_generate_stats(true) is called unconditionally on the cuDNN SDPA graph.
The Aux_CTX_Tensors pack is restructured: Stats is always tensor [0], Max occupies tensor [1] only when return_max_logit=True, followed by rng_state and optional Bias/SoftmaxOffset.
The Python fused_attn_fwd wrapper no longer manually computes Stats = log(SumExp) + Max; it reads Stats directly from output_tensors[1] and constructs max_logit from output_tensors[2] (Max) when return_max_logit=True.
FADescriptor_v1::generate_max_sum_exp is renamed to return_max_logit for clarity, correctly remaining part of the graph cache key.
API documentation in fused_attn.h is updated at both return_max_logit parameter occurrences.
One minor cleanup opportunity: the generate_stats local variable (line 107) is now always true and could be inlined directly into the .set_generate_stats() call.

Confidence Score: 4/5

This PR is safe to merge; the tensor ordering changes are internally consistent across all C++ and Python layers.
The change is well-scoped: the Aux_CTX_Tensors pack layout is updated consistently in both the forward allocation pass (size==0) and the data pass (size>=2), the Python layer correctly routes Stats to aux_ctx_tensors and Max to max_logit, and the backward pass always reads Stats from index 0. No regression risk for backward compatibility since the new tensor ordering matches what both C++ callers and Python callers now expect. Score is 4 rather than 5 only because the submodule bump (cuDNN frontend) is not reviewable here, and no new tests were added.
No files require special attention; all layers are consistent with the new Stats-first tensor ordering.

Important Files Changed

Filename	Overview
transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu	Core change: `generate_stats` is now always `true`, Stats tensor is always returned first in `Aux_CTX_Tensors`, and Max tensor is included at index 1 only when `return_max_logit=True`. The tensor pack ordering is consistent between the size==0 allocation pass and the size>=2 data pass. `generate_stats` variable could be inlined as `true`, but this is cosmetic.
transformer_engine/common/fused_attn/utils.h	Renames `generate_max_sum_exp` to `return_max_logit` in `FADescriptor_v1`. The field is correctly used in the `operator<` comparison (cache key), so graph caching correctly differentiates runs with vs. without Max output.
transformer_engine/common/include/transformer_engine/fused_attn.h	Documentation updated at both `return_max_logit` parameter occurrences (lines 209 and 272) to reflect new semantics: "Whether to produce Max along with Stats."
transformer_engine/pytorch/cpp_extensions/fused_attn.py	Removes the manual `Stats = log(SumExp) + Max` computation; now constructs `aux_ctx_tensors = [Stats, rng_state, ...]` and `max_logit` from `output_tensors[2]` (Max). Tensor order matches the updated C++ output pack ordering.
transformer_engine/pytorch/csrc/extensions/attention.cpp	Updates the allocation comments and comment for the second auxiliary tensor. Allocation logic is unchanged: first tensor is always Stats (S/M), and Max is allocated as the second tensor when `return_max_logit=True`.
3rdparty/cudnn-frontend	Submodule bump from 8d19d31 to a5ca04f to pick up the cuDNN frontend support for returning Stats always and Max when requested.

Sequence Diagram

sequenceDiagram
    participant Py as fused_attn_fwd (Python)
    participant Cpp as attention.cpp (C++)
    participant CUDA as fused_attn_f16_arbitrary_seqlen.cu
    participant cuDNN as cuDNN Frontend

    Py->>Cpp: tex.fused_attn_fwd(..., return_max_logit)
    Cpp->>Cpp: Allocate aux_tensor_pack<br/>[0]=Stats, [1]=Max(if rml), [n]=rng_state
    Cpp->>CUDA: nvte_fused_attn_fwd(... Aux_CTX_Tensors)
    CUDA->>CUDA: generate_stats=true (always)
    CUDA->>cuDNN: sdpa with set_generate_stats(true)<br/>+ set_logit_max(Max) if return_max_logit
    cuDNN-->>CUDA: O, Stats, [Max if return_max_logit]
    CUDA-->>Cpp: Aux_CTX_Tensors filled:<br/>[Stats, [Max], rng_state, ...]
    Cpp-->>Py: output_tensors=[O, Stats, [Max], rng_state, ...]

    Note over Py: if return_max_logit:<br/>aux=[Stats, rng_state,...]<br/>max_logit=amax(Max)<br/>else:<br/>aux=output_tensors[1:]

    Py->>Cpp: tex.fused_attn_bwd(..., aux_ctx_tensors=[Stats, rng_state,...])
    Cpp->>CUDA: Aux_CTX_Tensors[0]=Stats, [1]=rng_state
    CUDA->>cuDNN: sdpaBwd(Stats as softmax_stats)
    cuDNN-->>CUDA: dQ, dK, dV

_{Last reviewed commit: ef0d7ec}

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/cpp_extensions/fused_attn.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-17T18:34:10Z

Additional Comments (1)

transformer_engine/pytorch/cpp_extensions/fused_attn.py
Stale docstring: wrong formula for softmaxStats

The public docstring still describes softmaxStats as log(sum(e^(x - max(x)))), which is log(SumExp). However, with this PR, the returned tensor is cuDNN's Stats = log(SumExp) + Max, not just log(SumExp). This formula was already incorrect before this PR (the old code computed Max + log(SumExp) and stored it as stats), but the PR is an opportunity to correct it.

                       softmaxStats: torch.Tensor
                           log(sum(e^(x - max(x)))) + max(x), where x=Q*K.T (i.e. Stats = log(SumExp) + Max)
                           shape [batch_size, num_heads, max_seqlen_q, 1], dtype float32

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

cyanguwa · 2026-02-18T22:16:00Z

transformer_engine/pytorch/cpp_extensions/fused_attn.py

-        stats = output_tensors[1] + torch.log(output_tensors[2])
+        # thd:  output_tensors: out [tq, h, d],    Stats [tq, h, 1],    Max [tq, h, 1]
+        # bshd: output_tensors: out [b, sq, h, d], Stats [b, h, sq, 1], Max [b, h, sq, 1]
+        # sbhd: output_tensors: out [sq, b, h, d], Stats [b, h, sq, 1], Max [b, h, sq, 1] (there's no typo here)


Do we need the "there's no typo here" :)

I deliberately added it because I didn't believe it and checked the shapes myself :P

transformer_engine/pytorch/csrc/extensions/attention.cpp

cyanguwa · 2026-02-18T22:18:52Z

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

  size_t i = 0;
  if (Aux_CTX_Tensors->size == 0) {
    const auto cudnn_runtime_version = cudnnGetVersion();
+


You might need to make these changes in the "Aux_CTX_Tensors->size == 0" sections in _fwd/bwd_qkvpacked/kvpacked APIs as well. Please check. Thanks!

Looks like I don't need to because nvte_fused...qvpacked are in fused_attn.cpp which calls fused_attn_f16_arbitrary... just like regular nvte_fused_fwd/bwd

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/include/transformer_engine/fused_attn.h

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

KshitijLakhani · 2026-02-19T23:25:17Z

transformer_engine/pytorch/cpp_extensions/fused_attn.py

        # Max -> max_logit [h]
-        max_logit = torch.amax(output_tensors[1], dim=amax_dims).to(dtype=output_tensors[0].dtype)
-        aux_ctx_tensors = [stats]
+        max_logit = torch.amax(output_tensors[2], dim=amax_dims).to(dtype=output_tensors[0].dtype)


Maybe I understood this incorrectly, but isn't TE now also suppose to receive max from cuDNN directly (like stats, but with stats it is always true and with max it cn be toggled) rather than calling amax() in TE ?

(Sudhakar: Why am I able to update your comment? )

cuDNN returns Max ([b, h, sq, 1]) so it's an additional softmax statistic (apparently, the subset (Stats, Max) is enough for cuDNN bwd rather than the full set (Stats, SumExp, Max)).

Further, for muon, we need do amax on it to get a dimension [h] tensor. return_max_logit in TE controls whether to fetch Max from cuDNN.

Perf wise, it'd be nice for cuDNN to do additional reduction to return the [h] shaped tensor for muon as well but that's not the scope of this PR.

(Kshitij: looks like I can as well)

…eturn_stats_max_cudnn

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/include/transformer_engine/fused_attn.h

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-20T19:45:15Z

Additional Comments (1)

transformer_engine/common/include/transformer_engine/fused_attn.h
Entire file has been reformatted with unintentional 3-space indentation changes. This creates a large diff unrelated to the actual feature changes. Revert the formatting to match the original file structure.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

sudhakarsingh27 · 2026-03-10T20:07:50Z

/te-ci L2

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

sudhakarsingh27 and others added 5 commits February 12, 2026 13:12

cudnn now returns Stats always and Max only with return_max_logit=true

3fb19fc

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine

5b40701

fix a typo that caused a bug

5d479ad

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

update doc strings

296fb9f

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

24bfd45

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

transformer_engine/pytorch/cpp_extensions/fused_attn.py Outdated Show resolved Hide resolved

sudhakarsingh27 added 2 commits February 12, 2026 16:04

fix more docs

fd42feb

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'fix_return_stats_max_cudnn' of github.com:sudhakarsingh…

2d7b51b

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

260380b

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

sudhakarsingh27 requested review from KshitijLakhani and cyanguwa February 17, 2026 18:28

Merge branch 'main' into fix_return_stats_max_cudnn

7a5ab35

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

9710810

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu Show resolved Hide resolved

cyanguwa reviewed Feb 18, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/attention.cpp Outdated Show resolved Hide resolved

cyanguwa reviewed Feb 18, 2026

View reviewed changes

cyanguwa requested changes Feb 18, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

07db752

greptile-apps bot reviewed Feb 19, 2026

View reviewed changes

transformer_engine/common/include/transformer_engine/fused_attn.h Outdated Show resolved Hide resolved

KshitijLakhani reviewed Feb 19, 2026

View reviewed changes

sudhakarsingh27 added 2 commits February 19, 2026 16:53

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

f8b1a68

…eturn_stats_max_cudnn

Merge branch 'fix_return_stats_max_cudnn' of github.com:sudhakarsingh…

b5b2b9d

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

transformer_engine/common/include/transformer_engine/fused_attn.h Show resolved Hide resolved

sudhakarsingh27 force-pushed the fix_return_stats_max_cudnn branch from 21ca43a to becc3ad Compare February 20, 2026 19:41

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

fixes from the feedback

8f40cab

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the fix_return_stats_max_cudnn branch from d4568db to 8f40cab Compare February 20, 2026 20:00

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

56e46fd

…eturn_stats_max_cudnn

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

1102738

greptile-apps bot reviewed Feb 23, 2026

View reviewed changes

sudhakarsingh27 added 4 commits February 26, 2026 10:29

merge main

7363541

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

8c0d6a1

…eturn_stats_max_cudnn

update cudnn-frontend to v1.19.1

d517a13

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

e005455

…eturn_stats_max_cudnn

sudhakarsingh27 force-pushed the fix_return_stats_max_cudnn branch from 2b64738 to e005455 Compare March 10, 2026 19:01

sudhakarsingh27 requested review from KshitijLakhani and cyanguwa March 11, 2026 02:14

sudhakarsingh27 added the 2.14.0 label Mar 12, 2026

sudhakarsingh27 added 2 commits March 12, 2026 15:44

update the cudnn frontend

3ae0a34

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

ef0d7ec

…eturn_stats_max_cudnn

Conversation

sudhakarsingh27 commented Feb 12, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 17, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cyanguwa Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cyanguwa Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KshitijLakhani Feb 19, 2026 • edited by sudhakarsingh27 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Feb 20, 2026 • edited by KshitijLakhani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 commented Mar 10, 2026

greptile-apps bot commented Feb 12, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

cyanguwa Feb 18, 2026 •

edited

Loading

KshitijLakhani Feb 19, 2026 •

edited by sudhakarsingh27

Loading

sudhakarsingh27 Feb 20, 2026 •

edited by KshitijLakhani

Loading