Pytorch binding for cublas grouped gemm + Grouped Bias Support + Grouped Tensor Swizzling by vthumbe1503 · Pull Request #2669 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-02-10T08:43:45Z

Description

Pytorch binding for cublas gemm

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add 3 Pytorch bindings for cublas grouped gemm one where both inputs and outputs are grouped tensors. One where inputA is list of tensors and others are grouped tensors(needed for weights in forward pass). One where output is a list of tensors (needed for Wgrad/main_grad update)
- Missing nvte APIs also added for them. Common code needed for 3 APIs is also refactored in cublasLtgrouped_gemm.cu
- From python we have one single API for using cublas grouped gemm. It will redirect the code to apt tex API based on whether inputA or output is a list of tensors or not.
- Grouped Bias Add Kernel also Added since cublas grouped gemm doesnt support bias/dbias.
- Grad support is missing currently in the API.
Fixes a bug in type_converters.cpp where we were using data as attribute instead of rowwise_data for GroupedTensor
GroupedTensor Swizzling added for the case where each split has uniform shapes(Weights)

Following things would be done in follow up PR

Workspace caching for for alpha and beta in general grouped gemm
NVFP4 Grouped swizzle support
Improve perf of group_bias_add

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

for more information, see https://pre-commit.ci

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps · 2026-03-06T17:57:03Z

Greptile Summary

This PR adds PyTorch bindings for cuBLASLt grouped GEMM, supporting three dispatch variants (all-GroupedTensor, discrete input-A list, discrete output list), a grouped bias-add kernel, and grouped tensor swizzling for MXFP8 weights. It also fixes a bug in type_converters.cpp where rowwise_data was incorrectly accessed via the old data attribute name.

Key changes:

Three new NVTE C APIs: nvte_grouped_gemm_with_discrete_inputA, nvte_grouped_gemm_with_discrete_out, nvte_grouped_bias_add — each with validation, workspace setup, and cuBLASLt dispatch.
Refactored validation helpers: validate_grouped_gemm_inputs now accepts an initializer_list for both A and B, restoring FP16 support and cross-operand (FP8/MXFP8/swizzle) consistency checks.
grouped_bias_add_kernel: vectorized BF16/FP16 bias addition for grouped output; requires uniform last dim and last_dim % 4 == 0.
Grouped tensor swizzling: swizzle_grouped_scaling_factors / maybe_swizzle_grouped_tensor_for_gemm handle in-place swizzle of MXFP8 scale factors for uniform-shape grouped weights.
_with_gemm_swizzled_scales propagation: added through Python GroupedTensorStorage, C++ quantizer constructors, and type_converters.cpp.

Remaining concerns:

build_grouped_gemm_multi_out_args validates dtype but does not verify that each D-list tensor's dimensions are compatible with the expected GEMM output shape — mismatched shapes will silently corrupt memory.
validate_grouped_gemm_outputs checks each output for a valid dtype but does not enforce that C and D share the same dtype when both are present.
The test comment "Bias add in grouped kernel accumulates in FP32 for BF16/FP16" is inaccurate; the kernel performs native-type (BF16/FP16) addition with no FP32 intermediate.
B_fp8 = grouped_B.split_into_quantized_tensors() in test_grouped_gemm_grouped_tensor_mxfp8 is dead code in the new-API call path.

Confidence Score: 3/5

This PR is a large new feature with several addressed issues; the two remaining validation gaps (D-list shape, C/D dtype compatibility) and a misleading test comment reduce confidence, but no critical compile-breaking or data-loss regressions were identified in the current HEAD.
The PR resolves a large backlog of prior review comments (merge conflicts, swizzle corruption, MXFP8 validation, FP16 restoration, duplicate assignments, etc.). The remaining concerns are: (1) missing D-list shape validation that could lead to silent wrong results in the discrete_out path; (2) validate_grouped_gemm_outputs not cross-checking C/D dtype; (3) a misleading test comment about FP32 accumulation that may hide precision regressions; and (4) dead code (B_fp8 split) in the MXFP8 test. None are compile-blocking but Added the link to the User Guide #1 and fp8_autocast bug fix when switching from non-fp8 execution #2 are logic correctness concerns.
Pay close attention to transformer_engine/common/gemm/cublaslt_grouped_gemm.cu — specifically build_grouped_gemm_multi_out_args (missing D-shape validation) and validate_grouped_gemm_outputs (no C/D dtype compatibility check). Also review tests/pytorch/test_numerics.py for the inaccurate FP32 accumulation comment.

Important Files Changed

Filename	Overview
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu	Core change: adds three new NVTE APIs (`nvte_grouped_gemm_with_discrete_inputA`, `nvte_grouped_gemm_with_discrete_out`, `nvte_grouped_bias_add`), refactors validation helpers, adds `grouped_bias_add_kernel`, and integrates `MultiTensorGroupGemmInputArgs`/`MultiTensorGroupGemmOutputArgs` into the setup kernel. Several previously-reported issues were fixed (FP16 restored, MXFP8 cross-operand checks, duplicate assignments). Two remaining concerns: D_list output shape is never validated against GEMM-expected dimensions, and `validate_grouped_gemm_outputs` doesn't enforce C/D dtype compatibility.
transformer_engine/pytorch/cpp_extensions/gemm.py	Adds `general_grouped_gemm_for_grouped_tensor` Python wrapper with dispatch to three C++ bindings (grouped-tensor, discrete-in, discrete-out). Guards for discrete_in+discrete_out conflict and bias+discrete_out conflict are correctly added. Workspace size formula updated to 8 pointer arrays (matching C++). Device lookup uses `rowwise_data` consistently. Minor: return type annotation is now `Union[torch.Tensor, List[torch.Tensor]]` which is correct.
transformer_engine/pytorch/csrc/extensions/gemm.cpp	Adds three new pybind-facing functions (`te_general_grouped_gemm_for_grouped_tensor`, `te_general_grouped_gemm_for_discrete_in`, `te_general_grouped_gemm_for_discrete_out`). All call `maybe_swizzle_grouped_tensor_for_gemm` appropriately, hold swizzled scales alive, and delegate to the corresponding `nvte_*` API. The bias-add path reuses `nvte_grouped_bias_add`. The discrete_out path passes D as both C_list and D_list which is intentional for the accumulate=True case.
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	Adds `maybe_swizzle_grouped_tensor_for_gemm` which safely handles MXFP8 grouped tensor swizzling: allocates separate output buffers, delegates to `nvte_swizzle_grouped_scaling_factors`, then updates the input's scale pointers. The previous in-place corruption bug (overwriting input scales before the kernel read) was fixed. Shape uniformity check correctly rejects non-uniform tensors using `data_ptr != nullptr`.
transformer_engine/common/swizzle/swizzle.cu	Adds grouped swizzle kernels (`grouped_swizzle_row/col_scaling_uniform_shape_kernel`) that reuse per-tensor `blockIdx.z` offset into the contiguous scale buffer. The `swizzle_grouped_scaling_factors` implementation validates uniformity, computes padding/stride, and dispatches the appropriate vec_load_size variant. Logic mirrors the existing single-tensor swizzle kernels, adapted for uniform grouped shapes.
transformer_engine/pytorch/csrc/type_converters.cpp	Bug fix: replaces `tensor.attr("data")` with `tensor.attr("rowwise_data")` for `GroupedTensor` rowwise data access, correcting the attribute name after a prior rename. Also adds `_with_gemm_swizzled_scales` propagation using `py::hasattr` for safe optional access.
tests/pytorch/test_numerics.py	Adds two new test functions: `test_grouped_gemm_grouped_tensor` (BF16, all three cases: no_discrete/discrete_in/discrete_out) and `test_grouped_gemm_grouped_tensor_mxfp8` (FP8 with uniform expert sizes). Both have correct SM100/cuBLAS 13.2 skip guards. Minor: test comment about "FP32 accumulation" in the bias-add kernel is inaccurate (kernel uses native-dtype addition), and `B_fp8` is unused in the new-API call path.
transformer_engine/common/include/transformer_engine/gemm.h	Declares three new experimental APIs with Doxygen comments. The `\note` for `nvte_grouped_gemm_with_discrete_out` correctly documents the C/D dtype constraint. The new `kNVTEGroupedMatmulConfigUseSplitAccumulator=3` enum value is inserted before `kNVTEGroupedMatmulConfigSMCount`, which shifts `kNVTEGroupedMatmulConfigSMCount` from 3 to 4 — a potential ABI break noted in a prior review (developer indicated it will be addressed later).
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	Propagates `with_gemm_swizzled_scales` through `_initialize_storage_fields`, `__new__`, and `make_grouped_tensor`. The MXFP8 quantizer path correctly sets this field from `quantizer.optimize_for_gemm`; other quantizers default to `False`.

Sequence Diagram

sequenceDiagram
    participant PY as Python (gemm.py)
    participant CPP as C++ Extension (gemm.cpp)
    participant SW as Swizzle (swizzle.cpp)
    participant NVTE as NVTE API (cublaslt_grouped_gemm.cu)
    participant CB as cuBLASLt

    PY->>PY: general_grouped_gemm_for_grouped_tensor(A, B, out)
    PY->>PY: Dispatch: grouped_tensor / discrete_in / discrete_out

    alt discrete_in (A is list of weights)
        PY->>CPP: te_general_grouped_gemm_for_discrete_in(A_list, B, D)
        CPP->>SW: multi_tensor_swizzle_scales_for_gemm(A_wrappers)
        CPP->>SW: maybe_swizzle_grouped_tensor_for_gemm(grouped_B)
        CPP->>NVTE: nvte_grouped_gemm_with_discrete_inputA(A_list, B, C, D)
    else discrete_out (out is list of wgrads)
        PY->>CPP: te_general_grouped_gemm_for_discrete_out(A, B, D_list)
        CPP->>SW: maybe_swizzle_grouped_tensor_for_gemm(grouped_A)
        CPP->>SW: maybe_swizzle_grouped_tensor_for_gemm(grouped_B)
        CPP->>NVTE: nvte_grouped_gemm_with_discrete_out(A, B, C_list, D_list)
    else no_discrete (all GroupedTensors)
        PY->>CPP: te_general_grouped_gemm_for_grouped_tensor(A, B, D)
        CPP->>SW: maybe_swizzle_grouped_tensor_for_gemm(grouped_A)
        CPP->>SW: maybe_swizzle_grouped_tensor_for_gemm(grouped_B)
        CPP->>NVTE: nvte_grouped_gemm(A, B, C, D)
    end

    NVTE->>NVTE: validate_grouped_gemm_inputs / validate_grouped_gemm_outputs
    NVTE->>NVTE: select_grouped_operand (row vs col-wise)
    NVTE->>NVTE: setup_grouped_gemm_workspace
    NVTE->>NVTE: launch_grouped_gemm_setup (kernel: fill ptr arrays)
    NVTE->>CB: cublasLtMatmul (grouped GEMM)
    CB-->>NVTE: GEMM result in D

    opt bias provided (not discrete_out)
        CPP->>NVTE: nvte_grouped_bias_add(D, bias)
        NVTE->>NVTE: grouped_bias_add_kernel<<<grid, block>>>
    end

Comments Outside Diff (3)

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 708-736 (link)

Missing output-shape validation for D_list tensors

build_grouped_gemm_multi_out_args checks that each tensor in D_list is 2D and that its dtype matches expected_dtype, but it never verifies that the dimensions are compatible with the GEMM output (M × N). A caller that accidentally passes tensors of wrong shape will silently produce incorrect results or corrupt memory — the kernel will write to the wrong addresses computed by setup_grouped_gemm_kernel.

The same gap applies to C_list entries. Consider adding a shape-compatibility check once avg_m / avg_n (or operand shapes) are known, or at minimum add an assertion that the tensor has the expected number of elements:
```
// Example guard inside the per-tensor loop:
const size_t expected_elements = static_cast<size_t>(args.rows[i]) *
                                 static_cast<size_t>(args.cols[i]);
NVTE_CHECK(t->data.numel() == expected_elements,
           "Grouped GEMM: D_list tensor ", i,
           " element count mismatch (expected ", expected_elements,
           ", got ", t->data.numel(), ")");
```
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 558-596 (link)

validate_grouped_gemm_outputs does not enforce C / D dtype compatibility

Each output tensor in the initialiser list is independently checked for being a valid output dtype (BF16 / FP16 / FP32), but there is no cross-check ensuring that C and D share the same dtype. cuBLASLt grouped GEMM requires C and D to have matching types in standard configurations. If a caller passes C as FP32 and D as BF16, validation passes but the GEMM may silently produce wrong results.

Consider adding:
```
// After the per-tensor loop, verify C/D dtype match when both are present.
const transformer_engine::GroupedTensor *c_out = nullptr, *d_out = nullptr;
for (const auto *tensor : outputs) {
  if (tensor == nullptr) continue;
  if (!c_out) c_out = tensor;
  else d_out = tensor;
}
if (c_out && d_out) {
  NVTE_CHECK(c_out->dtype() == d_out->dtype(),
             "Grouped GEMM: C and D outputs must have the same dtype.");
}
```
tests/pytorch/test_numerics.py, line 339-344 (link)

B_fp8 is allocated and split but never used in the new-API call

B_fp8 = grouped_B.split_into_quantized_tensors() is used in the reference general_grouped_gemm call, which is correct. However, the new-API call general_grouped_gemm_for_grouped_tensor always receives grouped_B directly (not B_fp8) regardless of case. B_fp8 is therefore dead code in the new-API path. Remove it or document why the split is created (e.g., required to force scale initialisation on the GroupedTensor side).

_{Last reviewed commit: 18f479d}

transformer_engine/pytorch/module/grouped_linear.py

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

transformer_engine/pytorch/cpp_extensions/gemm.py

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

transformer_engine/pytorch/cpp_extensions/gemm.py

transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py

transformer_engine/pytorch/csrc/util.h

transformer_engine/pytorch/csrc/type_converters.cpp

transformer_engine/pytorch/csrc/extensions/gemm.cpp

transformer_engine/pytorch/csrc/extensions/swizzle.cpp

zhongbozhu · 2026-03-13T21:42:06Z

In a followup PR, let's build a rigorous enough unit test covering paged stashing and empty tokens for this rank like this one: https://github.com/NVIDIA/TransformerEngine/blob/main/tests/pytorch/mxfp8/test_mxfp8_group_quantize_graph_safe.py @vthumbe1503

transformer_engine/common/swizzle/swizzle.cu

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

transformer_engine/common/include/transformer_engine/gemm.h

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

transformer_engine/pytorch/cpp_extensions/gemm.py

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

transformer_engine/pytorch/csrc/extensions/swizzle.cpp

transformer_engine/pytorch/csrc/extensions/gemm.cpp

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

tests/pytorch/test_numerics.py

transformer_engine/pytorch/cpp_extensions/gemm.py

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

vthumbe1503 · 2026-03-16T01:48:32Z

/te-ci L1 pytorch

zhongbozhu

LGTM, need a follow up to add unit tests as well.

ksivaman and others added 2 commits February 6, 2026 06:10

PyTorch-Python GroupedTensor

3e7859c

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

grouped gemm support for bf16, bias support missing

ffeace8

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 changed the title ~~Users/vthumbe/pytorch binding for cublas gemm~~ Pytorch binding for cublas gemm + Grouped Linear integration Feb 10, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

aa86859

for more information, see https://pre-commit.ci

vthumbe1503 mentioned this pull request Feb 10, 2026

[PyTorch] Python GroupedTensor #2654

Merged

15 tasks

vthumbe1503 requested review from ksivaman, timmoon10 and zhongbozhu February 10, 2026 19:33

vthumbe1503 and others added 4 commits February 11, 2026 03:11

remove changes not needed for bf16

98cb4fa

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'users/vthumbe/pytorch_binding_for_cublas_gemm' of githu…

b4c91c5

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d041f8

for more information, see https://pre-commit.ci

Merge branch 'main' into users/vthumbe/pytorch_binding_for_cublas_gemm

3b2840e

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

vthumbe1503 requested a review from ptrendx February 11, 2026 17:15

vthumbe1503 added 2 commits March 6, 2026 17:40

resolve merge conflicts wit main

d5f9569

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

resolve merge conflicts agains

733a061

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 marked this pull request as ready for review March 6, 2026 17:46

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

transformer_engine/pytorch/module/grouped_linear.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/module/grouped_linear.py Outdated Show resolved Hide resolved

vthumbe1503 and others added 2 commits March 6, 2026 18:15

merge conflicts

d433e5f

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

38cf811

for more information, see https://pre-commit.ci

vthumbe1503 marked this pull request as draft March 6, 2026 18:42

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

transformer_engine/pytorch/cpp_extensions/gemm.py Show resolved Hide resolved

vthumbe1503 changed the title ~~Pytorch binding for cublas gemm + Grouped Linear integration~~ Pytorch binding for cublas gemm Mar 9, 2026

vthumbe1503 and others added 4 commits March 9, 2026 16:07

keep only pytorch binding for now

158d232

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'users/vthumbe/pytorch_binding_for_cublas_gemm' of githu…

63efd1b

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

[pre-commit.ci] auto fixes from pre-commit.com hooks

70abb18

for more information, see https://pre-commit.ci

Merge branch 'main' into users/vthumbe/pytorch_binding_for_cublas_gemm

ea33349

vthumbe1503 marked this pull request as ready for review March 9, 2026 16:14

vthumbe1503 added 2 commits March 9, 2026 16:15

linting error

d0604eb

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'users/vthumbe/pytorch_binding_for_cublas_gemm' of githu…

296dbb2

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Show resolved Hide resolved

vthumbe1503 and others added 3 commits March 13, 2026 19:44

fix minor bug from greptile

fd06ed0

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

revert for now

abb6f04

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

736b531

for more information, see https://pre-commit.ci

vthumbe1503 changed the title ~~Pytorch binding for cublas grouped gemm~~ Pytorch binding for cublas grouped gemm + Grouped Bias Support + Grouped Tensor Swizzling Mar 13, 2026

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

transformer_engine/pytorch/cpp_extensions/gemm.py Show resolved Hide resolved

zhongbozhu reviewed Mar 13, 2026

View reviewed changes

transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py Show resolved Hide resolved

zhongbozhu reviewed Mar 13, 2026

View reviewed changes

transformer_engine/pytorch/csrc/util.h Show resolved Hide resolved

zhongbozhu reviewed Mar 13, 2026

View reviewed changes

transformer_engine/pytorch/csrc/type_converters.cpp Show resolved Hide resolved

zhongbozhu reviewed Mar 13, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/gemm.cpp Show resolved Hide resolved

zhongbozhu reviewed Mar 13, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/swizzle.cpp Show resolved Hide resolved

zhongbozhu reviewed Mar 13, 2026

View reviewed changes

transformer_engine/common/swizzle/swizzle.cu Show resolved Hide resolved

Merge branch 'main' into users/vthumbe/pytorch_binding_for_cublas_gemm

85d946a

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Show resolved Hide resolved

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Outdated Show resolved Hide resolved

transformer_engine/common/include/transformer_engine/gemm.h Show resolved Hide resolved

address review comments

09baee2

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

vthumbe1503 added 2 commits March 16, 2026 01:04

address review comments

f9c637f

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

address review comments

b91f14e

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Show resolved Hide resolved

vthumbe1503 and others added 3 commits March 16, 2026 01:24

address review comments

64b3fb7

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a6a1cf7

for more information, see https://pre-commit.ci

address review comments

472eada

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

tests/pytorch/test_numerics.py Show resolved Hide resolved

transformer_engine/pytorch/cpp_extensions/gemm.py Show resolved Hide resolved

transformer_engine/pytorch/cpp_extensions/gemm.py Show resolved Hide resolved

vthumbe1503 added 2 commits March 16, 2026 01:36

address review commentsgp

ded5eb6

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'users/vthumbe/pytorch_binding_for_cublas_gemm' of githu…

18f479d

…b.com:vthumbe1503/TransformerEngine into users/vthumbe/pytorch_binding_for_cublas_gemm

zhongbozhu self-requested a review March 16, 2026 03:00

zhongbozhu approved these changes Mar 16, 2026

View reviewed changes

vthumbe1503 merged commit 708d7c1 into NVIDIA:main Mar 16, 2026
29 of 32 checks passed

Conversation

vthumbe1503 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhongbozhu commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Mar 16, 2026

Uh oh!

zhongbozhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vthumbe1503 commented Feb 10, 2026 •

edited

Loading

greptile-apps bot commented Mar 6, 2026 •

edited

Loading