Skip to content

Fix incorrect MNNVL fabric check#2626

Merged
ptrendx merged 3 commits intoNVIDIA:mainfrom
nvcastet:fix_mnnvl_check
Feb 24, 2026
Merged

Fix incorrect MNNVL fabric check#2626
ptrendx merged 3 commits intoNVIDIA:mainfrom
nvcastet:fix_mnnvl_check

Conversation

@nvcastet
Copy link
Contributor

Description

Fix incorrect MNNVL fabric check

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
@nvcastet
Copy link
Contributor Author

@ptrendx Can you trigger CI?

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Greptile Overview

Greptile Summary

Fixed the MNNVL fabric detection logic in has_mnnvl_fabric(). The previous implementation incorrectly treated the clusterUuid byte array as a C-string by only checking the first byte (clusterUuid[0]), which could lead to false positives if the first byte happened to be non-zero while the rest was zero. The fix properly compares the entire UUID array using memcmp() against a zero-initialized array. Additionally, changed the fabric state check from >= to == to ensure only the exact NVML_GPU_FABRIC_STATE_COMPLETED state is accepted, making the check more precise.

Confidence Score: 5/5

  • This PR is safe to merge with no risks
  • The changes fix a clear bug in UUID comparison logic by properly comparing the entire byte array instead of incorrectly treating it as a null-terminated string. The implementation is correct and the change is well-scoped to just the fabric check function.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp Fixed MNNVL fabric detection by properly checking the entire UUID array instead of treating it as a C-string, and changed state comparison from >= to == for exact match

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@timmoon10
Copy link
Collaborator

/te-ci L1

@ptrendx
Copy link
Member

ptrendx commented Feb 11, 2026

/te-ci L1

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx
Copy link
Member

ptrendx commented Feb 12, 2026

/te-ci L1

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx ptrendx merged commit 9eb982e into NVIDIA:main Feb 24, 2026
48 of 54 checks passed
Oleg-Goncharov pushed a commit to Oleg-Goncharov/TransformerEngine that referenced this pull request Feb 27, 2026
Signed-off-by: Nicolas Castet <ncastet@nvidia.com>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants