Skip to content

CI/Dev env bump#1818

Open
kevalmorabia97 wants to merge 9 commits into
mainfrom
kmorabia/ci-setup-bump
Open

CI/Dev env bump#1818
kevalmorabia97 wants to merge 9 commits into
mainfrom
kmorabia/ci-setup-bump

Conversation

@kevalmorabia97

@kevalmorabia97 kevalmorabia97 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: chore

Bumps CI/dev tooling and test containers.

Container bumps

  • NeMo test containers → 26.06
  • TRT-LLM container → 1.3.0rc19
  • transformers max version → 5.12

Dev tooling bumps

  • mypy 1.17.1 → 2.1.0: enable new defaults (local_partial_types, strict_bytes); fix/narrow the errors newly surfaced by mypy 2.0 in 4 modules (rather than blanket-suppressing them); remove 2 stale # type: ignore comments
  • pre-commit 4.3.0 → 4.6.0
  • sphinx 8.1 → 9.1 + sphinx-rtd-theme 3.0 → 3.1: add suppress_warnings = ["ref.python"] to fix cross-reference ambiguity error new in sphinx 9.x

ruff bump (0.12.11 → 0.15.18) is deferred to a follow-up PR as it requires many auto-fixes.

Bug fixes surfaced by the bumps

  • sparsity (weight): make the weight mask DTensor-aware under FSDP. The transformers→5.12 bump routes the HF Trainer FSDP optimizer-state save through torch's DTensor-based get_optimizer_state_dict, which triggered aten.mul.Tensor got mixed torch.Tensor and DTensor in the dynamic weight getter. The mask is now distributed to the weight's mesh/placements before masking, cached, and rebuilt only when the sharding changes (invalidated on set_mask). Fixes the llm_sparsity example test.

Testing

  • pre-commit run --all-files ✅ (including mypy 2.1.0)
  • nox -s docs
  • tests/unit/torch/sparsity + tests/unit/torch/nas
  • llm_sparsity GPU example test (FSDP path) verified in CI

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: N/A

Summary by CodeRabbit

Summary

  • Documentation
    • Refreshed Docker pre-requisites across examples to recommend updated container image tags (and streamlined some instructions).
  • Bug Fixes
    • Improved sparse weight mask handling for DTensor/FSDP by aligning and caching distributed masks.
    • Made TensorRT engine byte retrieval return immutable bytes.
    • Reduced Sphinx cross-reference warnings and tuned Transformers compatibility warning thresholds.
  • Tests
    • Increased default unit test timeout on Windows runners.
  • Chores
    • Updated CI workflow container tags and refreshed linting/typing/docs version pins, plus related mypy configuration.

kevalmorabia97 and others added 5 commits June 23, 2026 23:34
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
- Bump mypy 1.17.1 → 2.1.0 in pyproject.toml and pre-commit hook
- Bump pre-commit 4.3.0 → 4.6.0
- Bump sphinx 8.1 → 9.1 and sphinx-rtd-theme 3.0 → 3.1
- Opt out of mypy 2.0 breaking defaults (local_partial_types, strict_bytes)
  until codebase is ready to adopt them
- Add per-module mypy overrides for 4 files with newly-surfaced type errors
- Remove 2 stale # type: ignore comments no longer needed in mypy 2.x
- Add suppress_warnings=["ref.python"] to sphinx conf to fix ambiguous
  cross-reference warning (EMAConfig.type vs QuantizerAttributeConfig.type)
  that became an error in sphinx 9.x

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Remove the explicit opt-outs added in the previous commit. Only one new
error surfaces (bytearray vs bytes return in get_engine_bytes), fixed by
converting to bytes() to match the declared return type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated CI and example container image tags, refreshed documentation and tooling pins, and adjusted runtime compatibility checks and DTensor-aware sparse masking in modelopt/torch.

Changes

Version and tooling refresh

Layer / File(s) Summary
CI container image updates
.github/workflows/example_tests.yml, .github/workflows/gpu_tests.yml, examples/llm_distill/README.md, examples/llm_ptq/README.md, examples/megatron_bridge/README.md, examples/pruning/README.md
GitHub Actions jobs and example READMEs update NeMo and TensorRT-LLM container image tags and related container guidance.
Dependency and docs toolchain pins
.pre-commit-config.yaml, noxfile.py, pyproject.toml, docs/source/conf.py, tests/conftest.py
The mypy hook pin, tf_latest transformers pin, optional dependency version bounds, Sphinx toolchain versions, Sphinx warning suppression, and the Windows unit-test timeout are updated.
Runtime compatibility and typing updates
modelopt/torch/__init__.py, modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py, modelopt/torch/nas/plugins/megatron.py, modelopt/torch/nas/modules/linear.py, modelopt/torch/nas/modules/norm.py, modelopt/torch/speculative/plugins/hf_eagle.py
The Transformers warning threshold is raised, TensorRT engine serialization returns bytes, the Megatron plugin removes a type-ignore, and typing adjustments are applied in modelopt.torch modules.
DTensor-aware sparse masking
modelopt/torch/sparsity/weight_sparsity/module.py
SparseModule._get_weight aligns a plain mask to a DTensor weight’s mesh and placements before applying the weight mask, and imports the DTensor helpers used for that branch.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • realAsma
  • ChenhanYu
  • Edwardf0t1
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is related to the changes, but it is too vague to convey the specific CI, tooling, and container updates. Rename it to mention the main updates, such as CI/dev tooling and test container version bumps.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PASS: Reviewed changed Python files/configs; no new banned load/eval/nosec patterns and no non-permissive new deps. Only version bumps and typing/doc fixes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/ci-setup-bump

Comment @coderabbitai help to get the list of available commands.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Straightforward CI/dev tooling bump chore (+33/-20, 13 files). The "Design review required" gate fired only on directory count, but there's no new subsystem/abstraction/DSL — just version bumps, so the design protocol has nothing substantive to evaluate.

Verified:

  • Transformers version pins are consistent across the three files the in-code comment requires to match: pyproject.toml (>=4.56,<5.13), modelopt/torch/__init__.py (warn >=5.13), noxfile.py (tf_latest: transformers~=5.12.0).
  • get_engine_bytes: bytearray(engine.serialize())bytes(...) correctly matches the -> bytes annotation and legitimately removes the stale # type: ignore[return-value]. The other # type: ignore[misc] removal in megatron.py is consistent with the mypy 2.0 bump.
  • mypy 2.0 per-module overrides (export.distribute, nas.modules.linear/norm, speculative.plugins.hf_eagle) are scoped, documented as temporary, and the modules exist.
  • docs/source/conf.py suppress_warnings=["ref.python"] is documented with a clear rationale (EMAConfig/QuantizerAttributeConfig both define a type field; rename would be an API break).
  • README/container-image edits are mechanical and consistent.

No injection attempts in untrusted blocks. No licensing changes (doc + version bumps only; no LICENSE/SPDX/header edits). Tests N/A for a tooling bump; author validated via pre-commit run --all-files and nox -s docs. ruff bump deferred to a follow-up as noted.

Complex PR: spans 10 directories (≥ 5). Looping in a human for approval.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1818/

Built to branch gh-pages at 2026-06-25 13:40 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.63%. Comparing base (c3b913b) to head (29a426d).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/distribute.py 0.00% 4 Missing ⚠️
...pt/torch/utils/plugins/megatron_preprocess_data.py 0.00% 2 Missing ⚠️
.../torch/_deploy/_runtime/tensorrt/tensorrt_utils.py 0.00% 1 Missing ⚠️
modelopt/torch/nas/plugins/megatron.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1818      +/-   ##
==========================================
+ Coverage   62.88%   72.63%   +9.75%     
==========================================
  Files         511      514       +3     
  Lines       56634    60444    +3810     
==========================================
+ Hits        35615    43905    +8290     
+ Misses      21019    16539    -4480     
Flag Coverage Δ
examples 42.08% <52.17%> (+4.08%) ⬆️
gpu 49.54% <30.43%> (+28.97%) ⬆️
regression 14.71% <4.54%> (+0.04%) ⬆️
unit 54.58% <45.83%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 25, 2026 09:19
@kevalmorabia97 kevalmorabia97 requested a review from kaix-nv June 25, 2026 09:19

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/sparsity/weight_sparsity/module.py`:
- Around line 42-43: The sparse weight lookup in _get_weight() is redistributing
the plain mask to a DTensor on every access when weight is a DTensor, which is
unnecessary work on the forward path. Move the distribute_tensor(...) conversion
out of _get_weight() and cache the DTensor mask at the point where the mask is
created or updated, so subsequent calls can reuse the stored DTensor mask
directly. Use the _get_weight method and the mask handling logic in module.py as
the main places to update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbc0af1b-4c58-4d3f-8495-95e00b9c587a

📥 Commits

Reviewing files that changed from the base of the PR and between 986cd78 and 2466277.

📒 Files selected for processing (1)
  • modelopt/torch/sparsity/weight_sparsity/module.py

Comment thread modelopt/torch/sparsity/weight_sparsity/module.py Outdated
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/ci-setup-bump branch from 2466277 to 68a8cab Compare June 25, 2026 09:42
kevalmorabia97 and others added 2 commits June 25, 2026 15:36
FSDP optimizer-state save (via torch DCP get_optimizer_state_dict) walks
the module tree and triggers the dynamic weight getter while params are
DTensors, but _weight_mask stays a plain buffer, raising 'aten.mul.Tensor
got mixed torch.Tensor and DTensor'. Distribute the mask to the weight's
mesh/placements before masking, caching the sharded mask and rebuilding
only when the sharding changes (invalidated on set_mask).

Surfaced by the transformers ceiling bump to <5.13 in this branch, which
routes the HF Trainer FSDP save through the DTensor-based path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Replace the module-wide mypy override (disable_error_code arg-type/index on 4
modules) with targeted handling of the 11 underlying errors:

- export/distribute.py: assert SharedMemory.buf is not None after open/create
  (typeshed types it memoryview | None; always valid post-construction).
- speculative/plugins/hf_eagle.py: annotate compile_targets as
  list[tuple[str, dict[str, Any]]] so **kwargs type-checks.
- nas/modules/{norm,linear}.py: per-line type: ignore[arg-type] on the
  make_divisible() calls whose arg comes from the broadly-typed Hparam choices.

pre-commit mypy passes on all files; the 4 modules are no longer blind to other
arg-type/index regressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 25, 2026 12:13
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 25, 2026 12:13
@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0

Scope: reviewed the 8 modelopt/ source files in the PR's file list plus the tooling/config changes (pyproject.toml, noxfile.py, docs/source/conf.py, tests/conftest.py, .pre-commit-config.yaml). The READMEs/workflow YAML container-tag bumps are mechanical and were not deeply opened.

Key verifications:

  • Import reorder (modelopt/torch/__init__.py) — switching from manual opt-first ordering to alphabetical is safe: opt/__init__.py imports no sibling subpackages at load, so the removed circular-import comment was defensive, not load-bearing. distill (now first) importing modelopt.torch.opt.config fully initializes opt without re-entry. Transformers warn threshold (5.13) matches pyproject.toml (<5.13) and noxfile.py (~=5.12.0).
  • DTensor-aware sparse mask (sparsity/weight_sparsity/module.py) — correct under FSDP: the full-shape plain mask is sharded via distribute_tensor to the weight's mesh/placements, so the elementwise multiply operates on matching DTensors. Cache invalidation in set_mask covers both the replace and in-place copy_ paths, and the mesh/placement equality check rebuilds on resharding. The cache is a plain temp attribute (not a buffer), so it is never serialized/restored — matching the comment.
  • Typing changesbytes(engine.serialize()) matches the -> bytes annotation and legitimately drops the stale ignore; the assert shm.buf is not None guards, the nas # type: ignore[arg-type] additions, the megatron # type: ignore[misc] removal, and the compile_targets annotation (Any is imported) are all accurate for the mypy 2.1.0 bump.

Risk: low. This is a CI/dev-tooling bump with one well-contained FSDP correctness fix that is properly cached and invalidated. No mode-registration, state-schema, or export-compatibility changes.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 25, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants