[OMNIML-5084] cell_t0_d7 by ChenhanYu · Pull Request #1788 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-22T13:31:44Z

Draft PR opened by pensieve-intern for OMNIML-5084.

Stage cell_t0_d7 of Epic OMNIML-5081. The agent ran from the SPEC on the ticket description; review every change before marking ready.

Always-draft is enforced — the bot never auto-merges.

…nfra fixes modelopt/torch/quantization/plugins/transformer_engine.py: MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER=1 opts into per-gemm weight_quantizer_0..N-1 inside _QuantTEGroupedLinear (deepcopied from the shared weight_quantizer). Lets TEGroupedMLP recover per-expert amax granularity, matching SequentialMLP's default behavior. modelopt/torch/distill/plugins/megatron.py: LogitsKLLoss.forward prints student/teacher logit stats (mean/std/ min/max/shape) on rank 0 each call. Diagnostic for the QAD loss-spike investigation — confirms which spec produces which logits without changing the KL math. tests/gpu_megatron/torch/quantization/plugins/test_megatron.py: New test_te_grouped_vs_sequential_default_amax + ..._default_loss cover the structural amax asymmetry between TEGroupedMLP and SequentialMLP (TEGrouped per-linear amax = max-over-Sequential-experts amax) and a finiteness sanity check on the resulting quant error. tools/launcher/common/service_utils.sh: - Fall back to SLURM_PROCID / SLURM_LOCALID when PMIX_*/OMPI_* are unset, so `[[ "$mpi_local_rank" -eq 0 ]]` doesn't silently pass on every rank under plain srun. - util_install_extra_dep: per-node marker so concurrent ranks wait for rank 0 to finish installing (concurrent pip on a shared FS leaves a broken state); also installs nvidia-resiliency-ext. Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

- transformer_engine.py: dedup `import copy`/`import os` left over from the rebase, sort the four imports alphabetically. - transformer_engine.py: comment near the per-expert weight_quantizer setup explaining that base modelopt_post_restore won't re-calibrate the weight_quantizer_{i} modules, so save/restore is only safe when TP/EP is unchanged. Per-channel _amax shape depends on the TP-sliced output dim. - service_utils.sh: drop the duplicated mpi_rank / mpi_local_rank re-assignments — main already carries the SLURM fallback, the extra two lines were leftover rebase noise. Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

copy-pr-bot · 2026-06-22T13:31:47Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-22T13:31:53Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ef1e43cc-f8bc-44fa-aca7-d6d07132087f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pensieve-intern/OMNIML-5081/cell-t0-d7

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jenchen13 added 4 commits May 27, 2026 15:41

revert logging

cc7953e

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

sync weight_quantizer_{i} during post restore

b1e32d9

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OMNIML-5084] cell_t0_d7#1788

[OMNIML-5084] cell_t0_d7#1788
ChenhanYu wants to merge 4 commits into
mainfrom
pensieve-intern/OMNIML-5081/cell-t0-d7

ChenhanYu commented Jun 22, 2026

Uh oh!

copy-pr-bot Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChenhanYu commented Jun 22, 2026

Uh oh!

copy-pr-bot Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants