Skip to content

[OMNIML-5084] cell_t0_d7#1788

Draft
ChenhanYu wants to merge 4 commits into
mainfrom
pensieve-intern/OMNIML-5081/cell-t0-d7
Draft

[OMNIML-5084] cell_t0_d7#1788
ChenhanYu wants to merge 4 commits into
mainfrom
pensieve-intern/OMNIML-5081/cell-t0-d7

Conversation

@ChenhanYu

Copy link
Copy Markdown
Collaborator

Draft PR opened by pensieve-intern for OMNIML-5084.

Stage cell_t0_d7 of Epic OMNIML-5081. The agent ran from the SPEC on the ticket description; review every change before marking ready.

Always-draft is enforced — the bot never auto-merges.

jenchen13 added 4 commits May 27, 2026 15:41
…nfra fixes

modelopt/torch/quantization/plugins/transformer_engine.py:
  MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER=1 opts into per-gemm
  weight_quantizer_0..N-1 inside _QuantTEGroupedLinear (deepcopied from
  the shared weight_quantizer). Lets TEGroupedMLP recover per-expert
  amax granularity, matching SequentialMLP's default behavior.

modelopt/torch/distill/plugins/megatron.py:
  LogitsKLLoss.forward prints student/teacher logit stats (mean/std/
  min/max/shape) on rank 0 each call. Diagnostic for the QAD loss-spike
  investigation — confirms which spec produces which logits without
  changing the KL math.

tests/gpu_megatron/torch/quantization/plugins/test_megatron.py:
  New test_te_grouped_vs_sequential_default_amax + ..._default_loss
  cover the structural amax asymmetry between TEGroupedMLP and
  SequentialMLP (TEGrouped per-linear amax = max-over-Sequential-experts
  amax) and a finiteness sanity check on the resulting quant error.

tools/launcher/common/service_utils.sh:
  - Fall back to SLURM_PROCID / SLURM_LOCALID when PMIX_*/OMPI_* are
    unset, so `[[ "$mpi_local_rank" -eq 0 ]]` doesn't silently pass on
    every rank under plain srun.
  - util_install_extra_dep: per-node marker so concurrent ranks wait
    for rank 0 to finish installing (concurrent pip on a shared FS
    leaves a broken state); also installs nvidia-resiliency-ext.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
- transformer_engine.py: dedup `import copy`/`import os` left over from the
  rebase, sort the four imports alphabetically.
- transformer_engine.py: comment near the per-expert weight_quantizer setup
  explaining that base modelopt_post_restore won't re-calibrate the
  weight_quantizer_{i} modules, so save/restore is only safe when TP/EP is
  unchanged. Per-channel _amax shape depends on the TP-sliced output dim.
- service_utils.sh: drop the duplicated mpi_rank / mpi_local_rank
  re-assignments — main already carries the SLURM fallback, the extra two
  lines were leftover rebase noise.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ef1e43cc-f8bc-44fa-aca7-d6d07132087f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pensieve-intern/OMNIML-5081/cell-t0-d7

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants