Fix ModelOpt MCP Slurm launcher submit by ChenhanYu · Pull Request #1799 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-23T00:35:49Z

Summary

fix launcher Slurm task annotation patching so nemo-run CLI resolves slurm_factory correctly for task slots
harden modelopt-mcp submit parsing/status resolution and add regression coverage for launcher false-positive success cases
add a minimal nvidia-smi smoke YAML/script and fix launcher packaging so source-backed Slurm jobs package required files recursively

Validation

uv run pytest tests/test_core.py -q
uv run pytest tests/test_bridge.py -q
dry-run and live-submit validated through the patched local MCP server on cw_dfw
interactive smoke job succeeded end-to-end (nvidia-smi ran successfully in-container)

Summary by CodeRabbit

New Features
- Added an NVIDIA SMI GPU smoke test (script and minimal Slurm YAML example) for launcher integration.
Bug Fixes
- Improved detection of launcher-reported fatal errors, including when the launcher exits successfully.
- Enhanced Slurm submission parsing for experiment/job identifiers and added clearer “unparsed” failure handling; updated dry-run validation accordingly.
- Expanded experiment directory discovery to support nested layouts.
- Fixed Slurm config wiring so CLI parsing resolves the correct SlurmConfig type across all task variants.
Tests
- Expanded unit tests for Slurm parsing/validation, dry-run error handling, and nested filesystem job status.

coderabbitai · 2026-06-23T00:36:03Z

📝 Walkthrough

Walkthrough

Adds a GPU smoke test script and example YAML for Slurm/CUDA validation. Expands set_slurm_config_type() to patch all SandboxTask0–SandboxTask4 slots. Makes launcher packaging paths filesystem-conditional via new helper functions. Hardens MCP bridge Slurm submission parsing with fatal-error detection, multi-pattern experiment/job-id extraction, launch_result_unparsed fallback, and nested experiment directory resolution.

Changes

Launcher and MCP Bridge

Layer / File(s)	Summary
GPU smoke test script and example YAML `tools/launcher/common/smoke/nvidia_smi.sh`, `tools/launcher/examples/smoke/nvidia_smi.yaml`	Adds a bash smoke script that runs `hostname` and `nvidia-smi` with strict execution and start/end markers, and a YAML configuring a 1-node/1-GPU Slurm job running that script inside a CUDA 12.4 container.
SandboxTask slurm_config annotation patching for all task slots `tools/launcher/core.py`, `tools/launcher/tests/test_core.py`	`set_slurm_config_type()` now iterates over `SandboxTask` and `SandboxTask0`–`SandboxTask4`, patching `__dataclass_fields__`, `__annotations__`, and `__init__` for each. Tests are extended to assert patching for `SandboxTask2`–`SandboxTask4`.
Dynamic launcher package path construction `tools/launcher/launch.py`	Imports `glob`. `_include_pattern` and `_relative_path` are initialized empty and populated by new `_add_package_path()` and `_add_package_glob()` helpers that only append paths present on disk, replacing previously hard-coded glob patterns. Base `examples` and `common` directories are always added; Megatron-LM and Model-Optimizer paths are added only when modelopt source is detected.
MCP bridge: fatal error detection, output parsing, and experiment dir resolution `tools/mcp/modelopt_mcp/bridge.py`	Introduces `_LAUNCHER_ERROR_RE` and `_launcher_reported_error()` to detect fatal launcher output on exit code 0. Extends `submit_job_impl` and `_submit_job_dry_run` to fail on detected fatal errors. Reworks experiment-id extraction to try multiple regex patterns in sequence; adds `"Job id: ..."` job-id variant; returns `launch_result_unparsed` when neither id is found. `_resolve_experiment_dir` searches direct and nested paths via `glob(f"*/{experiment_id}")` across multiple candidate roots.
MCP bridge tests for new parsing and validation paths `tools/mcp/tests/test_bridge.py`	Adds five tests covering: zero-exit with no ids, Nemo "Experiment Status" id extraction, fatal stderr text as failure, dry-run fatal text validation, and nested `experiments/<title>/<id>` filesystem layout resolution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1766: Both PRs modify tools/launcher/launch.py's PatternPackager configuration to conditionally include ModelOpt-related paths based on local source checkout presence.

Suggested reviewers

kevalmorabia97
jenchen13
hychiang-git

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	# nosec comments used to bypass Bandit checks without proper inline justification or codeowner approval, violating SECURITY.md policy requiring both.	Remove or replace # nosec comments with proper inline justifications explaining why the pattern is safe, or obtain explicit approval from `@NVIDIA/modelopt-setup-codeowners` with justification in PR description.
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Fix ModelOpt MCP Slurm launcher submit' is directly related to the main objectives, which focus on fixing Slurm launcher submit parsing, validation, and related components in the ModelOpt MCP system.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/fix-modelopt-mcp-slurm-submit

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mcp/modelopt_mcp/bridge.py (1)

712-733: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not return Slurm success without an experiment_id.

Line 712 only fails when both IDs are absent, so stdout containing only Submitted batch job 123 returns ok=True with experiment_id=None. The MCP submit_job contract tells clients to poll job_status(experiment_id), so this becomes an unmonitorable success response. Treat missing experiment_id as a distinct partial/unparsed failure, or add a client-facing status path keyed by slurm_job_id.

🐛 Possible shape-preserving fix

-    if not experiment_id and not slurm_job_id:
+    if not experiment_id:
         return {
             "ok": False,
             "executor": "slurm",
             "reason": "launch_result_unparsed",
             "exit_code": 0,
             "stdout_tail": stdout_tail,
             "stderr_tail": stderr_tail,
+            "slurm_job_id": slurm_job_id,
             "diagnostic": (
-                "launch.py exited 0 but did not report an experiment_id "
-                "or Slurm job id. Treating this as failed so callers do "
-                "not assume work was submitted."
+                "launch.py exited 0 but did not report an experiment_id. "
+                "Treating this as failed so callers do not receive a "
+                "Slurm success response that cannot be monitored via "
+                "job_status."
             ),
             "argv": argv,
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 712 - 733, The condition at
line 712 that checks for missing IDs only fails when both experiment_id and
slurm_job_id are absent, but this allows a response with ok=True and
experiment_id=None to be returned when only slurm_job_id is present. Since the
MCP contract requires experiment_id for job status polling, modify the condition
to treat a missing experiment_id as a distinct failure case. Change the
condition from checking "not experiment_id and not slurm_job_id" to also fail
when experiment_id is missing (regardless of whether slurm_job_id exists), and
update the failure response diagnostic message to clearly indicate that
experiment_id was not found or parsed from the launch output.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 905-910: The roots list in the launcher experiment discovery logic
is missing the launcher's own experiments directory that is documented in the
docstring. Add back the root directory for the launcher's experiments folder to
the roots list. This should be included alongside the existing roots for
NEMORUN_HOME/experiments, ./experiments, and ./local_experiments to ensure jobs
created under the launcher's working directory can be properly resolved.
- Around line 905-917: The experiment_id parameter is used unsafely in path
operations without validation, allowing potential path traversal attacks (via
sequences like ..) and glob pattern injection. Before using experiment_id in the
path joins at line 912 (root / experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.

---

Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 712-733: The condition at line 712 that checks for missing IDs
only fails when both experiment_id and slurm_job_id are absent, but this allows
a response with ok=True and experiment_id=None to be returned when only
slurm_job_id is present. Since the MCP contract requires experiment_id for job
status polling, modify the condition to treat a missing experiment_id as a
distinct failure case. Change the condition from checking "not experiment_id and
not slurm_job_id" to also fail when experiment_id is missing (regardless of
whether slurm_job_id exists), and update the failure response diagnostic message
to clearly indicate that experiment_id was not found or parsed from the launch
output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e8aa525-4735-4ed1-b0bc-8d5aebf3abf8

📥 Commits

Reviewing files that changed from the base of the PR and between 090b1c5 and 69043a6.

📒 Files selected for processing (7)

tools/launcher/common/smoke/nvidia_smi.sh
tools/launcher/core.py
tools/launcher/examples/smoke/nvidia_smi.yaml
tools/launcher/launch.py
tools/launcher/tests/test_core.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/tests/test_bridge.py

coderabbitai · 2026-06-23T00:45:34Z

+    roots = []
    nemorun_home = os.environ.get("NEMORUN_HOME")
    if nemorun_home:
-        candidates.append(Path(nemorun_home) / "experiments" / experiment_id)
-    candidates.append(Path.cwd() / "experiments" / experiment_id)
-    candidates.append(Path.cwd() / "local_experiments" / experiment_id)
-    for c in candidates:
-        if c.exists():
-            return c
+        roots.append(Path(nemorun_home) / "experiments")
+    roots.append(Path.cwd() / "experiments")
+    roots.append(Path.cwd() / "local_experiments")
+    for root in roots:
+        direct = root / experiment_id
+        if direct.exists():
+            return direct
+        for nested in root.glob(f"*/{experiment_id}"):
+            if nested.exists():
+                return nested


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Validate experiment_id before path joins and globbing.

Line 912 accepts path traversal like ../other-dir, and Line 915 treats glob metacharacters in a user-supplied experiment_id as patterns. Since job_status/job_logs expose this value through the MCP API, validate it as a single path-safe token before resolving directories. As per coding guidelines, "Validate external input once at the interface boundary."

🛡️ Proposed guard

+_EXPERIMENT_ID_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9_.-]*$") + + def _resolve_experiment_dir(experiment_id: str) -> Path | None: """Map an experiment_id to its on-disk directory. @@ """ + if not _EXPERIMENT_ID_RE.fullmatch(experiment_id): + return None + roots = []

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 905 - 917, The experiment_id parameter is used unsafely in path operations without validation, allowing potential path traversal attacks (via sequences like ..) and glob pattern injection. Before using experiment_id in the path joins at line 912 (root / experiment_id) and the glob operation at line 915 (root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a single safe path token containing only alphanumeric characters, hyphens, and underscores. Reject any values containing path separators, dots for traversal, or glob metacharacters. Perform this validation at the entry point where experiment_id is received from the MCP API interface (in job_status and job_logs functions) rather than within this utility function.

Source: Coding guidelines

codecov · 2026-06-23T00:55:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.69%. Comparing base (090b1c5) to head (350fdab).
⚠️ Report is 1 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (090b1c5) and HEAD (350fdab). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (090b1c5) HEAD (350fdab)

unit 2 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1799       +/-   ##
===========================================
- Coverage   72.47%   57.69%   -14.78%     
===========================================
  Files         511      510        -1     
  Lines       56513    56445       -68     
===========================================
- Hits        40956    32565     -8391     
- Misses      15557    23880     +8323

Flag	Coverage Δ
regression	`14.73% <ø> (+0.06%)`	⬆️
unit	`54.53% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu requested a review from a team as a code owner June 23, 2026 00:35

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch from 69043a6 to 98b4113 Compare June 23, 2026 01:17

Fix ModelOpt MCP Slurm launcher submit

350fdab

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch from 98b4113 to 350fdab Compare June 23, 2026 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ModelOpt MCP Slurm launcher submit#1799

Fix ModelOpt MCP Slurm launcher submit#1799
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/fix-modelopt-mcp-slurm-submit

ChenhanYu commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks failed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

codecov Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChenhanYu commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChenhanYu commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

codecov Bot commented Jun 23, 2026 •

edited

Loading