Skip to content

Fix ModelOpt MCP Slurm launcher submit#1799

Open
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/fix-modelopt-mcp-slurm-submit
Open

Fix ModelOpt MCP Slurm launcher submit#1799
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/fix-modelopt-mcp-slurm-submit

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • fix launcher Slurm task annotation patching so nemo-run CLI resolves slurm_factory correctly for task slots
  • harden modelopt-mcp submit parsing/status resolution and add regression coverage for launcher false-positive success cases
  • add a minimal nvidia-smi smoke YAML/script and fix launcher packaging so source-backed Slurm jobs package required files recursively

Validation

  • uv run pytest tests/test_core.py -q
  • uv run pytest tests/test_bridge.py -q
  • dry-run and live-submit validated through the patched local MCP server on cw_dfw
  • interactive smoke job succeeded end-to-end (nvidia-smi ran successfully in-container)

Summary by CodeRabbit

  • New Features

    • Added an NVIDIA SMI GPU smoke test (script and minimal Slurm YAML example) for launcher integration.
  • Bug Fixes

    • Improved detection of launcher-reported fatal errors, including when the launcher exits successfully.
    • Enhanced Slurm submission parsing for experiment/job identifiers and added clearer “unparsed” failure handling; updated dry-run validation accordingly.
    • Expanded experiment directory discovery to support nested layouts.
    • Fixed Slurm config wiring so CLI parsing resolves the correct SlurmConfig type across all task variants.
  • Tests

    • Expanded unit tests for Slurm parsing/validation, dry-run error handling, and nested filesystem job status.

@ChenhanYu ChenhanYu requested a review from a team as a code owner June 23, 2026 00:35
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a GPU smoke test script and example YAML for Slurm/CUDA validation. Expands set_slurm_config_type() to patch all SandboxTask0SandboxTask4 slots. Makes launcher packaging paths filesystem-conditional via new helper functions. Hardens MCP bridge Slurm submission parsing with fatal-error detection, multi-pattern experiment/job-id extraction, launch_result_unparsed fallback, and nested experiment directory resolution.

Changes

Launcher and MCP Bridge

Layer / File(s) Summary
GPU smoke test script and example YAML
tools/launcher/common/smoke/nvidia_smi.sh, tools/launcher/examples/smoke/nvidia_smi.yaml
Adds a bash smoke script that runs hostname and nvidia-smi with strict execution and start/end markers, and a YAML configuring a 1-node/1-GPU Slurm job running that script inside a CUDA 12.4 container.
SandboxTask slurm_config annotation patching for all task slots
tools/launcher/core.py, tools/launcher/tests/test_core.py
set_slurm_config_type() now iterates over SandboxTask and SandboxTask0SandboxTask4, patching __dataclass_fields__, __annotations__, and __init__ for each. Tests are extended to assert patching for SandboxTask2SandboxTask4.
Dynamic launcher package path construction
tools/launcher/launch.py
Imports glob. _include_pattern and _relative_path are initialized empty and populated by new _add_package_path() and _add_package_glob() helpers that only append paths present on disk, replacing previously hard-coded glob patterns. Base examples and common directories are always added; Megatron-LM and Model-Optimizer paths are added only when modelopt source is detected.
MCP bridge: fatal error detection, output parsing, and experiment dir resolution
tools/mcp/modelopt_mcp/bridge.py
Introduces _LAUNCHER_ERROR_RE and _launcher_reported_error() to detect fatal launcher output on exit code 0. Extends submit_job_impl and _submit_job_dry_run to fail on detected fatal errors. Reworks experiment-id extraction to try multiple regex patterns in sequence; adds "Job id: ..." job-id variant; returns launch_result_unparsed when neither id is found. _resolve_experiment_dir searches direct and nested paths via glob(f"*/{experiment_id}") across multiple candidate roots.
MCP bridge tests for new parsing and validation paths
tools/mcp/tests/test_bridge.py
Adds five tests covering: zero-exit with no ids, Nemo "Experiment Status" id extraction, fatal stderr text as failure, dry-run fatal text validation, and nested experiments/<title>/<id> filesystem layout resolution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1766: Both PRs modify tools/launcher/launch.py's PatternPackager configuration to conditionally include ModelOpt-related paths based on local source checkout presence.

Suggested reviewers

  • kevalmorabia97
  • jenchen13
  • hychiang-git

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error # nosec comments used to bypass Bandit checks without proper inline justification or codeowner approval, violating SECURITY.md policy requiring both. Remove or replace # nosec comments with proper inline justifications explaining why the pattern is safe, or obtain explicit approval from @NVIDIA/modelopt-setup-codeowners with justification in PR description.
Docstring Coverage ⚠️ Warning Docstring coverage is 73.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Fix ModelOpt MCP Slurm launcher submit' is directly related to the main objectives, which focus on fixing Slurm launcher submit parsing, validation, and related components in the ModelOpt MCP system.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/fix-modelopt-mcp-slurm-submit

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/mcp/modelopt_mcp/bridge.py (1)

712-733: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not return Slurm success without an experiment_id.

Line 712 only fails when both IDs are absent, so stdout containing only Submitted batch job 123 returns ok=True with experiment_id=None. The MCP submit_job contract tells clients to poll job_status(experiment_id), so this becomes an unmonitorable success response. Treat missing experiment_id as a distinct partial/unparsed failure, or add a client-facing status path keyed by slurm_job_id.

🐛 Possible shape-preserving fix
-    if not experiment_id and not slurm_job_id:
+    if not experiment_id:
         return {
             "ok": False,
             "executor": "slurm",
             "reason": "launch_result_unparsed",
             "exit_code": 0,
             "stdout_tail": stdout_tail,
             "stderr_tail": stderr_tail,
+            "slurm_job_id": slurm_job_id,
             "diagnostic": (
-                "launch.py exited 0 but did not report an experiment_id "
-                "or Slurm job id. Treating this as failed so callers do "
-                "not assume work was submitted."
+                "launch.py exited 0 but did not report an experiment_id. "
+                "Treating this as failed so callers do not receive a "
+                "Slurm success response that cannot be monitored via "
+                "job_status."
             ),
             "argv": argv,
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 712 - 733, The condition at
line 712 that checks for missing IDs only fails when both experiment_id and
slurm_job_id are absent, but this allows a response with ok=True and
experiment_id=None to be returned when only slurm_job_id is present. Since the
MCP contract requires experiment_id for job status polling, modify the condition
to treat a missing experiment_id as a distinct failure case. Change the
condition from checking "not experiment_id and not slurm_job_id" to also fail
when experiment_id is missing (regardless of whether slurm_job_id exists), and
update the failure response diagnostic message to clearly indicate that
experiment_id was not found or parsed from the launch output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 905-910: The roots list in the launcher experiment discovery logic
is missing the launcher's own experiments directory that is documented in the
docstring. Add back the root directory for the launcher's experiments folder to
the roots list. This should be included alongside the existing roots for
NEMORUN_HOME/experiments, ./experiments, and ./local_experiments to ensure jobs
created under the launcher's working directory can be properly resolved.
- Around line 905-917: The experiment_id parameter is used unsafely in path
operations without validation, allowing potential path traversal attacks (via
sequences like ..) and glob pattern injection. Before using experiment_id in the
path joins at line 912 (root / experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.

---

Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 712-733: The condition at line 712 that checks for missing IDs
only fails when both experiment_id and slurm_job_id are absent, but this allows
a response with ok=True and experiment_id=None to be returned when only
slurm_job_id is present. Since the MCP contract requires experiment_id for job
status polling, modify the condition to treat a missing experiment_id as a
distinct failure case. Change the condition from checking "not experiment_id and
not slurm_job_id" to also fail when experiment_id is missing (regardless of
whether slurm_job_id exists), and update the failure response diagnostic message
to clearly indicate that experiment_id was not found or parsed from the launch
output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e8aa525-4735-4ed1-b0bc-8d5aebf3abf8

📥 Commits

Reviewing files that changed from the base of the PR and between 090b1c5 and 69043a6.

📒 Files selected for processing (7)
  • tools/launcher/common/smoke/nvidia_smi.sh
  • tools/launcher/core.py
  • tools/launcher/examples/smoke/nvidia_smi.yaml
  • tools/launcher/launch.py
  • tools/launcher/tests/test_core.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/tests/test_bridge.py

Comment thread tools/mcp/modelopt_mcp/bridge.py
Comment on lines +905 to +917
roots = []
nemorun_home = os.environ.get("NEMORUN_HOME")
if nemorun_home:
candidates.append(Path(nemorun_home) / "experiments" / experiment_id)
candidates.append(Path.cwd() / "experiments" / experiment_id)
candidates.append(Path.cwd() / "local_experiments" / experiment_id)
for c in candidates:
if c.exists():
return c
roots.append(Path(nemorun_home) / "experiments")
roots.append(Path.cwd() / "experiments")
roots.append(Path.cwd() / "local_experiments")
for root in roots:
direct = root / experiment_id
if direct.exists():
return direct
for nested in root.glob(f"*/{experiment_id}"):
if nested.exists():
return nested

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Validate experiment_id before path joins and globbing.

Line 912 accepts path traversal like ../other-dir, and Line 915 treats glob metacharacters in a user-supplied experiment_id as patterns. Since job_status/job_logs expose this value through the MCP API, validate it as a single path-safe token before resolving directories. As per coding guidelines, "Validate external input once at the interface boundary."

🛡️ Proposed guard
+_EXPERIMENT_ID_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9_.-]*$")
+
+
 def _resolve_experiment_dir(experiment_id: str) -> Path | None:
     """Map an experiment_id to its on-disk directory.
@@
     """
+    if not _EXPERIMENT_ID_RE.fullmatch(experiment_id):
+        return None
+
     roots = []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 905 - 917, The experiment_id
parameter is used unsafely in path operations without validation, allowing
potential path traversal attacks (via sequences like ..) and glob pattern
injection. Before using experiment_id in the path joins at line 912 (root /
experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.

Source: Coding guidelines

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.69%. Comparing base (090b1c5) to head (350fdab).
⚠️ Report is 1 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (090b1c5) and HEAD (350fdab). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (090b1c5) HEAD (350fdab)
unit 2 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1799       +/-   ##
===========================================
- Coverage   72.47%   57.69%   -14.78%     
===========================================
  Files         511      510        -1     
  Lines       56513    56445       -68     
===========================================
- Hits        40956    32565     -8391     
- Misses      15557    23880     +8323     
Flag Coverage Δ
regression 14.73% <ø> (+0.06%) ⬆️
unit 54.53% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch from 69043a6 to 98b4113 Compare June 23, 2026 01:17
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch from 98b4113 to 350fdab Compare June 23, 2026 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant