Fix ModelOpt MCP Slurm launcher submit#1799
Conversation
📝 WalkthroughWalkthroughAdds a GPU smoke test script and example YAML for Slurm/CUDA validation. Expands ChangesLauncher and MCP Bridge
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/mcp/modelopt_mcp/bridge.py (1)
712-733: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winDo not return Slurm success without an
experiment_id.Line 712 only fails when both IDs are absent, so stdout containing only
Submitted batch job 123returnsok=Truewithexperiment_id=None. The MCPsubmit_jobcontract tells clients to polljob_status(experiment_id), so this becomes an unmonitorable success response. Treat missingexperiment_idas a distinct partial/unparsed failure, or add a client-facing status path keyed byslurm_job_id.🐛 Possible shape-preserving fix
- if not experiment_id and not slurm_job_id: + if not experiment_id: return { "ok": False, "executor": "slurm", "reason": "launch_result_unparsed", "exit_code": 0, "stdout_tail": stdout_tail, "stderr_tail": stderr_tail, + "slurm_job_id": slurm_job_id, "diagnostic": ( - "launch.py exited 0 but did not report an experiment_id " - "or Slurm job id. Treating this as failed so callers do " - "not assume work was submitted." + "launch.py exited 0 but did not report an experiment_id. " + "Treating this as failed so callers do not receive a " + "Slurm success response that cannot be monitored via " + "job_status." ), "argv": argv, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 712 - 733, The condition at line 712 that checks for missing IDs only fails when both experiment_id and slurm_job_id are absent, but this allows a response with ok=True and experiment_id=None to be returned when only slurm_job_id is present. Since the MCP contract requires experiment_id for job status polling, modify the condition to treat a missing experiment_id as a distinct failure case. Change the condition from checking "not experiment_id and not slurm_job_id" to also fail when experiment_id is missing (regardless of whether slurm_job_id exists), and update the failure response diagnostic message to clearly indicate that experiment_id was not found or parsed from the launch output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 905-910: The roots list in the launcher experiment discovery logic
is missing the launcher's own experiments directory that is documented in the
docstring. Add back the root directory for the launcher's experiments folder to
the roots list. This should be included alongside the existing roots for
NEMORUN_HOME/experiments, ./experiments, and ./local_experiments to ensure jobs
created under the launcher's working directory can be properly resolved.
- Around line 905-917: The experiment_id parameter is used unsafely in path
operations without validation, allowing potential path traversal attacks (via
sequences like ..) and glob pattern injection. Before using experiment_id in the
path joins at line 912 (root / experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.
---
Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 712-733: The condition at line 712 that checks for missing IDs
only fails when both experiment_id and slurm_job_id are absent, but this allows
a response with ok=True and experiment_id=None to be returned when only
slurm_job_id is present. Since the MCP contract requires experiment_id for job
status polling, modify the condition to treat a missing experiment_id as a
distinct failure case. Change the condition from checking "not experiment_id and
not slurm_job_id" to also fail when experiment_id is missing (regardless of
whether slurm_job_id exists), and update the failure response diagnostic message
to clearly indicate that experiment_id was not found or parsed from the launch
output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8e8aa525-4735-4ed1-b0bc-8d5aebf3abf8
📒 Files selected for processing (7)
tools/launcher/common/smoke/nvidia_smi.shtools/launcher/core.pytools/launcher/examples/smoke/nvidia_smi.yamltools/launcher/launch.pytools/launcher/tests/test_core.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/tests/test_bridge.py
| roots = [] | ||
| nemorun_home = os.environ.get("NEMORUN_HOME") | ||
| if nemorun_home: | ||
| candidates.append(Path(nemorun_home) / "experiments" / experiment_id) | ||
| candidates.append(Path.cwd() / "experiments" / experiment_id) | ||
| candidates.append(Path.cwd() / "local_experiments" / experiment_id) | ||
| for c in candidates: | ||
| if c.exists(): | ||
| return c | ||
| roots.append(Path(nemorun_home) / "experiments") | ||
| roots.append(Path.cwd() / "experiments") | ||
| roots.append(Path.cwd() / "local_experiments") | ||
| for root in roots: | ||
| direct = root / experiment_id | ||
| if direct.exists(): | ||
| return direct | ||
| for nested in root.glob(f"*/{experiment_id}"): | ||
| if nested.exists(): | ||
| return nested |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Validate experiment_id before path joins and globbing.
Line 912 accepts path traversal like ../other-dir, and Line 915 treats glob metacharacters in a user-supplied experiment_id as patterns. Since job_status/job_logs expose this value through the MCP API, validate it as a single path-safe token before resolving directories. As per coding guidelines, "Validate external input once at the interface boundary."
🛡️ Proposed guard
+_EXPERIMENT_ID_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9_.-]*$")
+
+
def _resolve_experiment_dir(experiment_id: str) -> Path | None:
"""Map an experiment_id to its on-disk directory.
@@
"""
+ if not _EXPERIMENT_ID_RE.fullmatch(experiment_id):
+ return None
+
roots = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 905 - 917, The experiment_id
parameter is used unsafely in path operations without validation, allowing
potential path traversal attacks (via sequences like ..) and glob pattern
injection. Before using experiment_id in the path joins at line 912 (root /
experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.
Source: Coding guidelines
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## main #1799 +/- ##
===========================================
- Coverage 72.47% 57.69% -14.78%
===========================================
Files 511 510 -1
Lines 56513 56445 -68
===========================================
- Hits 40956 32565 -8391
- Misses 15557 23880 +8323
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
69043a6 to
98b4113
Compare
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
98b4113 to
350fdab
Compare
Summary
slurm_factorycorrectly for task slotsmodelopt-mcpsubmit parsing/status resolution and add regression coverage for launcher false-positive success casesnvidia-smismoke YAML/script and fix launcher packaging so source-backed Slurm jobs package required files recursivelyValidation
uv run pytest tests/test_core.py -quv run pytest tests/test_bridge.py -qcw_dfwnvidia-smiran successfully in-container)Summary by CodeRabbit
New Features
Bug Fixes
Tests