Skip to content

feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12)#1599

Draft
wprazuch wants to merge 1 commit into
wprazuch/ng-test-concurrencyfrom
wprazuch/ng-validate
Draft

feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12)#1599
wprazuch wants to merge 1 commit into
wprazuch/ng-test-concurrencyfrom
wprazuch/ng-validate

Conversation

@wprazuch

@wprazuch wprazuch commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What

Adds gym env validate (+ ng_validate/nemo_gym_validate deprecated shims) — runs the full config parse with no Ray and no server subprocesses, exits 0 (valid) / 1 (invalid) with a clean, rich-escaped message (no traceback).

gym env validate --config resources_servers/<env>/configs/<env>.yaml --config responses_api_models/<model>/configs/<model>.yaml

Targets martas/1434

Built on top of the unified-CLI epic (#1434) rather than main: validate() lives in cli/env.py and is registered as env validate in the gym router (cli/main.py COMMANDS, with the shared --config flag). It reuses the same get_global_config_dict() parse path the other commands use, so the validation checks stay in sync.

Why

Epic #1205 friction #12 (no config validation tooling) — the M1 'fast failure triage' deliverable. Config errors otherwise only surface after Ray starts (~30–60 s).

Tests

  • test_cli_main.py: gym env validate --config X routes to nemo_gym.cli.env:validate with +config_paths=[X] (added to the parametrized config-command matrix).
  • test_cli.py: validate() passes on a valid config; exits 1 on a raised error.
  • 100/100 test_cli+test_cli_main+test_global_config pass; ruff + pre-commit clean. Smoke-tested the router end-to-end (clean ✗ message, no traceback).

@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wprazuch added a commit that referenced this pull request Jun 16, 2026
Supersedes #1510; covers epic #1205 friction #8 + #12 and issues #1488,
#1489, #1490 in one place:

- #1488: missing config_paths entry -> ConfigPathNotFoundError (names entry +
  searched locations).
- #1490: malformed (non-list) config_paths -> MalformedConfigPathsError with
  the expected Hydra list syntax.
- #1489: zero configured servers -> NoServerInstancesError, raised in
  RunHelper.start() before Ray (covers ng_run AND e2e_rollout_collection).

All three subclass a new ConfigError base. A CLI decorator
(exit_cleanly_on_config_error) on run()/e2e_rollout_collection() turns any
ConfigError into a clean, rich-escaped message + exit 1 with NO traceback
(the explicit ask in #1488/#1489), while keeping them ordinary exceptions so
ng_validate (#1599) can still catch and format them. Zero-server check uses
validated server instances, not a raw key count.

Tests: deterministic tmp_path-based path-error tests (both-locations, dedup,
absolute), malformed-config_paths, zero-server, and the decorator
(ConfigError -> clean exit; non-ConfigError propagates).

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/ng-validate branch from 621d0f0 to dc297fa Compare June 17, 2026 08:30
@wprazuch wprazuch changed the base branch from main to martas/1434 June 17, 2026 08:30
@wprazuch wprazuch changed the title feat(cli): add ng_validate pre-flight config check (#1205 friction #12) feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12) Jun 17, 2026
wprazuch added a commit that referenced this pull request Jun 17, 2026
Shared CI fixes for the martas/1434-stacked CLI work: pin uv (0.11.20 drops
pinned deps -> 7 servers fail; = #1576) and pull main's graphwalks
example_rollouts.jsonl (fixes its data validation). This branch is the base
for the ng_validate (#1599) and config-error (#1609) PRs so the fixes live in
one place. Drop when martas/1434 rebases on main.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Ports ng_validate into the unified gym CLI (#1434): a validate() command in
cli/env.py registered as 'gym env validate' (+ ng_validate/nemo_gym_validate
deprecated shims). Runs the full parse with no Ray, exits 0/1 with a clean,
rich-escaped message. Targets martas/1434.

Epic #1205 friction #12 (no config validation tooling).

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/ng-validate branch from f44535d to 624d4cf Compare June 17, 2026 09:31
@wprazuch wprazuch changed the base branch from martas/1434 to wprazuch/ng-test-concurrency June 17, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant