feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12)#1599
Draft
wprazuch wants to merge 1 commit into
Draft
feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12)#1599wprazuch wants to merge 1 commit into
wprazuch wants to merge 1 commit into
Conversation
wprazuch
added a commit
that referenced
this pull request
Jun 16, 2026
Supersedes #1510; covers epic #1205 friction #8 + #12 and issues #1488, #1489, #1490 in one place: - #1488: missing config_paths entry -> ConfigPathNotFoundError (names entry + searched locations). - #1490: malformed (non-list) config_paths -> MalformedConfigPathsError with the expected Hydra list syntax. - #1489: zero configured servers -> NoServerInstancesError, raised in RunHelper.start() before Ray (covers ng_run AND e2e_rollout_collection). All three subclass a new ConfigError base. A CLI decorator (exit_cleanly_on_config_error) on run()/e2e_rollout_collection() turns any ConfigError into a clean, rich-escaped message + exit 1 with NO traceback (the explicit ask in #1488/#1489), while keeping them ordinary exceptions so ng_validate (#1599) can still catch and format them. Zero-server check uses validated server instances, not a raw key count. Tests: deterministic tmp_path-based path-error tests (both-locations, dedup, absolute), malformed-config_paths, zero-server, and the decorator (ConfigError -> clean exit; non-ConfigError propagates). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
621d0f0 to
dc297fa
Compare
wprazuch
added a commit
that referenced
this pull request
Jun 17, 2026
Shared CI fixes for the martas/1434-stacked CLI work: pin uv (0.11.20 drops pinned deps -> 7 servers fail; = #1576) and pull main's graphwalks example_rollouts.jsonl (fixes its data validation). This branch is the base for the ng_validate (#1599) and config-error (#1609) PRs so the fixes live in one place. Drop when martas/1434 rebases on main. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Ports ng_validate into the unified gym CLI (#1434): a validate() command in cli/env.py registered as 'gym env validate' (+ ng_validate/nemo_gym_validate deprecated shims). Runs the full parse with no Ray, exits 0/1 with a clean, rich-escaped message. Targets martas/1434. Epic #1205 friction #12 (no config validation tooling). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
f44535d to
624d4cf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
gym env validate(+ng_validate/nemo_gym_validatedeprecated shims) — runs the full config parse with no Ray and no server subprocesses, exits 0 (valid) / 1 (invalid) with a clean, rich-escaped message (no traceback).Targets
martas/1434Built on top of the unified-CLI epic (#1434) rather than
main:validate()lives incli/env.pyand is registered asenv validatein thegymrouter (cli/main.pyCOMMANDS, with the shared--configflag). It reuses the sameget_global_config_dict()parse path the other commands use, so the validation checks stay in sync.Why
Epic #1205 friction #12 (no config validation tooling) — the M1 'fast failure triage' deliverable. Config errors otherwise only surface after Ray starts (~30–60 s).
Tests
test_cli_main.py:gym env validate --config Xroutes tonemo_gym.cli.env:validatewith+config_paths=[X](added to the parametrized config-command matrix).test_cli.py:validate()passes on a valid config; exits 1 on a raised error.test_cli+test_cli_main+test_global_configpass; ruff + pre-commit clean. Smoke-tested the router end-to-end (clean ✗ message, no traceback).