fix(config): aggregated error for unset '???' config values#1575
fix(config): aggregated error for unset '???' config values#1575wprazuch wants to merge 4 commits into
Conversation
…ite) (#1576) ## Problem The **Full test suite** (`Test`) job has been red on every PR, failing 7 servers (`math_with_code`, `newton_bench`, `arena_judge`, `reasoning_gym`, `ether0`, `aviary`, `stirrup_agent`) with `ModuleNotFoundError` at test collection — `scipy`, `scikit-learn`, `matplotlib`, `PIL`. These deps **are** declared (and pinned) in each server's `requirements.txt`. The real cause: **`uv 0.11.20` (released 2026-06-10) has a resolver regression** — it silently drops pinned direct dependencies from `uv pip install -r requirements.txt` when the requirements also include an editable `-e` install (as every server's `-e nemo-gym[dev] @ ../../` does). No error, no conflict — the package is just omitted. CI installs uv **unpinned** (`curl -LsSf https://astral.sh/uv/install.sh | sh`), so it picked up 0.11.20 the day it released — which is exactly when the suite started failing. The first (passing) run used an earlier uv. ## Evidence (reproduced locally with the exact CI install command) For `resources_servers/reasoning_gym` (`source .venv/bin/activate && uv pip install -r requirements.txt openai==2.7.2`): | uv version | Resolved | `matplotlib==3.10.6` | |---|---|---| | 0.11.19 | 154 packages | ✅ installed → `import matplotlib` OK | | **0.11.20** | 150 packages | ❌ dropped → `ModuleNotFoundError` | Bisected 0.10.2 → 0.11.20: every version **through 0.11.19 works**; only **0.11.20** is broken. ## Fix Pin the uv installer to **0.11.19** (latest known-good) in `full-test-suite.yml` (Test + wheel jobs) and `unit-tests.yml`, with a comment explaining why. Once uv ships a fix, the pin can be bumped. Not caused by — and unblocks — any PR that triggers the full matrix (e.g. #1561, #1575). Worth also reporting the regression upstream to astral-sh/uv. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
0ccb3cf to
edc9b02
Compare
|
/ok to test edc9b02 |
edc9b02 to
713dffe
Compare
|
/ok to test 713dffe |
| # Fail fast with one actionable error if any required value is still '???' after merging | ||
| # all sources (CLI + env.yaml + config_paths). Otherwise the first unset value surfaces as | ||
| # an opaque MissingMandatoryValue deep in the pipeline (e.g. while resolving inheritance). | ||
| self.raise_on_missing_values(global_config_dict) |
There was a problem hiding this comment.
Potential regression flagged by Codex with an example being in benchmarks/nemotron_3_ultra/ultra_local_endpoint.yaml, what do you think?
parse() now calls raise_on_missing_values() before _recursively_swap_keys(). But
_recursively_swap_keys() is where _delete_key is applied. So a config that previously
deleted an obsolete branch containing ??? now fails before the deletion happens.
Example:
policy_model:
responses_api_models:
_delete_key: old_model
old_model:
entrypoint: app.py
model: ???
new_model:
entrypoint: app.py
model: actual-model
Before this PR: old_model gets deleted, config parses.
After this PR: parser raises ConfigMissingValuesError for
policy_model.responses_api_models.old_model.model, even though that value is supposed to
be removed.
Suggestion: move the missing-value check until after _recursively_swap_keys() has
applied _delete_key and inheritance/copy rewrites, or teach the scanner to ignore paths
scheduled for deletion. I’d prefer moving the check later if possible, then adding a
regression test with _delete_key deleting a branch that contains ???.
There was a problem hiding this comment.
Seems the suggestion is to just move the raise below swap keys, which I think makes sense
There was a problem hiding this comment.
Good catch — fixed in 1168c9f. One subtlety worth flagging: moving the check below _recursively_swap_keys() alone doesn't work, because swap_keys iterates with .items(), which resolves MISSING and raises MissingMandatoryValue mid-swap on any genuine ??? — i.e. it re-introduces the exact opaque error this PR removes. So the fix is two parts:
_recursively_swap_keys_helpernow iterates withitems_ex(resolve=False)— directive strings (${inherit_from:...}) come back raw so swap detection still matches, and a???leaf is returned as-is instead of crashing. (Nested values are stillDictConfig, so the delete/recursion logic is untouched.)raise_on_missing_valuesnow runs after_recursively_swap_keys, so_delete_key/_inherit_from/_copyare applied first and a???in a deleted/overwritten branch is not reported.
Added two regression tests as you suggested: a unit test (swap + collect ignores a _delete_key'd ??? branch while still reporting a genuine ???) and an e2e test through the full get_global_config_dict() parse with _delete_key removing the only ???. Full suite green (241 unit tests).
There was a problem hiding this comment.
Done in 1168c9f — moved the check below swap_keys as you suggested, plus made swap_keys itself missing-tolerant (items_ex(resolve=False)) so it doesn't crash on a genuine ??? before the check runs. Details + regression tests in the thread above.
|
/ok to test 1168c9f |
A required value left as OmegaConf MISSING ('???') currently surfaces as an
opaque MissingMandatoryValue deep in parsing — one field at a time, with no
guidance on how to set it. Add a fast-fail check that runs after all sources
are merged (CLI + env.yaml + config_paths) and before inheritance resolution,
raising a single ConfigMissingValuesError listing every unset value with a
++path=<value> override example.
The scan uses OmegaConf.to_container(resolve=False, throw_on_missing=False) so
it never trips on MISSING values or unresolved interpolations.
Part of #1205 (friction point #10).
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…anches Codex flagged a regression: raise_on_missing_values ran before _recursively_swap_keys, so a '???' inside a branch removed by _delete_key (or overwritten by _inherit_from/_copy) was wrongly reported as missing. Moving the check after swap is not sufficient on its own: swap_keys iterated with .items(), which resolves MISSING and raises MissingMandatoryValue mid-swap on any genuine '???'. Fix is two parts: - _recursively_swap_keys_helper: iterate with items_ex(resolve=False) so directive strings stay raw (swap detection still matches) and '???' leaves are returned as-is instead of crashing. - parse(): run raise_on_missing_values after _recursively_swap_keys, so deleted/overwritten branches are excluded from the missing-value report. Tests: unit (swap + collect ignores deleted branch) and e2e (full get_global_config_dict parse with _delete_key removing the only '???'). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
1168c9f to
f49c6c2
Compare
| assert "Did you mean" in message | ||
| assert "'resources'" in message | ||
|
|
||
| def test_collect_missing_value_paths(self) -> None: |
There was a problem hiding this comment.
we should tests for _inherit_from/copy too
| # come back unresolved (so the swap detection below still matches), and a missing ('???') | ||
| # leaf is returned as-is instead of raising MissingMandatoryValue mid-swap. Any genuinely | ||
| # unset values are reported together by raise_on_missing_values, which runs after this pass. | ||
| for k, v in list(dict_config.items_ex(resolve=False)): |
There was a problem hiding this comment.
if this is not already tested, please add coverage that a plain ${a.b.c} reference continues to flow through the full parse so we dont have any regressions on this path
| assert "old_model" not in models | ||
| assert "new_model" in models | ||
|
|
||
| def test_get_first_server_config_dict(self) -> None: |
There was a problem hiding this comment.
if not already covered, could you add a full end to end test that drives the get_global_config_dict() → parse() pipeline?
test_get_global_config_dict_raises_on_missing_values only checks against 1 missing value, and test_collect_missing_value_paths doesn't go through parse
| self, dict_config: DictConfig, original_dict_config: DictConfig, frozen_dict_config: DictConfig | ||
| ) -> None: | ||
| for k, v in list(dict_config.items()): | ||
| # items_ex(resolve=False) yields raw values: directive strings like "${inherit_from:...}" |
There was a problem hiding this comment.
i think these configs are still an issue:
target: ${copy:source.model}
source:
model: ???
where this returns ValueError: Path specified does not exist in config: ['source', 'model']
which may not be easily understandable by users
though this config is reported correctly:
target: ${copy:source} # fine — resolves, then the ??? inside is reported normally
source:
model: ???
ananthsub
left a comment
There was a problem hiding this comment.
mostly LGTM, but requesting changes to add more test coverage to ensure we dont have regressions from these changes
…erage Address review on #1575: - _recursive_index_dict_using_path now navigates with _get_node and propagates MISSING for an unset referenced value, so '${copy:source.model}' where source.model is '???' reports the missing value instead of an opaque 'path does not exist' error. - Tests: _copy/_inherit_from carry '???' into the target (reported); copy of a missing leaf is reported not opaque; plain '${a.b.c}' interpolation still resolves through full parse (no regression); e2e get_global_config_dict() aggregates multiple missing values. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Swap iteration over ListConfig now skips missing ('???') elements without
resolving, so a '???' in a list (or in a dict nested in a list) is reported by
raise_on_missing_values instead of crashing mid-swap with MissingMandatoryValue.
Completes the missing-tolerance started for dict leaves. Found by review.
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
What
A required config value left unset (OmegaConf MISSING,
???) currently surfaces as an opaqueomegaconf.errors.MissingMandatoryValue— one field at a time, raised deep in parsing(while resolving inheritance), with no guidance on how to set it.
This adds a fast-fail check in
parse()that raises a singleConfigMissingValuesErrorlistingevery unset value with a ready-to-use override example.
Before / after
How
ConfigMissingValuesError(ValueError)inconfig_types.py.collect_missing_value_paths()/raise_on_missing_values()inglobal_config.py. The scanruns after all sources are merged (CLI + env.yaml + config_paths) and before
_recursively_swap_keys— which is the first step that iterates and would otherwise trip theopaque
MissingMandatoryValue. By that point every CLI/env override has been applied, so anyremaining
???is genuinely unset.OmegaConf.to_container(resolve=False, throw_on_missing=False), so it neverraises on MISSING values or unresolved
${...}interpolations.Scope / safety
Base configs that intentionally ship
???(api keys, container paths, model names — ~33 files)are unaffected: they're filled at run time via CLI/env, and no test parses a bare
???config.This only changes the error a user sees when they forget to supply one.
Testing
test_collect_missing_value_paths— nested dict + list, asserts["a", "b.d", "e[1]"].test_get_global_config_dict_raises_on_missing_values— asserts the dotted path and the++...=<value>hint appear in the message.pytest tests/unit_tests/test_global_config.py28/28; relatedtest_train_data_utils/test_config_types_help/test_rollout_collection54/54. ruff clean. No new dependencies.Part of #1205 (friction point 10). Companion to #1561 (friction 3).