Skip to content

fix(config): aggregated error for unset '???' config values#1575

Open
wprazuch wants to merge 4 commits into
mainfrom
wprazuch/config-missing-values
Open

fix(config): aggregated error for unset '???' config values#1575
wprazuch wants to merge 4 commits into
mainfrom
wprazuch/config-missing-values

Conversation

@wprazuch

@wprazuch wprazuch commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What

A required config value left unset (OmegaConf MISSING, ???) currently surfaces as an opaque
omegaconf.errors.MissingMandatoryValueone field at a time, raised deep in parsing
(while resolving inheritance), with no guidance on how to set it.

This adds a fast-fail check in parse() that raises a single ConfigMissingValuesError listing
every unset value with a ready-to-use override example.

Before / after

# before
omegaconf.errors.MissingMandatoryValue: Missing mandatory value:
swe_agents.responses_api_agents.swe_agents.container_formatter

# after
2 required config value(s) are unset (still '???') after merging:
  - swe_agents.responses_api_agents.swe_agents.container_formatter
  - swe_agents.responses_api_agents.swe_agents.dataset_path

Provide each value via a CLI override, in env.yaml, or in a config you pass via config_paths.
For example, on the command line:
  ++swe_agents.responses_api_agents.swe_agents.container_formatter=<value>
  ++swe_agents.responses_api_agents.swe_agents.dataset_path=<value>

How

  • New ConfigMissingValuesError(ValueError) in config_types.py.
  • collect_missing_value_paths() / raise_on_missing_values() in global_config.py. The scan
    runs after all sources are merged (CLI + env.yaml + config_paths) and before
    _recursively_swap_keys — which is the first step that iterates and would otherwise trip the
    opaque MissingMandatoryValue. By that point every CLI/env override has been applied, so any
    remaining ??? is genuinely unset.
  • The walk uses OmegaConf.to_container(resolve=False, throw_on_missing=False), so it never
    raises on MISSING values or unresolved ${...} interpolations.

Scope / safety

Base configs that intentionally ship ??? (api keys, container paths, model names — ~33 files)
are unaffected: they're filled at run time via CLI/env, and no test parses a bare ??? config.
This only changes the error a user sees when they forget to supply one.

Testing

  • test_collect_missing_value_paths — nested dict + list, asserts ["a", "b.d", "e[1]"].
  • test_get_global_config_dict_raises_on_missing_values — asserts the dotted path and the
    ++...=<value> hint appear in the message.
  • pytest tests/unit_tests/test_global_config.py 28/28; related test_train_data_utils /
    test_config_types_help / test_rollout_collection 54/54. ruff clean. No new dependencies.

Part of #1205 (friction point 10). Companion to #1561 (friction 3).

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ko3n1g pushed a commit that referenced this pull request Jun 12, 2026
…ite) (#1576)

## Problem

The **Full test suite** (`Test`) job has been red on every PR, failing 7
servers
(`math_with_code`, `newton_bench`, `arena_judge`, `reasoning_gym`,
`ether0`, `aviary`,
`stirrup_agent`) with `ModuleNotFoundError` at test collection —
`scipy`, `scikit-learn`,
`matplotlib`, `PIL`.

These deps **are** declared (and pinned) in each server's
`requirements.txt`. The real cause:

**`uv 0.11.20` (released 2026-06-10) has a resolver regression** — it
silently drops pinned
direct dependencies from `uv pip install -r requirements.txt` when the
requirements also
include an editable `-e` install (as every server's `-e nemo-gym[dev] @
../../` does). No error,
no conflict — the package is just omitted.

CI installs uv **unpinned** (`curl -LsSf https://astral.sh/uv/install.sh
| sh`), so it picked up
0.11.20 the day it released — which is exactly when the suite started
failing. The first
(passing) run used an earlier uv.

## Evidence (reproduced locally with the exact CI install command)

For `resources_servers/reasoning_gym` (`source .venv/bin/activate && uv
pip install -r requirements.txt openai==2.7.2`):

| uv version | Resolved | `matplotlib==3.10.6` |
|---|---|---|
| 0.11.19 | 154 packages | ✅ installed → `import matplotlib` OK |
| **0.11.20** | 150 packages | ❌ dropped → `ModuleNotFoundError` |

Bisected 0.10.2 → 0.11.20: every version **through 0.11.19 works**; only
**0.11.20** is broken.

## Fix

Pin the uv installer to **0.11.19** (latest known-good) in
`full-test-suite.yml` (Test + wheel
jobs) and `unit-tests.yml`, with a comment explaining why. Once uv ships
a fix, the pin can be
bumped.

Not caused by — and unblocks — any PR that triggers the full matrix
(e.g. #1561, #1575). Worth
also reporting the regression upstream to astral-sh/uv.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/config-missing-values branch from 0ccb3cf to edc9b02 Compare June 15, 2026 13:07
@wprazuch

Copy link
Copy Markdown
Contributor Author

/ok to test edc9b02

@wprazuch

Copy link
Copy Markdown
Contributor Author

/ok to test 713dffe

Comment thread nemo_gym/global_config.py Outdated
# Fail fast with one actionable error if any required value is still '???' after merging
# all sources (CLI + env.yaml + config_paths). Otherwise the first unset value surfaces as
# an opaque MissingMandatoryValue deep in the pipeline (e.g. while resolving inheritance).
self.raise_on_missing_values(global_config_dict)

@cmunley1 cmunley1 Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential regression flagged by Codex with an example being in benchmarks/nemotron_3_ultra/ultra_local_endpoint.yaml, what do you think?

 parse() now calls raise_on_missing_values() before _recursively_swap_keys(). But
  _recursively_swap_keys() is where _delete_key is applied. So a config that previously
  deleted an obsolete branch containing ??? now fails before the deletion happens.

  Example:

  policy_model:
    responses_api_models:
      _delete_key: old_model
      old_model:
        entrypoint: app.py
        model: ???
      new_model:
        entrypoint: app.py
        model: actual-model

  Before this PR: old_model gets deleted, config parses.

  After this PR: parser raises ConfigMissingValuesError for
  policy_model.responses_api_models.old_model.model, even though that value is supposed to
  be removed.

  Suggestion: move the missing-value check until after _recursively_swap_keys() has
  applied _delete_key and inheritance/copy rewrites, or teach the scanner to ignore paths
  scheduled for deletion. I’d prefer moving the check later if possible, then adding a
  regression test with _delete_key deleting a branch that contains ???.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the suggestion is to just move the raise below swap keys, which I think makes sense

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 1168c9f. One subtlety worth flagging: moving the check below _recursively_swap_keys() alone doesn't work, because swap_keys iterates with .items(), which resolves MISSING and raises MissingMandatoryValue mid-swap on any genuine ??? — i.e. it re-introduces the exact opaque error this PR removes. So the fix is two parts:

  1. _recursively_swap_keys_helper now iterates with items_ex(resolve=False) — directive strings (${inherit_from:...}) come back raw so swap detection still matches, and a ??? leaf is returned as-is instead of crashing. (Nested values are still DictConfig, so the delete/recursion logic is untouched.)
  2. raise_on_missing_values now runs after _recursively_swap_keys, so _delete_key/_inherit_from/_copy are applied first and a ??? in a deleted/overwritten branch is not reported.

Added two regression tests as you suggested: a unit test (swap + collect ignores a _delete_key'd ??? branch while still reporting a genuine ???) and an e2e test through the full get_global_config_dict() parse with _delete_key removing the only ???. Full suite green (241 unit tests).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1168c9f — moved the check below swap_keys as you suggested, plus made swap_keys itself missing-tolerant (items_ex(resolve=False)) so it doesn't crash on a genuine ??? before the check runs. Details + regression tests in the thread above.

@wprazuch

Copy link
Copy Markdown
Contributor Author

/ok to test 1168c9f

wprazuch added 2 commits June 16, 2026 13:32
A required value left as OmegaConf MISSING ('???') currently surfaces as an
opaque MissingMandatoryValue deep in parsing — one field at a time, with no
guidance on how to set it. Add a fast-fail check that runs after all sources
are merged (CLI + env.yaml + config_paths) and before inheritance resolution,
raising a single ConfigMissingValuesError listing every unset value with a
++path=<value> override example.

The scan uses OmegaConf.to_container(resolve=False, throw_on_missing=False) so
it never trips on MISSING values or unresolved interpolations.

Part of #1205 (friction point #10).

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…anches

Codex flagged a regression: raise_on_missing_values ran before
_recursively_swap_keys, so a '???' inside a branch removed by _delete_key
(or overwritten by _inherit_from/_copy) was wrongly reported as missing.

Moving the check after swap is not sufficient on its own: swap_keys
iterated with .items(), which resolves MISSING and raises
MissingMandatoryValue mid-swap on any genuine '???'. Fix is two parts:

- _recursively_swap_keys_helper: iterate with items_ex(resolve=False) so
  directive strings stay raw (swap detection still matches) and '???'
  leaves are returned as-is instead of crashing.
- parse(): run raise_on_missing_values after _recursively_swap_keys, so
  deleted/overwritten branches are excluded from the missing-value report.

Tests: unit (swap + collect ignores deleted branch) and e2e (full
get_global_config_dict parse with _delete_key removing the only '???').

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/config-missing-values branch from 1168c9f to f49c6c2 Compare June 16, 2026 11:32
assert "Did you mean" in message
assert "'resources'" in message

def test_collect_missing_value_paths(self) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should tests for _inherit_from/copy too

Comment thread nemo_gym/global_config.py
# come back unresolved (so the swap detection below still matches), and a missing ('???')
# leaf is returned as-is instead of raising MissingMandatoryValue mid-swap. Any genuinely
# unset values are reported together by raise_on_missing_values, which runs after this pass.
for k, v in list(dict_config.items_ex(resolve=False)):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is not already tested, please add coverage that a plain ${a.b.c} reference continues to flow through the full parse so we dont have any regressions on this path

assert "old_model" not in models
assert "new_model" in models

def test_get_first_server_config_dict(self) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not already covered, could you add a full end to end test that drives the get_global_config_dict()parse() pipeline?

test_get_global_config_dict_raises_on_missing_values only checks against 1 missing value, and test_collect_missing_value_paths doesn't go through parse

Comment thread nemo_gym/global_config.py
self, dict_config: DictConfig, original_dict_config: DictConfig, frozen_dict_config: DictConfig
) -> None:
for k, v in list(dict_config.items()):
# items_ex(resolve=False) yields raw values: directive strings like "${inherit_from:...}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these configs are still an issue:

target: ${copy:source.model}
source:
  model: ???

where this returns ValueError: Path specified does not exist in config: ['source', 'model']

which may not be easily understandable by users

though this config is reported correctly:

target: ${copy:source}   # fine — resolves, then the ??? inside is reported normally
source:
  model: ???

@ananthsub ananthsub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, but requesting changes to add more test coverage to ensure we dont have regressions from these changes

wprazuch added 2 commits June 17, 2026 11:44
…erage

Address review on #1575:
- _recursive_index_dict_using_path now navigates with _get_node and propagates
  MISSING for an unset referenced value, so '${copy:source.model}' where
  source.model is '???' reports the missing value instead of an opaque
  'path does not exist' error.
- Tests: _copy/_inherit_from carry '???' into the target (reported); copy of a
  missing leaf is reported not opaque; plain '${a.b.c}' interpolation still
  resolves through full parse (no regression); e2e get_global_config_dict()
  aggregates multiple missing values.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Swap iteration over ListConfig now skips missing ('???') elements without
resolving, so a '???' in a list (or in a dict nested in a list) is reported by
raise_on_missing_values instead of crashing mid-swap with MissingMandatoryValue.
Completes the missing-tolerance started for dict leaves. Found by review.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants