Skip to content

fix(config): actionable error for unknown server cross-references#1561

Merged
wprazuch merged 2 commits into
mainfrom
wprazuch/config-friction-quickwins
Jun 15, 2026
Merged

fix(config): actionable error for unknown server cross-references#1561
wprazuch merged 2 commits into
mainfrom
wprazuch/config-friction-quickwins

Conversation

@wprazuch

@wprazuch wprazuch commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

When a server config references another server by {type, name} and the name doesn't exist in the merged config, replace the bare assert in validate_and_populate_defaults with a dedicated ServerRefNotFoundError that:

  • names the offending instance and the field that holds the bad reference,
  • fuzzy-matches the missing name against same-type instances (via difflib) and suggests the closest correct name,
  • falls back to listing the available same-type instances when there's no close match.

It also iterates .items() (to recover the field name) and raises instead of asserting, so validation is not stripped under python -O.

Why

Today a typo in a cross-reference produces an opaque AssertionError that dumps the full list of every server ref, only after Ray has initialized (~30–60s). This is friction point 3 in #1205.

Before / after

# before
AssertionError: Could not find type='resources_servers' name='workplace_assitant' in the list
of available servers: [type='resources_servers' name='workplace_assistant', ...]

# after
ServerRefNotFoundError: In server instance 'workplace_assistant_simple_agent', field
'resources_server' references resources_servers/'workplace_assitant', which is not defined in
the merged config.
Did you mean: 'workplace_assistant'?

Scope

This is a message-quality fix only — behavior is unchanged for valid configs, and invalid configs still fail at the same point. The deeper recursive scan for refs nested inside sub-dicts is intentionally deferred (it would surface previously-silent misses, a behavior change).

Testing

  • Updated the two existing error tests (errors_on_missing, errors_on_wrong_type) to expect ServerRefNotFoundError; the first now asserts the message names the instance, field, and ref.
  • Added test_get_global_config_dict_server_refs_suggests_close_match covering the fuzzy "Did you mean?" branch (typo resource → suggests resources, scoped to same type).
  • pytest tests/unit_tests/test_global_config.py — 27/27 pass. ruff clean.

Part of #1205 (friction point 3). No new dependencies (difflib is stdlib).

@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread nemo_gym/global_config.py Outdated
Comment thread nemo_gym/global_config.py Outdated
wprazuch added a commit that referenced this pull request Jun 10, 2026
…uote error

Per review on #1561: import get_close_matches directly instead of the
difflib module, and use a triple-quoted block for the ServerRefNotFoundError
message instead of concatenated single-quote f-strings. Output is unchanged.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
wprazuch added a commit that referenced this pull request Jun 10, 2026
…uote error

Per review on #1561: import get_close_matches directly instead of the
difflib module, and use a triple-quoted block for the ServerRefNotFoundError
message instead of concatenated single-quote f-strings. Output is unchanged.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/config-friction-quickwins branch from 5a85a53 to 186a762 Compare June 10, 2026 18:18
ko3n1g pushed a commit that referenced this pull request Jun 12, 2026
…ite) (#1576)

## Problem

The **Full test suite** (`Test`) job has been red on every PR, failing 7
servers
(`math_with_code`, `newton_bench`, `arena_judge`, `reasoning_gym`,
`ether0`, `aviary`,
`stirrup_agent`) with `ModuleNotFoundError` at test collection —
`scipy`, `scikit-learn`,
`matplotlib`, `PIL`.

These deps **are** declared (and pinned) in each server's
`requirements.txt`. The real cause:

**`uv 0.11.20` (released 2026-06-10) has a resolver regression** — it
silently drops pinned
direct dependencies from `uv pip install -r requirements.txt` when the
requirements also
include an editable `-e` install (as every server's `-e nemo-gym[dev] @
../../` does). No error,
no conflict — the package is just omitted.

CI installs uv **unpinned** (`curl -LsSf https://astral.sh/uv/install.sh
| sh`), so it picked up
0.11.20 the day it released — which is exactly when the suite started
failing. The first
(passing) run used an earlier uv.

## Evidence (reproduced locally with the exact CI install command)

For `resources_servers/reasoning_gym` (`source .venv/bin/activate && uv
pip install -r requirements.txt openai==2.7.2`):

| uv version | Resolved | `matplotlib==3.10.6` |
|---|---|---|
| 0.11.19 | 154 packages | ✅ installed → `import matplotlib` OK |
| **0.11.20** | 150 packages | ❌ dropped → `ModuleNotFoundError` |

Bisected 0.10.2 → 0.11.20: every version **through 0.11.19 works**; only
**0.11.20** is broken.

## Fix

Pin the uv installer to **0.11.19** (latest known-good) in
`full-test-suite.yml` (Test + wheel
jobs) and `unit-tests.yml`, with a comment explaining why. Once uv ships
a fix, the pin can be
bumped.

Not caused by — and unblocks — any PR that triggers the full matrix
(e.g. #1561, #1575). Worth
also reporting the regression upstream to astral-sh/uv.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
wprazuch added 2 commits June 15, 2026 15:06
Replace the bare assert in validate_and_populate_defaults with a
ServerRefNotFoundError that names the offending instance and field, and
fuzzy-matches the missing name against same-type instances to suggest a
correction. Iterate items() to recover the field name; raise instead of
assert so validation is not stripped under python -O.

Closes part of #1205 (friction point #3).

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…uote error

Per review on #1561: import get_close_matches directly instead of the
difflib module, and use a triple-quoted block for the ServerRefNotFoundError
message instead of concatenated single-quote f-strings. Output is unchanged.

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/config-friction-quickwins branch from 186a762 to 84bce25 Compare June 15, 2026 13:06

@marta-sd marta-sd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since Brian's suggestions were addressed and he's ooto

@wprazuch wprazuch enabled auto-merge (squash) June 15, 2026 13:29
@wprazuch

Copy link
Copy Markdown
Contributor Author

/ok to test 84bce25

@wprazuch wprazuch dismissed bxyu-nvidia’s stale review June 15, 2026 13:51

Applied changes, and brian is currently unavailable

@wprazuch wprazuch merged commit 9f3192b into main Jun 15, 2026
30 checks passed
@wprazuch wprazuch deleted the wprazuch/config-friction-quickwins branch June 15, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants