fix(config): actionable error for unknown server cross-references#1561
Merged
Conversation
bxyu-nvidia
previously requested changes
Jun 10, 2026
wprazuch
added a commit
that referenced
this pull request
Jun 10, 2026
…uote error Per review on #1561: import get_close_matches directly instead of the difflib module, and use a triple-quoted block for the ServerRefNotFoundError message instead of concatenated single-quote f-strings. Output is unchanged. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
wprazuch
added a commit
that referenced
this pull request
Jun 10, 2026
…uote error Per review on #1561: import get_close_matches directly instead of the difflib module, and use a triple-quoted block for the ServerRefNotFoundError message instead of concatenated single-quote f-strings. Output is unchanged. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
5a85a53 to
186a762
Compare
This was referenced Jun 11, 2026
ko3n1g
pushed a commit
that referenced
this pull request
Jun 12, 2026
…ite) (#1576) ## Problem The **Full test suite** (`Test`) job has been red on every PR, failing 7 servers (`math_with_code`, `newton_bench`, `arena_judge`, `reasoning_gym`, `ether0`, `aviary`, `stirrup_agent`) with `ModuleNotFoundError` at test collection — `scipy`, `scikit-learn`, `matplotlib`, `PIL`. These deps **are** declared (and pinned) in each server's `requirements.txt`. The real cause: **`uv 0.11.20` (released 2026-06-10) has a resolver regression** — it silently drops pinned direct dependencies from `uv pip install -r requirements.txt` when the requirements also include an editable `-e` install (as every server's `-e nemo-gym[dev] @ ../../` does). No error, no conflict — the package is just omitted. CI installs uv **unpinned** (`curl -LsSf https://astral.sh/uv/install.sh | sh`), so it picked up 0.11.20 the day it released — which is exactly when the suite started failing. The first (passing) run used an earlier uv. ## Evidence (reproduced locally with the exact CI install command) For `resources_servers/reasoning_gym` (`source .venv/bin/activate && uv pip install -r requirements.txt openai==2.7.2`): | uv version | Resolved | `matplotlib==3.10.6` | |---|---|---| | 0.11.19 | 154 packages | ✅ installed → `import matplotlib` OK | | **0.11.20** | 150 packages | ❌ dropped → `ModuleNotFoundError` | Bisected 0.10.2 → 0.11.20: every version **through 0.11.19 works**; only **0.11.20** is broken. ## Fix Pin the uv installer to **0.11.19** (latest known-good) in `full-test-suite.yml` (Test + wheel jobs) and `unit-tests.yml`, with a comment explaining why. Once uv ships a fix, the pin can be bumped. Not caused by — and unblocks — any PR that triggers the full matrix (e.g. #1561, #1575). Worth also reporting the regression upstream to astral-sh/uv. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Replace the bare assert in validate_and_populate_defaults with a ServerRefNotFoundError that names the offending instance and field, and fuzzy-matches the missing name against same-type instances to suggest a correction. Iterate items() to recover the field name; raise instead of assert so validation is not stripped under python -O. Closes part of #1205 (friction point #3). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…uote error Per review on #1561: import get_close_matches directly instead of the difflib module, and use a triple-quoted block for the ServerRefNotFoundError message instead of concatenated single-quote f-strings. Output is unchanged. Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
186a762 to
84bce25
Compare
marta-sd
approved these changes
Jun 15, 2026
marta-sd
left a comment
Contributor
There was a problem hiding this comment.
Approving since Brian's suggestions were addressed and he's ooto
Contributor
Author
|
/ok to test 84bce25 |
Applied changes, and brian is currently unavailable
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When a server config references another server by
{type, name}and the name doesn't exist in the merged config, replace the bareassertinvalidate_and_populate_defaultswith a dedicatedServerRefNotFoundErrorthat:difflib) and suggests the closest correct name,It also iterates
.items()(to recover the field name) andraises instead ofasserting, so validation is not stripped underpython -O.Why
Today a typo in a cross-reference produces an opaque
AssertionErrorthat dumps the full list of every server ref, only after Ray has initialized (~30–60s). This is friction point 3 in #1205.Before / after
Scope
This is a message-quality fix only — behavior is unchanged for valid configs, and invalid configs still fail at the same point. The deeper recursive scan for refs nested inside sub-dicts is intentionally deferred (it would surface previously-silent misses, a behavior change).
Testing
errors_on_missing,errors_on_wrong_type) to expectServerRefNotFoundError; the first now asserts the message names the instance, field, and ref.test_get_global_config_dict_server_refs_suggests_close_matchcovering the fuzzy "Did you mean?" branch (typoresource→ suggestsresources, scoped to same type).pytest tests/unit_tests/test_global_config.py— 27/27 pass.ruffclean.Part of #1205 (friction point 3). No new dependencies (
difflibis stdlib).