Skip to content

cp: [benchmarking] Adds support for fine-grained config overrides and global ray config (1840) into r1.2.0#1867

Open
svcnvidia-nemo-ci wants to merge 1 commit into
r1.2.0from
cherry-pick-1840-r1.2.0
Open

cp: [benchmarking] Adds support for fine-grained config overrides and global ray config (1840) into r1.2.0#1867
svcnvidia-nemo-ci wants to merge 1 commit into
r1.2.0from
cherry-pick-1840-r1.2.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @rlratzel 👋,

we've cherry picked #1840 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

…bal ray config (#1840)

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test f905cc7

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR cherry-picks support for fine-grained config overrides and a global Ray config into r1.2.0. It replaces the flat dict.update config merge with a deep recursive update_config function, adds a top-level ray section to Session that acts as a global default inherited by all entries, and refactors nightly-benchmark.yaml to use that global default instead of repeating identical ray blocks in every entry.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/fragility concerns that do not affect the primary use case.

Both open comments are P2: the first-key matching fragility only matters if a YAML author reorders keys in override files (not a current scenario), and the non-dict append duplication only triggers if an override file repeats a value already in the base list. The core merge and global-ray logic is correct and the nightly-benchmark.yaml refactor preserves all effective per-entry Ray configurations.

benchmarking/run.py — the update_config list-merge logic warrants a second look if override files with non-dict lists or reordered keys are introduced.

Important Files Changed

Filename Overview
benchmarking/run.py Adds deep-merge config logic (update_config, merge_config_files) replacing a flat dict.update call; matching list items by first key is fragile and non-dict list items are always appended without deduplication.
benchmarking/runner/session.py Adds a top-level ray: dict field to Session and applies it as a base for each entry's ray config via {**self.ray, **entry.ray} in __post_init__; logic is correct.
benchmarking/nightly-benchmark.yaml Extracts repeated ray settings to a single global block and leaves only per-entry overrides (e.g. num_gpus: 0); effective ray config for each entry is unchanged.
benchmarking/README.md Documentation updated to explain deep-merge semantics, global ray section, and list-matching rules; examples are accurate and match implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["CLI: --config file1.yaml --config file2.yaml"] --> B["merge_config_files()"]
    B --> C["For each YAML file\nyaml.full_load_all()"]
    C --> D["update_config(config_dict, new_dict)"]
    D --> E{Key type in\nexisting dict?}
    E -->|"Both dicts"| F["Recurse:\nupdate_config()"]
    E -->|"Both lists"| G["Match by first key\nof each dict item"]
    G -->|"Match found"| H["Recurse:\nupdate_config()"]
    G -->|"No match"| I["Append item"]
    G -->|"Non-dict item"| I
    E -->|"Otherwise"| J["Replace value"]
    E -->|"Key missing"| K["Add new key"]
    D --> L["Merged config_dict"]
    L --> M["Session.assert_valid_config_dict()"]
    M --> N["Session.from_dict()"]
    N --> O["Session.__post_init__()"]
    O --> P["For each entry:\nentry.ray = {**session.ray, **entry.ray}"]
    P --> Q["Run entries"]
Loading

Reviews (1): Last reviewed commit: "[benchmarking] Adds support for fine-gra..." | Re-trigger Greptile

Comment thread benchmarking/run.py
Comment on lines +88 to +105
elif isinstance(config_dict[key], list) and isinstance(value, list):
for sub_val in value:
# Handle dicts in the list by matching based on the first key
if isinstance(sub_val, dict) and sub_val:
first_key = next(iter(sub_val.keys()))
for config_sub_val in config_dict[key]:
if (
isinstance(config_sub_val, dict)
and config_sub_val
and next(iter(config_sub_val.keys())) == first_key
and config_sub_val[first_key] == sub_val[first_key]
):
# If matching dict found (based on first key), recursively update it
update_config(config_sub_val, sub_val)
break
else:
# If no matching dict, append the new dict to the list
config_dict[key].append(sub_val)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 First-key matching is fragile and order-dependent

List items are matched by next(iter(sub_val.keys())), which depends on dict insertion order. If an override YAML has a requirement dict with keys in a different order (e.g., min_value before metric), the lookup key changes, no existing item matches, and the item is appended rather than merged — silently duplicating the requirement in the entry. The README documents the "first key" semantics, but this makes the merge logic sensitive to YAML key ordering in a non-obvious way.

A more robust approach would be to match entries by a known stable key (name for entries, metric for requirements) rather than relying on insertion order.

Comment thread benchmarking/run.py
Comment on lines +106 to +108
else:
# If not a dict, append the new value to the list
config_dict[key].append(sub_val)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-dict list items are always appended, never deduplicated

When a list contains non-dict values (e.g., Slack user IDs in ping_on_failure), the code unconditionally appends the override value even if it already exists in the base list. If an override file repeats a ping_on_failure ID that is already in the base config, it will be added twice, causing duplicate Slack notifications on failure. The README's merge-behavior table only describes the dict-matching case and doesn't cover this edge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants