Add BLADE analysis skill#1591
Open
jmabry wants to merge 13 commits into
Open
Conversation
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Issue #1585 asks for a public BLADE analysis skill so agents can turn NeMo Gym rollout artifacts into evidence-backed benchmark reports and improvement recommendations.
This PR adds that skill without making benchmark-specific examples part of the default agent context.
What Changed
nemo-gym-blade-analysisfor both Claude and Codex agents.blade-judgeblade_toolkit.pyhelper for package-shape validation, draft anchor-fact extraction, shallow baseline generation, scoring, and local calibration checks when external BLADE tooling is not available.resources_servers/cvdp/data/and includes only the Nemotron 3 Super golden report, metrics, and anchor facts as optional reference material.CI And Tooling
The repo lint job previously depended on cloning hook repositories during pre-commit setup. This PR replaces those hook-repo clones with uv-managed local hooks, keeps Ruff pinned to the existing hook version, and updates the lint workflow to run pre-commit through
uv run --no-project.Validation
git diff --checkuv run --no-project --with pre-commit==3.6.0 pre-commit run --all-files --show-diff-on-failure --color=alwaysuv run --extra dev pytest tests/unit_tests/test_blade_toolkit.py -quv run --extra dev ng_dev_test— 239 passed, coverage 96.15%uv run python -m json.tool.blade_toolkit.py make-shallowandcalibrateon the bundled CVDP golden artifacts; the golden report scored high against itself and the shallow baseline stayed below the target threshold.Closes #1585.
Supersedes #1584.