Skip to content

Add BLADE analysis skill#1591

Open
jmabry wants to merge 13 commits into
mainfrom
add-blade-analysis-skill-signed
Open

Add BLADE analysis skill#1591
jmabry wants to merge 13 commits into
mainfrom
add-blade-analysis-skill-signed

Conversation

@jmabry

@jmabry jmabry commented Jun 13, 2026

Copy link
Copy Markdown

Why

Issue #1585 asks for a public BLADE analysis skill so agents can turn NeMo Gym rollout artifacts into evidence-backed benchmark reports and improvement recommendations.

This PR adds that skill without making benchmark-specific examples part of the default agent context.

What Changed

  • Add nemo-gym-blade-analysis for both Claude and Codex agents.
  • Document the BLADE-ready package shape:
    • D1: analysis skill
    • D2: rollout data
    • D3: golden report package with metrics sidecars and anchor facts for the universal blade-judge
  • Add a generic BLADE benchmark build guide covering artifact inventory, pass@k analysis, task bucketing, workflow funnels, root-cause taxonomy, evidence rules, anchor facts, shallow baselines, and calibration expectations.
  • Add a public blade_toolkit.py helper for package-shape validation, draft anchor-fact extraction, shallow baseline generation, scoring, and local calibration checks when external BLADE tooling is not available.
  • Add an opt-in original CVDP reference. It points to the existing public CVDP example rollout files already in resources_servers/cvdp/data/ and includes only the Nemotron 3 Super golden report, metrics, and anchor facts as optional reference material.
  • Keep CVDP artifacts clearly labeled and non-default: agents should load them only when the user asks for a concrete CVDP example or needs help understanding a completed report shape.
  • Add unit coverage for toolkit behavior, including shallow-baseline leakage checks and Claude/Codex toolkit copy synchronization.

CI And Tooling

The repo lint job previously depended on cloning hook repositories during pre-commit setup. This PR replaces those hook-repo clones with uv-managed local hooks, keeps Ruff pinned to the existing hook version, and updates the lint workflow to run pre-commit through uv run --no-project.

Validation

  • git diff --check
  • uv run --no-project --with pre-commit==3.6.0 pre-commit run --all-files --show-diff-on-failure --color=always
  • uv run --extra dev pytest tests/unit_tests/test_blade_toolkit.py -q
  • uv run --extra dev ng_dev_test — 239 passed, coverage 96.15%
  • Parsed bundled JSON artifacts with uv run python -m json.tool.
  • Ran blade_toolkit.py make-shallow and calibrate on the bundled CVDP golden artifacts; the golden report scored high against itself and the shallow baseline stayed below the target threshold.
  • GitHub Actions are passing for DCO, lint, copyright, secrets, unit tests, and all server shards.

Closes #1585.
Supersedes #1584.

jmabry added 10 commits June 12, 2026 18:01
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jmabry jmabry mentioned this pull request Jun 13, 2026
jmabry and others added 3 commits June 12, 2026 18:10
Signed-off-by: jmabry <jmabry@nvidia.com>
Signed-off-by: jmabry <jmabry@nvidia.com>
@jmabry jmabry marked this pull request as ready for review June 15, 2026 18:11
@jmabry jmabry requested a review from a team as a code owner June 15, 2026 18:11
@jmabry jmabry enabled auto-merge (squash) June 16, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add BLADE analysis skill for benchmark report methodology

1 participant