Skip to content

Add CritPt Benchmark#1588

Merged
gwarmstrong merged 31 commits into
mainfrom
fsiino/critpt-benchmark-fix-dco
Jun 16, 2026
Merged

Add CritPt Benchmark#1588
gwarmstrong merged 31 commits into
mainfrom
fsiino/critpt-benchmark-fix-dco

Conversation

@fsiino-nvidia

@fsiino-nvidia fsiino-nvidia commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Fixes DCO issue in #1537

--

Migrates the CritPt benchmark from nemo-skills into nemo-gym. This keeps the workflow gym-shaped (per-rollout reward flowing through the standard /verify contract).

How it works in gym:

  • Architecture - Single phase: agent runs per-rollout /run -> /verify.

  • Batching - Resources server buffers concurrent verify() calls until 70 unique problem_ids accumulate, then fires once and resolves all 70 awaiting futures with the same aggregate accuracy

  • num_repeats > 1 - Each repeat of a problem_id opens a new pending batch. N repeats -> N independent AA API calls.

  • Reward - Same as skills in that the aggregate is distributed to all batch members as their reward.

  • Headline metric - pass@k/accuracy via compute_metrics override. Numerically equal to nemo-skills' accuracy at num_repeats=1.

Test results:
With ultra mopd/step36 checkpoint:

Key metrics for critpt_benchmark_agent:
{
    "mean/reward": 0.05714285714285713,
    "mean/accuracy": 0.05714285714285713,
    "mean/timeout_rate": 0.0,
    "mean/input_tokens": 6107.028571428571,
    "mean/output_tokens": 2514.1857142857143,
    "mean/total_tokens": 8621.214285714286
}

wandb: https://wandb.ai/nvidia/fsiino-gym-dev/runs/lle631hq

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…data

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…prompt_fpath, fix requirements path, add tests

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…for pydantic serialization warnings

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- Log per-verify buffer fill in verify()

- Add /status route for curling

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
This reverts commit 65c860b.

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- Move custom agent + example dataset into resources_server config (matches the convention for custom-agent paired benchmarks)
- Trim benchmarks/critpt/config.yaml; delete unused agent config
- Move example_metrics.json to resources_server data dir; regenerate against the curated example.jsonl
- Simplify benchmarks/critpt/data/.gitignore to *.jsonl

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
… update docs

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- Correct reward in benchmark readme

- Correct log prefix in /status

- Distinguish flat-field vs pre-materialized jsonls

- Fix output_jsonl_fpath collision risk

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia changed the title Fsiino/critpt benchmark fix dco Add CritPt Benchmark Jun 12, 2026
cmunley1
cmunley1 previously approved these changes Jun 12, 2026

@cmunley1 cmunley1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to reproduce some other models eg kimi, minimax, but lgtm

@cmunley1 cmunley1 requested a review from jiacheng-xu June 12, 2026 21:14
…rk-fix-dco

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small request for additional fields

type: example
jsonl_fpath: resources_servers/critpt/data/example.jsonl
prompt_config: benchmarks/critpt/prompts/turn1.yaml
num_repeats: 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the license?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also description and value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all 3 in 719cd07 . Thanks!

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…rk-fix-dco

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@gwarmstrong gwarmstrong merged commit 74ebce9 into main Jun 16, 2026
16 checks passed
@gwarmstrong gwarmstrong deleted the fsiino/critpt-benchmark-fix-dco branch June 16, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants