Add CritPt Benchmark#1588
Merged
Merged
Conversation
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…data Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…prompt_fpath, fix requirements path, add tests Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…for pydantic serialization warnings Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- Log per-verify buffer fill in verify() - Add /status route for curling Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
This reverts commit 65c860b. Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- Move custom agent + example dataset into resources_server config (matches the convention for custom-agent paired benchmarks) - Trim benchmarks/critpt/config.yaml; delete unused agent config - Move example_metrics.json to resources_server data dir; regenerate against the curated example.jsonl - Simplify benchmarks/critpt/data/.gitignore to *.jsonl Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
… update docs Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- Correct reward in benchmark readme - Correct log prefix in /status - Distinguish flat-field vs pre-materialized jsonls - Fix output_jsonl_fpath collision risk Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
cmunley1
previously approved these changes
Jun 12, 2026
cmunley1
left a comment
Contributor
There was a problem hiding this comment.
would be good to reproduce some other models eg kimi, minimax, but lgtm
…rk-fix-dco Signed-off-by: Frankie Siino <fsiino@nvidia.com>
gwarmstrong
requested changes
Jun 16, 2026
gwarmstrong
left a comment
Contributor
There was a problem hiding this comment.
small request for additional fields
| type: example | ||
| jsonl_fpath: resources_servers/critpt/data/example.jsonl | ||
| prompt_config: benchmarks/critpt/prompts/turn1.yaml | ||
| num_repeats: 1 |
Contributor
There was a problem hiding this comment.
can you add the license?
Contributor
There was a problem hiding this comment.
also description and value?
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…rk-fix-dco Signed-off-by: Frankie Siino <fsiino@nvidia.com>
gwarmstrong
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes DCO issue in #1537
--
Migrates the CritPt benchmark from nemo-skills into nemo-gym. This keeps the workflow gym-shaped (per-rollout
rewardflowing through the standard/verifycontract).How it works in gym:
Architecture - Single phase: agent runs per-rollout
/run->/verify.Batching - Resources server buffers concurrent
verify()calls until 70 uniqueproblem_ids accumulate, then fires once and resolves all 70 awaiting futures with the same aggregate accuracynum_repeats > 1- Each repeat of aproblem_idopens a new pending batch. N repeats -> N independent AA API calls.Reward - Same as skills in that the aggregate is distributed to all batch members as their
reward.Headline metric -
pass@k/accuracyviacompute_metricsoverride. Numerically equal to nemo-skills'accuracyatnum_repeats=1.Test results:
With ultra mopd/step36 checkpoint:
wandb: https://wandb.ai/nvidia/fsiino-gym-dev/runs/lle631hq