Add SciCode Benchmark by fsiino-nvidia · Pull Request #1592 · NVIDIA-NeMo/Gym

fsiino-nvidia · 2026-06-13T01:46:11Z

Migrates the SciCode benchmark from nemo-skills into nemo-gym.

How it works in gym:

Architecture - Multi-step: the agent makes one model call per sub-step, accumulating its own prior-step code as context, then submits the accumulated solutions to /verify. N model calls per problem (N varies).
Execution - Resources server runs each sub-step's accumulated code in a Ray subprocess against ground-truth targets in test_data.h5.
Test data - test_data.h5 must be staged manually from the official SciCode Google Drive.
Reward - Binary per problem (1.0 iff every sub-step passes), same as skills' problem-level accuracy.
Headline metric - subtask_accuracy (total sub-steps passed / total) via compute_metrics/get_key_metrics override on the agent. Numerically equal to nemo-skills' pass@1[avg-of-3]/subtask_accuracy.

Test results:
With ultra mopd/step36 checkpoint:

  Key metrics for scicode_benchmark_agent:
  {
      "mean/reward": 0.19583333333333333,
      "mean/num_steps_passed": 1.8875,
      "mean/num_steps_total": 4.225,
      "mean/problem_accuracy": 0.19583333333333333,
      "mean/input_tokens": 3554.2,
      "mean/output_tokens": 5608.7125,
      "mean/total_tokens": 9162.9125,
      "subtask_accuracy": 0.4467455621301775
  }

wandb: https://wandb.ai/nvidia/fsiino-gym-dev/runs/3jom33yh

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

…mark readme Signed-off-by: Frankie Siino <fsiino@nvidia.com>

- runner/rayexecutor + hdf5 target/compare helpers - app.py - config, schemas, verify() - fix configs - add requirements Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

- agent: run loop + step_utils (prompt context, code extraction, prefilled-steps data, context-window detection) + tests - prompts: default.yaml + background.yaml (agent-loaded) - scicode verify(): skips sub-steps absent from solutions Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

- SciCodeVerifyResponse carries problem_id into rollout output so rows are identifiable - agent config: example dataset - .gitignore adjustments - Add resources server example with metrics - Add benchmark metrics Signed-off-by: Frankie Siino <fsiino@nvidia.com>

- move agent instance + example dataset into resources server config - regenerate example_rollouts with example split Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

copy-pr-bot · 2026-06-13T01:46:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

- resources_servers/scicode: replace scaffold - benchmarks/scicode: fix stale bits (metric, example path), add real-world benchmark run steps - scicode_agent: add prompt_fpath, note subtask_accuracy hook, fix prompt-default wording Signed-off-by: Frankie Siino <fsiino@nvidia.com>

gwarmstrong

small request to add dataset license

gwarmstrong · 2026-06-16T18:38:20Z

+        prepare_script: benchmarks/scicode/prepare.py
+        # null: the agent builds per-sub-step prompts (prompt_fpath); no row-level materialization.
+        prompt_config: null
+        num_repeats: 3


Can you add the dataset's license here?

Added in 096101a

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

fsiino-nvidia added 11 commits June 8, 2026 16:38

Init - scaffold agent and resources server

e1c918b

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Write data preparation, declare deps, add example rollouts, add bench…

eacbe3f

…mark readme Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Implement scicode resources server

568ca52

- runner/rayexecutor + hdf5 target/compare helpers - app.py - config, schemas, verify() - fix configs - add requirements Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Add tests for resource server

3adde5a

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Wire scicode benchmark config, fix e2e smoke test

e59fd36

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Update num_repeats, add .h5 file staging instructions for cluster

d57df26

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Wire subtask_accuracy metric to resource server

c546e94

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Consolidate scicode agent config into resources config

92fae73

- move agent instance + example dataset into resources server config - regenerate example_rollouts with example split Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Move subtask_accuracy metric hooks to to agent

92de11e

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

fsiino-nvidia marked this pull request as ready for review June 15, 2026 23:45

copy-pr-bot Bot temporarily deployed to public June 15, 2026 23:45 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 23:46 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 23:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 00:38 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 00:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 00:40 Inactive

fsiino-nvidia requested a review from gwarmstrong June 16, 2026 00:47

gwarmstrong requested changes Jun 16, 2026

View reviewed changes

fsiino-nvidia added 2 commits June 16, 2026 13:54

Add license

096101a

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Merge remote-tracking branch 'github/main' into fsiino/scicode-benchmark

b7f297c

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 16, 2026 20:57 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 20:58 Inactive

gwarmstrong approved these changes Jun 16, 2026

View reviewed changes

gwarmstrong merged commit eddd316 into main Jun 16, 2026
24 checks passed

gwarmstrong deleted the fsiino/scicode-benchmark branch June 16, 2026 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SciCode Benchmark#1592

Add SciCode Benchmark#1592
gwarmstrong merged 14 commits into
mainfrom
fsiino/scicode-benchmark

fsiino-nvidia commented Jun 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

gwarmstrong left a comment

Uh oh!

gwarmstrong Jun 16, 2026

Uh oh!

fsiino-nvidia Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fsiino-nvidia commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

fsiino-nvidia Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fsiino-nvidia commented Jun 13, 2026 •

edited

Loading