Skip to content

Add SciCode Benchmark#1592

Merged
gwarmstrong merged 14 commits into
mainfrom
fsiino/scicode-benchmark
Jun 16, 2026
Merged

Add SciCode Benchmark#1592
gwarmstrong merged 14 commits into
mainfrom
fsiino/scicode-benchmark

Conversation

@fsiino-nvidia

@fsiino-nvidia fsiino-nvidia commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Migrates the SciCode benchmark from nemo-skills into nemo-gym.

How it works in gym:

  • Architecture - Multi-step: the agent makes one model call per sub-step, accumulating its own prior-step code as context, then submits the accumulated solutions to /verify. N model calls per problem (N varies).

  • Execution - Resources server runs each sub-step's accumulated code in a Ray subprocess against ground-truth targets in test_data.h5.

  • Test data - test_data.h5 must be staged manually from the official SciCode Google Drive.

  • Reward - Binary per problem (1.0 iff every sub-step passes), same as skills' problem-level accuracy.

  • Headline metric - subtask_accuracy (total sub-steps passed / total) via compute_metrics/get_key_metrics override on the agent. Numerically equal to nemo-skills' pass@1[avg-of-3]/subtask_accuracy.

Test results:
With ultra mopd/step36 checkpoint:

  Key metrics for scicode_benchmark_agent:
  {
      "mean/reward": 0.19583333333333333,
      "mean/num_steps_passed": 1.8875,
      "mean/num_steps_total": 4.225,
      "mean/problem_accuracy": 0.19583333333333333,
      "mean/input_tokens": 3554.2,
      "mean/output_tokens": 5608.7125,
      "mean/total_tokens": 9162.9125,
      "subtask_accuracy": 0.4467455621301775
  }

wandb: https://wandb.ai/nvidia/fsiino-gym-dev/runs/3jom33yh

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…mark readme

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- runner/rayexecutor + hdf5 target/compare helpers

- app.py - config, schemas, verify()

- fix configs

- add requirements

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- agent: run loop + step_utils (prompt context, code extraction, prefilled-steps data, context-window detection) + tests
- prompts: default.yaml + background.yaml (agent-loaded)
- scicode verify(): skips sub-steps absent from solutions

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- SciCodeVerifyResponse carries problem_id into rollout output so rows are identifiable
- agent config: example dataset
- .gitignore adjustments
- Add resources server example with metrics
- Add benchmark metrics

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
- move agent instance + example dataset into resources server config
- regenerate example_rollouts with example split

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

- resources_servers/scicode: replace scaffold
- benchmarks/scicode: fix stale bits (metric, example path), add real-world benchmark run steps
- scicode_agent: add prompt_fpath, note subtask_accuracy hook, fix prompt-default wording

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small request to add dataset license

prepare_script: benchmarks/scicode/prepare.py
# null: the agent builds per-sub-step prompts (prompt_fpath); no row-level materialization.
prompt_config: null
num_repeats: 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the dataset's license here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 096101a

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@gwarmstrong gwarmstrong merged commit eddd316 into main Jun 16, 2026
24 checks passed
@gwarmstrong gwarmstrong deleted the fsiino/scicode-benchmark branch June 16, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants