[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona by yichuan-w · Pull Request #3734 · pytorch/torchtitan

yichuan-w · 2026-06-22T05:17:56Z

What

A TorchTitan RL example (swe_r2e) that post-trains a Qwen model on R2E-Gym SWE
tasks where the rollout is driven by an unmodified agentic CLI harness (Claude
Code) running inside a Daytona cloud sandbox. An on-box Anthropic-Messages
adapter serves the trained policy to the agent and captures every model turn as
on-policy training tokens; R2E hidden tests grade the agent's patch for the reward.

This is the TorchTitan analogue of THUDM/slime's coding_agent_rl and Meta
msl/rl's "virtual actor + reverse-proxy" pattern.

How it works

RLTrainer (controller)
  SWER2ERollouter.run_group_rollouts(generate_fn, sample, group_size=K)
    AnthropicAdapter  <- one HTTP endpoint backed by the controller's generate_fn
    per sibling:
      boot Daytona sandbox (R2E image) + install Claude Code (self-contained CDN binary)
      claude -p  ANTHROPIC_BASE_URL -> adapter (via the Daytona fs-relay bridge)
         adapter: render_ids / bridge_to_next_turn (TITO) -> generate_fn -> Completion
                  records (prompt_ids, completion_ids, logprobs) per turn
      git diff -> evaluate_r2e (fresh sandbox: apply diff, run hidden tests) -> reward
  rubric -> GRPO advantage -> rollout_to_episodes -> Batcher -> backward

The adapter reuses prior turns' exact sampled tokens via the renderer's
bridge_to_next_turn (Token-In-Token-Out), so each turn's prompt exactly extends
prev_prompt + prev_completion and a whole multi-turn trajectory packs into one
training episode (assistant tokens trained, prompt / tool-result tokens masked); a
Claude Code auto-compaction breaks the prefix and opens a new episode branch,
exactly as rollout_to_episodes expects.

Because an inbound-firewalled trainer box cannot accept a dial-back from the public
Daytona cloud (Daytona refuses ssh -R), bridge.py relays the agent's HTTP over
Daytona's fs API: the in-sandbox proxy writes a request file, the host polls and
replays it to the adapter, then uploads a response file. One-shot delivery is
equivalent to a dial-back since token capture happens host-side regardless.

Layout (three orthogonal axes)

harness/sandbox/ -- WHERE code runs: Sandbox contract + make_sandbox, the
Daytona backend, and the Daytona fs-relay bridge.
harness/adapters/ -- HOW the model is served: a token-capturing wire-format
endpoint (anthropic; add openai for Codex/OpenCode -- capture is shared).
harness/agents/ -- WHICH CLI agent + how to launch it (claude_code).
examples/swe_r2e/ -- the R2E task: data, grading, rubric, rollouter,
config_registry (1.7B / 8B / 14B / 32B dense + 30B-A3B MoE), launcher, isolated
smoke harness.

Adding a new CLI agent = a new agents/ runner (+ reuse/extend an adapters/ wire
module); a new sandbox provider = a new sandbox/ backend.

Sandbox cleanup (no zombies, incl. SIGKILL)

Two layers: (1) per-rollout delete on context exit; (2) a cloud-side auto-delete
TTL on every sandbox (TT_DAYTONA_AUTO_STOP_MIN / AUTO_DELETE_MIN) so an orphan
self-reaps even if the process is SIGKILL'd (e.g. preemption) and never runs its
exit path.

Verification

End to end on Daytona: Claude Code rollouts -> R2E grading -> GRPO backward.

Single 8xH100 host: Train | Step 1 backward for Qwen3-1.7B (~83s) and
Qwen3-8B (~82s, bit_wise/logprob_diff/max 0.39).
Multi-host (4x H200): Qwen3-32B, 8 prompts x 8 samples, binary R2E reward,
24 training steps. Pass rate trends up across the run -- by thirds
1.66% -> 2.73% -> 3.71% (highest single step 8.6%, 11/128 solved), solving
tasks across scrapy / pillow / numpy / orange3 / datalad. The signal is real but
sparse (binary reward at ~2-3% pass leaves only ~1-2 of 16 GRPO groups per step
with within-group variance) -- grading.py has an opt-in dense per-test-fraction
reward (SWE_REWARD_DENSE) to densify it.

pre-commit clean. 30B-A3B fits one host (FSDP-4 bf16 trainer + TP4/EP4 generator)
but MoE breaks vLLM cudagraph capture, so the generator runs eager -- practical 30B
needs the MoE-cudagraph fix or multi-host; the config is included for that follow-up.

Run

DAYTONA_API_KEY=dtn_... CONFIG=rl_grpo_qwen3_1_7b_swe_r2e \
  PROMPT_DATA=/path/to/r2e.jsonl HF_ASSETS_PATH=/path/to/Qwen3-1.7B \
  bash torchtitan/experiments/rl/examples/swe_r2e/run_swe_r2e_daytona.sh

The Claude Code binary is downloaded inside the sandbox from its CDN (override via
SWE_CLAUDE_CDN), so no host toolchain is needed. See examples/swe_r2e/README.md
for prereqs and knobs.

Status

Draft / experiment. Proves the pipeline end to end and shows a real (if sparse)
upward reward trend at 32B; small models score ~reward 0 on one-shot R2E (zero
advantage). Meaningful reward needs a bigger model + larger context + many steps
(and/or the dense reward).

yichuan-w · 2026-06-24T21:28:36Z

RL training result: Qwen3-32B, 24 steps (binary R2E reward)

Setup: Qwen3-32B (dense), GRPO. Trainer FSDP-8 + 2 generators (TP-8) on 4x H200.
seq_len 32768, binary reward (R2E hidden-test pass/fail). Data: R2E-Gym-Subset
(~4.5K tasks).

On the batch sizes (64 vs ~128): GRPO uses 8 prompts x 8 samples = 64 rollouts
per collection round. The trainer's packed batch is global_batch_size=64 rows x
seq_len=32768. Because each coding episode (~16k trainable tokens) is far shorter
than seq_len, the collection loop keeps pulling rounds until it fills the token
target (64 x 32768 = 2.1M trainable tokens) -- about 2 rounds, i.e. ~128 rollouts
(16 prompts x 8 samples) per step, then packs ~2 episodes per row into the
[64, 32768] batch.

Reward trend (solved rollouts out of ~128 per step; run ongoing, 24 of 30 steps):

steps	solved per step	mean pass%
1-8	1, 0, 8, 0, 0, 2, 4, 2	1.66%
9-16	8, 2, 3, 0, 4, 2, 0, 9	2.73%
17-24	0, 4, 9, 2, 11, 3, 3, 6	3.71%

Pass rate trends up ~2.2x over 24 steps (overall 83/3072 = 2.70%); highest single
step 8.6% (11/128). Solved tasks span many repos (scrapy, pillow, numpy, orange3,
datalad, ...), confirming 32B can fix real R2E issues.

Caveat -- binary sparsity: at ~2-3% pass, most GRPO groups are all-fail (or
occasionally all-pass), so only ~1-2 of 16 groups per step has within-group reward
variance / a non-zero advantage. The signal is real but weak; the opt-in dense
per-test-fraction reward (SWE_REWARD_DENSE) would turn the many applied-but-
unsolved patches into gradient and should speed this up considerably.

Step time / robustness: ~50 min/step, ~95% of which is the agentic rollout
(inference re-prefill of the growing context + in-sandbox tool execution); the
trainer backward + weight sync is ~1.5 min. Sandbox error rate ~0.9% (15+ of the
steps fully clean) after jittered-backoff retry for transient Daytona control-plane
401s and the empty-exit-code poll race; the run is stable (0 restarts).

Train a policy with an unmodified agentic CLI (Claude Code) as the environment: the agent runs headless inside a Daytona cloud sandbox and is pointed at an on-box Anthropic /v1/messages adapter that serves the trained policy and captures every turn as on-policy training tokens (Token-In-Token-Out, so a multi-turn trajectory packs into one episode). A SWER2ERollouter drives Claude Code per rollout, grades the git diff against R2E-Gym hidden tests in a fresh sandbox, and feeds the standard rubric/advantage/GRPO path. Layout (torchtitan/experiments/rl/): - harness/sandbox: Sandbox contract + make_sandbox factory + the Daytona backend + bridge.py (relays the agent's HTTP over Daytona's fs API, since an inbound- firewalled box cannot accept a dial-back). - harness/adapters/anthropic.py: the token-capturing Anthropic Messages endpoint. - harness/agents/claude_code.py: boot sandbox + install the CDN claude binary + run. - examples/swe_r2e: R2E dataset, grading, rubric, rollouter, env placeholder, config recipes (1.7B smoke / 8B target / 30B-A3B + 14B/32B scale), run script.

pytorch-bot Bot added ciflow/8gpu ciflow/rl labels Jun 22, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 22, 2026

yichuan-w mentioned this pull request Jun 22, 2026

[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona yichuan-w/torchtitan#2

Closed

yichuan-w force-pushed the yichuan/swe-r2e-upstream-pr branch 2 times, most recently from fe28eab to 8d44748 Compare June 25, 2026 21:38

yichuan-w marked this pull request as ready for review June 25, 2026 21:45

yichuan-w force-pushed the yichuan/swe-r2e-upstream-pr branch from 8d44748 to b7413a2 Compare June 25, 2026 21:45

yichuan-w requested review from felipemello1 and tianyu-l June 25, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona#3734

[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona#3734
yichuan-w wants to merge 1 commit into
pytorch:mainfrom
yichuan-w:yichuan/swe-r2e-upstream-pr

yichuan-w commented Jun 22, 2026 •

edited

Loading

Uh oh!

yichuan-w commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yichuan-w commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How it works

Layout (three orthogonal axes)

Sandbox cleanup (no zombies, incl. SIGKILL)

Verification

Run

Status

Uh oh!

yichuan-w commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RL training result: Qwen3-32B, 24 steps (binary R2E reward)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yichuan-w commented Jun 22, 2026 •

edited

Loading

yichuan-w commented Jun 24, 2026 •

edited

Loading