Skip to content

[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona#3734

Open
yichuan-w wants to merge 1 commit into
pytorch:mainfrom
yichuan-w:yichuan/swe-r2e-upstream-pr
Open

[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona#3734
yichuan-w wants to merge 1 commit into
pytorch:mainfrom
yichuan-w:yichuan/swe-r2e-upstream-pr

Conversation

@yichuan-w

@yichuan-w yichuan-w commented Jun 22, 2026

Copy link
Copy Markdown
Member

What

A TorchTitan RL example (swe_r2e) that post-trains a Qwen model on R2E-Gym SWE
tasks where the rollout is driven by an unmodified agentic CLI harness (Claude
Code)
running inside a Daytona cloud sandbox. An on-box Anthropic-Messages
adapter serves the trained policy to the agent and captures every model turn as
on-policy training tokens; R2E hidden tests grade the agent's patch for the reward.

This is the TorchTitan analogue of THUDM/slime's coding_agent_rl and Meta
msl/rl's "virtual actor + reverse-proxy" pattern.

How it works

RLTrainer (controller)
  SWER2ERollouter.run_group_rollouts(generate_fn, sample, group_size=K)
    AnthropicAdapter  <- one HTTP endpoint backed by the controller's generate_fn
    per sibling:
      boot Daytona sandbox (R2E image) + install Claude Code (self-contained CDN binary)
      claude -p  ANTHROPIC_BASE_URL -> adapter (via the Daytona fs-relay bridge)
         adapter: render_ids / bridge_to_next_turn (TITO) -> generate_fn -> Completion
                  records (prompt_ids, completion_ids, logprobs) per turn
      git diff -> evaluate_r2e (fresh sandbox: apply diff, run hidden tests) -> reward
  rubric -> GRPO advantage -> rollout_to_episodes -> Batcher -> backward

The adapter reuses prior turns' exact sampled tokens via the renderer's
bridge_to_next_turn (Token-In-Token-Out), so each turn's prompt exactly extends
prev_prompt + prev_completion and a whole multi-turn trajectory packs into one
training episode (assistant tokens trained, prompt / tool-result tokens masked); a
Claude Code auto-compaction breaks the prefix and opens a new episode branch,
exactly as rollout_to_episodes expects.

Because an inbound-firewalled trainer box cannot accept a dial-back from the public
Daytona cloud (Daytona refuses ssh -R), bridge.py relays the agent's HTTP over
Daytona's fs API: the in-sandbox proxy writes a request file, the host polls and
replays it to the adapter, then uploads a response file. One-shot delivery is
equivalent to a dial-back since token capture happens host-side regardless.

Layout (three orthogonal axes)

  • harness/sandbox/ -- WHERE code runs: Sandbox contract + make_sandbox, the
    Daytona backend, and the Daytona fs-relay bridge.
  • harness/adapters/ -- HOW the model is served: a token-capturing wire-format
    endpoint (anthropic; add openai for Codex/OpenCode -- capture is shared).
  • harness/agents/ -- WHICH CLI agent + how to launch it (claude_code).
  • examples/swe_r2e/ -- the R2E task: data, grading, rubric, rollouter,
    config_registry (1.7B / 8B / 14B / 32B dense + 30B-A3B MoE), launcher, isolated
    smoke harness.

Adding a new CLI agent = a new agents/ runner (+ reuse/extend an adapters/ wire
module); a new sandbox provider = a new sandbox/ backend.

Sandbox cleanup (no zombies, incl. SIGKILL)

Two layers: (1) per-rollout delete on context exit; (2) a cloud-side auto-delete
TTL
on every sandbox (TT_DAYTONA_AUTO_STOP_MIN / AUTO_DELETE_MIN) so an orphan
self-reaps even if the process is SIGKILL'd (e.g. preemption) and never runs its
exit path.

Verification

End to end on Daytona: Claude Code rollouts -> R2E grading -> GRPO backward.

  • Single 8xH100 host: Train | Step 1 backward for Qwen3-1.7B (~83s) and
    Qwen3-8B (~82s, bit_wise/logprob_diff/max 0.39).
  • Multi-host (4x H200): Qwen3-32B, 8 prompts x 8 samples, binary R2E reward,
    24 training steps. Pass rate trends up across the run -- by thirds
    1.66% -> 2.73% -> 3.71% (highest single step 8.6%, 11/128 solved), solving
    tasks across scrapy / pillow / numpy / orange3 / datalad. The signal is real but
    sparse (binary reward at ~2-3% pass leaves only ~1-2 of 16 GRPO groups per step
    with within-group variance) -- grading.py has an opt-in dense per-test-fraction
    reward (SWE_REWARD_DENSE) to densify it.

pre-commit clean. 30B-A3B fits one host (FSDP-4 bf16 trainer + TP4/EP4 generator)
but MoE breaks vLLM cudagraph capture, so the generator runs eager -- practical 30B
needs the MoE-cudagraph fix or multi-host; the config is included for that follow-up.

Run

DAYTONA_API_KEY=dtn_... CONFIG=rl_grpo_qwen3_1_7b_swe_r2e \
  PROMPT_DATA=/path/to/r2e.jsonl HF_ASSETS_PATH=/path/to/Qwen3-1.7B \
  bash torchtitan/experiments/rl/examples/swe_r2e/run_swe_r2e_daytona.sh

The Claude Code binary is downloaded inside the sandbox from its CDN (override via
SWE_CLAUDE_CDN), so no host toolchain is needed. See examples/swe_r2e/README.md
for prereqs and knobs.

Status

Draft / experiment. Proves the pipeline end to end and shows a real (if sparse)
upward reward trend at 32B; small models score ~reward 0 on one-shot R2E (zero
advantage). Meaningful reward needs a bigger model + larger context + many steps
(and/or the dense reward).

@yichuan-w

yichuan-w commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

RL training result: Qwen3-32B, 24 steps (binary R2E reward)

Setup: Qwen3-32B (dense), GRPO. Trainer FSDP-8 + 2 generators (TP-8) on 4x H200.
seq_len 32768, binary reward (R2E hidden-test pass/fail). Data: R2E-Gym-Subset
(~4.5K tasks).

On the batch sizes (64 vs ~128): GRPO uses 8 prompts x 8 samples = 64 rollouts
per collection round. The trainer's packed batch is global_batch_size=64 rows x
seq_len=32768. Because each coding episode (~16k trainable tokens) is far shorter
than seq_len, the collection loop keeps pulling rounds until it fills the token
target (64 x 32768 = 2.1M trainable tokens) -- about 2 rounds, i.e. ~128 rollouts
(16 prompts x 8 samples) per step, then packs ~2 episodes per row into the
[64, 32768] batch.

Reward trend (solved rollouts out of ~128 per step; run ongoing, 24 of 30 steps):

steps solved per step mean pass%
1-8 1, 0, 8, 0, 0, 2, 4, 2 1.66%
9-16 8, 2, 3, 0, 4, 2, 0, 9 2.73%
17-24 0, 4, 9, 2, 11, 3, 3, 6 3.71%

Pass rate trends up ~2.2x over 24 steps (overall 83/3072 = 2.70%); highest single
step 8.6% (11/128). Solved tasks span many repos (scrapy, pillow, numpy, orange3,
datalad, ...), confirming 32B can fix real R2E issues.

Caveat -- binary sparsity: at ~2-3% pass, most GRPO groups are all-fail (or
occasionally all-pass), so only ~1-2 of 16 groups per step has within-group reward
variance / a non-zero advantage. The signal is real but weak; the opt-in dense
per-test-fraction reward (SWE_REWARD_DENSE) would turn the many applied-but-
unsolved patches into gradient and should speed this up considerably.

Step time / robustness: ~50 min/step, ~95% of which is the agentic rollout
(inference re-prefill of the growing context + in-sandbox tool execution); the
trainer backward + weight sync is ~1.5 min. Sandbox error rate ~0.9% (15+ of the
steps fully clean) after jittered-backoff retry for transient Daytona control-plane
401s and the empty-exit-code poll race; the run is stable (0 restarts).

@yichuan-w yichuan-w force-pushed the yichuan/swe-r2e-upstream-pr branch 2 times, most recently from fe28eab to 8d44748 Compare June 25, 2026 21:38
Train a policy with an unmodified agentic CLI (Claude Code) as the environment:
the agent runs headless inside a Daytona cloud sandbox and is pointed at an on-box
Anthropic /v1/messages adapter that serves the trained policy and captures every
turn as on-policy training tokens (Token-In-Token-Out, so a multi-turn trajectory
packs into one episode). A SWER2ERollouter drives Claude Code per rollout, grades
the git diff against R2E-Gym hidden tests in a fresh sandbox, and feeds the
standard rubric/advantage/GRPO path.

Layout (torchtitan/experiments/rl/):
- harness/sandbox: Sandbox contract + make_sandbox factory + the Daytona backend +
  bridge.py (relays the agent's HTTP over Daytona's fs API, since an inbound-
  firewalled box cannot accept a dial-back).
- harness/adapters/anthropic.py: the token-capturing Anthropic Messages endpoint.
- harness/agents/claude_code.py: boot sandbox + install the CDN claude binary + run.
- examples/swe_r2e: R2E dataset, grading, rubric, rollouter, env placeholder,
  config recipes (1.7B smoke / 8B target / 30B-A3B + 14B/32B scale), run script.
@yichuan-w yichuan-w marked this pull request as ready for review June 25, 2026 21:45
@yichuan-w yichuan-w force-pushed the yichuan/swe-r2e-upstream-pr branch from 8d44748 to b7413a2 Compare June 25, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rl ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant