[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona#3734
[rl] swe_r2e: pluggable coding-agent (Claude Code) harness on Daytona#3734yichuan-w wants to merge 1 commit into
Conversation
RL training result: Qwen3-32B, 24 steps (binary R2E reward)Setup: Qwen3-32B (dense), GRPO. Trainer FSDP-8 + 2 generators (TP-8) on 4x H200. On the batch sizes (64 vs ~128): GRPO uses 8 prompts x 8 samples = 64 rollouts Reward trend (solved rollouts out of ~128 per step; run ongoing, 24 of 30 steps):
Pass rate trends up ~2.2x over 24 steps (overall 83/3072 = 2.70%); highest single Caveat -- binary sparsity: at ~2-3% pass, most GRPO groups are all-fail (or Step time / robustness: ~50 min/step, ~95% of which is the agentic rollout |
fe28eab to
8d44748
Compare
Train a policy with an unmodified agentic CLI (Claude Code) as the environment: the agent runs headless inside a Daytona cloud sandbox and is pointed at an on-box Anthropic /v1/messages adapter that serves the trained policy and captures every turn as on-policy training tokens (Token-In-Token-Out, so a multi-turn trajectory packs into one episode). A SWER2ERollouter drives Claude Code per rollout, grades the git diff against R2E-Gym hidden tests in a fresh sandbox, and feeds the standard rubric/advantage/GRPO path. Layout (torchtitan/experiments/rl/): - harness/sandbox: Sandbox contract + make_sandbox factory + the Daytona backend + bridge.py (relays the agent's HTTP over Daytona's fs API, since an inbound- firewalled box cannot accept a dial-back). - harness/adapters/anthropic.py: the token-capturing Anthropic Messages endpoint. - harness/agents/claude_code.py: boot sandbox + install the CDN claude binary + run. - examples/swe_r2e: R2E dataset, grading, rubric, rollouter, env placeholder, config recipes (1.7B smoke / 8B target / 30B-A3B + 14B/32B scale), run script.
8d44748 to
b7413a2
Compare
What
A TorchTitan RL example (
swe_r2e) that post-trains a Qwen model on R2E-Gym SWEtasks where the rollout is driven by an unmodified agentic CLI harness (Claude
Code) running inside a Daytona cloud sandbox. An on-box Anthropic-Messages
adapter serves the trained policy to the agent and captures every model turn as
on-policy training tokens; R2E hidden tests grade the agent's patch for the reward.
This is the TorchTitan analogue of THUDM/slime's
coding_agent_rland Metamsl/rl's "virtual actor + reverse-proxy" pattern.
How it works
The adapter reuses prior turns' exact sampled tokens via the renderer's
bridge_to_next_turn(Token-In-Token-Out), so each turn's prompt exactly extendsprev_prompt + prev_completionand a whole multi-turn trajectory packs into onetraining episode (assistant tokens trained, prompt / tool-result tokens masked); a
Claude Code auto-compaction breaks the prefix and opens a new episode branch,
exactly as
rollout_to_episodesexpects.Because an inbound-firewalled trainer box cannot accept a dial-back from the public
Daytona cloud (Daytona refuses
ssh -R),bridge.pyrelays the agent's HTTP overDaytona's
fsAPI: the in-sandbox proxy writes a request file, the host polls andreplays it to the adapter, then uploads a response file. One-shot delivery is
equivalent to a dial-back since token capture happens host-side regardless.
Layout (three orthogonal axes)
harness/sandbox/-- WHERE code runs:Sandboxcontract +make_sandbox, theDaytona backend, and the Daytona fs-relay
bridge.harness/adapters/-- HOW the model is served: a token-capturing wire-formatendpoint (
anthropic; addopenaifor Codex/OpenCode -- capture is shared).harness/agents/-- WHICH CLI agent + how to launch it (claude_code).examples/swe_r2e/-- the R2E task:data,grading,rubric,rollouter,config_registry(1.7B / 8B / 14B / 32B dense + 30B-A3B MoE), launcher, isolatedsmoke harness.
Adding a new CLI agent = a new
agents/runner (+ reuse/extend anadapters/wiremodule); a new sandbox provider = a new
sandbox/backend.Sandbox cleanup (no zombies, incl. SIGKILL)
Two layers: (1) per-rollout delete on context exit; (2) a cloud-side auto-delete
TTL on every sandbox (
TT_DAYTONA_AUTO_STOP_MIN/AUTO_DELETE_MIN) so an orphanself-reaps even if the process is SIGKILL'd (e.g. preemption) and never runs its
exit path.
Verification
End to end on Daytona: Claude Code rollouts -> R2E grading -> GRPO backward.
Train | Step 1backward for Qwen3-1.7B (~83s) andQwen3-8B (~82s,
bit_wise/logprob_diff/max0.39).24 training steps. Pass rate trends up across the run -- by thirds
1.66% -> 2.73% -> 3.71% (highest single step 8.6%, 11/128 solved), solving
tasks across scrapy / pillow / numpy / orange3 / datalad. The signal is real but
sparse (binary reward at ~2-3% pass leaves only ~1-2 of 16 GRPO groups per step
with within-group variance) --
grading.pyhas an opt-in dense per-test-fractionreward (
SWE_REWARD_DENSE) to densify it.pre-commitclean. 30B-A3B fits one host (FSDP-4 bf16 trainer + TP4/EP4 generator)but MoE breaks vLLM cudagraph capture, so the generator runs eager -- practical 30B
needs the MoE-cudagraph fix or multi-host; the config is included for that follow-up.
Run
The Claude Code binary is downloaded inside the sandbox from its CDN (override via
SWE_CLAUDE_CDN), so no host toolchain is needed. Seeexamples/swe_r2e/README.mdfor prereqs and knobs.
Status
Draft / experiment. Proves the pipeline end to end and shows a real (if sparse)
upward reward trend at 32B; small models score ~reward 0 on one-shot R2E (zero
advantage). Meaningful reward needs a bigger model + larger context + many steps
(and/or the dense reward).