exp: r2e-gym ablation (vm sandboxes) by S1ro1 · Pull Request #2888 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-06-26T21:28:56Z

What

configs/debug/v1/glm45air_r2e.toml — a v1 RL ablation for GLM-4.5-Air (100B MoE) on the r2e-gym-v1 SWE taskset (train) + swebench-verified (eval), bash_edit harness on Prime VM sandboxes. Identical to the scaleswe baseline (glm45air_scaleswe.toml) except the train taskset is swapped scaleswe-v1 → r2e-gym-v1 and run on VM sandboxes.

Config highlights

2 train + 4 infer nodes (num_infer_replicas = 4). Trainer: cp=8 (ulysses), muon, + the GLM-5.1 prod-run memory improvements (LM-head token chunking, activation checkpointing + offload max_inflight=1, optimizer CPU offload, skip-gather/skip-optimizer ckpt).
Inference: 4× tp=8 replicas, no expert parallelism; prefix caching on.
Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts).
r2e-gym-v1 train env on Prime VM sandboxes: taskset.vm = true resolves images to the PI Research team registry as micro-VM (rootfs) artifacts (the only ref a Prime VM sandbox boots; takes precedence over use_prime_registry), paired with harness.runtime.vm = true. Requires research-environments feat/r2e-gym-vm-images (submodule → ee2e86032). Eval (swebench-verified) stays non-VM — that taskset has no VM images.
65536 context, eval at step 0 + every 20 steps; job_name / wandb / sandbox labels = glm45air-r2e.
Branched off exp/glm45air-scaleswe, so the diff vs main also carries the scaleswe ablation lineage.

Changelog

2026-06-26

Created the r2e-gym VM ablation: copy of the scaleswe baseline with the train taskset swapped scaleswe-v1 → r2e-gym-v1, run on Prime VM sandboxes (taskset.vm + harness.runtime.vm); router replay on. Submodule → feat/r2e-gym-vm-images (ee2e86032).
Merged main (drops removed envs; bumps verifiers → dev400 + research-environments).
Scaled inference to 4 nodes / 4 replicas (num_infer_replicas 2 → 4).
Dropped glm45air_scaleswe.toml from this branch (belongs to exp/glm45air-scaleswe).
Bumped max_inflight_rollouts 384 → 512 (saturate the 4 inference replicas).
Fixed FlashInfer JIT cache deadlock: pin FLASHINFER_WORKSPACE_BASE to node-local /tmp in the sbatch templates (was defaulting to shared weka $HOME/.cache/flashinfer, where concurrent GLM/MoE runs deadlock on fused_moe_*.lock → inference never serves).

v1 RL config: GLM-4.5-Air (zai-org/GLM-4.5-Air, 100B MoE) on scaleswe-v1 (train) + swebench-verified (eval), bash harness on prime sandboxes. Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts). 2 train + 2 infer nodes. Trainer: cp=8 ulysses, muon, + GLM-5.1 prod-run memory improvements (LM-head chunking, AC + activation offload, optimizer CPU offload, skip-gather/skip-optimizer ckpt). Inference: 2x tp=8 replicas, NO expert parallelism (inference EP + router-replay capture deadlocked the engine via cross-node EP all-to-all). Renderer/parsers auto-resolve from the official slug. Relies on fixes already in main: the glm4_moe routed_experts .contiguous() slice (torch.compile stride assert) and the verifiers always-install-uv bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vLLM's compile cache defaulted to NFS (~/.cache/vllm), which hung inference startup on slow shared FS. Point it at node-local /tmp (matching inference.sbatch.j2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…45air Inference runs vLLM online fp8 quant (vllm_extra={quantization="fp8"}) over the bf16 policy for faster generation; trainer/inference/orchestrator all use the bf16 zai-org/GLM-4.5-Air. The per-channel GLM-4.5-Air-FP8 checkpoint is incompatible with prime-rl's block-wise fp8 path (use_deep_gemm / quantize_in_weight_transfer), so we use online quant instead. max_inflight_rollouts=384. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Sibling of glm45air_scaleswe.toml with orchestrator.advantage.length_penalty enabled at defaults (coef=0.25, gate_by_correctness=false). Distinct slurm job_name + sandbox labels (glm45air-swe-lp) so it runs alongside the no-penalty run without sharing prime sandboxes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cold-cache nodes read the 206GB bf16 model from NFS at ~50s/shard (~46min total). The weight-broadcast store rendezvous default (1200s/20min) times out before inference finishes loading (DistStoreError: 1/17 clients joined), killing the trainer. 3600s covers the cold-load worst case with margin. Applied to both the base and -lp ablation configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). - Drop the -lp length-penalty variant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run bf16 inference (remove vllm_extra quantization=fp8). fp8 inference added ~10x mismatch KL (~0.002 vs ~0.0002); bf16 inference lowers it. Trainer fp8 + native KV-cache offload (and the now-disabled router replay) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

seq_len + inference.model.max_model_len 65536 -> 131072. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- seq_len + inference.model.max_model_len 131072 -> 65536 - Remove inference.kv_cache_offload (native CPU tier) - max_inflight_rollouts 512 -> 384 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the startup eval explicit so SWE-Bench Verified runs at step 0 before any train rollouts (already the default; pinned for clarity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove trainer.model.fp8 (back to bf16 training). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fork of glm45air_scaleswe.toml (PR #2862) with the training env swapped scaleswe-v1 -> r2e-gym-v1; eval stays SWE-Bench Verified. r2e train images resolve through Prime's registry mirror (use_prime_registry) to avoid Docker Hub pull rate limits at rollout concurrency. Model/trainer/inference knobs unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Match the scaleswe baseline: enable router replay (trainer.enable_router_replay + inference.enable_return_routed_experts). - Run the r2e-gym-v1 train env on Prime VM sandboxes: taskset.vm=true (resolves images to the PI Research team micro-VM registry; takes precedence over use_prime_registry) + harness.runtime.vm=true. Eval (swebench-verified) stays non-VM (no VM images). - Bump research-environments submodule to feat/r2e-gym-vm-images (ee2e86032), which adds vm-capable image resolution to r2e-gym-v1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

num_infer_replicas 2 -> 4 (num_infer_nodes is per-replica, so total inference nodes = num_infer_nodes * num_infer_replicas = 4). Total job = 2 train + 4 infer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

glm45air_scaleswe.toml belongs to exp/glm45air-scaleswe; the r2e branch only ships glm45air_r2e.toml. Inherited here from branching off scaleswe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Increase orchestrator rollout concurrency to better saturate the 4 inference replicas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FLASHINFER_WORKSPACE_BASE defaulted to $HOME/.cache/flashinfer on shared weka. With concurrent GLM-4.5-Air/MoE runs every TP worker contends on the same fused_moe_*.lock there and deadlocks in uninterruptible (D-state) filesystem I/O during the CUTLASS fused-MoE JIT build, so inference never serves. Pin it to node-local /tmp like the Triton/vLLM caches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

S1ro1 and others added 17 commits June 24, 2026 01:13

Merge remote-tracking branch 'origin/main' into exp/glm45air-scaleswe

5f3319e

exp(glm45air-scaleswe): bump context to 131072

acdacb0

seq_len + inference.model.max_model_len 65536 -> 131072. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

exp(glm45air-scaleswe): eval at step 0 (explicit skip_first_step=false)

69a868f

Make the startup eval explicit so SWE-Bench Verified runs at step 0 before any train rollouts (already the default; pinned for clarity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

exp(glm45air-scaleswe): drop fp8 trainer

cf0c496

Remove trainer.model.fp8 (back to bf16 training). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into exp/glm45air-r2e

daa0ca7

mikasenghaas changed the title ~~exp: GLM-4.5-Air r2e-gym ablation (VM sandboxes)~~ exp: r2e-gym ablation (VM sandboxes) Jun 26, 2026

mikasenghaas changed the title ~~exp: r2e-gym ablation (VM sandboxes)~~ exp: r2e-gym ablation (vm sandboxes) Jun 26, 2026

S1ro1 and others added 2 commits June 26, 2026 21:35

exp(glm45air-r2e): drop scaleswe config

42d05cd

glm45air_scaleswe.toml belongs to exp/glm45air-scaleswe; the r2e branch only ships glm45air_r2e.toml. Inherited here from branching off scaleswe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

exp(glm45air-r2e): bump max_inflight_rollouts 384 -> 512

949d35c

Increase orchestrator rollout concurrency to better saturate the 4 inference replicas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas force-pushed the exp/glm45air-r2e branch from ef7dc1c to 949d35c Compare June 26, 2026 21:43

mikasenghaas force-pushed the exp/glm45air-r2e branch from e8a4dfe to 2cb76c1 Compare June 26, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

exp: r2e-gym ablation (vm sandboxes)#2888

exp: r2e-gym ablation (vm sandboxes)#2888
S1ro1 wants to merge 20 commits into
mainfrom
exp/glm45air-r2e

S1ro1 commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

S1ro1 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Config highlights

Changelog

2026-06-26

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

S1ro1 commented Jun 26, 2026 •

edited

Loading