Skip to content

exp: r2e-gym ablation (vm sandboxes)#2888

Draft
S1ro1 wants to merge 20 commits into
mainfrom
exp/glm45air-r2e
Draft

exp: r2e-gym ablation (vm sandboxes)#2888
S1ro1 wants to merge 20 commits into
mainfrom
exp/glm45air-r2e

Conversation

@S1ro1

@S1ro1 S1ro1 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

configs/debug/v1/glm45air_r2e.toml — a v1 RL ablation for GLM-4.5-Air (100B MoE) on the r2e-gym-v1 SWE taskset (train) + swebench-verified (eval), bash_edit harness on Prime VM sandboxes. Identical to the scaleswe baseline (glm45air_scaleswe.toml) except the train taskset is swapped scaleswe-v1r2e-gym-v1 and run on VM sandboxes.

Config highlights

  • 2 train + 4 infer nodes (num_infer_replicas = 4). Trainer: cp=8 (ulysses), muon, + the GLM-5.1 prod-run memory improvements (LM-head token chunking, activation checkpointing + offload max_inflight=1, optimizer CPU offload, skip-gather/skip-optimizer ckpt).
  • Inference: 4× tp=8 replicas, no expert parallelism; prefix caching on.
  • Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts).
  • r2e-gym-v1 train env on Prime VM sandboxes: taskset.vm = true resolves images to the PI Research team registry as micro-VM (rootfs) artifacts (the only ref a Prime VM sandbox boots; takes precedence over use_prime_registry), paired with harness.runtime.vm = true. Requires research-environments feat/r2e-gym-vm-images (submodule → ee2e86032). Eval (swebench-verified) stays non-VM — that taskset has no VM images.
  • 65536 context, eval at step 0 + every 20 steps; job_name / wandb / sandbox labels = glm45air-r2e.
  • Branched off exp/glm45air-scaleswe, so the diff vs main also carries the scaleswe ablation lineage.

Changelog

2026-06-26

  • Created the r2e-gym VM ablation: copy of the scaleswe baseline with the train taskset swapped scaleswe-v1r2e-gym-v1, run on Prime VM sandboxes (taskset.vm + harness.runtime.vm); router replay on. Submodule → feat/r2e-gym-vm-images (ee2e86032).
  • Merged main (drops removed envs; bumps verifiers → dev400 + research-environments).
  • Scaled inference to 4 nodes / 4 replicas (num_infer_replicas 2 → 4).
  • Dropped glm45air_scaleswe.toml from this branch (belongs to exp/glm45air-scaleswe).
  • Bumped max_inflight_rollouts 384 → 512 (saturate the 4 inference replicas).
  • Fixed FlashInfer JIT cache deadlock: pin FLASHINFER_WORKSPACE_BASE to node-local /tmp in the sbatch templates (was defaulting to shared weka $HOME/.cache/flashinfer, where concurrent GLM/MoE runs deadlock on fused_moe_*.lock → inference never serves).

S1ro1 and others added 17 commits June 24, 2026 01:13
v1 RL config: GLM-4.5-Air (zai-org/GLM-4.5-Air, 100B MoE) on scaleswe-v1 (train)
+ swebench-verified (eval), bash harness on prime sandboxes. Router replay on
(trainer.enable_router_replay + inference.enable_return_routed_experts).

2 train + 2 infer nodes. Trainer: cp=8 ulysses, muon, + GLM-5.1 prod-run memory
improvements (LM-head chunking, AC + activation offload, optimizer CPU offload,
skip-gather/skip-optimizer ckpt). Inference: 2x tp=8 replicas, NO expert
parallelism (inference EP + router-replay capture deadlocked the engine via
cross-node EP all-to-all). Renderer/parsers auto-resolve from the official slug.

Relies on fixes already in main: the glm4_moe routed_experts .contiguous() slice
(torch.compile stride assert) and the verifiers always-install-uv bootstrap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM's compile cache defaulted to NFS (~/.cache/vllm), which hung inference
startup on slow shared FS. Point it at node-local /tmp (matching inference.sbatch.j2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…45air

Inference runs vLLM online fp8 quant (vllm_extra={quantization="fp8"}) over the
bf16 policy for faster generation; trainer/inference/orchestrator all use the
bf16 zai-org/GLM-4.5-Air. The per-channel GLM-4.5-Air-FP8 checkpoint is
incompatible with prime-rl's block-wise fp8 path (use_deep_gemm /
quantize_in_weight_transfer), so we use online quant instead. max_inflight_rollouts=384.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sibling of glm45air_scaleswe.toml with orchestrator.advantage.length_penalty
enabled at defaults (coef=0.25, gate_by_correctness=false). Distinct slurm
job_name + sandbox labels (glm45air-swe-lp) so it runs alongside the no-penalty
run without sharing prime sandboxes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cold-cache nodes read the 206GB bf16 model from NFS at ~50s/shard (~46min
total). The weight-broadcast store rendezvous default (1200s/20min) times out
before inference finishes loading (DistStoreError: 1/17 clients joined), killing
the trainer. 3600s covers the cold-load worst case with margin. Applied to both
the base and -lp ablation configs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts):
  mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits
  don't carry routed-expert decisions).
- Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache).
- FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set.
- Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash.
- Bump max_inflight_rollouts 384 -> 512.
- Keep LM-head token chunking + activation checkpointing explicit at the values that become
  trainer defaults in #2867 (not merged yet, so removing them would disable the features).
- Drop the -lp length-penalty variant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts):
  mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits
  don't carry routed-expert decisions).
- Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache).
- FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set.
- Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash.
- Bump max_inflight_rollouts 384 -> 512.
- Keep LM-head token chunking + activation checkpointing explicit at the values that become
  trainer defaults in #2867 (not merged yet, so removing them would disable the features).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run bf16 inference (remove vllm_extra quantization=fp8). fp8 inference added
~10x mismatch KL (~0.002 vs ~0.0002); bf16 inference lowers it. Trainer fp8 +
native KV-cache offload (and the now-disabled router replay) unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
seq_len + inference.model.max_model_len 65536 -> 131072.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- seq_len + inference.model.max_model_len 131072 -> 65536
- Remove inference.kv_cache_offload (native CPU tier)
- max_inflight_rollouts 512 -> 384

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the startup eval explicit so SWE-Bench Verified runs at step 0 before
any train rollouts (already the default; pinned for clarity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove trainer.model.fp8 (back to bf16 training).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fork of glm45air_scaleswe.toml (PR #2862) with the training env swapped
scaleswe-v1 -> r2e-gym-v1; eval stays SWE-Bench Verified. r2e train images
resolve through Prime's registry mirror (use_prime_registry) to avoid Docker
Hub pull rate limits at rollout concurrency. Model/trainer/inference knobs
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Match the scaleswe baseline: enable router replay (trainer.enable_router_replay
  + inference.enable_return_routed_experts).
- Run the r2e-gym-v1 train env on Prime VM sandboxes: taskset.vm=true (resolves
  images to the PI Research team micro-VM registry; takes precedence over
  use_prime_registry) + harness.runtime.vm=true. Eval (swebench-verified) stays
  non-VM (no VM images).
- Bump research-environments submodule to feat/r2e-gym-vm-images (ee2e86032),
  which adds vm-capable image resolution to r2e-gym-v1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
num_infer_replicas 2 -> 4 (num_infer_nodes is per-replica, so total inference
nodes = num_infer_nodes * num_infer_replicas = 4). Total job = 2 train + 4 infer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title exp: GLM-4.5-Air r2e-gym ablation (VM sandboxes) exp: r2e-gym ablation (VM sandboxes) Jun 26, 2026
@mikasenghaas mikasenghaas changed the title exp: r2e-gym ablation (VM sandboxes) exp: r2e-gym ablation (vm sandboxes) Jun 26, 2026
S1ro1 and others added 2 commits June 26, 2026 21:35
glm45air_scaleswe.toml belongs to exp/glm45air-scaleswe; the r2e branch only
ships glm45air_r2e.toml. Inherited here from branching off scaleswe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Increase orchestrator rollout concurrency to better saturate the 4 inference replicas.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FLASHINFER_WORKSPACE_BASE defaulted to $HOME/.cache/flashinfer on shared weka.
With concurrent GLM-4.5-Air/MoE runs every TP worker contends on the same
fused_moe_*.lock there and deadlocks in uninterruptible (D-state) filesystem
I/O during the CUTLASS fused-MoE JIT build, so inference never serves. Pin it
to node-local /tmp like the Triton/vLLM caches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants