exp: r2e-gym ablation (vm sandboxes)#2888
Draft
S1ro1 wants to merge 20 commits into
Draft
Conversation
v1 RL config: GLM-4.5-Air (zai-org/GLM-4.5-Air, 100B MoE) on scaleswe-v1 (train) + swebench-verified (eval), bash harness on prime sandboxes. Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts). 2 train + 2 infer nodes. Trainer: cp=8 ulysses, muon, + GLM-5.1 prod-run memory improvements (LM-head chunking, AC + activation offload, optimizer CPU offload, skip-gather/skip-optimizer ckpt). Inference: 2x tp=8 replicas, NO expert parallelism (inference EP + router-replay capture deadlocked the engine via cross-node EP all-to-all). Renderer/parsers auto-resolve from the official slug. Relies on fixes already in main: the glm4_moe routed_experts .contiguous() slice (torch.compile stride assert) and the verifiers always-install-uv bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM's compile cache defaulted to NFS (~/.cache/vllm), which hung inference startup on slow shared FS. Point it at node-local /tmp (matching inference.sbatch.j2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…45air
Inference runs vLLM online fp8 quant (vllm_extra={quantization="fp8"}) over the
bf16 policy for faster generation; trainer/inference/orchestrator all use the
bf16 zai-org/GLM-4.5-Air. The per-channel GLM-4.5-Air-FP8 checkpoint is
incompatible with prime-rl's block-wise fp8 path (use_deep_gemm /
quantize_in_weight_transfer), so we use online quant instead. max_inflight_rollouts=384.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sibling of glm45air_scaleswe.toml with orchestrator.advantage.length_penalty enabled at defaults (coef=0.25, gate_by_correctness=false). Distinct slurm job_name + sandbox labels (glm45air-swe-lp) so it runs alongside the no-penalty run without sharing prime sandboxes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cold-cache nodes read the 206GB bf16 model from NFS at ~50s/shard (~46min total). The weight-broadcast store rendezvous default (1200s/20min) times out before inference finishes loading (DistStoreError: 1/17 clients joined), killing the trainer. 3600s covers the cold-load worst case with margin. Applied to both the base and -lp ablation configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). - Drop the -lp length-penalty variant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run bf16 inference (remove vllm_extra quantization=fp8). fp8 inference added ~10x mismatch KL (~0.002 vs ~0.0002); bf16 inference lowers it. Trainer fp8 + native KV-cache offload (and the now-disabled router replay) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
seq_len + inference.model.max_model_len 65536 -> 131072. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- seq_len + inference.model.max_model_len 131072 -> 65536 - Remove inference.kv_cache_offload (native CPU tier) - max_inflight_rollouts 512 -> 384 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the startup eval explicit so SWE-Bench Verified runs at step 0 before any train rollouts (already the default; pinned for clarity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove trainer.model.fp8 (back to bf16 training). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fork of glm45air_scaleswe.toml (PR #2862) with the training env swapped scaleswe-v1 -> r2e-gym-v1; eval stays SWE-Bench Verified. r2e train images resolve through Prime's registry mirror (use_prime_registry) to avoid Docker Hub pull rate limits at rollout concurrency. Model/trainer/inference knobs unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Match the scaleswe baseline: enable router replay (trainer.enable_router_replay + inference.enable_return_routed_experts). - Run the r2e-gym-v1 train env on Prime VM sandboxes: taskset.vm=true (resolves images to the PI Research team micro-VM registry; takes precedence over use_prime_registry) + harness.runtime.vm=true. Eval (swebench-verified) stays non-VM (no VM images). - Bump research-environments submodule to feat/r2e-gym-vm-images (ee2e86032), which adds vm-capable image resolution to r2e-gym-v1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
num_infer_replicas 2 -> 4 (num_infer_nodes is per-replica, so total inference nodes = num_infer_nodes * num_infer_replicas = 4). Total job = 2 train + 4 infer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
glm45air_scaleswe.toml belongs to exp/glm45air-scaleswe; the r2e branch only ships glm45air_r2e.toml. Inherited here from branching off scaleswe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Increase orchestrator rollout concurrency to better saturate the 4 inference replicas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ef7dc1c to
949d35c
Compare
FLASHINFER_WORKSPACE_BASE defaulted to $HOME/.cache/flashinfer on shared weka. With concurrent GLM-4.5-Air/MoE runs every TP worker contends on the same fused_moe_*.lock there and deadlocks in uninterruptible (D-state) filesystem I/O during the CUTLASS fused-MoE JIT build, so inference never serves. Pin it to node-local /tmp like the Triton/vLLM caches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e8a4dfe to
2cb76c1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
configs/debug/v1/glm45air_r2e.toml— a v1 RL ablation for GLM-4.5-Air (100B MoE) on the r2e-gym-v1 SWE taskset (train) + swebench-verified (eval), bash_edit harness on Prime VM sandboxes. Identical to the scaleswe baseline (glm45air_scaleswe.toml) except the train taskset is swappedscaleswe-v1→r2e-gym-v1and run on VM sandboxes.Config highlights
num_infer_replicas = 4). Trainer:cp=8(ulysses), muon, + the GLM-5.1 prod-run memory improvements (LM-head token chunking, activation checkpointing + offloadmax_inflight=1, optimizer CPU offload, skip-gather/skip-optimizer ckpt).trainer.enable_router_replay+inference.enable_return_routed_experts).taskset.vm = trueresolves images to the PI Research team registry as micro-VM (rootfs) artifacts (the only ref a Prime VM sandbox boots; takes precedence overuse_prime_registry), paired withharness.runtime.vm = true. Requires research-environmentsfeat/r2e-gym-vm-images(submodule →ee2e86032). Eval (swebench-verified) stays non-VM — that taskset has no VM images.65536context, eval at step 0 + every 20 steps; job_name / wandb / sandbox labels =glm45air-r2e.exp/glm45air-scaleswe, so the diff vsmainalso carries the scaleswe ablation lineage.Changelog
2026-06-26
scaleswe-v1→r2e-gym-v1, run on Prime VM sandboxes (taskset.vm+harness.runtime.vm); router replay on. Submodule →feat/r2e-gym-vm-images(ee2e86032).main(drops removed envs; bumps verifiers → dev400 + research-environments).num_infer_replicas2 → 4).glm45air_scaleswe.tomlfrom this branch (belongs toexp/glm45air-scaleswe).max_inflight_rollouts384 → 512 (saturate the 4 inference replicas).FLASHINFER_WORKSPACE_BASEto node-local/tmpin the sbatch templates (was defaulting to shared weka$HOME/.cache/flashinfer, where concurrent GLM/MoE runs deadlock onfused_moe_*.lock→ inference never serves).