exp: scaleswe ablation#2862
Merged
Merged
Conversation
v1 RL config: GLM-4.5-Air (zai-org/GLM-4.5-Air, 100B MoE) on scaleswe-v1 (train) + swebench-verified (eval), bash harness on prime sandboxes. Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts). 2 train + 2 infer nodes. Trainer: cp=8 ulysses, muon, + GLM-5.1 prod-run memory improvements (LM-head chunking, AC + activation offload, optimizer CPU offload, skip-gather/skip-optimizer ckpt). Inference: 2x tp=8 replicas, NO expert parallelism (inference EP + router-replay capture deadlocked the engine via cross-node EP all-to-all). Renderer/parsers auto-resolve from the official slug. Relies on fixes already in main: the glm4_moe routed_experts .contiguous() slice (torch.compile stride assert) and the verifiers always-install-uv bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM's compile cache defaulted to NFS (~/.cache/vllm), which hung inference startup on slow shared FS. Point it at node-local /tmp (matching inference.sbatch.j2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…45air
Inference runs vLLM online fp8 quant (vllm_extra={quantization="fp8"}) over the
bf16 policy for faster generation; trainer/inference/orchestrator all use the
bf16 zai-org/GLM-4.5-Air. The per-channel GLM-4.5-Air-FP8 checkpoint is
incompatible with prime-rl's block-wise fp8 path (use_deep_gemm /
quantize_in_weight_transfer), so we use online quant instead. max_inflight_rollouts=384.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bda8452 to
d55678e
Compare
Sibling of glm45air_scaleswe.toml with orchestrator.advantage.length_penalty enabled at defaults (coef=0.25, gate_by_correctness=false). Distinct slurm job_name + sandbox labels (glm45air-swe-lp) so it runs alongside the no-penalty run without sharing prime sandboxes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cold-cache nodes read the 206GB bf16 model from NFS at ~50s/shard (~46min total). The weight-broadcast store rendezvous default (1200s/20min) times out before inference finishes loading (DistStoreError: 1/17 clients joined), killing the trainer. 3600s covers the cold-load worst case with margin. Applied to both the base and -lp ablation configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). - Drop the -lp length-penalty variant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run bf16 inference (remove vllm_extra quantization=fp8). fp8 inference added ~10x mismatch KL (~0.002 vs ~0.0002); bf16 inference lowers it. Trainer fp8 + native KV-cache offload (and the now-disabled router replay) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
seq_len + inference.model.max_model_len 65536 -> 131072. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- seq_len + inference.model.max_model_len 131072 -> 65536 - Remove inference.kv_cache_offload (native CPU tier) - max_inflight_rollouts 512 -> 384 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the startup eval explicit so SWE-Bench Verified runs at step 0 before any train rollouts (already the default; pinned for clarity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove trainer.model.fp8 (back to bf16 training). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-enable trainer.enable_router_replay + inference.enable_return_routed_experts. Compatible again now that kv-cache offload and fp8 (both mutually exclusive with router replay) have been dropped; replaying inference's routed-expert decisions in the trainer cuts the train/inference mismatch by ~an order of magnitude. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
num_infer_replicas 2 -> 4 (num_infer_nodes is per-replica, so total inference nodes = num_infer_nodes * num_infer_replicas = 4). Total job = 2 train + 4 infer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Increase orchestrator rollout concurrency to better saturate the 4 inference replicas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4a5872b to
9f8688e
Compare
FLASHINFER_WORKSPACE_BASE defaulted to $HOME/.cache/flashinfer on shared weka. With concurrent GLM-4.5-Air/MoE runs every TP worker contends on the same fused_moe_*.lock there and deadlocks in uninterruptible (D-state) filesystem I/O during the CUTLASS fused-MoE JIT build, so inference never serves. Pin it to node-local /tmp like the Triton/vLLM caches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # src/prime_rl/templates/inference.sbatch.j2 # src/prime_rl/templates/multi_node_rl.sbatch.j2
Move the cache-dir env vars out of the SLURM templates and into the config so the PR only adds the TOML (the merged env-var feature, #2863, makes this possible): - [env_vars] TRITON_CACHE_DIR (trainer + inference) - [inference.env_vars] VLLM_CACHE_ROOT, FLASHINFER_WORKSPACE_BASE Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
faresobeid
previously approved these changes
Jun 29, 2026
…nking_retention Main #2900 replaced the renderer `preserve_all_thinking` bool with `thinking_retention`; the merge left the config on the removed field, so it failed the config-load unit test ("No config class could be parsed"). Switch to `thinking_retention = "all"` (the documented equivalent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
faresobeid
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
configs/debug/v1/glm45air_scaleswe.toml— a v1 RL ablation for GLM-4.5-Air (100B MoE) on the scaleswe-v1 SWE taskset (train) + swebench-verified (eval), bash_edit harness on prime sandboxes.This PR only adds the config TOML — no source/template changes.
Config highlights
cp=8(ulysses), muon, + the GLM-5.1 prod-run memory improvements (LM-head token chunking, activation checkpointing + offloadmax_inflight=1, optimizer CPU offload, skip-gather/skip-optimizer ckpt).trainer.enable_router_replay+inference.enable_return_routed_experts): replays inference's routed-expert decisions in the trainer to cut train/inference mismatch.65536context (seq_len+max_model_len),max_inflight_rollouts = 512.[env_vars](see 2026-06-29 below) — no SLURM template edits.skip_first_step = false).glm-4.5) and tool/reasoning parsers auto-resolve from the officialzai-org/GLM-4.5-Airslug.Changelog
2026-06-29
Merged latest
main(now includes the configurable env-vars feature, feat: configurable env vars #2863).Moved the node-local cache dirs out of the SLURM templates and into the config, so the PR only adds the TOML:
[env_vars]:TRITON_CACHE_DIR(trainer + inference)[inference.env_vars]:VLLM_CACHE_ROOT,FLASHINFER_WORKSPACE_BASEThe SLURM templates are now untouched vs
main. Values keep the shell-expanded per-user/per-job paths ($USER/$SLURM_JOB_ID/$(whoami)); the multi-node launcher rendersenv_varsasexport KEY="VALUE", so the expansion still happens at runtime.Migrated the renderer field
preserve_all_thinking = true→thinking_retention = "all"(main Update configs for renderer thinking retention #2900 removed the old bool; the config otherwise failed the config-load unit test).2026-06-26
trainer.enable_router_replay+inference.enable_return_routed_experts) — compatible again now that kv-cache offload and trainer fp8 (both mutually exclusive with router replay) have been dropped; replaying inference's routed-expert decisions cuts the train/inference mismatch.main(drops removed envs; bumps verifiers + research-environments).num_infer_replicas2 → 4).max_inflight_rollouts384 → 512 (saturate the 4 inference replicas).FLASHINFER_WORKSPACE_BASEto node-local/tmp(was defaulting to shared weka$HOME/.cache/flashinfer, where concurrent GLM/MoE runs deadlock onfused_moe_*.lock→ inference never serves). (2026-06-29: relocated from the sbatch templates into the config.)2026-06-25
main.trainer.enable_router_replay+inference.enable_return_routed_experts).bash→bash_edit.vllm_extra.quantization) to lower train/inference mismatch.skip_first_step = false).-lplength-penalty variant (glm45air_scaleswe_lp.toml).2026-06-24
scaleswe-v1(train) +swebench-verified-v1(eval),bashharness on the prime-sandbox runtime, router replay enabled,cp=8ulysses + muon + GLM-5.1 prod-run memory knobs.vllm_extra.quantization = "fp8") + 384 rollout concurrency.weight_broadcastNCCL timeout to 3600s for cold NFS loads.Note
Low Risk
Config-only addition with no application or template changes; affects experiment launch parameters only.
Overview
Adds
configs/debug/v1/glm45air_scaleswe.tomlonly — a v1 RL experiment wiring GLM-4.5-Air toscaleswe-v1training and SWE-Bench Verified eval on prime sandboxes with thebash_editharness (not the basescaleswerlm setup).Deployment & scale: 2 train nodes, 4 inference replicas at
tp=8(no expert parallelism),max_inflight_rollouts = 512,max_steps = 1000, 65k context. Weight broadcast NCCL timeout 3600s.Trainer:
cp=8ulysses, muon, router replay on, plus GLM-5.1-style memory knobs (LM-head token chunking, activation checkpointing/offload, optimizer CPU offload, skip-gather / skip-optimizer checkpoints).Inference: prefix caching,
enable_return_routed_expertsfor replay; no fp8 quant. Eval runs at step 0 and every 20 steps.Ops: top-level
[env_vars]and[inference.env_vars]point Triton, vLLM, and FlashInfer caches at node-local/tmpto avoid shared-FS contention and FlashInfer lock deadlocks; SLURMpre_run_commanddeletes orphaned prime sandboxes by job label.Reviewed by Cursor Bugbot for commit 2eea033. Bugbot is set up for automated code reviews on this repo. Configure here.