exp: scaleswe ablation by S1ro1 · Pull Request #2862 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-06-24T01:13:45Z

What

configs/debug/v1/glm45air_scaleswe.toml — a v1 RL ablation for GLM-4.5-Air (100B MoE) on the scaleswe-v1 SWE taskset (train) + swebench-verified (eval), bash_edit harness on prime sandboxes.

This PR only adds the config TOML — no source/template changes.

Config highlights

2 train + 4 infer nodes. Trainer: cp=8 (ulysses), muon, + the GLM-5.1 prod-run memory improvements (LM-head token chunking, activation checkpointing + offload max_inflight=1, optimizer CPU offload, skip-gather/skip-optimizer ckpt).
Inference: 4× tp=8 replicas, no expert parallelism. No fp8 quant on the inference side (dropped to lower train/inference mismatch); prefix caching on.
Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts): replays inference's routed-expert decisions in the trainer to cut train/inference mismatch.
65536 context (seq_len + max_model_len), max_inflight_rollouts = 512.
Node-local cache dirs via [env_vars] (see 2026-06-29 below) — no SLURM template edits.
Eval: SWE-Bench Verified at step 0 + every 20 steps (skip_first_step = false).
Renderer (glm-4.5) and tool/reasoning parsers auto-resolve from the official zai-org/GLM-4.5-Air slug.

Changelog

2026-06-29

Merged latest main (now includes the configurable env-vars feature, feat: configurable env vars #2863).
Moved the node-local cache dirs out of the SLURM templates and into the config, so the PR only adds the TOML:
- top-level [env_vars]: TRITON_CACHE_DIR (trainer + inference)
- [inference.env_vars]: VLLM_CACHE_ROOT, FLASHINFER_WORKSPACE_BASE
The SLURM templates are now untouched vs main. Values keep the shell-expanded per-user/per-job paths ($USER/$SLURM_JOB_ID/$(whoami)); the multi-node launcher renders env_vars as export KEY="VALUE", so the expansion still happens at runtime.
Migrated the renderer field preserve_all_thinking = true → thinking_retention = "all" (main Update configs for renderer thinking retention #2900 removed the old bool; the config otherwise failed the config-load unit test).

2026-06-26

Brought back router replay (trainer.enable_router_replay + inference.enable_return_routed_experts) — compatible again now that kv-cache offload and trainer fp8 (both mutually exclusive with router replay) have been dropped; replaying inference's routed-expert decisions cuts the train/inference mismatch.
Merged main (drops removed envs; bumps verifiers + research-environments).
Scaled inference to 4 nodes / 4 replicas (num_infer_replicas 2 → 4).
Bumped max_inflight_rollouts 384 → 512 (saturate the 4 inference replicas).
Fixed FlashInfer JIT cache deadlock: pin FLASHINFER_WORKSPACE_BASE to node-local /tmp (was defaulting to shared weka $HOME/.cache/flashinfer, where concurrent GLM/MoE runs deadlock on fused_moe_*.lock → inference never serves). (2026-06-29: relocated from the sbatch templates into the config.)

2026-06-25

Merged main.
Dropped router replay (trainer.enable_router_replay + inference.enable_return_routed_experts).
Switched train + eval harness bash → bash_edit.
Dropped inference fp8 quant (vllm_extra.quantization) to lower train/inference mismatch.
Eval at step 0 + every 20 steps (skip_first_step = false).
Kept LM-head chunking + activation checkpointing explicit at the upcoming feat: add orchestrator debug mode (no-inference, no-trainer) #2867 default values.
Dropped the -lp length-penalty variant (glm45air_scaleswe_lp.toml).

2026-06-24

Initial config: GLM-4.5-Air on scaleswe-v1 (train) + swebench-verified-v1 (eval), bash harness on the prime-sandbox runtime, router replay enabled, cp=8 ulysses + muon + GLM-5.1 prod-run memory knobs.
Online fp8 inference (vllm_extra.quantization = "fp8") + 384 rollout concurrency.
Bumped weight_broadcast NCCL timeout to 3600s for cold NFS loads.

Note

Low Risk
Config-only addition with no application or template changes; affects experiment launch parameters only.

Overview
Adds configs/debug/v1/glm45air_scaleswe.toml only — a v1 RL experiment wiring GLM-4.5-Air to scaleswe-v1 training and SWE-Bench Verified eval on prime sandboxes with the bash_edit harness (not the base scaleswe rlm setup).

Deployment & scale: 2 train nodes, 4 inference replicas at tp=8 (no expert parallelism), max_inflight_rollouts = 512, max_steps = 1000, 65k context. Weight broadcast NCCL timeout 3600s.

Trainer: cp=8 ulysses, muon, router replay on, plus GLM-5.1-style memory knobs (LM-head token chunking, activation checkpointing/offload, optimizer CPU offload, skip-gather / skip-optimizer checkpoints).

Inference: prefix caching, enable_return_routed_experts for replay; no fp8 quant. Eval runs at step 0 and every 20 steps.

Ops: top-level [env_vars] and [inference.env_vars] point Triton, vLLM, and FlashInfer caches at node-local /tmp to avoid shared-FS contention and FlashInfer lock deadlocks; SLURM pre_run_command deletes orphaned prime sandboxes by job label.

^{Reviewed by Cursor Bugbot for commit 2eea033. Bugbot is set up for automated code reviews on this repo. Configure here.}

v1 RL config: GLM-4.5-Air (zai-org/GLM-4.5-Air, 100B MoE) on scaleswe-v1 (train) + swebench-verified (eval), bash harness on prime sandboxes. Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts). 2 train + 2 infer nodes. Trainer: cp=8 ulysses, muon, + GLM-5.1 prod-run memory improvements (LM-head chunking, AC + activation offload, optimizer CPU offload, skip-gather/skip-optimizer ckpt). Inference: 2x tp=8 replicas, NO expert parallelism (inference EP + router-replay capture deadlocked the engine via cross-node EP all-to-all). Renderer/parsers auto-resolve from the official slug. Relies on fixes already in main: the glm4_moe routed_experts .contiguous() slice (torch.compile stride assert) and the verifiers always-install-uv bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vLLM's compile cache defaulted to NFS (~/.cache/vllm), which hung inference startup on slow shared FS. Point it at node-local /tmp (matching inference.sbatch.j2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…45air Inference runs vLLM online fp8 quant (vllm_extra={quantization="fp8"}) over the bf16 policy for faster generation; trainer/inference/orchestrator all use the bf16 zai-org/GLM-4.5-Air. The per-channel GLM-4.5-Air-FP8 checkpoint is incompatible with prime-rl's block-wise fp8 path (use_deep_gemm / quantize_in_weight_transfer), so we use online quant instead. max_inflight_rollouts=384. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Sibling of glm45air_scaleswe.toml with orchestrator.advantage.length_penalty enabled at defaults (coef=0.25, gate_by_correctness=false). Distinct slurm job_name + sandbox labels (glm45air-swe-lp) so it runs alongside the no-penalty run without sharing prime sandboxes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cold-cache nodes read the 206GB bf16 model from NFS at ~50s/shard (~46min total). The weight-broadcast store rendezvous default (1200s/20min) times out before inference finishes loading (DistStoreError: 1/17 clients joined), killing the trainer. 3600s covers the cold-load worst case with margin. Applied to both the base and -lp ablation configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). - Drop the -lp length-penalty variant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts): mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits don't carry routed-expert decisions). - Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache). - FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set. - Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash. - Bump max_inflight_rollouts 384 -> 512. - Keep LM-head token chunking + activation checkpointing explicit at the values that become trainer defaults in #2867 (not merged yet, so removing them would disable the features). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run bf16 inference (remove vllm_extra quantization=fp8). fp8 inference added ~10x mismatch KL (~0.002 vs ~0.0002); bf16 inference lowers it. Trainer fp8 + native KV-cache offload (and the now-disabled router replay) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

seq_len + inference.model.max_model_len 65536 -> 131072. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- seq_len + inference.model.max_model_len 131072 -> 65536 - Remove inference.kv_cache_offload (native CPU tier) - max_inflight_rollouts 512 -> 384 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the startup eval explicit so SWE-Bench Verified runs at step 0 before any train rollouts (already the default; pinned for clarity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove trainer.model.fp8 (back to bf16 training). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Re-enable trainer.enable_router_replay + inference.enable_return_routed_experts. Compatible again now that kv-cache offload and fp8 (both mutually exclusive with router replay) have been dropped; replaying inference's routed-expert decisions in the trainer cuts the train/inference mismatch by ~an order of magnitude. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

num_infer_replicas 2 -> 4 (num_infer_nodes is per-replica, so total inference nodes = num_infer_nodes * num_infer_replicas = 4). Total job = 2 train + 4 infer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Increase orchestrator rollout concurrency to better saturate the 4 inference replicas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FLASHINFER_WORKSPACE_BASE defaulted to $HOME/.cache/flashinfer on shared weka. With concurrent GLM-4.5-Air/MoE runs every TP worker contends on the same fused_moe_*.lock there and deadlocks in uninterruptible (D-state) filesystem I/O during the CUTLASS fused-MoE JIT build, so inference never serves. Pin it to node-local /tmp like the Triton/vLLM caches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # src/prime_rl/templates/inference.sbatch.j2 # src/prime_rl/templates/multi_node_rl.sbatch.j2

Move the cache-dir env vars out of the SLURM templates and into the config so the PR only adds the TOML (the merged env-var feature, #2863, makes this possible): - [env_vars] TRITON_CACHE_DIR (trainer + inference) - [inference.env_vars] VLLM_CACHE_ROOT, FLASHINFER_WORKSPACE_BASE Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nking_retention Main #2900 replaced the renderer `preserve_all_thinking` bool with `thinking_retention`; the merge left the config on the removed field, so it failed the config-load unit test ("No config class could be parsed"). Switch to `thinking_retention = "all"` (the documented equivalent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

S1ro1 and others added 3 commits June 24, 2026 01:13

S1ro1 force-pushed the exp/glm45air-scaleswe branch from bda8452 to d55678e Compare June 24, 2026 01:41

S1ro1 and others added 5 commits June 24, 2026 03:55

Merge remote-tracking branch 'origin/main' into exp/glm45air-scaleswe

5f3319e

mikasenghaas changed the title ~~feat(v1): GLM-4.5-Air scaleswe SWE ablation config (router replay)~~ exp: GLM-4.5-Air scaleswe SWE ablation (kv-offload + fp8, bash_edit) Jun 25, 2026

S1ro1 and others added 2 commits June 25, 2026 05:13

exp(glm45air-scaleswe): bump context to 131072

acdacb0

seq_len + inference.model.max_model_len 65536 -> 131072. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas changed the title ~~exp: GLM-4.5-Air scaleswe SWE ablation (kv-offload + fp8, bash_edit)~~ exp: GLM-4.5-Air scaleswe SWE ablation Jun 25, 2026

mikasenghaas and others added 3 commits June 25, 2026 06:19

exp(glm45air-scaleswe): eval at step 0 (explicit skip_first_step=false)

69a868f

Make the startup eval explicit so SWE-Bench Verified runs at step 0 before any train rollouts (already the default; pinned for clarity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

exp(glm45air-scaleswe): drop fp8 trainer

cf0c496

Remove trainer.model.fp8 (back to bf16 training). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas changed the title ~~exp: GLM-4.5-Air scaleswe SWE ablation~~ exp: GLM-4.5-Air scaleswe SWE ablation (bash_edit) Jun 25, 2026

mikasenghaas changed the title ~~exp: GLM-4.5-Air scaleswe SWE ablation (bash_edit)~~ exp: GLM-4.5-Air scaleswe ablation Jun 26, 2026

S1ro1 and others added 2 commits June 26, 2026 21:13

Merge remote-tracking branch 'origin/main' into exp/glm45air-scaleswe

4e70908

mikasenghaas changed the title ~~exp: GLM-4.5-Air scaleswe ablation~~ exp: scaleswe ablation Jun 26, 2026

S1ro1 and others added 2 commits June 26, 2026 21:35

exp(glm45air-scaleswe): bump max_inflight_rollouts 384 -> 512

9f8688e

Increase orchestrator rollout concurrency to better saturate the 4 inference replicas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas force-pushed the exp/glm45air-scaleswe branch from 4a5872b to 9f8688e Compare June 26, 2026 21:43

S1ro1 and others added 4 commits June 26, 2026 23:10

glm45air-scaleswe: bump max_steps 400 -> 1000

35df140

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into exp/glm45air-scaleswe

aef1c7c

# Conflicts: # src/prime_rl/templates/inference.sbatch.j2 # src/prime_rl/templates/multi_node_rl.sbatch.j2

mikasenghaas marked this pull request as ready for review June 29, 2026 21:10

mikasenghaas requested a review from samsja June 29, 2026 21:10

mikasenghaas requested a review from faresobeid June 29, 2026 21:10

faresobeid previously approved these changes Jun 29, 2026

View reviewed changes

mikasenghaas dismissed faresobeid’s stale review via 2eea033 June 29, 2026 21:19

faresobeid approved these changes Jun 29, 2026

View reviewed changes

mikasenghaas merged commit 7d5f2d7 into main Jun 29, 2026
18 checks passed

mikasenghaas deleted the exp/glm45air-scaleswe branch June 29, 2026 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

exp: scaleswe ablation#2862

exp: scaleswe ablation#2862
mikasenghaas merged 22 commits into
mainfrom
exp/glm45air-scaleswe

S1ro1 commented Jun 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

S1ro1 commented Jun 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Config highlights

Changelog

2026-06-29

2026-06-26

2026-06-25

2026-06-24

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

S1ro1 commented Jun 24, 2026 •

edited by cursor Bot

Loading