Skip to content

exp: scaleswe ablation#2862

Merged
mikasenghaas merged 22 commits into
mainfrom
exp/glm45air-scaleswe
Jun 29, 2026
Merged

exp: scaleswe ablation#2862
mikasenghaas merged 22 commits into
mainfrom
exp/glm45air-scaleswe

Conversation

@S1ro1

@S1ro1 S1ro1 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

What

configs/debug/v1/glm45air_scaleswe.toml — a v1 RL ablation for GLM-4.5-Air (100B MoE) on the scaleswe-v1 SWE taskset (train) + swebench-verified (eval), bash_edit harness on prime sandboxes.

This PR only adds the config TOML — no source/template changes.

Config highlights

  • 2 train + 4 infer nodes. Trainer: cp=8 (ulysses), muon, + the GLM-5.1 prod-run memory improvements (LM-head token chunking, activation checkpointing + offload max_inflight=1, optimizer CPU offload, skip-gather/skip-optimizer ckpt).
  • Inference: 4× tp=8 replicas, no expert parallelism. No fp8 quant on the inference side (dropped to lower train/inference mismatch); prefix caching on.
  • Router replay on (trainer.enable_router_replay + inference.enable_return_routed_experts): replays inference's routed-expert decisions in the trainer to cut train/inference mismatch.
  • 65536 context (seq_len + max_model_len), max_inflight_rollouts = 512.
  • Node-local cache dirs via [env_vars] (see 2026-06-29 below) — no SLURM template edits.
  • Eval: SWE-Bench Verified at step 0 + every 20 steps (skip_first_step = false).
  • Renderer (glm-4.5) and tool/reasoning parsers auto-resolve from the official zai-org/GLM-4.5-Air slug.

Changelog

2026-06-29

  • Merged latest main (now includes the configurable env-vars feature, feat: configurable env vars #2863).

  • Moved the node-local cache dirs out of the SLURM templates and into the config, so the PR only adds the TOML:

    • top-level [env_vars]: TRITON_CACHE_DIR (trainer + inference)
    • [inference.env_vars]: VLLM_CACHE_ROOT, FLASHINFER_WORKSPACE_BASE

    The SLURM templates are now untouched vs main. Values keep the shell-expanded per-user/per-job paths ($USER/$SLURM_JOB_ID/$(whoami)); the multi-node launcher renders env_vars as export KEY="VALUE", so the expansion still happens at runtime.

  • Migrated the renderer field preserve_all_thinking = truethinking_retention = "all" (main Update configs for renderer thinking retention #2900 removed the old bool; the config otherwise failed the config-load unit test).

2026-06-26

  • Brought back router replay (trainer.enable_router_replay + inference.enable_return_routed_experts) — compatible again now that kv-cache offload and trainer fp8 (both mutually exclusive with router replay) have been dropped; replaying inference's routed-expert decisions cuts the train/inference mismatch.
  • Merged main (drops removed envs; bumps verifiers + research-environments).
  • Scaled inference to 4 nodes / 4 replicas (num_infer_replicas 2 → 4).
  • Bumped max_inflight_rollouts 384 → 512 (saturate the 4 inference replicas).
  • Fixed FlashInfer JIT cache deadlock: pin FLASHINFER_WORKSPACE_BASE to node-local /tmp (was defaulting to shared weka $HOME/.cache/flashinfer, where concurrent GLM/MoE runs deadlock on fused_moe_*.lock → inference never serves). (2026-06-29: relocated from the sbatch templates into the config.)

2026-06-25

  • Merged main.
  • Dropped router replay (trainer.enable_router_replay + inference.enable_return_routed_experts).
  • Switched train + eval harness bashbash_edit.
  • Dropped inference fp8 quant (vllm_extra.quantization) to lower train/inference mismatch.
  • Eval at step 0 + every 20 steps (skip_first_step = false).
  • Kept LM-head chunking + activation checkpointing explicit at the upcoming feat: add orchestrator debug mode (no-inference, no-trainer) #2867 default values.
  • Dropped the -lp length-penalty variant (glm45air_scaleswe_lp.toml).

2026-06-24

  • Initial config: GLM-4.5-Air on scaleswe-v1 (train) + swebench-verified-v1 (eval), bash harness on the prime-sandbox runtime, router replay enabled, cp=8 ulysses + muon + GLM-5.1 prod-run memory knobs.
  • Online fp8 inference (vllm_extra.quantization = "fp8") + 384 rollout concurrency.
  • Bumped weight_broadcast NCCL timeout to 3600s for cold NFS loads.

Note

Low Risk
Config-only addition with no application or template changes; affects experiment launch parameters only.

Overview
Adds configs/debug/v1/glm45air_scaleswe.toml only — a v1 RL experiment wiring GLM-4.5-Air to scaleswe-v1 training and SWE-Bench Verified eval on prime sandboxes with the bash_edit harness (not the base scaleswe rlm setup).

Deployment & scale: 2 train nodes, 4 inference replicas at tp=8 (no expert parallelism), max_inflight_rollouts = 512, max_steps = 1000, 65k context. Weight broadcast NCCL timeout 3600s.

Trainer: cp=8 ulysses, muon, router replay on, plus GLM-5.1-style memory knobs (LM-head token chunking, activation checkpointing/offload, optimizer CPU offload, skip-gather / skip-optimizer checkpoints).

Inference: prefix caching, enable_return_routed_experts for replay; no fp8 quant. Eval runs at step 0 and every 20 steps.

Ops: top-level [env_vars] and [inference.env_vars] point Triton, vLLM, and FlashInfer caches at node-local /tmp to avoid shared-FS contention and FlashInfer lock deadlocks; SLURM pre_run_command deletes orphaned prime sandboxes by job label.

Reviewed by Cursor Bugbot for commit 2eea033. Bugbot is set up for automated code reviews on this repo. Configure here.

S1ro1 and others added 3 commits June 24, 2026 01:13
v1 RL config: GLM-4.5-Air (zai-org/GLM-4.5-Air, 100B MoE) on scaleswe-v1 (train)
+ swebench-verified (eval), bash harness on prime sandboxes. Router replay on
(trainer.enable_router_replay + inference.enable_return_routed_experts).

2 train + 2 infer nodes. Trainer: cp=8 ulysses, muon, + GLM-5.1 prod-run memory
improvements (LM-head chunking, AC + activation offload, optimizer CPU offload,
skip-gather/skip-optimizer ckpt). Inference: 2x tp=8 replicas, NO expert
parallelism (inference EP + router-replay capture deadlocked the engine via
cross-node EP all-to-all). Renderer/parsers auto-resolve from the official slug.

Relies on fixes already in main: the glm4_moe routed_experts .contiguous() slice
(torch.compile stride assert) and the verifiers always-install-uv bootstrap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM's compile cache defaulted to NFS (~/.cache/vllm), which hung inference
startup on slow shared FS. Point it at node-local /tmp (matching inference.sbatch.j2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…45air

Inference runs vLLM online fp8 quant (vllm_extra={quantization="fp8"}) over the
bf16 policy for faster generation; trainer/inference/orchestrator all use the
bf16 zai-org/GLM-4.5-Air. The per-channel GLM-4.5-Air-FP8 checkpoint is
incompatible with prime-rl's block-wise fp8 path (use_deep_gemm /
quantize_in_weight_transfer), so we use online quant instead. max_inflight_rollouts=384.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@S1ro1 S1ro1 force-pushed the exp/glm45air-scaleswe branch from bda8452 to d55678e Compare June 24, 2026 01:41
S1ro1 and others added 5 commits June 24, 2026 03:55
Sibling of glm45air_scaleswe.toml with orchestrator.advantage.length_penalty
enabled at defaults (coef=0.25, gate_by_correctness=false). Distinct slurm
job_name + sandbox labels (glm45air-swe-lp) so it runs alongside the no-penalty
run without sharing prime sandboxes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cold-cache nodes read the 206GB bf16 model from NFS at ~50s/shard (~46min
total). The weight-broadcast store rendezvous default (1200s/20min) times out
before inference finishes loading (DistStoreError: 1/17 clients joined), killing
the trainer. 3600s covers the cold-load worst case with margin. Applied to both
the base and -lp ablation configs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts):
  mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits
  don't carry routed-expert decisions).
- Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache).
- FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set.
- Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash.
- Bump max_inflight_rollouts 384 -> 512.
- Keep LM-head token chunking + activation checkpointing explicit at the values that become
  trainer defaults in #2867 (not merged yet, so removing them would disable the features).
- Drop the -lp length-penalty variant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop router replay (trainer.enable_router_replay + inference.enable_return_routed_experts):
  mutually exclusive with inference.kv_cache_offload (rl.py validator — external KV cache hits
  don't carry routed-expert decisions).
- Enable native KV-cache offloading with a 128GB CPU tier (extends the prefix cache).
- FP8 trainer (DeepGEMM blockwise linear/MoE) — impl=custom is already set.
- Use the bash_edit harness (bash + local edit tool) for train + eval, replacing pure bash.
- Bump max_inflight_rollouts 384 -> 512.
- Keep LM-head token chunking + activation checkpointing explicit at the values that become
  trainer defaults in #2867 (not merged yet, so removing them would disable the features).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title feat(v1): GLM-4.5-Air scaleswe SWE ablation config (router replay) exp: GLM-4.5-Air scaleswe SWE ablation (kv-offload + fp8, bash_edit) Jun 25, 2026
S1ro1 and others added 2 commits June 25, 2026 05:13
Run bf16 inference (remove vllm_extra quantization=fp8). fp8 inference added
~10x mismatch KL (~0.002 vs ~0.0002); bf16 inference lowers it. Trainer fp8 +
native KV-cache offload (and the now-disabled router replay) unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
seq_len + inference.model.max_model_len 65536 -> 131072.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title exp: GLM-4.5-Air scaleswe SWE ablation (kv-offload + fp8, bash_edit) exp: GLM-4.5-Air scaleswe SWE ablation Jun 25, 2026
mikasenghaas and others added 3 commits June 25, 2026 06:19
- seq_len + inference.model.max_model_len 131072 -> 65536
- Remove inference.kv_cache_offload (native CPU tier)
- max_inflight_rollouts 512 -> 384

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the startup eval explicit so SWE-Bench Verified runs at step 0 before
any train rollouts (already the default; pinned for clarity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove trainer.model.fp8 (back to bf16 training).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title exp: GLM-4.5-Air scaleswe SWE ablation exp: GLM-4.5-Air scaleswe SWE ablation (bash_edit) Jun 25, 2026
@mikasenghaas mikasenghaas changed the title exp: GLM-4.5-Air scaleswe SWE ablation (bash_edit) exp: GLM-4.5-Air scaleswe ablation Jun 26, 2026
S1ro1 and others added 2 commits June 26, 2026 21:13
Re-enable trainer.enable_router_replay + inference.enable_return_routed_experts.
Compatible again now that kv-cache offload and fp8 (both mutually exclusive with
router replay) have been dropped; replaying inference's routed-expert decisions in
the trainer cuts the train/inference mismatch by ~an order of magnitude.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title exp: GLM-4.5-Air scaleswe ablation exp: scaleswe ablation Jun 26, 2026
S1ro1 and others added 2 commits June 26, 2026 21:35
num_infer_replicas 2 -> 4 (num_infer_nodes is per-replica, so total inference
nodes = num_infer_nodes * num_infer_replicas = 4). Total job = 2 train + 4 infer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Increase orchestrator rollout concurrency to better saturate the 4 inference replicas.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the exp/glm45air-scaleswe branch from 4a5872b to 9f8688e Compare June 26, 2026 21:43
S1ro1 and others added 4 commits June 26, 2026 23:10
FLASHINFER_WORKSPACE_BASE defaulted to $HOME/.cache/flashinfer on shared weka.
With concurrent GLM-4.5-Air/MoE runs every TP worker contends on the same
fused_moe_*.lock there and deadlocks in uninterruptible (D-state) filesystem
I/O during the CUTLASS fused-MoE JIT build, so inference never serves. Pin it
to node-local /tmp like the Triton/vLLM caches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	src/prime_rl/templates/inference.sbatch.j2
#	src/prime_rl/templates/multi_node_rl.sbatch.j2
Move the cache-dir env vars out of the SLURM templates and into the config
so the PR only adds the TOML (the merged env-var feature, #2863, makes this
possible):
- [env_vars] TRITON_CACHE_DIR (trainer + inference)
- [inference.env_vars] VLLM_CACHE_ROOT, FLASHINFER_WORKSPACE_BASE

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas marked this pull request as ready for review June 29, 2026 21:10
@mikasenghaas mikasenghaas requested a review from samsja June 29, 2026 21:10
@mikasenghaas mikasenghaas requested a review from faresobeid June 29, 2026 21:10
faresobeid
faresobeid previously approved these changes Jun 29, 2026
…nking_retention

Main #2900 replaced the renderer `preserve_all_thinking` bool with
`thinking_retention`; the merge left the config on the removed field, so it
failed the config-load unit test ("No config class could be parsed"). Switch
to `thinking_retention = "all"` (the documented equivalent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas merged commit 7d5f2d7 into main Jun 29, 2026
18 checks passed
@mikasenghaas mikasenghaas deleted the exp/glm45air-scaleswe branch June 29, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants