Skip to content

feat: checkpoint resume + HP sweep for pretraining#172

Open
CalebisGross wants to merge 3 commits intomainfrom
feat/pretrain-hp-sweep
Open

feat: checkpoint resume + HP sweep for pretraining#172
CalebisGross wants to merge 3 commits intomainfrom
feat/pretrain-hp-sweep

Conversation

@CalebisGross
Copy link
Collaborator

Summary

  • Adds --resume last.pt crash recovery for multi-day pretraining runs (saves full model + optimizer + step + RNG state)
  • Adds --run-name, --wandb-group, --wandb-tags CLI args for sweep orchestration
  • Adds sweep_hp.sh: phased hyperparameter sweep script (Phase 0: batch size, Phase 1: LR+WD, Phase 2: beta2, Phase 3: warmup)
  • Results tracked in sweep_results.tsv (gitignored), all runs grouped in wandb

Prerequisite for full 100M pretraining — autoresearch moved before pretrain in epic #149 Phase 2.

Test plan

  • Resume round-trip verified: 20 steps → save → resume → 20 more steps, loss continued from ~8.7 (not reset to ~11)
  • Legacy model-only checkpoints load correctly (optimizer resets)
  • sweep_hp.sh --dry-run --phase 1 previews correct commands
  • Phase 0 batch size test
  • Phase 1-3 sweep runs

🤖 Generated with Claude Code

@CalebisGross CalebisGross force-pushed the feat/pretrain-hp-sweep branch 2 times, most recently from bd5144d to 254d004 Compare March 17, 2026 22:24
CalebisGross and others added 2 commits March 17, 2026 20:02
- Add --resume flag to train_mnemonic_lm.py for crash recovery during
  multi-day runs. Full checkpoints save model + optimizer + step + RNG
  state. Legacy model-only checkpoints also supported.
- Add --run-name, --wandb-group, --wandb-tags for sweep orchestration.
  Extend wandb config with weight_decay, beta2, spoke_lr_mult, warmup.
- Add sweep_hp.sh: phased hyperparameter sweep script (batch size test,
  LR+WD sweep, beta2 interaction, warmup validation). Results tracked
  in sweep_results.tsv, all runs grouped in wandb.
- Add sweep_results.tsv to training/.gitignore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Training script (train_mnemonic_lm.py):
- Full-state checkpoints: model + optimizer + step + losses + RNG
- --resume flag restores all state and continues from saved step
- --ckpt-dir for per-run checkpoint directories
- Dataloader fast-forwards past already-seen batches on resume

Sweep script (sweep_hp.sh):
- Each run gets its own checkpoint dir (checkpoints/<run_name>/)
- Completed runs skipped on re-run (checks TSV)
- Failed runs retried automatically
- Crashed runs auto-resume from checkpoint
- Add --accum flag

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CalebisGross CalebisGross force-pushed the feat/pretrain-hp-sweep branch from 254d004 to d8a493d Compare March 18, 2026 00:37
Cap PyTorch VRAM allocation to 90% so out-of-memory becomes a
catchable exception instead of triggering the Linux OOM killer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant