feat: checkpoint resume + HP sweep for pretraining by CalebisGross · Pull Request #172 · AppSprout-dev/mnemonic

CalebisGross · 2026-03-17T22:08:40Z

Summary

Adds --resume last.pt crash recovery for multi-day pretraining runs (saves full model + optimizer + step + RNG state)
Adds --run-name, --wandb-group, --wandb-tags CLI args for sweep orchestration
Adds sweep_hp.sh: phased hyperparameter sweep script (Phase 0: batch size, Phase 1: LR+WD, Phase 2: beta2, Phase 3: warmup)
Results tracked in sweep_results.tsv (gitignored), all runs grouped in wandb

Prerequisite for full 100M pretraining — autoresearch moved before pretrain in epic #149 Phase 2.

Test plan

Resume round-trip verified: 20 steps → save → resume → 20 more steps, loss continued from ~8.7 (not reset to ~11)
Legacy model-only checkpoints load correctly (optimizer resets)
sweep_hp.sh --dry-run --phase 1 previews correct commands
Phase 0 batch size test
Phase 1-3 sweep runs

🤖 Generated with Claude Code

- Add --resume flag to train_mnemonic_lm.py for crash recovery during multi-day runs. Full checkpoints save model + optimizer + step + RNG state. Legacy model-only checkpoints also supported. - Add --run-name, --wandb-group, --wandb-tags for sweep orchestration. Extend wandb config with weight_decay, beta2, spoke_lr_mult, warmup. - Add sweep_hp.sh: phased hyperparameter sweep script (batch size test, LR+WD sweep, beta2 interaction, warmup validation). Results tracked in sweep_results.tsv, all runs grouped in wandb. - Add sweep_results.tsv to training/.gitignore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Training script (train_mnemonic_lm.py): - Full-state checkpoints: model + optimizer + step + losses + RNG - --resume flag restores all state and continues from saved step - --ckpt-dir for per-run checkpoint directories - Dataloader fast-forwards past already-seen batches on resume Sweep script (sweep_hp.sh): - Each run gets its own checkpoint dir (checkpoints/<run_name>/) - Completed runs skipped on re-run (checks TSV) - Failed runs retried automatically - Crashed runs auto-resume from checkpoint - Add --accum flag Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cap PyTorch VRAM allocation to 90% so out-of-memory becomes a catchable exception instead of triggering the Linux OOM killer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CalebisGross force-pushed the feat/pretrain-hp-sweep branch 2 times, most recently from bd5144d to 254d004 Compare March 17, 2026 22:24

CalebisGross and others added 2 commits March 17, 2026 20:02

CalebisGross force-pushed the feat/pretrain-hp-sweep branch from 254d004 to d8a493d Compare March 18, 2026 00:37

feat: add VRAM cap to prevent system OOM crashes during training

c66d331

Cap PyTorch VRAM allocation to 90% so out-of-memory becomes a catchable exception instead of triggering the Linux OOM killer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: checkpoint resume + HP sweep for pretraining#172

feat: checkpoint resume + HP sweep for pretraining#172
CalebisGross wants to merge 3 commits intomainfrom
feat/pretrain-hp-sweep

CalebisGross commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CalebisGross commented Mar 17, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant