feat: checkpoint resume + HP sweep for pretraining#172
Open
CalebisGross wants to merge 3 commits intomainfrom
Open
feat: checkpoint resume + HP sweep for pretraining#172CalebisGross wants to merge 3 commits intomainfrom
CalebisGross wants to merge 3 commits intomainfrom
Conversation
bd5144d to
254d004
Compare
- Add --resume flag to train_mnemonic_lm.py for crash recovery during multi-day runs. Full checkpoints save model + optimizer + step + RNG state. Legacy model-only checkpoints also supported. - Add --run-name, --wandb-group, --wandb-tags for sweep orchestration. Extend wandb config with weight_decay, beta2, spoke_lr_mult, warmup. - Add sweep_hp.sh: phased hyperparameter sweep script (batch size test, LR+WD sweep, beta2 interaction, warmup validation). Results tracked in sweep_results.tsv, all runs grouped in wandb. - Add sweep_results.tsv to training/.gitignore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Training script (train_mnemonic_lm.py): - Full-state checkpoints: model + optimizer + step + losses + RNG - --resume flag restores all state and continues from saved step - --ckpt-dir for per-run checkpoint directories - Dataloader fast-forwards past already-seen batches on resume Sweep script (sweep_hp.sh): - Each run gets its own checkpoint dir (checkpoints/<run_name>/) - Completed runs skipped on re-run (checks TSV) - Failed runs retried automatically - Crashed runs auto-resume from checkpoint - Add --accum flag Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
254d004 to
d8a493d
Compare
Cap PyTorch VRAM allocation to 90% so out-of-memory becomes a catchable exception instead of triggering the Linux OOM killer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--resume last.ptcrash recovery for multi-day pretraining runs (saves full model + optimizer + step + RNG state)--run-name,--wandb-group,--wandb-tagsCLI args for sweep orchestrationsweep_hp.sh: phased hyperparameter sweep script (Phase 0: batch size, Phase 1: LR+WD, Phase 2: beta2, Phase 3: warmup)sweep_results.tsv(gitignored), all runs grouped in wandbPrerequisite for full 100M pretraining — autoresearch moved before pretrain in epic #149 Phase 2.
Test plan
sweep_hp.sh --dry-run --phase 1previews correct commands🤖 Generated with Claude Code