Add set_epoch hook to TrainStepperABC by mcgibbon · Pull Request #1233 · ai2cm/ace

mcgibbon · 2026-06-06T11:53:12Z

Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and
wires it from the trainer at fresh-epoch boundaries (mid-epoch resume
preserves in-module state so partial-epoch accumulators continue from
where they left off).

Stepper and CoupledStepper implement set_epoch by walking submodules
and invoking request_latent_global_mean_envelope_reset where present,
giving model components a way to reset per-epoch in-module statistics
without coupling the stepper to model internals.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and wires it from the trainer at fresh-epoch boundaries (mid-epoch resume preserves in-module state so partial-epoch accumulators continue from where they left off). Stepper and CoupledStepper implement set_epoch by walking submodules and invoking request_latent_global_mean_envelope_reset where present, giving model components a way to reset per-epoch in-module statistics without coupling the stepper to model internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When enabled, the per-channel spatial mean of the post-encoder latent is tracked during training and, in eval, the latent is shifted so that mean falls within the observed envelope (no-op when the mean is already inside it). Bounds the global-mean of the latent the transformer blocks see at inference to the range observed in training. The envelope is reset at the start of each training epoch (lazily, on the next training-mode forward) via request_latent_global_mean_envelope_reset, which the stepper invokes through the TrainStepperABC.set_epoch hook. Exposed as a single clip_latent_global_means: bool option on SFNONetConfig and NoiseConditionedSFNOBuilder; defaults to False so existing models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds CRPS, SSR bias, and ensemble mean RMSE to the training aggregator when ensemble_metrics=True. This lets us compare train vs val SSR bias to diagnose whether ensemble overconfidence is caused by overfitting. Disabled by default so existing configs are unaffected. Enable via train_aggregator.ensemble_metrics: true in the training YAML.

Combines all three perturbations: c96-shield training data with conditional model (labels), residual prediction, and lr-tuning schedule starting at lr=0.001. Also enables train_aggregator.ensemble_metrics for SSR bias diagnostics.

Combines conditional SFNO with c96-shield training data and lr-tuning, without residual prediction. Replaces the demoted residual-only run.

mcgibbon and others added 8 commits June 5, 2026 16:57

add initial 4deg-daily-v1 configs

19b1352

make run-train executable

1ca5339

add other configs to run-train.sh

059b687

Add combined labels+residual+lr-tuning config

ca802e8

Combines all three perturbations: c96-shield training data with conditional model (labels), residual prediction, and lr-tuning schedule starting at lr=0.001. Also enables train_aggregator.ensemble_metrics for SSR bias diagnostics.

Add labels+lr-tuning config (no residual prediction)

e2bde95

Combines conditional SFNO with c96-shield training data and lr-tuning, without residual prediction. Replaces the demoted residual-only run.

mcgibbon closed this Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add set_epoch hook to TrainStepperABC#1233

Add set_epoch hook to TrainStepperABC#1233
mcgibbon wants to merge 8 commits into
mainfrom
experiment/2026-06-05-aimip-like

mcgibbon commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcgibbon commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant