Skip to content

Add set_epoch hook to TrainStepperABC#1233

Closed
mcgibbon wants to merge 8 commits into
mainfrom
experiment/2026-06-05-aimip-like
Closed

Add set_epoch hook to TrainStepperABC#1233
mcgibbon wants to merge 8 commits into
mainfrom
experiment/2026-06-05-aimip-like

Conversation

@mcgibbon

@mcgibbon mcgibbon commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and
wires it from the trainer at fresh-epoch boundaries (mid-epoch resume
preserves in-module state so partial-epoch accumulators continue from
where they left off).

Stepper and CoupledStepper implement set_epoch by walking submodules
and invoking request_latent_global_mean_envelope_reset where present,
giving model components a way to reset per-epoch in-module statistics
without coupling the stepper to model internals.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

mcgibbon and others added 8 commits June 5, 2026 16:57
Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and
wires it from the trainer at fresh-epoch boundaries (mid-epoch resume
preserves in-module state so partial-epoch accumulators continue from
where they left off).

Stepper and CoupledStepper implement set_epoch by walking submodules
and invoking request_latent_global_mean_envelope_reset where present,
giving model components a way to reset per-epoch in-module statistics
without coupling the stepper to model internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When enabled, the per-channel spatial mean of the post-encoder latent
is tracked during training and, in eval, the latent is shifted so that
mean falls within the observed envelope (no-op when the mean is
already inside it). Bounds the global-mean of the latent the
transformer blocks see at inference to the range observed in training.

The envelope is reset at the start of each training epoch (lazily, on
the next training-mode forward) via
request_latent_global_mean_envelope_reset, which the stepper invokes
through the TrainStepperABC.set_epoch hook.

Exposed as a single clip_latent_global_means: bool option on
SFNONetConfig and NoiseConditionedSFNOBuilder; defaults to False so
existing models are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CRPS, SSR bias, and ensemble mean RMSE to the training aggregator
when ensemble_metrics=True. This lets us compare train vs val SSR bias to
diagnose whether ensemble overconfidence is caused by overfitting.

Disabled by default so existing configs are unaffected. Enable via
train_aggregator.ensemble_metrics: true in the training YAML.
Combines all three perturbations: c96-shield training data with
conditional model (labels), residual prediction, and lr-tuning schedule
starting at lr=0.001. Also enables train_aggregator.ensemble_metrics
for SSR bias diagnostics.
Combines conditional SFNO with c96-shield training data and lr-tuning,
without residual prediction. Replaces the demoted residual-only run.
@mcgibbon mcgibbon closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant