Add set_epoch hook to TrainStepperABC#1233
Closed
mcgibbon wants to merge 8 commits into
Closed
Conversation
Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and wires it from the trainer at fresh-epoch boundaries (mid-epoch resume preserves in-module state so partial-epoch accumulators continue from where they left off). Stepper and CoupledStepper implement set_epoch by walking submodules and invoking request_latent_global_mean_envelope_reset where present, giving model components a way to reset per-epoch in-module statistics without coupling the stepper to model internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When enabled, the per-channel spatial mean of the post-encoder latent is tracked during training and, in eval, the latent is shifted so that mean falls within the observed envelope (no-op when the mean is already inside it). Bounds the global-mean of the latent the transformer blocks see at inference to the range observed in training. The envelope is reset at the start of each training epoch (lazily, on the next training-mode forward) via request_latent_global_mean_envelope_reset, which the stepper invokes through the TrainStepperABC.set_epoch hook. Exposed as a single clip_latent_global_means: bool option on SFNONetConfig and NoiseConditionedSFNOBuilder; defaults to False so existing models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CRPS, SSR bias, and ensemble mean RMSE to the training aggregator when ensemble_metrics=True. This lets us compare train vs val SSR bias to diagnose whether ensemble overconfidence is caused by overfitting. Disabled by default so existing configs are unaffected. Enable via train_aggregator.ensemble_metrics: true in the training YAML.
Combines all three perturbations: c96-shield training data with conditional model (labels), residual prediction, and lr-tuning schedule starting at lr=0.001. Also enables train_aggregator.ensemble_metrics for SSR bias diagnostics.
Combines conditional SFNO with c96-shield training data and lr-tuning, without residual prediction. Replaces the demoted residual-only run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a TrainStepperABC.set_epoch(epoch) hook with a no-op default and
wires it from the trainer at fresh-epoch boundaries (mid-epoch resume
preserves in-module state so partial-epoch accumulators continue from
where they left off).
Stepper and CoupledStepper implement set_epoch by walking submodules
and invoking request_latent_global_mean_envelope_reset where present,
giving model components a way to reset per-epoch in-module statistics
without coupling the stepper to model internals.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com