ideal_mhd_model: make computeMHDForces allocation-free#12
Draft
krystophny wants to merge 21 commits into
Draft
Conversation
The force kernel allocated 17 dynamic Eigen vectors per radial surface (the _o half-grid quantities and the avg/wavg surface averages). Move them to preallocated per-thread ThreadLocalStorage scratch and assign in place, so the radial loop allocates nothing. Two benefits: it removes per-surface heap churn from the hot force loop, and it makes the kernel differentiable by Enzyme, which cannot trace dynamic Eigen temporaries (forward and reverse mode both abort on them). This is the allocation-free prerequisite for an exact autodiff Hessian. Pure refactor, identical arithmetic. Verified bit-for-bit: vmec_standalone MHD energy unchanged on solovev (2.548352e+00) and cth_like_fixed_bdy (5.057191e-02).
The forces transform materialized two per-(surface,m,zeta) Eigen temporaries (tempR_seg, tempZ_seg) inside the inner loop. Reuse per-thread scratch instead, so the whole FFTX-off force path (geometryFromFourier, computeJacobian/Metric/BContra/BCo, pressureAndEnergies, computeMHDForces, forcesToFourier) is now allocation-free end to end. Same arithmetic as the previous .eval(); verified bit-for-bit: solovev 2.548352e+00, cth_like_fixed_bdy 5.057191e-02.
This was referenced Jun 14, 2026
The 'Compare benchmark result' step uses github-action-benchmark with comment-on-alert and the GITHUB_TOKEN, which is read-only for pull requests from forks -> 'Resource not accessible by integration'. Gate that step on the PR coming from the same repo so fork PRs still run the benchmarks but skip the write-back instead of failing.
The pinned vmec-0.0.6 cp310 wheel was f90wrapped against numpy 1.x. Under the numpy 2.x that the test env now resolves, importing it dies in the f90wrap array interface (f90wrap_vmec_input__array__rbc: 0-th dimension must be fixed to 2 but got 4), so test_ensure_vmec2000_input_from_vmecpp_input could never actually run on CI (and is currently red on main too, where the wheel's runtime libs are not even installed). Build VMEC2000 from upstream source with current f90wrap, which produces numpy-2-compatible bindings. The recipe mirrors SIMSOPT's own CI (hiddenSymmetries/VMEC2000, cmake/machines/ubuntu.json). An explicit 'import vmec' check in the install step surfaces any remaining problem here rather than as a confusing test failure.
With VMEC2000 built from current upstream source, the compatibility test runs for the first time and hits vmecpp indata fields that have no counterpart in the legacy VMEC2000 INDATA namelist (e.g. free_boundary_method), which raised AttributeError. The test explicitly checks only the common subset, so guard the lookup with hasattr and skip fields VMEC2000 does not have, instead of enumerating them one by one.
…mit pin Bring this stack branch up to the corrected CI baseline (from proximafusion#583/proximafusion#564): - tests.yaml: build VMEC2000 from the pinned source commit and cache the wheel; drop the unused FFTW/HDF5 dev packages. - benchmarks.yaml: skip the result upload on fork PRs (read-only token). - test_simsopt_compat.py: skip vmecpp-only INDATA fields. - CMakeLists: pin abseil to the 20260107.1 commit hash for Clang >= 21.
…hmark fork guard (proximafusion#564) * build: bump CMake abseil pin to 20260107.1 for Clang >= 21 The CMake FetchContent abseil pin (2024-08) fails to compile under Clang >= 21: absl::Nonnull SFINAE in absl/strings/ascii.cc and the numbers.cc nullability annotations are rejected by the newer frontend. Bump to the 20260107.1 LTS, which compiles cleanly under Clang 21.1.8 and GCC. Clang is the compiler required for the Enzyme autodiff build. The Bazel build keeps its own (BCR) abseil pin and is unaffected. * ci: skip benchmark result upload on fork PRs (token is read-only) The 'Compare benchmark result' step uses github-action-benchmark with comment-on-alert and the GITHUB_TOKEN, which is read-only for pull requests from forks -> 'Resource not accessible by integration'. Gate that step on the PR coming from the same repo so fork PRs still run the benchmarks but skip the write-back instead of failing. * ci: build VMEC2000 from source so the compat test runs on numpy 2 The pinned vmec-0.0.6 cp310 wheel was f90wrapped against numpy 1.x. Under the numpy 2.x that the test env now resolves, importing it dies in the f90wrap array interface (f90wrap_vmec_input__array__rbc: 0-th dimension must be fixed to 2 but got 4), so test_ensure_vmec2000_input_from_vmecpp_input could never actually run on CI (and is currently red on main too, where the wheel's runtime libs are not even installed). Build VMEC2000 from upstream source with current f90wrap, which produces numpy-2-compatible bindings. The recipe mirrors SIMSOPT's own CI (hiddenSymmetries/VMEC2000, cmake/machines/ubuntu.json). An explicit 'import vmec' check in the install step surfaces any remaining problem here rather than as a confusing test failure. * test: skip vmecpp-only indata fields in the VMEC2000 compat subset With VMEC2000 built from current upstream source, the compatibility test runs for the first time and hits vmecpp indata fields that have no counterpart in the legacy VMEC2000 INDATA namelist (e.g. free_boundary_method), which raised AttributeError. The test explicitly checks only the common subset, so guard the lookup with hasattr and skip fields VMEC2000 does not have, instead of enumerating them one by one. * build: pin abseil to the 20260107.1 commit hash Pin the FetchContent abseil dependency to commit 255c84d (the exact commit behind the 20260107.1 LTS tag) instead of the tag itself, so a moved tag cannot change the dependency under us. * ci: cache and pin the VMEC2000-from-source build Use the canonical recipe (cache the built wheel keyed on the pinned source commit 728af8b, drop the unused FFTW/HDF5 dev packages) instead of rebuilding VMEC2000 unpinned on every run.
…roximafusion#562) `VmecWOut` NetCDF serialization assumed `jaxtyping._array_types._AnonymousDim` exists. With newer jaxtyping releases, that symbol may be absent (replaced by `_anonymous_dim`), causing runtime `AttributeError` in `test_init` wout IO/reference paths. - **Compatibility fix: dim marker resolution** - Resolve jaxtyping dimension marker types via guarded lookup on `jt._array_types`: - named: `_NamedDim` - anonymous: `_AnonymousDim` **or** fallback to `type(_anonymous_dim)` when only the instance-style API is present. - **Runtime behavior: robust dimension-name inference** - Keep existing inference behavior, but branch on resolved types instead of hard-coding a single private symbol. - If a marker type is unavailable, logic degrades safely to existing default dimension naming. - **Code organization** - Compute marker-type resolution once before field iteration, then reuse in per-field shape inference. ```python array_types = getattr(jt, "_array_types", None) named_dim_type = getattr(array_types, "_NamedDim", None) if array_types is not None else None anonymous_dim_type = getattr(array_types, "_AnonymousDim", None) if array_types is not None else None if anonymous_dim_type is None and array_types is not None: anonymous_dim_instance = getattr(array_types, "_anonymous_dim", None) if anonymous_dim_instance is not None: anonymous_dim_type = type(anonymous_dim_instance) ```
Older abseil fails to compile under Clang >= 21 (used in the Enzyme build) due to `absl::Nonnull` SFINAE issues in `absl/strings/ascii.cc` (upstream fix: `255c84dadd029fd8ad25c5efb5933e47beaa00c7`). ## Changes - **`src/vmecpp/cpp/MODULE.bazel`**: bump `abseil-cpp` from `20230802.0.bcr.1` to `20260107.1` (LTS)
The allocation-free rewrite placed tempR_seg/tempZ_seg in a block-scope thread_local inside the (jF, m, zeta) inner loop, which emits a __tls_get_addr call and an init-guard branch every iteration. Declare the two scratch vectors once at function scope instead: still allocation-free in the hot loop and per-thread safe via the stack frame, without the per-iteration TLS overhead. Same arithmetic; cma and w7x wout are bit-for-bit unchanged.
…#566) * ideal_mhd_model: make computeMHDForces allocation-free The force kernel allocated 17 dynamic Eigen vectors per radial surface (the _o half-grid quantities and the avg/wavg surface averages). Move them to preallocated per-thread ThreadLocalStorage scratch and assign in place, so the radial loop allocates nothing. Two benefits: it removes per-surface heap churn from the hot force loop, and it makes the kernel differentiable by Enzyme, which cannot trace dynamic Eigen temporaries (forward and reverse mode both abort on them). This is the allocation-free prerequisite for an exact autodiff Hessian. Pure refactor, identical arithmetic. Verified bit-for-bit: vmec_standalone MHD energy unchanged on solovev (2.548352e+00) and cth_like_fixed_bdy (5.057191e-02). * dft_toroidal: make ForcesToFourier allocation-free The forces transform materialized two per-(surface,m,zeta) Eigen temporaries (tempR_seg, tempZ_seg) inside the inner loop. Reuse per-thread scratch instead, so the whole FFTX-off force path (geometryFromFourier, computeJacobian/Metric/BContra/BCo, pressureAndEnergies, computeMHDForces, forcesToFourier) is now allocation-free end to end. Same arithmetic as the previous .eval(); verified bit-for-bit: solovev 2.548352e+00, cth_like_fixed_bdy 5.057191e-02. * apply pre-commit formatting (ruff, docformatter, clang-format) * ci: skip benchmark result upload on fork PRs (token is read-only) The 'Compare benchmark result' step uses github-action-benchmark with comment-on-alert and the GITHUB_TOKEN, which is read-only for pull requests from forks -> 'Resource not accessible by integration'. Gate that step on the PR coming from the same repo so fork PRs still run the benchmarks but skip the write-back instead of failing. * ci: build VMEC2000 from source so the compat test runs on numpy 2 The pinned vmec-0.0.6 cp310 wheel was f90wrapped against numpy 1.x. Under the numpy 2.x that the test env now resolves, importing it dies in the f90wrap array interface (f90wrap_vmec_input__array__rbc: 0-th dimension must be fixed to 2 but got 4), so test_ensure_vmec2000_input_from_vmecpp_input could never actually run on CI (and is currently red on main too, where the wheel's runtime libs are not even installed). Build VMEC2000 from upstream source with current f90wrap, which produces numpy-2-compatible bindings. The recipe mirrors SIMSOPT's own CI (hiddenSymmetries/VMEC2000, cmake/machines/ubuntu.json). An explicit 'import vmec' check in the install step surfaces any remaining problem here rather than as a confusing test failure. * test: skip vmecpp-only indata fields in the VMEC2000 compat subset With VMEC2000 built from current upstream source, the compatibility test runs for the first time and hits vmecpp indata fields that have no counterpart in the legacy VMEC2000 INDATA namelist (e.g. free_boundary_method), which raised AttributeError. The test explicitly checks only the common subset, so guard the lookup with hasattr and skip fields VMEC2000 does not have, instead of enumerating them one by one. * ci: sync VMEC2000-from-source build, benchmark fork guard, abseil commit pin Bring this stack branch up to the corrected CI baseline (from proximafusion#583/proximafusion#564): - tests.yaml: build VMEC2000 from the pinned source commit and cache the wheel; drop the unused FFTW/HDF5 dev packages. - benchmarks.yaml: skip the result upload on fork PRs (read-only token). - test_simsopt_compat.py: skip vmecpp-only INDATA fields. - CMakeLists: pin abseil to the 20260107.1 commit hash for Clang >= 21. * ideal_mhd_model: hoist ForcesToFourier scratch out of the inner loop The allocation-free rewrite placed tempR_seg/tempZ_seg in a block-scope thread_local inside the (jF, m, zeta) inner loop, which emits a __tls_get_addr call and an init-guard branch every iteration. Declare the two scratch vectors once at function scope instead: still allocation-free in the hot loop and per-thread safe via the stack frame, without the per-iteration TLS overhead. Same arithmetic; cma and w7x wout are bit-for-bit unchanged. * Update thread_local_storage.h * Update thread_local_storage.h --------- Co-authored-by: Philipp Jurašić <166746189+jurasic-pf@users.noreply.github.com>
* ideal_mhd_model: make the iteration hot loop allocation-free (proximafusion#594) The per-iteration force evaluation allocated heap temporaries that scaled with problem size (~8.5k allocations per iteration on a 3D fixed-boundary case). Reuse pre-sized ThreadLocalStorage scratch in computeMHDForces, the update() residual accumulators, and the 3D force DFT (ForcesToFourier3DSymmFastPoloidal, which was the bulk); use stack arrays in applyRZPreconditioner; shift the 1/tau averaging history in place; and reserve the convergence-history vectors once per multigrid stage. The steady-state loop now performs zero heap allocations. This is storage reuse only: the converged wout is bit-identical before and after on 2D and 3D fixed-boundary cases, verified single-threaded (VMEC++ is only run-to-run deterministic single-threaded). Add VmecHotLoop.IsAllocationFree, which interposes the malloc family and asserts zero allocations across steady-state iterations (skipped under the sanitizers, whose allocators conflict with the interposition). * vmec_allocation_test: add a positive control for the allocation counter AllocationCounterWorks asserts the interposed malloc counter observes a known set of heap allocations while counting is enabled and stays at zero while disabled, so IsAllocationFree cannot pass on a counter that never fires. * ideal_mhd_model: stack-allocate the fixed-size force-residual vectors Per review: the size-3 (R, Z, lambda) force-residual accumulators were dynamic Eigen::VectorXd kept in ThreadLocalStorage. They are fixed size, so use stack Eigen::Vector3d locals scoped to where they are used and drop the thread-local members. residuals(), evalFResInvar() and evalFResPrecd() now take Eigen::Vector3d, so no temporary VectorXd is materialized and the hot loop stays allocation-free. No functional change: cma and solovev converge to the same values. * Re-trigger CI * Re-trigger CI
…tigrid driver (proximafusion#560) * iteration: reset accumulated constants on reinitialize/refine_to, add Python multigrid driver * tests: satisfy docformatter and ruff-format hooks --------- Co-authored-by: Philipp Jurašić <166746189+jurasic-pf@users.noreply.github.com>
…on#565) * build: bump CMake abseil pin to 20260107.1 for Clang >= 21 The CMake FetchContent abseil pin (2024-08) fails to compile under Clang >= 21: absl::Nonnull SFINAE in absl/strings/ascii.cc and the numbers.cc nullability annotations are rejected by the newer frontend. Bump to the 20260107.1 LTS, which compiles cleanly under Clang 21.1.8 and GCC. Clang is the compiler required for the Enzyme autodiff build. The Bazel build keeps its own (BCR) abseil pin and is unaffected. * enzyme: opt-in Clang/Enzyme build option and AD smoke test Add VMECPP_ENABLE_ENZYME (OFF by default), which requires a Clang compiler and a ClangEnzyme plugin path and builds a self-contained autodiff smoke test. The test differentiates a scalar objective written over Eigen::Map'd caller buffers and checks reverse- and forward-mode Enzyme gradients against the closed form and central finite differences. enzyme.h documents the intrinsic ABI and the allocation constraint that shapes the differentiable kernels: Enzyme cannot track Eigen's aligned allocator, so differentiable paths use Eigen::Map over caller-owned buffers and avoid heap expression temporaries. With the option off the build is unchanged. * ci: skip benchmark result upload on fork PRs (token is read-only) The 'Compare benchmark result' step uses github-action-benchmark with comment-on-alert and the GITHUB_TOKEN, which is read-only for pull requests from forks -> 'Resource not accessible by integration'. Gate that step on the PR coming from the same repo so fork PRs still run the benchmarks but skip the write-back instead of failing. * ci: build VMEC2000 from source so the compat test runs on numpy 2 The pinned vmec-0.0.6 cp310 wheel was f90wrapped against numpy 1.x. Under the numpy 2.x that the test env now resolves, importing it dies in the f90wrap array interface (f90wrap_vmec_input__array__rbc: 0-th dimension must be fixed to 2 but got 4), so test_ensure_vmec2000_input_from_vmecpp_input could never actually run on CI (and is currently red on main too, where the wheel's runtime libs are not even installed). Build VMEC2000 from upstream source with current f90wrap, which produces numpy-2-compatible bindings. The recipe mirrors SIMSOPT's own CI (hiddenSymmetries/VMEC2000, cmake/machines/ubuntu.json). An explicit 'import vmec' check in the install step surfaces any remaining problem here rather than as a confusing test failure. * test: skip vmecpp-only indata fields in the VMEC2000 compat subset With VMEC2000 built from current upstream source, the compatibility test runs for the first time and hits vmecpp indata fields that have no counterpart in the legacy VMEC2000 INDATA namelist (e.g. free_boundary_method), which raised AttributeError. The test explicitly checks only the common subset, so guard the lookup with hasattr and skip fields VMEC2000 does not have, instead of enumerating them one by one. * ci: sync VMEC2000-from-source build, benchmark fork guard, abseil commit pin Bring this stack branch up to the corrected CI baseline (from proximafusion#583/proximafusion#564): - tests.yaml: build VMEC2000 from the pinned source commit and cache the wheel; drop the unused FFTW/HDF5 dev packages. - benchmarks.yaml: skip the result upload on fork PRs (read-only token). - test_simsopt_compat.py: skip vmecpp-only INDATA fields. - CMakeLists: pin abseil to the 20260107.1 commit hash, not the tag. * enzyme: run the AD smoke test through bazel instead of ctest Move the Enzyme autodiff smoke test into the bazel test framework, which owns every other C++ test in this repository, and drop the separate CMake ctest path that nothing in CI exercised. - vmecpp/common/enzyme/BUILD.bazel: an `enzyme` header library plus an `enzyme_smoke_test` cc_test. The test is tagged `manual` so the default GCC `bazel test //...` skips it (the Enzyme intrinsics only resolve under Clang with the plugin attached) and never tries to compile it with GCC. - .bazelrc: a `--config=enzyme` that sets -O2 so the Enzyme optimization pass fires. Select Clang with CC/CXX and pass the plugin path the way -DVMECPP_ENZYME_PLUGIN did under CMake: CC=clang CXX=clang++ bazel test --config=enzyme \ --copt=-fplugin=/path/to/ClangEnzyme-NN.so \ //vmecpp/common/enzyme:enzyme_smoke_test - CMakeLists.txt: remove the VMECPP_ENABLE_ENZYME option and the ctest registration it only existed to drive. * ci: build ClangEnzyme and run the enzyme smoke test in CI Add a GitHub Actions job that gives the Enzyme autodiff smoke test actual CI coverage. It mirrors the EnzymeAD upstream recipe: install Clang/LLVM 21 from apt.llvm.org, build a pinned ClangEnzyme-21 plugin (v0.0.264, the version this stack is developed against) against the installed LLVM and Clang, then run the bazel target under --config=enzyme with the plugin attached. The plugin build is cached on the pinned ref so only the first run pays for it. This is what the enzyme test needed beyond the bazel move: the default GCC test_bazel job skips the manual-tagged target, so without a Clang/Enzyme job nothing exercised it. --------- Co-authored-by: Philipp Jurašić <166746189+jurasic-pf@users.noreply.github.com>
# Conflicts: # src/vmecpp/cpp/vmecpp/vmec/ideal_mhd_model/dft_toroidal.cc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make
IdealMhdModel::computeMHDForcesallocation-free. The kernel allocated 17dynamic
Eigen::VectorXdper radial surface: the_ohalf-grid quantities(
P_o,rup_o, ...,gbvbv_o) and the surface averages (P_avg,P_wavg,gbubu_avg, ...). These move to preallocated per-threadThreadLocalStoragescratch and are assigned in place; the radial loop now allocates nothing.
Why
Two reasons:
per surface per iteration is avoidable heap churn.
forward and reverse mode abort on a dynamic-size Eigen expression that
allocates (the "freeing without malloc" failure). Writing into preallocated
storage via
Eigen::Map/in-place assignment is the form Enzyme candifferentiate. This is the allocation-free prerequisite for an exact autodiff
Hessian-vector product of the force.
This is a pure refactor: the arithmetic is identical, only the storage of the
intermediates changes.
Verification
computeMHDForcesnow contains zeroEigen::VectorXdlocals orEigen::VectorXd::Zeroconstructions (allocation-free).Bit-for-bit identical results,
vmec_standalonefinal MHD energy:Builds clean under GCC 16 and Clang 21; clang-format (file style) clean.