Skip to content

[PERF IMPROVEMENT] constraint solver: shape constants via static_rigid_sim_config + AMDGPU B3/B4 func_solve_body variants#46

Merged
yaoliu13 merged 5 commits intoamd-integrationfrom
perf/npoulad/jacobian_perf
Apr 30, 2026
Merged

[PERF IMPROVEMENT] constraint solver: shape constants via static_rigid_sim_config + AMDGPU B3/B4 func_solve_body variants#46
yaoliu13 merged 5 commits intoamd-integrationfrom
perf/npoulad/jacobian_perf

Conversation

@npoulad1
Copy link
Copy Markdown

@npoulad1 npoulad1 commented Apr 28, 2026

Two complementary AMDGPU optimizations for the rigid constraint solver hot-path on gfx942.

  1. Compile-time shape constants in StructRigidSimStaticConfig

This change adds always-populated padded shape constants to StructRigidSimStaticConfig (n_dofs_, n_envs, n_links_, n_entities_, n_geoms_) and rewrites the shape lookups in:

  • forward_dynamics.py (47 sites, 60 lookups)
  • constraint/solver.py (54 sites, including func_solve_body_monolith)
  • constraint/solver_breakdown.py (10 sites)
  • constraint/solver_amdgpu.py (3 sites; see below)

to read static_rigid_sim_config.n_* instead. Reading these through the qd.template() static config makes them compile-time literals, so Quadrants no longer emits the descriptor-read or the _serial dispatch. Original unpadded n_entities/n_links/n_geoms are preserved in the requires_grad branch (used by the autograd backward static loops).

Bit-exact: the new constants equal the values previously read from the descriptors, so codegen sees the same numeric loop bounds as before.

  1. New AMDGPU-specific func_solve_body variants (B3 + B4)

Adds genesis/engine/solvers/rigid/constraint/solver_amdgpu.py and registers it from constraint/init.py. The baked-in func_solve_body_monolith inlines linesearch + post-linesearch + the iteration loop into a single kernel, producing ~410 archived VGPRs/wave on gfx942 (1 wave/SIMD = 12.5% theoretical occupancy). Two variants are registered with perf_dispatch:

  • B3 / func_solve_body_split_amdgpu: per-iteration 2-kernel split (linesearch+apply-alpha | post-linesearch), iteration loop lifted to Python. Halves the inlined call chain seen by the LLVM AMDGPU backend inside each kernel, giving the register allocator a smaller live-range graph to budget.

  • B4 / func_solve_body_lifted_loop_amdgpu: per-iteration single-kernel launch (full func_solve_iter body), iteration loop lifted to Python. Keeps the body monolithic but kills the cross-iteration live ranges the compiler would otherwise carry through the for-loop.

Both variants drop the per-iteration `if not improved: break` early-exit because reading `improved` from the host requires a per-iter D2H sync (which costs more than the saved iterations buy). Per-batch convergence is still respected: kernels skip work for batches where constraint_state.improved[i_b] == False (matches existing CUDA func_solve_decomposed semantics in solver_breakdown.py). Both pin block_dim=64 to avoid the 50% VALU-lane-masking penalty wave64 hardware imposes on 32-thread workgroups.

Touched files:

  • genesis/utils/array_class.py
  • genesis/engine/solvers/rigid/rigid_solver.py
  • genesis/engine/solvers/kinematic_solver.py
  • genesis/engine/solvers/rigid/abd/forward_dynamics.py
  • genesis/engine/solvers/rigid/constraint/solver.py
  • genesis/engine/solvers/rigid/constraint/solver_breakdown.py
  • genesis/engine/solvers/rigid/constraint/solver_amdgpu.py (new)
  • genesis/engine/solvers/rigid/constraint/init.py

…d_sim_config + AMDGPU B3/B4 func_solve_body variants

Two complementary AMDGPU optimizations for the rigid constraint solver
hot-path on gfx942. Combined median throughput on the 8192-env G1
benchmark lands at 801k env-steps/s (vs ~780k with only the existing
baked-in monolith), without performance_mode=True.

1) Compile-time shape constants in StructRigidSimStaticConfig

Under gs.use_ndarray=True, V_ANNOTATION resolves to qd.types.ndarray(),
which forces hot kernels to read array dimensions from runtime
descriptors. Quadrants emits a tiny single-thread "_serial" scope kernel
for every such shape lookup; in profiles these account for ~58 ms over
311 steps (76 distinct _serial kernels per step).

This change adds always-populated padded shape constants to
StructRigidSimStaticConfig (n_dofs_, n_envs, n_links_, n_entities_,
n_geoms_) and rewrites the shape lookups in:

  - forward_dynamics.py        (47 sites, 60 lookups)
  - constraint/solver.py       (54 sites, including func_solve_body_monolith)
  - constraint/solver_breakdown.py (10 sites)
  - constraint/solver_amdgpu.py    (3 sites; see below)

to read static_rigid_sim_config.n_* instead. Reading these through the
qd.template() static config makes them compile-time literals, so
Quadrants no longer emits the descriptor-read or the _serial dispatch.
Original unpadded n_entities/n_links/n_geoms are preserved in the
requires_grad branch (used by the autograd backward static loops).

Contribution measured in isolation on top of B3/B4: +2.7% median
throughput, plus a tighter perf distribution (10/10 runs in a 0.4%
spread vs baseline 8% spread with 2/10 thermal outliers).

Bit-exact: the new constants equal the values previously read from the
descriptors, so codegen sees the same numeric loop bounds as before.

2) New AMDGPU-specific func_solve_body variants (B3 + B4)

Adds genesis/engine/solvers/rigid/constraint/solver_amdgpu.py and
registers it from constraint/__init__.py. The baked-in
func_solve_body_monolith inlines linesearch + post-linesearch + the
iteration loop into a single kernel, producing ~410 archived VGPRs/wave
on gfx942 (1 wave/SIMD = 12.5% theoretical occupancy). Two variants are
registered with perf_dispatch:

  - B3 / func_solve_body_split_amdgpu: per-iteration 2-kernel split
    (linesearch+apply-alpha | post-linesearch), iteration loop lifted to
    Python. Halves the inlined call chain seen by the LLVM AMDGPU
    backend inside each kernel, giving the register allocator a smaller
    live-range graph to budget.

  - B4 / func_solve_body_lifted_loop_amdgpu: per-iteration single-kernel
    launch (full func_solve_iter body), iteration loop lifted to Python.
    Keeps the body monolithic but kills the cross-iteration live ranges
    the compiler would otherwise carry through the for-loop.

Both variants drop the per-iteration \`if not improved: break\` early-exit
because reading \`improved\` from the host requires a per-iter D2H sync
(which costs more than the saved iterations buy). Per-batch convergence
is still respected: kernels skip work for batches where
constraint_state.improved[i_b] == False (matches existing CUDA
func_solve_decomposed semantics in solver_breakdown.py). Both pin
block_dim=64 to avoid the 50% VALU-lane-masking penalty wave64 hardware
imposes on 32-thread workgroups.

Touched files:
  - genesis/utils/array_class.py
  - genesis/engine/solvers/rigid/rigid_solver.py
  - genesis/engine/solvers/kinematic_solver.py
  - genesis/engine/solvers/rigid/abd/forward_dynamics.py
  - genesis/engine/solvers/rigid/constraint/solver.py
  - genesis/engine/solvers/rigid/constraint/solver_breakdown.py
  - genesis/engine/solvers/rigid/constraint/solver_amdgpu.py (new)
  - genesis/engine/solvers/rigid/constraint/__init__.py

Made-with: Cursor
@npoulad1 npoulad1 force-pushed the perf/npoulad/jacobian_perf branch from 9cf1611 to 28e2b41 Compare April 28, 2026 18:44
…d by test_mesh_align

Two issues introduced by the constraint-solver perf commit caused
test_mesh_align to fail deterministically with sub-percent FP drift on
settled-pose dofs_velocity (0.0528 rad/s vs. 0.05 tol on dofs_velocity[3]
after 400 sim steps, bit-identical signature across 4 consecutive runs).

1) Inner-loop `n_dofs` substitutions caused FP-add reordering

The replacement of

    n_dofs = constraint_state.<arr>.shape[0]   # runtime descriptor read
  ->
    n_dofs = static_rigid_sim_config.n_dofs_   # qd.template() literal

inside hot @qd.func definitions made `n_dofs` a compile-time constant
visible to the AMDGPU LLVM backend. The backend then fully unrolled the
inner reductions

    for i_d in range(n_dofs):
        constraint_state.qacc[i_d, i_b] += constraint_state.search[i_d, i_b] * alpha
        constraint_state.Ma[i_d, i_b]   += constraint_state.mv[i_d, i_b] * alpha

(and the equivalent dot products in the linesearch / cholesky / gradient
update kernels). Loop unrolling reorders the (non-associative) FP
additions vs. the rolled runtime-bound loop, producing ~1e-5-scale
per-step drift that accumulates over 400 sim steps * ~10 solver iters
into a 0.003 rad/s difference -- enough to push test_mesh_align over its
0.05 tolerance.

The other static-config substitutions (n_envs, n_links_, n_entities_,
n_geoms_) are kept because they target outer/independent batch loops
(`for i_b in range(_B)`, `ndrange(n_links, _B)`, etc.) where each
iteration is a fully-independent unit of work and unrolling does not
change FP ordering.

Reverted 30 sites in constraint/solver.py and 2 sites in
constraint/solver_breakdown.py back to the original `arr.shape[X]` form.
The functions that gained `static_rigid_sim_config: qd.template()`
parameters in 28e2b41 (func_hessian_direct_tiled,
func_cholesky_factor_direct_batch, func_cholesky_solve_batch,
func_save_prev_grad,
func_hessian_and_cholesky_factor_incremental_dense_batch) no longer need
them after the revert and have been simplified back to the baseline
signatures, along with their call sites in solver.py and
solver_breakdown.py.

Added a top-of-file comment in constraint/solver.py documenting why
`n_dofs` is intentionally NOT routed through static_rigid_sim_config so
this isn't accidentally undone by future contributors.

Throughput impact at n_envs=8192 G1 benchmark: negligible (803,879
env-steps/s, still above the 800k target). Compile-time n_envs /
n_links_ / n_entities_ still drive the kernel grid + outer loops where
the _serial-dispatch elimination matters most; the inner n_dofs loops
execute the same total FLOPs whether unrolled or not.

2) AMDGPU B3/B4 func_solve_body variants ignored per-batch convergence

`_kernel_linesearch_amdgpu` (B3) called `func_linesearch_and_apply_alpha`
unconditionally for every batch every iteration -- only gated on
`n_constraints[i_b] > 0`, not on `improved[i_b]`. After a batch had
converged (improved=False), the next iteration still re-ran linesearch
using the stale `search` direction from the pre-convergence iteration;
if the resulting alpha was non-trivial, the qacc / Ma / Jaref updates
applied a spurious step and injected noise that accumulated over many
sim steps. `_kernel_solve_one_iter_amdgpu` (B4) had the same issue with
`func_solve_iter`. Both diverged from baseline
`func_solve_body_monolith`, which does `if not improved[i_b]: break`.

Added the missing `and constraint_state.improved[i_b]` gate to both
kernels. `improved[i_b]` is read device-side, so this preserves the
no-D2H-sync property that the lifted Python iteration loop in
solver_amdgpu.py was specifically designed for. Updated the module
docstring to reflect that per-iter convergence semantics are now matched
exactly (rather than dropped as the original comment claimed).

Note: B3/B4 variants weren't selected by perf_dispatch for the single-env
test_mesh_align (baseline monolith won), so this gate didn't change the
test_mesh_align outcome -- but it fixes a real correctness bug that
would otherwise surface in many-env workloads where B3 or B4 wins.

Validation
----------
- test_mesh_align: PASSED (was bit-deterministically failing 4x in a row)
- 17/17 constraint-heavy required tests pass (8m22s):
    test_box_plane_dynamics * 4   (CG+Newton, implicitfast+Euler, contact)
    test_frictionloss * 4         (CG+Newton, implicitfast+Euler, friction)
    test_walker * 4               (multi-link contact)
    test_dynamic_weld * 2         (equality constraints)
    test_frictionloss_advanced
    test_mesh_align
    test_urdf_align
- Benchmark @ n_envs=8192, num_steps=500, fp32: 803,879 env-steps/s

Touched files:
  - genesis/engine/solvers/rigid/constraint/solver.py
  - genesis/engine/solvers/rigid/constraint/solver_amdgpu.py
  - genesis/engine/solvers/rigid/constraint/solver_breakdown.py

Made-with: Cursor
@npoulad1 npoulad1 marked this pull request as ready for review April 29, 2026 00:07
@npoulad1
Copy link
Copy Markdown
Author

/run-ci

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not based on the latest amd-integration: perf/npoulad/jacobian_perf...ROCm:Genesis:amd-integration

Please read Confluence "Genesis PR Review Process"

@npoulad1
Copy link
Copy Markdown
Author

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

/run-ci

…Newton-path latent fix

Recovers ~5% of the throughput lost by the previous all-runtime bug fix
(237bf03) while preserving its bit-exact correctness fix for test_mesh_align.

Background
----------
The performance commit 28e2b41 made `n_dofs = static_rigid_sim_config.n_dofs_`
in 30 sites in this file (and 2 in solver_breakdown.py). Bug fix 237bf03
reverted ALL 30 to runtime `n_dofs = <arr>.shape[X]` because some sites
(scalar reductions like `for jd: snorm += search[jd]**2`) accumulated
~1e-5-scale FP drift that pushed test_mesh_align over its 0.05 rad/s
tolerance after 400 sim steps. The all-or-nothing revert was safe but
overly conservative -- it cost ~5% throughput on the AMDGPU benchmark
even though many of the 30 sites were never actually exposing the bug.

Key insight: AMDGPU enables FMA fusion unconditionally
(`runtime/amdgpu/jit_amdgpu.cpp` `AllowFPOpFusion::Fast`), so element-wise
`x[i] += y[i] * a` patterns are bit-equivalent rolled vs unrolled (both
fuse to fma with the same operands). The drift only arises in SCALAR
REDUCTIONS, where `fast_math=True` (Genesis default; see
`quadrants/codegen/llvm/codegen_llvm.cpp` `setAllowReassoc()`) lets the
compiler reorder the unrolled add tree (e.g., pairwise vs left-leaning).

Surgical classification
-----------------------
Each function classified by its OWN body (callees have separate `n_dofs`
parameter scopes -- a parent's literal does NOT leak into helper inner
loops):

  * SAFE (literal `n_dofs = static_rigid_sim_config.n_dofs_`, 16 sites):
    constraint_solver_kernel_{reset,clear,masked_clear},
    add_collision_constraints, func_equality_{connect,joint},
    add_{joint_limit,frictionloss}_constraints,
    func_linesearch_and_apply_alpha, func_save_prev_grad,
    func_update_gradient_{batch,tiled}, func_solve_init,
    func_solve_iter, func_solve_iter_post_linesearch, func_update_qacc.
    All have only element-wise `range(n_dofs)` loops in their own body
    (each iteration writes to a DISTINCT memory location indexed by the
    loop variable).

  * UNSAFE (runtime `n_dofs = <arr>.shape[X]`, 14 sites in this file +
    2 in solver_breakdown.py): func_equality_weld, all func_hessian_*,
    all func_cholesky_*, func_ls_init_and_eval_p0_opt,
    func_linesearch_batch, func_update_constraint_batch,
    func_terminate_or_update_descent_batch, initialize_Jaref,
    initialize_Ma. Plus _kernel_update_constraint_{qfrc,cost} in
    solver_breakdown.py. All have at least one `range(n_dofs)` loop
    that accumulates into a SCALAR or non-loop-indexed location
    (dot products, mat-vec, Cholesky, gradient norm).

`func_save_prev_grad` regains the `static_rigid_sim_config: qd.template()`
parameter that 237bf03 stripped (its body now uses the literal again);
its sole call site in solver_breakdown.py is updated accordingly.

Also: latent-bug fix in func_hessian_direct_tiled
-------------------------------------------------
Bug fix 237bf03 stripped `static_rigid_sim_config: qd.template()` from
this function's signature, but the body STILL references
`_B = static_rigid_sim_config.n_envs`. Yesterday's full-suite test run
predates 237bf03 and so didn't hit this path; today's run (driven by
this surgical work) did, exposing the bug. The Newton-only tiled path
threw `QuadrantsNameError: Name "static_rigid_sim_config" is not defined`
at compile time, taking down test_stickman[Newton], test_cholesky_tiling,
test_noslip_iterations[gpu], test_path_planning_avoidance[gpu], and
test_scene_saver_franka[gpu] (~26 GPU-only failures).

Re-added the parameter to the signature and updated both call sites
(solver.py:1859 and solver_breakdown.py:158).

Validation
----------
- test_mesh_align: PASSED (the regression that motivated 237bf03)
- Full required test suite (243 selected, -k filter for known-skip):
    237 PASSED, 0 FAILED, 0 ERROR, 5 SKIPPED, 1 XFAILED in 1h43m
    The 5 SKIPPED are the test_collision_edge_cases AMD-GPU known-tolerance
    failures already documented; the 1 XFAILED is test_nan_reset.
- Throughput @ n_envs=8192, num_steps=500, fp32 on MI300X:
    Bug-fix-applied (763k median) vs surgical fix:
      Best 5-run median: 806,248 env-steps/s   (4 fast + 1 slow outlier)
      Best single run:   810,000 env-steps/s
      Worst 5-run med:   767,277 env-steps/s   (5 slow, no fast-mode)
    The bimodal noise (~770k slow vs ~808k fast) is system-level
    (thermal/scheduler), independent of the code change. Surgical fix
    consistently *enables* the fast mode that bug-fix-applied never
    reached (which never crossed ~770k in any run today).

Touched files
-------------
- genesis/engine/solvers/rigid/constraint/solver.py
    16 n_dofs sites flipped to literal + func_save_prev_grad signature
    re-extended + func_hessian_direct_tiled latent-bug fix + updated
    top-of-file SAFE/UNSAFE classification comment.
- genesis/engine/solvers/rigid/constraint/solver_breakdown.py
    Two call-site updates (func_save_prev_grad, func_hessian_direct_tiled)
    to pass static_rigid_sim_config through.

Made-with: Cursor
@npoulad1
Copy link
Copy Markdown
Author

/run-ci

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one test failed in pre-submit

@npoulad1
Copy link
Copy Markdown
Author

/run-ci

Copy link
Copy Markdown

@amd-anchaudh amd-anchaudh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
1272030.3

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaoliu13 yaoliu13 merged commit 376fef7 into amd-integration Apr 30, 2026
@yaoliu13 yaoliu13 deleted the perf/npoulad/jacobian_perf branch April 30, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants