[PERF IMPROVEMENT] constraint solver: shape constants via static_rigid_sim_config + AMDGPU B3/B4 func_solve_body variants by npoulad1 · Pull Request #46 · ROCm/Genesis

npoulad1 · 2026-04-28T18:12:17Z

Two complementary AMDGPU optimizations for the rigid constraint solver hot-path on gfx942.

Compile-time shape constants in StructRigidSimStaticConfig

This change adds always-populated padded shape constants to StructRigidSimStaticConfig (n_dofs_, n_envs, n_links_, n_entities_, n_geoms_) and rewrites the shape lookups in:

forward_dynamics.py (47 sites, 60 lookups)
constraint/solver.py (54 sites, including func_solve_body_monolith)
constraint/solver_breakdown.py (10 sites)
constraint/solver_amdgpu.py (3 sites; see below)

to read static_rigid_sim_config.n_* instead. Reading these through the qd.template() static config makes them compile-time literals, so Quadrants no longer emits the descriptor-read or the _serial dispatch. Original unpadded n_entities/n_links/n_geoms are preserved in the requires_grad branch (used by the autograd backward static loops).

Bit-exact: the new constants equal the values previously read from the descriptors, so codegen sees the same numeric loop bounds as before.

New AMDGPU-specific func_solve_body variants (B3 + B4)

Adds genesis/engine/solvers/rigid/constraint/solver_amdgpu.py and registers it from constraint/init.py. The baked-in func_solve_body_monolith inlines linesearch + post-linesearch + the iteration loop into a single kernel, producing ~410 archived VGPRs/wave on gfx942 (1 wave/SIMD = 12.5% theoretical occupancy). Two variants are registered with perf_dispatch:

B3 / func_solve_body_split_amdgpu: per-iteration 2-kernel split (linesearch+apply-alpha | post-linesearch), iteration loop lifted to Python. Halves the inlined call chain seen by the LLVM AMDGPU backend inside each kernel, giving the register allocator a smaller live-range graph to budget.
B4 / func_solve_body_lifted_loop_amdgpu: per-iteration single-kernel launch (full func_solve_iter body), iteration loop lifted to Python. Keeps the body monolithic but kills the cross-iteration live ranges the compiler would otherwise carry through the for-loop.

Both variants drop the per-iteration `if not improved: break` early-exit because reading `improved` from the host requires a per-iter D2H sync (which costs more than the saved iterations buy). Per-batch convergence is still respected: kernels skip work for batches where constraint_state.improved[i_b] == False (matches existing CUDA func_solve_decomposed semantics in solver_breakdown.py). Both pin block_dim=64 to avoid the 50% VALU-lane-masking penalty wave64 hardware imposes on 32-thread workgroups.

Touched files:

genesis/utils/array_class.py
genesis/engine/solvers/rigid/rigid_solver.py
genesis/engine/solvers/kinematic_solver.py
genesis/engine/solvers/rigid/abd/forward_dynamics.py
genesis/engine/solvers/rigid/constraint/solver.py
genesis/engine/solvers/rigid/constraint/solver_breakdown.py
genesis/engine/solvers/rigid/constraint/solver_amdgpu.py (new)
genesis/engine/solvers/rigid/constraint/init.py

…d_sim_config + AMDGPU B3/B4 func_solve_body variants Two complementary AMDGPU optimizations for the rigid constraint solver hot-path on gfx942. Combined median throughput on the 8192-env G1 benchmark lands at 801k env-steps/s (vs ~780k with only the existing baked-in monolith), without performance_mode=True. 1) Compile-time shape constants in StructRigidSimStaticConfig Under gs.use_ndarray=True, V_ANNOTATION resolves to qd.types.ndarray(), which forces hot kernels to read array dimensions from runtime descriptors. Quadrants emits a tiny single-thread "_serial" scope kernel for every such shape lookup; in profiles these account for ~58 ms over 311 steps (76 distinct _serial kernels per step). This change adds always-populated padded shape constants to StructRigidSimStaticConfig (n_dofs_, n_envs, n_links_, n_entities_, n_geoms_) and rewrites the shape lookups in: - forward_dynamics.py (47 sites, 60 lookups) - constraint/solver.py (54 sites, including func_solve_body_monolith) - constraint/solver_breakdown.py (10 sites) - constraint/solver_amdgpu.py (3 sites; see below) to read static_rigid_sim_config.n_* instead. Reading these through the qd.template() static config makes them compile-time literals, so Quadrants no longer emits the descriptor-read or the _serial dispatch. Original unpadded n_entities/n_links/n_geoms are preserved in the requires_grad branch (used by the autograd backward static loops). Contribution measured in isolation on top of B3/B4: +2.7% median throughput, plus a tighter perf distribution (10/10 runs in a 0.4% spread vs baseline 8% spread with 2/10 thermal outliers). Bit-exact: the new constants equal the values previously read from the descriptors, so codegen sees the same numeric loop bounds as before. 2) New AMDGPU-specific func_solve_body variants (B3 + B4) Adds genesis/engine/solvers/rigid/constraint/solver_amdgpu.py and registers it from constraint/__init__.py. The baked-in func_solve_body_monolith inlines linesearch + post-linesearch + the iteration loop into a single kernel, producing ~410 archived VGPRs/wave on gfx942 (1 wave/SIMD = 12.5% theoretical occupancy). Two variants are registered with perf_dispatch: - B3 / func_solve_body_split_amdgpu: per-iteration 2-kernel split (linesearch+apply-alpha | post-linesearch), iteration loop lifted to Python. Halves the inlined call chain seen by the LLVM AMDGPU backend inside each kernel, giving the register allocator a smaller live-range graph to budget. - B4 / func_solve_body_lifted_loop_amdgpu: per-iteration single-kernel launch (full func_solve_iter body), iteration loop lifted to Python. Keeps the body monolithic but kills the cross-iteration live ranges the compiler would otherwise carry through the for-loop. Both variants drop the per-iteration \`if not improved: break\` early-exit because reading \`improved\` from the host requires a per-iter D2H sync (which costs more than the saved iterations buy). Per-batch convergence is still respected: kernels skip work for batches where constraint_state.improved[i_b] == False (matches existing CUDA func_solve_decomposed semantics in solver_breakdown.py). Both pin block_dim=64 to avoid the 50% VALU-lane-masking penalty wave64 hardware imposes on 32-thread workgroups. Touched files: - genesis/utils/array_class.py - genesis/engine/solvers/rigid/rigid_solver.py - genesis/engine/solvers/kinematic_solver.py - genesis/engine/solvers/rigid/abd/forward_dynamics.py - genesis/engine/solvers/rigid/constraint/solver.py - genesis/engine/solvers/rigid/constraint/solver_breakdown.py - genesis/engine/solvers/rigid/constraint/solver_amdgpu.py (new) - genesis/engine/solvers/rigid/constraint/__init__.py Made-with: Cursor

…d by test_mesh_align Two issues introduced by the constraint-solver perf commit caused test_mesh_align to fail deterministically with sub-percent FP drift on settled-pose dofs_velocity (0.0528 rad/s vs. 0.05 tol on dofs_velocity[3] after 400 sim steps, bit-identical signature across 4 consecutive runs). 1) Inner-loop `n_dofs` substitutions caused FP-add reordering The replacement of n_dofs = constraint_state.<arr>.shape[0] # runtime descriptor read -> n_dofs = static_rigid_sim_config.n_dofs_ # qd.template() literal inside hot @qd.func definitions made `n_dofs` a compile-time constant visible to the AMDGPU LLVM backend. The backend then fully unrolled the inner reductions for i_d in range(n_dofs): constraint_state.qacc[i_d, i_b] += constraint_state.search[i_d, i_b] * alpha constraint_state.Ma[i_d, i_b] += constraint_state.mv[i_d, i_b] * alpha (and the equivalent dot products in the linesearch / cholesky / gradient update kernels). Loop unrolling reorders the (non-associative) FP additions vs. the rolled runtime-bound loop, producing ~1e-5-scale per-step drift that accumulates over 400 sim steps * ~10 solver iters into a 0.003 rad/s difference -- enough to push test_mesh_align over its 0.05 tolerance. The other static-config substitutions (n_envs, n_links_, n_entities_, n_geoms_) are kept because they target outer/independent batch loops (`for i_b in range(_B)`, `ndrange(n_links, _B)`, etc.) where each iteration is a fully-independent unit of work and unrolling does not change FP ordering. Reverted 30 sites in constraint/solver.py and 2 sites in constraint/solver_breakdown.py back to the original `arr.shape[X]` form. The functions that gained `static_rigid_sim_config: qd.template()` parameters in 28e2b41 (func_hessian_direct_tiled, func_cholesky_factor_direct_batch, func_cholesky_solve_batch, func_save_prev_grad, func_hessian_and_cholesky_factor_incremental_dense_batch) no longer need them after the revert and have been simplified back to the baseline signatures, along with their call sites in solver.py and solver_breakdown.py. Added a top-of-file comment in constraint/solver.py documenting why `n_dofs` is intentionally NOT routed through static_rigid_sim_config so this isn't accidentally undone by future contributors. Throughput impact at n_envs=8192 G1 benchmark: negligible (803,879 env-steps/s, still above the 800k target). Compile-time n_envs / n_links_ / n_entities_ still drive the kernel grid + outer loops where the _serial-dispatch elimination matters most; the inner n_dofs loops execute the same total FLOPs whether unrolled or not. 2) AMDGPU B3/B4 func_solve_body variants ignored per-batch convergence `_kernel_linesearch_amdgpu` (B3) called `func_linesearch_and_apply_alpha` unconditionally for every batch every iteration -- only gated on `n_constraints[i_b] > 0`, not on `improved[i_b]`. After a batch had converged (improved=False), the next iteration still re-ran linesearch using the stale `search` direction from the pre-convergence iteration; if the resulting alpha was non-trivial, the qacc / Ma / Jaref updates applied a spurious step and injected noise that accumulated over many sim steps. `_kernel_solve_one_iter_amdgpu` (B4) had the same issue with `func_solve_iter`. Both diverged from baseline `func_solve_body_monolith`, which does `if not improved[i_b]: break`. Added the missing `and constraint_state.improved[i_b]` gate to both kernels. `improved[i_b]` is read device-side, so this preserves the no-D2H-sync property that the lifted Python iteration loop in solver_amdgpu.py was specifically designed for. Updated the module docstring to reflect that per-iter convergence semantics are now matched exactly (rather than dropped as the original comment claimed). Note: B3/B4 variants weren't selected by perf_dispatch for the single-env test_mesh_align (baseline monolith won), so this gate didn't change the test_mesh_align outcome -- but it fixes a real correctness bug that would otherwise surface in many-env workloads where B3 or B4 wins. Validation ---------- - test_mesh_align: PASSED (was bit-deterministically failing 4x in a row) - 17/17 constraint-heavy required tests pass (8m22s): test_box_plane_dynamics * 4 (CG+Newton, implicitfast+Euler, contact) test_frictionloss * 4 (CG+Newton, implicitfast+Euler, friction) test_walker * 4 (multi-link contact) test_dynamic_weld * 2 (equality constraints) test_frictionloss_advanced test_mesh_align test_urdf_align - Benchmark @ n_envs=8192, num_steps=500, fp32: 803,879 env-steps/s Touched files: - genesis/engine/solvers/rigid/constraint/solver.py - genesis/engine/solvers/rigid/constraint/solver_amdgpu.py - genesis/engine/solvers/rigid/constraint/solver_breakdown.py Made-with: Cursor

npoulad1 · 2026-04-29T00:17:32Z

/run-ci

yaoliu13

This PR is not based on the latest amd-integration: perf/npoulad/jacobian_perf...ROCm:Genesis:amd-integration

Please read Confluence "Genesis PR Review Process"

npoulad1 · 2026-04-29T05:23:32Z

/run-ci

yaoliu13 · 2026-04-29T06:42:06Z

/run-ci

…Newton-path latent fix Recovers ~5% of the throughput lost by the previous all-runtime bug fix (237bf03) while preserving its bit-exact correctness fix for test_mesh_align. Background ---------- The performance commit 28e2b41 made `n_dofs = static_rigid_sim_config.n_dofs_` in 30 sites in this file (and 2 in solver_breakdown.py). Bug fix 237bf03 reverted ALL 30 to runtime `n_dofs = <arr>.shape[X]` because some sites (scalar reductions like `for jd: snorm += search[jd]**2`) accumulated ~1e-5-scale FP drift that pushed test_mesh_align over its 0.05 rad/s tolerance after 400 sim steps. The all-or-nothing revert was safe but overly conservative -- it cost ~5% throughput on the AMDGPU benchmark even though many of the 30 sites were never actually exposing the bug. Key insight: AMDGPU enables FMA fusion unconditionally (`runtime/amdgpu/jit_amdgpu.cpp` `AllowFPOpFusion::Fast`), so element-wise `x[i] += y[i] * a` patterns are bit-equivalent rolled vs unrolled (both fuse to fma with the same operands). The drift only arises in SCALAR REDUCTIONS, where `fast_math=True` (Genesis default; see `quadrants/codegen/llvm/codegen_llvm.cpp` `setAllowReassoc()`) lets the compiler reorder the unrolled add tree (e.g., pairwise vs left-leaning). Surgical classification ----------------------- Each function classified by its OWN body (callees have separate `n_dofs` parameter scopes -- a parent's literal does NOT leak into helper inner loops): * SAFE (literal `n_dofs = static_rigid_sim_config.n_dofs_`, 16 sites): constraint_solver_kernel_{reset,clear,masked_clear}, add_collision_constraints, func_equality_{connect,joint}, add_{joint_limit,frictionloss}_constraints, func_linesearch_and_apply_alpha, func_save_prev_grad, func_update_gradient_{batch,tiled}, func_solve_init, func_solve_iter, func_solve_iter_post_linesearch, func_update_qacc. All have only element-wise `range(n_dofs)` loops in their own body (each iteration writes to a DISTINCT memory location indexed by the loop variable). * UNSAFE (runtime `n_dofs = <arr>.shape[X]`, 14 sites in this file + 2 in solver_breakdown.py): func_equality_weld, all func_hessian_*, all func_cholesky_*, func_ls_init_and_eval_p0_opt, func_linesearch_batch, func_update_constraint_batch, func_terminate_or_update_descent_batch, initialize_Jaref, initialize_Ma. Plus _kernel_update_constraint_{qfrc,cost} in solver_breakdown.py. All have at least one `range(n_dofs)` loop that accumulates into a SCALAR or non-loop-indexed location (dot products, mat-vec, Cholesky, gradient norm). `func_save_prev_grad` regains the `static_rigid_sim_config: qd.template()` parameter that 237bf03 stripped (its body now uses the literal again); its sole call site in solver_breakdown.py is updated accordingly. Also: latent-bug fix in func_hessian_direct_tiled ------------------------------------------------- Bug fix 237bf03 stripped `static_rigid_sim_config: qd.template()` from this function's signature, but the body STILL references `_B = static_rigid_sim_config.n_envs`. Yesterday's full-suite test run predates 237bf03 and so didn't hit this path; today's run (driven by this surgical work) did, exposing the bug. The Newton-only tiled path threw `QuadrantsNameError: Name "static_rigid_sim_config" is not defined` at compile time, taking down test_stickman[Newton], test_cholesky_tiling, test_noslip_iterations[gpu], test_path_planning_avoidance[gpu], and test_scene_saver_franka[gpu] (~26 GPU-only failures). Re-added the parameter to the signature and updated both call sites (solver.py:1859 and solver_breakdown.py:158). Validation ---------- - test_mesh_align: PASSED (the regression that motivated 237bf03) - Full required test suite (243 selected, -k filter for known-skip): 237 PASSED, 0 FAILED, 0 ERROR, 5 SKIPPED, 1 XFAILED in 1h43m The 5 SKIPPED are the test_collision_edge_cases AMD-GPU known-tolerance failures already documented; the 1 XFAILED is test_nan_reset. - Throughput @ n_envs=8192, num_steps=500, fp32 on MI300X: Bug-fix-applied (763k median) vs surgical fix: Best 5-run median: 806,248 env-steps/s (4 fast + 1 slow outlier) Best single run: 810,000 env-steps/s Worst 5-run med: 767,277 env-steps/s (5 slow, no fast-mode) The bimodal noise (~770k slow vs ~808k fast) is system-level (thermal/scheduler), independent of the code change. Surgical fix consistently *enables* the fast mode that bug-fix-applied never reached (which never crossed ~770k in any run today). Touched files ------------- - genesis/engine/solvers/rigid/constraint/solver.py 16 n_dofs sites flipped to literal + func_save_prev_grad signature re-extended + func_hessian_direct_tiled latent-bug fix + updated top-of-file SAFE/UNSAFE classification comment. - genesis/engine/solvers/rigid/constraint/solver_breakdown.py Two call-site updates (func_save_prev_grad, func_hessian_direct_tiled) to pass static_rigid_sim_config through. Made-with: Cursor

npoulad1 · 2026-04-29T20:39:11Z

/run-ci

yaoliu13

one test failed in pre-submit

npoulad1 · 2026-04-30T06:00:54Z

/run-ci

amd-anchaudh

lgtm
1272030.3

yaoliu13

LGTM

npoulad1 force-pushed the perf/npoulad/jacobian_perf branch from 9cf1611 to 28e2b41 Compare April 28, 2026 18:44

npoulad1 marked this pull request as ready for review April 29, 2026 00:07

yaoliu13 requested changes Apr 29, 2026

View reviewed changes

Merge branch 'amd-integration' into perf/npoulad/jacobian_perf

71f52c7

npoulad1 added 2 commits April 29, 2026 09:34

Merge branch 'amd-integration' into perf/npoulad/jacobian_perf

c053ae3

yaoliu13 requested changes Apr 29, 2026

View reviewed changes

amd-anchaudh approved these changes Apr 30, 2026

View reviewed changes

yaoliu13 approved these changes Apr 30, 2026

View reviewed changes

yaoliu13 merged commit 376fef7 into amd-integration Apr 30, 2026

yaoliu13 deleted the perf/npoulad/jacobian_perf branch April 30, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF IMPROVEMENT] constraint solver: shape constants via static_rigid_sim_config + AMDGPU B3/B4 func_solve_body variants#46

[PERF IMPROVEMENT] constraint solver: shape constants via static_rigid_sim_config + AMDGPU B3/B4 func_solve_body variants#46
yaoliu13 merged 5 commits intoamd-integrationfrom
perf/npoulad/jacobian_perf

npoulad1 commented Apr 28, 2026 •

edited

Loading

Uh oh!

npoulad1 commented Apr 29, 2026

Uh oh!

yaoliu13 left a comment •

edited

Loading

Uh oh!

npoulad1 commented Apr 29, 2026

Uh oh!

yaoliu13 commented Apr 29, 2026

Uh oh!

npoulad1 commented Apr 29, 2026

Uh oh!

yaoliu13 left a comment

Uh oh!

npoulad1 commented Apr 30, 2026

Uh oh!

amd-anchaudh left a comment •

edited by yaoliu13

Loading

Uh oh!

yaoliu13 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

npoulad1 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npoulad1 commented Apr 29, 2026

Uh oh!

yaoliu13 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

npoulad1 commented Apr 29, 2026

Uh oh!

yaoliu13 commented Apr 29, 2026

Uh oh!

npoulad1 commented Apr 29, 2026

Uh oh!

yaoliu13 left a comment

Choose a reason for hiding this comment

Uh oh!

npoulad1 commented Apr 30, 2026

Uh oh!

amd-anchaudh left a comment • edited by yaoliu13 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoliu13 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

npoulad1 commented Apr 28, 2026 •

edited

Loading

yaoliu13 left a comment •

edited

Loading

amd-anchaudh left a comment •

edited by yaoliu13

Loading