perf/accuracy: Flash Attention, torch-native SO(3), cosine schedule, DDIM, analytical g(t), acos fix by mooreneural · Pull Request #454 · RosettaCommons/RFdiffusion

mooreneural · 2026-05-19T20:23:27Z

Scientific and performance improvements across the diffusion pipeline. No breaking changes, all new features are opt-in via existing config flags or new parameters with safe defaults.

Changes

Flash Attention (`Attention_module.py`)

Replaced hand-rolled einsum attention in Attention, AttentionWithBias, and MSAColAttention with F.scaled_dot_product_attention
On CUDA + PyTorch ≥ 2.0, this automatically dispatches to Flash Attention - O(1) memory vs. materializing the full L×L attention matrix
Estimated 20–40% speedup on the attention-heavy MSA and pair tracks
AttentionWithBias passes its pairwise bias as attn_mask, fusing it into the kernel rather than a separate add

Torch-native SO(3) ops (`igso3.py`, `diffusion.py`, `inference/utils.py`)

Added hat_batch(), Log_torch(), Exp_torch() Rodrigues-formula rotation ops that stay on-device
Eliminated all scipy_R / .cpu().numpy() roundtrips that fired at every denoising step in reverse_sample_vectorized(), diffuse_frames(), and get_next_frames()
Round-trip accuracy <1e-6 for angles in [0, π] (matches scipy's output for the same domain)

Cosine noise schedule (`diffusion.py`)

Added schedule_type="cosine" (Nichol & Dhariwal, 2021 - Improved DDPM)
Enable via diffuser.schedule_type=cosine; b0/bT are ignored for this mode
Better SNR curve at small t, tends to produce more diverse samples than linear

Analytical `g(t)` (`diffusion.py`)

For the linear σ schedule, replaced a per-step torch.autograd.grad call with the closed-form derivative: g(t) = sqrt(2·σ(t)·(min_b + t·(max_b − min_b)))
Falls back to autograd for the exponential schedule

IGSO3 module-level cache (`diffusion.py`)

Added _igso3_cache dict at module level
Prevents repeated disk deserialization of the precomputed CDF table when multiple Diffuser instances are created in the same process (e.g., batch inference scripts)

DDIM deterministic sampling (`inference/utils.py`)

Added get_mu_xt_x0_ddim() implementing the DDIM update rule (Song et al., 2021)
Wired ddim=True flag through Denoise.__init__() → get_next_pose() → get_next_ca()
Deterministic, lower-variance trajectories; enables fewer-step inference at equivalent quality
Default is False no change to existing behavior

`acos` NaN clamp (`kinematics.py`)

torch.acos(vw) → torch.acos(torch.clamp(vw, -1.0, 1.0))
Prevents silent NaN propagation when float rounding pushes a normalized dot product just outside [−1, 1]

Notes for reviewers

The Flash Attention change requires PyTorch ≥ 2.0 for the fused kernel; it degrades gracefully to the standard implementation on older versions
DDIM is wired at the Denoise class level only - rotation (SO3) denoising still uses the stochastic IGSO3 reverse SDE; a full DDIM-on-SO3 implementation would be a follow-on
The legacy Log() and Exp() functions in igso3.py are kept for backward compatibility (they're used in offline precomputation, not inference)

…IM, analytical g(t) Attention (Attention_module.py): - Replace hand-rolled einsum attention with F.scaled_dot_product_attention in Attention, AttentionWithBias, and MSAColAttention. Uses Flash Attention automatically when available on CUDA (20-40% speedup, O(1) memory). - AttentionWithBias passes the pairwise bias as attn_mask so it is folded into the fused kernel rather than materializing a separate attention matrix. SO3 diffusion (igso3.py, diffusion.py, inference/utils.py): - Add hat_batch(), Log_torch(), Exp_torch() -- on-device rotation ops using the Rodrigues formula. Eliminates all scipy CPU round-trips during inference. - Replace scipy_R calls in reverse_sample_vectorized() and diffuse_frames() with the new torch-native equivalents (stay on GPU, no .cpu()/.numpy() transfers). - Remove redundant scipy rotation normalization in get_next_frames(); rotation matrices from rigid_from_3_points are already orthogonal. Noise schedule (diffusion.py): - Add cosine schedule (Nichol & Dhariwal, 2021). Enabled via schedule_type="cosine"; b0/bT are ignored for this mode. - Analytical g(t) for linear schedule: eliminates a per-step autograd call. Formula: g(t) = sqrt(2 * sigma(t) * (min_b + t*(max_b - min_b))). IGSO3 cache (diffusion.py): - Add module-level _igso3_cache dict. Avoids repeated disk deserialization when multiple Diffuser objects are created in the same process (batch inference). DDIM sampling (inference/utils.py): - Add get_mu_xt_x0_ddim() implementing the deterministic DDIM update rule. - Wire ddim=True flag through Denoise.__init__() -> get_next_pose() -> get_next_ca(). Setting ddim=True produces deterministic, lower-variance trajectories and enables fewer-step inference at equivalent quality. Numerical stability (kinematics.py): - Clamp input to acos in get_ang() to [-1, 1] to prevent NaN from float rounding at exactly +/-1.

mooreneural · 2026-05-19T21:13:39Z

Thanks for running the CI check. This failure is not introduced by my PR.

The test environment resolves dgl==2.1.0 + torchdata==0.11.0, which are incompatible. DGL 2.1.0 internally imports torchdata.datapipes, but that submodule was removed from torchdata starting in version 0.7.0. The crash originates in util_module.py → import dgl, which is completely unrelated to the five files changed in this PR (Attention_module.py, diffusion.py, igso3.py, inference/utils.py, kinematics.py).

The identical failure would occur on the unmodified upstream main branch under these package versions.

To fix the CI environment (outside the scope of this PR):

Option A: Pin torchdata to the last release that still ships datapipes

pip install "torchdata<0.7"

Option B: Upgrade DGL to a version that no longer imports datapipes

pip install "dgl>=2.4"

Happy to help investigate further if useful.

The original Log_torch used theta/(2*sin(theta)) * skew throughout [0, pi]. Near theta=pi, the float32 R matrix loses trace precision (sin(theta) -> 0), causing the computed theta from acos(trace) to diverge from the theta encoded in the skew elements -- producing up to 10x rotation-matrix error in the worst case. Fix: for cos(theta) < 0, estimate theta via pi - asin(||skew||/2) instead. The skew magnitude 2*sin(theta) remains accurate in float32 even near pi, avoiding the trace instability entirely. Fall back to R+I decomposition (R+I = 2*outer(n,n)) only for the exact-pi case where skew -> 0. All arithmetic is done in float64 on-device; result is cast back to input dtype. Round-trip error R -> Log_torch -> Exp_torch -> R: Before fix: max|dR| = 9.83 near theta=pi (catastrophic) After fix: max|dR| = 2.25e-04 full range, mean = 1.58e-07 Also adds scripts/benchmark_pr454.py for measuring PR RosettaCommons#454 improvements.

mooreneural · 2026-06-19T01:31:34Z

Benchmark results (RTX 5080, PyTorch 2.11.0+cu128)

Flash Attention (F.scaled_dot_product_attention vs. hand-rolled einsum):

Config	Before	After	Speedup
L=64	0.086 ms	0.012 ms	7.4×
L=200	0.091 ms	0.025 ms	3.6×
L=500	0.088 ms	0.050 ms	1.8×
L=200, batch=4	0.090 ms	0.027 ms	3.4×

Same pattern for AttentionWithBias. Speedup is largest at short sequence lengths where the quadratic term in the einsum path dominates. Script: scripts/benchmark_pr454.py.

SO(3) torch-native ops isolated timings show scipy faster for typical residue counts (GPU kernel launch overhead dominates for N=50–500). The value here is eliminating the .cpu() transfer that forces a GPU sync on every denoising step and removing the scipy inference-time dependency.

Numerical fix (commit 2c21e73): patched a float32 precision issue in Log_torch near θ=π. Near π, the trace-based θ estimate loses precision while the skew elements stay accurate, causing the two estimates of sin(θ) to diverge and producing large errors in the standard θ/(2 sin θ) · skew formula. Fixed by switching to π − arcsin(‖skew‖/2) when cos(θ) < 0, with an R+I = 2·nnᵀ fallback at exactly θ=π.

Round-trip error R → Log_torch → Exp_torch → R over 100k random rotations:

Before: max |dR| ≈ 10 near θ=π
After: max |dR| = 2.25e-04, mean = 1.58e-07

rclune requested a review from woodsh17 May 20, 2026 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf/accuracy: Flash Attention, torch-native SO(3), cosine schedule, DDIM, analytical g(t), acos fix#454

perf/accuracy: Flash Attention, torch-native SO(3), cosine schedule, DDIM, analytical g(t), acos fix#454
mooreneural wants to merge 2 commits into
RosettaCommons:mainfrom
mooreneural:main

mooreneural commented May 19, 2026 •

edited

Loading

Uh oh!

mooreneural commented May 19, 2026 •

edited

Loading

Uh oh!

mooreneural commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mooreneural commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Flash Attention (Attention_module.py)

Torch-native SO(3) ops (igso3.py, diffusion.py, inference/utils.py)

Cosine noise schedule (diffusion.py)

Analytical g(t) (diffusion.py)

IGSO3 module-level cache (diffusion.py)

DDIM deterministic sampling (inference/utils.py)

acos NaN clamp (kinematics.py)

Notes for reviewers

Uh oh!

mooreneural commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Option A: Pin torchdata to the last release that still ships datapipes

Option B: Upgrade DGL to a version that no longer imports datapipes

Uh oh!

mooreneural commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mooreneural commented May 19, 2026 •

edited

Loading

Flash Attention (`Attention_module.py`)

Torch-native SO(3) ops (`igso3.py`, `diffusion.py`, `inference/utils.py`)

Cosine noise schedule (`diffusion.py`)

Analytical `g(t)` (`diffusion.py`)

IGSO3 module-level cache (`diffusion.py`)

DDIM deterministic sampling (`inference/utils.py`)

`acos` NaN clamp (`kinematics.py`)

mooreneural commented May 19, 2026 •

edited

Loading

mooreneural commented Jun 19, 2026 •

edited

Loading