Skip to content

Fix STG crash/index miscount with guide-mask self-attention#503

Open
Jean J. de Jong (jjdejong) wants to merge 1 commit into
Lightricks:masterfrom
jjdejong:stg-guide-mask-fix
Open

Fix STG crash/index miscount with guide-mask self-attention#503
Jean J. de Jong (jjdejong) wants to merge 1 commit into
Lightricks:masterfrom
jjdejong:stg-guide-mask-fix

Conversation

@jjdejong

Copy link
Copy Markdown

Summary

Fixes a crash (and a related attention-index miscount) in the STG / multimodal
guider whenever cond-image / keyframe guides with strength != 1.0 are combined
with STG perturbation.

Symptom

During the STG-perturbed denoise step:

RuntimeError: The expanded size of the tensor (N) must match the existing
size (M) at non-singleton dimension 1 ... in _attention_with_guide_mask

Root cause

ComfyUI core (CORE-166, "Reduce LTX2.3 peak VRAM when guide_mask is in use")
now splits one video self-attention into up to three optimized_attention
calls over sliced queries against the full key/value. The plugin's STG
PatchAttention skipped a layer by return v (the full-length value) and
counted each sub-call as a separate attention index. So:

  • return v is the wrong length for the sliced-query output slot → crash; and
  • the extra sub-calls shift audio_attn_idx in calc_stg_indexes, so audio STG
    would skip the wrong attention even when it doesn't crash.

Fix

PatchAttention now recognises a guide-split sub-call via the
low_precision_attention=False kwarg (the only signal core's guide path passes —
this avoids false-positiving the v2a cross-attention, which also has
q_len < v_len), collapses the split into a single logical STG index, and
returns the matching v[:, off:off+q_len] slice when skipping. No core changes.

Scope

Not AV-specific: any video-only workflow combining optional_cond_images
(strength != 1.0) with the STG / multimodal guider triggers this.

Repro

A video-only LTX-2 workflow with optional_cond_images at strength != 1.0 and
the STG / multimodal guider with perturbation enabled.

Caveat

The detector keys on low_precision_attention=False as the guide-split marker,
which is a core implementation detail. If core changes that, the detector would
silently stop collapsing the split — maintainers may prefer a more explicit
contract or a core-side fix.

When cond-image/keyframe guides with strength != 1.0 are combined with STG
perturbation, the perturbed denoise step crashed with:

  RuntimeError: The expanded size of the tensor (N) must match the existing
  size (M) ... in _attention_with_guide_mask

Root cause: comfy core (CORE-166, "Reduce LTX2.3 peak VRAM when guide_mask
is in use") splits one video self-attention into up to three
optimized_attention calls over sliced queries against the full key/value.
STG's PatchAttention skipped a layer with `return v` (the full sequence)
and counted each sub-call as a separate attention index. So the returned
value was the wrong length (crash on the query-slice assignment), and the
extra sub-calls shifted audio_attn_idx in calc_stg_indexes.

Fix: detect a guide-split sub-call via the low_precision_attention=False
kwarg (the only signal core's guide path passes; this avoids
false-positiving the v2a cross-attention, which also has q_len < v_len),
collapse the split into a single logical STG index, and return the matching
v[:, off:off+q_len] slice when skipping. No core changes.

Not AV-specific: any video-only workflow combining cond_images
(strength != 1.0) with the STG/multimodal guider triggers it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant