Fix STG crash/index miscount with guide-mask self-attention#503
Open
Jean J. de Jong (jjdejong) wants to merge 1 commit into
Open
Fix STG crash/index miscount with guide-mask self-attention#503Jean J. de Jong (jjdejong) wants to merge 1 commit into
Jean J. de Jong (jjdejong) wants to merge 1 commit into
Conversation
When cond-image/keyframe guides with strength != 1.0 are combined with STG perturbation, the perturbed denoise step crashed with: RuntimeError: The expanded size of the tensor (N) must match the existing size (M) ... in _attention_with_guide_mask Root cause: comfy core (CORE-166, "Reduce LTX2.3 peak VRAM when guide_mask is in use") splits one video self-attention into up to three optimized_attention calls over sliced queries against the full key/value. STG's PatchAttention skipped a layer with `return v` (the full sequence) and counted each sub-call as a separate attention index. So the returned value was the wrong length (crash on the query-slice assignment), and the extra sub-calls shifted audio_attn_idx in calc_stg_indexes. Fix: detect a guide-split sub-call via the low_precision_attention=False kwarg (the only signal core's guide path passes; this avoids false-positiving the v2a cross-attention, which also has q_len < v_len), collapse the split into a single logical STG index, and return the matching v[:, off:off+q_len] slice when skipping. No core changes. Not AV-specific: any video-only workflow combining cond_images (strength != 1.0) with the STG/multimodal guider triggers it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a crash (and a related attention-index miscount) in the STG / multimodal
guider whenever cond-image / keyframe guides with
strength != 1.0are combinedwith STG perturbation.
Symptom
During the STG-perturbed denoise step:
Root cause
ComfyUI core (CORE-166, "Reduce LTX2.3 peak VRAM when guide_mask is in use")
now splits one video self-attention into up to three
optimized_attentioncalls over sliced queries against the full key/value. The plugin's STG
PatchAttentionskipped a layer byreturn v(the full-length value) andcounted each sub-call as a separate attention index. So:
return vis the wrong length for the sliced-query output slot → crash; andaudio_attn_idxincalc_stg_indexes, so audio STGwould skip the wrong attention even when it doesn't crash.
Fix
PatchAttentionnow recognises a guide-split sub-call via thelow_precision_attention=Falsekwarg (the only signal core's guide path passes —this avoids false-positiving the
v2across-attention, which also hasq_len < v_len), collapses the split into a single logical STG index, andreturns the matching
v[:, off:off+q_len]slice when skipping. No core changes.Scope
Not AV-specific: any video-only workflow combining
optional_cond_images(strength != 1.0) with the STG / multimodal guider triggers this.
Repro
A video-only LTX-2 workflow with
optional_cond_imagesat strength != 1.0 andthe STG / multimodal guider with perturbation enabled.
Caveat
The detector keys on
low_precision_attention=Falseas the guide-split marker,which is a core implementation detail. If core changes that, the detector would
silently stop collapsing the split — maintainers may prefer a more explicit
contract or a core-side fix.