[model, feature] qwen3-omni: add packed sequence support and shared sequence utilities#4304
[model, feature] qwen3-omni: add packed sequence support and shared sequence utilities#4304hbhflw2000 wants to merge 6 commits into
Conversation
f0f95d8 to
042eed1
Compare
|
/claude review |
|
Light Code Review - Clean implementation that follows the existing Qwen3-VL packed-padding pattern well. One inline comment posted about using _parallel_size() consistently. Suggested test cases: No perf tests impacted. |
|
Details: pack_or_pad_batch_sequences accesses pg_collection.tp.size() / .cp.size() directly instead of the null-safe _parallel_size() helper introduced in the same PR. The sibling function pad_batch_sequences_for_context_parallel was already refactored to use it. See inline comment for a suggested fix. |
|
Test coverage notes: (1) test_forward_step_passes_packed_sequence_params_to_model only exercises CP=1 -- a CP>1 variant would cover _get_dense_batch_on_this_cp_rank + packed path interaction but may require more complex mocking and could be deferred. (2) No direct unit test for pack_or_pad_batch_sequences (tested indirectly via forward_step) -- a focused test would make FP8 padding alignment (math.lcm) and force_to_seq_length easier to verify in isolation. Suggested test cases: No perf tests impacted. |
| cp_size=cp_size, | ||
| cp_rank=cp_rank, | ||
| sequence_parallel=self.config.sequence_parallel, | ||
| if packed_seq_params is None: |
There was a problem hiding this comment.
these 2 here seems not need condition check, can just be one path?
There was a problem hiding this comment.
Updated, thanks. I merged this into one split_deepstack_embs path. Packed THD tensors are already CP-aware after preprocess_packed_seqs, so only the non-packed path keeps the actual cp_size/cp_rank.
| deepstack_visual_embeds, | ||
| tp_size=tp_size, | ||
| tp_rank=tp_rank, | ||
| cp_size=1, |
There was a problem hiding this comment.
This follows the existing Qwen3-VL packed THD handling: packed tensors are already CP-partitioned by preprocess_packed_seqs, so this split should only apply SP and avoid a second dense CP split. I made the intent explicit with local variables/comments in the latest commit.
yaoyu-33
left a comment
There was a problem hiding this comment.
Overall this is a well-structured PR — the shared utilities are clean, the design note clearly explains the trade-offs vs slice_batch_for_context_parallel, and the unit test coverage is solid. A few things worth addressing before merge:
Bug: is_thd_format = True is permanent state mutation
thinker_model.py line 165:
self.language_model.rotary_pos_emb.is_thd_format = TrueThis mutates the language model's rotary embedding state without ever resetting it. If the model is called again with packed_seq_params=None (e.g. inference or a non-packed eval step), is_thd_format stays True from the previous call. The guarding condition one line above already checks if packed_seq_params is not None, so this is:
if packed_seq_params is not None and position_ids is not None:
...
self.language_model.rotary_pos_emb.is_thd_format = True
# but never reset to FalseShould be set/reset in a symmetric pattern, e.g.:
self.language_model.rotary_pos_emb.is_thd_format = packed_seq_params is not Noneplaced unconditionally just before return self.language_model(...).
Magic constants in _get_qwen3_omni_audio_output_lengths
input_lengths_leave = input_lengths % 100
feat_lengths = (input_lengths_leave - 1) // 2 + 1
return ((feat_lengths - 1) // 2 + 1 - 1) // 2 + 1 + (input_lengths // 100) * 13100 and 13 are unexplained. Given the docstring says "Match HF Qwen3-Omni audio encoder forward output lengths", these likely represent the audio encoder chunk size (100 frames) and output tokens per full chunk (13). A comment explaining what these constants represent would make this much easier to audit when HF changes the audio encoder:
# Audio encoder processes chunks of 100 input frames,
# each producing 13 output tokens; remainder frames go through
# two stride-2 convolutions then a final stride-2 pool.Also worth noting: the existing _get_feat_extract_output_lengths is a private API (_ prefix) so avoiding it is correct — but a reference to the HF source or audio encoder config that grounds these constants would prevent future drift.
Redundant attention mask construction in step
qwen3_omni_step.py lines 326–330:
forward_args["attention_mask"] = torch.ones_like(
original_tokens, dtype=torch.bool, device=original_tokens.device,
)thinker_model.forward already reconstructs attention_mask = torch.ones_like(input_ids, ...) when packed_seq_params is not None and attention_mask is None. Passing ones from the step means the model's guard is never triggered, but the result is the same. Minor, but adds an allocation on the critical path that the model discards immediately (it sets attention_mask = None before passing to language_model()). Could pass None here and let the model handle it, consistent with the existing guard.
Readability: nested ternary in language_model call
thinker_model.py lines 169–175:
input_ids=(
lm_input_ids
if packed_seq_params is not None
else None
if combined_embeddings is not None
else input_ids
),This is a nested conditional expression. A simple if/elif/else assignment before the call would be easier to follow.
Minor: no unit test for CP + packed combined path in forward_step
The test test_forward_step_passes_packed_sequence_params_to_model uses cp.size()=1. The E2E tested CP=2 + packing combined, but there's no unit-level coverage of the _get_dense_batch_on_this_cp_rank branch inside the packed path (if pack_sequences_in_batch and cp > 1). Not blocking given the E2E coverage, but worth a follow-up.
use_fp8_padding=True is hardcoded in the packed path regardless of whether FP8 training is actually enabled (adds lcm-16 alignment unconditionally). This is a conservative safe choice but worth a comment explaining why.
|
@yaoyu-33 Thanks for the detailed review. Fixed in the latest commits:
I kept the explicit attention_mask for now to avoid changing the validated packed+CP path. The new unit test verifies labels/loss_mask are CP-local while full raw input_ids and packed_seq_params are still passed to the model. |
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
Signed-off-by: hbhflw2000 <417911774@qq.com>
f1aeec0 to
6c31ed3
Compare
What does this PR do?
Add Qwen3-Omni packed sequence training support and introduce shared raw sequence padding / packed-sequence metadata utilities for the Qwen3-Omni training path.
Changelog
pack_sequences_in_batch=Trueforward-step support.input_idsavailable for model-internal mRoPE while slicing train tensors on CP ranks.training/utils/padding_utils.py.PackedSeqParamsconstruction intraining/utils/packed_seq_utils.py.Design note / RFC
This implementation follows the existing Qwen3-VL packed-padding pattern: pad raw batch sequence tensors to an aligned dense length, build uniform THD
PackedSeqParams, and keep model-specific multimodal / mRoPE handling inside the Qwen3-Omni step and model code.This PR intentionally does not reuse
slice_batch_for_context_parallelfor Qwen3-Omni raw-batch padding. That utility operates after embedding preparation and slicesinputs_embeds, while Qwen3-Omni needs pre-forward raw sequence normalization so the fullinput_idstensor remains available for multimodal placeholder handling and mRoPE.The shared abstraction here is intentionally narrow: compute the padded target sequence length, pad/truncate common raw batch tensors, and construct uniform THD
PackedSeqParams. Model-specific logic such as multimodal merge, CP rank slicing, and mRoPE handling remains in Qwen3-Omni code.ATTENTION: Qwen3-VL code is intentionally left unchanged in this PR. Applying these helpers back to Qwen3-VL can be considered separately with Qwen3-VL-specific regression coverage.
Validation
Unit tests: