feat(diffusion): add LongLive WAN training path by AndysonYs · Pull Request #4272 · NVIDIA-NeMo/Megatron-Bridge

AndysonYs · 2026-06-10T12:29:31Z

What does this PR do ?

Adds the initial offline-latents LongLive WAN MVP requested in #4215, covering clean-history plus noisy-target temporal chunks with windowed attention defaults and SP/TP validation.

Changelog

Add longlive_wan_step registration so WAN recipes can select the LongLive forward step from scripts/training/run_recipe.py.
Add LongLive WAN chunk selection, target-only loss masking, teacher-forcing mask helpers, and CP/SP partition utilities.
Add LongLiveWanForwardStep and LongLiveWanFlowMatchingPipeline for clean-history plus noisy-target temporal chunk training.
Add LongLive WAN 1.3B and 5B SP long-video recipes with explicit sliding-window attention settings to avoid dense [S, S] masks for long sequences.
Update WAN mock/dataset config paths to support LongLive latent shape overrides used by the long-video recipe.
Keep the Megatron-Core submodule clean by making explicit dense self-attention masks opt-in only when the decoder supports self_attention_mask.
Add focused unit tests for LongLive chunking, noising, masking, recipe wiring, and dense-mask/windowed-attention selection.
Add scripts/validation/wan_sp_tp_tiny_parity.py to verify tiny WAN TP/SP inference parity with exact tensor equality.
Document the LongLive WAN MVP commands and the 5B SP long-video smoke path in the WAN example README.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

Additional Information

Related to [feature] LongLiveWan Long-Video Training Recipe #4215.
This PR implements the initial offline WAN latents/text embeddings target from [feature] LongLiveWan Long-Video Training Recipe #4215; online raw-video VAE encoding with temporal halo exchange remains a future extension.
Does not add new required or optional dependencies.
pre-commit run --all-files passed.
python -m compileall -q scripts/validation/wan_sp_tp_tiny_parity.py src/megatron/bridge/diffusion/models/wan/longlive_wan_step.py src/megatron/bridge/diffusion/models/wan/longlive_wan_utils.py src/megatron/bridge/diffusion/models/wan/wan_model.py src/megatron/bridge/diffusion/recipes/wan/wan.py tests/unit_tests/diffusion/model/wan/test_longlive_wan_step.py tests/unit_tests/diffusion/recipes/wan/test_wan_recipe.py passed.
Slurm unit job 3244385: 35 passed, 26 warnings in 3.56s on 4x GB200.
Slurm tiny SP/TP parity job 3244425: strict_equal=True, max_abs=0.00000000e+00.
Slurm long-video smoke job 3244426: completed 1/1 iteration on 4x GB200 with 0 skipped and 0 NaN iterations.
Local uv was unavailable in the conda environment, so pre-commit was run directly after installing pre-commit into the existing mb-longlive-wan environment.

Signed-off-by: Shuai Yang <shyang@nvidia.com>

copy-pr-bot · 2026-06-10T12:29:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Shuai Yang <shyang@nvidia.com>

yaoyu-33 · 2026-06-17T17:17:13Z

Could you please check the default LongLive 1.3B recipe? It looks internally inconsistent: longlive_wan_1_3b_pretrain_config() inherits the text-to-video config with context_parallel_size=4 and the WAN provider default qkv_format="thd", but LongLiveWanFlowMatchingPipeline.validate_qkv_format() rejects anything except qkv_format="sbhd".

Please make the default recipe runnable, or mark it as non-runnable and update the README/tests accordingly.

feat(diffusion): add LongLive WAN training path

a4a6cca

Signed-off-by: Shuai Yang <shyang@nvidia.com>

github-actions Bot added the community-request label Jun 10, 2026

yaoyu-33 added area:diffusion DFM module feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jun 10, 2026

fix(diffusion): harden LongLive WAN parallel masks

171a488

Signed-off-by: Shuai Yang <shyang@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diffusion): add LongLive WAN training path#4272

feat(diffusion): add LongLive WAN training path#4272
AndysonYs wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
AndysonYs:longlive-wan-mvp

AndysonYs commented Jun 10, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

yaoyu-33 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AndysonYs commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

yaoyu-33 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AndysonYs commented Jun 10, 2026 •

edited

Loading