LTX2 distilled checkpoint support #12934

rootonchair · 2026-01-09T12:47:11Z

What does this PR do?

Test script t2i

import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image

pipe = LTX2Pipeline.from_pretrained("rootonchair/LTX-2-19b-distilled", torch_dtype=torch.bfloat16)
upsample_pipe = LTX2LatentUpsamplePipeline.from_pretrained("rootonchair/LTX-2-19b-distilled", subfolder="upsample_pipeline", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A beautiful sunset over the ocean"
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

frame_rate = 24.0
video_latent, audio_latent = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="latent",
    return_dict=False,
)

upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=3,
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_sample.mp4",
)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…s in sigmas

rootonchair · 2026-01-11T15:46:25Z

Running results
Original

output.mp4

Converted ckpt

ltx2_sample.mp4

sayakpaul

Thanks for shipping this so quickly! Left some comments, LMK if they make sense.

scripts/convert_ltx2_to_diffusers.py

src/diffusers/pipelines/ltx2/pipeline_ltx2.py

HuggingFaceDocBuilderDev · 2026-01-12T03:55:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…into feat/distill-ltx2

src/diffusers/pipelines/ltx2/utils.py

dg845

Thanks for the PR! Left some comments about the distilled sigmas schedule.

If I print out the timesteps for the Stage 1 distilled pipeline, I get (for commit faeccc5):

Distilled timesteps: tensor([1000.0000,  999.6502,  999.2961,  998.9380,  998.5754,  994.4882,
         979.3755,  929.6974,  100.0000], device='cuda:0')

Here the sigmas (and thus the timesteps) are shifted toward a terminal value of 0.1, and use_dynamic_shifting is applied as well. However, I believe the distilled sigmas are used as-is in the original LTX 2.0 code:

https://github.com/Lightricks/LTX-2/blob/391c0a2462f8bc2238fa0a2b7717bc1914969230/packages/ltx-pipelines/src/ltx_pipelines/utils/helpers.py#L132

So I think when creating the distilled scheduler we need to disable use_dynamic_shifting and shift_terminal so that the distilled sigmas are used without changes.

Can you check whether the final distilled sigmas match up with those of the original implementation?

dg845 · 2026-01-13T02:45:15Z

The original test script didn't work for me, but I was able to get a working version as follows:

import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video

device = "cuda:0"
width = 768
height = 512

pipe = LTX2Pipeline.from_pretrained(
    "rootonchair/LTX-2-19b-distilled", torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload(device=device)

prompt = "A beautiful sunset over the ocean"
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

frame_rate = 24.0
video_latent, audio_latent = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "rootonchair/LTX-2-19b-distilled",
    subfolder="upsample_pipeline/latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_inference_steps=3,
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_distilled_sample.mp4",
)

The necessary changes were to create LTX2LatentUpsamplePipeline directly from the models, as from_pretrained doesn't work because there are two pipelines (the distilled pipeline and the latent upsampling pipeline) in the same repo (a known limitation), and to set Stage 2 inference to use width * 2 and height * 2, as the upsampling pipeline upsamples the video by 2x in both the width and height.

Co-authored-by: dg845 <[email protected]>

sayakpaul · 2026-01-13T05:39:30Z

Not jeopardizing this PR at all but while we're at the two-stage pipeline stuff, it could also be cool to verify it with the distilled LoRA that we have in place (PR already merged: #12933).

So, what we would do is:

Obtain video and audio latents with the regular LTX2 pipeline
Upsample the video latents
Load the distilled LoRA into the regular LTX2 pipeline
Run the pipeline with 4 steps and these sigmas and with a guidance scale of 1.

Once we're close to merging the PR, we could document all of these to inform the community.

rootonchair · 2026-01-13T06:18:54Z

@dg845 thank you for your detail reviews. Let's me take a closer look on that

rootonchair · 2026-01-13T06:22:29Z

Not jeopardizing this PR at all but while we're at the two-stage pipeline stuff, it could also be cool to verify it with the distilled LoRA that we have in place (PR already merged: #12933).

So, what we would do is:

Obtain video and audio latents with the regular LTX2 pipeline

Upsample the video latents

Load the distilled LoRA into the regular LTX2 pipeline

Run the pipeline with 4 steps and these sigmas and with a guidance scale of 1.

Once we're close to merging the PR, we could document all of these to inform the community.

That sounds interesting. Let's have a quick test on two stage distilled LoRA too

dg845 · 2026-01-13T06:41:17Z

For two-stage inference with the Stage 2 distilled LoRA, I think this script should work:

import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video

device = "cuda:0"
width = 768
height = 512

pipe = LTX2Pipeline.from_pretrained(
    "rootonchair/LTX-2-19b-distilled", torch_dtype=torch.bfloat16
)
# This scheduler should use distilled sigmas without any changes
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler
pipe.enable_model_cpu_offload(device=device)

prompt = "A beautiful sunset over the ocean"
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

frame_rate = 24.0
video_latent, audio_latent = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "rootonchair/LTX-2-19b-distilled",
    subfolder="upsample_pipeline/latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

# Load Stage 2 distilled LoRA
pipe.load_lora_weights(
    "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
)
pipe.set_adapters("stage_2_distilled", 1.0)
# VAE tiling seems necessary to avoid OOM error when VAE decoding
pipe.vae.enable_tiling()
video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_inference_steps=3,
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_distilled_sample.mp4",
)

Sample with LoRA and scheduler fix:

ltx2_distilled_sample_lora_fix.mp4

rootonchair · 2026-01-13T06:46:02Z

For two-stage inference with the Stage 2 distilled LoRA, I think this script should work:

import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video

device = "cuda:0"
width = 768
height = 512

pipe = LTX2Pipeline.from_pretrained(
    "rootonchair/LTX-2-19b-distilled", torch_dtype=torch.bfloat16
)
# This scheduler should use distilled sigmas without any changes
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler
pipe.enable_model_cpu_offload(device=device)

prompt = "A beautiful sunset over the ocean"
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

frame_rate = 24.0
video_latent, audio_latent = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "rootonchair/LTX-2-19b-distilled",
    subfolder="upsample_pipeline/latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

# Load Stage 2 distilled LoRA
pipe.load_lora_weights(
    "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
)
pipe.set_adapters("stage_2_distilled", 1.0)
# VAE tiling seems necessary to avoid OOM error when VAE decoding
pipe.vae.enable_tiling()
video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_inference_steps=3,
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_distilled_sample.mp4",
)

Sample with LoRA and scheduler fix:

ltx2_distilled_sample_lora_fix.mp4

I think we should run with the original LTX2 weight and not the distilled checkpoint. WDYT?

sayakpaul · 2026-01-13T07:13:59Z

Yes, the first stage, in this case, should use the non-distilled ckpt.

dg845 · 2026-01-13T08:01:16Z

Fixed script (I think):

import torch
from diffusers import FlowMatchEulerDiscreteScheduler
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video

device = "cuda:0"
width = 768
height = 512

pipe = LTX2Pipeline.from_pretrained(
    "Lightricks/LTX-2", torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload(device=device)

prompt = "A beautiful sunset over the ocean"
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

# Stage 1 default (non-distilled) inference
frame_rate = 24.0
video_latent, audio_latent = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    sigmas=None,
    guidance_scale=4.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "Lightricks/LTX-2",
    subfolder="latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

# Load Stage 2 distilled LoRA
pipe.load_lora_weights(
    "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
)
pipe.set_adapters("stage_2_distilled", 1.0)
# VAE tiling seems necessary to avoid OOM error when VAE decoding
pipe.vae.enable_tiling()
# Change scheduler to use Stage 2 distilled sigmas as is
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler
# Stage 2 inference with distilled LoRA and sigmas
video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_inference_steps=3,
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_distilled_sample.mp4",
)

dg845 · 2026-01-13T08:49:50Z

If I test the distilled pipeline with the prompt "a dog dancing to energetic electronic dance music", I get the following sample:

ltx2_distilled_sample_dog_edm.mp4

I would expect the audio to be music for this prompt, but instead the audio is only noise, so I think there might be something wrong with the way audio is currently being handled in the distilled pipeline. (The video also doesn't follow the prompt closely; I'm not sure if this is a symptom of the audio being messed up or if there are also bugs for video processing.)

src/diffusers/pipelines/ltx2/utils.py

sayakpaul · 2026-01-13T10:37:35Z

@dg845 should the second stage inference with LoRA be run with 4 num_inference_steps? Also, is the following necessary?

width=width * 2,
height=height * 2,

dg845 · 2026-01-14T00:52:06Z

should the second stage inference with LoRA be run with 4 num_inference_steps?

I believe it should be run with 3, as STAGE_2_DISTILLED_SIGMA_VALUES has 3 values excluding the trailing 0.0. Note that if sigmas is set to STAGE_2_DISTILLED_SIGMA_VALUES, this will currently override num_inference_steps.

Also, is the following necessary?

It is currently necessary as otherwise we'd get a shape error, but we could modify the code to infer the latent_height and latent_width from supplied latents, by changing this part:

diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2.py

Lines 922 to 925 in 3c70440

    
           latent_num_frames = (num_frames - 1) // self.vae_temporal_compression_ratio + 1 
        
           latent_height = height // self.vae_spatial_compression_ratio 
        
           latent_width = width // self.vae_spatial_compression_ratio 
        
           video_sequence_length = latent_num_frames * latent_height * latent_width

…into feat/distill-ltx2

rootonchair · 2026-01-27T18:10:56Z

Properly a stupid question, but is LTX2LatentUpsamplePipeline only working for distilled weights?

@tin2tin you can use it with any LTX2 weight and not limited to distilled weight

rootonchair · 2026-01-27T18:17:25Z

I have updated the document and the test. I'm not sure if the test is enough or we need to add more for all the new params. If all good, let's check for two-stages lora generation result (still downloading the original repo) before merging

tests/pipelines/ltx2/test_ltx2_image2video.py

tests/pipelines/ltx2/test_ltx2.py