Video-to-Video model

LatentSync

Diffusion‑based sync with strong temporal consistency.

Audio‑conditioned latent diffusion model for lip sync, designed for high‑fidelity results and strong temporal consistency over time.

Best for: Longer clips

Inputs: Video + Audio

Outputs: Video

What this model is best at

Short answer: Audio‑conditioned latent diffusion model for lip sync, designed for high‑fidelity results and strong temporal consistency over time.

Use this workspace to preview the model, compare example output, and start creating with the recommended workflow for this model.

Highlight 1

End‑to‑end audio‑conditioned latent diffusion.

Highlight 2

Temporal consistency enhancements with TREPA.

Highlight 3

Language‑agnostic lip sync.

Video-to-Video

LatentSync workspace

Start from the built-in workflow below, then tune the model inside the standard LipsyncX creation surface.

Talking Photo Video Dubbing Long Video Pet & Anime

1. Upload photo

1. Choose a face

Choose a template or uploadDrag & drop video or photoor click to upload

2. Choose Model

3. Add Script

Instant script templates

One-click copy for greetings, celebrations, and announcements.

—

Billing unit10 credits / 5s

Billing units—

Estimated length—

Est. total—

Uses real audio duration when available.

Voice

Speech speed (0.90x)

0 / 1000

—

Step 1/4

Choose a face

Follow the next step to keep building your video.

—

Avg render time

7 min

Languages supported

50+

Creators onboarded

3,200+

Trusted by teams

StudioBlendAudioNovaCourseWaveMintlyVisionSpark

Long‑form segment

Stable mouth motion across a longer scene.

Original

Synced

Popular use cases

Use case 1

Podcast videos

Maintain sync over time.

Use case 2

Training lessons

Consistency across segments.

Use case 3

Series content

Keep identity stable.

Quick specs

Primary use

High‑fidelity video‑to‑video lip sync

Inputs

Source video + target audio

Output

Synced video

Best strength

Temporal consistency on longer clips

Best practices

Use steady, well‑lit footage for the cleanest temporal consistency.

Keep the face centered to minimize occlusion artifacts.

Match audio cadence to the original pacing.

FAQ

How does it keep frames consistent?

It uses temporal representation alignment (TREPA) to stabilize results across frames.

Is it language‑specific?

No. LatentSync is designed to be language‑agnostic.

What resolution is it optimized for?

The model targets 512×512 output resolution.

Ready to try LatentSync?

Use the built-in workspace to test prompts, compare outputs, and see how this model fits your content workflow.