Audio-to-Video model

LongCat Multi‑Avatar

Stable talking heads for longer content and multi‑speaker scenes.

Audio‑driven multi‑character avatar model built for realistic group conversations with synchronized lip sync and natural turn‑taking.

Best for: Podcasts

Inputs: Image + Audio

Outputs: Video

What this model is best at

Short answer: Audio‑driven multi‑character avatar model built for realistic group conversations with synchronized lip sync and natural turn‑taking.

Use this workspace to preview the model, compare example output, and start creating with the recommended workflow for this model.

Highlight 1

Multi‑character conversations with synchronized lip sync.

Highlight 2

Multi‑stream audio support for multi‑speaker dialogue.

Highlight 3

Natural group dynamics and turn‑taking.

Audio-to-Video

LongCat Multi‑Avatar workspace

Start from the built-in workflow below, then tune the model inside the standard LipsyncX creation surface.

Talking Photo Video Dubbing Long Video Pet & Anime

1. Choose a face

Choose a template or uploadDrag & drop video or photoor click to upload

2. Model

3. Add your audio

clean-male-demo-3s.mp3Supports MP3, WAV, M4A. Max 30MB / 10 min. For best lip sync quality, upload audio under 1 min.

Preview uploaded audioUpload a new audio file to replace this demo.

0 / 1000

Est. total10/Balance0

Step 1/3

Choose a face

Follow the next step to keep building your video.

Est. total10/Balance0

Avg render time

7 min

Languages supported

50+

Creators onboarded

3,200+

Trusted by teams

StudioBlendAudioNovaCourseWaveMintlyVisionSpark

Two‑speaker panel

Drive multiple avatars from one audio track.

Portraits

Generated

Popular use cases

Use case 1

Podcast panels

Multi‑guest episodes.

Use case 2

Roundtables

Two‑speaker summaries.

Use case 3

Debates

Split‑speaker scripts.

Quick specs

Primary use

Multi‑speaker avatar video

Inputs

Multiple portraits + multiple audio tracks

Output

Multi‑avatar talking‑head video

Best strength

Group conversations with turn‑taking

Best practices

Provide clean, separated audio per speaker.

Use distinct portraits to avoid identity confusion.

Keep background motion minimal for clarity.

FAQ

Can it handle multiple speakers?

Yes. It is designed for multi‑character lip‑sync conversations.

What inputs are required?

Provide portraits plus a separate audio stream for each speaker.

Is it suitable for longer scenes?

It targets long‑form stability with consistent identity.

Ready to try LongCat Multi‑Avatar?

Use the built-in workspace to test prompts, compare outputs, and see how this model fits your content workflow.