Open Source🚀 Apache 2.0 Licensed - Free for Commercial Use!

IndexTTS2: Emotional & Duration-Controlled Zero-Shot TTS

IndexTTS2 is a powerful autoregressive zero-shot text-to-speech system that achieves unprecedented control over speech generation. With advanced emotion expression, precise duration control, and instant voice cloning capabilities, it delivers natural, expressive speech synthesis in multiple languages. Released under Apache 2.0 license, it's completely open-source and ready for commercial use.

Let the Bullets Fly - Duration Control Demo

Demonstrating precise speech duration control with emotional expression preservation

Duration Control

Precise timing adjustment

Emotion Control

Natural emotional expression

Zero-Shot

No training required

Try IndexTTS2 Live Demo

Experience IndexTTS2's powerful voice cloning and emotion control capabilities in real-time. Generate natural, expressive speech with precise duration control and multi-language support. Clone any voice instantly without training.

Loading IndexTTS2...

Reviews

What People Are Saying About IndexTTS2

Hear what researchers, developers, and AI enthusiasts are saying about IndexTTS2's breakthrough voice cloning and emotion control capabilities

Index TTS2 – A VERY Emotive TTS With Voice Cloning!

Index TTS2 – A VERY Emotive TTS With Voice Cloning!

These open source models are getting crazy good. Genuinely impressed.

Fantastic New AI Text to Speech Model Released! Index TTS 2 Initial Impressions

Fantastic New AI Text to Speech Model Released! Index TTS 2 Initial Impressions

Higgs Audio still remains undefeated. Hopefully their v3 trained model will feature controllable emotions.

New top AI text to speech is here! Free & uncensored. IndexTTS2 tutorial

New top AI text to speech is here! Free & uncensored. IndexTTS2 tutorial

"Hi this Joe's mother, he's not feeling well today and will need to stay home from school until he gets well."

Performance Comparison with Leading TTS Models

See how IndexTTS2 stands against state-of-the-art text-to-speech models in emotion expression, duration accuracy, voice cloning quality, and multi-language support.

MetricIndexTTS2OpenAI TTSElevenLabsAzure TTSF5-TTSCosyVoice
WER (Word Error Rate)%
1.01
N/A
N/A
N/A
1.56
1.45
Speaker Similarity
0.87
N/A
N/A
N/A
0.82
0.85
MOS (Naturalness)/5.0
4.54
4.2
4.3
4.3
4.19
4.12
Emotion Control
Limited
Limited
Duration Control
Limited
Limited
Zero-Shot Cloning
Supported Languages
2+
57
29
119
2
Multi
RTF (Real-Time Factor)
N/A
0.20
0.15
N/A
0.15
N/A

Comparative performance across key TTS quality metrics based on academic benchmarks

Data Sources: IndexTTS2 (arXiv 2506.21619), F5-TTS (arXiv 2410.06885), CosyVoice2 (arXiv 2412.10117)

Note: N/A indicates data not publicly available. Commercial models evaluated through third-party benchmarks.

✓ = Supported | ✗ = Not Supported | Limited = Partial Support

Local Deployment

Quick Start Guide

Deploy IndexTTS2 locally in minutes with our comprehensive step-by-step guide. Start generating natural, emotional speech with zero-shot voice cloning capabilities.

Python API Example

from indextts import IndexTTS

# Initialize the model
tts = IndexTTS()

# Generate speech from text
audio = tts.synthesize(
    text="Hello world! Welcome to IndexTTS2.",
    voice_reference="path/to/reference.wav",  # Optional: clone a voice
    emotion="neutral",  # Control emotion: happy, sad, angry, neutral
    speed=1.0,  # Adjust speaking speed
    language="en"  # Supported: en, zh
)

# Save the output
audio.save("output.wav")

Documentation

Complete guides and API reference

GitHub Repository

Source code and examples

Community

Get help and share ideas

Key Features of IndexTTS2

Discover the powerful capabilities that make IndexTTS2 the ideal choice for expressive, controllable text-to-speech generation.

Zero-Shot Voice Cloning

Instantly clone any voice from just a few seconds of audio without training. Achieves high-fidelity voice reproduction with speaker consistency across diverse content and emotions.

Emotion Expression Control

Decouple timbre from emotion for independent control. Use text descriptions to guide emotional expression (happy, sad, excited, angry) while maintaining voice identity and naturalness.

Precise Duration Control

First autoregressive TTS combining accurate duration control with natural generation. Achieve precise speech timing without sacrificing expressiveness or prosody quality.

Multi-Language Support

Native support for Chinese (Mandarin), English, and mixed Chinese-English synthesis. Maintains natural pronunciation and intonation across language boundaries.

Pinyin Pronunciation Control

Advanced pronunciation control through pinyin notation for Chinese text. Resolve ambiguous pronunciations and ensure accurate character reading in complex contexts.

High Naturalness & Clarity

Superior word error rate (WER) and emotion preservation compared to existing models. Achieves human-like naturalness ratings with exceptional clarity and intelligibility.

What People Are Talking About IndexTTS2 on X

Join the conversation about IndexTTS2 and share your experience with the research community

FAQ