Open Source🚀 Apache 2.0 Licensed - Free for Commercial Use!

IndexTTS2: Emotional & Duration-Controlled Zero-Shot TTS

IndexTTS2 is a powerful autoregressive zero-shot text-to-speech system that achieves unprecedented control over speech generation. With advanced emotion expression, precise duration control, and instant voice cloning capabilities, it delivers natural, expressive speech synthesis in multiple languages. Released under Apache 2.0 license, it's completely open-source and ready for commercial use.

Let the Bullets Fly

Empresses in the Palace

Empresses in the Palace 2

Play demo

Let the Bullets Fly - Duration Control Demo

Demonstrating precise speech duration control with emotional expression preservation

Duration Control

Precise timing adjustment

Emotion Control

Natural emotional expression

Zero-Shot

No training required

Try IndexTTS2 Live Demo

Experience IndexTTS2's powerful voice cloning and emotion control capabilities in real-time. Generate natural, expressive speech with precise duration control and multi-language support. Clone any voice instantly without training.

Reviews

What People Are Saying About IndexTTS2

Hear what researchers, developers, and AI enthusiasts are saying about IndexTTS2's breakthrough voice cloning and emotion control capabilities

Index TTS2 – A VERY Emotive TTS With Voice Cloning!

These open source models are getting crazy good. Genuinely impressed.

Fantastic New AI Text to Speech Model Released! Index TTS 2 Initial Impressions

Higgs Audio still remains undefeated. Hopefully their v3 trained model will feature controllable emotions.

New top AI text to speech is here! Free & uncensored. IndexTTS2 tutorial

"Hi this Joe's mother, he's not feeling well today and will need to stay home from school until he gets well."

Performance Comparison with Leading TTS Models

See how IndexTTS2 stands against state-of-the-art text-to-speech models in emotion expression, duration accuracy, voice cloning quality, and multi-language support.

Metric	IndexTTS2	OpenAI TTS	ElevenLabs	Azure TTS	F5-TTS	CosyVoice
WER (Word Error Rate)%	1.01	N/A	N/A	N/A	1.56	1.45
Speaker Similarity	0.87	N/A	N/A	N/A	0.82	0.85
MOS (Naturalness)/5.0	4.54	4.2	4.3	4.3	4.19	4.12
Emotion Control	✓	✗	Limited	Limited	✗	✓
Duration Control	✓	✗	✗	✗	Limited	Limited
Zero-Shot Cloning	✓	✓	✓	✗	✓	✓
Supported Languages	2+	57	29	119	2	Multi
RTF (Real-Time Factor)	N/A	0.20	0.15	N/A	0.15	N/A

Comparative performance across key TTS quality metrics based on academic benchmarks

Data Sources: IndexTTS2 (arXiv 2506.21619), F5-TTS (arXiv 2410.06885), CosyVoice2 (arXiv 2412.10117)

Note: N/A indicates data not publicly available. Commercial models evaluated through third-party benchmarks.

✓ = Supported | ✗ = Not Supported | Limited = Partial Support

Local Deployment

Quick Start Guide

Deploy IndexTTS2 locally in minutes with our comprehensive step-by-step guide. Start generating natural, emotional speech with zero-shot voice cloning capabilities.

Python API Example

from indextts import IndexTTS

# Initialize the model
tts = IndexTTS()

# Generate speech from text
audio = tts.synthesize(
    text="Hello world! Welcome to IndexTTS2.",
    voice_reference="path/to/reference.wav",  # Optional: clone a voice
    emotion="neutral",  # Control emotion: happy, sad, angry, neutral
    speed=1.0,  # Adjust speaking speed
    language="en"  # Supported: en, zh
)

# Save the output
audio.save("output.wav")

Documentation

Complete guides and API reference

GitHub Repository

Source code and examples

Community

Get help and share ideas

Key Features of IndexTTS2

Discover the powerful capabilities that make IndexTTS2 the ideal choice for expressive, controllable text-to-speech generation.

Zero-Shot Voice Cloning

Instantly clone any voice from just a few seconds of audio without training. Achieves high-fidelity voice reproduction with speaker consistency across diverse content and emotions.

Emotion Expression Control

Decouple timbre from emotion for independent control. Use text descriptions to guide emotional expression (happy, sad, excited, angry) while maintaining voice identity and naturalness.

Precise Duration Control

First autoregressive TTS combining accurate duration control with natural generation. Achieve precise speech timing without sacrificing expressiveness or prosody quality.

Multi-Language Support

Native support for Chinese (Mandarin), English, and mixed Chinese-English synthesis. Maintains natural pronunciation and intonation across language boundaries.

Pinyin Pronunciation Control

Advanced pronunciation control through pinyin notation for Chinese text. Resolve ambiguous pronunciations and ensure accurate character reading in complex contexts.

High Naturalness & Clarity

Superior word error rate (WER) and emotion preservation compared to existing models. Achieves human-like naturalness ratings with exceptional clarity and intelligibility.

What People Are Talking About IndexTTS2 on X

Join the conversation about IndexTTS2 and share your experience with the research community

How do you make TTS both natural and precisely timed for dubbing or sync? 🎙️⏱️

Meet IndexTTS2—an autoregressive model with novel duration control:

- Mode 1: specify token count → exact speech length
- Mode 2: free AR generation → natural prosody preserved

✨ Extra features:… pic.twitter.com/nvmq05xU5Z
— 机器之心 JIQIZHIXIN (@jiqizhixin) September 19, 2025

🗣️✨Bilibili just dropped IndexTTS2, and it might be the most expressive and controllable zero-shot TTS model yet!

It's a breakthrough for autoregressive models, bringing precise timing and rich emotion to synthesized speech.

Basically, it can produce a voice that sounds like… pic.twitter.com/w625jhRwkq
— 机器之心 JIQIZHIXIN (@jiqizhixin) August 1, 2025

IndexTTS2: New AI text to speech with full emotion control

Free & open-source!

Here's the full tutorial: https://t.co/dzMMT5JvcR pic.twitter.com/LCELPyMwtj
— ⚡AI Search⚡ (@aisearchio) September 18, 2025

IndexTTS2, one of the most realistic and expressive text-to-speech model so far.

Fully local with open weights.

Zero-shot voice cloning. You just provide one audio file (in any language) and it will extremely accurately clone the voice style and rhythm. It sounds much more… pic.twitter.com/6ixAtbCrnn
— Rohan Paul (@rohanpaul_ai) July 14, 2025

⭐ Today’s China AI Native Industry Insights include:
1. MoonshotAI releases Kimi K2: Open-Source Agentic Intelligence at Scale

2. Exciting Upgrade: Alibaba's Qwen Chat Launches Enhanced Features!

3. Bilibili launches IndexTTS2: Revolutionizing Voice Synthesis with Emotion… pic.twitter.com/ov85VOyVjy
— AI Native Foundation (@AINativeF) July 15, 2025

来看 Index TTS2 和 VibeVoice-7B 哪个效果好？

需要注意的是，生成长音频的时候，这两个模型都会抖动，所以解决方案是，可以多生成一块，然后反复生成把有瑕疵的部分裁剪掉。

这两个 workflow 都是开源的：
VibeVoice: https://t.co/RV3qYA9UkP
IndexTTS2: https://t.co/ptagLxcZLQ… pic.twitter.com/o0CSUCdPhw
— karminski-牙医 (@karminski3) September 19, 2025

B站上大分！IndexTTS2 名副其实的好！
不仅能克隆音色，而且能还原情感和语调，这一点比11Labs还要强的多。 pic.twitter.com/aT03Yk0dac
— Gorden Sun (@Gorden_Sun) September 11, 2025