IndexTTS2: Emotional & Duration-Controlled Zero-Shot TTS
IndexTTS2 is a powerful autoregressive zero-shot text-to-speech system that achieves unprecedented control over speech generation. With advanced emotion expression, precise duration control, and instant voice cloning capabilities, it delivers natural, expressive speech synthesis in multiple languages. Released under Apache 2.0 license, it's completely open-source and ready for commercial use.
Let the Bullets Fly - Duration Control Demo
Demonstrating precise speech duration control with emotional expression preservation
Duration Control
Precise timing adjustment
Emotion Control
Natural emotional expression
Zero-Shot
No training required
Try IndexTTS2 Live Demo
Experience IndexTTS2's powerful voice cloning and emotion control capabilities in real-time. Generate natural, expressive speech with precise duration control and multi-language support. Clone any voice instantly without training.
Loading IndexTTS2...
What People Are Saying About IndexTTS2
Hear what researchers, developers, and AI enthusiasts are saying about IndexTTS2's breakthrough voice cloning and emotion control capabilities

Index TTS2 – A VERY Emotive TTS With Voice Cloning!
These open source models are getting crazy good. Genuinely impressed.

Fantastic New AI Text to Speech Model Released! Index TTS 2 Initial Impressions
Higgs Audio still remains undefeated. Hopefully their v3 trained model will feature controllable emotions.

New top AI text to speech is here! Free & uncensored. IndexTTS2 tutorial
"Hi this Joe's mother, he's not feeling well today and will need to stay home from school until he gets well."
Performance Comparison with Leading TTS Models
See how IndexTTS2 stands against state-of-the-art text-to-speech models in emotion expression, duration accuracy, voice cloning quality, and multi-language support.
| Metric | IndexTTS2 | OpenAI TTS | ElevenLabs | Azure TTS | F5-TTS | CosyVoice |
|---|---|---|---|---|---|---|
| WER (Word Error Rate)% | 1.01 | N/A | N/A | N/A | 1.56 | 1.45 |
| Speaker Similarity | 0.87 | N/A | N/A | N/A | 0.82 | 0.85 |
| MOS (Naturalness)/5.0 | 4.54 | 4.2 | 4.3 | 4.3 | 4.19 | 4.12 |
| Emotion Control | ✓ | ✗ | Limited | Limited | ✗ | ✓ |
| Duration Control | ✓ | ✗ | ✗ | ✗ | Limited | Limited |
| Zero-Shot Cloning | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ |
| Supported Languages | 2+ | 57 | 29 | 119 | 2 | Multi |
| RTF (Real-Time Factor) | N/A | 0.20 | 0.15 | N/A | 0.15 | N/A |
Comparative performance across key TTS quality metrics based on academic benchmarks
Data Sources: IndexTTS2 (arXiv 2506.21619), F5-TTS (arXiv 2410.06885), CosyVoice2 (arXiv 2412.10117)
Note: N/A indicates data not publicly available. Commercial models evaluated through third-party benchmarks.
✓ = Supported | ✗ = Not Supported | Limited = Partial Support
Quick Start Guide
Deploy IndexTTS2 locally in minutes with our comprehensive step-by-step guide. Start generating natural, emotional speech with zero-shot voice cloning capabilities.
Python API Example
from indextts import IndexTTS
# Initialize the model
tts = IndexTTS()
# Generate speech from text
audio = tts.synthesize(
text="Hello world! Welcome to IndexTTS2.",
voice_reference="path/to/reference.wav", # Optional: clone a voice
emotion="neutral", # Control emotion: happy, sad, angry, neutral
speed=1.0, # Adjust speaking speed
language="en" # Supported: en, zh
)
# Save the output
audio.save("output.wav")Documentation
Complete guides and API reference
GitHub Repository
Source code and examples
Community
Get help and share ideas
Key Features of IndexTTS2
Discover the powerful capabilities that make IndexTTS2 the ideal choice for expressive, controllable text-to-speech generation.
Zero-Shot Voice Cloning
Instantly clone any voice from just a few seconds of audio without training. Achieves high-fidelity voice reproduction with speaker consistency across diverse content and emotions.
Emotion Expression Control
Decouple timbre from emotion for independent control. Use text descriptions to guide emotional expression (happy, sad, excited, angry) while maintaining voice identity and naturalness.
Precise Duration Control
First autoregressive TTS combining accurate duration control with natural generation. Achieve precise speech timing without sacrificing expressiveness or prosody quality.
Multi-Language Support
Native support for Chinese (Mandarin), English, and mixed Chinese-English synthesis. Maintains natural pronunciation and intonation across language boundaries.
Pinyin Pronunciation Control
Advanced pronunciation control through pinyin notation for Chinese text. Resolve ambiguous pronunciations and ensure accurate character reading in complex contexts.
High Naturalness & Clarity
Superior word error rate (WER) and emotion preservation compared to existing models. Achieves human-like naturalness ratings with exceptional clarity and intelligibility.
What People Are Talking About IndexTTS2 on X
Join the conversation about IndexTTS2 and share your experience with the research community
How do you make TTS both natural and precisely timed for dubbing or sync? 🎙️⏱️
— 机器之心 JIQIZHIXIN (@jiqizhixin) September 19, 2025
Meet IndexTTS2—an autoregressive model with novel duration control:
- Mode 1: specify token count → exact speech length
- Mode 2: free AR generation → natural prosody preserved
✨ Extra features:… pic.twitter.com/nvmq05xU5Z
🗣️✨Bilibili just dropped IndexTTS2, and it might be the most expressive and controllable zero-shot TTS model yet!
— 机器之心 JIQIZHIXIN (@jiqizhixin) August 1, 2025
It's a breakthrough for autoregressive models, bringing precise timing and rich emotion to synthesized speech.
Basically, it can produce a voice that sounds like… pic.twitter.com/w625jhRwkq
IndexTTS2: New AI text to speech with full emotion control
— ⚡AI Search⚡ (@aisearchio) September 18, 2025
Free & open-source!
Here's the full tutorial: https://t.co/dzMMT5JvcR pic.twitter.com/LCELPyMwtj
IndexTTS2, one of the most realistic and expressive text-to-speech model so far.
— Rohan Paul (@rohanpaul_ai) July 14, 2025
Fully local with open weights.
Zero-shot voice cloning. You just provide one audio file (in any language) and it will extremely accurately clone the voice style and rhythm. It sounds much more… pic.twitter.com/6ixAtbCrnn
⭐ Today’s China AI Native Industry Insights include:
— AI Native Foundation (@AINativeF) July 15, 2025
1. MoonshotAI releases Kimi K2: Open-Source Agentic Intelligence at Scale
2. Exciting Upgrade: Alibaba's Qwen Chat Launches Enhanced Features!
3. Bilibili launches IndexTTS2: Revolutionizing Voice Synthesis with Emotion… pic.twitter.com/ov85VOyVjy
来看 Index TTS2 和 VibeVoice-7B 哪个效果好?
— karminski-牙医 (@karminski3) September 19, 2025
需要注意的是,生成长音频的时候,这两个模型都会抖动,所以解决方案是,可以多生成一块,然后反复生成把有瑕疵的部分裁剪掉。
这两个 workflow 都是开源的:
VibeVoice: https://t.co/RV3qYA9UkP
IndexTTS2: https://t.co/ptagLxcZLQ… pic.twitter.com/o0CSUCdPhw
B站上大分!IndexTTS2 名副其实的好!
— Gorden Sun (@Gorden_Sun) September 11, 2025
不仅能克隆音色,而且能还原情感和语调,这一点比11Labs还要强的多。 pic.twitter.com/aT03Yk0dac
