awesome-generative-ai

Text-to-Speech (TTS) Models

Curated collection of high-quality open-source TTS models and toolkits for research, production, and multi-language synthesis.

Table of Contents

High-Fidelity Models
Fast and Efficient Models
Vocoders
Notable TTS Projects
Additional TTS Models
New Additions (Curated)
Selection Guide
Additional Resources

High-Fidelity Models

Tortoise-TTS

Type: High-fidelity voice cloning and generation
Features: Prompt tuning, long-form speech
Requirements: GPU, ~6GB memory recommended
Best for: High-quality voice synthesis

VoiceCraft

Type: Zero-shot speech editing and text-to-speech
Features: Speech editing in the wild, speaker conditioning
Best for: High-quality speech editing and synthesis

VITS (Variational Inference TTS)

Type: End-to-end model combining Tacotron2 + HiFi-GAN
Features: High-quality synthesis, fewer artifacts
Performance: Fast training and inference
Best for: Research and production

Coqui TTS

Type: Modular, multilingual toolkit
Features: Easy training, fine-tuning, pretrained models
Languages: English, German, French, and more
Best for: Production-ready applications

Bark

Type: Multilingual text-to-audio model
Features: High-quality speech synthesis
Languages: Multiple languages
Best for: Creative audio generation

Maha TTS

Type: Multilingual TTS system
Features: Indian languages support
Languages: Hindi, English, and regional languages
Best for: Indian language applications

MMS (Massively Multilingual Speech)

Type: Multilingual speech recognition and synthesis
Features: 1000+ languages support
Languages: Extensive language coverage
Best for: Global applications

Vall-E X

Type: Multilingual TTS with voice cloning
Features: Cross-lingual voice cloning
Languages: Multiple languages
Best for: Multilingual voice synthesis

StyleTTS2

Type: Style-aware TTS system
Features: Style control and transfer
Performance: High-quality synthesis
Best for: Expressive speech generation

SeamlessM4T

Type: Multimodal multilingual model
Features: Speech-to-speech translation
Languages: 100+ languages
Best for: Real-time translation

Fast and Efficient Models

Tacotron 2

Type: Two-stage pipeline (spectrogram + vocoder)
Features: Realistic prosody and intonation
Framework: TensorFlow
Best for: Research and learning

FastSpeech 2

Type: Non-autoregressive TTS
Features: High speed and stability
Vocoders: HiFi-GAN, WaveGlow compatible
Best for: Real-time applications

Glow-TTS

Type: Flow-based TTS architecture
Features: High-performance, parallelizable
Best for: Fast inference scenarios

KittenTTS

Type: Ultra-lightweight neural TTS
Features: 15M parameters, <25MB model size
Performance: CPU-optimized, no GPU required
Voices: 8 premium voice options (male/female variants)
Best for: Edge deployment, mobile applications

Piper

Type: Fast, local neural text-to-speech system
Features: Lightweight deployment, high-quality synthesis
Performance: Optimized for real-time applications
Languages: Multiple language support
Note: Development moved to OHF-Voice/piper1-gpl
Best for: Local deployment, privacy-focused applications

Vocoders

Vocoders convert spectrograms to audio waveforms.

HiFi-GAN

Type: Fast, high-quality vocoder
Features: Efficient waveform generation
Best for: Production TTS systems

WaveGlow

Type: Real-time waveform generator
Features: NVIDIA-optimized
Best for: GPU-accelerated synthesis

MelGAN

Type: GAN-based vocoder
Features: Adversarial training
Best for: High-quality audio generation

Notable TTS Projects

Fish-Speech

Type: Real-time streaming TTS system
Features: High-fidelity, low-latency synthesis
Demo: https://speech.fish.audio/samples/
Website: https://speech.fish.audio/

Kokoro

Type: Multilingual expressive TTS
Features: Multi-speaker, style control
Languages: Japanese and English
PyPI: https://pypi.org/project/kokoro/
Model: https://huggingface.co/hexgrad/Kokoro
Demo: https://huggingface.co/spaces/hexgrad/kokoro-tts

Llasa-TTS

Type: Ultra-fast multilingual TTS engine
Features: Streaming, natural prosody
Demo: https://huggingface.co/spaces/srinivasanbalasubramani/llasa-tts

Spark-TTS

Type: Modular, real-time neural TTS
Features: Latency-optimized
Demo: https://sparkaudio.github.io/spark-tts/

Qwen3-TTS

Type: Large-scale multilingual TTS model
Features: High-quality synthesis, controllable voices
Best for: Research and production TTS

ComfyUI-Qwen-TTS

Type: ComfyUI node / workflow integration for Qwen TTS
Features: Generate TTS audio directly inside ComfyUI graphs
Best for: TTS pipelines built in ComfyUI

VITS2

Type: Tacotron-style neural TTS
Features: Real-time performance, lightweight
Architecture: Fast and modular
Synthesis: Phoneme-based via espeak-ng

Index-TTS

Type: Neural text-to-speech system
Features: Training and inference pipeline
Best for: Research and experimentation

Chatterbox

Type: Neural text-to-speech model
Features: Open-source TTS model and demos
Best for: Prototyping expressive speech

ChatTTS

Type: Conversational TTS model
Features: Dialogue-focused prosody and expressiveness
Best for: Chat-style TTS and assistant voices

FireRedTTS2

Type: Neural text-to-speech system
Features: Training and inference codebase
Best for: Model exploration and benchmarks

Genie-TTS

Type: Text-to-speech toolkit
Features: Inference tooling and model assets
Best for: Quick demos and experiments

Supertonic

Type: Text-to-speech model and toolkit
Features: Repository with model and utilities
Best for: Research and evaluation

Granite Speech Models

Type: Enterprise-focused speech model collection
Features: Granite family speech model assets and usage examples
Best for: IBM Granite speech workflows and enterprise prototyping

TADA-TTS

Type: Neural text-to-speech project
Features: Open-source codebase for training and inference
Best for: TTS experimentation and custom speech pipelines

Additional TTS Models

Extension Models

XTTSv2 - Enhanced XTTS implementation
MARS5 - Multilingual TTS system
F5-TTS - High-quality diffusion/flow TTS (see curated entry below)
Parler TTS - Hugging Face TTS
OpenVoice - Open-source voice cloning
OpenVoice V2 - Enhanced OpenVoice
MeloTTS - Multilingual text-to-speech toolkit by MyShell
Irodori-TTS - Neural text-to-speech project and tooling
kani-tts-2 - TTS model and inference toolkit
FlashLabs-Chroma - Speech synthesis project and tooling
DIA - Neural TTS framework
Higgs Audio - Open-source TTS model and toolkit
Soprano - TTS model and training toolkit
CosyVoice - Conversational TTS
VoxCPM - Open-source TTS model
GLM-TTS - Zero-shot controllable TTS with emotion and style tokens
GPT-SoVITS - GPT-based voice synthesis
Piper TTS - Lightweight TTS engine
Kimi Audio 7B Instruct - Large-scale audio model
ACE-Step - Advanced TTS framework

New Additions (Curated)

E2-TTS (e2-tts-pytorch)

Type: Fully non-autoregressive end-to-end TTS (PyTorch)
Features: Zero-shot style-capable synthesis, multistream transformer conditioning, simple training and inference API
Requirements: PyTorch (see repo for CUDA/CPU options)
Best for: Research experiments and fast end-to-end TTS prototyping

F5-TTS

Type: Diffusion/flow-based TTS mixing ConvNeXt V2 + Flow Matching
Features: High-quality, faster training/inference, Gradio/CLI apps, Triton/TensorRT runtime guides
Requirements: GPU recommended for best performance
Best for: High-quality synthesis and production benchmarks

Delayed Streams Modeling / Kyutai

Type: Collection for streaming STT and TTS (Kyutai models)
Features: Streaming/low-latency TTS examples, PyTorch + Rust + MLX implementations, Colab demos
Requirements: Varies by implementation (PyTorch, Rust, or Apple MLX)
Best for: Real-time/streaming TTS systems and production servers

Chatterbox Fine-Tuning Kit

Type: Fine-tuning and inference kit for Chatterbox TTS (Standard and Turbo modes)
Features: Automated tokenizer merging, preprocessing pipeline, Turbo mode for faster adaptation, inference scripts with VAD
Requirements: Python 3.8+, GPU recommended; run setup.py to download base models
Best for: Fine-tuning voice cloning and multi-language adaptation

Emotional VITS

Type: VITS variant with emotion conditioning
Features: Emotion embedding extraction from reference audio, emotion-controllable synthesis
Requirements: Python 3.6+, preprocessing and monotonic alignment build steps
Best for: Emotion-aware TTS and expressive voice synthesis research

Matcha-TTS

Type: Fast non-autoregressive TTS using conditional flow matching
Features: ICASSP implementation, fast synthesis, ONNX export and runtime support, CLI and Gradio app
Requirements: PyTorch 2.0+, optional ONNX runtime for export/inference
Best for: Low-latency, exportable TTS pipelines and production-friendly deployment

Selection Guide

Use Case	Recommended Model	Why
High-quality synthesis	Tortoise-TTS	Best audio quality
Production deployment	Coqui TTS	Modular and well-documented
Real-time applications	FastSpeech 2	Fast inference
Research projects	VITS	End-to-end and efficient
Multilingual support	MMS, Vall-E X	Extensive language coverage
Streaming applications	Llasa-TTS	Ultra-fast, streaming
Lightweight deployment	VITS2	Small footprint
Ultra-lightweight/Edge	KittenTTS	<25MB, CPU-only
Local/Privacy-focused	Piper	Fast local synthesis
Voice cloning	OpenVoice, Bark	High-fidelity cloning

Voice Apps & Utilities

Voice

Type: Minimalistic audiobook player (Android)
Focus: Local playback and simple library management
Best for: Lightweight audiobook listening apps

Additional Resources

Voice Cloning - Voice synthesis and cloning techniques
STT Models - Speech-to-text recognition
Emotion Recognition - Audio emotion analysis
Talking Head - Visual speech synthesis

Tip: Consider your use case (quality vs speed) and target platform when choosing a TTS model.

This site is open source. Improve this page.