Text-to-Speech (TTS) Models
Curated collection of high-quality open-source TTS models and toolkits for research, production, and multi-language synthesis.
Table of Contents
High-Fidelity Models
- Type: High-fidelity voice cloning and generation
- Features: Prompt tuning, long-form speech
- Requirements: GPU, ~6GB memory recommended
- Best for: High-quality voice synthesis
- Type: Zero-shot speech editing and text-to-speech
- Features: Speech editing in the wild, speaker conditioning
- Best for: High-quality speech editing and synthesis
- Type: End-to-end model combining Tacotron2 + HiFi-GAN
- Features: High-quality synthesis, fewer artifacts
- Performance: Fast training and inference
- Best for: Research and production
- Type: Modular, multilingual toolkit
- Features: Easy training, fine-tuning, pretrained models
- Languages: English, German, French, and more
- Best for: Production-ready applications
- Type: Multilingual text-to-audio model
- Features: High-quality speech synthesis
- Languages: Multiple languages
- Best for: Creative audio generation
- Type: Multilingual TTS system
- Features: Indian languages support
- Languages: Hindi, English, and regional languages
- Best for: Indian language applications
- Type: Multilingual speech recognition and synthesis
- Features: 1000+ languages support
- Languages: Extensive language coverage
- Best for: Global applications
- Type: Multilingual TTS with voice cloning
- Features: Cross-lingual voice cloning
- Languages: Multiple languages
- Best for: Multilingual voice synthesis
- Type: Style-aware TTS system
- Features: Style control and transfer
- Performance: High-quality synthesis
- Best for: Expressive speech generation
- Type: Multimodal multilingual model
- Features: Speech-to-speech translation
- Languages: 100+ languages
- Best for: Real-time translation
Fast and Efficient Models
- Type: Two-stage pipeline (spectrogram + vocoder)
- Features: Realistic prosody and intonation
- Framework: TensorFlow
- Best for: Research and learning
- Type: Non-autoregressive TTS
- Features: High speed and stability
- Vocoders: HiFi-GAN, WaveGlow compatible
- Best for: Real-time applications
- Type: Flow-based TTS architecture
- Features: High-performance, parallelizable
- Best for: Fast inference scenarios
- Type: Ultra-lightweight neural TTS
- Features: 15M parameters, <25MB model size
- Performance: CPU-optimized, no GPU required
- Voices: 8 premium voice options (male/female variants)
- Best for: Edge deployment, mobile applications
- Type: Fast, local neural text-to-speech system
- Features: Lightweight deployment, high-quality synthesis
- Performance: Optimized for real-time applications
- Languages: Multiple language support
- Note: Development moved to OHF-Voice/piper1-gpl
- Best for: Local deployment, privacy-focused applications
Vocoders
Vocoders convert spectrograms to audio waveforms.
- Type: Fast, high-quality vocoder
- Features: Efficient waveform generation
- Best for: Production TTS systems
- Type: Real-time waveform generator
- Features: NVIDIA-optimized
- Best for: GPU-accelerated synthesis
- Type: GAN-based vocoder
- Features: Adversarial training
- Best for: High-quality audio generation
Notable TTS Projects
- Type: Real-time streaming TTS system
- Features: High-fidelity, low-latency synthesis
- Demo: https://speech.fish.audio/samples/
- Website: https://speech.fish.audio/
- Type: Multilingual expressive TTS
- Features: Multi-speaker, style control
- Languages: Japanese and English
- PyPI: https://pypi.org/project/kokoro/
- Model: https://huggingface.co/hexgrad/Kokoro
- Demo: https://huggingface.co/spaces/hexgrad/kokoro-tts
- Type: Ultra-fast multilingual TTS engine
- Features: Streaming, natural prosody
- Demo: https://huggingface.co/spaces/srinivasanbalasubramani/llasa-tts
- Type: Modular, real-time neural TTS
- Features: Latency-optimized
- Demo: https://sparkaudio.github.io/spark-tts/
- Type: Large-scale multilingual TTS model
- Features: High-quality synthesis, controllable voices
- Best for: Research and production TTS
- Type: ComfyUI node / workflow integration for Qwen TTS
- Features: Generate TTS audio directly inside ComfyUI graphs
- Best for: TTS pipelines built in ComfyUI
- Type: Tacotron-style neural TTS
- Features: Real-time performance, lightweight
- Architecture: Fast and modular
- Synthesis: Phoneme-based via espeak-ng
- Type: Neural text-to-speech system
- Features: Training and inference pipeline
- Best for: Research and experimentation
- Type: Neural text-to-speech model
- Features: Open-source TTS model and demos
- Best for: Prototyping expressive speech
- Type: Conversational TTS model
- Features: Dialogue-focused prosody and expressiveness
- Best for: Chat-style TTS and assistant voices
- Type: Neural text-to-speech system
- Features: Training and inference codebase
- Best for: Model exploration and benchmarks
- Type: Text-to-speech toolkit
- Features: Inference tooling and model assets
- Best for: Quick demos and experiments
- Type: Text-to-speech model and toolkit
- Features: Repository with model and utilities
- Best for: Research and evaluation
- Type: Enterprise-focused speech model collection
- Features: Granite family speech model assets and usage examples
- Best for: IBM Granite speech workflows and enterprise prototyping
- Type: Neural text-to-speech project
- Features: Open-source codebase for training and inference
- Best for: TTS experimentation and custom speech pipelines
Additional TTS Models
Extension Models
New Additions (Curated)
- Type: Fully non-autoregressive end-to-end TTS (PyTorch)
- Features: Zero-shot style-capable synthesis, multistream transformer conditioning, simple training and inference API
- Requirements: PyTorch (see repo for CUDA/CPU options)
- Best for: Research experiments and fast end-to-end TTS prototyping
- Type: Diffusion/flow-based TTS mixing ConvNeXt V2 + Flow Matching
- Features: High-quality, faster training/inference, Gradio/CLI apps, Triton/TensorRT runtime guides
- Requirements: GPU recommended for best performance
- Best for: High-quality synthesis and production benchmarks
- Type: Collection for streaming STT and TTS (Kyutai models)
- Features: Streaming/low-latency TTS examples, PyTorch + Rust + MLX implementations, Colab demos
- Requirements: Varies by implementation (PyTorch, Rust, or Apple MLX)
- Best for: Real-time/streaming TTS systems and production servers
- Type: Fine-tuning and inference kit for Chatterbox TTS (Standard and Turbo modes)
- Features: Automated tokenizer merging, preprocessing pipeline, Turbo mode for faster adaptation, inference scripts with VAD
- Requirements: Python 3.8+, GPU recommended; run setup.py to download base models
- Best for: Fine-tuning voice cloning and multi-language adaptation
- Type: VITS variant with emotion conditioning
- Features: Emotion embedding extraction from reference audio, emotion-controllable synthesis
- Requirements: Python 3.6+, preprocessing and monotonic alignment build steps
- Best for: Emotion-aware TTS and expressive voice synthesis research
- Type: Fast non-autoregressive TTS using conditional flow matching
- Features: ICASSP implementation, fast synthesis, ONNX export and runtime support, CLI and Gradio app
- Requirements: PyTorch 2.0+, optional ONNX runtime for export/inference
- Best for: Low-latency, exportable TTS pipelines and production-friendly deployment
Selection Guide
| Use Case |
Recommended Model |
Why |
| High-quality synthesis |
Tortoise-TTS |
Best audio quality |
| Production deployment |
Coqui TTS |
Modular and well-documented |
| Real-time applications |
FastSpeech 2 |
Fast inference |
| Research projects |
VITS |
End-to-end and efficient |
| Multilingual support |
MMS, Vall-E X |
Extensive language coverage |
| Streaming applications |
Llasa-TTS |
Ultra-fast, streaming |
| Lightweight deployment |
VITS2 |
Small footprint |
| Ultra-lightweight/Edge |
KittenTTS |
<25MB, CPU-only |
| Local/Privacy-focused |
Piper |
Fast local synthesis |
| Voice cloning |
OpenVoice, Bark |
High-fidelity cloning |
Voice Apps & Utilities
- Type: Minimalistic audiobook player (Android)
- Focus: Local playback and simple library management
- Best for: Lightweight audiobook listening apps
Additional Resources
Tip: Consider your use case (quality vs speed) and target platform when choosing a TTS model.