ποΈ Speech-to-Text (STT) Datasets
Comprehensive collection of high-quality speech datasets for training and evaluating speech recognition models across multiple languages and domains.
π Table of Contents
π Multilingual Datasets
π· Common Voice (Mozilla)
- Languages: 100+ languages
- Size: >15,000 hours (validated), >20,000 hours (total)
- Speakers: Multi-speaker
- License: CC-0
- Download: voice.mozilla.org
π· VoxForge
- Languages: Multiple languages
- Size: Variable by language
- Speakers: Community-contributed
- License: Various open licenses
- Download: voxforge.org
πΊπΈ English Datasets
π· LibriSpeech
- Size: ~1000 hours
- Speakers: 2,484 speakers (1,201 female / 1,283 male)
- Content: Audiobooks
- License: CC-BY 4.0
- Download: openslr.org/12
π· LibriTTS
- Size: 586 hours
- Speakers: 2,456 speakers (1,185 female / 1,271 male)
- Content: TTS-optimized audiobooks
- License: CC-BY 4.0
- Download: openslr.org/60
π· VCTK
π· Speech Commands
π Other Languages
π· German Datasets
- Thorsten-21.02-neutral: ~24 hours, 1 male speaker, CC-0
- Thorsten-21.06-emotional: 2,400 utterances, 8 emotions, CC-0
- Telecooperation German: ~35 hours, ~180 speakers, CC-BY 2.0
π· Nordic Languages
- NST Danish ASR: 229,992 utterances, 616 speakers, CC-0
- NST Swedish ASR: 366,000 utterances, 1,000 speakers, CC-0
- NST Norwegian ASR: 359,760 utterances, 980 speakers, CC-0
π· African Languages
- NCHLT Afrikaans: 56 hours, 210 speakers, CC-BY 3.0
- NCHLT English: 56 hours, 210 speakers, CC-BY 3.0
- NCHLT isiZulu: 56 hours, 210 speakers, CC-BY 3.0
π· Asian Languages
π Dataset Statistics
π· License Distribution
| License | Count | Description |
|:β|:β|:β|
| CC-0 | 15+ | Public domain, no restrictions |
| CC-BY | 20+ | Attribution required |
| CC-BY-SA | 10+ | Share-alike required |
| CC-BY-NC | 5+ | Non-commercial use only |
π· Language Coverage
| Language Family | Languages | Total Hours |
|:β|:β|:β|
| Indo-European | 50+ | 10,000+ |
| Sino-Tibetan | 10+ | 1,000+ |
| Afro-Asiatic | 15+ | 2,000+ |
| Niger-Congo | 20+ | 3,000+ |
| Other | 30+ | 4,000+ |
π‘ Dataset Selection Guide
| Use Case |
Recommended Datasets |
Why |
| General STT |
LibriSpeech, Common Voice |
Large, diverse, well-annotated |
| Multi-speaker |
VCTK, LibriTTS |
Multiple speakers, high quality |
| Command Recognition |
Speech Commands |
Short phrases, commands |
| Multilingual |
Common Voice |
100+ languages |
| Research |
NST datasets |
Academic quality |
π Usage Examples
Python - Loading LibriSpeech
import torchaudio
from torchaudio.datasets import LIBRISPEECH
# Load dataset
dataset = LIBRISPEECH(
root="./data",
url="train-clean-100",
download=True
)
# Access audio and transcript
waveform, sample_rate, transcript, speaker_id, utterance_id = dataset[0]
Python - Loading Common Voice
from datasets import load_dataset
# Load Common Voice dataset
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en")
# Access data
audio = dataset["train"][0]["audio"]
text = dataset["train"][0]["sentence"]
π‘ Tip: Start with smaller datasets for prototyping, then scale up to larger datasets for production models.