awesome-generative-ai

πŸŽ™οΈ Speech-to-Text (STT) Datasets

Comprehensive collection of high-quality speech datasets for training and evaluating speech recognition models across multiple languages and domains.


πŸ“‹ Table of Contents


🌍 Multilingual Datasets

πŸ”· Common Voice (Mozilla)

πŸ”· VoxForge


πŸ‡ΊπŸ‡Έ English Datasets

πŸ”· LibriSpeech

πŸ”· LibriTTS

πŸ”· VCTK

πŸ”· Speech Commands


🌐 Other Languages

πŸ”· German Datasets

πŸ”· Nordic Languages

πŸ”· African Languages

πŸ”· Asian Languages


πŸ“Š Dataset Statistics

πŸ”· License Distribution

| License | Count | Description | |:β€”|:β€”|:β€”| | CC-0 | 15+ | Public domain, no restrictions | | CC-BY | 20+ | Attribution required | | CC-BY-SA | 10+ | Share-alike required | | CC-BY-NC | 5+ | Non-commercial use only |

πŸ”· Language Coverage

| Language Family | Languages | Total Hours | |:β€”|:β€”|:β€”| | Indo-European | 50+ | 10,000+ | | Sino-Tibetan | 10+ | 1,000+ | | Afro-Asiatic | 15+ | 2,000+ | | Niger-Congo | 20+ | 3,000+ | | Other | 30+ | 4,000+ |


πŸ’‘ Dataset Selection Guide

Use Case Recommended Datasets Why
General STT LibriSpeech, Common Voice Large, diverse, well-annotated
Multi-speaker VCTK, LibriTTS Multiple speakers, high quality
Command Recognition Speech Commands Short phrases, commands
Multilingual Common Voice 100+ languages
Research NST datasets Academic quality


πŸš€ Usage Examples

Python - Loading LibriSpeech

import torchaudio
from torchaudio.datasets import LIBRISPEECH

# Load dataset
dataset = LIBRISPEECH(
    root="./data",
    url="train-clean-100",
    download=True
)

# Access audio and transcript
waveform, sample_rate, transcript, speaker_id, utterance_id = dataset[0]

Python - Loading Common Voice

from datasets import load_dataset

# Load Common Voice dataset
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en")

# Access data
audio = dataset["train"][0]["audio"]
text = dataset["train"][0]["sentence"]

πŸ’‘ Tip: Start with smaller datasets for prototyping, then scale up to larger datasets for production models.