awesome-generative-ai

Widely-used Transformer Models

Comprehensive collection of transformer and foundation models for audio, vision, multimodal, and NLP use cases.


Table of Contents


Audio Processing

Speech Recognition and Classification

Audio Generation and Synthesis


Computer Vision

Image Understanding

Object Detection and Recognition

Pose and Segmentation


Multimodal

Audio-Text Integration

Image-Text Processing

Advanced Multimodal


Natural Language Processing

Text Understanding

Text Generation and Processing


Model Selection Guide

Task Type Recommended Models Typical Use Case
Speech Recognition Whisper, Moonshine Multilingual transcription
Image Understanding SAM, DINO v2 Visual analysis
Multimodal Tasks Qwen-VL, Llava, MiniCPM-o Cross-modal reasoning
Text Processing BART, T5, Qwen Language tasks
Audio Generation MusicGen, Bark Audio synthesis


Best Practices

Model Selection

Performance Optimization