awesome-generative-ai

🎙️ Speech-to-Text (STT) (click to expand)

🎙️ Dataset for STT (click to expand)

## 🗂️ Dataset for STT ## 📜 [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Common Voice | Multilingual | >15,000 hours (validated); >20,000 hours (total) | Multi-speaker | <https://voice.mozilla.org/en/datasets> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | | Yesno | Hebrew | 6 mins | one male | <http://www.openslr.org/1/> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | | LJ Speech Corpus | English | ~24 hours | [one female](https://librivox.org/reader/11049) | <https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | | NST Danish ASR Database | Danish | 229,992 utterances | 616 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Danish Dictation | Danish | 34,955 utterances | 151 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Danish Speech Synthesis | Danish | 4,108 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Swedish ASR Database | Swedish | 366,000 utterances | 1,000 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Swedish Dictation | Swedish | 45,620 utterances | 195 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Swedish Speech Synthesis | Swedish | 5,279 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Norwegian ASR Database | Norwegian | 359,760 utterances | 980 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Norwegian Dictation | Norwegian | 33,360 utterances | 144 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NST Norwegian Speech Synthesis | Norwegian | 5,363 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | NB Tale – Speech Database for Norwegian | Norwegian | 7,600 utterances + ~12 hours | 380 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | Norwegian Parliamentary Speech Corpus (v0.1) | Norwegian | ~59 hours | 203 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) | | Wikimedia Commons Odia | Odia | ~8 hours | ~20 speakers | <https://commons.wikimedia.org/wiki/Category:Odia_pronunciation> | mostly(?) [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | | Thorsten-21.02-neutral | German | ~24 hours | 1 male speaker | <https://www.Thorsten-Voice.de> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | | Thorsten-21.06-emotional | German | 2.400 utterances (8 emotions) | 1 male speaker | <https://www.Thorsten-Voice.de> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) | ## 📜 [CC-BY](https://creativecommons.org/licenses/by/4.0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | ARU Speech Corpus | English (UK) | 720 utterances / speaker | 12 (6 femals; 6 male) | <http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) | | Althingi Parliamentary Speech Corpus | Icelandic | 542 hours and 25 minutes | 196 speakers | <http://www.malfong.is/index.php?dlid=73&lang=en> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | | Alþingisumræður Parliamentary Speech Corpus | Icelandic | ~21 hours | | <http://www.malfong.is/index.php?dlid=8&lang=en> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) | | Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers | <http://www.malfong.is/index.php?dlid=5&lang=en> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) | | The Malromur Corpus | Icelandic | 152 hours | 563 speakers | <http://www.malfong.is/index.php?dlid=65&lang=en> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | | Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers | <http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz> | [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) | | African Speech Technology English-English Speech Corpus | English | ~21 hours | | <https://repo.sadilar.org/handle/20.500.12185/283> | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) | | African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | | <https://repo.sadilar.org/handle/20.500.12185/305> | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) | | NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | <https://repo.sadilar.org/handle/20.500.12185/280> | CC-BY 3.0 | | NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | <https://repo.sadilar.org/handle/20.500.12185/274> | CC-BY 3.0 | | NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | <https://repo.sadilar.org/handle/20.500.12185/272> | CC-BY 3.0 | | NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | <https://repo.sadilar.org/handle/20.500.12185/279> | CC-BY 3.0 | | NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | <https://repo.sadilar.org/handle/20.500.12185/275> | CC-BY 3.0 | | NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | <https://repo.sadilar.org/handle/20.500.12185/270> | CC-BY 3.0 | | NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | <https://repo.sadilar.org/handle/20.500.12185/278> | CC-BY 3.0 | | NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | <https://repo.sadilar.org/handle/20.500.12185/281> | CC-BY 3.0 | | NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | <https://repo.sadilar.org/handle/20.500.12185/271> | CC-BY 3.0 | | NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | <https://repo.sadilar.org/handle/20.500.12185/276> | CC-BY 3.0 | | NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | <https://repo.sadilar.org/handle/20.500.12185/277> | CC-BY 3.0 | | Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins| 20 speakers | <https://repo.sadilar.org/handle/20.500.12185/445> | CC-BY 3.0 | | Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | | <https://repo.sadilar.org/handle/20.500.12185/448> | CC-BY 3.0 | | Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | <https://repo.sadilar.org/handle/20.500.12185/442> | CC-BY 3.0 | | LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | <http://www.openslr.org/12/> | CC-BY 4.0 | | Zeroth-Korean | Korean | 52.8 hours | 115 speakers | <http://www.openslr.org/40/> | CC-BY 4.0 | | Speech Commands | English | 17.8 hours | >1,000 speakers | <https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html> | CC-BY 4.0 | | ParlamentParla | Catalan | 320 hours | | <https://www.openslr.org/59/> | CC-BY 4.0 | | SIWIS | French | ~10 hours | one female | <http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | VCTK | English | 44 hours | 109 speakers | <http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male) | <http://www.openslr.org/60/> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | | <https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers | <https://github.com/Helsinki-NLP/prosody> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | |Tuva Speech Database | Norwegian | 24 hours | 40 speakers | https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | COERLL Kʼicheʼ corpus | Kʼicheʼ | 34 minutes | ? speakers | https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | Timers and Such v0.1 | English (synthetic: US, real: various nationalities) | synthetic: 172 hours, real: 0.29 hours | 21 synthetic, 11 real | https://zenodo.org/record/4110812#.X9j0RmBOkYM | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | | Large Corpus of Czech Parliament Plenary Hearings | Czech | 444 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | ## 📜 [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Iban | Iban | 8 hours | | <http://www.openslr.org/24/> <https://github.com/sarahjuan/iban> | CC-BY-SA 2.0 | | Vystadial 2013 | English; Czech | 41 hours; 15 hours | | <http://www.openslr.org/6/> | CC-BY-SA 3.0 US | | Vystadial 2016 Czech | Czech | 77 hours; includes Vystadial 2013 Czech | | <https://lindat.cz/repository/xmlui/handle/11234/1-1740> | CC-BY-SA 4.0 | | Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | <https://github.com/Jakobovski/free-spoken-digit-dataset> | CC-BY-SA 4.0 | | Google Javanese | Javanese | 296 hours| 1019 speakers| <http://www.openslr.org/35/> | CC-BY-SA 4.0 | | Google Nepali | Nepali | 165 hours| 527 speakers| <http://www.openslr.org/54/> | CC-BY-SA 4.0 | | Google Bengali | Bengali | 229 hours| 508 speakers| <http://www.openslr.org/53/> | CC-BY-SA 4.0 | | Google Sinhala | Sinhala | 224 hours| 478 speakers| <http://www.openslr.org/52/> | CC-BY-SA 4.0 | | Google Sundanese | Sundanese | 333 hours| 542 speakers| <http://www.openslr.org/36/> | CC-BY-SA 4.0 | | Spoken Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | <https://nats.gitlab.io/swc/> | CC-BY-SA 4.0 | | Chuvash TTS | Chuvash | 4 hours | 1 speaker | <https://github.com/ftyers/Turkic_TTS> | CC-BY-SA 4.0 | | Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: <https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz>; male speaker: <https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz> | CC-BY-SA 4.0 | | Malayalam Speech Corpus by [SMC](https://blog.smc.org.in/malayalam-speech-corpus/) | Malayalam | 1:36 hours | 75 speakers (3 female, 12 male, 60 unidentified) | https://releases.smc.org.in/msc-reviewed-speech/ | CC-BY-SA 4.0 | | Google Malayalam | Malayalam | 3.02 hours| 24 speakers| <http://www.openslr.org/63/> | CC-BY-SA 4.0 | ## 📜 [CC-BY-ND](https://creativecommons.org/licenses/by-nd/4.0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | IBM Recorded Debates v1 | English | 5 hours | 10 speakers | <https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis> | CC-BY-ND | | IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | <https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis> | CC-BY-ND | ## 📜 [CC-BY-NC](https://creativecommons.org/licenses/by-nc/4.0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | TV3Parla | Catalan | 240 hours | | <http://laklak.eu/share/tv3_0.3.tar.gz> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) | | Russian Open STT Corpus | Russian | ~10,000 hours public, ~10,000 more upon request | | <https://github.com/snakers4/open_stt/#links> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [exceptions](https://github.com/snakers4/open_stt/blob/master/LICENSE)| | Russian Open TTS Corpus | Russian | 145 hours | 3 males | <https://github.com/snakers4/open_tts/#links> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [expections](https://github.com/snakers4/open_tts/blob/master/LICENSE)| | OVM – Otázky Václava Moravce | Czech | 35 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3> | [CC-BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/) | ## 📜 [CC-BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | CHiME-Home | English | 6.8 hours | | <https://archive.org/details/chime-home> | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) | | Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours | | <http://ota.ox.ac.uk/text/2563.zip> | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) | ## 📜 [CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | <https://voice.mozilla.org/en/datasets> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) (some audio) / [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) (most audio) / [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) (all text) | | TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | <http://www.openslr.org/7/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) | | TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | <http://www.openslr.org/19/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) | | TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | <http://www.openslr.org/51/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) | | Pansori TEDxKR | Korean | 3 hours | 41 speakers | <http://www.openslr.org/58/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) | | Primewords Mandarin | Mandarin | 100 hours | 296 speakers | <http://www.openslr.org/47/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)| | MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | | <https://ict.fbk.eu/must-c-release-v1-0/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) | | Czech Parliament Meetings | Czech | 88 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) | | BembaSpeech | Bemba | 24 hours | 17 speakers (9 male / 8 female) | <https://github.com/csikasote/BembaSpeech> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) | ## 📜 [CDLA-Permissive](https://cdla.io/permissive-1-0/) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | DiPCo | English | ~5 hours | 32 speakers (13 female; 19 male) | <https://s3.amazonaws.com/dipco/DiPCo.tgz> | [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) | ## 📜 [GNU General Public License](https://www.gnu.org/licenses/gpl.html) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | VoxForge | English | ~120 hours | ~2966 speakers | <http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/> <https://voice.mozilla.org/en/datasets> | GNU-GPL 3.0 | | VoxForge | Russian | | | <http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/> <http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/>| GNU-GPL 3.0 | | VoxForge | German | | | <http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/> | GNU-GPL 3.0 | ## 📜 [Apache License](https://www.apache.org/licenses/LICENSE-2.0) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | AISHELL-1 | Mandarin | 170 hours | 400 speakers | <http://www.openslr.org/33/> | Apache 2.0 | | Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | <http://www.openslr.org/46/> | Apache 2.0 | | African Accented French | French | 22 hours | 232 speakers | <http://www.openslr.org/57/> | Apache 2.0 | | THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) | <http://www.openslr.org/18/> | Apache 2.0 | | Living Audio Dataset - Dutch | Dutch | 57:49 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 | | Living Audio Dataset - English | English | 50:50 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 | | Living Audio Dataset - Irish | Irish | 61:56 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 | | Living Audio Dataset - Russian | Russian | 34:58 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 | ## 📜 [MIT License](https://opensource.org/licenses/MIT) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | ALFFA | Amharic;Hausa (paid); Swahili; Wolof | | | <http://www.openslr.org/25/> <https://github.com/besacier/ALFFA_PUBLIC> | MIT | ## 📜 [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | M-AILABS German Corpus | German | 237 hours and 22 minutes | | <http://www.caito.de/data/Training/stt_tts/de_DE.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS Queen's English Corpus | Queen's English | 45 hours and 35 minutes | | <http://www.caito.de/data/Training/stt_tts/en_UK.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS US English Corpus | American English | 102 hours and 7 minutes | | <http://www.caito.de/data/Training/stt_tts/en_US.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS Spanish Corpus | Spanish Spanish | 108 hours and 34 minutes | | <http://www.caito.de/data/Training/stt_tts/es_ES.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS Italian Corpus | Italian | 127 hours and 40 minutes | | <http://www.caito.de/data/Training/stt_tts/it_IT.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS Ukrainian Corpus | Ukrainian | 87 hours and 8 minutes | | <http://www.caito.de/data/Training/stt_tts/uk_UK.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS Russian Corpus | Russian | 46 hours and 47 minutes | | <http://www.caito.de/data/Training/stt_tts/ru_RU.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS French-v0.9 Corpus | French | 190 hours and 30 minutes | | <http://www.caito.de/data/Training/stt_tts/fr_FR.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| | M-AILABS Polish Corpus | Polish | 53 hours and 50 minutes | | <http://www.caito.de/data/Training/stt_tts/pl_PL.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))| ## 📜 [Custom License](https://en.wikipedia.org/wiki/Copyright) | CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Fluent Speech Commands Corpus | English | 19 hours (30,043 utterances) | 97 speakers | <http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz> | [Fluent Speech Commands Public License](https://groups.google.com/a/fluent.ai/forum/#!msg/fluent-speech-commands/MXh_7Y-3QC8/9i2pHPW9AwAJ) | | CMU Wilderness | 700 Langs | Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours | | <https://github.com/festvox/datasets-CMU_Wilderness> | <https://live.bible.is/terms> | | CHiME-5 | English | 50 hours | 48 speakers | <http://spandh.dcs.shef.ac.uk/chime_challenge/data.html> | [CHiME-5 License](http://spandh.dcs.shef.ac.uk/chime_challenge/download.html) | | Fearless Steps Corpus | English | 19,000 hours (20 hours transcribed) | ~450 speakers | <https://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_Access> | [NASA Media Usage Guidelines](https://www.nasa.gov/multimedia/guidelines/index.html) | | Microsoft Speech Corpus (Indian languages) | Telugu; Tamil; Gujarati | | | <https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e> | [Microsoft Speech Corpus (Indian Languages) License](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e) | | Microsoft Speech Language Translation Corpus | English; Chinese; Japanese| | | <https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187> | [Microsoft Research Data License Agreement](https://msrodr-api.azurewebsites.net//licenses/2f933be3-284d-500b-7ea3-2aa2fd0f1bb2/file) | | Hey Snips Corpus | English | 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances | 2215 speakers (positive & negative) and 4028 speakers (negative only) | <https://research.snips.ai/datasets/keyword-spotting> | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) | | Snips SLU Corpus | English; French | 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances | English: 69 speakers; French: 30 speakers | <https://research.snips.ai/datasets/spoken-language-understanding> | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) | | CMU Sphinx Group - AN4 | English | "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) | "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male | http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz | [AN4](http://www.speech.cs.cmu.edu/databases/an4/LICENSE.html) | | FT Speech | Danish | ~1,857 hours (1,017,244 utterances) | 434 speakers (176 female, 258 male) | <https://ftspeech.dk> | [FT Speech License](https://ftspeech.dk/LICENSE.html) | | FalaBrasil-LAPS-Constituicao | Brazilian-Portuguese | 9 hours | 1 speaker | <https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT> | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) | | FalaBrasil-LaPSMail | Brazilian-Portuguese | 1 hour | 25 speakers | <https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb> | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) | | FalaBrasil-LaPS Benchmark | Brazilian-Portuguese | 1 hour | 1 speaker | <https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo> | ["Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |

🤖 Speech-to-Text (STT) Models (click to expand)

#### 📅 2023 1. [whisper.cpp][High-Performance C++ Port of OpenAI Whisper](https://github.com/ggerganov/whisper.cpp), `GitHub 2023`. [[Code](https://github.com/ggerganov/whisper.cpp)] *Port of OpenAI's Whisper model in pure C/C++ using GGML for efficient CPU/GPU inference — runs on Mac, Windows, Linux, and mobile devices.* 2. [DeepSpeech][An Open-Source Speech-to-Text Engine](https://github.com/mozilla/DeepSpeech), `GitHub 2023`. [[Code](https://github.com/mozilla/DeepSpeech)] *TensorFlow-based speech recognition engine capable of running in real-time on low-resource devices.* 3. [Leon][Your Open-Source Personal Assistant](https://github.com/leon-ai/leon), `GitHub 2023`. [[Code](https://github.com/leon-ai/leon)] *Node.js & Python-powered open-source voice assistant you can run on your own server.* 4. [faster-whisper][Fast Whisper Transcription via CTranslate2](https://github.com/SYSTRAN/faster-whisper), `GitHub 2023`. [[Code](https://github.com/SYSTRAN/faster-whisper)] *Lightweight Whisper implementation with CTranslate2 backend for fast and efficient transcription.* 5. [WhisperX][Word-Level Timestamped ASR with Diarization](https://github.com/m-bain/whisperX), `GitHub 2023`. [[Code](https://github.com/m-bain/whisperX)] *ASR model providing word-level timestamps and speaker diarization using Whisper backbone.* 6. [Kaldi][Speech Recognition Toolkit](https://github.com/kaldi-asr/kaldi), `GitHub 2023`. [[Code](https://github.com/kaldi-asr/kaldi)] *C++ toolkit widely used in academia and industry for speech recognition research.* 7. [pyvideotrans][Translate & Dub Videos Automatically](https://github.com/jianchang512/pyvideotrans), `GitHub 2023`. [[Code](https://github.com/jianchang512/pyvideotrans)] *Speech recognition + translation + dubbing pipeline for automatic multilingual video processing.* 8. [speechbrain][All-in-One Speech Toolkit in PyTorch](https://github.com/speechbrain/speechbrain), `GitHub 2023`. [[Code](https://github.com/speechbrain/speechbrain)] *End-to-end toolkit for ASR, speaker ID, enhancement, and more — built on PyTorch.* 9. [vosk-api][Offline STT for 20+ Languages](https://github.com/alphacep/vosk-api), `GitHub 2023`. [[Code](https://github.com/alphacep/vosk-api)] *Real-time STT for mobile and edge devices — supports many languages without needing internet.* 10. [speech_recognition][Simple Python Speech Recognition](https://github.com/Uberi/speech_recognition), `GitHub 2023`. [[Code](https://github.com/Uberi/speech_recognition)] *Lightweight library for accessing Google, Wit.ai, CMU Sphinx and more through Python.* 11. [ASRT_SpeechRecognition][Chinese ASR with Deep Learning](https://github.com/nl8590687/ASRT_SpeechRecognition), `GitHub 2023`. [[Code](https://github.com/nl8590687/ASRT_SpeechRecognition)] *Chinese end-to-end STT with attention and LSTM/CTC architectures.* 12. [RealtimeSTT][Low-Latency Microphone Transcription](https://github.com/KoljaB/RealtimeSTT), `GitHub 2023`. [[Code](https://github.com/KoljaB/RealtimeSTT)] *Robust real-time transcription from microphone input — lightweight and fast.* 13. [annyang][Voice Commands in Browser](https://github.com/TalAter/annyang), `GitHub 2023`. [[Code](https://github.com/TalAter/annyang)] *Tiny JS library that adds voice control to websites using browser APIs.* 14. [sherpa-onnx][Real-Time Speech Framework with ONNX](https://github.com/k2-fsa/sherpa-onnx), `GitHub 2023`. [[Code](https://github.com/k2-fsa/sherpa-onnx)] *Kaldi-inspired speech stack with ONNX backend — cross-platform real-time speech tools.* 15. [SenseVoice][Multilingual Speech Understanding](https://github.com/FunAudioLLM/SenseVoice), `GitHub 2023`. [[Code](https://github.com/FunAudioLLM/SenseVoice)] *Foundation model for ASR, emotion detection, language ID, and event classification.* 16. [silero-models][Production-Ready STT/TTS Models](https://github.com/snakers4/silero-models), `GitHub 2023`. [[Code](https://github.com/snakers4/silero-models)] *Accurate and fast models for mobile and server deployment — multilingual support.* 17. [whisper-jax][Whisper on JAX for Fast ASR](https://github.com/sanchit-gandhi/whisper-jax), `GitHub 2023`. [[Code](https://github.com/sanchit-gandhi/whisper-jax)] *Fast Whisper inference with batching and TPU support — great for large-scale pipelines.* 18. [FunClip][Multimodal Speech-Text Understanding](https://github.com/modelscope/FunClip), `GitHub 2023`. [[Code](https://github.com/modelscope/FunClip)] *Multimodal model trained for audio, vision, and text fusion — designed for universal understanding.*

🔊 Text-to-Speech (TTS) (click to expand)

# 📚 Awesome TTS Datasets A curated list of high-quality **Text-to-Speech (TTS)** datasets suitable for training, fine-tuning, and benchmarking TTS models. > 🔗 *Note: Always check dataset licenses before commercial use.* --- ## 🌍 Multilingual / Large-scale Datasets ### 🗣 [LibriTTS](https://www.openslr.org/60/) **Description**: A large corpus derived from LibriSpeech with aligned text and high-quality audio for English TTS tasks. --- ### 🗣 [Hi-Fi TTS](https://www.openslr.org/109/) **Description**: High-fidelity English TTS dataset with diverse speakers and SNR subsets, suitable for robust TTS training. --- ## 🎤 English Datasets ### 🗣 [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) **Description**: A widely used single-speaker English dataset designed for TTS and voice cloning tasks. --- ### 🗣 [AudioCaps](https://github.com/cdjkim/audiocaps) **Description**: 44K audio-caption pairs, useful for audio-captioning and could support TTS training with paired audio-text data. --- ## 🇨🇳 Mandarin Chinese Datasets ### 🗣 [Opencpop](https://wenet.org.cn/opencpop/) **Description**: Mandarin singing voice dataset containing phoneme-aligned lyrics, MIDI, and TextGrid files. --- ### 🗣 [KiSing](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/) **Description**: Mandarin singing voice synthesis corpus with clean recordings. --- ## 🇯🇵 Japanese Datasets ### 🗣 [PJS](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus) **Description**: Japanese speech corpus containing both singing and speaking voice recordings. --- ## 🧑‍🎤 Singing Voice Datasets ### 🗣 [M4Singer](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view) **Description**: Multi-singer singing voice dataset with phoneme-aligned data. --- ### 🗣 [OpenSinger](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view) **Description**: Open-source singing voice dataset with both male and female recordings. --- ### 🗣 [NUS-48E](https://drive.google.com/drive/folders/12pP9uUl0HTVANu3IPLnumTJiRjPtVUMx) **Description**: English singing voice corpus from multiple speakers with both singing and speaking data. --- ### 🗣 [PopBuTFy](https://github.com/MoonInTheRiver/NeuralSVB) **Description**: Singing dataset featuring both amateur and professional singing recordings. --- ### 🗣 [PopCS](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md) **Description**: Mandarin singing corpus with aligned phoneme and waveform data. --- ### 🗣 [Opera](http://isophonics.net/SingingVoiceDataset) **Description**: Western and Chinese opera dataset containing monophonic and polyphonic recordings. --- ## 🧪 Voice Conversion / Singing Voice Conversion ### 🗣 [CSD](https://zenodo.org/records/4785016) **Description**: Multilingual dataset for cross-lingual voice conversion including Korean and English utterances. --- ### 🗣 [SVCC](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset) **Description**: Singing Voice Conversion Challenge dataset for benchmarking singing voice conversion systems. --- ## 👤 Multi-speaker Speech Datasets ### 🗣 [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) **Description**: English multi-speaker dataset designed for speech synthesis and voice conversion tasks. --- ## 🛠 Custom Dataset Support ### 🗣 CustomSVCDataset **Description**: Amphion-compatible folder structure for organizing your own Singing Voice Conversion dataset. --- ## 🔖 License Reminder Most datasets listed are for **research purposes only**. For commercial use, carefully review and comply with individual dataset licenses. --- ## 📝 Contributions Want to add a new dataset? Feel free to submit a pull request or open an issue!

✨ Awesome Generative AI & LLM APIs (click to expand)

## GenAI APIs | Project Homepage | API Docs Link | Requires Auth Token (Y/N) | Description (2 lines max) | |:-----------|:------|:------|:-------------| | [OpenAI](https://openai.com/)| [Link](https://platform.openai.com/docs/api-reference) | Y | OpenAI APIs offer state-of-the-art GenAI models that can generate human-like text, answer questions, translate languages, generate and understand images, turn text to speech or speech to text thus empowering developers to create advanced AI-powered applications with ease. | | [Gemini](https://ai.google.dev/)| [Link](https://ai.google.dev/gemini-api/docs) | Y | designed to understand and interact with multiple data types, including text, images, audio, and video. | | [Llama AI](https://www.llama-api.com/) | [Link](https://docs.llama-api.com/quickstart) | Y | Offers APIs to access Llama models to answer complex queries and generate text.| | [Groq](https://groq.com/) | [Link](https://console.groq.com/docs/quickstart) | Y | Fastest Token Generation with Language Processing Units. Able to work on Open Source Models: Gemma-7b-lt, Llama3-70b-8192, Llama3-8b-8192, Mixtral-8x7b-32768. | | [Databricks](https://docs.databricks.com/en/machine-learning/foundation-models/index.html) | [Link](https://docs.databricks.com/en/machine-learning/foundation-models/api-reference.html) | Y | Databricks supports Foundation Model APIs which allow you to access and query state-of-the-art open models. You can quickly and easily build applications that leverage a high-quality generative AI model without maintaining your own model deployment. | | [Cohere AI](https://cohere.com/) | [Link](https://docs.cohere.com/docs/chat-api) | Y | Cohere AI offers a chat API that enables developers to create conversational interfaces with ease, leveraging advanced natural language understanding capabilities. | | [DeepAI](https://deepai.org/) | [Link](https://deepai.org/docs) | Y | DeepAI is a user-friendly platform providing state-of-the-art AI tools & APIs that unlock and enhance creativity across industries, widely democratizing access to AI technologies for both developers and non-tech users. | | [Clarifai API](https://www.clarifai.com/) | [Link](https://docs.clarifai.com/api-guide/api-overview/) | Y | Clarifai offers access to various popular generative AI models (LLM, multimodal, image, video). | | [Anthropic](https://www.anthropic.com/) | [Link](https://docs.anthropic.com/en/api/getting-started) | Y | Anthropic is an AI safety and research company behind the powerful Claude 3 model family. | | [HuggingFace API](https://huggingface.co/) | [Link](https://huggingface.co/docs/api-inference/index) | Y | HuggingFace provides API access to many open source Generative AI models, datasets and Spaces which are free to use. | | [TextCortex](https://textcortex.com/) | [Link](https://docs.textcortex.com/api) | Y | TextCortex provides a highly-scalable Text Generation API that uses advanced NLP to produce diverse and refined content. | | [Stability AI](https://stability.ai/) | [Link](https://platform.stability.ai/docs/api-reference) | Y | Stability AI offers open-access AI models with minimal resource requirements in imaging, language, code and audio. | | [Lovo AI](https://lovo.ai/) | [Link](https://docs.genny.lovo.ai/reference/intro/getting-started) | Y | Lovo lets you generate advanced AI voices for any use case. | | [Jasper AI](https://www.jasper.ai/) | [Link](https://developers.jasper.ai/docs/getting-started-1) | Y | Jasper assist marketers in creating, optimizing and publishing content effectively using AI. | | [Deepbrain AI](https://www.deepbrain.io/) | [Link](https://docs.deepbrain.io/aistudios/getting-started) | Y | DeepBrain AI offers natural text-to-speech capabilities & a powerful video generator that converts various inputs like text prompts, URLs, PDFs, and articles into engaging, professional-quality videos. | | [Leonardo AI](https://leonardo.ai) | [Link](https://docs.leonardo.ai/reference/createdataset) | Y | Leonardo AI lets you create production quality visual assets for your projects. | | [Mistral AI](https://mistral.ai/) | [Link](https://docs.mistral.ai/api/) | Y | Mistral offers open and portable Gen AI models for multilingual, code generation, maths, and advanced reasoning capabilities. | | [Tavus AI](https://www.tavus.io/) | [Link](https://docs.tavusapi.com/api-reference/phoenix-replica-model/create-replica) | Y | Tavus offers an AI voice API converting text to video with features like voice cloning, lip-syncing, and script generation, realistic avatars and others. | | [Colossyan](https://www.colossyan.com/) | [Link](https://docs.colossyan.com/) | Y | Colossyan offers an AI API to create videos from text with AI avatars. | | [Synthesia](https://www.synthesia.io/) | [Link](https://docs.synthesia.io/docs/getting-started) | Y | Synthesia offers an API to turn text to video in minutes with AI avatars and voiceovers in 130+ languages. | | [ElevenLabs](https://elevenlabs.io/) | [Link](https://elevenlabs.io/docs/api-reference/getting-started) | Y | ElevenLabs offers a voice generation API to produce highly realistic and natural-sounding voices. | | [Perplexity AI](https://www.perplexity.ai/hub/getting-started) | [Link](https://docs.perplexity.ai/docs/getting-started) | Y | Perplexity is like an AI-powered swiss army knife helping in information discovery, summarizing content, exploring new topics etc. | | [HeyGen AI](https://www.heygen.com) | [Link](https://docs.heygen.com/reference/authentication-1) | Y | Heygen let's you create produce studio-quality videos with AI-generated avatars and voices. | | [DeepL Translate](https://www.deepl.com/translator) | [Link](https://developers.deepl.com/docs) | Y | DeepL provides high-quality text and document translations. | | [IBM Watson AI](https://www.ibm.com/products/watsonx-ai) | [Link](https://cloud.ibm.com/developer/watson/documentation) | Y | IBM Watson lets you incorporate AI capabilities like conversation, language analysis, STT & TTS into your applications. | | [Writer](https://writer.com/) | [Link](https://dev.writer.com/api-reference/list-models) | Y | Writer provides APIs for generating, enhancing, and personalizing content. | | [Together AI](https://www.together.ai/) | [Link](https://docs.together.ai/docs/quickstart) | Y | Together AI offers an API to query 50+ leading open-source models in a couple lines of code. | | [GooseAI](https://goose.ai/) | [Link](https://goose.ai/docs) | Y | GooseAI provides a fully managed NLP-as-a-Service, offering various GPT-based models with high customization and speed. | | [Voyage AI](https://www.voyageai.com/) | [Link](https://docs.voyageai.com/reference/embeddings-api) | Y | Voyage AI provides API endpoints for embedding and reranking models. | | [AI/ML API](https://aimlapi.com/) | [Link](https://docs.aimlapi.com/) | Y | An API aggregator that provides access to 100+ AI models via a single API. | | [Wit.ai](https://wit.ai/) | [Link](https://wit.ai/docs/http/20240304/) | Y | Wit.ai provides APIs to build natural language experiences. | | [PlayHT](https://play.ht/) | [Link](https://docs.play.ht/reference/api-getting-started) | Y | Play.ht provides realistic text-to-speech voices and audio generation for various applications. | | [Chooch AI](https://www.chooch.com/) | [Link](https://www.chooch.com/api/) | Y | Detects, processes, and instantly analyzes visual elements in video streams. | | [Clipdrop](https://clipdrop.co/) | [Link](https://clipdrop.co/apis/docs) | Y | ClipDrop offers APIs for image upscaling, background removal, and other image enhancement features. | | [Astria AI](https://www.astria.ai/) | [Link](https://docs.astria.ai/docs/api/overview/) | Y | Astria is an API for fine-tuning and customization of generative image models. | | [Magic Slides](https://www.magicslides.app/) | [Link](https://www.magicslides.app/magicslides-api/docs) | Y | Professional Presentations in Seconds with AI. | | [Mubert](https://mubert.com/) | [Link](https://mubert2.docs.apiary.io/#) | Y | Generates personalized soundtracks. | | [SharpAPI](https://sharpapi.com/) | [Link](https://sharpapi.com/documentation) | Y | Generative AI APIs for some use cases in E-Commerce, Marketing, Content Management, HR Tech, Travel, etc.| ## GenAI API Integration Articles/Tutorials | Article Title | Link | Summary (2 lines max) | |:-----------|:------|:-------------| | How to integrate generative AI into your applications | [Link](https://www.pluralsight.com/resources/blog/data/integrate-genai-applications-openai) | The article offers a detailed tutorial on accessing the OpenAI API, demonstrating methods via web API calls and Python's OpenAI library, enabling developers to integrate Generative AI effortlessly into their projects. | | AI Image Generator using Reactjs & Open Journey API | [Link](https://medium.com/@vikumch/ai-image-generator-using-reactjs-open-journey-api-8706d7063dae) | This article provides a tutorial on creating an image generator using react.js and Open Journey API from Prompthero. | | Create your own GenAI Image Generator App like MidJourney or DALLE-2 | [Link](https://dev.to/techygeeky/create-your-own-genai-image-generator-app-like-midjourney-or-dalle-2-lej) | This article provides a tutorial on how to integrate AI-generated images into a React app using Segmind's text2Img API. | | Introducing Google Gemini API: Discover the Power of the New Gemini AI Models | [Link](https://www.datacamp.com/tutorial/introducing-gemini-api) | This article provides a tutorial on how to use Gemini Python API and its various functions to build AI-enabled applications. | | The OpenAI API in Python | [Link](https://www.datacamp.com/cheat-sheet/the-open-ai-api-in-python) | Learn the basics on how to leverage OpenAI API. | | How to Build LLM Applications with LangChain | [Link](https://www.datacamp.com/tutorial/how-to-build-llm-applications-with-langchain) | Explore the untapped potential of Large Language Models with LangChain. | ## GenAI API Integration Youtube Videos | Video Title | Link | Summary (2 lines max) | |:-----------|:------|:-------------| | Beginner's Guide to FastAPI & OpenAI ChatGPT API Integration | [Link](https://youtu.be/KVdP4SpWcc4?feature=shared) | The video offers a step-by-step tutorial on FastAPI and OpenAI's ChatGPT integration using Python. FastAPI is a high-performance web framework that's perfect for building APIs, and ChatGPT brings a layer of artificial intelligence into the mix. | | How to Integrate a Custom GPT Into Your Website (Step-by-step Guide) | [Link](https://youtu.be/SNwqkdhv1HQ?si=Mi2cfQZ2uyM0WyTc) | The video offers a step-by-step tutorial on a custom GPT integration on websites. Two different approaches have been depicted in the video so that both a beginner as well as those with some technical know-how could find it comfortable. | | Getting Started with Groq API: Making Near Real-Time Chatting with LLMs Possible | [Link](https://www.youtube.com/watch?v=S53BanCP14c) | The video discusses the Groq API and how it can be used to create near real-time chatting applications with large language models (LLMs). | | Building an AI Mobile Application with Flutter and Google Gemini API | [Link](https://www.youtube.com/watch?v=oAmIqoGkfIY) | This video is a tutorial on building an AI mobile application using Flutter and Google Gemini API. | | Groq Function Calling Llama 3: How to Integrate Custom API in AI App? | [Link](https://www.youtube.com/watch?v=7OAmeq-vwNc) | This video explores integrating custom APIs into AI applications using Groq functions and potentially Llama 3, a large language model. It might be the third part in a series on this topic. | | Text Cortex REWRITING API ⚙️ AI Text Generator | [Link](https://www.youtube.com/watch?v=vIusOmfXhoA) | The video is a tutorial on the Text Generation API (TextCortex). It guides through the process of integration, steps to access and perform tasks using TextCortex API. | | Build An AI Image Generator Using OpenAI (Dall-E) API - The Server (NodeJS, Express) | [Link](https://www.youtube.com/watch?v=Iyj9y1XpM0A) | This video is a tutorial on creating an AI image generator using the Open AI API, Node JS and Express. | | About OpenAI Assistants API | [Link](https://youtu.be/qHPonmSX4Ms?si=EZ9C0-pOVLOImOoh) | Learn how to use the OpenAI's assistant API'S to build powerful AI assistants | | Langchain by Greg Kamradt (Data Indy) | [Link](https://www.youtube.com/playlist?list=PLqZXAkvF1bPNQER9mLmDbntNfSpzdDIU5) | The playlist covers Open AI and Langchain and their various use cases. | | LangChain Series by Krish Naik | [Link](https://www.youtube.com/playlist?list=PLZoTAELRMXVORE4VF7WQ_fAl0L1Gljtar) | The LangChain Series offers a comprehensive guide to building various LLM-based application projects using LangChain. | | Google Gemini series by Krish Naik | [Link](https://www.youtube.com/playlist?list=PLZoTAELRMXVNbDmGZlcgCA3a8mRQp5axb) | This Google gemini playlist offers a comprehensive guide to build various LLM-based applications using Gemini. | | Spring Boot + OpenAI ChatGPT API Integration by JavaTechie | [Link](https://www.youtube.com/watch?v=HlDkuFy8xRM) | This tutorial by JavaTechie provides a step-by-step guide to integrating the OpenAI API with a Spring Boot application. | | How To Use ChatGPT With Python | [Link](https://www.youtube.com/watch?v=5MvYe44zen4) | This video shows how to integrate OpenAI's API in Python projects. | | Build an AI Chatbot using RAG | [Link](https://www.youtube.com/watch?v=XctooiH0moI) | This video shows how to build an AI chatbot using retrieval augmented generation. | | Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy | [Link](https://youtu.be/kCc8FmEb1nY?si=gc2dhU96USvt90ik) | This video demonstrates building a Generatively Pretrained Transformer (GPT). |

🖼️ Text-to-Image Generation (click to expand)

## Dataset ### 2025 1. [Janus (DeepSeek-VL)][Dual-Path Vision-Language Model for Text-to-Image Synthesis](https://arxiv.org/abs/2403.09878), `arXiv 2025`. [[No official code yet]] *Unifies visual and textual alignment using a dual-path architecture for improved caption-to-image generation.* --- ### 2024 1. [Text-to-Pose-to-Image][Improving Diffusion Model Control and Quality](https://arxiv.org/abs/2411.12872), `NeurIPS 2024 Workshop`. [[Code](https://github.com/clement-bonnet/text-to-pose)] *Enhances diffusion model generation by inserting an intermediate pose structure between text and image.* 2. [ControlNet v1.1][Structured Guidance for Stable Diffusion](https://arxiv.org/abs/2302.05543), `CVPR 2024`. [[Code](https://github.com/lllyasviel/ControlNet)] *Adds structural conditioning (edge, pose, depth) to pre-trained diffusion models without affecting performance.* 3. [T2I-Adapter][Adapter Modules for Controllable Text-to-Image Synthesis](https://arxiv.org/abs/2302.08453), `CVPR 2024`. [[Code](https://github.com/TencentARC/T2I-Adapter)] *Injects visual condition controls into frozen diffusion models using small plug-in modules.* 4. [StyleDiffusion][Text-Driven Image Generation with Style Control](https://arxiv.org/abs/2312.01234), `arXiv 2024`. [[Code](https://github.com/MatthewLWang/StyleDiffusion)] *Combines diffusion with textual prompts and style embeddings for controlled generation.* 5. [Sana][Scalable Personalization for Text-to-Image Generation](https://arxiv.org/abs/2404.06016), `arXiv 2024`. [[Code](https://github.com/NVlabs/Sana)]*A scalable personalization method for diffusion-based text-to-image models. Supports multi-subject generation and higher fidelity.* 6. [IMAG-Dressing][IMAG-Dressing: Unveiling the Potential of Language-Driven Virtual Try-on](https://arxiv.org/abs/2404.03094), `arXiv 2024`. [[Code](https://github.com/muzishen/IMAGDressing)] *Language-guided virtual try-on system that manipulates clothing appearance based on textual descriptions using diffusion-based architecture.* 7. [Infinity][Infinity: Towards Infinite Resolution Generation with Diffusion Models](https://arxiv.org/abs/2404.08758), `arXiv 2024`. [[Code](https://github.com/FoundationVision/Infinity)] *A diffusion model capable of generating ultra-high-resolution images by leveraging patch-wise autoregressive modeling.* --- ### 2023 1. [GALIP][Generative Adversarial CLIPs for Text-to-Image Synthesis](https://arxiv.org/abs/2301.12959), `arXiv 2023`. [[Code](https://github.com/tobran/GALIP)] *Integrates CLIP in both generator and discriminator for efficient and controllable text-to-image synthesis.* 2. [ELITE][Encoding Visual Concepts into Textual Embeddings](https://arxiv.org/abs/2302.13848), `arXiv 2023`. [[Code](https://github.com/csyxwei/ELITE)] *Maps visual concepts into language embeddings to enable customized image generation.* 3. [Rich-Text-to-Image][Rich Text-to-Image Generation](https://arxiv.org/abs/2307.XXXX), `ICCV 2023`. [[Code](https://github.com/songweige/rich-text-to-image)] *Enhances structure and context preservation using enriched textual prompts.* 4. [custom-diffusion][Multi-Concept Customization of Text-to-Image Diffusion](https://arxiv.org/abs/2212.04488), [[Code](https://github.com/adobe-research/custom-diffusion)] *Multi-Concept Customization of Text-to-Image Diffusion* --- ### 2022 1. [DreamBooth][Subject-Driven Text-to-Image Generation](https://arxiv.org/abs/2208.12242), `arXiv 2022`. [[Code](https://github.com/XavierXiao/Dreambooth-Stable-Diffusion)] *Fine-tunes diffusion models to generate images of specific subjects with a few samples.* 2. [FuseDream][Training-Free CLIP-Guided GAN Generation](https://arxiv.org/abs/2112.01573), `arXiv 2022`. [[Code](https://github.com/gnobitab/FuseDream)] *Utilizes CLIP+GAN latent optimization to generate images without model retraining.* --- ### 2021 1. [CogView][Pretrained Transformer for General-Domain Generation](https://arxiv.org/abs/2105.13290), `NeurIPS 2021`. [[Code](https://github.com/THUDM/CogView)] *Introduces a large-scale transformer model for high-quality text-to-image synthesis.*

🖼️ Image Super-Resolution (click to expand)

## 🗓️ 2015 ### [waifu2x](https://github.com/nagadomi/waifu2x) - 📄 Paper: [Image Super-Resolution Using Deep Convolutional Networks](https://arxiv.org/abs/1501.00092) --- ## 🗓️ 2016 ### [FSRCNN-pytorch](https://github.com/yjn870/FSRCNN-pytorch) - 📄 Paper: [Accelerating the Super-Resolution Convolutional Neural Network](https://arxiv.org/abs/1608.00367) ### [pytorch-vdsr](https://github.com/twtygqyy/pytorch-vdsr) - 📄 Paper: [Accurate Image Super-Resolution Using Very Deep Convolutional Networks](http://cv.snu.ac.kr/research/VDSR/) --- ## 🗓️ 2017 ### [EDSR-PyTorch](https://github.com/sanghyun-son/EDSR-PyTorch) - 📄 Paper: [Enhanced Deep Residual Networks for Single Image Super-Resolution](https://arxiv.org/abs/1707.02921) ### [LapSRN](https://github.com/phoenix104104/LapSRN) - 📄 Paper: [Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution](https://arxiv.org/abs/1704.03915) ### [SRGAN](https://github.com/tensorlayer/SRGAN) - 📄 Paper: [Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network](https://arxiv.org/abs/1609.04802) --- ## 🗓️ 2018 ### [RCAN](https://github.com/yulunzhang/RCAN) - 📄 Paper: [Image Super-Resolution Using Very Deep Residual Channel Attention Networks](https://arxiv.org/abs/1807.02758) ### [RDN](https://github.com/yulunzhang/RDN) - 📄 Paper: [Residual Dense Network for Image Super-Resolution](https://arxiv.org/abs/1802.08797) ### [DBPN-Pytorch](https://github.com/alterzero/DBPN-Pytorch) - 📄 Paper: [Deep Back-Projection Networks for Super-Resolution](https://arxiv.org/abs/1803.02735) --- ## 🗓️ 2019 ### [BasicSR](https://github.com/XPixelGroup/BasicSR) - 📄 Paper: [ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks](https://arxiv.org/abs/1809.00219) ### [Anime4K](https://github.com/bloc97/Anime4K) - 📄 Paper: Not available --- ## 🗓️ 2020 ### [DRLN](https://github.com/yulunzhang/DRLN) - 📄 Paper: [Residual Dense Network for Image Super-Resolution](https://arxiv.org/abs/1802.08797) --- ## 🗓️ 2021 ### [GFPGAN](https://github.com/TencentARC/GFPGAN) - 📄 Paper: [GFPGAN: Towards Real-World Blind Face Restoration with Generative Facial Prior](https://arxiv.org/abs/2101.04061) ### [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) - 📄 Paper: [Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data](https://arxiv.org/abs/2107.10833) ### [SwinIR](https://github.com/JingyunLiang/SwinIR) - 📄 Paper: [SwinIR: Image Restoration Using Swin Transformer](https://arxiv.org/abs/2108.10257) --- ## 🗓️ 2022 ### [ESRGAN](https://github.com/xinntao/ESRGAN) - 📄 Paper: [Enhanced Super-Resolution Generative Adversarial Networks](https://arxiv.org/abs/1809.00219) ### [LIIF](https://github.com/yinboc/liif) - 📄 Paper: [Learning Continuous Image Representation with Local Implicit Image Function](https://arxiv.org/abs/2012.09161) --- ## 🗓️ 2023 ### [Omni-SR](https://github.com/Francis0625/Omni-SR) - 📄 Paper: [Omni Aggregation Networks for Lightweight Image Super-Resolution](https://arxiv.org/abs/2304.10244) ### [ESRGCNN](https://github.com/hellloxiaotian/ESRGCNN) - 📄 Paper: [Image Super-resolution with An Enhanced Group Convolutional Neural Network](https://arxiv.org/abs/2205.14548) --- ## 🗓️ 2024 ### [SeeSR](https://github.com/cswry/SeeSR) - 📄 Paper: [SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution](https://arxiv.org/abs/2403.12345) ---

🧠 Voice Cloning (click to expand)

## 🗣️ Voice Cloning Models ### 2025 1. [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) *Fast and high-quality voice cloning from 1-minute audio using GPT + SoVITS.* 2. [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) *Multilingual voice generation pipeline combining LLMs with TTS systems.* 3. [VideoLingo](https://github.com/Huanshere/VideoLingo) *AI-based dubbing and voice-over pipeline with automatic sync and translation.* 4. [ebook2audiobook](https://github.com/DrewThomasson/ebook2audiobook) *Convert ebooks to audiobooks with chapters and metadata using dynamic AI models and voice cloning.* 5. [YuE](https://github.com/multimodal-art-projection/YuE) *Text-to-music generation system capable of cloning singing voices.* --- ### 2024 6. [OpenVoice](https://github.com/myshell-ai/OpenVoice) *Instant voice cloning with granular control over voice styles, including emotion and accent.* 7. [Bark-Voice-Cloning](https://github.com/serp-ai/bark-with-voice-clone) *Text-prompted generative audio model with voice cloning capabilities.* --- ### 2023 8. [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) *Real-time voice cloning with speaker embedding, Tacotron2, and WaveRNN.* 9. [TTS](https://github.com/coqui-ai/TTS) *Deep learning toolkit for Text-to-Speech and voice cloning in many languages.* --- ### 2022 10. [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) *Open-source speech toolkit for ASR, TTS, voice cloning, and more.*

💬 Emotion Recognition (click to expand)

- Add KoBERT, CNN+mel, IEMOCAP dataset, etc...

🗣️ Talking Head Generation (click to expand)

## Datasets 0. VoxCeleb1 [[`Download link`](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html)]. 1. VoxCeleb2 [[`Download link`](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html)]. 2. Faceforensics++ [[`Download link`](https://github.com/ondyari/FaceForensics)]. 3. CelebV [[`Download link`](https://drive.google.com/file/d/1jQ6d76T5GQuvQH4dq8_Wq1T0cxvN0_xp/view)]. 4. TalkingHead-1KH [[`Download link`](https://github.com/tcwang0509/TalkingHead-1KH)]. 5. LRW (Lip Reading in the Wild) [[`Download link`](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)]. 6. MEAD [[`Download link`](https://github.com/uniBruce/Mead)]. 7. CelebV-HQ [[`Download link`](https://github.com/CelebV-HQ/CelebV-HQ)]. 8. CHDTF [[`Download link`](https://medialab.sjtu.edu.cn/post/chdtf/)]. 9. MultiTalk [[`Download link`](https://github.com/postech-ami/MultiTalk/tree/main/MultiTalk_dataset)]. 10. VFHQ [[`Download link`](https://github.com/anjieyang/VFHQ-downloader)]. 11. Hallo3 [[`Download link`](https://huggingface.co/datasets/fudan-generative-ai/hallo3_training_data)]. --- ## Image-driven ### 2025 1. [HunyuanPortrait] [HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation](https://arxiv.org/abs/2503.18860), `CVPR 2025`. [[Code](https://github.com/kkakkkka/HunyuanPortrait)] [[Project](https://kkakkkka.github.io/HunyuanPortrait)] ### 2024 1. [X-Portrait] [X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention](https://arxiv.org/abs/2403.15931), `arXiv 2024`. 2. [LivePortrait] [LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/pdf/2407.03168) [[Code](https://github.com/KwaiVGI/LivePortrait)] [[Project](https://liveportrait.github.io)] 3. [EMOPortraits] [EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars](https://arxiv.org/pdf/2404.19110), `CVPR 2024`. [[Code](https://github.com/neeek2303/EMOPortraits)], [[Project](https://neeek2303.github.io/EMOPortraits/)] 4. [SMA] [Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation](https://arxiv.org/abs/2412.00719), `CVPR 2024`. [[Project](https://shaelynz.github.io/synergize-motion-appearance/)] ### 2023 1. [AVFR-GAN][Audio-Visual Face Reenactment](https://arxiv.org/pdf/2210.02755.pdf), `WACV 2023`. [[Code](https://github.com/mdv3101/AVFR-Gan/)], [[Project](http://cvit.iiit.ac.in/research/projects/cvit-projects/avfr)] 2. [TS-Net][Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis](https://arxiv.org/pdf/2210.01559.pdf), `WACV 2023`. [[Code](https://github.com/nihaomiao/WACV23_TSNet)] 2. [MCNET][Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation](https://arxiv.org/abs/2307.09906), `ICCV 2023`. [[Project](https://harlanhong.github.io/publications/mcnet.html)] [[Code](https://github.com/harlanhong/ICCV2023-MCNET)] ### 2022 1. [DaGAN][Depth-Aware Generative Adversarial Network for Talking Head Video Generation](https://arxiv.org/abs/2203.06605), `CVPR 2022`. [[Code](https://github.com/harlanhong/CVPR2022-DaGAN)], [[Project](https://harlanhong.github.io/publications/dagan.html)] 2. [TPSM][Thin-Plate Spline Motion Model for Image Animation](https://arxiv.org/abs/2203.14367), `CVPR 2022`. [[Code](https://github.com/yoyo-nb/Thin-Plate-Spline-Motion-Model)] 3. [StyleHEAT][StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pretrained StyleGAN](https://arxiv.org/pdf/2203.04036.pdf), `ECCV 2022`. [[Code](https://github.com/FeiiYin/StyleHEAT/)], [[Project](https://feiiyin.github.io/StyleHEAT/)] 4. [MegaPortraits][MegaPortraits: One-shot Megapixel Neural Head Avatars](https://arxiv.org/abs/2207.07621), `ACM MM 2022`. [[Project](https://samsunglabs.github.io/MegaPortraits/)] 5. [DAM][Structure-Aware Motion Transfer with Deformable Anchor Model](https://openaccess.thecvf.com/content/CVPR2022/papers/Tao_Structure-Aware_Motion_Transfer_With_Deformable_Anchor_Model_CVPR_2022_paper.pdf), `CVPR 2022`. [[Code](https://github.com/JialeTao/DAM)] 6. [StyleMask][StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment](https://arxiv.org/pdf/2209.13375.pdf), `FG, 2023`. [[Code](https://github.com/StelaBou/StyleMask)] 7. [CoRF][Controllable Radiance Fields for Dynamic Face Synthesis](https://arxiv.org/pdf/2210.06465.pdf), `Arxiv 2022`. 8. [AniFaceGAN][Animatable 3D-Aware Face Image Generation for Video Avatars](https://arxiv.org/pdf/2210.05825.pdf), `NeurIPS 2022`. [[Project](https://yuewuhkust.github.io/AniFaceGAN/)] 9. [IW][Implicit Warping for Animation with Image Sets](https://arxiv.org/pdf/2210.01794.pdf), `NeurIPS 2022`. [[Project](https://deepimagination.cc/implicit_warping/)] 10. [HifiHead][HifiHead: One-Shot High Fidelity Neural Head Synthesis with 3D Control](https://www.ijcai.org/proceedings/2022/0244.pdf), `IJCAI 2022`. 10. [Face Animation with Multiple Source Images](https://arxiv.org/pdf/2212.00256.pdf?), `Arxiv 2022`. 10. [MetaPortrait][MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation](https://download.arxiv.org/pdf/2212.08062v2), `Arxiv 2022`. 11. [Compressing Video Calls using Synthetic Talking Heads](https://arxiv.org/pdf/2210.03692.pdf), `BMVC 2022`. [[Project](https://cvit.iiit.ac.in/research/projects/cvit-projects/talking-video-compression)] 12. [Finding Directions in GAN’s Latent Space for Neural Face Reenactment](https://arxiv.org/pdf/2202.00046.pdf), `BMVC 2022`. [[Project](https://stelabou.github.io/stylegan-directions-reenactment/)] [[Code](https://github.com/StelaBou/stylegan_directions_face_reenactment)] 13. [LIA][Latent Image Animator: Learning to Animate Images via Latent Space Navigation](https://arxiv.org/pdf/2203.09043.pdf), `ICLR 2022`. [[Project](https://wyhsirius.github.io/LIA-project/)] [[Code](https://github.com/wyhsirius/LIA)] ### 2021 1. [face-vid2vid] [One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing](https://nvlabs.github.io/face-vid2vid/main.pdf), `CVPR 2021 Oral`. [[Project](https://nvlabs.github.io/face-vid2vid/)] 2. [S2D] [Sparse to Dense Motion Transfer for Face Image Animation](https://openaccess.thecvf.com/content/ICCV2021W/AIM/papers/Zhao_Sparse_to_Dense_Motion_Transfer_for_Face_Image_Animation_ICCVW_2021_paper.pdf), `ICCV 2021`. 3. [SAFA] [SAFA: Structure Aware Face Animation](https://arxiv.org/pdf/2111.04928.pdf), `3DV 2021`. [[Code](https://github.com/Qiulin-W/SAFA)] 4. [SAA] [Self-appearance-aided Differential Evolution for Motion Transfer](https://arxiv.org/abs/2110.04658), `arXiv 2021`. 5. [PIRenderer][PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering](https://arxiv.org/pdf/2109.08379.pdf), `ICCV 2021`. [[Code](https://github.com/RenYurui/PIRender)] 6. [FaceGAN][FACEGAN: Facial Attribute Controllable rEenactment GAN](https://openaccess.thecvf.com/content/WACV2021/papers/Tripathy_FACEGAN_Facial_Attribute_Controllable_rEenactment_GAN_WACV_2021_paper.pdf), `WACV 2021`. 7. [F^3A-GAN][F3A-GAN: Facial Flow for Face Animation With Generative Adversarial Networks](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9547053), `IEEE TIP 2021`. 8. [FACIAL][FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhang_FACIAL_Synthesizing_Dynamic_Talking_Face_With_Implicit_Attribute_Learning_ICCV_2021_paper.pdf), `ICCV 2021`. 9. [MRAA][ Motion Representations for Articulated Animation](https://openaccess.thecvf.com/content/CVPR2021/papers/Siarohin_Motion_Representations_for_Articulated_Animation_CVPR_2021_paper.pdf), `CVPR 2021`. [[Code](https://github.com/snap-research/articulated-animation)] 10. [HeadGAN][HeadGAN: One-shot Neural Head Synthesis and Editing](https://arxiv.org/pdf/2012.08261.pdf), `ICCV 2021`. [[Project](https://michaildoukas.github.io/HeadGAN/)] ### 2020 1. [MeshG] [Mesh Guided One-shot Face Reenactment Using Graph Convolutional Networks](https://dl.acm.org/doi/pdf/10.1145/3394171.3413865g), `ACM Multimedia 2020`. [[Code](https://arxiv.org/abs/2008.07783)] 2. [MarioNETte] [MarioNETte: Few-shot Face Reenactment Preserving Identity of Unseen Targets](https://arxiv.org/abs/1911.08139), `AAAI 2020`. [[Project](https://hyperconnect.github.io/MarioNETte/)] 3. [CrossID-GAN] [Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment](https://openaccess.thecvf.com/content_CVPR_2020/papers/Huang_Learning_Identity-Invariant_Motion_Representations_for_Cross-ID_Face_Reenactment_CVPR_2020_paper.pdf), `CVPR 2020`. ### 2019 1. [FOMM] [First order motion model for image animation](http://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation.pdf), `NeurIPS 2019`. [[Code](https://github.com/AliaksandrSiarohin/first-order-model)] 2. [NeuralHead][Few-Shot Adversarial Learning of Realistic Neural Talking Head models](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwif8Y6R_Mb1AhVjH0QIHcQZDpwQFnoECDQQAQ&url=https%3A%2F%2Fopenaccess.thecvf.com%2Fcontent_ICCV_2019%2Fpapers%2FZakharov_Few-Shot_Adversarial_Learning_of_Realistic_Neural_Talking_Head_Models_ICCV_2019_paper.pdf&usg=AOvVaw1oKgCYySpv2cFHZ2mNI5A9), `ICCV 2019`. [[Code](https://github.com/vincent-thevenin/Realistic-Neural-Talking-Head-Models)] 3. [Monkey-Net][Animating Arbitrary Objects via Deep Motion Transfer](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjnoOTYgsf1AhXsJ0QIHSF3A-sQFnoECAUQAQ&url=https%3A%2F%2Farxiv.org%2Fabs%2F1812.08861&usg=AOvVaw2fzcaa6nXcI9MiH8uIFNfJ), `CVPR 2019 Oral`. [[Code](https://github.com/AliaksandrSiarohin/monkey-net)], [[Project](http://www.stulyakov.com/papers/monkey-net.html)] 4. [fs-vid2vid][Few-shot Video-to-Video Synthesis](https://nvlabs.github.io/few-shot-vid2vid/main.pdf), `NeurIPS 2019`. [[Code](https://github.com/NVlabs/few-shot-vid2vid)], [[Project](https://nvlabs.github.io/few-shot-vid2vid/)] ### 2018 1. [ReenactGAN] [ReenactGAN: Learning to Reenact Faces via Boundary Transfer](https://wywu.github.io/projects/ReenactGAN/support/ReenactGAN.pdf), `ECCV 2018`. [[Code](https://github.com/wywu/ReenactGAN)] 2. [X2Face] [X2Face: A network for controlling face generation by using images, audio, and pose codes](http://www.robots.ox.ac.uk/~vgg/publications/2018/Wiles18/wiles18.pdf), `ECCV 2018`. [[Code](https://github.com/oawiles/X2Face)], [[Project](https://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html)] ### 2016 1. [Face2face] [Face2Face: Real-time face capture and reenactment of RGB videos](http://openaccess.thecvf.com/content_cvpr_2016/html/Thies_Face2Face_Real-Time_Face_CVPR_2016_paper.html), `CVPR 2016`. --- ## Audio-driven ### 2025 1. [OmniHuman-1][OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models](https://arxiv.org/abs/2502.01061), `arXiv 2025`. [[Project](https://omnihuman-lab.github.io/)] 2. [ACTalker][Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modelling for Natural Talking Head Generation](https://arxiv.org/abs/2504.02542), `arXiv 2025`. [[Project](https://harlanhong.github.io/publications/actalker/index.html)] ### 2024 1. [Real3DPortrait] [Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis](https://arxiv.org/pdf/2401.08503.pdf), `ICLR 2024`. [[Project](https://real3dportrait.github.io/)] [[Code](https://github.com/yerfor/Real3DPortrait)] 2. [EMO] [Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions](https://arxiv.org/pdf/2402.17485.pdf), `arXiv 2024`. [[Project](https://humanaigc.github.io/emote-portrait-alive/)] [[Code](https://github.com/HumanAIGC/EMO)] 3. [Style2Talker] [Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style](https://arxiv.org/pdf/2403.06365.pdf), `AAAI 2024`. 4. [SaaS] [Say Anything with Any Style](https://arxiv.org/abs/2403.06363), `AAAI 2024`. 5. [MuseTalk] Real-Time High Quality Lip Synchorization with Latent Space Inpainting, [[Code](https://github.com/TMElyralab/MuseTalk)]. 6. [VASA-1] [VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time]((https://arxiv.org/abs/2404.10667)), `arXiv 2024`. [[Project](https://www.microsoft.com/en-us/research/project/vasa-1/)] 7. [THQA] [THQA: A Perceptual Quality Assessment Database for Talking Heads](https://arxiv.org/abs/2404.09003), `arXiv 2024`. [[Code](https://github.com/zyj-2000/THQA)] 8. [Talk3D] [Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior](https://arxiv.org/abs/2403.20153), `arXiv 2024`. [[Code](https://github.com/KU-CVLAB/Talk3D)] [[Project](https://ku-cvlab.github.io/Talk3D/)] 9. [EDTalk] [EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis](https://arxiv.org/abs/2404.01647), `arXiv 2024`. [[Code](https://github.com/tanshuai0219/EDTalk)] [[Project](https://tanshuai0219.github.io/EDTalk/)] 10. [AniPortrait] [AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations](https://arxiv.org/abs/2403.17694), `arXiv 2024`. [[Code](https://github.com/Zejun-Yang/AniPortrait)] 11. [FlowVQTalker] [FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization](https://arxiv.org/abs/2403.06375), `arXiv 2024`. 12. [FaceChain-ImagineID] [FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio](https://arxiv.org/abs/2403.01901), `arXiv 2024`. [[Code](https://github.com/modelscope/facechain)] 13. [Hallo] [Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation](https://arxiv.org/pdf/2406.08801), `arXiv 2024`. [[Code](https://github.com/fudan-generative-vision/hallo)] 14. [EchoMimic][EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions](https://arxiv.org/abs/2407.08136), `arXiv 2024`. [[Code](https://github.com/BadToBest/EchoMimic)], [[Project](https://badtobest.github.io/echomimic.html)] 15. [RealTalk][RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network](https://arxiv.org/abs/2406.18284), `arXiv 2024`. 16. [Emotional Conversation][Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation](https://arxiv.org/abs/2406.07895), `arXiv 2024`. 17. [Make Your Actor Talk][Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement](https://arxiv.org/abs/2406.08096), `arXiv 2024`. 18. [FD2Talk][FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model](https://arxiv.org/pdf/2408.09384v1), `arXiv 2024`. 19. [ReSyncer][ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer](https://arxiv.org/abs/2408.03284), `arXiv 2024`. 20. [StyleSync][Style-Preserving Lip Sync via Audio-Aware Style Reference](https://arxiv.org/abs/2408.05412), `arXiv 2024`. 21. [Loopy][Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://arxiv.org/pdf/2409.02634), `arXiv 2024`. [[Project](https://loopyavatar.github.io)] 22. [DAWN][DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation](https://arxiv.org/abs/2410.13726), `arXiv 2024`. [[Project](https://hanbo-cheng.github.io/DAWN/)], [[Code](https://github.com/Hanbo-Cheng/DAWN-pytorch)] 23. [EchoMimicV2][EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation](https://arxiv.org/abs/2411.10061), `arXiv 2024`. [[Code](https://github.com/antgroup/echomimic_v2)], [[Project](https://antgroup.github.io/ai/echomimic_v2/)] 24. [LetsTalk][Latent Diffusion Transformer for Talking Video Synthesis](https://arxiv.org/abs/2411.16748), `arXiv 2024`. [[Code](https://github.com/zhang-haojie/letstalk?tab=readme-ov-file)], [[Project](https://zhang-haojie.github.io/project-pages/letstalk.html)] 25. [IF-MDM][Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation](https://arxiv.org/abs/2412.04000), `arXiv 2024`. [[Project](http://ec2-3-25-102-128.ap-southeast-2.compute.amazonaws.com/IF-MDM/ifmdm_supplementary/index.html)] 26. [INFP][Audio-Driven Interactive Head Generation in Dyadic Conversations](https://arxiv.org/abs/2412.04037), `arXiv 2024`. [[Project](https://grisoon.github.io/INFP/)] 27. [MEMO][Memory-Guided Diffusion for Expressive Talking Video Generation](https://arxiv.org/abs/2412.04448), `arXiv 2024`. [[Project](https://memoavatar.github.io/)], [[Code](https://github.com/memoavatar/memo)] 28. [FLOAT][ Generative Motion Latent Flow Matching for Audio-driven Talking Portrait](https://arxiv.org/abs/2412.01064), `arXiv 2024`. [[Project](https://deepbrainai-research.github.io/float/)] 29. [Hallo3][Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks](https://arxiv.org/abs/2412.00733), `arXiv 2024`. 30. [VQTalker][VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization](https://arxiv.org/pdf/2412.09892), `arXiv 2024`. 31. [PortraitTalk][Towards Customizable One-Shot Audio-to-Talking Face Generation](https://arxiv.org/abs/2412.07754), `arXiv 2024`. 32. [IF-MDM][IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation](https://arxiv.org/abs/2412.04000), `arXiv 2024`. 33. [LatentSync][LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync](https://arxiv.org/abs/2412.09262), `arXiv 2024`. [[Code](https://github.com/bytedance/LatentSync)] ### 2023 1. [Diffused Heads] [Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation](https://mstypulkowski.github.io/diffusedheads/diffused_heads.pdf), `Arxiv 2023`. [[Project](https://mstypulkowski.github.io/diffusedheads/)] :fire:Diffusion:fire: 2. [DiffTalk] [DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis](https://arxiv.org/abs/2301.03786), `Arxiv 2023`. [[Project](https://sstzal.github.io/DiffTalk/)] [[Code](https://github.com/sstzal/DiffTalk)] :fire:Diffusion:fire: 3. [READ] [READ Avatars: Realistic Emotion-controllable Audio Driven Avatars](READ Avatars: Realistic Emotion-controllable Audio Driven Avatars), `Arxiv 2023`. 4. [DAE-Talker] [DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder](https://arxiv.org/pdf/2303.17550.pdf), `Arxiv 2023`. :fire:Diffusion:fire: 5. [EmoGen] [Emotionally Enhanced Talking Face Generation](https://arxiv.org/pdf/2303.11548.pdf), `Arxiv 2023`. [[Code](https://github.com/sahilg06/EmoGen)] 6. [TalkLip] [Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert](https://arxiv.org/pdf/2303.17480.pdf), `CVPR 2023`. [[Code](https://github.com/Sxjdwang/TalkLip)] 7. [StyleSync] [StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator](https://arxiv.org/pdf/2305.05445.pdf), `CVPR 2023`. [[Project](https://hangz-nju-cuhk.github.io/projects/StyleSync)] [[Code](https://github.com/guanjz20/StyleSync)] 8. [GeneFace++] [GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/pdf/2305.00787.pdf), `arXiv 2023`. [[Project](https://genefaceplusplus.github.io)] [[Code](https://github.com/yerfor/GeneFacePlusPlus)] 9. [MODA] [MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions](https://arxiv.org/abs/2307.10008), `ICCV 2023`. 10. [VividTalk] [VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior](https://arxiv.org/pdf/2312.01841.pdf), `Arxiv 2023`. [[Project](https://humanaigc.github.io/vivid-talk/)] [[Code](https://github.com/HumanAIGC/VividTalk)] 11. [IP_LAP] [IP_LAP: Identity-Preserving Talking Face Generation with Landmark and Appearance Priors](https://arxiv.org/abs/2305.08293), `CVPR 2023`. [[Code](https://github.com/Weizhi-Zhong/IP_LAP)] 12. [HyperLips] [HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation](https://arxiv.org/abs/2310.05720), `CVPR 2023`. [[Code](https://github.com/semchan/HyperLips)] 13. [EAT] [Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation](https://arxiv.org/abs/2309.04946), `ICCV 2023`. [[Project](https://yuangan.github.io/eat/)] [[Code](https://github.com/yuangan/EAT_code)] 14. [SadTalker] [SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Talking Head Animation](https://arxiv.org/pdf/2211.12194.pdf), `CVPR 2023`. [[Project](https://sadtalker.github.io)] [[Code](https://github.com/Winfredy/SadTalker)] ### 2022 1. [GC-AVT] [Expressive Talking Head Generation with Granular Audio-Visual Control ](https://openaccess.thecvf.com/content/CVPR2022/papers/Liang_Expressive_Talking_Head_Generation_With_Granular_Audio-Visual_Control_CVPR_2022_paper.pdf), `CVPR 2022`. 2. [Talking Face Generation with Multilingual TTS](https://openaccess.thecvf.com/content/CVPR2022/papers/Song_Talking_Face_Generation_With_Multilingual_TTS_CVPR_2022_paper.pdf), `CVPR 2022`. [[Demo Track](https://huggingface.co/spaces/CVPR/ml-talking-face)] 3. [EAMM] [EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model](https://arxiv.org/pdf/2205.15278.pdf), `SIGGRAPH 2022`. 4. [SPACEx] [SPACEx 🚀: Speech-driven Portrait Animation with Controllable Expression](https://arxiv.org/pdf/2211.09809.pdf), `arXiv 2022`. [[Project](https://deepimagination.cc/SPACEx/)] `CVPR 2023` 5. [AV-CAT] [Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers](https://arxiv.org/pdf/2212.04970.pdf), `SIGGRAPH Asia 2022`. 6. [MemFace] [Memories are One-to-Many Mapping Alleviators in Talking Face Generation](https://arxiv.org/pdf/2212.05005.pdf), `arXiv 2022`. ### 2021 1. [PC-AVS] [Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/abs/2104.11116), `CVPR 2021`. [[Code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS)], [[Project](https://hangz-nju-cuhk.github.io/projects/PC-AVS)] 2. [IATS][Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://dl.acm.org/doi/pdf/10.1145/3474085.3475280),`ACM Multimedia 2021`. 3. [EVP] [Audio-Driven Emotional Video Portraits](https://openaccess.thecvf.com/content/CVPR2021/papers/Ji_Audio-Driven_Emotional_Video_Portraits_CVPR_2021_paper.pdf), `CVPR 2021`. [[Code](https://github.com/jixinya/EVP)] 4. [FAU] [Talking Head Generation with Audio and Speech Related Facial Action Units](https://arxiv.org/pdf/2110.09951.pdf), `arxiv 2021`. 5. [Speech2Talking-Face] [Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation](https://www.ijcai.org/proceedings/2021/0141.pdf), `IJCAI 2021`. 6. [IATS] [Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://arxiv.org/abs/2111.00203), `ACM MM 2021`. 7. [LSP] [Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation](https://arxiv.org/abs/2109.10595), `ACM TOG 2021`. [[Code](https://github.com/YuanxunLu/LiveSpeechPortraits)] 8. [Audio2head] [Audio2head: Audio-driven one-shot talking-head generation with natural head motion](https://arxiv.org/pdf/2107.09293), `ArXiv 2021`. ### 2020 1. [Wav2Lip] [A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild](http://arxiv.org/abs/2008.10010), `ACM Multimedia 2020`. [[Code](https://github.com/Rudrabha/Wav2Lip)], [[Project](http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/)] 2. [RhythmicHead] [Talking-head Generation with Rhythmic Head Motion](https://arxiv.org/pdf/2007.08547v1.pdf), `ECCV 2020`. [[Code](https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion)] 3. [MakeItTalk] [MakeItTalk: Speaker-Aware Talking-Head Animation](), `SIGGRAPH Asia 2020`. [[Code](https://github.com/yzhou359/MakeItTalk)], [[Project](https://people.umass.edu/~yangzhou/MakeItTalk/)] 4. [Neural Voice Puppetry] [Neural Voice Puppetry: Audio-driven Facial Reenactment](https://arxiv.org/abs/1912.05566), `ECCV 2020`. [[Code](https://github.com/keetsky/NeuralVoicePuppetry)], [[Project](https://justusthies.github.io/posts/neural-voice-puppetry/)] 5. [MEAD] [MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660698.pdf), `ECCV 2020`. [[Code](https://github.com/uniBruce/Mead)], [[Project](https://wywu.github.io/projects/MEAD/MEAD.html)] 6. [Realistic Speech-Driven Facial Animation with GANs](https://arxiv.org/pdf/1906.06337.pdf), `IJCV 2020`. ### 2019 1. [DAVS] [Talking Face Generation by Adversarially Disentangled Audio-Visual Representation](https://arxiv.org/abs/1807.07860), `AAAI 2019`. [[Code](https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS)] 2. [ATVGnet] [Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss](https://www.cs.rochester.edu/~cxu22/p/cvpr2019_facegen_paper.pdf), `CVPR 2019`. [[Code](https://github.com/lelechen63/ATVGnet)] ### 2018 1. [Lip Movements Generation at a Glance](https://openaccess.thecvf.com/content_ECCV_2018/papers/Lele_Chen_Lip_Movements_Generation_ECCV_2018_paper.pdf), `ECCV 2018`. [[Code](https://github.com/lelechen63/3d_gan)] 2. [VisemeNet] [VisemeNet: Audio-Driven Animator-Centric Speech Animation](https://arxiv.org/abs/1805.09488), `SIGGRAPH 2018`. ### 2017 1. [Synthesizing-Obama] [Synthesizing Obama: Learning Lip Sync From Audio](https://grail.cs.washington.edu/projects/AudioToObama/siggraph17_obama.pdf), `SIGGRAPH 2017`. [[Project](https://grail.cs.washington.edu/projects/AudioToObama/)] 2. [You-Said-That?] [You Said That?: Synthesising Talking Faces From Audio](https://arxiv.org/abs/1705.02966), `IJCV 2019`. [[Code](https://github.com/joonson/yousaidthat)] 3. [Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion](https://users.aalto.fi/~laines9/publications/karras2017siggraph_paper.pdf), `SIGGRAPH 2017`. 4. [A Deep Learning Approach for Generalized Speech Animation](https://home.ttic.edu/~taehwan/taylor_etal_siggraph2017.pdf), `SIGGRAPH 2017`. ### 2016 1. [LRW] [Lip Reading in the Wild](https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16/chung16.pdf), `ACCV 2016`. --- ## Nerf & 3D ### 2024 1. [CVTHead] [CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer](https://openaccess.thecvf.com/content/WACV2024/papers/Ma_CVTHead_One-Shot_Controllable_Head_Avatar_With_Vertex-Feature_Transformer_WACV_2024_paper.pdf), `WACV 2024`. [[Code](https://github.com/HowieMa/CVTHead)]. 2. [Head3D] [3D-Aware Talking-Head Video Motion Transfer](https://openaccess.thecvf.com/content/WACV2024/papers/Ni_3D-Aware_Talking-Head_Video_Motion_Transfer_WACV_2024_paper.pdf), `WACV 2024`. ### 2022 1. [SSP-NeRFF] [Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation](https://arxiv.org/pdf/2201.07786.pdf), `arxiv, 2022`. 2. [HeadNeRF] [HeadNeRF: A Real-time NeRF-based Parametric Head Model](https://openaccess.thecvf.com/content/CVPR2022/papers/Grassal_Neural_Head_Avatars_From_Monocular_RGB_Videos_CVPR_2022_paper.pdf), `CVPR 2022`. [[Code](https://github.com/CrisHY1995/headnerf)], [[Project](https://hy1995.top/HeadNeRF-Project/)] 3. [IMavatar] [I M Avatar: Implicit Morphable Head Avatars from Videos](https://openaccess.thecvf.com/content/CVPR2022/papers/Zheng_I_M_Avatar_Implicit_Morphable_Head_Avatars_From_Videos_CVPR_2022_paper.pdf), `CVPR 2022`. [[Code](https://ait.ethz.ch/projects/2022/IMavatar/)] 4. [ROME] [Realistic One-shot Mesh-based Head Avatars](https://arxiv.org/pdf/2206.08343.pdf), `ECCV 2022`. 5. [FNeVR] [FNeVR: Neural Volume Rendering for Face Animation](https://arxiv.org/abs/2209.10340), `Arxiv 2022`. [[Code](https://github.com/zengbohan0217/FNeVR)] 6. [3DFaceShop] [3DFaceShop: Explicitly Controllable 3D-Aware Portrait Generation](https://arxiv.org/pdf/2209.05434), `Arxiv 2022`. [[Code](https://github.com/junshutang/3DFaceShop)], [[Project](https://junshutang.github.io/control/index.html)] 7. [Next3D] [Generative Neural Texture Rasterization for 3D-Aware Head Avatars](https://arxiv.org/pdf/2211.11208.pdf), `Arxiv 2022`. [[Project](https://mrtornado24.github.io/Next3D/)] 8. [NeRFInvertor] [NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation](https://arxiv.org/pdf/2211.17235.pdf?), `Arxiv 2022`. 9. [DFRF] [Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis](https://arxiv.org/abs/2207.11770), `ECCV 2022`. [[Code](https://github.com/sstzal/DFRF)] ### 2021 1. [DFA-NeRF] [DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering](https://arxiv.org/pdf/2201.00791v1.pdf), `arxiv, 2021`. 2. [NerFACE] [NerFACE: Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction](https://arxiv.org/pdf/2012.03065), `CVPR 2021 Oral`. [[Code](https://github.com/gafniguy/4D-Facial-Avatars)], [[Project](https://gafniguy.github.io/4D-Facial-Avatars/)] 3. [AD-NeRF] [AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis](https://arxiv.org/abs/2103.11078), `ICCV 2021`. [[Code](https://github.com/gafniguy/4D-Facial-Avatars)], [[Code](https://github.com/YudongGuo/AD-NeRF)] ### 2020 1. [DiscoFaceGAN ] [Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning ](), `CVPR 2020 Oral`. [[Code](https://github.com/microsoft/DiscoFaceGAN)] ## Survey ### 2024 1. [A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos](https://arxiv.org/abs/2403.06421) [[Code](https://github.com/zwx8981/ADTH-QA)] ### 2020 1. [What comprises a good talking-head video generation?: A Survey and Benchmark](https://arxiv.org/pdf/2005.03201v1.pdf)

This site is open source. Improve this page.