๐๏ธ Speech-to-Text (STT) (click to expand)
๐๏ธ Dataset for STT (click to expand)
## ๐๏ธ Dataset for STT
## ๐ [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| Common Voice | Multilingual | >15,000 hours (validated); >20,000 hours (total) | Multi-speaker | <https://voice.mozilla.org/en/datasets> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
| Yesno | Hebrew | 6 mins | one male | <http://www.openslr.org/1/> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
| LJ Speech Corpus | English | ~24 hours | [one female](https://librivox.org/reader/11049) | <https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
| NST Danish ASR Database | Danish | 229,992 utterances | 616 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Danish Dictation | Danish | 34,955 utterances | 151 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Danish Speech Synthesis | Danish | 4,108 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Swedish ASR Database | Swedish | 366,000 utterances | 1,000 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Swedish Dictation | Swedish | 45,620 utterances | 195 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Swedish Speech Synthesis | Swedish | 5,279 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Norwegian ASR Database | Norwegian | 359,760 utterances | 980 speakers | original: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/>, reorganized: <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Norwegian Dictation | Norwegian | 33,360 utterances | 144 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NST Norwegian Speech Synthesis | Norwegian | 5,363 utterances | 1 male speaker | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| NB Tale โ Speech Database for Norwegian | Norwegian | 7,600 utterances + ~12 hours | 380 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| Norwegian Parliamentary Speech Corpus (v0.1) | Norwegian | ~59 hours | 203 speakers | <https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/> | [CC-0](https://creativecommons.org/publicdomain/zero/1.0/) |
| Wikimedia Commons Odia | Odia | ~8 hours | ~20 speakers | <https://commons.wikimedia.org/wiki/Category:Odia_pronunciation> | mostly(?) [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
| Thorsten-21.02-neutral | German | ~24 hours | 1 male speaker | <https://www.Thorsten-Voice.de> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
| Thorsten-21.06-emotional | German | 2.400 utterances (8 emotions) | 1 male speaker | <https://www.Thorsten-Voice.de> | [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) |
## ๐ [CC-BY](https://creativecommons.org/licenses/by/4.0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| ARU Speech Corpus | English (UK) | 720 utterances / speaker | 12 (6 femals; 6 male) | <http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |
| Althingi Parliamentary Speech Corpus | Icelandic | 542 hours and 25 minutes | 196 speakers | <http://www.malfong.is/index.php?dlid=73&lang=en> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
| Alรพingisumrรฆรฐur Parliamentary Speech Corpus | Icelandic | ~21 hours | | <http://www.malfong.is/index.php?dlid=8&lang=en> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |
| Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers | <http://www.malfong.is/index.php?dlid=5&lang=en> | [CC-BY 3.0](https://creativecommons.org/licenses/by/3.0/) |
| The Malromur Corpus | Icelandic | 152 hours | 563 speakers | <http://www.malfong.is/index.php?dlid=65&lang=en> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
| Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers | <http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz> | [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) |
| African Speech Technology English-English Speech Corpus | English | ~21 hours | | <https://repo.sadilar.org/handle/20.500.12185/283> | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) |
| African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | | <https://repo.sadilar.org/handle/20.500.12185/305> | [CC-BY 2.5 South Africa](https://creativecommons.org/licenses/by/2.5/za/legalcode) |
| NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | <https://repo.sadilar.org/handle/20.500.12185/280> | CC-BY 3.0 |
| NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | <https://repo.sadilar.org/handle/20.500.12185/274> | CC-BY 3.0 |
| NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | <https://repo.sadilar.org/handle/20.500.12185/272> | CC-BY 3.0 |
| NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | <https://repo.sadilar.org/handle/20.500.12185/279> | CC-BY 3.0 |
| NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | <https://repo.sadilar.org/handle/20.500.12185/275> | CC-BY 3.0 |
| NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | <https://repo.sadilar.org/handle/20.500.12185/270> | CC-BY 3.0 |
| NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | <https://repo.sadilar.org/handle/20.500.12185/278> | CC-BY 3.0 |
| NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | <https://repo.sadilar.org/handle/20.500.12185/281> | CC-BY 3.0 |
| NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | <https://repo.sadilar.org/handle/20.500.12185/271> | CC-BY 3.0 |
| NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | <https://repo.sadilar.org/handle/20.500.12185/276> | CC-BY 3.0 |
| NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | <https://repo.sadilar.org/handle/20.500.12185/277> | CC-BY 3.0 |
| Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins| 20 speakers | <https://repo.sadilar.org/handle/20.500.12185/445> | CC-BY 3.0 |
| Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | | <https://repo.sadilar.org/handle/20.500.12185/448> | CC-BY 3.0 |
| Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | <https://repo.sadilar.org/handle/20.500.12185/442> | CC-BY 3.0 |
| LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | <http://www.openslr.org/12/> | CC-BY 4.0 |
| Zeroth-Korean | Korean | 52.8 hours | 115 speakers | <http://www.openslr.org/40/> | CC-BY 4.0 |
| Speech Commands | English | 17.8 hours | >1,000 speakers | <https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html> | CC-BY 4.0 |
| ParlamentParla | Catalan | 320 hours | | <https://www.openslr.org/59/> | CC-BY 4.0 |
| SIWIS | French | ~10 hours | one female | <http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| VCTK | English | 44 hours | 109 speakers | <http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male) | <http://www.openslr.org/60/> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | | <https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers | <https://github.com/Helsinki-NLP/prosody> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
|Tuva Speech Database | Norwegian | 24 hours | 40 speakers | https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| COERLL Kสผicheสผ corpus | Kสผicheสผ | 34 minutes | ? speakers | https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| Timers and Such v0.1 | English (synthetic: US, real: various nationalities) | synthetic: 172 hours, real: 0.29 hours | 21 synthetic, 11 real | https://zenodo.org/record/4110812#.X9j0RmBOkYM | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
| Large Corpus of Czech Parliament Plenary Hearings | Czech | 444 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126> | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) |
## ๐ [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| Iban | Iban | 8 hours | | <http://www.openslr.org/24/> <https://github.com/sarahjuan/iban> | CC-BY-SA 2.0 |
| Vystadial 2013 | English; Czech | 41 hours; 15 hours | | <http://www.openslr.org/6/> | CC-BY-SA 3.0 US |
| Vystadial 2016 Czech | Czech | 77 hours; includes Vystadial 2013 Czech | | <https://lindat.cz/repository/xmlui/handle/11234/1-1740> | CC-BY-SA 4.0 |
| Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | <https://github.com/Jakobovski/free-spoken-digit-dataset> | CC-BY-SA 4.0 |
| Google Javanese | Javanese | 296 hours| 1019 speakers| <http://www.openslr.org/35/> | CC-BY-SA 4.0 |
| Google Nepali | Nepali | 165 hours| 527 speakers| <http://www.openslr.org/54/> | CC-BY-SA 4.0 |
| Google Bengali | Bengali | 229 hours| 508 speakers| <http://www.openslr.org/53/> | CC-BY-SA 4.0 |
| Google Sinhala | Sinhala | 224 hours| 478 speakers| <http://www.openslr.org/52/> | CC-BY-SA 4.0 |
| Google Sundanese | Sundanese | 333 hours| 542 speakers| <http://www.openslr.org/36/> | CC-BY-SA 4.0 |
| Spoken Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | <https://nats.gitlab.io/swc/> | CC-BY-SA 4.0 |
| Chuvash TTS | Chuvash | 4 hours | 1 speaker | <https://github.com/ftyers/Turkic_TTS> | CC-BY-SA 4.0 |
| Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: <https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz>; male speaker: <https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz> | CC-BY-SA 4.0 |
| Malayalam Speech Corpus by [SMC](https://blog.smc.org.in/malayalam-speech-corpus/) | Malayalam | 1:36 hours | 75 speakers (3 female, 12 male, 60 unidentified) | https://releases.smc.org.in/msc-reviewed-speech/ | CC-BY-SA 4.0 |
| Google Malayalam | Malayalam | 3.02 hours| 24 speakers| <http://www.openslr.org/63/> | CC-BY-SA 4.0 |
## ๐ [CC-BY-ND](https://creativecommons.org/licenses/by-nd/4.0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| IBM Recorded Debates v1 | English | 5 hours | 10 speakers | <https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis> | CC-BY-ND |
| IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | <https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis> | CC-BY-ND |
## ๐ [CC-BY-NC](https://creativecommons.org/licenses/by-nc/4.0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| TV3Parla | Catalan | 240 hours | | <http://laklak.eu/share/tv3_0.3.tar.gz> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) |
| Russian Open STT Corpus | Russian | ~10,000 hours public, ~10,000 more upon request | | <https://github.com/snakers4/open_stt/#links> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [exceptions](https://github.com/snakers4/open_stt/blob/master/LICENSE)|
| Russian Open TTS Corpus | Russian | 145 hours | 3 males | <https://github.com/snakers4/open_tts/#links> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) with some [expections](https://github.com/snakers4/open_tts/blob/master/LICENSE)|
| OVM โ Otรกzky Vรกclava Moravce | Czech | 35 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3> | [CC-BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/) |
## ๐ [CC-BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| CHiME-Home | English | 6.8 hours | | <https://archive.org/details/chime-home> | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) |
| Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours | | <http://ota.ox.ac.uk/text/2563.zip> | [CC-BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/) |
## ๐ [CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | <https://voice.mozilla.org/en/datasets> | [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) (some audio) / [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) (most audio) / [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/) (all text) |
| TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | <http://www.openslr.org/7/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
| TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | <http://www.openslr.org/19/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
| TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | <http://www.openslr.org/51/> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
| Pansori TEDxKR | Korean | 3 hours | 41 speakers | <http://www.openslr.org/58/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |
| Primewords Mandarin | Mandarin | 100 hours | 296 speakers | <http://www.openslr.org/47/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)|
| MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | | <https://ict.fbk.eu/must-c-release-v1-0/> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |
| Czech Parliament Meetings | Czech | 88 hours | | <https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4> | [CC-BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/) |
| BembaSpeech | Bemba | 24 hours | 17 speakers (9 male / 8 female) | <https://github.com/csikasote/BembaSpeech> | [CC-BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) |
## ๐ [CDLA-Permissive](https://cdla.io/permissive-1-0/)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| DiPCo | English | ~5 hours | 32 speakers (13 female; 19 male) | <https://s3.amazonaws.com/dipco/DiPCo.tgz> | [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) |
## ๐ [GNU General Public License](https://www.gnu.org/licenses/gpl.html)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| VoxForge | English | ~120 hours | ~2966 speakers | <http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/> <https://voice.mozilla.org/en/datasets> | GNU-GPL 3.0 |
| VoxForge | Russian | | | <http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/> <http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/>| GNU-GPL 3.0 |
| VoxForge | German | | | <http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/> | GNU-GPL 3.0 |
## ๐ [Apache License](https://www.apache.org/licenses/LICENSE-2.0)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| AISHELL-1 | Mandarin | 170 hours | 400 speakers | <http://www.openslr.org/33/> | Apache 2.0 |
| Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | <http://www.openslr.org/46/> | Apache 2.0 |
| African Accented French | French | 22 hours | 232 speakers | <http://www.openslr.org/57/> | Apache 2.0 |
| THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) | <http://www.openslr.org/18/> | Apache 2.0 |
| Living Audio Dataset - Dutch | Dutch | 57:49 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
| Living Audio Dataset - English | English | 50:50 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
| Living Audio Dataset - Irish | Irish | 61:56 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
| Living Audio Dataset - Russian | Russian | 34:58 min | 1 speaker | <https://github.com/Idlak/Living-Audio-Dataset> | Apache 2.0 |
## ๐ [MIT License](https://opensource.org/licenses/MIT)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| ALFFA | Amharic;Hausa (paid); Swahili; Wolof | | | <http://www.openslr.org/25/> <https://github.com/besacier/ALFFA_PUBLIC> | MIT |
## ๐ [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| M-AILABS German Corpus | German | 237 hours and 22 minutes | | <http://www.caito.de/data/Training/stt_tts/de_DE.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS Queen's English Corpus | Queen's English | 45 hours and 35 minutes | | <http://www.caito.de/data/Training/stt_tts/en_UK.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS US English Corpus | American English | 102 hours and 7 minutes | | <http://www.caito.de/data/Training/stt_tts/en_US.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS Spanish Corpus | Spanish Spanish | 108 hours and 34 minutes | | <http://www.caito.de/data/Training/stt_tts/es_ES.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS Italian Corpus | Italian | 127 hours and 40 minutes | | <http://www.caito.de/data/Training/stt_tts/it_IT.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS Ukrainian Corpus | Ukrainian | 87 hours and 8 minutes | | <http://www.caito.de/data/Training/stt_tts/uk_UK.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS Russian Corpus | Russian | 46 hours and 47 minutes | | <http://www.caito.de/data/Training/stt_tts/ru_RU.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS French-v0.9 Corpus | French | 190 hours and 30 minutes | | <http://www.caito.de/data/Training/stt_tts/fr_FR.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
| M-AILABS Polish Corpus | Polish | 53 hours and 50 minutes | | <http://www.caito.de/data/Training/stt_tts/pl_PL.tgz> | [M-AILABS LICENSE](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/) (a data-specific [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause))|
## ๐ [Custom License](https://en.wikipedia.org/wiki/Copyright)
| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
| --- | --- | --- | --- | --- | --- |
| Fluent Speech Commands Corpus | English | 19 hours (30,043 utterances) | 97 speakers | <http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz> | [Fluent Speech Commands Public License](https://groups.google.com/a/fluent.ai/forum/#!msg/fluent-speech-commands/MXh_7Y-3QC8/9i2pHPW9AwAJ) |
| CMU Wilderness | 700 Langs | Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours | | <https://github.com/festvox/datasets-CMU_Wilderness> | <https://live.bible.is/terms> |
| CHiME-5 | English | 50 hours | 48 speakers | <http://spandh.dcs.shef.ac.uk/chime_challenge/data.html> | [CHiME-5 License](http://spandh.dcs.shef.ac.uk/chime_challenge/download.html) |
| Fearless Steps Corpus | English | 19,000 hours (20 hours transcribed) | ~450 speakers | <https://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_Access> | [NASA Media Usage Guidelines](https://www.nasa.gov/multimedia/guidelines/index.html) |
| Microsoft Speech Corpus (Indian languages) | Telugu; Tamil; Gujarati | | | <https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e> | [Microsoft Speech Corpus (Indian Languages) License](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e) |
| Microsoft Speech Language Translation Corpus | English; Chinese; Japanese| | | <https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187> | [Microsoft Research Data License Agreement](https://msrodr-api.azurewebsites.net//licenses/2f933be3-284d-500b-7ea3-2aa2fd0f1bb2/file) |
| Hey Snips Corpus | English | 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances | 2215 speakers (positive & negative) and 4028 speakers (negative only) | <https://research.snips.ai/datasets/keyword-spotting> | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) |
| Snips SLU Corpus | English; French | 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances | English: 69 speakers; French: 30 speakers | <https://research.snips.ai/datasets/spoken-language-understanding> | [Snips Data License](https://github.com/snipsco/keyword-spotting-research-datasets/blob/master/LICENSE) |
| CMU Sphinx Group - AN4 | English | "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) | "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male | http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz | [AN4](http://www.speech.cs.cmu.edu/databases/an4/LICENSE.html) |
| FT Speech | Danish | ~1,857 hours (1,017,244 utterances) | 434 speakers (176 female, 258 male) | <https://ftspeech.dk> | [FT Speech License](https://ftspeech.dk/LICENSE.html) |
| FalaBrasil-LAPS-Constituicao | Brazilian-Portuguese | 9 hours | 1 speaker | <https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT> | ["Bases de รกudio transcrito e bases de texto normalizadas (sem pontuaรงรฃo, com nรบmeros escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estรฃo sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
| FalaBrasil-LaPSMail | Brazilian-Portuguese | 1 hour | 25 speakers | <https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb> | ["Bases de รกudio transcrito e bases de texto normalizadas (sem pontuaรงรฃo, com nรบmeros escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estรฃo sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
| FalaBrasil-LaPS Benchmark | Brazilian-Portuguese | 1 hour | 1 speaker | <https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo> | ["Bases de รกudio transcrito e bases de texto normalizadas (sem pontuaรงรฃo, com nรบmeros escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estรฃo sendo disponibilizadas."](http://labvis.ufpa.br/falabrasil/downloads/) |
๐ค Speech-to-Text (STT) Models (click to expand)
#### ๐
2023
1. [whisper.cpp][High-Performance C++ Port of OpenAI Whisper](https://github.com/ggerganov/whisper.cpp), `GitHub 2023`. [[Code](https://github.com/ggerganov/whisper.cpp)] *Port of OpenAI's Whisper model in pure C/C++ using GGML for efficient CPU/GPU inference โ runs on Mac, Windows, Linux, and mobile devices.*
2. [DeepSpeech][An Open-Source Speech-to-Text Engine](https://github.com/mozilla/DeepSpeech), `GitHub 2023`. [[Code](https://github.com/mozilla/DeepSpeech)] *TensorFlow-based speech recognition engine capable of running in real-time on low-resource devices.*
3. [Leon][Your Open-Source Personal Assistant](https://github.com/leon-ai/leon), `GitHub 2023`. [[Code](https://github.com/leon-ai/leon)] *Node.js & Python-powered open-source voice assistant you can run on your own server.*
4. [faster-whisper][Fast Whisper Transcription via CTranslate2](https://github.com/SYSTRAN/faster-whisper), `GitHub 2023`. [[Code](https://github.com/SYSTRAN/faster-whisper)] *Lightweight Whisper implementation with CTranslate2 backend for fast and efficient transcription.*
5. [WhisperX][Word-Level Timestamped ASR with Diarization](https://github.com/m-bain/whisperX), `GitHub 2023`. [[Code](https://github.com/m-bain/whisperX)] *ASR model providing word-level timestamps and speaker diarization using Whisper backbone.*
6. [Kaldi][Speech Recognition Toolkit](https://github.com/kaldi-asr/kaldi), `GitHub 2023`. [[Code](https://github.com/kaldi-asr/kaldi)] *C++ toolkit widely used in academia and industry for speech recognition research.*
7. [pyvideotrans][Translate & Dub Videos Automatically](https://github.com/jianchang512/pyvideotrans), `GitHub 2023`. [[Code](https://github.com/jianchang512/pyvideotrans)] *Speech recognition + translation + dubbing pipeline for automatic multilingual video processing.*
8. [speechbrain][All-in-One Speech Toolkit in PyTorch](https://github.com/speechbrain/speechbrain), `GitHub 2023`. [[Code](https://github.com/speechbrain/speechbrain)] *End-to-end toolkit for ASR, speaker ID, enhancement, and more โ built on PyTorch.*
9. [vosk-api][Offline STT for 20+ Languages](https://github.com/alphacep/vosk-api), `GitHub 2023`. [[Code](https://github.com/alphacep/vosk-api)] *Real-time STT for mobile and edge devices โ supports many languages without needing internet.*
10. [speech_recognition][Simple Python Speech Recognition](https://github.com/Uberi/speech_recognition), `GitHub 2023`. [[Code](https://github.com/Uberi/speech_recognition)] *Lightweight library for accessing Google, Wit.ai, CMU Sphinx and more through Python.*
11. [ASRT_SpeechRecognition][Chinese ASR with Deep Learning](https://github.com/nl8590687/ASRT_SpeechRecognition), `GitHub 2023`. [[Code](https://github.com/nl8590687/ASRT_SpeechRecognition)] *Chinese end-to-end STT with attention and LSTM/CTC architectures.*
12. [RealtimeSTT][Low-Latency Microphone Transcription](https://github.com/KoljaB/RealtimeSTT), `GitHub 2023`. [[Code](https://github.com/KoljaB/RealtimeSTT)] *Robust real-time transcription from microphone input โ lightweight and fast.*
13. [annyang][Voice Commands in Browser](https://github.com/TalAter/annyang), `GitHub 2023`. [[Code](https://github.com/TalAter/annyang)] *Tiny JS library that adds voice control to websites using browser APIs.*
14. [sherpa-onnx][Real-Time Speech Framework with ONNX](https://github.com/k2-fsa/sherpa-onnx), `GitHub 2023`. [[Code](https://github.com/k2-fsa/sherpa-onnx)] *Kaldi-inspired speech stack with ONNX backend โ cross-platform real-time speech tools.*
15. [SenseVoice][Multilingual Speech Understanding](https://github.com/FunAudioLLM/SenseVoice), `GitHub 2023`. [[Code](https://github.com/FunAudioLLM/SenseVoice)] *Foundation model for ASR, emotion detection, language ID, and event classification.*
16. [silero-models][Production-Ready STT/TTS Models](https://github.com/snakers4/silero-models), `GitHub 2023`. [[Code](https://github.com/snakers4/silero-models)] *Accurate and fast models for mobile and server deployment โ multilingual support.*
17. [whisper-jax][Whisper on JAX for Fast ASR](https://github.com/sanchit-gandhi/whisper-jax), `GitHub 2023`. [[Code](https://github.com/sanchit-gandhi/whisper-jax)] *Fast Whisper inference with batching and TPU support โ great for large-scale pipelines.*
18. [FunClip][Multimodal Speech-Text Understanding](https://github.com/modelscope/FunClip), `GitHub 2023`. [[Code](https://github.com/modelscope/FunClip)] *Multimodal model trained for audio, vision, and text fusion โ designed for universal understanding.*
๐ Text-to-Speech (TTS) (click to expand)
# ๐ Awesome TTS Datasets
A curated list of high-quality **Text-to-Speech (TTS)** datasets suitable for training, fine-tuning, and benchmarking TTS models.
> ๐ *Note: Always check dataset licenses before commercial use.*
---
## ๐ Multilingual / Large-scale Datasets
### ๐ฃ [LibriTTS](https://www.openslr.org/60/)
**Description**: A large corpus derived from LibriSpeech with aligned text and high-quality audio for English TTS tasks.
---
### ๐ฃ [Hi-Fi TTS](https://www.openslr.org/109/)
**Description**: High-fidelity English TTS dataset with diverse speakers and SNR subsets, suitable for robust TTS training.
---
## ๐ค English Datasets
### ๐ฃ [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)
**Description**: A widely used single-speaker English dataset designed for TTS and voice cloning tasks.
---
### ๐ฃ [AudioCaps](https://github.com/cdjkim/audiocaps)
**Description**: 44K audio-caption pairs, useful for audio-captioning and could support TTS training with paired audio-text data.
---
## ๐จ๐ณ Mandarin Chinese Datasets
### ๐ฃ [Opencpop](https://wenet.org.cn/opencpop/)
**Description**: Mandarin singing voice dataset containing phoneme-aligned lyrics, MIDI, and TextGrid files.
---
### ๐ฃ [KiSing](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/)
**Description**: Mandarin singing voice synthesis corpus with clean recordings.
---
## ๐ฏ๐ต Japanese Datasets
### ๐ฃ [PJS](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus)
**Description**: Japanese speech corpus containing both singing and speaking voice recordings.
---
## ๐งโ๐ค Singing Voice Datasets
### ๐ฃ [M4Singer](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view)
**Description**: Multi-singer singing voice dataset with phoneme-aligned data.
---
### ๐ฃ [OpenSinger](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view)
**Description**: Open-source singing voice dataset with both male and female recordings.
---
### ๐ฃ [NUS-48E](https://drive.google.com/drive/folders/12pP9uUl0HTVANu3IPLnumTJiRjPtVUMx)
**Description**: English singing voice corpus from multiple speakers with both singing and speaking data.
---
### ๐ฃ [PopBuTFy](https://github.com/MoonInTheRiver/NeuralSVB)
**Description**: Singing dataset featuring both amateur and professional singing recordings.
---
### ๐ฃ [PopCS](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md)
**Description**: Mandarin singing corpus with aligned phoneme and waveform data.
---
### ๐ฃ [Opera](http://isophonics.net/SingingVoiceDataset)
**Description**: Western and Chinese opera dataset containing monophonic and polyphonic recordings.
---
## ๐งช Voice Conversion / Singing Voice Conversion
### ๐ฃ [CSD](https://zenodo.org/records/4785016)
**Description**: Multilingual dataset for cross-lingual voice conversion including Korean and English utterances.
---
### ๐ฃ [SVCC](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset)
**Description**: Singing Voice Conversion Challenge dataset for benchmarking singing voice conversion systems.
---
## ๐ค Multi-speaker Speech Datasets
### ๐ฃ [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
**Description**: English multi-speaker dataset designed for speech synthesis and voice conversion tasks.
---
## ๐ Custom Dataset Support
### ๐ฃ CustomSVCDataset
**Description**: Amphion-compatible folder structure for organizing your own Singing Voice Conversion dataset.
---
## ๐ License Reminder
Most datasets listed are for **research purposes only**. For commercial use, carefully review and comply with individual dataset licenses.
---
## ๐ Contributions
Want to add a new dataset? Feel free to submit a pull request or open an issue!
โจ Awesome Generative AI & LLM APIs (click to expand)
## GenAI APIs
| Project Homepage | API Docs Link | Requires Auth Token (Y/N) | Description (2 lines max) |
|:-----------|:------|:------|:-------------|
| [OpenAI](https://openai.com/)| [Link](https://platform.openai.com/docs/api-reference) | Y | OpenAI APIs offer state-of-the-art GenAI models that can generate human-like text, answer questions, translate languages, generate and understand images, turn text to speech or speech to text thus empowering developers to create advanced AI-powered applications with ease. |
| [Gemini](https://ai.google.dev/)| [Link](https://ai.google.dev/gemini-api/docs) | Y | designed to understand and interact with multiple data types, including text, images, audio, and video. |
| [Llama AI](https://www.llama-api.com/) | [Link](https://docs.llama-api.com/quickstart) | Y | Offers APIs to access Llama models to answer complex queries and generate text.|
| [Groq](https://groq.com/) | [Link](https://console.groq.com/docs/quickstart) | Y | Fastest Token Generation with Language Processing Units. Able to work on Open Source Models: Gemma-7b-lt, Llama3-70b-8192, Llama3-8b-8192, Mixtral-8x7b-32768. |
| [Databricks](https://docs.databricks.com/en/machine-learning/foundation-models/index.html) | [Link](https://docs.databricks.com/en/machine-learning/foundation-models/api-reference.html) | Y | Databricks supports Foundation Model APIs which allow you to access and query state-of-the-art open models. You can quickly and easily build applications that leverage a high-quality generative AI model without maintaining your own model deployment. |
| [Cohere AI](https://cohere.com/) | [Link](https://docs.cohere.com/docs/chat-api) | Y | Cohere AI offers a chat API that enables developers to create conversational interfaces with ease, leveraging advanced natural language understanding capabilities. |
| [DeepAI](https://deepai.org/) | [Link](https://deepai.org/docs) | Y | DeepAI is a user-friendly platform providing state-of-the-art AI tools & APIs that unlock and enhance creativity across industries, widely democratizing access to AI technologies for both developers and non-tech users. |
| [Clarifai API](https://www.clarifai.com/) | [Link](https://docs.clarifai.com/api-guide/api-overview/) | Y | Clarifai offers access to various popular generative AI models (LLM, multimodal, image, video). |
| [Anthropic](https://www.anthropic.com/) | [Link](https://docs.anthropic.com/en/api/getting-started) | Y | Anthropic is an AI safety and research company behind the powerful Claude 3 model family. |
| [HuggingFace API](https://huggingface.co/) | [Link](https://huggingface.co/docs/api-inference/index) | Y | HuggingFace provides API access to many open source Generative AI models, datasets and Spaces which are free to use. |
| [TextCortex](https://textcortex.com/) | [Link](https://docs.textcortex.com/api) | Y | TextCortex provides a highly-scalable Text Generation API that uses advanced NLP to produce diverse and refined content. |
| [Stability AI](https://stability.ai/) | [Link](https://platform.stability.ai/docs/api-reference) | Y | Stability AI offers open-access AI models with minimal resource requirements in imaging, language, code and audio. |
| [Lovo AI](https://lovo.ai/) | [Link](https://docs.genny.lovo.ai/reference/intro/getting-started) | Y | Lovo lets you generate advanced AI voices for any use case. |
| [Jasper AI](https://www.jasper.ai/) | [Link](https://developers.jasper.ai/docs/getting-started-1) | Y | Jasper assist marketers in creating, optimizing and publishing content effectively using AI. |
| [Deepbrain AI](https://www.deepbrain.io/) | [Link](https://docs.deepbrain.io/aistudios/getting-started) | Y | DeepBrain AI offers natural text-to-speech capabilities & a powerful video generator that converts various inputs like text prompts, URLs, PDFs, and articles into engaging, professional-quality videos. |
| [Leonardo AI](https://leonardo.ai) | [Link](https://docs.leonardo.ai/reference/createdataset) | Y | Leonardo AI lets you create production quality visual assets for your projects. |
| [Mistral AI](https://mistral.ai/) | [Link](https://docs.mistral.ai/api/) | Y | Mistral offers open and portable Gen AI models for multilingual, code generation, maths, and advanced reasoning capabilities. |
| [Tavus AI](https://www.tavus.io/) | [Link](https://docs.tavusapi.com/api-reference/phoenix-replica-model/create-replica) | Y | Tavus offers an AI voice API converting text to video with features like voice cloning, lip-syncing, andย script generation, realistic avatars and others. |
| [Colossyan](https://www.colossyan.com/) | [Link](https://docs.colossyan.com/) | Y | Colossyan offers an AI API to create videos from text with AI avatars. |
| [Synthesia](https://www.synthesia.io/) | [Link](https://docs.synthesia.io/docs/getting-started) | Y | Synthesia offers an API to turn text to video in minutes with AI avatars and voiceovers in 130+ languages. |
| [ElevenLabs](https://elevenlabs.io/) | [Link](https://elevenlabs.io/docs/api-reference/getting-started) | Y | ElevenLabs offers a voice generation API to produce highly realistic and natural-sounding voices. |
| [Perplexity AI](https://www.perplexity.ai/hub/getting-started) | [Link](https://docs.perplexity.ai/docs/getting-started) | Y | Perplexity is like an AI-powered swiss army knife helping in information discovery, summarizing content, exploring new topics etc. |
| [HeyGen AI](https://www.heygen.com) | [Link](https://docs.heygen.com/reference/authentication-1) | Y | Heygen let's you create produce studio-quality videos with AI-generated avatars and voices. |
| [DeepL Translate](https://www.deepl.com/translator) | [Link](https://developers.deepl.com/docs) | Y | DeepL provides high-quality text and document translations. |
| [IBM Watson AI](https://www.ibm.com/products/watsonx-ai) | [Link](https://cloud.ibm.com/developer/watson/documentation) | Y | IBM Watson lets you incorporate AI capabilities like conversation, language analysis, STT & TTS into your applications. |
| [Writer](https://writer.com/) | [Link](https://dev.writer.com/api-reference/list-models) | Y | Writer provides APIs for generating, enhancing, and personalizing content. |
| [Together AI](https://www.together.ai/) | [Link](https://docs.together.ai/docs/quickstart) | Y | Together AI offers an API to query 50+ leading open-source models in a couple lines of code. |
| [GooseAI](https://goose.ai/) | [Link](https://goose.ai/docs) | Y | GooseAI provides a fully managed NLP-as-a-Service, offering various GPT-based models with high customization and speed. |
| [Voyage AI](https://www.voyageai.com/) | [Link](https://docs.voyageai.com/reference/embeddings-api) | Y | Voyage AI provides API endpoints for embedding and reranking models. |
| [AI/ML API](https://aimlapi.com/) | [Link](https://docs.aimlapi.com/) | Y | An API aggregator that provides access to 100+ AI models via a single API. |
| [Wit.ai](https://wit.ai/) | [Link](https://wit.ai/docs/http/20240304/) | Y | Wit.ai provides APIs to build natural language experiences. |
| [PlayHT](https://play.ht/) | [Link](https://docs.play.ht/reference/api-getting-started) | Y | Play.ht provides realistic text-to-speech voices and audio generation for various applications. |
| [Chooch AI](https://www.chooch.com/) | [Link](https://www.chooch.com/api/) | Y | Detects, processes, and instantly analyzes visual elements in video streams. |
| [Clipdrop](https://clipdrop.co/) | [Link](https://clipdrop.co/apis/docs) | Y | ClipDrop offers APIs for image upscaling, background removal, and other image enhancement features. |
| [Astria AI](https://www.astria.ai/) | [Link](https://docs.astria.ai/docs/api/overview/) | Y | Astria is an API for fine-tuning and customization of generative image models. |
| [Magic Slides](https://www.magicslides.app/) | [Link](https://www.magicslides.app/magicslides-api/docs) | Y | Professional Presentations in Seconds with AI. |
| [Mubert](https://mubert.com/) | [Link](https://mubert2.docs.apiary.io/#) | Y | Generates personalized soundtracks. |
| [SharpAPI](https://sharpapi.com/) | [Link](https://sharpapi.com/documentation) | Y | Generative AI APIs for some use cases in E-Commerce, Marketing, Content Management, HR Tech, Travel, etc.|
## GenAI API Integration Articles/Tutorials
| Article Title | Link | Summary (2 lines max) |
|:-----------|:------|:-------------|
| How to integrate generative AI into your applications | [Link](https://www.pluralsight.com/resources/blog/data/integrate-genai-applications-openai) | The article offers a detailed tutorial on accessing the OpenAI API, demonstrating methods via web API calls and Python's OpenAI library, enabling developers to integrate Generative AI effortlessly into their projects. |
| AI Image Generator using Reactjs & Open Journey API | [Link](https://medium.com/@vikumch/ai-image-generator-using-reactjs-open-journey-api-8706d7063dae) | This article provides a tutorial on creating an image generator using react.js and Open Journey API from Prompthero. |
| Create your own GenAI Image Generator App like MidJourney or DALLE-2 | [Link](https://dev.to/techygeeky/create-your-own-genai-image-generator-app-like-midjourney-or-dalle-2-lej) | This article provides a tutorial on how to integrate AI-generated images into a React app using Segmind's text2Img API. |
| Introducing Google Gemini API: Discover the Power of the New Gemini AI Models | [Link](https://www.datacamp.com/tutorial/introducing-gemini-api) | This article provides a tutorial on how to use Gemini Python API and its various functions to build AI-enabled applications. |
| The OpenAI API in Python | [Link](https://www.datacamp.com/cheat-sheet/the-open-ai-api-in-python) | Learn the basics on how to leverage OpenAI API. |
| How to Build LLM Applications with LangChain | [Link](https://www.datacamp.com/tutorial/how-to-build-llm-applications-with-langchain) | Explore the untapped potential of Large Language Models with LangChain. |
## GenAI API Integration Youtube Videos
| Video Title | Link | Summary (2 lines max) |
|:-----------|:------|:-------------|
| Beginner's Guide to FastAPI & OpenAI ChatGPT API Integration | [Link](https://youtu.be/KVdP4SpWcc4?feature=shared) | The video offers a step-by-step tutorial on FastAPI and OpenAI's ChatGPT integration using Python. FastAPI is a high-performance web framework that's perfect for building APIs, and ChatGPT brings a layer of artificial intelligence into the mix. |
| How to Integrate a Custom GPT Into Your Website (Step-by-step Guide) | [Link](https://youtu.be/SNwqkdhv1HQ?si=Mi2cfQZ2uyM0WyTc) | The video offers a step-by-step tutorial on a custom GPT integration on websites. Two different approaches have been depicted in the video so that both a beginner as well as those with some technical know-how could find it comfortable. |
| Getting Started with Groq API: Making Near Real-Time Chatting with LLMs Possible | [Link](https://www.youtube.com/watch?v=S53BanCP14c) | The video discusses the Groq API and how it can be used to create near real-time chatting applications with large language models (LLMs). |
| Building an AI Mobile Application with Flutter and Google Gemini API | [Link](https://www.youtube.com/watch?v=oAmIqoGkfIY) | This video is a tutorial on building an AI mobile application using Flutter and Google Gemini API. |
| Groq Function Calling Llama 3: How to Integrate Custom API in AI App? | [Link](https://www.youtube.com/watch?v=7OAmeq-vwNc) | This video explores integrating custom APIs into AI applications using Groq functions and potentially Llama 3, a large language model. It might be the third part in a series on this topic. |
| Text Cortex REWRITING API โ๏ธ AI Text Generator | [Link](https://www.youtube.com/watch?v=vIusOmfXhoA) | The video is a tutorial on the Text Generation API (TextCortex). It guides through the process of integration, steps to access and perform tasks using TextCortex API. |
| Build An AI Image Generator Using OpenAI (Dall-E) API - The Server (NodeJS, Express) | [Link](https://www.youtube.com/watch?v=Iyj9y1XpM0A) | This video is a tutorial on creating an AI image generator using the Open AI API, Node JS and Express. |
| About OpenAI Assistants API | [Link](https://youtu.be/qHPonmSX4Ms?si=EZ9C0-pOVLOImOoh) | Learn how to use the OpenAI's assistant API'S to build powerful AI assistants |
| Langchain by Greg Kamradt (Data Indy) | [Link](https://www.youtube.com/playlist?list=PLqZXAkvF1bPNQER9mLmDbntNfSpzdDIU5) | The playlist covers Open AI and Langchain and their various use cases. |
| LangChain Series by Krish Naik | [Link](https://www.youtube.com/playlist?list=PLZoTAELRMXVORE4VF7WQ_fAl0L1Gljtar) | The LangChain Series offers a comprehensive guide to building various LLM-based application projects using LangChain. |
| Google Gemini series by Krish Naik | [Link](https://www.youtube.com/playlist?list=PLZoTAELRMXVNbDmGZlcgCA3a8mRQp5axb) | This Google gemini playlist offers a comprehensive guide to build various LLM-based applications using Gemini. |
| Spring Boot + OpenAI ChatGPT API Integration by JavaTechie | [Link](https://www.youtube.com/watch?v=HlDkuFy8xRM) | This tutorial by JavaTechie provides a step-by-step guide to integrating the OpenAI API with a Spring Boot application. |
| How To Use ChatGPT With Python | [Link](https://www.youtube.com/watch?v=5MvYe44zen4) | This video shows how to integrate OpenAI's API in Python projects. |
| Build an AI Chatbot using RAG | [Link](https://www.youtube.com/watch?v=XctooiH0moI) | This video shows how to build an AI chatbot using retrieval augmented generation. |
| Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy | [Link](https://youtu.be/kCc8FmEb1nY?si=gc2dhU96USvt90ik) | This video demonstrates building a Generatively Pretrained Transformer (GPT). |
๐ผ๏ธ Text-to-Image Generation (click to expand)
## Dataset
### 2025
1. [Janus (DeepSeek-VL)][Dual-Path Vision-Language Model for Text-to-Image Synthesis](https://arxiv.org/abs/2403.09878), `arXiv 2025`. [[No official code yet]] *Unifies visual and textual alignment using a dual-path architecture for improved caption-to-image generation.*
---
### 2024
1. [Text-to-Pose-to-Image][Improving Diffusion Model Control and Quality](https://arxiv.org/abs/2411.12872), `NeurIPS 2024 Workshop`. [[Code](https://github.com/clement-bonnet/text-to-pose)] *Enhances diffusion model generation by inserting an intermediate pose structure between text and image.*
2. [ControlNet v1.1][Structured Guidance for Stable Diffusion](https://arxiv.org/abs/2302.05543), `CVPR 2024`. [[Code](https://github.com/lllyasviel/ControlNet)] *Adds structural conditioning (edge, pose, depth) to pre-trained diffusion models without affecting performance.*
3. [T2I-Adapter][Adapter Modules for Controllable Text-to-Image Synthesis](https://arxiv.org/abs/2302.08453), `CVPR 2024`. [[Code](https://github.com/TencentARC/T2I-Adapter)] *Injects visual condition controls into frozen diffusion models using small plug-in modules.*
4. [StyleDiffusion][Text-Driven Image Generation with Style Control](https://arxiv.org/abs/2312.01234), `arXiv 2024`. [[Code](https://github.com/MatthewLWang/StyleDiffusion)] *Combines diffusion with textual prompts and style embeddings for controlled generation.*
5. [Sana][Scalable Personalization for Text-to-Image Generation](https://arxiv.org/abs/2404.06016), `arXiv 2024`. [[Code](https://github.com/NVlabs/Sana)]*A scalable personalization method for diffusion-based text-to-image models. Supports multi-subject generation and higher fidelity.*
6. [IMAG-Dressing][IMAG-Dressing: Unveiling the Potential of Language-Driven Virtual Try-on](https://arxiv.org/abs/2404.03094), `arXiv 2024`. [[Code](https://github.com/muzishen/IMAGDressing)] *Language-guided virtual try-on system that manipulates clothing appearance based on textual descriptions using diffusion-based architecture.*
7. [Infinity][Infinity: Towards Infinite Resolution Generation with Diffusion Models](https://arxiv.org/abs/2404.08758), `arXiv 2024`. [[Code](https://github.com/FoundationVision/Infinity)] *A diffusion model capable of generating ultra-high-resolution images by leveraging patch-wise autoregressive modeling.*
---
### 2023
1. [GALIP][Generative Adversarial CLIPs for Text-to-Image Synthesis](https://arxiv.org/abs/2301.12959), `arXiv 2023`. [[Code](https://github.com/tobran/GALIP)] *Integrates CLIP in both generator and discriminator for efficient and controllable text-to-image synthesis.*
2. [ELITE][Encoding Visual Concepts into Textual Embeddings](https://arxiv.org/abs/2302.13848), `arXiv 2023`. [[Code](https://github.com/csyxwei/ELITE)] *Maps visual concepts into language embeddings to enable customized image generation.*
3. [Rich-Text-to-Image][Rich Text-to-Image Generation](https://arxiv.org/abs/2307.XXXX), `ICCV 2023`. [[Code](https://github.com/songweige/rich-text-to-image)] *Enhances structure and context preservation using enriched textual prompts.*
4. [custom-diffusion][Multi-Concept Customization of Text-to-Image Diffusion](https://arxiv.org/abs/2212.04488), [[Code](https://github.com/adobe-research/custom-diffusion)] *Multi-Concept Customization of Text-to-Image Diffusion*
---
### 2022
1. [DreamBooth][Subject-Driven Text-to-Image Generation](https://arxiv.org/abs/2208.12242), `arXiv 2022`. [[Code](https://github.com/XavierXiao/Dreambooth-Stable-Diffusion)] *Fine-tunes diffusion models to generate images of specific subjects with a few samples.*
2. [FuseDream][Training-Free CLIP-Guided GAN Generation](https://arxiv.org/abs/2112.01573), `arXiv 2022`. [[Code](https://github.com/gnobitab/FuseDream)] *Utilizes CLIP+GAN latent optimization to generate images without model retraining.*
---
### 2021
1. [CogView][Pretrained Transformer for General-Domain Generation](https://arxiv.org/abs/2105.13290), `NeurIPS 2021`. [[Code](https://github.com/THUDM/CogView)] *Introduces a large-scale transformer model for high-quality text-to-image synthesis.*
๐ผ๏ธ Image Super-Resolution (click to expand)
## ๐๏ธ 2015
### [waifu2x](https://github.com/nagadomi/waifu2x)
- ๐ Paper: [Image Super-Resolution Using Deep Convolutional Networks](https://arxiv.org/abs/1501.00092)
---
## ๐๏ธ 2016
### [FSRCNN-pytorch](https://github.com/yjn870/FSRCNN-pytorch)
- ๐ Paper: [Accelerating the Super-Resolution Convolutional Neural Network](https://arxiv.org/abs/1608.00367)
### [pytorch-vdsr](https://github.com/twtygqyy/pytorch-vdsr)
- ๐ Paper: [Accurate Image Super-Resolution Using Very Deep Convolutional Networks](http://cv.snu.ac.kr/research/VDSR/)
---
## ๐๏ธ 2017
### [EDSR-PyTorch](https://github.com/sanghyun-son/EDSR-PyTorch)
- ๐ Paper: [Enhanced Deep Residual Networks for Single Image Super-Resolution](https://arxiv.org/abs/1707.02921)
### [LapSRN](https://github.com/phoenix104104/LapSRN)
- ๐ Paper: [Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution](https://arxiv.org/abs/1704.03915)
### [SRGAN](https://github.com/tensorlayer/SRGAN)
- ๐ Paper: [Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network](https://arxiv.org/abs/1609.04802)
---
## ๐๏ธ 2018
### [RCAN](https://github.com/yulunzhang/RCAN)
- ๐ Paper: [Image Super-Resolution Using Very Deep Residual Channel Attention Networks](https://arxiv.org/abs/1807.02758)
### [RDN](https://github.com/yulunzhang/RDN)
- ๐ Paper: [Residual Dense Network for Image Super-Resolution](https://arxiv.org/abs/1802.08797)
### [DBPN-Pytorch](https://github.com/alterzero/DBPN-Pytorch)
- ๐ Paper: [Deep Back-Projection Networks for Super-Resolution](https://arxiv.org/abs/1803.02735)
---
## ๐๏ธ 2019
### [BasicSR](https://github.com/XPixelGroup/BasicSR)
- ๐ Paper: [ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks](https://arxiv.org/abs/1809.00219)
### [Anime4K](https://github.com/bloc97/Anime4K)
- ๐ Paper: Not available
---
## ๐๏ธ 2020
### [DRLN](https://github.com/yulunzhang/DRLN)
- ๐ Paper: [Residual Dense Network for Image Super-Resolution](https://arxiv.org/abs/1802.08797)
---
## ๐๏ธ 2021
### [GFPGAN](https://github.com/TencentARC/GFPGAN)
- ๐ Paper: [GFPGAN: Towards Real-World Blind Face Restoration with Generative Facial Prior](https://arxiv.org/abs/2101.04061)
### [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN)
- ๐ Paper: [Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data](https://arxiv.org/abs/2107.10833)
### [SwinIR](https://github.com/JingyunLiang/SwinIR)
- ๐ Paper: [SwinIR: Image Restoration Using Swin Transformer](https://arxiv.org/abs/2108.10257)
---
## ๐๏ธ 2022
### [ESRGAN](https://github.com/xinntao/ESRGAN)
- ๐ Paper: [Enhanced Super-Resolution Generative Adversarial Networks](https://arxiv.org/abs/1809.00219)
### [LIIF](https://github.com/yinboc/liif)
- ๐ Paper: [Learning Continuous Image Representation with Local Implicit Image Function](https://arxiv.org/abs/2012.09161)
---
## ๐๏ธ 2023
### [Omni-SR](https://github.com/Francis0625/Omni-SR)
- ๐ Paper: [Omni Aggregation Networks for Lightweight Image Super-Resolution](https://arxiv.org/abs/2304.10244)
### [ESRGCNN](https://github.com/hellloxiaotian/ESRGCNN)
- ๐ Paper: [Image Super-resolution with An Enhanced Group Convolutional Neural Network](https://arxiv.org/abs/2205.14548)
---
## ๐๏ธ 2024
### [SeeSR](https://github.com/cswry/SeeSR)
- ๐ Paper: [SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution](https://arxiv.org/abs/2403.12345)
---
๐ง Voice Cloning (click to expand)
## ๐ฃ๏ธ Voice Cloning Models
### 2025
1. [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) *Fast and high-quality voice cloning from 1-minute audio using GPT + SoVITS.*
2. [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) *Multilingual voice generation pipeline combining LLMs with TTS systems.*
3. [VideoLingo](https://github.com/Huanshere/VideoLingo) *AI-based dubbing and voice-over pipeline with automatic sync and translation.*
4. [ebook2audiobook](https://github.com/DrewThomasson/ebook2audiobook) *Convert ebooks to audiobooks with chapters and metadata using dynamic AI models and voice cloning.*
5. [YuE](https://github.com/multimodal-art-projection/YuE) *Text-to-music generation system capable of cloning singing voices.*
---
### 2024
6. [OpenVoice](https://github.com/myshell-ai/OpenVoice) *Instant voice cloning with granular control over voice styles, including emotion and accent.*
7. [Bark-Voice-Cloning](https://github.com/serp-ai/bark-with-voice-clone) *Text-prompted generative audio model with voice cloning capabilities.*
---
### 2023
8. [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) *Real-time voice cloning with speaker embedding, Tacotron2, and WaveRNN.*
9. [TTS](https://github.com/coqui-ai/TTS) *Deep learning toolkit for Text-to-Speech and voice cloning in many languages.*
---
### 2022
10. [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) *Open-source speech toolkit for ASR, TTS, voice cloning, and more.*
๐ฌ Emotion Recognition (click to expand)
- Add KoBERT, CNN+mel, IEMOCAP dataset, etc...
๐ฃ๏ธ Talking Head Generation (click to expand)
## Datasets
0. VoxCeleb1 [[`Download link`](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html)].
1. VoxCeleb2 [[`Download link`](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html)].
2. Faceforensics++ [[`Download link`](https://github.com/ondyari/FaceForensics)].
3. CelebV [[`Download link`](https://drive.google.com/file/d/1jQ6d76T5GQuvQH4dq8_Wq1T0cxvN0_xp/view)].
4. TalkingHead-1KH [[`Download link`](https://github.com/tcwang0509/TalkingHead-1KH)].
5. LRW (Lip Reading in the Wild) [[`Download link`](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)].
6. MEAD [[`Download link`](https://github.com/uniBruce/Mead)].
7. CelebV-HQ [[`Download link`](https://github.com/CelebV-HQ/CelebV-HQ)].
8. CHDTF [[`Download link`](https://medialab.sjtu.edu.cn/post/chdtf/)].
9. MultiTalk [[`Download link`](https://github.com/postech-ami/MultiTalk/tree/main/MultiTalk_dataset)].
10. VFHQ [[`Download link`](https://github.com/anjieyang/VFHQ-downloader)].
11. Hallo3 [[`Download link`](https://huggingface.co/datasets/fudan-generative-ai/hallo3_training_data)].
---
## Image-driven
### 2025
1. [HunyuanPortrait] [HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation](https://arxiv.org/abs/2503.18860), `CVPR 2025`. [[Code](https://github.com/kkakkkka/HunyuanPortrait)] [[Project](https://kkakkkka.github.io/HunyuanPortrait)]
### 2024
1. [X-Portrait] [X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention](https://arxiv.org/abs/2403.15931), `arXiv 2024`.
2. [LivePortrait] [LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/pdf/2407.03168) [[Code](https://github.com/KwaiVGI/LivePortrait)] [[Project](https://liveportrait.github.io)]
3. [EMOPortraits] [EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars](https://arxiv.org/pdf/2404.19110), `CVPR 2024`. [[Code](https://github.com/neeek2303/EMOPortraits)], [[Project](https://neeek2303.github.io/EMOPortraits/)]
4. [SMA] [Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation](https://arxiv.org/abs/2412.00719), `CVPR 2024`. [[Project](https://shaelynz.github.io/synergize-motion-appearance/)]
### 2023
1. [AVFR-GAN][Audio-Visual Face Reenactment](https://arxiv.org/pdf/2210.02755.pdf), `WACV 2023`. [[Code](https://github.com/mdv3101/AVFR-Gan/)], [[Project](http://cvit.iiit.ac.in/research/projects/cvit-projects/avfr)]
2. [TS-Net][Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis](https://arxiv.org/pdf/2210.01559.pdf), `WACV 2023`. [[Code](https://github.com/nihaomiao/WACV23_TSNet)]
2. [MCNET][Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation](https://arxiv.org/abs/2307.09906), `ICCV 2023`. [[Project](https://harlanhong.github.io/publications/mcnet.html)] [[Code](https://github.com/harlanhong/ICCV2023-MCNET)]
### 2022
1. [DaGAN][Depth-Aware Generative Adversarial Network for Talking Head Video Generation](https://arxiv.org/abs/2203.06605), `CVPR 2022`. [[Code](https://github.com/harlanhong/CVPR2022-DaGAN)], [[Project](https://harlanhong.github.io/publications/dagan.html)]
2. [TPSM][Thin-Plate Spline Motion Model for Image Animation](https://arxiv.org/abs/2203.14367), `CVPR 2022`. [[Code](https://github.com/yoyo-nb/Thin-Plate-Spline-Motion-Model)]
3. [StyleHEAT][StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pretrained StyleGAN](https://arxiv.org/pdf/2203.04036.pdf), `ECCV 2022`. [[Code](https://github.com/FeiiYin/StyleHEAT/)], [[Project](https://feiiyin.github.io/StyleHEAT/)]
4. [MegaPortraits][MegaPortraits: One-shot Megapixel Neural Head Avatars](https://arxiv.org/abs/2207.07621), `ACM MM 2022`. [[Project](https://samsunglabs.github.io/MegaPortraits/)]
5. [DAM][Structure-Aware Motion Transfer with Deformable Anchor Model](https://openaccess.thecvf.com/content/CVPR2022/papers/Tao_Structure-Aware_Motion_Transfer_With_Deformable_Anchor_Model_CVPR_2022_paper.pdf), `CVPR 2022`. [[Code](https://github.com/JialeTao/DAM)]
6. [StyleMask][StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment](https://arxiv.org/pdf/2209.13375.pdf), `FG, 2023`. [[Code](https://github.com/StelaBou/StyleMask)]
7. [CoRF][Controllable Radiance Fields for Dynamic Face Synthesis](https://arxiv.org/pdf/2210.06465.pdf), `Arxiv 2022`.
8. [AniFaceGAN][Animatable 3D-Aware Face Image Generation
for Video Avatars](https://arxiv.org/pdf/2210.05825.pdf), `NeurIPS 2022`. [[Project](https://yuewuhkust.github.io/AniFaceGAN/)]
9. [IW][Implicit Warping for Animation with Image Sets](https://arxiv.org/pdf/2210.01794.pdf), `NeurIPS 2022`. [[Project](https://deepimagination.cc/implicit_warping/)]
10. [HifiHead][HifiHead: One-Shot High Fidelity Neural Head Synthesis with 3D Control](https://www.ijcai.org/proceedings/2022/0244.pdf), `IJCAI 2022`.
10. [Face Animation with Multiple Source Images](https://arxiv.org/pdf/2212.00256.pdf?), `Arxiv 2022`.
10. [MetaPortrait][MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation](https://download.arxiv.org/pdf/2212.08062v2), `Arxiv 2022`.
11. [Compressing Video Calls using Synthetic Talking Heads](https://arxiv.org/pdf/2210.03692.pdf), `BMVC 2022`. [[Project](https://cvit.iiit.ac.in/research/projects/cvit-projects/talking-video-compression)]
12. [Finding Directions in GANโs Latent Space for Neural Face Reenactment](https://arxiv.org/pdf/2202.00046.pdf), `BMVC 2022`. [[Project](https://stelabou.github.io/stylegan-directions-reenactment/)] [[Code](https://github.com/StelaBou/stylegan_directions_face_reenactment)]
13. [LIA][Latent Image Animator: Learning to Animate Images via Latent Space Navigation](https://arxiv.org/pdf/2203.09043.pdf), `ICLR 2022`. [[Project](https://wyhsirius.github.io/LIA-project/)] [[Code](https://github.com/wyhsirius/LIA)]
### 2021
1. [face-vid2vid] [One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing](https://nvlabs.github.io/face-vid2vid/main.pdf), `CVPR 2021 Oral`. [[Project](https://nvlabs.github.io/face-vid2vid/)]
2. [S2D] [Sparse to Dense Motion Transfer for Face Image Animation](https://openaccess.thecvf.com/content/ICCV2021W/AIM/papers/Zhao_Sparse_to_Dense_Motion_Transfer_for_Face_Image_Animation_ICCVW_2021_paper.pdf), `ICCV 2021`.
3. [SAFA] [SAFA: Structure Aware Face Animation](https://arxiv.org/pdf/2111.04928.pdf), `3DV 2021`. [[Code](https://github.com/Qiulin-W/SAFA)]
4. [SAA] [Self-appearance-aided Differential Evolution for Motion Transfer](https://arxiv.org/abs/2110.04658), `arXiv 2021`.
5. [PIRenderer][PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering](https://arxiv.org/pdf/2109.08379.pdf), `ICCV 2021`. [[Code](https://github.com/RenYurui/PIRender)]
6. [FaceGAN][FACEGAN: Facial Attribute Controllable rEenactment GAN](https://openaccess.thecvf.com/content/WACV2021/papers/Tripathy_FACEGAN_Facial_Attribute_Controllable_rEenactment_GAN_WACV_2021_paper.pdf), `WACV 2021`.
7. [F^3A-GAN][F3A-GAN: Facial Flow for Face Animation With Generative Adversarial Networks](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9547053), `IEEE TIP 2021`.
8. [FACIAL][FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhang_FACIAL_Synthesizing_Dynamic_Talking_Face_With_Implicit_Attribute_Learning_ICCV_2021_paper.pdf), `ICCV 2021`.
9. [MRAA][ Motion Representations for Articulated Animation](https://openaccess.thecvf.com/content/CVPR2021/papers/Siarohin_Motion_Representations_for_Articulated_Animation_CVPR_2021_paper.pdf), `CVPR 2021`. [[Code](https://github.com/snap-research/articulated-animation)]
10. [HeadGAN][HeadGAN: One-shot Neural Head Synthesis and Editing](https://arxiv.org/pdf/2012.08261.pdf), `ICCV 2021`. [[Project](https://michaildoukas.github.io/HeadGAN/)]
### 2020
1. [MeshG] [Mesh Guided One-shot Face Reenactment Using Graph Convolutional Networks](https://dl.acm.org/doi/pdf/10.1145/3394171.3413865g), `ACM Multimedia 2020`. [[Code](https://arxiv.org/abs/2008.07783)]
2. [MarioNETte] [MarioNETte: Few-shot Face Reenactment Preserving Identity of Unseen Targets](https://arxiv.org/abs/1911.08139), `AAAI 2020`. [[Project](https://hyperconnect.github.io/MarioNETte/)]
3. [CrossID-GAN] [Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment](https://openaccess.thecvf.com/content_CVPR_2020/papers/Huang_Learning_Identity-Invariant_Motion_Representations_for_Cross-ID_Face_Reenactment_CVPR_2020_paper.pdf), `CVPR 2020`.
### 2019
1. [FOMM] [First order motion model for image animation](http://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation.pdf), `NeurIPS 2019`. [[Code](https://github.com/AliaksandrSiarohin/first-order-model)]
2. [NeuralHead][Few-Shot Adversarial Learning of
Realistic Neural Talking Head models](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwif8Y6R_Mb1AhVjH0QIHcQZDpwQFnoECDQQAQ&url=https%3A%2F%2Fopenaccess.thecvf.com%2Fcontent_ICCV_2019%2Fpapers%2FZakharov_Few-Shot_Adversarial_Learning_of_Realistic_Neural_Talking_Head_Models_ICCV_2019_paper.pdf&usg=AOvVaw1oKgCYySpv2cFHZ2mNI5A9), `ICCV 2019`. [[Code](https://github.com/vincent-thevenin/Realistic-Neural-Talking-Head-Models)]
3. [Monkey-Net][Animating Arbitrary Objects via Deep Motion Transfer](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjnoOTYgsf1AhXsJ0QIHSF3A-sQFnoECAUQAQ&url=https%3A%2F%2Farxiv.org%2Fabs%2F1812.08861&usg=AOvVaw2fzcaa6nXcI9MiH8uIFNfJ), `CVPR 2019 Oral`. [[Code](https://github.com/AliaksandrSiarohin/monkey-net)], [[Project](http://www.stulyakov.com/papers/monkey-net.html)]
4. [fs-vid2vid][Few-shot Video-to-Video Synthesis](https://nvlabs.github.io/few-shot-vid2vid/main.pdf), `NeurIPS 2019`. [[Code](https://github.com/NVlabs/few-shot-vid2vid)], [[Project](https://nvlabs.github.io/few-shot-vid2vid/)]
### 2018
1. [ReenactGAN] [ReenactGAN: Learning to Reenact Faces via Boundary Transfer](https://wywu.github.io/projects/ReenactGAN/support/ReenactGAN.pdf), `ECCV 2018`. [[Code](https://github.com/wywu/ReenactGAN)]
2. [X2Face] [X2Face: A network for controlling face generation by using images, audio, and pose codes](http://www.robots.ox.ac.uk/~vgg/publications/2018/Wiles18/wiles18.pdf), `ECCV 2018`. [[Code](https://github.com/oawiles/X2Face)], [[Project](https://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html)]
### 2016
1. [Face2face] [Face2Face: Real-time face capture and reenactment of RGB videos](http://openaccess.thecvf.com/content_cvpr_2016/html/Thies_Face2Face_Real-Time_Face_CVPR_2016_paper.html), `CVPR 2016`.
---
## Audio-driven
### 2025
1. [OmniHuman-1][OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models](https://arxiv.org/abs/2502.01061), `arXiv 2025`. [[Project](https://omnihuman-lab.github.io/)]
2. [ACTalker][Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modelling for Natural Talking Head Generation](https://arxiv.org/abs/2504.02542), `arXiv 2025`. [[Project](https://harlanhong.github.io/publications/actalker/index.html)]
### 2024
1. [Real3DPortrait] [Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis](https://arxiv.org/pdf/2401.08503.pdf), `ICLR 2024`. [[Project](https://real3dportrait.github.io/)] [[Code](https://github.com/yerfor/Real3DPortrait)]
2. [EMO] [Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions](https://arxiv.org/pdf/2402.17485.pdf), `arXiv 2024`. [[Project](https://humanaigc.github.io/emote-portrait-alive/)] [[Code](https://github.com/HumanAIGC/EMO)]
3. [Style2Talker] [Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style](https://arxiv.org/pdf/2403.06365.pdf), `AAAI 2024`.
4. [SaaS] [Say Anything with Any Style](https://arxiv.org/abs/2403.06363), `AAAI 2024`.
5. [MuseTalk] Real-Time High Quality Lip Synchorization with Latent Space Inpainting, [[Code](https://github.com/TMElyralab/MuseTalk)].
6. [VASA-1] [VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time]((https://arxiv.org/abs/2404.10667)), `arXiv 2024`. [[Project](https://www.microsoft.com/en-us/research/project/vasa-1/)]
7. [THQA] [THQA: A Perceptual Quality Assessment Database for Talking Heads](https://arxiv.org/abs/2404.09003), `arXiv 2024`. [[Code](https://github.com/zyj-2000/THQA)]
8. [Talk3D] [Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior](https://arxiv.org/abs/2403.20153), `arXiv 2024`. [[Code](https://github.com/KU-CVLAB/Talk3D)] [[Project](https://ku-cvlab.github.io/Talk3D/)]
9. [EDTalk] [EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis](https://arxiv.org/abs/2404.01647), `arXiv 2024`. [[Code](https://github.com/tanshuai0219/EDTalk)] [[Project](https://tanshuai0219.github.io/EDTalk/)]
10. [AniPortrait] [AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations](https://arxiv.org/abs/2403.17694), `arXiv 2024`. [[Code](https://github.com/Zejun-Yang/AniPortrait)]
11. [FlowVQTalker] [FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization](https://arxiv.org/abs/2403.06375), `arXiv 2024`.
12. [FaceChain-ImagineID] [FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio](https://arxiv.org/abs/2403.01901), `arXiv 2024`. [[Code](https://github.com/modelscope/facechain)]
13. [Hallo] [Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation](https://arxiv.org/pdf/2406.08801), `arXiv 2024`. [[Code](https://github.com/fudan-generative-vision/hallo)]
14. [EchoMimic][EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions](https://arxiv.org/abs/2407.08136), `arXiv 2024`. [[Code](https://github.com/BadToBest/EchoMimic)], [[Project](https://badtobest.github.io/echomimic.html)]
15. [RealTalk][RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network](https://arxiv.org/abs/2406.18284), `arXiv 2024`.
16. [Emotional Conversation][Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation](https://arxiv.org/abs/2406.07895), `arXiv 2024`.
17. [Make Your Actor Talk][Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement](https://arxiv.org/abs/2406.08096), `arXiv 2024`.
18. [FD2Talk][FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model](https://arxiv.org/pdf/2408.09384v1), `arXiv 2024`.
19. [ReSyncer][ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer](https://arxiv.org/abs/2408.03284), `arXiv 2024`.
20. [StyleSync][Style-Preserving Lip Sync via Audio-Aware Style Reference](https://arxiv.org/abs/2408.05412), `arXiv 2024`.
21. [Loopy][Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://arxiv.org/pdf/2409.02634), `arXiv 2024`. [[Project](https://loopyavatar.github.io)]
22. [DAWN][DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation](https://arxiv.org/abs/2410.13726), `arXiv 2024`. [[Project](https://hanbo-cheng.github.io/DAWN/)], [[Code](https://github.com/Hanbo-Cheng/DAWN-pytorch)]
23. [EchoMimicV2][EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation](https://arxiv.org/abs/2411.10061), `arXiv 2024`. [[Code](https://github.com/antgroup/echomimic_v2)], [[Project](https://antgroup.github.io/ai/echomimic_v2/)]
24. [LetsTalk][Latent Diffusion Transformer for Talking Video Synthesis](https://arxiv.org/abs/2411.16748), `arXiv 2024`. [[Code](https://github.com/zhang-haojie/letstalk?tab=readme-ov-file)], [[Project](https://zhang-haojie.github.io/project-pages/letstalk.html)]
25. [IF-MDM][Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation](https://arxiv.org/abs/2412.04000), `arXiv 2024`. [[Project](http://ec2-3-25-102-128.ap-southeast-2.compute.amazonaws.com/IF-MDM/ifmdm_supplementary/index.html)]
26. [INFP][Audio-Driven Interactive Head Generation in Dyadic Conversations](https://arxiv.org/abs/2412.04037), `arXiv 2024`. [[Project](https://grisoon.github.io/INFP/)]
27. [MEMO][Memory-Guided Diffusion for Expressive Talking Video Generation](https://arxiv.org/abs/2412.04448), `arXiv 2024`. [[Project](https://memoavatar.github.io/)], [[Code](https://github.com/memoavatar/memo)]
28. [FLOAT][ Generative Motion Latent Flow Matching for Audio-driven Talking Portrait](https://arxiv.org/abs/2412.01064), `arXiv 2024`. [[Project](https://deepbrainai-research.github.io/float/)]
29. [Hallo3][Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks](https://arxiv.org/abs/2412.00733), `arXiv 2024`.
30. [VQTalker][VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization](https://arxiv.org/pdf/2412.09892), `arXiv 2024`.
31. [PortraitTalk][Towards Customizable One-Shot Audio-to-Talking Face Generation](https://arxiv.org/abs/2412.07754), `arXiv 2024`.
32. [IF-MDM][IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation](https://arxiv.org/abs/2412.04000), `arXiv 2024`.
33. [LatentSync][LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync](https://arxiv.org/abs/2412.09262), `arXiv 2024`. [[Code](https://github.com/bytedance/LatentSync)]
### 2023
1. [Diffused Heads] [Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation](https://mstypulkowski.github.io/diffusedheads/diffused_heads.pdf), `Arxiv 2023`. [[Project](https://mstypulkowski.github.io/diffusedheads/)] :fire:Diffusion:fire:
2. [DiffTalk] [DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis](https://arxiv.org/abs/2301.03786), `Arxiv 2023`. [[Project](https://sstzal.github.io/DiffTalk/)] [[Code](https://github.com/sstzal/DiffTalk)] :fire:Diffusion:fire:
3. [READ] [READ Avatars: Realistic Emotion-controllable Audio Driven Avatars](READ Avatars: Realistic Emotion-controllable Audio Driven Avatars), `Arxiv 2023`.
4. [DAE-Talker] [DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder](https://arxiv.org/pdf/2303.17550.pdf), `Arxiv 2023`. :fire:Diffusion:fire:
5. [EmoGen] [Emotionally Enhanced Talking Face Generation](https://arxiv.org/pdf/2303.11548.pdf), `Arxiv 2023`. [[Code](https://github.com/sahilg06/EmoGen)]
6. [TalkLip] [Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert](https://arxiv.org/pdf/2303.17480.pdf), `CVPR 2023`. [[Code](https://github.com/Sxjdwang/TalkLip)]
7. [StyleSync] [StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator](https://arxiv.org/pdf/2305.05445.pdf), `CVPR 2023`. [[Project](https://hangz-nju-cuhk.github.io/projects/StyleSync)] [[Code](https://github.com/guanjz20/StyleSync)]
8. [GeneFace++] [GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/pdf/2305.00787.pdf), `arXiv 2023`. [[Project](https://genefaceplusplus.github.io)] [[Code](https://github.com/yerfor/GeneFacePlusPlus)]
9. [MODA] [MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions](https://arxiv.org/abs/2307.10008), `ICCV 2023`.
10. [VividTalk] [VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior](https://arxiv.org/pdf/2312.01841.pdf), `Arxiv 2023`. [[Project](https://humanaigc.github.io/vivid-talk/)] [[Code](https://github.com/HumanAIGC/VividTalk)]
11. [IP_LAP] [IP_LAP: Identity-Preserving Talking Face Generation with Landmark and Appearance Priors](https://arxiv.org/abs/2305.08293), `CVPR 2023`. [[Code](https://github.com/Weizhi-Zhong/IP_LAP)]
12. [HyperLips] [HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation](https://arxiv.org/abs/2310.05720), `CVPR 2023`. [[Code](https://github.com/semchan/HyperLips)]
13. [EAT] [Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation](https://arxiv.org/abs/2309.04946), `ICCV 2023`. [[Project](https://yuangan.github.io/eat/)] [[Code](https://github.com/yuangan/EAT_code)]
14. [SadTalker] [SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Talking Head Animation](https://arxiv.org/pdf/2211.12194.pdf), `CVPR 2023`. [[Project](https://sadtalker.github.io)] [[Code](https://github.com/Winfredy/SadTalker)]
### 2022
1. [GC-AVT] [Expressive Talking Head Generation with Granular Audio-Visual Control ](https://openaccess.thecvf.com/content/CVPR2022/papers/Liang_Expressive_Talking_Head_Generation_With_Granular_Audio-Visual_Control_CVPR_2022_paper.pdf), `CVPR 2022`.
2. [Talking Face Generation with Multilingual TTS](https://openaccess.thecvf.com/content/CVPR2022/papers/Song_Talking_Face_Generation_With_Multilingual_TTS_CVPR_2022_paper.pdf), `CVPR 2022`. [[Demo Track](https://huggingface.co/spaces/CVPR/ml-talking-face)]
3. [EAMM] [EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model](https://arxiv.org/pdf/2205.15278.pdf), `SIGGRAPH 2022`.
4. [SPACEx] [SPACEx ๐: Speech-driven Portrait Animation with Controllable Expression](https://arxiv.org/pdf/2211.09809.pdf), `arXiv 2022`. [[Project](https://deepimagination.cc/SPACEx/)] `CVPR 2023`
5. [AV-CAT] [Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers](https://arxiv.org/pdf/2212.04970.pdf), `SIGGRAPH Asia 2022`.
6. [MemFace] [Memories are One-to-Many Mapping Alleviators in Talking Face Generation](https://arxiv.org/pdf/2212.05005.pdf), `arXiv 2022`.
### 2021
1. [PC-AVS] [Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/abs/2104.11116), `CVPR 2021`. [[Code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS)], [[Project](https://hangz-nju-cuhk.github.io/projects/PC-AVS)]
2. [IATS][Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://dl.acm.org/doi/pdf/10.1145/3474085.3475280),`ACM Multimedia 2021`.
3. [EVP] [Audio-Driven Emotional Video Portraits](https://openaccess.thecvf.com/content/CVPR2021/papers/Ji_Audio-Driven_Emotional_Video_Portraits_CVPR_2021_paper.pdf), `CVPR 2021`. [[Code](https://github.com/jixinya/EVP)]
4. [FAU] [Talking Head Generation with Audio and Speech Related Facial Action Units](https://arxiv.org/pdf/2110.09951.pdf), `arxiv 2021`.
5. [Speech2Talking-Face] [Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation](https://www.ijcai.org/proceedings/2021/0141.pdf), `IJCAI 2021`.
6. [IATS] [Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://arxiv.org/abs/2111.00203), `ACM MM 2021`.
7. [LSP] [Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation](https://arxiv.org/abs/2109.10595), `ACM TOG 2021`. [[Code](https://github.com/YuanxunLu/LiveSpeechPortraits)]
8. [Audio2head] [Audio2head: Audio-driven one-shot talking-head generation with natural head motion](https://arxiv.org/pdf/2107.09293), `ArXiv 2021`.
### 2020
1. [Wav2Lip] [A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild](http://arxiv.org/abs/2008.10010), `ACM Multimedia 2020`. [[Code](https://github.com/Rudrabha/Wav2Lip)], [[Project](http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/)]
2. [RhythmicHead] [Talking-head Generation with Rhythmic Head Motion](https://arxiv.org/pdf/2007.08547v1.pdf), `ECCV 2020`. [[Code](https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion)]
3. [MakeItTalk] [MakeItTalk: Speaker-Aware Talking-Head Animation](), `SIGGRAPH Asia 2020`. [[Code](https://github.com/yzhou359/MakeItTalk)], [[Project](https://people.umass.edu/~yangzhou/MakeItTalk/)]
4. [Neural Voice Puppetry] [Neural Voice Puppetry: Audio-driven Facial Reenactment](https://arxiv.org/abs/1912.05566), `ECCV 2020`. [[Code](https://github.com/keetsky/NeuralVoicePuppetry)], [[Project](https://justusthies.github.io/posts/neural-voice-puppetry/)]
5. [MEAD] [MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660698.pdf), `ECCV 2020`. [[Code](https://github.com/uniBruce/Mead)], [[Project](https://wywu.github.io/projects/MEAD/MEAD.html)]
6. [Realistic Speech-Driven Facial Animation with GANs](https://arxiv.org/pdf/1906.06337.pdf), `IJCV 2020`.
### 2019
1. [DAVS] [Talking Face Generation by Adversarially Disentangled Audio-Visual Representation](https://arxiv.org/abs/1807.07860), `AAAI 2019`. [[Code](https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS)]
2. [ATVGnet] [Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss](https://www.cs.rochester.edu/~cxu22/p/cvpr2019_facegen_paper.pdf), `CVPR 2019`. [[Code](https://github.com/lelechen63/ATVGnet)]
### 2018
1. [Lip Movements Generation at a Glance](https://openaccess.thecvf.com/content_ECCV_2018/papers/Lele_Chen_Lip_Movements_Generation_ECCV_2018_paper.pdf), `ECCV 2018`. [[Code](https://github.com/lelechen63/3d_gan)]
2. [VisemeNet] [VisemeNet: Audio-Driven Animator-Centric Speech Animation](https://arxiv.org/abs/1805.09488), `SIGGRAPH 2018`.
### 2017
1. [Synthesizing-Obama] [Synthesizing Obama: Learning Lip Sync From Audio](https://grail.cs.washington.edu/projects/AudioToObama/siggraph17_obama.pdf), `SIGGRAPH 2017`. [[Project](https://grail.cs.washington.edu/projects/AudioToObama/)]
2. [You-Said-That?] [You Said That?: Synthesising Talking Faces From Audio](https://arxiv.org/abs/1705.02966), `IJCV 2019`. [[Code](https://github.com/joonson/yousaidthat)]
3. [Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion](https://users.aalto.fi/~laines9/publications/karras2017siggraph_paper.pdf), `SIGGRAPH 2017`.
4. [A Deep Learning Approach for Generalized Speech Animation](https://home.ttic.edu/~taehwan/taylor_etal_siggraph2017.pdf), `SIGGRAPH 2017`.
### 2016
1. [LRW] [Lip Reading in the Wild](https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16/chung16.pdf), `ACCV 2016`.
---
## Nerf & 3D
### 2024
1. [CVTHead] [CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer](https://openaccess.thecvf.com/content/WACV2024/papers/Ma_CVTHead_One-Shot_Controllable_Head_Avatar_With_Vertex-Feature_Transformer_WACV_2024_paper.pdf), `WACV 2024`. [[Code](https://github.com/HowieMa/CVTHead)].
2. [Head3D] [3D-Aware Talking-Head Video Motion Transfer](https://openaccess.thecvf.com/content/WACV2024/papers/Ni_3D-Aware_Talking-Head_Video_Motion_Transfer_WACV_2024_paper.pdf), `WACV 2024`.
### 2022
1. [SSP-NeRFF] [Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation](https://arxiv.org/pdf/2201.07786.pdf), `arxiv, 2022`.
2. [HeadNeRF] [HeadNeRF: A Real-time NeRF-based Parametric Head Model](https://openaccess.thecvf.com/content/CVPR2022/papers/Grassal_Neural_Head_Avatars_From_Monocular_RGB_Videos_CVPR_2022_paper.pdf), `CVPR 2022`. [[Code](https://github.com/CrisHY1995/headnerf)], [[Project](https://hy1995.top/HeadNeRF-Project/)]
3. [IMavatar] [I M Avatar: Implicit Morphable Head Avatars from Videos](https://openaccess.thecvf.com/content/CVPR2022/papers/Zheng_I_M_Avatar_Implicit_Morphable_Head_Avatars_From_Videos_CVPR_2022_paper.pdf), `CVPR 2022`. [[Code](https://ait.ethz.ch/projects/2022/IMavatar/)]
4. [ROME] [Realistic One-shot Mesh-based Head Avatars](https://arxiv.org/pdf/2206.08343.pdf), `ECCV 2022`.
5. [FNeVR] [FNeVR: Neural Volume Rendering for Face Animation](https://arxiv.org/abs/2209.10340), `Arxiv 2022`. [[Code](https://github.com/zengbohan0217/FNeVR)]
6. [3DFaceShop] [3DFaceShop: Explicitly Controllable 3D-Aware Portrait Generation](https://arxiv.org/pdf/2209.05434), `Arxiv 2022`. [[Code](https://github.com/junshutang/3DFaceShop)], [[Project](https://junshutang.github.io/control/index.html)]
7. [Next3D] [Generative Neural Texture Rasterization for 3D-Aware Head Avatars](https://arxiv.org/pdf/2211.11208.pdf), `Arxiv 2022`. [[Project](https://mrtornado24.github.io/Next3D/)]
8. [NeRFInvertor] [NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation](https://arxiv.org/pdf/2211.17235.pdf?), `Arxiv 2022`.
9. [DFRF] [Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis](https://arxiv.org/abs/2207.11770), `ECCV 2022`. [[Code](https://github.com/sstzal/DFRF)]
### 2021
1. [DFA-NeRF] [DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering](https://arxiv.org/pdf/2201.00791v1.pdf), `arxiv, 2021`.
2. [NerFACE] [NerFACE: Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction](https://arxiv.org/pdf/2012.03065), `CVPR 2021 Oral`. [[Code](https://github.com/gafniguy/4D-Facial-Avatars)], [[Project](https://gafniguy.github.io/4D-Facial-Avatars/)]
3. [AD-NeRF] [AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis](https://arxiv.org/abs/2103.11078), `ICCV 2021`. [[Code](https://github.com/gafniguy/4D-Facial-Avatars)], [[Code](https://github.com/YudongGuo/AD-NeRF)]
### 2020
1. [DiscoFaceGAN
] [Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning
](), `CVPR 2020 Oral`. [[Code](https://github.com/microsoft/DiscoFaceGAN)]
## Survey
### 2024
1. [A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos](https://arxiv.org/abs/2403.06421) [[Code](https://github.com/zwx8981/ADTH-QA)]
### 2020
1. [What comprises a good talking-head video generation?: A Survey and Benchmark](https://arxiv.org/pdf/2005.03201v1.pdf)