Merge pull request #5410 from siddhu001/Multitask_Whisper_PR

Multitask Whisper PR
espnet · Oct 25, 2023 · 76b318e · 76b318e
2 parents 2caf055 + 1b572eb
commit 76b318e
Show file tree

Hide file tree

Showing 253 changed files with 4,776 additions and 31 deletions.
diff --git a/ci/test_configuration_espnet2.sh b/ci/test_configuration_espnet2.sh
@@ -31,6 +31,16 @@ if python3 -c 'import torch as t; from packaging.version import parse as L; asse
                 continue
             fi
         fi
+        if [ "$f" == "egs2/stop/asr1/conf/train_asr_whisper_full_correct.yaml" ]; then
+            if ! python3 -c "import whisper" > /dev/null; then
+                continue
+            fi
+        fi
+        if [ "$f" == "egs2/uslu14/asr1/conf/train_asr_whisper_full_correct_specaug.yaml" ]; then
+            if ! python3 -c "import whisper" > /dev/null; then
+                continue
+            fi
+        fi
         ${python} -m espnet2.bin.asr_train --config "${f}" --iterator_type none --dry_run true --output_dir out --token_list dummy_token_list
     done
 

diff --git a/egs2/README.md b/egs2/README.md
@@ -8,6 +8,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 
 | Directory name          | Corpus name                                                                                                                      | Task                    | Language              | URL                                                                                                          | Note         |
 |-------------------------|----------------------------------------------------------------------------------------------------------------------------------|-------------------------| --------------------- | ------------------------------------------------------------------------------------------------------------ | ------------ |
+| accentdb | A Database of Non-Native English Accents                                                                                                | Accent Recognition                     | ENG                   | https://accentdb.org/                                                                                  |              |
 | accented_french_openslr57 | African Accented French Corpus                                                                                                 | ASR                     | FRA                   | https://www.openslr.org/57/                                                                                  |              |
 | acesinger | ACESinger Singing Corpus                                                                                                 | SVS                     | CMN                   | WIP                                                                                  |              |
 | aesrc2020               | Accented English Speech Recognition Challenge 2020                                                                               | ASR                     | ENG                   | https://arxiv.org/abs/2102.10233                                                                             |              |
@@ -21,6 +22,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 | ami                     | The AMI Meeting Corpus                                                                                                           | ASR                     | ENG                   | http://groups.inf.ed.ac.uk/ami/corpus/                                                                       |              |
 | an4                     | CMU AN4 database                                                                                                                 | ASR/TTS                 | ENG                   | http://www.speech.cs.cmu.edu/databases/an4/                                                                  |              |
 | aphasiabank             | AphasiaBank database (English)                                                                                                   | ASR                     | ENG                   | https://aphasia.talkbank.org/                                                                                |              |
+| arabic_sc          | Database for Arabic Speech Commands Recognition                                                             | SLU                     | ARA                  | https://github.com/ltkbenamer/AR_Speech_Database                                                  |              |
+| asvspoof          | The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database                                                             | Fak Speech Detection                     | ENG                  | https://datashare.ed.ac.uk/handle/10283/3336                                                  |              |
 | babel                   | IARPA Babel corups                                                                                                               | ASR                     | ~20 languages         | https://www.iarpa.gov/index.php/research-programs/babel                                                      |              |
 | bibletts                   | Bible TTS corups                                                                                                               | TTS                     | 6 Sub-Saharan Africa languages         | https://masakhane-io.github.io/bibleTTS/                                          |              |
 | bn_openslr53            | Large bengali ASR training dataset                                                                                               | ASR                     | BEN                   | https://openslr.org/53/                                                                                      |              |
@@ -48,8 +51,10 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 | dns_ins20               | Deep Noise Suppression Challenge – INTERSPEECH 2020                                                                              | SE                      | 11 languages + singing| https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/ |              |
 | dns_ins21               | Deep Noise Suppression Challenge – INTERSPEECH 2021                                                                              | SE                      | 11 languages + singing| https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/ |              |
 | dsing                   | Automatic Lyric Transcription from Karaoke Vocal Tracks (From DAMP Sing300x30x2)                                                 | ASR (ALT)               | ENG singing           | https://github.com/groadabike/Kaldi-Dsing-task                                                               |              |
+| esc50                   | Dataset for Environmental Sound Classification                                                 | Audio Classification               |           | https://github.com/karolpiczak/ESC-50                                                               |              |
 | fisher_callhome_spanish | Fisher and CALLHOME Spanish--English Speech Translation                                                                          | ASR/ST                  | SPA->ENG              | https://catalog.ldc.upenn.edu/LDC2014T23                                                                     |              |
 | fleurs                  | Few-shot Learning Evaluation of Universal Representations of Speech                                                              | ASR/Multilingual        | 102 languages         | https://huggingface.co/datasets/google/fleurs                                                                |              |
+| freesound                  | Speech Command & Freesound for VAD        | English        | https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/datasets.html#speech-command-freesound-for-vad                                                                |              |
 | fsc                     | Fluent Speech Commands Dataset                                                                                                   | SLU                     | ENG                   | https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/               |              |
 | fsc_challenge           | Fluent Speech Commands Dataset MASE Eval Challenge splits                                                                        | SLU                     | ENG                   | https://github.com/maseEval/mase                                                                             |              |
 | fsc_unseen              | Fluent Speech Commands Dataset MASE Eval Unseen splits                                                                           | SLU                     | ENG                   | https://github.com/maseEval/mase                                                                             |              |
@@ -95,6 +100,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 | lrs2                    | The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset                                                                            | Lipreading/ASR          | ENG                  | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html                                                  |              |
 | lrs3                    | The Oxford-BBC Lip Reading Sentences 3 (LRS3) Dataset                                                                            | ASR                     | ENG                  | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html                                                  |              |
 | lt_slurp_spatialized    | Spatialized Libri-Trans and Spatialized SLURP (LT-S and SLURP-S), Enhancement for Translation and Understanding Dataset          | SE/ST/SLU               | ENG                  |                                                                                                              |              |
+| lt_speech_commands    | Lithuanian Speech Commands dataset               | LIT                  |    https://github.com/kolesov93/lt_speech_commands                                                                                                          |              |
 | m4singer                | Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus                                                                                | SVS                     | CMN                  | https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view?usp=share_link                                                                               |              |
 | magicdata               | MAGICDATA Mandarin Chinese Read Speech Corpus                                                                                    | ASR                     | ENG                  | https://www.openslr.org/68/                                                                                  |              |
 | media                   | MEDIA speech database for French                                                                                                 | SLU/Entity Classifi.    | FRA                  | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/                                              |              |
@@ -114,6 +120,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 | musdb18                 | Music source separation corpus                                                                                                   | ENH                     | ENG                  | https://sigsep.github.io/datasets/musdb.htmlmust-c/                                                          |              |
 | must_c                 | https://ict.fbk.eu/must-c/                                                                                                        | ASR/MT/ST               | ENG->14langs         | https://ict.fbk.eu/must-c/                                                                                   |              |
 | must_c_v2              | https://ict.fbk.eu/must-c/                                                                                                        | ASR/MT/ST               | ENG->DEU            | https://ict.fbk.eu/must-c/                                                                                    |              |
+| mustard              | MUStARD: Multimodal Sarcasm Detection Dataset                                                                                                        | SLU               | ENG            | https://github.com/soujanyaporia/MUStARD/                                                                                    |              |
+| mustard_plus_plus              | A Multimodal Corpus for Emotion Recognition in Sarcasm                                                                                                        | SLU               | ENG            | https://github.com/cfiltnlp/MUStARD_Plus_Plus/                                                                                    |              |
 | nit_song070             | The NITech Japanese speech database | SVS                     | JPN                  | http://hts.sp.nitech.ac.jp/archives/2.3/HTS-demo_NIT-SONG070-F001.tar.bz2
 | nsc                     | National Speech Corpus                                                                                                           | ASR                     | ENG-SG               | https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus                        |              |
 | ofuton_p_utagoe_db      | Ofuton_p_utagoe Singing voice synthesis corpus                                                                                   | SVS                     | JPN                  | https://sites.google.com/view/oftn-utagoedb/%E3%83%9B%E3%83%BC%E3%83%A0                                      |              |
@@ -144,6 +152,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
 | snips                   | SNIPS: A dataset for spoken language understanding                                                                               | SLU                     | ENG                  | https://github.com/sonos/spoken-language-understanding-research-datasets                                     |              |
 | speechcommands          | Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition                                                             | SLU                     | ENG                  | https://www.tensorflow.org/datasets/catalog/speech_commands                                                  |              |
 | spgispeech              | SPGISpeech 5k corpus                                                                                                             | ASR                     | ENG                  | https://datasets.kensho.com/datasets/scribe                                                                  |              |
+| stop              | STOP: Spoken Task Oriented Parsing                                                                                                             | SLU                     | ENG                  | https://facebookresearch.github.io/spoken_task_oriented_parsing/                                                                  |              |
 | su_openslr36            | Sundanese                                                                                                                        | ASR                     | SUN                  | http://www.openslr.org/36                                                                                    |              |
 | swbd                    | Switchboard Corpus for 2-channel Conversational Telephone Speech (300h)                                                          | ASR                     | ENG                  | https://catalog.ldc.upenn.edu/LDC97S62                                                                       |              |
 | swbd_da                 | NXT Switchboard Annotations                                                                                                      | SLU                     | ENG                  | https://catalog.ldc.upenn.edu/LDC2009T26                                                                     |              |