Skip to content

Commit

Permalink
Merge pull request #5410 from siddhu001/Multitask_Whisper_PR
Browse files Browse the repository at this point in the history
Multitask Whisper PR
  • Loading branch information
sw005320 committed Oct 25, 2023
2 parents 2caf055 + 1b572eb commit 76b318e
Show file tree
Hide file tree
Showing 253 changed files with 4,776 additions and 31 deletions.
10 changes: 10 additions & 0 deletions ci/test_configuration_espnet2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@ if python3 -c 'import torch as t; from packaging.version import parse as L; asse
continue
fi
fi
if [ "$f" == "egs2/stop/asr1/conf/train_asr_whisper_full_correct.yaml" ]; then
if ! python3 -c "import whisper" > /dev/null; then
continue
fi
fi
if [ "$f" == "egs2/uslu14/asr1/conf/train_asr_whisper_full_correct_specaug.yaml" ]; then
if ! python3 -c "import whisper" > /dev/null; then
continue
fi
fi
${python} -m espnet2.bin.asr_train --config "${f}" --iterator_type none --dry_run true --output_dir out --token_list dummy_token_list
done

Expand Down
9 changes: 9 additions & 0 deletions egs2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2

| Directory name | Corpus name | Task | Language | URL | Note |
|-------------------------|----------------------------------------------------------------------------------------------------------------------------------|-------------------------| --------------------- | ------------------------------------------------------------------------------------------------------------ | ------------ |
| accentdb | A Database of Non-Native English Accents | Accent Recognition | ENG | https://accentdb.org/ | |
| accented_french_openslr57 | African Accented French Corpus | ASR | FRA | https://www.openslr.org/57/ | |
| acesinger | ACESinger Singing Corpus | SVS | CMN | WIP | |
| aesrc2020 | Accented English Speech Recognition Challenge 2020 | ASR | ENG | https://arxiv.org/abs/2102.10233 | |
Expand All @@ -21,6 +22,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| ami | The AMI Meeting Corpus | ASR | ENG | http://groups.inf.ed.ac.uk/ami/corpus/ | |
| an4 | CMU AN4 database | ASR/TTS | ENG | http://www.speech.cs.cmu.edu/databases/an4/ | |
| aphasiabank | AphasiaBank database (English) | ASR | ENG | https://aphasia.talkbank.org/ | |
| arabic_sc | Database for Arabic Speech Commands Recognition | SLU | ARA | https://github.com/ltkbenamer/AR_Speech_Database | |
| asvspoof | The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database | Fak Speech Detection | ENG | https://datashare.ed.ac.uk/handle/10283/3336 | |
| babel | IARPA Babel corups | ASR | ~20 languages | https://www.iarpa.gov/index.php/research-programs/babel | |
| bibletts | Bible TTS corups | TTS | 6 Sub-Saharan Africa languages | https://masakhane-io.github.io/bibleTTS/ | |
| bn_openslr53 | Large bengali ASR training dataset | ASR | BEN | https://openslr.org/53/ | |
Expand Down Expand Up @@ -48,8 +51,10 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| dns_ins20 | Deep Noise Suppression Challenge – INTERSPEECH 2020 | SE | 11 languages + singing| https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/ | |
| dns_ins21 | Deep Noise Suppression Challenge – INTERSPEECH 2021 | SE | 11 languages + singing| https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/ | |
| dsing | Automatic Lyric Transcription from Karaoke Vocal Tracks (From DAMP Sing300x30x2) | ASR (ALT) | ENG singing | https://github.com/groadabike/Kaldi-Dsing-task | |
| esc50 | Dataset for Environmental Sound Classification | Audio Classification | | https://github.com/karolpiczak/ESC-50 | |
| fisher_callhome_spanish | Fisher and CALLHOME Spanish--English Speech Translation | ASR/ST | SPA->ENG | https://catalog.ldc.upenn.edu/LDC2014T23 | |
| fleurs | Few-shot Learning Evaluation of Universal Representations of Speech | ASR/Multilingual | 102 languages | https://huggingface.co/datasets/google/fleurs | |
| freesound | Speech Command & Freesound for VAD | English | https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/datasets.html#speech-command-freesound-for-vad | |
| fsc | Fluent Speech Commands Dataset | SLU | ENG | https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/ | |
| fsc_challenge | Fluent Speech Commands Dataset MASE Eval Challenge splits | SLU | ENG | https://github.com/maseEval/mase | |
| fsc_unseen | Fluent Speech Commands Dataset MASE Eval Unseen splits | SLU | ENG | https://github.com/maseEval/mase | |
Expand Down Expand Up @@ -95,6 +100,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| lrs2 | The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset | Lipreading/ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | |
| lrs3 | The Oxford-BBC Lip Reading Sentences 3 (LRS3) Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | |
| lt_slurp_spatialized | Spatialized Libri-Trans and Spatialized SLURP (LT-S and SLURP-S), Enhancement for Translation and Understanding Dataset | SE/ST/SLU | ENG | | |
| lt_speech_commands | Lithuanian Speech Commands dataset | LIT | https://github.com/kolesov93/lt_speech_commands | |
| m4singer | Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | SVS | CMN | https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view?usp=share_link | |
| magicdata | MAGICDATA Mandarin Chinese Read Speech Corpus | ASR | ENG | https://www.openslr.org/68/ | |
| media | MEDIA speech database for French | SLU/Entity Classifi. | FRA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/ | |
Expand All @@ -114,6 +120,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| musdb18 | Music source separation corpus | ENH | ENG | https://sigsep.github.io/datasets/musdb.htmlmust-c/ | |
| must_c | https://ict.fbk.eu/must-c/ | ASR/MT/ST | ENG->14langs | https://ict.fbk.eu/must-c/ | |
| must_c_v2 | https://ict.fbk.eu/must-c/ | ASR/MT/ST | ENG->DEU | https://ict.fbk.eu/must-c/ | |
| mustard | MUStARD: Multimodal Sarcasm Detection Dataset | SLU | ENG | https://github.com/soujanyaporia/MUStARD/ | |
| mustard_plus_plus | A Multimodal Corpus for Emotion Recognition in Sarcasm | SLU | ENG | https://github.com/cfiltnlp/MUStARD_Plus_Plus/ | |
| nit_song070 | The NITech Japanese speech database | SVS | JPN | http://hts.sp.nitech.ac.jp/archives/2.3/HTS-demo_NIT-SONG070-F001.tar.bz2
| nsc | National Speech Corpus | ASR | ENG-SG | https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus | |
| ofuton_p_utagoe_db | Ofuton_p_utagoe Singing voice synthesis corpus | SVS | JPN | https://sites.google.com/view/oftn-utagoedb/%E3%83%9B%E3%83%BC%E3%83%A0 | |
Expand Down Expand Up @@ -144,6 +152,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| snips | SNIPS: A dataset for spoken language understanding | SLU | ENG | https://github.com/sonos/spoken-language-understanding-research-datasets | |
| speechcommands | Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition | SLU | ENG | https://www.tensorflow.org/datasets/catalog/speech_commands | |
| spgispeech | SPGISpeech 5k corpus | ASR | ENG | https://datasets.kensho.com/datasets/scribe | |
| stop | STOP: Spoken Task Oriented Parsing | SLU | ENG | https://facebookresearch.github.io/spoken_task_oriented_parsing/ | |
| su_openslr36 | Sundanese | ASR | SUN | http://www.openslr.org/36 | |
| swbd | Switchboard Corpus for 2-channel Conversational Telephone Speech (300h) | ASR | ENG | https://catalog.ldc.upenn.edu/LDC97S62 | |
| swbd_da | NXT Switchboard Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2009T26 | |
Expand Down

0 comments on commit 76b318e

Please sign in to comment.