Mms release (#3948) (#5110)

facebookresearch · May 22, 2023 · 728b947 · 728b947
1 parent bfd9dc6
commit 728b947
Show file tree

Hide file tree

Showing 23 changed files with 2,657 additions and 70 deletions.
diff --git a/examples/mms/MODEL_CARD.md b/examples/mms/MODEL_CARD.md
@@ -0,0 +1,63 @@
+# MMS Model Card
+
+## Model details
+
+**Organization developing the model**  The FAIR team of Meta AI.
+
+**Model version**  This is version 1 of the model.
+
+**Model type**  MMS is speech model, based on the transformer architecture. The pre-trained model comes in two sizes: 300M and 1B parameters. We fine-tune the model for speech recognition and make it available in the 1B variant. We also fine-tune the 1B variant for language identification.
+
+**License**  CC BY-NC
+
+**Where to send questions or comments about the model**  Questions and comments about MMS can be sent via the  [GitHub repository](https://github.com/pytorch/fairseq/tree/master/examples/mms)  of the project , by opening an issue and tagging it as MMS.
+
+## Uses
+
+**Primary intended uses**  The primary use of MMS is to perform speech processing research for many more languages and to perform tasks such as automatic speech recognition, language identification, and speech synthesis.
+
+**Primary intended users**  The primary intended users of the model are researchers in speech processing, machine learning and artificial intelligence.
+
+**Out-of-scope use cases**  Fine-tuning the pre-pretrained models on other labeled datasets or downstream tasks requires further risk evaluation and mitigation.
+
+## Bias and Risks
+
+The MMS models were pre-trained on a blend of data from different domains, including readings of the New Testament. In the paper, we describe two studies analyzing gender bias and the use of religious language which conclude that models perform equally well for both genders and that on average, there is little bias for religious language (section 8 of the paper).
+
+# Training Details
+
+## Training Data
+
+MMS is pre-trained on VoxPopuli (parliamentary speech), MLS (read audiobooks), VoxLingua-107 (YouTube speech), CommonVoice (read Wikipedia text), BABEL (telephone conversations), and MMS-lab-U (New Testament readings), MMS-unlab (various read Christian texts).
+Models are fine-tuned on FLEURS, VoxLingua-107, MLS, CommonVoice, and MMS-lab. We obtained the language information for MMS-lab, MMS-lab-U and MMS-unlab from our data soucrce and did not  manually verify it for every language.
+
+## Training Procedure
+
+Please refer to the research paper for details on this.
+
+# Evaluation
+
+## Testing Data, Factors & Metrics
+
+We evaluate the model on a different benchmarks for the downstream tasks. The evaluation details are presented in the paper. The models performance is measured using standard metrics such as character error rate, word error rate, and classification accuracy.
+
+
+# Citation
+
+**BibTeX:**
+
+```
+@article{pratap2023mms,
+  title={Scaling Speech Technology to 1,000+ Languages},
+  author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
+  journal={arXiv},
+  year={2023}
+}
+
+```
+
+# Model Card Contact
+
+Please reach out to the authors at: [vineelkpratap@meta.com](mailto:vineelkpratap@meta.com) [androstj@meta.com](mailto:androstj@meta.com) [bshi@meta.com](mailto:bshi@meta.com) [michaelauli@meta.com](mailto:michaelauli@gmail.com)
+
+
diff --git a/examples/mms/README.md b/examples/mms/README.md
@@ -0,0 +1,175 @@
+# MMS: Scaling Speech Technology to 1000+ languages
+
+The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over [4,000 languages](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages. Our goal is to make it easier for people to access information and to use devices in their preferred language.  
+
+You can find details in the paper [Scaling Speech Technology to 1000+ languages](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/) and the [blog post](https://ai.facebook.com/blog/multilingual-speech-recognition-model/).
+
+An overview of the languages covered by MMS can be found [here](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html).
+
+
+## Pretrained models
+
+| Model | Link
+|---|---
+MMS-300M | [download](https://dl.fbaipublicfiles.com/mms/pretraining/base_300m.pt)
+MMS-1B | [download](https://dl.fbaipublicfiles.com/mms/pretraining/base_1b.pt)
+
+Example commands to finetune the pretrained models can be found [here](https://github.com/fairinternal/fairseq-py/tree/mms_release/examples/wav2vec#fine-tune-a-pre-trained-model-with-ctc).
+
+## Finetuned models
+### ASR
+
+| Model | Languages | Dataset | Model | Supported languages |
+|---|---|---|---|---
+MMS-1B:FL102 | 102 | FLEURS | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt) | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102_langs.html) 
+MMS-1B:L1107| 1107 | MMS-lab | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt) | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107_langs.html) 
+MMS-1B-all| 1162 | MMS-lab + FLEURS <br>+ CV + VP + MLS |  [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt) | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_all_langs.html)
+
+### TTS
+1. Download the list of [iso codes](https://dl.fbaipublicfiles.com/mms/tts/all-tts-languages.html) of 1107 languages.
+2. Find the iso code of the target language and download the checkpoint. Each folder contains 3 files: `G_100000.pth`,  `config.json`, `vocab.txt`. The `G_100000.pth` is the generator trained for 100K updates, `config.json` is the training config, `vocab.txt` is the vocabulary for the TTS model. 
+```
+# Examples:
+wget https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz # English (eng)
+wget https://dl.fbaipublicfiles.com/mms/tts/azj-script_latin.tar.gz # North Azerbaijani (azj-script_latin)
+```
+
+### LID
+
+\# Languages | Dataset | Model | Dictionary | Supported languages |
+|---|---|---|---|---
+126 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l126.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l126/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l126_langs.html)
+256 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l256.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l256/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l256_langs.html)
+512 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l512.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l512/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l512_langs.html)
+1024 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l1024.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l1024/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l1024_langs.html)
+2048 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l2048.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l2048/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l2048_langs.html)
+4017 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l4017.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l4017/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l4017_langs.html)
+
+## Commands to run inference 
+
+### ASR
+Run this command to transcribe one or more audio files:
+```shell command
+cd /path/to/fairseq-py/
+python examples/mms/asr/infer/mms_infer.py --model "/path/to/asr/model" --lang lang_code --audio "/path/to/audio_1.wav" "/path/to/audio_1.wav"
+```
+
+For more advance configuration and calculate CER/WER, you could prepare manifest folder by creating a folder with this format: 
+```
+$ ls /path/to/manifest
+dev.tsv
+dev.wrd
+dev.ltr
+dev.uid
+
+# dev.tsv each line contains <audio>  <number_of_sample>
+$ cat dev.tsv
+/
+/path/to/audio_1  180000
+/path/to/audio_2  200000
+
+$ cat dev.ltr
+t h i s | i s | o n e |
+t h i s | i s | t w o |
+
+$ cat dev.wrd
+this is one
+this is two
+
+$ cat dev.uid
+audio_1
+audio_2
+```
+
+Followed by command below:
+```
+lang_code=<iso_code>
+
+PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1 python examples/speech_recognition/new/infer.py -m --config-dir examples/mms/config/ --config-name infer_common decoding.type=viterbi dataset.max_tokens=4000000 distributed_training.distributed_world_size=1 "common_eval.path='/path/to/asr/model'" task.data='/path/to/manifest' dataset.gen_subset="${lang_code}:dev" common_eval.post_process=letter
+
+```
+Available options:
+* To get the raw character-based output, user can change to `common_eval.post_process=none` 
+
+* To maximize GPU efficiency or avoid out-of-memory (OOM), user can tune `dataset.max_tokens=???` size
+
+* To run language model decoding, install flashlight python bindings using
+  ```
+  git clone --recursive git@github.com:flashlight/flashlight.git
+  cd flashlight; 
+  git checkout 035ead6efefb82b47c8c2e643603e87d38850076 
+  cd bindings/python 
+  python3 setup.py install
+  ```
+  Train a [KenLM language model](https://github.com/flashlight/wav2letter/tree/main/recipes/rasr#language-model) and prepare a lexicon file in [this](https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt) format. 
+  ```
+   LANG=<iso> # for example - 'eng', 'azj-script_latin'
+   PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1  python examples/speech_recognition/new/infer.py  --config-dir=examples/mms/asr/config \
+      --config-name=infer_common decoding.type=kenlm  distributed_training.distributed_world_size=1  \ 
+      decoding.unique_wer_file=true   decoding.beam=500 decoding.beamsizetoken=50  \
+      task.data=<MANIFEST_FOLDER_PATH>   common_eval.path='<MODEL_PATH.pt>' decoding.lexicon=<LEXICON_FILE> decoding.lmpath=<LM_FILE> \  
+      decoding.results_path=<OUTPUT_DIR> dataset.gen_subset=${LANG}:dev decoding.lmweight=??? decoding.wordscore=???
+  ```
+   We typically sweep `lmweight` in the range of 0 to 5 and `wordscore` in the range of -3 to 3.  The output directory will contain the reference and hypothesis outputs from decoder. 
+
+   For decoding with character-based language models, use empty lexicon file (`decoding.lexicon=`), `decoding.unitlm=True` and sweep over `decoding.silweight` instead of `wordscore`. 
+
+### TTS
+Note: clone and install [VITS](https://github.com/jaywalnut310/vits) before running inference.
+```shell script
+## English TTS
+$ PYTHONPATH=$PYTHONPATH:/path/to/vits python examples/mms/tts/infer.py --model-dir /path/to/model/eng \
+--wav ./example.wav --txt "Expanding the language coverage of speech technology \
+has the potential to improve access to information for many more people"
+
+## Maithili TTS
+$ PYTHONPATH=$PYTHONPATH:/path/to/vits python examples/mms/tts/infer.py --model-dir /path/to/model/mai \
+--wav ./example.wav --txt "मुदा आइ धरि ई तकनीक सौ सं किछु बेसी भाषा तक सीमित छल जे सात हजार \ 
+सं बेसी ज्ञात भाषाक एकटा अंश अछी"
+```
+`example.wav` contains synthesized audio for the language.
+
+
+### LID
+
+
+Prepare two files in this format 
+```
+#/path/to/manifest.tsv
+/
+/path/to/audio1.wav
+/path/to/audio2.wav
+/path/to/audio3.wav
+
+# /path/to/manifest.lang
+eng 1
+eng 1
+eng 1
+```
+
+Download model and the corresponding dictionary file for the LID model. The following command assuming there is a file named `dict.lang.txt` in `/path/to/dict/l126/`. 
+Use the following command to run inference - 
+```shell script
+$  PYTHONPATH='.'  python3  examples/mms/lid/infer.py /path/to/dict/l126/ --path /path/to/models/mms1b_l126.pt \
+  --task audio_classification  --infer-manifest /path/to/manifest.tsv --output-path <OUTDIR>
+```
+`<OUTDIR>/predictions.txt` will contain the predictions from the model for the audio files in `manifest.tsv`. 
+
+
+# License
+
+The MMS code and model weights are released under the CC-BY-NC 4.0 license.
+
+# Citation
+
+**BibTeX:**
+
+```
+@article{pratap2023mms,
+  title={Scaling Speech Technology to 1,000+ Languages},
+  author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
+  journal={arXiv},
+  year={2023}
+}
+
+```
diff --git a/examples/mms/asr/config/infer_common.yaml b/examples/mms/asr/config/infer_common.yaml
@@ -0,0 +1,32 @@
+# @package _global_
+# defaults:
+#   - hydra/launcher: submitit_slurm
+
+# @package _group_
+
+task:
+  _name: audio_finetuning
+  data: null
+  labels: ltr
+common_eval:
+  path: null
+  post_process: letter
+  # model_overrides: "{'task':{'multi_corpus_keys':None}}"
+decoding:
+  type: viterbi
+  lexicon: null
+  unique_wer_file: false
+  results_path: null
+distributed_training:
+  ddp_backend: legacy_ddp
+  distributed_world_size: 1
+hydra:
+  run:
+    dir: ${common_eval.results_path}/${dataset.gen_subset}
+  sweep:
+    dir: /checkpoint/${env:USER}/${env:PREFIX}/${common_eval.results_path}
+    subdir: ${dataset.gen_subset}
+dataset:
+  max_tokens: 2_000_000
+  gen_subset: dev
+  required_batch_size_multiple: 1
diff --git a/examples/mms/asr/infer/example_infer_adapter.sh b/examples/mms/asr/infer/example_infer_adapter.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+lang="$1"
+PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1 python examples/speech_recognition/new/infer.py -m --config-dir examples/mms/asr/config/ --config-name infer_common decoding.type=viterbi dataset.max_tokens=4000000 distributed_training.distributed_world_size=1 "common_eval.path='/fsx-wav2vec/androstj/exps/wav2vec/mms/v4/finetune/xl1b_d5_dfls_0_0.3_u300k__ft_on_d5_127_dbeta1/ft_smax_adp_common.seed:1__dataset.max_tokens:2880000__optimization.lr:[0.001]__optimization.max_update:4000__merged_ckpt/checkpoints/checkpoint_last.pt'" task.data=/fsx-wav2vec/androstj/dataset/v4/fl/fseq dataset.gen_subset="${lang}:${lang}/dev" common_eval.post_process=none
diff --git a/examples/mms/asr/infer/mms_infer.py b/examples/mms/asr/infer/mms_infer.py
@@ -0,0 +1,52 @@
+#!/usr/bin/env python -u
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import soundfile as sf
+import tempfile
+from pathlib import Path
+import os
+import subprocess
+import sys
+import re
+
+def parser():
+    parser = argparse.ArgumentParser(description="ASR inference script for MMS model")
+    parser.add_argument("--model", type=str, help="path to ASR model", required=True)
+    parser.add_argument("--audio", type=str, help="path to audio file", required=True, nargs='+')
+    parser.add_argument("--lang", type=str, help="audio language", required=True)
+    parser.add_argument("--format", type=str, choices=["none", "letter"], default="letter")
+    return parser.parse_args()
+
+def process(args):    
+    with tempfile.TemporaryDirectory() as tmpdir:
+        print(">>> preparing tmp manifest dir ...", file=sys.stderr)
+        tmpdir = Path(tmpdir)
+        with open(tmpdir / "dev.tsv", "w") as fw:
+            fw.write("/\n")
+            for audio in args.audio:
+                nsample = sf.SoundFile(audio).frames
+                fw.write(f"{audio}\t{nsample}\n")
+        with open(tmpdir / "dev.uid", "w") as fw:
+            fw.write(f"{audio}\n"*len(args.audio))
+        with open(tmpdir / "dev.ltr", "w") as fw:
+            fw.write("d u m m y | d u m m y\n"*len(args.audio))
+        with open(tmpdir / "dev.wrd", "w") as fw:
+            fw.write("dummy dummy\n"*len(args.audio))
+        cmd = f"""
+        PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1 python examples/speech_recognition/new/infer.py -m --config-dir examples/mms/asr/config/ --config-name infer_common decoding.type=viterbi dataset.max_tokens=4000000 distributed_training.distributed_world_size=1 "common_eval.path='{args.model}'" task.data={tmpdir} dataset.gen_subset="{args.lang}:dev" common_eval.post_process={args.format} decoding.results_path={tmpdir}
+        """
+        print(">>> loading model & running inference ...", file=sys.stderr)
+        subprocess.run(cmd, shell=True, stdout=subprocess.DEVNULL,)
+        with open(tmpdir/"hypo.word") as fr:
+            for ii, hypo in enumerate(fr):
+                hypo = re.sub("\(\S+\)$", "", hypo).strip()
+                print(f'===============\nInput: {args.audio[ii]}\nOutput: {hypo}')
+
+
+if __name__ == "__main__":
+    args = parser()
+    process(args)
diff --git a/examples/mms/data_prep/README.md b/examples/mms/data_prep/README.md
@@ -0,0 +1,47 @@
+# Data Preparation 
+
+We describe the process of aligning long audio files with their transcripts and generating shorter audio segments below. 
+
+- Step 1:  Download and install torchaudio using the nightly version. We have open sourced the CTC forced alignment algorithm described in our paper via [torchaudio](https://github.com/pytorch/audio/pull/3348). 
+  ```
+  pip install --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
+  ```
+
+- Step 2: Download [uroman](https://github.com/isi-nlp/uroman) from Github. It is a universal romanizer which converts text in any script to the Latin alphabet. Use [this link](https://www.isi.edu/~ulf/uroman.html) to try their web interface.  
+  ```
+  git clone git@github.com:isi-nlp/uroman.git
+  ```
+
+- Step 3: Install a few other dependencies 
+  ```
+  pip install sox 
+  pip install dataclasses 
+  ```  
+
+- Step 4: Create a text file containing the transcript for a (long) audio file. Each line in the text file will correspond to a separate audio segment that will be generated upon alignment.
+
+  Example content of the input text file :
+  ```
+  Text of the desired first segment
+  Text of the desired second segment
+  Text of the desired third segment
+  ```
+
+- Step 5: Run forced alignment and segment the audio file into shorter segments. 
+  ```
+  python align_and_segment.py --audio /path/to/audio.wav --textfile /path/to/textfile --lang <iso> --outdir /path/to/output --uroman /path/to/uroman/bin 
+  ```
+
+  The above code  will generated the audio segments under output directory based on the content of each line in the input text file. The `manifest.json` file consisting of the of segmented audio filepaths and their corresponding transcripts. 
+
+  ```
+  > head /path/to/output/manifest.json 
+
+  {"audio_start_sec": 0.0, "audio_filepath": "/path/to/output/segment1.flac", "duration": 6.8, "text": "she wondered afterwards how she could have spoken with that hard serenity how she could have", "normalized_text": "she wondered afterwards how she could have spoken with that hard serenity how she could have", "uroman_tokens": "s h e w o n d e r e d a f t e r w a r d s h o w s h e c o u l d h a v e s p o k e n w i t h t h a t h a r d s e r e n i t y h o w s h e c o u l d h a v e"}
+  {"audio_start_sec": 6.8, "audio_filepath": "/path/to/output/segment2.flac", "duration": 5.3, "text": "gone steadily on with story after story poem after poem till", "normalized_text": "gone steadily on with story after story poem after poem till", "uroman_tokens": "g o n e s t e a d i l y o n w i t h s t o r y a f t e r s t o r y p o e m a f t e r p o e m t i l l"}
+  {"audio_start_sec": 12.1, "audio_filepath": "/path/to/output/segment3.flac", "duration": 5.9, "text": "allan's grip on her hands relaxed and he fell into a heavy tired sleep", "normalized_text": "allan's grip on her hands relaxed and he fell into a heavy tired sleep", "uroman_tokens": "a l l a n ' s g r i p o n h e r h a n d s r e l a x e d a n d h e f e l l i n t o a h e a v y t i r e d s l e e p"}
+  ```
+
+  To visualize the segmented audio files, [Speech Data Explorer](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer) tool from NeMo toolkit can be used.  
+
+  As our alignment model outputs uroman tokens for input audio in any language, it also works with non-english audio and their corresponding transcripts.