Skip to content

Wav2vec resources and models for Brazilian Portuguese

License

Notifications You must be signed in to change notification settings

lucasgris/wav2vec4bp

Repository files navigation

Wav2vec 2.0 for Brazilian Portuguese 🇧🇷

This repository aims at the development of audio technologies using Wav2vec 2.0, such as Automatic Speech Recognition (ASR), for the Brazilian Portuguese language.

Description

This repository contains code and fine-tuned Wav2vec checkpoints for Brazilian Portuguese, including some useful scripts to download and preprocess transcribed data.

Wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020). For more information about Wav2vec, please access the official repository.

Tasks

  • Add CORAA to the BP Dataset (BP Dataset Version 2);
  • Release BP Dataset V2 fine tuned models;
  • Finetune using the XLR-S 300M, XLR-S 1B and XLR-S 2B models.

Checkpoints

ASR checkpoints

We provide several Wav2vec fine-tuned models for ASR. For a more detailed description of how we finetuned these models, please check the paper Brazilian Portuguese Speech Recognition Using Wav2vec 2.0.

Our last model is the bp_400. It was finetuned using the 400h filtered version of the BP Dataset (see Brazilian Portuguese (BP) Dataset Version 1 below). The results against each gathered dataset are shown below.

Checkpoints of BP Dataset V1

Model name Pretrained model Fairseq model Dict Hugging Face link
bp_400 XLSR-53 fairseq dict hugging face
bp_400_xls-r-300M XLS-R-300M fairseq dict hugging face

Checkpoints of non-filtered BP Dataset (early version of the BP dataset)

Model name Pretrained model Fairseq model Dict Hugging Face link
bp_500 XLSR-53 fairseq dict hugging face
bp_500_10k VoxPopuli 10k BASE fairseq dict hugging face
bp_500_100k VoxPopuli 100k BASE fairseq dict hugging face

Checkpoints of each gathered dataset

Model name Pretrained model Fairseq model Dict Hugging Face link
bp_cetuc_100 XLSR-53 fairseq dict hugging face
bp_commonvoice_100 XLSR-53 fairseq dict hugging face
bp_commonvoice_10 XLSR-53 fairseq dict hugging face
bp_lapsbm_1 XLSR-53 fairseq dict hugging face
bp_mls_100 XLSR-53 fairseq dict hugging face
bp_sid_10 XLSR-53 fairseq dict hugging face
bp_tedx_100 XLSR-53 fairseq dict hugging face
bp_voxforge_1 XLSR-53 fairseq dict hugging face

Other checkpoints

We provide other Wav2vec checkpoints. These models were trained using all the available data at the time, including its dev and test subsets. Only Common Voice dev/test was selected to validate and test the model, respectively.

Datasets used for training Fairseq model Dict Hugging Face link
CETUC + CV 6.1 (only train) + LaPS BM + MLS + VoxForge fairseq dict hugging face
CETUC + CV 6.1 (all validated) + LaPS BM + MLS + VoxForge hugging face

ASR Results

Summary (WER)
Model CETUC CV LaPS MLS SID TEDx VF AVG
bp_400 0.052 0.140 0.074 0.117 0.121 0.245 0.118 0.124
bp_400_xls-r-300M 0.048 0.123 0.068 0.111 0.084 0.207 0.095 0.105
bp_500 0.052 0.137 0.032 0.118 0.095 0.236 0.082* 0.112
bp_500-base10k_voxpopuli 0.120 0.249 0.039 0.227 0.169 0.349 0.116* 0.181
bp_500-base100k_voxpopuli 0.074 0.174 0.032 0.182 0.181 0.349 0.111* 0.157
bp_cetuc_100** 0.446 0.856 0.089 0.967 1.172 0.929 0.902 0.765
bp_commonvoice_100 0.088 0.126 0.121 0.173 0.177 0.424 0.145 0.179
bp_commonvoice_10 0.133 0.189 0.165 0.189 0.247 0.474 0.251 0.235
bp_lapsbm_1 0.111 0.418 0.145 0.299 0.562 0.580 0.469 0.369
bp_mls_100 0.192 0.260 0.162 0.163 0.268 0.492 0.268 0.257
bp_sid_10 0.186 0.327 0.207 0.505 0.124 0.835 0.472 0.379
bp_tedx_100 0.138 0.369 0.169 0.165 0.794 0.222 0.395 0.321
bp_voxforge_1 0.468 0.608 0.503 0.505 0.717 0.731 0.561 0.584

* We found a problem with the dataset used in these experiments regarding the VoxForge subset. In this test set, some speakers were also present in the training set (which explains the lower WER). The final version of the dataset does not have such contamination.

** We do not perform validation in the subset experiments. CETUC has a poor variety of transcriptions. It might be overfitted.

Transcription examples
Text Transcription
alguém sabe a que horas começa o jantar alguém sabe a que horas começo jantar
lila covas ainda não sabe o que vai fazer no fundo lilacovas ainda não sabe o que vai fazer no fundo
que tal um pouco desse bom spaghetti quetá um pouco deste bom ispaguete
hong kong em cantonês significa porto perfumado rongkong en cantones significa porto perfumado
vamos hackear esse problema vamos rackar esse problema
apenas a poucos metros há uma estação de ônibus apenas ha poucos metros á uma estação de ônibus
relâmpago e trovão sempre andam juntos relampagotrevão sempre andam juntos

Datasets

Datasets provided:

  • CETUC: contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus.
  • Common Voice 7.0: is a project proposed by Mozilla Foundation with the goal to create a wide-open dataset in different languages. In this project, volunteers donate and validate speech using the oficial site.
  • Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
  • Multilingual Librispeech (MLS): a massive dataset available in many languages. The MLS is based on audiobook recordings in the public domain like LibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese used in this work (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
  • Multilingual TEDx: a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
  • Sidney (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
  • VoxForge: is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.

These datasets were combined to build a larger Brazilian Portuguese dataset (BP Dataset). All data was used for training except Common Voice dev/test sets, which were used for validation/test respectively. We also made test sets for all the gathered datasets.

Dataset Train Valid Test
CETUC 93.9h -- 5.4h
Common Voice 37.6h 8.9h 9.5h
LaPS BM 0.8h -- 0.1h
MLS 161.0h -- 3.7h
Multilingual TEDx (Portuguese) 144.2h -- 1.8h
SID 5.0h -- 1.0h
VoxForge 2.8h -- 0.1h
Total 437.2h 8.9h 21.6h

You can download the datasets individually using the scripts at scripts/ directory. The scripts will create the respective dev and test sets automatically.

python scripts/mls.py

If you want to join several datasets, execute the script join_datasets at scripts/:

python scripts/join_datasets.py /path/to/dataset1/train /path/to/dataset2/train ... --output-dir data/my_dataset --output-name train

After joining datasets, you might have some degree of transcription contamination. To remove all transcriptions present in a specific subset (for example, test subset), you can use the filter_dataset script:

python scripts/filter_datasets.py /path/to/my_dataset/train /path/to/dataset1/test /path/to/dataset2/test -output-dir data/my_dataset --output-name my_filtered_train

Alternativelly, download the raw datasets using the links below:

Brazilian Portuguese (BP) Dataset Version 1

The BP Dataset is an assembled dataset composed of many others in Brazilian Portuguese. We used the original test sets of each gathered dataset to make individual test sets. For the datasets without test sets, we created them by selecting 5% of unique male and female speakers. Additionally, we performed some filtering removing all transcriptions of the test sets from the final training set. We also ignored audio more than 30 seconds long from the dataset.

If you run the provided scripts, you might generate a slightly different version of the BP dataset. If you want to use the same files used to train, validate and test our models, you can download the metadata here.

Other versions

Our first attempt to build a larger dataset for BP produced a 500 hours dataset. However, we found some problems with the VoxForge subset. We also found some transcriptions of the test sets present in the training set. We made available the models trained with this version of the dataset (bp_500).

Language models

Language models can improve the ASR output. To use with fairseq, you will need to install flashlight python bindings. You will also need a lexicon containing the possible words.

Ken LM models

You can download some Ken LM models here. It is compatible with the flashlight decoder.

Transformer LM (fairseq) models

Model name Fairseq model Dict
BP Transformer LM fairseq model dict
Wikipedia Transformer LM fairseq model dict
Wikipedia Prunned Transformer LM fairseq model dict

Lexicon

🤗 Hugging Face Transformers + Wav2Vec2_PyCTCDecode

If you want to use Wav2Vec2_PyCTCDecode with Transformers to decode the Hugging Face models, the Ken LM models provided above might not work. In this case, you should train your own following the instructions here, or use one of the two models trained with BP Dataset and Wikipedia below:

ASR finetune

  1. To finetune the model, first install fairseq and its dependencies.
cd fairseq
pip install -e .
  1. Download a pre-trained model (See pretrained models)

  2. Create or use a configuration file (see configs/ directory).

  3. Finetune the model executing fairseq-hydra-train

root=/path/to/wav2vec4bp
fairseq-hydra-train \
   task.data=$root/data/my_dataset \
   checkpoint.save_dir=$root/checkpoints/stt/my_model_name \
   model.w2v_path=$root/xlsr_53_56k.pt \
   common.tensorboard_logdir=$root/logs/stt/my_model_name \
   --config-dir $root/configs \
   --config-name my_configuration_file_name

Pretrained models

To fine-tune Wav2vec, you will need to download a pre-trained model first.

🤗 ASR finetune with HuggingFace

To easily finetune the model using hugging face, you can use the repository Wav2vec-wrapper.

Language model training

To train a language model, one can use a Transformer LM or KenLM.

Ken LM

First, install KenLM.

git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

Then create a text file and run the following command:

./kenlm/build/bin/lmplz -o 5 <text.txt > path_to_lm.arpa

Transformer LM

To train a Transformer LM, first prepare and preprocess train, valid and test text files:

TEXT=path/to/dataset
fairseq-preprocess \
    --only-source \
    --trainpref $TEXT/train.tokens \
    --validpref $TEXT/valid.tokens \
    --testpref $TEXT/test.tokens \
    --destdir data/text/$dataset \
    --workers 20

Then train the model:

fairseq-train --task language_modeling \
  data/text/$dataset \
  --save-dir checkpoints/transformer_lms/$name \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 1024 --update-freq 32 \
  --fp16 \
  --max-update 50000

Docker

We recommend using a docker container, such as flml/flashlight, to easily finetune and test your models.