Skip to content

Phoneme Recognition using pre-trained models Wav2vec2, HuBERT and WavLM. Throughout this project, we compared specifically three different self-supervised models, Wav2vec (2019, 2020), HuBERT (2021) and WavLM (2022) pretrained on a corpus of English speech that we will use in various ways to perform phoneme recognition for different languages wiโ€ฆ

ASR-project/Multilingual-PR

Repository files navigation

Multilingual-PR

Implementation of the project Self-supervised pretraining for phoneme recognition, and generalization on foreign languages

Authors: Apavou Clรฉment & Belkada Younes & Leo Tronchon & Arthur Zucker

Python PyTorch PyTorch Lightning

This repository is powered by HuggingFace ๐Ÿค—, Pytorch-Lightning and Weight & Biases.

๐Ÿฆ Introduction

The scarcity of annotated data, and the heavy cost of producing them, limits our ability to train deep neural network for audio processing tasks.Therefore, the speech community developed feature learning methods with a minimal need for annotated data, which mostly fall under unsupervised and self-supervised techniques.

Recently, the rise of self-supervised learning methods for textual modality has outperformed state-of-the-art methods on downstream tasks, by fine-tuning the pretrained models on a relatively small amount of data. These approaches have recently been tested for other modalities such as images and audios.

Phoneme recognition is an exciting challenge that involves processing a raw audio recording and predict the corresponding sequence of phonemes that are pronounced by the speaker. Throughout this project, we will compare specifically three different self-supervised models, Wav2vec (2019, 2020), HuBERT (2021) and WavLM (2022) pretrained on a corpus of English speech that we will use in various ways to perform phoneme recognition for different languages with a network trained with Connectionist Temporal Classification (CTC) algorithm. Different questions will be addressed:

  • What is the impact of choosing English as a pretrained language, especially for languages that are very different from English? Which method(s) works best for transferring knowledge from English to other languages?
  • Which method allows to extract the best features for phoneme recognition?
  • What is the influence of the abundance of training data on the performance of models?

In this project, we address these questions by drawing conclusions from our experiments.

โœจ Main features

  • Modularity between SOTA models in self-supervision for speech
  • Freedom to select any languages available on CommonVoice hosted at HuggingFace.
  • Nice visualization tool through wandb.

โœ๏ธ Network Architecture for phoneme recognition

Diagram of the models used for the experiments. N=22 and h=1024 for HuBERT Large and WavLM Large, and N=11 and h=768 for Wav2vec2 Base and WavLM Base. Made by us.

๐Ÿ“š Languages for which phoneme dictionaries are available

Dutch (du), Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Russian (ru), Sweedish (sv), Turkish (tr), Tatar (tt) and Mandarin (zh). From https://github.com/facebookresearch/CPC_audio.

๐ŸŒŸ Usage

Please refer to our example notebook if you want to train or test a model. To understand the command line arguments that you can use, run:

Hparams ['parameters.hparams']:
  Hyperparameters of for the run

  --wandb_entity str    wandb (default: asr-project)
  --debug bool          (default: False)
  --test bool           test code before running, if testing, no checkpoints are written (default: True)
  --wandb_project str   (default: test-asr)
  --root_dir str        root_dir (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR)
  --seed_everything [int]
                        basic params (default: None)
  --gpu int             number or gpu (default: 1)
  --hparams.max_epochs int
                        maximum number of epochs (default: 100)
  --weights_path str    (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/weights)
  --tune_lr bool        modes (default: False)
  --dev_run bool        (default: False)
  --train bool          (default: True)
  --best_model str      (default: )
  --log_freq_audio int  (default: 10)
  --log_nb_audio int    (default: 2)
  --val_check_interval float
                        trainer params (default: 1.0)
  --limit_train_batches float
                        1.0 (default: 1.0)
  --limit_val_batches float
                        1.0 (default: 1.0)
  --enable_progress_bar bool
                        (default: True)
  --best_model_run str  testing params (default: WavLM_sv)
  --early_stopping bool
                        Early Stopping (default: True)
  --early_stopping_params typing.Dict[str, typing.Any]
                        (default: {'monitor': 'val/per', 'patience': 10, 'mode': 'min', 'verbose': True})

DatasetParams ['parameters.data_param']:
  Dataset Parameters
      ! The batch_size and number of crops should be defined here
      

  --dataset_name str    Hugging Face datasets parameters (default: common_voice)
  --use_auth_token bool
                        True if use mozilla-foundation datasets (default: False)
  --subset str          (default: sv-SE)
  --download_mode str   chosen language (see https://huggingface.co/datasets/common_voice) (default: reuse_dataset_if_exists)
  --cache_dir str       (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/assets)
  --language str        to create vocabulary of phonemes (default: sv)
  --root_path_annotation str
                        (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/assets/common_voices_splits)
  --phoible_csv_path str
                        (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/assets)
  --num_workers int     Dataloader parameters (default: 20)
  --batch_size int      (default: 2)
  --max_input_length_in_sec float
                        Dataset processing parameters (default: 5)
  --num_proc int        (default: 4)
  --create_dataset bool
                        (default: False)

NetworkParams ['parameters.network_param']:
  NetworkParams(network_name: str = 'WavLM', pretrained_name: Union[str, NoneType] = '', freeze: bool = True, freeze_transformer: bool = True, eos_token: str = '</s>', bos_token: str = '<s>', unk_token: str = '<unk>', pad_token: str = '<pad>', word_delimiter_token: str = '|')

  --network_name str    Hubert, Wav2Vec2, WavLM (default: WavLM)
  --pretrained_name [str]
                        (default: )
  --freeze bool         (default: True)
  --freeze_transformer bool
                        (default: True)
  --eos_token str       Phoneme Tokenizer (default: </s>)
  --bos_token str       (default: <s>)
  --unk_token str       (default: <unk>)
  --pad_token str       (default: <pad>)
  --word_delimiter_token str
                        (default: |)

OptimizerParams ['parameters.optim_param']:
  Optimization parameters

  --optimizer str       (default: AdamW)
  --lr float            (default: 0.02)
  --weight_decay float  (default: 1e-08)
  --accumulate_grad_batches int
                        1 for no accumulation (default: 16)
  --scheduler [str]     Scheduler parameters (default: None)
  --optim_param.max_epochs int
                        Cosine, ReduceLROnPlateau, MultiStepLR, StepLR or None Cosine scheduler (default: 10)
  --warmup_epochs int   (default: 1)
  --warmup_start_lr float
                        (default: 0.0006)
  --eta_min float       (default: 5e-06)
  --step_size int       Step LR scheduler (default: 2)
  --gamma float         also for multi step lr (default: 0.1)
  --milestones str      MultiStepLR scheduler (default: [8, 10, 15])
  --min_lr float        ReduceLROnPlateau scheduler (default: 5e-09)
  --patience int        (default: 10)

๐Ÿ”‰ Dataset

The project is based on Mozilla CommonVoice dataset available on HuggingFace. When the script is launched, the program will automatically download the correct dataset and transform ground truth sentences to phonemes using phonemizer. You are free to chose any dataset available on HuggingFace with phonemes dictionaries previously cited to run your models. For our experiments we use:

it, nl, tr, ru, sv-SE

Feel free to try any other languages and submit a Pull Request ๐Ÿ”Œ.

๐Ÿ“Ž Pre-trained models

Schema of Wav2vec2, HuBERT and WavLM.

For our experiments, we used models hosted on Hugging Face library, that are pre-trained on 960 hours of English audio data from Librispeech dataset on 16kHz sampled speech audio. The following pre-trained models were used:

๐Ÿ‘ช Language Family

The language family tree can be found in the following figure. This gives insight on the genetic proximity of each language.

Language Family Proximity with English
Italian ๐Ÿ‡ฎ๐Ÿ‡น ย Romance 47.8
Russian ๐Ÿ‡ท๐Ÿ‡บ East Slavic 60.3
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ West Germanic 27.2
Swedish ๐Ÿ‡ธ๐Ÿ‡ช North Germanic 26.7
Turkish ๐Ÿ‡น๐Ÿ‡ท Turkic 92.0

Genetic proximity between languages studied and english computed [here](http://www.elinguistics.net/Compare_Languages.aspx). [1, 30]: Highly related languages, [30, 50]: Related languages, [50, 70]: Remotely related languages, [70, 78]: Very remotely related languages, [78, 100]: No recognizable relationship.

English is a part of the West Germanic family.
Source: https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md and http://www.elinguistics.net/Compare_Languages.aspx

๐Ÿ“ˆ Main results

dataset: Common Voice Corpus 6.1 : https://commonvoice.mozilla.org/fr/datasets

Pretrained English models to other languages

๐Ÿš€ Fine-tuning

Language Training data (in hours) Model PER validation PER test Runs
Italian ๐Ÿ‡ฎ๐Ÿ‡น 62.34 Wav2Vec2 Base 19.05 17.95
Hubert Large 14.05 12.67
WavLM Base 19.83 25.60
Russian ๐Ÿ‡ท๐Ÿ‡บ 15.55 Wav2Vec2 Base 32.16 31.66
Hubert Large 25.10 24.09
WavLM Base 20.25 18.88
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ 12.78 Wav2Vec2 Base 16.18 20.83
Hubert Large 12.77 16.49
WavLM Base 15.96 19.91
Swedish ๐Ÿ‡ธ๐Ÿ‡ช 3.22 Wav2Vec2 Base 26.50 24.16
Hubert Large 21.77 19.38
WavLM Base 26.86 24.61
Turkish ๐Ÿ‡น๐Ÿ‡ท 2.52 Wav2Vec2 Base 19.62 19.03
Hubert Large 15.51 14.19
WavLM Base 19.85 18.95
Average - Wav2Vec2 Base 22.70 22.73
Hubert Large 17.84 17.36
WavLM Base 20.55 21.59

Table of experiments when models are **fine tuned**. Here, we compare 3 different pretrained models. The models were fine tuned on the phoneme recognition task with different languages and a varying amount of training data.

๐ŸงŠ Frozen Features

Language Training data (in hours) Model PER validation PER test Runs
Italian ๐Ÿ‡ฎ๐Ÿ‡น 62.34 Wav2Vec2 Base 38.94 36.84
WavLM Base 27.29 25.98
Hubert Large 23.85 21.15
WavLM Large 21.02 18.80
Russian ๐Ÿ‡ท๐Ÿ‡บ 15.55 Wav2Vec2 Base 50.11 48.69
WavLM Base 40.66 38.76
Hubert Large 38.36 36.18
WavLM Large 34.48 32.26
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ 12.78 Wav2Vec2 Base 40.15 39.23
WavLM Base 34.94 35.67
Hubert Large 27.62 26.68
WavLM Large 27.71 27.19
Swedish ๐Ÿ‡ธ๐Ÿ‡ช 3.22 Wav2Vec2 Base 50.30 45.23
WavLM Base 43.65 40.55
Hubert Large 37.34 32.68
WavLM Large 37.25 33.14
Turkish ๐Ÿ‡น๐Ÿ‡ท 2.52 Wav2Vec2 Base 53.92 52.08
WavLM Base 47.18 45.53
Hubert Large 39.55 37.08
WavLM Large 30.66 30.14
Average - Wav2Vec2 Base 46.68 44.41
WavLM Base 38.74 37.30
Hubert Large 33.34 30.75
WavLM Large 30.22 28.31

Table of experiments using **frozen features**. Here, we compare 4 different pretrained models. The objective was to train a linear layer, using pretrained models' frozen features, on the phoneme recognition task with different languages and a varying amount of training data.

โŒš Training data

Training set Training data Model PER validation PER test Runs
5% ~ 10 min Wav2Vec2 Base 55.35 50.91
Hubert Large 44.96 39.38
WavLM Base 56.22 51.25
10% ~ 20 min Wav2Vec2 Base 52.97 49.01
Hubert Large 42.61 37.50
WavLM Base 46.54 43.64
50% ~ 2 h Wav2Vec2 Base 51.23 46.24
Hubert Large 39.91 35.27
WavLM Base 44.57 42.33
100% ~ 3 h Wav2Vec2 Base 50.30 45.23
Hubert Large 37.34 32.68
WavLM Base 43.65 40.55

Variation in the amount of training data with frozen features of models pre-trained with the 3 different methods. Language: Swedish ๐Ÿ‡ธ๐Ÿ‡ช.

PER on the test and validation sets vs Training data for the Swedish language with frozen features.

๐Ÿ“Œ Project structure

โ”œโ”€โ”€ agents
|   โ”œโ”€โ”€ BaseTrainer.py       
|   
โ”œโ”€โ”€ assets                      # database and vocab phonemes are put here
|
โ”œโ”€โ”€ config
|   โ”œโ”€โ”€ hparams.py              # configuration file
|
โ”œโ”€โ”€ Datasets
|   |
|   โ”œโ”€โ”€ datamodule.py           #ย datamodules PyTorch lightning for CommonVoice dataset
|          
โ”œโ”€โ”€ models
|   โ”œโ”€โ”€ BaseModule.py           #  lightning module 
|   โ”œโ”€โ”€ models.py               # Wav2vec2 WavLM and Hubert using Hugging Face library
| 
โ”œโ”€โ”€ utils                       # utils functions
|   โ”œโ”€โ”€ agent_utils.py
|   โ”œโ”€โ”€ callbacks.py
|   โ”œโ”€โ”€ dataset_utils.py
|   โ”œโ”€โ”€ logger.py
|   โ”œโ”€โ”€ metrics.py              
|   โ”œโ”€โ”€ per.py                  # torch metrics implementation of the phoneme error rate
|
โ”œโ”€โ”€ hparams.py                   # configuration file
|
โ”œโ”€โ”€ main.py                      # main script to launch for training of inference 
|
โ””โ”€โ”€ README.md

โšก Powered by

logo hugging face logo wandb logo pytorch lightning

About

Phoneme Recognition using pre-trained models Wav2vec2, HuBERT and WavLM. Throughout this project, we compared specifically three different self-supervised models, Wav2vec (2019, 2020), HuBERT (2021) and WavLM (2022) pretrained on a corpus of English speech that we will use in various ways to perform phoneme recognition for different languages wiโ€ฆ

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •