GitHub - Rumeysakeskin/ASR-fine-tuning-for-low-resource-languages: Transfer learning for ASR with subword encoding CTC model (NVIDIA NeMo Citrinet) on low-resource languages

Finetuning ASR Model on Low Resource Languages (Turkish)

For this project, we will attempt to fine-tune a ASR model onto speech dataset for Turkish. This repo will also allow us to discuss in detail how to fine-tune a pre-trained subword-based (n-gram characters) CTC model onto a new low-resource language with a small dataset.

Table of Contents 🎉

Download and Prepare Free Audio Data for ASR
Custom ASR Data Preperation
Text-Pre-processing-(Normalization,-Clean-up)
Speech Data Augmentation
Sub-word Encoding CTC Model
The necessity of subword tokenization
Build Custom Subword Tokenizer
Specifying Model with YAML Config File
Citrinet Model Parameters
Specifying the Tokenizer to The Model and Update Custom Vocabulary
Training with PyTorch Lightning

Download and Prepare Free Audio Data for ASR

You can download and create manifest.jsonl from some of the common publically available speech dataset in English, Turkish and some other languages from my repository speech-datasets-for-ASR.

Custom ASR Data Preperation

The nemo_asr collection expects each dataset to consist of a set of utterances in individual audio files plus a manifest that describes the dataset, with information about one utterance per line (.json). Each line of the manifest (data/train_manifest.jsonl and data/val_manifest.jsonl) should be in the following format:

{"audio_filepath": "/data/train_wav/audio_1.wav", "duration": 2.836326530612245, "text": "bugün hava durumu nasıl"}

The audio_filepath field should provide an absolute path to the .wav file corresponding to the utterance. The text field should contain the full transcript for the utterance, and the duration field should reflect the duration of the utterance in seconds.

Text Pre-processing (Normalization, Clean-up)

Text cleaning and normalization is the process of preparing raw text for down-stream process. Open the following notebook for Turkish text normalization.

preprocess_manifest_file.ipynb

Speech Data Augmentation

Also, you can use my repository speech-data-augmentation to increase the diversity of your dataset augmenting the data artificially for ASR models training.

Sub-word Encoding CTC Model

A sub-encoding model accepts a sub-word tokenized text corpus and emits sub-word tokens in its decoding step. This repository will detail how we prepare a CTC model which utilizes a sub-word Encoding scheme. We will utilize a pre-trained Citrinet model trained on roughly 7,000 hours of English speech as the base model. We will modify the decoder layer (thereby changing the model's vocabulary) for training.

The necessity of subword tokenization

Subword tokenization is a solution between word and character-based tokenization. The main idea is to solve the issues faced by word-based tokenization (very large vocabulary size, large number of OOV tokens, and different meaning of very similar words) and character-based tokenization (very long sequences and less meaningful individual tokens).

As the corpus size increases, the number of unique words increases too and this leads to a larger vocabulary size which causes memory and performance problems during processing.

Subword tokenization not only reduces the length of the tokenized representation (thereby making sentences shorter and more manageable for models to learn), but also boosts the accuracy of prediction of correct tokens.

Some of the popular subword tokenization algorithms are WordPiece, Byte-Pair Encoding (BPE), Unigram, and SentencePiece.

BPE is used in language models like GPT-2, RoBERTa, XLM, FlauBERT, etc.
SentencePiece is an extension of two sub-word segmentation algorithms, byte-pair encoding, and a uni-gram language model. SentencePiece does not need pre-tokenized word sequences, unlike BPE and ULM.

Build Custom Subword Tokenizer

We will utilize the SentencePiece tokenizer in this study. Following NeMo script was used to easily build a tokenizer for Turkish speech dataset.

!python scripts/process_asr_text_tokenizer.py \
  --manifest=$train_manifest \
  --vocab_size=$VOCAB_SIZE \
  --data_root=$tokenizer_dir \
  --tokenizer="spe" \
  --spe_type=$TOKENIZER_TYPE \
  --spe_character_coverage=1.0 \
  --no_lower_case \
  --log

Open the tokenizer_for_sub_word_encoding_CTC_model.ipynb script in the Colab and create your custom tokenizer for your dataset.

Note: You can find more information about subword tokenization in Finetuning CTC models on other languages for your language.

Our tokenizer is now built and stored inside the data_root directory that we provided to the script.

Check for getting the subwords of the transcript or tokenizing a dataset using the same tokenizer as the ASR model.

Output:

[NeMo I 2023-01-12 06:16:05 ctc_bpe_models:341] Changed tokenizer to ['<unk>', '▁', 'a', 'e', 'i', 'n', 'l', 'ı', 'k', 'r', 'm', 't', 'u', 'd', 'y', 's', 'b', 'o', 'z', 'ü', 'ş', 'ar', 'g', 'ç', 'h', 'v', 'p', 'c', 'f', 'ö', 'j', 'w', 'q', '̇', 'x', 'ğ'] vocabulary.
tokenizer: <nemo.collections.common.tokenizers.sentencepiece_tokenizer.SentencePieceTokenizer object at 0x7fde5605d280>
tokens: ['▁', 'm', 'e', 'r', 'h', 'a', 'b', 'a', '▁', 'n', 'a', 's', 'ı', 'l', 's', 'ı', 'n']
token_ids: [1, 10, 3, 9, 24, 2, 16, 2, 1, 5, 2, 15, 7, 6, 15, 7, 5]
subwords: ['▁', 'm', 'e', 'r', 'h', 'a', 'b', 'a', '▁', 'n', 'a', 's', 'ı', 'l', 's', 'ı', 'n']
text: merhaba nasılsın

Specifying Model with YAML Config File

For this project, we will build citrinet model using the configuration found in confing/config_bpe.yaml. You can use another config file for your model in Nemo ASR conf.

import os
if not os.path.exists("configs/config_bpe.yaml"):
  !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/citrinet/config_bpe.yaml
  
config_path = "configs/config_bpe.yaml"

yaml = YAML(typ='safe')
with open(config_path) as f:
    params = yaml.load(f)

Citrinet Model Parameters

first_asr_model = nemo_asr.models.EncDecCTCModelBPE(cfg=DictConfig(params['model']))
first_asr_model = first_asr_model.restore_from("stt_en_contextnet_256.nemo")

Specifying the Tokenizer to The Model and Update Custom Vocabulary

Specify the tokenizer to the model parameters and change the vocabulary of a sub-word encoding ASR model is as simple as passing the path of the tokenizer dir to change_vocabulary().

params['model']['tokenizer']['dir'] = TOKENIZER_DIR
params['model']['tokenizer']['type'] = 'bpe'

first_asr_model.change_vocabulary(new_tokenizer_dir=TOKENIZER_DIR, new_tokenizer_type="bpe")

Training with PyTorch Lightning

NeMo's models are based on PytorchLightning's LightningModule and we use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy.

trainer = pl.Trainer(devices=1, accelerator='cpu',num_nodes=1,  # accelerator='ddp'
                  max_epochs=EPOCHS,
                  logger=wandb_logger, log_every_n_steps=1,
                  val_check_interval=1.0, enable_checkpointing=checkpoint_callback)

first_asr_model.set_trainer(trainer)
trainer.fit(first_asr_model)

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.ipynb_checkpoints		.ipynb_checkpoints
citrinet256_model_params.png		citrinet256_model_params.png
citrinet_model_params.png		citrinet_model_params.png
data_augmentation_and_build_manifest.ipynb		data_augmentation_and_build_manifest.ipynb
preprocess_manifest_file.ipynb		preprocess_manifest_file.ipynb
readme.md		readme.md
tokenizer_for_sub_word_encoding_CTC_model.ipynb		tokenizer_for_sub_word_encoding_CTC_model.ipynb
training&tokenizer_for_sub_word_encoding_CTC_model.ipynb		training&tokenizer_for_sub_word_encoding_CTC_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

citrinet256_model_params.png

citrinet256_model_params.png

citrinet_model_params.png

citrinet_model_params.png

data_augmentation_and_build_manifest.ipynb

data_augmentation_and_build_manifest.ipynb

preprocess_manifest_file.ipynb

preprocess_manifest_file.ipynb

readme.md

readme.md

tokenizer_for_sub_word_encoding_CTC_model.ipynb

tokenizer_for_sub_word_encoding_CTC_model.ipynb

training&tokenizer_for_sub_word_encoding_CTC_model.ipynb

training&tokenizer_for_sub_word_encoding_CTC_model.ipynb

Repository files navigation

Finetuning ASR Model on Low Resource Languages (Turkish)

Table of Contents 🎉

Download and Prepare Free Audio Data for ASR

Custom ASR Data Preperation

Text Pre-processing (Normalization, Clean-up)

Speech Data Augmentation

Sub-word Encoding CTC Model

The necessity of subword tokenization

Build Custom Subword Tokenizer

Specifying Model with YAML Config File

Citrinet Model Parameters

Specifying the Tokenizer to The Model and Update Custom Vocabulary

Training with PyTorch Lightning

Referances

About

Releases

Packages

Languages

Rumeysakeskin/ASR-fine-tuning-for-low-resource-languages

Folders and files

Latest commit

History

Repository files navigation

Finetuning ASR Model on Low Resource Languages (Turkish)

Table of Contents 🎉

Download and Prepare Free Audio Data for ASR

Custom ASR Data Preperation

Text Pre-processing (Normalization, Clean-up)

Speech Data Augmentation

Sub-word Encoding CTC Model

The necessity of subword tokenization

Build Custom Subword Tokenizer

Specifying Model with YAML Config File

Citrinet Model Parameters

Specifying the Tokenizer to The Model and Update Custom Vocabulary

Training with PyTorch Lightning

Referances

About

Topics

Resources

Stars

Watchers

Forks

Languages