Skip to content

Listen, Attend and spell model for E2E ASR. Implementation in Pytorch

Notifications You must be signed in to change notification settings

jiwidi/las-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LAS-Pytorch

This is my pytorch implementation for the Listen, Attend and Spell (LAS) google ASR deep learning model. I used both the mozilla Common voice dataset and the LibriSpeech dataset.

LAS Network architecture

The feature transformation is done on the fly while loading the files thanks to torchaudio.

Results

This are the LER (letter error rate) and loss metrics for 4 epochs of training with a considerably smaller architecture since my gpu didnt have enough memory. Listener had 128 neurons and 2 layers while the Speller had 256 neurons with 2 layers as well.

We can see how the model is able to learn from the data we are feeding to it but it still needs more training and a proper architecture.

Letter error rate Loss
LER LOSS

If we try to predict a sample of audio the results now look like:

true_y: ['A', 'N', 'D', '', 'S', 'T', 'I', 'L', 'L', '', 'N', 'O', '', 'A', 'T', 'T', 'E', 'M', 'P', 'T', '', 'B', 'Y', '', 'T', 'H', 'E', '', 'P', 'O']

pred_y:['A', 'N', 'D', '', 'T', 'H', 'E', 'L', 'L', '', 'T', 'O', 'T', 'M', '', 'T', 'E', 'N', 'P', 'T', '', 'O', 'E', '', 'T', 'H', 'E', '', 'S', 'R']

Only the conjunction are being properly indentified, this led us to think the model needs higher training times to be able to learn more specific words.

#Will train more and update results here, still looking for credits in cloud compute

How to run it

Requirements

Code is setup to run with both the mozilla Common voice dataset and the LibriSpeech dataset. If you want to run the code you should download the datasets and extract them under data/ or run the script utils/download_data.py which will download it and extract it in the following format:

Data

data
├── LibriSpeech
│   ├── BOOKS.TXT
│   ├── CHAPTERS.TXT
│   ├── dev-clean/
│   ├── LICENSE.TXT
│   ├── README.TXT
│   ├── SPEAKERS.TXT
│   ├── test-clean/
│   └── train-clean-100/
└── mozilla
    ├── dev.tsv
    ├── invalidated.tsv
    ├── mp3/
    ├── other.tsv
    ├── test.tsv
    ├── train.tsv
    └──  validated.tsv

So run


#Remove flags if you want to avoid download that specific dataset
$ python utils/download_data.py --libri --common

And run the following commands to process and collect all files.

#Still in utils/
$ python utils/prepare_librispeech.py --root $ABSOLUTEPATH TO DATASET
$ python uitls/prepare_common-voice.py --root $ABSOLUTEPATH TO DATASET

This will create a processed/ folder inside each of the datassets containing the csvs with teh data neccesary to train along vocabulary and word count files.

Training

Execute the train script along with the yaml config file for the desired dataset.

$ python train.py --config_path config/librispeech-config.yaml
# Or
$ python train.py --config_path config/common_voice-config.yaml

Loss and lert will be logged to the runs/ folder, you can check them by running tensoboard in the root directory.

About

Listen, Attend and spell model for E2E ASR. Implementation in Pytorch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages