Skip to content

Latest commit

 

History

History
74 lines (58 loc) · 2.25 KB

README.md

File metadata and controls

74 lines (58 loc) · 2.25 KB

Pytorch-End-to-End-ASR-on-TIMIT

BiGRU encoder + Attention decoder, based on "Listen, Attend and Spell"1.

The acoustic features are 80-dimensional filter banks. They are stacked every 3 consecutive frames, so the time resolution is reduced.

Following the standard recipe, we use 462-speaker training set with all SA records removed. Outputs are mapped to 39 phonemes when evalauting.

With this code you can achieve ~22% PER on the core test set.

Usage

Install requirements

$ pip install -r requirements.txt

Prepare data

This will create lists (*.csv) of audio file paths along with their transcripts:

$ python prepare_data.py --root ${DIRECTORY_OF_TIMIT}

Train

Check available options:

$ python train.py -h

Use the default configuration for training:

$ python train.py exp/default.yaml

You can also write your own configuration file based on exp/default.yaml.

$ python train.py ${PATH_TO_YOUR_CONFIG}

Show loss curve

With the default configuration, the training logs are stored in exp/default/history.csv. Specify your training logs accordingly.

$ python show_history.py exp/default/history.csv

Test

During training, the program will keep monitoring the error rate on development set. The checkpoint with the lowest error rate will be saved in the logging directory (by default exp/default/best.pth).

To evalutate the checkpoint on test set, run:

$ python eval.py exp/default/best.pth

Or you can test random audio from the test set and see the attentions:

$ python inference.py exp/default/best.pth

Predict:
h# hh ih l pcl p gcl g r ey tcl d ix pcl p ih kcl k ix pcl p eh kcl k ix v dcl d ix tcl t ey dx ah v z h#
Ground-truth:
h# hh eh l pcl p gcl g r ey gcl t ix pcl p ih kcl k ix pcl p eh kcl k ix v pcl p ix tcl t ey dx ow z h#

References

[1] W. Chan et al., "Listen, Attend and Spell", https://arxiv.org/pdf/1508.01211.pdf

[2] J. Chorowski et al., "Attention-Based Models for Speech Recognition", https://arxiv.org/pdf/1506.07503.pdf

[3] M. Luong et al., "Effective Approaches to Attention-based Neural Machine Translation", https://arxiv.org/pdf/1508.04025.pdf