A TensorFlow Implementation of Listen Attention and Spell

This is a tensorflow implementation of end-to-end ASR. Though there are several fantastic github repos in tensorflow, I tried to implement it without using tf.contrib.seq2seq API in this repo. In addition, the performance on LibriSpeech dev/test sets is evaluated and the evaluation results are reasonable.

Overview

Components:
- Char/Subword text encoding.
- MFCC/fbank acoustic features with per utterance CMVN.
- LAS training (visualized with tensorboard: loss, sample text outputs, features, alignments).
- TFrecord dataset pipline.
- Batch testing using greedy decoder.
- Beam search decoder.
- RNNLM.

Remarks

Note that this project is still in progress.

Notes
- Currently, I only test this model on MFCC 39 (13+delta+accelerate) features.
- CTC related parts are not yet fully tested.
- Volume augmentation is currently commented out because it shows little improvements.
Improvements
- CNN-based Listener.
- Location-aware attention.
- Augmentation include speed perturbation.
- Label smoothing. (IMPORANT)
- Scheduled learning rate.
- Scheduled sampling.
- Bucketing.
Some advice
- Generally, LibriSpeech-100 is not large enough to train a well-perform LAS.
- In my experience, adding more data is the best policy.
- A better way to check if your model is learning in a right way is to monitor the speech-text aligments in tensorboard or set verbosity=1.

Requirements

pip3 install virtualenv
virtualenv --python=python3 venv
source venv/bin/activate
pip3 install -r requirements.txt

Arguments

The definitions of the args are described in las/arguments.py. You can modify all args there before preprocessing, training, testing and decoding.

Usage

0) Prepare data

Libirspeech train/dev/test data

bash prepare_libri_data.sh

1) Preprocess Audios & Texts. / Training pipeline.

I include dataset pipeline and training pipeline in run.sh.

bash run.sh

Train RNNLM: (NOT READY)

2) Testing

Testing with gready decoder.

python3 test.py --split SPLIT \           # test or dev
                --unit UNIT \ 
                --feat_dim FEAT_DIM \ 
                --feat_dir FEAT_DIR \
                --save_dir SAVE_DIR

3) Decode

Beam search decoder.

python3 decode.py --split SPLIT \         # test or dev
                  --unit UNIT \ 
                  --beam_size BEAM_SIZE \
                  --convert_rate 0.24 \   # 0.24 is large enough.
                  --apply_lm APPLY_LM \         
                  --lm_weight LM_WEIGHT \
                  --feat_dim FEAT_DIM \ 
                  --feat_dir FEAT_DIR \
                  --save_dir SAVE_DIR

Tensorboard

tensorboard --logdir ./summary

Results

Results trained on LibriSpeech-360 (WER)

Model	dev-clean	test-clean
Char LAS	0.249	0.262

Reference

Listen, Attend and Spell
Attention-Based Models for Speech Recognition
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Char RNNLM: TensorFlow-Char-RNN
A nice Pytorch version: End-to-end-ASR-Pytorch.

TODO

Add scheduled sampling.
Add location-aware attention.
Evaluate the performance with subword unit: Subword las training.
Decoding with subword-based RNNLM.
Evaluate the performance on joint CTC training, decoding.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
demo		demo
lang		lang
las		las
utils		utils
README.md		README.md
create_tfrecord.py		create_tfrecord.py
decode.py		decode.py
prepare_libri_data.sh		prepare_libri_data.sh
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run.sh		run.sh
test.py		test.py
tfrecord_data_loader.py		tfrecord_data_loader.py
train.py		train.py
train_lm.py		train_lm.py
train_subword.py		train_subword.py

30stomercury/Automatic-Speech-Recognition

Folders and files

Latest commit

History

Repository files navigation

A TensorFlow Implementation of Listen Attention and Spell

Overview

Remarks

Requirements

Arguments

Usage

0) Prepare data

1) Preprocess Audios & Texts. / Training pipeline.

2) Testing

3) Decode

Tensorboard

Results

Reference

TODO

About

Topics

Resources

Stars

Watchers

Forks

Languages