Speech recognition using Melspectrogram and Spikegram

Description

This is a PyTorch implementation of Listen, Attend and Spell (LAS) published in ICASSP 2016 (Student Paper Award) on TIMIT. Feel free to use/modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact b03902034[AT]ntu.edu.tw

TIMIT

The input feature is MFCC 39 (13+delta+accelerate), and the output phoneme classes is reduced from 61 to 39 classes during evaluation. This implement achieves about 26% phoneme error rate on TIMIT's testing set (using original setting in the paper without hyper parameter tuning, models are stored in checkpoint/). It's not a remarkable score but please notice that deep end2end ASR without special designed loss function such as LAS requires larger corpus to achieve outstanding performance.

Learning Curve

Attention Visualization & Recognition Result

Result of the first sample in TIMIT testing set. Training log is availible in here, use tensorboard --logdir=las_example/ to access.

Remarks

Differences from the paper

Be aware of some differences between this implementation and the originally proposed model:

Smaller Dataset

Originally, LAS was trained on Google's private voice search dataset representing 2000 hours of data plus additional data augmentation. Here the model was trained on TIMIT, a MUCH smaller dataset, without any data augmentation. Even LibriSpeech is relatively small corpus for LAS.
Different Metric

On TIMIT, the evaluation criterion we chose is the Word Error Rate (WER) of the output phoneme (i.e. phoneme error rate ) sequence instead of real sentences composed of real words.
Simplified Speller

Speller contains a single layer LSTM instead of 2 layer LSTM proposed. According to the response I got from a letter I wrote to the author, using single layer can get similar result.
Features for character prediction

According to Equation (8) in the paper, last layer of Speller takes both RNN output and attention-based context as input and output character distribution. However, the actual operation of this equation is unclear. In this implementation, RNN output and attention-based context are simply concatenated.

Improvement

Multi-head Attention (MHA)

Google had released another paper introducing state-of-the-art end2end ASR based on LAS. According to the paper, they modified the attention mechanism to MHA and gain remarkable performance improvement. We've implemented MHA as described in section 2.2.2. in the paper and enable it when training on LibriSpeech. It is worth to mention that MHA increases the training time of LAS (which was already too slow), so consider disable MHA by setting multi_head=1 in config on slower GPU.
Label Smoothing

Like MHA, label smoothing was mentioned in the same paper and show significant improvement on LAS. However, pytorch's loss function design makes it difficult to implement label smoothing. In this implementation, label smoothing is achieved by self-defined loss function (can be found at functions.py). The implementation may be numerical unstable comparing to native loss function provided by pytorch, you may disable label smoothing by setting it to 0 in config file. We will be very thankful for bug report or sugestion on label smoothing implementation.

Requirements

Execution Environment

Python 3
GPU computing is recommended for training efficiency
Computing power and memory space (both RAM/GPU's RAM) is extremely important if you'ld like to train your own model, especially on LibriSpeech.

Packages

SoX

Command line tool for transforming raw wave file in TIMIT from NIST to RIFF
python_speech_features

A Python package for extracting MFCC features during preprocessing
pydub

High level api for audio file format tranlation
python_speech_features

A Python package for extracting acoustic features during preprocessing
joblib

Parallel tool to speed up feature extraction/ dataset loading.
tdqm

Progress bar for visualization.
PyTorch (0.4.0)

Please use PyTorch 0.4.0 in where loss computation over 2D target is availible and the softmax bug on 3D input is fixed.
editdistance

Package for calculating edit distance (Levenshtein distance).
tensorboardX

Tensorboard interface for pytorch, we used it to visualize training process.
pandas

For TIMIT dataset loading.

Setup

TIMIT
- Dataset Preprocess
  
  Please prepare TIMIT dataset without modifying the file structure of it and run the following command to preprocess it from wave to MFCC 39 before training.
```
  cd util
  ./timit_preprocess.sh <TIMIT folder>       
```
  After preprocessing step, timit_mfcc_39.pkl should be in your TIMIT folder. Add your data path to config file.
- Train LAS Run the following commands to train LAS on TIMIT
```
  mkdir -p checkpoint
  mkdir -p log
  python3 train_timit.py <config file path>
```
  Training log will be stored at log/ while model checkpoint at checkpoint/
  
  For a customized experiment, please read and modify config/las_example_config.yaml. For more information and a simple demonstration, please refer to las_demo.ipynb

References

TIMIT preprocessing : https://github.com/Faur/TIMIT

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
checkpoint		checkpoint
config		config
docker-compose		docker-compose
docker		docker
figure		figure
log		log
model		model
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dco-mel40_experiments.yml		dco-mel40_experiments.yml
logger.py		logger.py
test_timit.py		test_timit.py
train_timit.py		train_timit.py

License

HanSeokhyeon/Speech_recognition_for_English_and_Korean

Folders and files

Latest commit

History

Repository files navigation

Speech recognition using Melspectrogram and Spikegram

Description

TIMIT

Remarks

Differences from the paper

Improvement

Requirements

Execution Environment

Packages

Setup

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages