Skip to content

다양한 feature를 이용한 음성인식 LAS model입니다. (한국어는 개발예정)

License

Notifications You must be signed in to change notification settings

HanSeokhyeon/Speech_recognition_for_English_and_Korean

Repository files navigation

Speech recognition using Melspectrogram and Spikegram

Description

This is a PyTorch implementation of Listen, Attend and Spell (LAS) published in ICASSP 2016 (Student Paper Award) on TIMIT. Feel free to use/modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact b03902034[AT]ntu.edu.tw

TIMIT

The input feature is MFCC 39 (13+delta+accelerate), and the output phoneme classes is reduced from 61 to 39 classes during evaluation. This implement achieves about 26% phoneme error rate on TIMIT's testing set (using original setting in the paper without hyper parameter tuning, models are stored in checkpoint/). It's not a remarkable score but please notice that deep end2end ASR without special designed loss function such as LAS requires larger corpus to achieve outstanding performance.

  • Learning Curve

  • Attention Visualization & Recognition Result

Result of the first sample in TIMIT testing set. Training log is availible in here, use tensorboard --logdir=las_example/ to access.

Remarks

Differences from the paper

Be aware of some differences between this implementation and the originally proposed model:

  • Smaller Dataset

    Originally, LAS was trained on Google's private voice search dataset representing 2000 hours of data plus additional data augmentation. Here the model was trained on TIMIT, a MUCH smaller dataset, without any data augmentation. Even LibriSpeech is relatively small corpus for LAS.

  • Different Metric

    On TIMIT, the evaluation criterion we chose is the Word Error Rate (WER) of the output phoneme (i.e. phoneme error rate ) sequence instead of real sentences composed of real words.

  • Simplified Speller

    Speller contains a single layer LSTM instead of 2 layer LSTM proposed. According to the response I got from a letter I wrote to the author, using single layer can get similar result.

  • Features for character prediction

    According to Equation (8) in the paper, last layer of Speller takes both RNN output and attention-based context as input and output character distribution. However, the actual operation of this equation is unclear. In this implementation, RNN output and attention-based context are simply concatenated.

Improvement

  • Multi-head Attention (MHA)

    Google had released another paper introducing state-of-the-art end2end ASR based on LAS. According to the paper, they modified the attention mechanism to MHA and gain remarkable performance improvement. We've implemented MHA as described in section 2.2.2. in the paper and enable it when training on LibriSpeech. It is worth to mention that MHA increases the training time of LAS (which was already too slow), so consider disable MHA by setting multi_head=1 in config on slower GPU.

  • Label Smoothing

    Like MHA, label smoothing was mentioned in the same paper and show significant improvement on LAS. However, pytorch's loss function design makes it difficult to implement label smoothing. In this implementation, label smoothing is achieved by self-defined loss function (can be found at functions.py). The implementation may be numerical unstable comparing to native loss function provided by pytorch, you may disable label smoothing by setting it to 0 in config file. We will be very thankful for bug report or sugestion on label smoothing implementation.

Requirements

Execution Environment

  • Python 3
  • GPU computing is recommended for training efficiency
  • Computing power and memory space (both RAM/GPU's RAM) is extremely important if you'ld like to train your own model, especially on LibriSpeech.

Packages

  • SoX

    Command line tool for transforming raw wave file in TIMIT from NIST to RIFF

  • python_speech_features

    A Python package for extracting MFCC features during preprocessing

  • pydub

    High level api for audio file format tranlation

  • python_speech_features

    A Python package for extracting acoustic features during preprocessing

  • joblib

    Parallel tool to speed up feature extraction/ dataset loading.

  • tdqm

    Progress bar for visualization.

  • PyTorch (0.4.0)

    Please use PyTorch 0.4.0 in where loss computation over 2D target is availible and the softmax bug on 3D input is fixed.

  • editdistance

    Package for calculating edit distance (Levenshtein distance).

  • tensorboardX

    Tensorboard interface for pytorch, we used it to visualize training process.

  • pandas

    For TIMIT dataset loading.

Setup

  • TIMIT
    • Dataset Preprocess

      Please prepare TIMIT dataset without modifying the file structure of it and run the following command to preprocess it from wave to MFCC 39 before training.

        cd util
        ./timit_preprocess.sh <TIMIT folder>       
      

      After preprocessing step, timit_mfcc_39.pkl should be in your TIMIT folder. Add your data path to config file.

    • Train LAS Run the following commands to train LAS on TIMIT

        mkdir -p checkpoint
        mkdir -p log
        python3 train_timit.py <config file path>
      

      Training log will be stored at log/ while model checkpoint at checkpoint/

      For a customized experiment, please read and modify config/las_example_config.yaml. For more information and a simple demonstration, please refer to las_demo.ipynb

References

About

다양한 feature를 이용한 음성인식 LAS model입니다. (한국어는 개발예정)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published