Speech-to-Text-WaveNet2 : End-to-end sentence level English speech recognition using DeepMind's WaveNet

A tensorflow implementation of speech recognition based on DeepMind's WaveNet: A Generative Model for Raw Audio. (Hereafter the Paper)

The architecture is shown in the following figure.

(Some images are cropped from [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499) and [Neural Machine Translation in Linear Time](https://arxiv.org/abs/1610.10099))

Version

Current Version : 2.1.0.0

demo
test
train
train model

Dependencies

tensorflow >= 1.12.0
librosa
glog
nltk

If you have problems with the librosa library, try to install ffmpeg by the following command. ( Ubuntu 14.04 )


sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt-get update
sudo apt-get dist-upgrade -y
sudo apt-get -y install ffmpeg

Dataset

Audio was augmented by the scheme in the Tom Ko et al's paper. (Thanks @migvel for your kind information)

Usage

Exculte

python ***.py --help

to get help when you use ***.py

Create dataset

Download and extract dataset(only VCTK support now, other will coming soon)
Assume the directory of VCTK dataset is f:/speech, Execute

python tools/create_tf_record.py -input_dir='f:/speech'

to create record for train or test

Train

Rename config/config.json.example to config/english-28.json
Execute

python train.py

to train model.

Test

Execute

python test.py

to evalute model.

Demo

1.Download pretrain model(buriburisuri model) and extract to 'release' directory

2.Execute


python demo.py -input_path

to transform a speech wave file to the English sentence. The result will be printed on the console.

For example, try the following command.


python demo.py -input_path=data/demo.wav -ckpt_dir=release/buriburisuri

The result will be as follows:


please scool stella

The ground truth is as follows:


PLEASE SCOOL STELLA

As mentioned earlier, there is no language model, so there are some cases where capital letters, punctuations, and words are misspelled.

Pretrained models

buriburisuri model : convert model from https://github.com/buriburisuri/speech-to-text-wavenet.

Future works

try to tokenlize the english label with nltk
train with all punctuation
add attention layer

Other resources

Namju's other repositories

Citation

If you find this code useful please cite us in your work:


Kim and Park. Speech-to-Text-WaveNet. 2016. GitHub repository. https://github.com/buriburisuri/.

Authors

Namju Kim (namju.kim@kakaocorp.com) at KakaoBrain Corp.

Kyubyong Park (kbpark@jamonglab.com) at KakaoBrain Corp.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data		data
doc		doc
tools		tools
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
demo.py		demo.py
test.py		test.py
train.py		train.py
utils.py		utils.py
wavenet.py		wavenet.py

kingstarcraft/speech-to-text-wavenet2

Folders and files

Latest commit

History

Repository files navigation