Another PyTorch implementation of Tacotron2 MMI (with waveglow)

This is an another pytorch implementation of Tacotron2 MMI hugly based on bfs18's code.

Notes

I decided to implement this to address robustness and slow training of NVIDIA/tacotron2. While I searched for issues handling it, I found bfs18's Taoctron2 MMI from issue #280 regarding to effectiveness of reduction windows in tacotron frameworks.
In bfs18's Taoctron2 MMI, there are two main contributions: drop frame rate and CTC loss based MMI to maximize "the dependency between the autoregressive module and the condition module". But as reported in the follow-up issue, it seems somewhat unstable. So I didn't use MMI term in training by setting use_mmi==False.
Instead, I applied two things to get robust alignments as follows.
- n_frames_per_step>1 mode (it is not supported in NVIDIA/tacotron2)
- I only tried n_frames_per_step==2, but it should work for any number greater than 2.
- espnet's implementation of diagonal guided attention loss
As a result, aligments are learned more than 3 times faster than NVIDIA/tacotron2 with Blizzard Challenge 2013 dataset
However, the overall quality of the synthesized speech is poor even with excellent alignments due to the regularizational effects of both the drop frame rate and the reduction windows. I trained ~130k steps, but it only shows 0.3621 val loss. This is significantly slower than NVIDIA/tacotron2 with warm start model. It may converge in later with more training, but I am not going any further in my current implementation since I don't want to spend too much time on training.
You can enjoy of my code, and I hope to see an exceptional improvement from you. Any suggestions are appreciated.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Download and extract the Blizzard Challenge 2013 dataset
Follow the remain process as in NVIDIA/tacotron2

Training

python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Inference

Single sample: python inference.py -c checkpoint/path -r reference_audio/wav/path -t "synthesize text"
Multi samples: python inference_all.py -c checkpoint/path -r reference_audios/dir/path

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

Multi-GPU (distributed) and Automatic Mixed Precision Training

Not supported in current implementation.

Suggestions and Tips

You may remove mel_layer in decoder to lower the training loss. It is not existing in NVIDIA/tacotron2 but in bfs18's code.
In my experements, there was no big difference between using drop frame rate and reduction windows as described in issue #280 especially in terms of learning alignments. But the trace of both training and validation loss are different. Specifically, using reduction windows shows more large val loss at the same training steps compared to drop frame rate. Also, training time is reduced almost by half when using reduction windows.
- val_loss_r1_d2: val loss of using reduction windows in size 1(no reduction in frame per decoder step), and drop frame rate 0.2.
- val_loss_r2_d0: val loss of using reduction windows in size 2, and drop frame rate 0 (no drop frame rate).
I found another implementation from BogiHsu which also has n_frames_per_step>1 mode. Main difference is the way to deal with the length of gate mask. You may try this too.

Citation

@misc{lee2021tacotron2_mmi,
  author = {Lee, Keon},
  title = {tacotron2_MMI},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/tacotron2_MMI}}
}

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
alignment_fig		alignment_fig
filelists		filelists
text		text
waveglow @ 4b1001f		waveglow @ 4b1001f
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
audio_processing.py		audio_processing.py
data_utils.py		data_utils.py
demo.wav		demo.wav
distributed.py		distributed.py
gradient_adaptive_factor.py		gradient_adaptive_factor.py
hparams.py		hparams.py
inference.ipynb		inference.ipynb
inference.py		inference.py
inference_all.py		inference_all.py
layers.py		layers.py
logger.py		logger.py
loss_function.py		loss_function.py
loss_scaler.py		loss_scaler.py
model.py		model.py
multiproc.py		multiproc.py
plotting_utils.py		plotting_utils.py
requirements.txt		requirements.txt
stft.py		stft.py
tensorboard.png		tensorboard.png
train.py		train.py
utils.py		utils.py

License

keonlee9420/tacotron2_MMI

Folders and files

Latest commit

History

Repository files navigation

Another PyTorch implementation of Tacotron2 MMI (with waveglow)

Notes

Pre-requisites

Setup

Training

Inference

Multi-GPU (distributed) and Automatic Mixed Precision Training

Suggestions and Tips

Citation

Related repos

About

Topics

Resources

License

Stars

Watchers

Forks

Languages