Robust and fine-grained prosody control of end-to-end speech synthesis (with waveglow)

Pytorch Implementation of Robust and fine-grained prosody control of end-to-end speech synthesis (Unofficial)

This implementation uses the LibriTTS dataset.

Notes

dev branch: Tacotron2 with multispeaker (speaker embedding). Speaker information is only consumed by Decoder module, and Attention module doesn't see any of it (as authors' intention).
text_side branch: Text-side prosody control model implementation.
Speech-side prosody control and Prosody normalization are not implemented in current version, but you can simply add them on top of above branches.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Download and extract the LibriTTS dataset
Clone this repo: git clone https://github.com/keonlee9420/Robust_Fine_Grained_Prosody_Control.git
CD into this repo: cd Robust_Fine_Grained_Prosody_Control
Initialize submodule: git submodule init; git submodule update
Update .wav paths: sed -i -- 's,/home/keon/speech-datasets/LibriTTS_preprocessed/train-clean-100/,your_libritts_dataset_folder/,g' filelists/*.txt
- Alternatively, set load_mel_from_disk=True in hparams.py and update mel-spectrogram paths
Install PyTorch 1.0
Install Apex
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt

Training

python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

(TBD)

Multi-GPU (distributed) and Automatic Mixed Precision Training

Not supported in current implementation.

Inference

Single sample: python inference.py -c checkpoint/path -r reference_audio/wav/path -t "synthesize text"
Multi samples: python inference_all.py -c checkpoint/path -r reference_audios/dir/path

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

Citation

@misc{lee2021robust_fine_grained_prosody_control,
  author = {Lee, Keon},
  title = {Robust_Fine_Grained_Prosody_Control},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Robust_Fine_Grained_Prosody_Control}}
}

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis

nv-wavenet Faster than real time WaveNet.

Acknowledgements

This implementation uses code from the following repos: NVIDIA/Tacotron-2, KinglittleQ/GST-Tacotron

We are thankful to the paper authors, specially Younggun Lee, and Taesu Kim.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.idea		.idea
filelists		filelists
text		text
waveglow @ 5bc2a53		waveglow @ 5bc2a53
.gitignore		.gitignore
.gitmodules		.gitmodules
CoordConv.py		CoordConv.py
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
audio_processing.py		audio_processing.py
data_utils.py		data_utils.py
demo.wav		demo.wav
distributed.py		distributed.py
hparams.py		hparams.py
inference.py		inference.py
inference_all.py		inference_all.py
layers.py		layers.py
logger.py		logger.py
loss_function.py		loss_function.py
loss_scaler.py		loss_scaler.py
model.py		model.py
modules.py		modules.py
multiproc.py		multiproc.py
plotting_utils.py		plotting_utils.py
requirements.txt		requirements.txt
stft.py		stft.py
tensorboard.png		tensorboard.png
train.py		train.py
utils.py		utils.py

License

keonlee9420/Robust_Fine_Grained_Prosody_Control

Folders and files

Latest commit

History

Repository files navigation

Robust and fine-grained prosody control of end-to-end speech synthesis (with waveglow)

Notes

Pre-requisites

Setup

Training

Training using a pre-trained model

Multi-GPU (distributed) and Automatic Mixed Precision Training

Inference

Citation

Related repos

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages