MB-iSTFT-VITS with AutoVocoder

Motivation for implementation

Starting from VITS, MB-iSTFT-VITS improves the synthesis speed using below techniques:

Multi-band parallel generation strategy by decomposing speech signals into sub-band signals
iSTFT based waveform generation process

Based on this well-designed framework, this repository aims to further improve sound quality and inference speed with Autovocoder.
This repo is based on MB-iSTFT-VITS, and the expected modifications and enhancements are below:

1. Replace the iSTFTNet-based decoder to AutoVocoder-based decoder.
2. In iSTFT operation, use Real/Imaginary instead of Phase/Magnitude components to construct complex spectrogram. Add time-domain reconstruction loss.
3. Revise the posterior encoder to accept 4 complex components instead of linear spectrogram.

Owing to nature of VITS that models powerful latents, AutoVocoder can be proper application due to its autoencoder architecture. Also it has fast inference speed by directly generating waveform with (1024, 256, 1024) fft/hop/win size without upsmpling modules. (Multi-band startegy will be maintained)
Conventional TTS models including VITS, modeling phase information has been entirely the role of a decoder (vocoder). In Mod 3., by providing phase information to latents, we test whether prior can reliably approx these latents.

Disclaimer : This repo is built for testing purpose. Performance is not guaranteed. Welcome your contributions.

Note

For easy comparison, we did not change the whole architecture of the posterior encoder. Instead, we only used group convolution in the front part to process revised inputs (4 complex components).
In current, this repo tries to implement MB-iSTFT-VITS based model. Application to mini, MS, w/o MB might be future work.

Explanation (from MB-iSTFT-VITS)

0. Baseline: MB-iSTFT-VITS

1. Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

2. Training

In the case of MB-iSTFT-VITS training, run the following script

python train_latest.py -c configs/ljs_mb_istft_vits.json -m ljs_mb_istft_vits

After the training, you can check inference audio using inference.ipynb

References

MB-iSTFT-VITS: Paper / Code
AutoVocoder: Paper / Code (unofficial)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
configs		configs
fig		fig
filelists		filelists
monotonic_align		monotonic_align
text		text
.gitignore		.gitignore
DUMMY1		DUMMY1
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
commons.py		commons.py
data_utils.py		data_utils.py
inference.ipynb		inference.ipynb
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
pqmf.py		pqmf.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
stft.py		stft.py
stft_loss.py		stft_loss.py
train_latest.py		train_latest.py
transforms.py		transforms.py
utils.py		utils.py

License

hcy71o/MB-iSTFT-VITS-with-AutoVocoder

Folders and files

Latest commit

History

Repository files navigation

MB-iSTFT-VITS with AutoVocoder

Motivation for implementation

Note

Explanation (from MB-iSTFT-VITS)

0. Baseline: MB-iSTFT-VITS

1. Pre-requisites

2. Training

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages