Vector Quantized PPGs based Voice conversion

Code for this paper Decoupling segmental and prosodic cues of non-native speech through vector quantization

Waris Quamer, Anurag Das, Ricardo Gutierrez-Osuna

Block Diagram

See details and Audio Samples here. Link

Installation

Install ffmpeg.
Install Kaldi
Install PyKaldi
Install packages using environment.yml file.
Download pretrained TDNN-F model, extract it, and set PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh to the pretrained model directory.

Dataset

Acoustic Model: LibriSpeech. Download pretrained TDNN-F acoustic model here.
- You also need to set KALDI_ROOT and PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh accordingly.
Speaker Encoder: LibriSpeech, see here for detailed training process.
Vector Quantization: [ARCTIC and L2-ARCTIC, see here for detailed training process.
Synthesizer (i.e., Seq2seq model): ARCTIC and L2-ARCTIC. Please see here for a merged version.
Vocoder (HiFiGAN): LibriSpeech (Training code to be updated).

All the pretrained the models are available (To be updated) here

Directory layout (Format your dataset to match below)

datatset_root
├── speaker 1
├── speaker 2 
│   ├── wav          # contains all the wav files from speaker 2
│   └── kaldi        # Kaldi files (auto-generated after running kaldi-scripts
.
.
└── speaker N

Quick Start

See the inference script

Training

Use Kaldi to extract BNF for individual speakers (Do it for all speakers)

./kaldi_scripts/extract_features_kaldi.sh /path/to/speaker

Preprocessing

python preprocess_bnfs.py path/to/dataset
python generate_speaker_embeds.py path/to/dataset
python make_data_all.py #Edit the file to specify dataset path

Vector Quantize the BNFs see here
Setting Training params See conf/
Training Model 1

./train_vc128_all.sh

Training Model 2

./train_vc128_all_prosody_ecapa.sh

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
audio		audio
bin		bin
conf		conf
data_objects		data_objects
kaldi_scripts		kaldi_scripts
mcd		mcd
mcd_results		mcd_results
speaker_encoder		speaker_encoder
src		src
tools		tools
utils		utils
vocoders		vocoders
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
block_diagram.jpg		block_diagram.jpg
block_diagram.png		block_diagram.png
dev_all.txt		dev_all.txt
environment.yml		environment.yml
generate_speaker_embeds.py		generate_speaker_embeds.py
inference_script.ipynb		inference_script.ipynb
inference_script_prosody.ipynb		inference_script_prosody.ipynb
main.py		main.py
make_data_all.py		make_data_all.py
mcd_analysis.ipynb		mcd_analysis.ipynb
path.sh		path.sh
preprocess_bnfs.py		preprocess_bnfs.py
requirements.txt		requirements.txt
requirements_conda.txt		requirements_conda.txt
run.sh		run.sh
train.txt		train.txt
train_all.txt		train_all.txt
train_vc128_all.sh		train_vc128_all.sh
train_vc128_all_prosody_ecapa.sh		train_vc128_all_prosody_ecapa.sh
train_vc256_all_prosody_ecapa.sh		train_vc256_all_prosody_ecapa.sh

License

warisqr007/vq-ppg-vc

Folders and files

Latest commit

History

Repository files navigation

Vector Quantized PPGs based Voice conversion

Block Diagram

Installation

Dataset

Directory layout (Format your dataset to match below)

Quick Start

Training

About

Topics

Resources

License

Stars

Watchers

Forks

Languages