ConsistencyVC-voive-conversion

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion

Demo page: https://consistencyvc.github.io/ConsistencyVC-demo-page

The whisper medium model can be downloaded here: https://drive.google.com/file/d/1PZsfQg3PUZuu1k6nHvavd6OcOB_8m1Aa/view?usp=drive_link

The pre-trained models are available here:https://drive.google.com/drive/folders/1KvMN1V8BWCzJd-N8hfyP283rLQBKIbig?usp=sharing

Note: The audio needs to be 16KHz for train and inference.

Inference with the pre-trained models (use WEO as example)

Generate the WEO of the source speech in src by preprocess_ppg.py.

Copy the root of the reference speech to tgt

Use whisperconvert_exp.py to achieve voice conversion using WEO as content information.

For ConsistencyEVC, use ppgemoconvert_exp.py to achieve voice conversion using ppg as content information.

Inference for the long audio

I uploaded a new py file for the inference of long audio. You don't need to run the whisper by another file, just change this part and run this py file.

Train models by your dataset

Use ppg.py to generate the PPG.

Use preprocess_ppg.py to generate the WEO.

If you want to use WEO to train a cross-lingual voice conversion model:

First you need to train the model without speaker consistency loss for 100k steps:

change this line to

loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo

run the py file:

python train_whisper_emo.py -c configs/cvc-whispers-multi.json -m cvc-whispers-three

Then change this line back to finetune this model with speaker consistency loss

python train_whisper_emo.py -c configs/cvc-whispers-three-emo.json -m cvc-whispers-three

If you want to use PPG to train an expressive voice conversion model:

First you need to train the model without speaker consistency loss for 100k steps:

change this line to

loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo

run the py file:

python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo.json -m cvc-eng-ppgs-three-emo

Then change this line back to finetune this model with speaker consistency loss

python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo-cycleloss.json -m cvc-eng-ppgs-three-emo

Reference

The code structure is based on FreeVC-s. Suggestion: please follow the instruction of FreeVC to install python requirements.

The WEO content feature is based on LoraSVC.

The PPG is from the phoneme recognition model.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
configs		configs
whisper		whisper
LICENSE		LICENSE
README.md		README.md
commons.py		commons.py
cvc627.png		cvc627.png
data_utils_engppg.py		data_utils_engppg.py
data_utils_whisper.py		data_utils_whisper.py
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
ppg.py		ppg.py
ppgemoconvert_exp.py		ppgemoconvert_exp.py
preprocess_ppg.py		preprocess_ppg.py
train_eng_ppg_emo_loss.py		train_eng_ppg_emo_loss.py
train_whisper_emo.py		train_whisper_emo.py
utils.py		utils.py
whisperconvert_exp.py		whisperconvert_exp.py
whisperconvert_longaudio.py		whisperconvert_longaudio.py

License

ConsistencyVC/ConsistencyVC-voive-conversion

Folders and files

Latest commit

History

Repository files navigation

ConsistencyVC-voive-conversion

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion

Inference with the pre-trained models (use WEO as example)

Inference for the long audio

Train models by your dataset

If you want to use WEO to train a cross-lingual voice conversion model:

If you want to use PPG to train an expressive voice conversion model:

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages