seq2seq-vc: sequence-to-sequence voice conversion toolkit

Paper (INTERSPEECH2020)
Paper (IEEE/ACM TASLP)
Original codebase on ESPNet

Introduction and motivation

Sequence-to-sequence (seq2seq) modeling is especially attractive to voice conversion (VC) owing to their ability to convert prosody. In particular, this repository aim to reproduce the results of the following papers/models.

Voice Transformer Network (VTN) (paper)

This is the first paper that applies the Transformer model to VC. In addition to the model architecture itself, the true novelty of this paper is actually a pre-training technique based on text-to-speech (TTS). This repository provides recipes for (1) TTS pre-training and (2) fine-tuning on a VC dataset. That is to say, TTS is also available in this repository.

Originally I open-sourced the code on ESPNet, but as it grows bigger and bigger, it becomes harder to conduct scientific research on ESPNet. Therefore, this repository aims to isolate the seq2seq VC part from ESPNet to become an independently-maintained toolkit (hopefully).

Instsallation

Editable installation with virtualenv

git clone https://github.com/unilight/s3prl-vc.git
cd s3prl-vc/tools
make

Complete training, decoding and benchmarking

Same as many speech processing based repositories (ESPNet, ParallelWaveGAN, etc.), we formulate our recipes in kaldi-style. They can be found in the egs folder. Please check the detailed usage in each recipe.

Reproducing VTN experiments

The TTS pre-training is conducted on LJSpeech. Please refer to the readme file in egs/ljspeech/tts1.
Afterwards, the VC fine-tuning is conducted on CMU ARCTIC. Please refer to the readme file in egs/arctic/vc1.

Citation

@inproceedings{huang20i_interspeech,
  author={Wen-Chin Huang and Tomoki Hayashi and Yi-Chiao Wu and Hirokazu Kameoka and Tomoki Toda},
  title={{Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining}},
  year=2020,
  booktitle={Proc. Interspeech},
  pages={4676--4680},
}
@ARTICLE{vtn_journal,
  author={Huang, Wen-Chin and Hayashi, Tomoki and Wu, Yi-Chiao and Kameoka, Hirokazu and Toda, Tomoki},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Pretraining Techniques for Sequence-to-Sequence Voice Conversion}, 
  year={2021},
  volume={29},
  pages={745-755},
}

Acknowledgements

This repo is greatly inspired by the following repos. Or I should say, many code snippets are directly taken from part of the following repos.

Author

Wen-Chin Huang
Toda Labotorary, Nagoya University
E-mail: wen.chinhuang@g.sp.m.is.nagoya-u.ac.jp

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
egs		egs
seq2seq_vc		seq2seq_vc
tools		tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

egs

egs

seq2seq_vc

seq2seq_vc

tools

tools

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

Repository files navigation

seq2seq-vc: sequence-to-sequence voice conversion toolkit

Introduction and motivation

Voice Transformer Network (VTN) (paper)

Instsallation

Editable installation with virtualenv

Complete training, decoding and benchmarking

Reproducing VTN experiments

Citation

Acknowledgements

Author

About

Releases

Packages

Languages

License

mengting7tw/seq2seq-vc

Folders and files

Latest commit

History

Repository files navigation

seq2seq-vc: sequence-to-sequence voice conversion toolkit

Introduction and motivation

Voice Transformer Network (VTN) (paper)

Instsallation

Editable installation with virtualenv

Complete training, decoding and benchmarking

Reproducing VTN experiments

Citation

Acknowledgements

Author

About

Resources

License

Stars

Watchers

Forks

Languages