dc-comix-tts

Implementation of DCComix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer Accepted to Interspech 2023. Audio samples/demo for this system is here

Abstract: Despite the huge successes made in neutral TTS, content-leakage remains a challenge. In this paper, we propose a new input representation and simple architecture to achieve improved prosody modeling. Inspired by the recent success in the use of discrete code in TTS, we introduce discrete code to the input of the reference encoder. Specifically, we leverage the vector quantizer from the audio compression model to exploit the diverse acoustic information it has already been trained on. In addition, we apply the modified MLP-Mixer to the reference encoder, making the architecture lighter. As a result, we train the prosody transfer TTS in an end-to-end manner. We prove the effectiveness of our method through both subjective and objective evaluations. We demonstrate that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments. In addition, we obtain comparable results even when fewer parameters are inputted.

This repository leverages Nemo for VITS and MixerTTS implementation.
We use Encodec for discrete code

Installation

python ≥ 3.8
pytorch 1.11.0+cu113
nemo_toolkit 1.18.0

See requirements.txt for other libraries

Traininig

prepare data (VCTK)
```
python preprocess/make_manifest.py
```
- Note that we resample VCTK audios to 24kHz to match resolution with Encodec
preprocessing
- text normalization
```
python torchdata/text_preprocess.py
```
run train.py
- for dc-comix-tts : use ref_mixer_codec_vits.yaml

References

@software{Harper_NeMo_a_toolkit,
author = {Harper, Eric and Majumdar, Somshubra and Kuchaiev, Oleksii and Jason, Li and Zhang, Yang and Bakhturina, Evelina and Noroozi, Vahid and Subramanian, Sandeep and Nithin, Koluguri and Jocelyn, Huang and Jia, Fei and Balam, Jagadeesh and Yang, Xuesong and Livne, Micha and Dong, Yi and Naren, Sean and Ginsburg, Boris},
title = {{NeMo: a toolkit for Conversational AI and Large Language Models}},
url = {https://github.com/NVIDIA/NeMo}
}

@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
conf		conf
model		model
module		module
preprocess		preprocess
torchdata		torchdata
.gitignore		.gitignore
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt
vits_train.py		vits_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

model

model

module

module

preprocess

preprocess

torchdata

torchdata

.gitignore

.gitignore

README.md

README.md

infer.py

infer.py

requirements.txt

requirements.txt

vits_train.py

vits_train.py

Repository files navigation

dc-comix-tts

Installation

Traininig

References

About

Releases

Packages

Languages

MaxMax2016/dc-comix-tts

Folders and files

Latest commit

History

Repository files navigation

dc-comix-tts

Installation

Traininig

References

About

Topics

Resources

Stars

Watchers

Forks

Languages