Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
Paper: https://arxiv.org/pdf/2206.02147

Conference: NIPS 2022

We provide our implementation and pretrained models as open source in this repository.

Visit our demo page for audio samples.

Dependencies

Requirements

# Install Python 3 first. (Anaconda recommended)
export PYTHONPATH=.

# build a virtual env
conda create -n dict_tts python=3.9 
conda activate dict_tts

# install pytorch requirements
# We use RTX 3080 with CUDA 11.3
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

# Newer version of MFA has different output format
conda install montreal-forced-aligner==2.0.0rc3 -c conda-forge

pip install -r requirements.txt
sudo apt install -y sox libsox-fmt-mp3

Install the aligner (MFA 2.0)

# with conda (recommended, and is included in the script above)
conda install montreal-forced-aligner==2.0.0rc3 -c conda-forge

# with pip
bash scripts/install_mfa2.sh

Download the datasets (for example, Biaobei) Download Biaobei from https://www.data-baker.com/open source.html to data/raw/biaobei

Download the pre-trained vocoder

mkdir pretrained
mkdir pretrained/hifigan_hifitts

download model_ckpt_steps_2168000.ckpt, config.yaml, from https://drive.google.com/drive/folders/1n_0tROauyiAYGUDbmoQ__eqyT_G4RvjN?usp=sharing to pretrained/hifigan_hifitts

Download the pre-trained language model download roformer-chinese-base, from https://huggingface.co/junnyu/roformer_chinese_base to pretrained/roformer-chinese-base

Obtain the dictionary You can use the dictionary in ./data/zh-dict.json or crawl the dictionary from the dictionary website mentioned in our paper.

Quick Start

Choose the config file (for example, DictTTS's config)

export CONFIG=egs/datasets/audio/biaobei/dict_tts.yaml

Preprocess

Pre-align

python data_gen/tts/bin/pre_align.py --config $CONFIG

MFA-align

python data_gen/tts/bin/mfa_train.py --config $CONFIG
python data_gen/tts/bin/mfa_align.py --config $CONFIG

Binarize

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config $CONFIG

Pre-trained models

You can download the pre-trained models from https://drive.google.com/drive/folders/1oAaXlbGo03RIymwDthKEjOGmi-QcfWhm?usp=sharing, put them to the chechkpoints/dicttts_biaobei_wo_gumbel and follow the inference steps below.

Train, Infer and Eval

Train

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config $CONFIG --exp_name dicttts_biaobei_wo_gumbel --reset --hparams="ds_workers=4,max_updates=300000,num_valid_plots=10,use_word_input=True,vocoder_ckpt=pretrained/hifigan_hifitts,max_sentences=60,val_check_interval=2000,valid_infer_interval=2000,binary_data_dir=data/binary/biaobei,word_size=8000,use_dict=True"

Infer (GPU)

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config $CONFIG --exp_name dicttts_biaobei_wo_gumbel --infer --hparams="ds_workers=4,max_updates=300000,num_valid_plots=10,use_word_input=True,vocoder_ckpt=pretrained/hifigan_hifitts,max_sentences=60,val_check_interval=2000,valid_infer_interval=2000,binary_data_dir=data/binary/biaobei,word_size=8000,use_dict=True"

Eval the pronunciation error rate (PER)

# The PER of the current version is about 1.93 %.
python scripts/get_pron_error.py

Overall Repository Structure

egs: the config files in the experiments，which is read by utils/hparams.py
data_gen: preprocess and binarize the dataset
modules: model
scripts: some scripts used in the experiments
tasks: dataloader, training and inference
utils: utils
data: data folder
- raw: raw files
- processed: preprocessed files
- binary: binary files
checkpoints: checkpoint, tensorboard logs, and inference results。

Todo

The pretrained models
The Gumbel softmax version

Citation

If you find this useful for your research, please cite the following papers:

Dict-TTS

@article{jiang2022dict,
  title={Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech},
  author={Jiang, Ziyue and Zhe, Su and Zhao, Zhou and Yang, Qian and Ren, Yi and Liu, Jinglin and Ye, Zhenhui},
  journal={arXiv preprint arXiv:2206.02147},
  year={2022}
}

Acknowledgments

Our codes are influenced by the following repos:

License and Agreement

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
data		data
data_gen/tts		data_gen/tts
egs		egs
modules		modules
scripts		scripts
tasks		tasks
utils		utils
vocoders		vocoders
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Zain-Jiang/Dict-TTS

Folders and files

Latest commit

History

Repository files navigation

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Dependencies

Quick Start

Choose the config file (for example, DictTTS's config)

Preprocess

Pre-trained models

Train, Infer and Eval

Overall Repository Structure

Todo

Citation

Acknowledgments

License and Agreement

About

Resources

Stars

Watchers

Forks

Languages