Skip to content

Zain-Jiang/Dict-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech



Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
Paper: https://arxiv.org/pdf/2206.02147

Conference: NIPS 2022

arXiv stars visitors

We provide our implementation and pretrained models as open source in this repository.

Visit our demo page for audio samples.

Dependencies

Requirements

# Install Python 3 first. (Anaconda recommended)
export PYTHONPATH=.

# build a virtual env
conda create -n dict_tts python=3.9 
conda activate dict_tts

# install pytorch requirements
# We use RTX 3080 with CUDA 11.3
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

# Newer version of MFA has different output format
conda install montreal-forced-aligner==2.0.0rc3 -c conda-forge

pip install -r requirements.txt
sudo apt install -y sox libsox-fmt-mp3

Install the aligner (MFA 2.0)

# with conda (recommended, and is included in the script above)
conda install montreal-forced-aligner==2.0.0rc3 -c conda-forge

# with pip
bash scripts/install_mfa2.sh

Download the datasets (for example, Biaobei) Download Biaobei from https://www.data-baker.com/open source.html to data/raw/biaobei

Download the pre-trained vocoder

mkdir pretrained
mkdir pretrained/hifigan_hifitts

download model_ckpt_steps_2168000.ckpt, config.yaml, from https://drive.google.com/drive/folders/1n_0tROauyiAYGUDbmoQ__eqyT_G4RvjN?usp=sharing to pretrained/hifigan_hifitts

Download the pre-trained language model download roformer-chinese-base, from https://huggingface.co/junnyu/roformer_chinese_base to pretrained/roformer-chinese-base

Obtain the dictionary You can use the dictionary in ./data/zh-dict.json or crawl the dictionary from the dictionary website mentioned in our paper.

Quick Start

Choose the config file (for example, DictTTS's config)

export CONFIG=egs/datasets/audio/biaobei/dict_tts.yaml 

Preprocess

Pre-align

python data_gen/tts/bin/pre_align.py --config $CONFIG

MFA-align

python data_gen/tts/bin/mfa_train.py --config $CONFIG
python data_gen/tts/bin/mfa_align.py --config $CONFIG

Binarize

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config $CONFIG

Pre-trained models

You can download the pre-trained models from https://drive.google.com/drive/folders/1oAaXlbGo03RIymwDthKEjOGmi-QcfWhm?usp=sharing, put them to the chechkpoints/dicttts_biaobei_wo_gumbel and follow the inference steps below.

Train, Infer and Eval

Train

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config $CONFIG --exp_name dicttts_biaobei_wo_gumbel --reset --hparams="ds_workers=4,max_updates=300000,num_valid_plots=10,use_word_input=True,vocoder_ckpt=pretrained/hifigan_hifitts,max_sentences=60,val_check_interval=2000,valid_infer_interval=2000,binary_data_dir=data/binary/biaobei,word_size=8000,use_dict=True"

Infer (GPU)

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config $CONFIG --exp_name dicttts_biaobei_wo_gumbel --infer --hparams="ds_workers=4,max_updates=300000,num_valid_plots=10,use_word_input=True,vocoder_ckpt=pretrained/hifigan_hifitts,max_sentences=60,val_check_interval=2000,valid_infer_interval=2000,binary_data_dir=data/binary/biaobei,word_size=8000,use_dict=True"

Eval the pronunciation error rate (PER)

# The PER of the current version is about 1.93 %.
python scripts/get_pron_error.py

Overall Repository Structure

  • egs: the config files in the experiments,which is read by utils/hparams.py
  • data_gen: preprocess and binarize the dataset
  • modules: model
  • scripts: some scripts used in the experiments
  • tasks: dataloader, training and inference
  • utils: utils
  • data: data folder
    • raw: raw files
    • processed: preprocessed files
    • binary: binary files
  • checkpoints: checkpoint, tensorboard logs, and inference results。

Todo

  • The pretrained models
  • The Gumbel softmax version

Citation

If you find this useful for your research, please cite the following papers:

  • Dict-TTS
@article{jiang2022dict,
  title={Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech},
  author={Jiang, Ziyue and Zhe, Su and Zhao, Zhou and Yang, Qian and Ren, Yi and Liu, Jinglin and Ye, Zhenhui},
  journal={arXiv preprint arXiv:2206.02147},
  year={2022}
}

Acknowledgments

Our codes are influenced by the following repos:

License and Agreement

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published