Skip to content

An expanded version of the previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In KazakhTTS2, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified.

License

IS2AI/Kazakh_TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

KazakhTTS RECIPE

This is the recipe of Kazakh text-to-speech model based on KazakhTTS and KazakhTTS2 corpora.

Setup and Requirements

Our code builds upon ESPnet, and requires prior installation of the framework. Please follow the installation guide and put the KazakhTTS folder inside espnet/egs2/ directory:

cd espnet/egs2
git clone https://github.com/IS2AI/Kazakh_TTS.git

Go to Kazakh_TTS/tts1 folder and create links to the dependencies:

ln -s ../../TEMPLATE/tts1/path.sh .
ln -s ../../TEMPLATE/asr1/pyscripts .
ln -s ../../TEMPLATE/asr1/scripts .
ln -s ../../../tools/kaldi/egs/wsj/s5/steps .
ln -s ../../TEMPLATE/tts1/tts.sh .
ln -s ../../../tools/kaldi/egs/wsj/s5/utils .

Downloading the dataset

Download KazakhTTS dataset and untar in the directory of your choice. Specify the path to the dataset directory (where Audio/Transcripts dirs are located) inside KazakhTTS/tts1/local/data.sh script:

db_root=/path-to-speaker-folder

For example db_root=/home/datasets/ISSAI_KazakhTTS/M1/Books

Training

To train the models, run the script ./run.sh inside KazakhTTS/tts1/ folder. GPU and RAM specifications can be found in the configuration (conf/) folder.

./run.sh --stage 1 --stop_stage 6 --train_config conf/train.yaml 

If you would like to train fastspeech/transformer models, change train_config=conf/train.yaml accordingly. The detailed description of each stage are documented in ESPNet's repository.

Pretrained models

The model was developed by the Institute of Smart Systems and Artificial Intelligence, Nazarbayev University Kazakhstan (henceforth ISSAI).

Please use the model only for a good cause and in a wise manner. You must not use the model to generate data that are obscene, offensive, or contain any discrimination with regard to religion, sex, race, language or territory of origin.

ISSAI appreciates and requires attribution. An attribution should include the title of the original paper, the author, and the name of the organization under which the development of the model took place. For example:

Mussakhojayeva, S., Janaliyeva, A., Mirzakhmetov, A., Khassanov, Y., Varol, H.A. (2021) KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. Proc. Interspeech 2021, 2786-2790, doi: 10.21437/Interspeech.2021-2124. The Institute of Smart Systems and Artificial Intelligence (issai.nu.edu.kz), Nazarbayev University, Kazakhstan

kaztts_female1_tacotron2_train.loss.ave

kaztts_female2_tacotron2_train.loss.ave

kaztts_female3_tacotron2_train.loss.ave

kaztts_male1_tacotron2_train.loss.ave

kaztts_male2_tacotron2_train.loss.ave

Pretrained vocoders

parallelwavegan_female1_checkpoint

parallelwavegan_female2_checkpoint

parallelwavegan_female3_checkpoint

parallelwavegan_male1_checkpoint

parallelwavegan_male2_checkpoint

Speech synthesis

You can synthesize an arbitrary text using synthesize.py script. Modify the following lines in the script:

## specify the path to vocoder's checkpoint, i.e
vocoder_checkpoint="exp/vocoder/checkpoint-400000steps.pkl"

## specify path to the main model(transformer/tacotron2/fastspeech) and its config file
config_file = "exp/tts_train_raw_char/config.yaml"
model_path = "exp/tts_train_raw_char/train.loss.ave_5best.pth"

Now you can run the script using an arbitrary text, for example:

python synthesize.py --text "бүгінде өңірде тағы бес жобаның құрылысы жүргізілуде."

The generated file will be saved in tts1/synthesized_wavs folder.

Citation

@inproceedings{mussakhojayeva21_interspeech,
  author={Saida Mussakhojayeva and Aigerim Janaliyeva and Almas Mirzakhmetov and Yerbolat Khassanov and Huseyin Atakan Varol},
  title={{KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={2786--2790},
  doi={10.21437/Interspeech.2021-2124}
}

About

An expanded version of the previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In KazakhTTS2, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published