Robo-Shaul

Welcome to the Robo-Shaul repository! Here, you'll find everything you need to train your own Robo-Shaul or use pre-trained models. Robo-Shaul is a text-to-speech system that converts diacritized Hebrew text into speech using the Coqui AI TTS framework.

For a website + audio examples look here

For a quick start look at roboshaul_usage.ipynb or

For the חיות כיס podcast documenting the project listen here

The system consists of the SASPEECH dataset, which is a collection of recordings of Shaul Amsterdamski's unedited recordings for the podcast 'Hayot Kis', and a Text-to-Speech system trained on the dataset, implemented in the Coqui AI TTS framework.

The text-to-speech system consists of two parts:

A text-to-mel spectrogram model called OverFlow
A mel spectrogram-to-wav model called HiFi-GAN

To download the dataset for training, go to Open SLR

To download the trained models, go to this link for OverFlow, this link for HiFi-GAN.

The model expects diacritized hebrew (עברית מנוקדת), we recommend Nakdimon by Elazar Gershuni and Yuval Pinter. The link is to a free online tool, the code and model are also available on GitHub at https://github.com/elazarg/nakdimon

Installation

Here are the installation instructions necessary to use our trained models or to train your own models. They have been tested on Ubuntu 22.04, and should work as-is on Mac except for there being no CUDA. If you're on Windows, running pip install numpy==1.23.5 numba==0.56.4 after installation has been reported to make it work on python 3.10, however not thoroughly tested.

It is recommended to do these steps inside a virtual environment, conda env, or similar.

Steps:

Clone our fork of coqui-tts: git clone https://github.com/shenberg/TTS
Install it as an editable install: pip install -e TTS
Download our trained models: Overflow TTS, HiFi-GAN
Test that it works: CUDA_VISIBLE_DEVICES=0 tts --text "עַכְשָׁיו, לְאַט לְאַט, נָסוּ לְדַמְיֵין סוּפֶּרְמַרְקֶט." --model_path /path/to/saspeech_overflow.pth --config_path /path/to/config_saspeech_overflow.json --vocoder_path /path/to/saspeech_hifigan.pth --vocoder_config_path /path/to/config_saspeech_hifigan.json --out_path test.wav

You should now have a file named test.wav which has the model's TTS output.

NOTE: By now our modifications for Coqui-TTS have been upstreamed so a regular installation should work.

Example windows installation using conda

conda create --name "roboshaul" python=3.10
conda activate roboshaul
conda install -c anaconda cython
conda install -c conda-forge jupyterlab
git clone https://github.com/shenberg/TTS
pip install -e TTS
pip install numpy==1.23.5 numba==0.56.4

Training

Once you have successfully created test.wav using our trained models, the time has come to set up training your own models.

NOTE: we are assuming you are in a python environment where you successfully installed TTS and generated audio with it.

Data Preparation

The sequence of actions necessary to extract the dataset (replace /path/to with the real path of course):

$ unzip Roboshaul.zip
$ mkdir /path/to/TTS/recipes/saspeech/data
$ mv saspeech_*tar.gz /path/to/TTS/recipes/saspeech/data
$ cd /path/to/TTS/recipes/saspeech/data
$ tar zxvf saspeech_automatic_data_v1.0.tar.gz
$ tar zxvf saspeech_gold_standard_v1.0.tar.gz
$ rm saspeech_automatic_data_v1.0.tar.gz saspeech_gold_standard_v1.0.tar.gz

Now data/ should have two sub-directories, data/saspeech_automatic_data/, data/saspeech_gold_standard/

Resampling

The files are provided at a sampling rate of 44100Hz, but we train models on audio at 22050Hz in order to decrease computational loads while still providing decent quality.

$ python -m TTS.bin.resample --input_dir data/saspeech_gold_standard/ --output_dir data/saspeech_gold_standard_resampled --output_sr 22050
$ python -m TTS.bin.resample --input_dir data/saspeech_automatic_data/ --output_dir data/saspeech_automatic_data_resampled --output_sr 22050

Advanced users may want to resample using sox vhq algorithm with intermediate phase, as it is higher-quality than what these scripts provide (they use librosa 0.8, which uses resampy behind the scenes and is okay quality).

Windowing for HiFi-GAN

Some of the audio files are long, especially in the automatically tagged portion of the data where some exceed a minute in length.

HiFi-GAN training expects short files to take random windows out of, so we have a script that will break down long files into shorter segments. We will use this script to also gather both the gold-standard and automatic subsets of the data into one directory.

$ cd hifigan
$ python prepare_dataset_for_hifigan.py --input_dir ../data/saspeech_gold_standard_resampled/wavs/ ../data/saspeech_automatic_data_resampled/wavs/ --output_dir ../data/saspeech_all_windowed

Training OverFlow

Run training script from TTS/recipes/saspeech/overflow/:

$ CUDA_VISIBLE_DEVICES=0 python train_overflow.py

Metrics and Saved Models

When you run the training script, it will print out a bunch of logs, and one of the rows will say something like:

> Start Tensorboard: tensorboard --logdir=/path/to/TTS/recipes/saspeech/overflow/overflow_saspeech_gold-March-14-2023_08+46AM-91fd5654

In order to look at metrics, open another terminal in the same python virtual environment and run the command from the logs, e.g.

$ tensorboard --logdir=/path/to/TTS/recipes/saspeech/overflow/overflow_saspeech_gold-March-14-2023_08+46AM-91fd5654

Now that tensorboard is running, you can go to http://localhost:6006 in your browser to view metrics as training evolves.

Model checkpoints are saved to the same directory, so in our above example, the directory /path/to/TTS/recipes/saspeech/overflow/overflow_saspeech_gold-March-14-2023_08+46AM-91fd5654 will also contain model checkpoints

Training HiFi-GAN

Run training script from TTS/recipes/saspeech/hifigan/:

$ CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python train_hifigan.py

Fine-Tuning existing checkpoint

Fine-tuning an existing run is simple: add the argument --restore_path /path/to/saspeech_hifigan, so, if I downloaded saspeech_hifigan.tar.gz and extracted it to ~/saspeech_hifigan_trained, in order to continue training from that point, my full command would be TODO --restore_path ~/saspeech_hifigan_trained.

Metrics and Saved Models

When you run the training script, it will print out a bunch of logs, and one of the rows will say something like:

> Start Tensorboard: tensorboard --logdir=/path/to/TTS/recipes/saspeech/hifigan/run-March-14-2023_07+07AM-91fd5654

In order to look at metrics, open another terminal in the same python virtual environment and run the command from the logs, e.g.

$ tensorboard --logdir=/path/to/TTS/recipes/saspeech/hifigan/run-March-14-2023_07+07AM-91fd5654

Now that tensorboard is running, you can go to http://localhost:6006 in your browser to view metrics as training evolves.

Model checkpoints are saved to the same directory, so in our above example, the directory /path/to/TTS/recipes/saspeech/hifigan/run-March-14-2023_07+07AM-91fd5654 will also contain model checkpoints

Contact Us

We are Roee Shenberg and Orian Sharoni or in other words Up·AI. If you have any questions or comments, please feel free to contact us using the information below.

Orian Sharoni	Roee Shenberg
Connect on LinkedIn	Connect on LinkedIn
Follow on Twitter	Follow on Twitter
orian.sharoni@upai.dev	roee.shenberg@upai.dev

The project's discord is also available for communication and collaboration.

Citation

If you use our work, cite us as: Sharoni, O., Shenberg, R., Cooper, E. (2023) SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion. Proc. INTERSPEECH 2023, 5566-5570, doi: 10.21437/Interspeech.2023-430

@inproceedings{sharoni23_interspeech,
  author={Orian Sharoni and Roee Shenberg and Erica Cooper},
  title={{SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={5566--5570},
  doi={10.21437/Interspeech.2023-430}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
roboshaul_usage.ipynb		roboshaul_usage.ipynb
roboshaul_usage_colab.ipynb		roboshaul_usage_colab.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

roboshaul_usage.ipynb

roboshaul_usage.ipynb

roboshaul_usage_colab.ipynb

roboshaul_usage_colab.ipynb

Repository files navigation

Robo-Shaul

For a website + audio examples look here

For a quick start look at roboshaul_usage.ipynb or

For the חיות כיס podcast documenting the project listen here

Installation

Example windows installation using conda

Training

Data Preparation

Resampling

Windowing for HiFi-GAN

Training OverFlow

Metrics and Saved Models

Training HiFi-GAN

Fine-Tuning existing checkpoint

Metrics and Saved Models

Contact Us

Citation

About

Releases

Packages

Contributors 3

Languages

License

Sharonio/roboshaul

Folders and files

Latest commit

History

Repository files navigation

Robo-Shaul

For a website + audio examples look here

For a quick start look at roboshaul_usage.ipynb or

For the חיות כיס podcast documenting the project listen here

Installation

Example windows installation using conda

Training

Data Preparation

Resampling

Windowing for HiFi-GAN

Training OverFlow

Metrics and Saved Models

Training HiFi-GAN

Fine-Tuning existing checkpoint

Metrics and Saved Models

Contact Us

Citation

About

Resources

License

Stars

Watchers

Forks

Languages