Home

🐸 TTS is a deep learning based text-to-speech solution. It favors simplicity over complex and large models and yet, it aims to achieve the state-of-the-art results.

Based on the user study, 🐸 TTS is able to achieve on par or better results compared to other commercial and open-source text-to-speech solutions. It also supports various languages and already applied to more than 13 different languages.

The general architecture we use comprises two separate deep neural networks. The first network computes acoustic features from given text input. The second network produces the voice from the computed acoustic features. We call the first model "text2feat" and the second "vocoder".

🐸 TTS also serves a Speaker Encoder model that can be used for computing speaker embedding vectors for various purposes including speaker verification, speaker identification, multi-speaker text-to-speech models.

Currently, we implemented the following methods and models.

Text-to-Feat Models

Tacotron: paper
Tacotron2: paper
Glow-TTS: paper
Speedy-Speech: paper

Tricks for more efficient Tacotron learning.

Gradual Training: blog post
Global Style Tokens: paper

Attention methods for Tacotron Models

Guided Attention: paper
Forward Backward Decoding: paper
Graves Attention: paper
Double Decoder Consistency: blog

Speaker Encoder

GE2E: paper
Angular Loss: paper

Vocoders

MelGAN: paper
MultiBandMelGAN: paper
ParallelWaveGAN: paper
GAN-TTS discriminators: paper
WaveRNN: origin
WaveGrad: paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly