Skip to content

Speaker Encoder

Eren Gölge edited this page Nov 2, 2022 · 2 revisions

🐸 TTS has a subproject, called Speaker Encoder. It is an implementation of https://arxiv.org/abs/1710.10467 . There is also a released model trained on LibriTTS dataset with ~1000 speakers in Released Models page.

You can use this model for various purposes:

  • Training a multi-speaker model using voice embeddings as speaker features.
    • Compute embedding vectors by compute_embedding.py and feed them to your TTS network. (TTS side needs to be implemented but it should be straight forward)
  • Pruning bad examples from your TTS dataset.
    • Compute embedding vectors and plot them using the notebook provided. Thx @nmstoker for this!
  • Use as a speaker classification or verification system.
  • Speaker diarization for ASR systems.

The model provided here is the halve of the baseline model. I figured, it is easier to train and the final performance does not differ too much compared to the larger version.

Clone this wiki locally