Skip to content
Max Hawkins edited this page Jan 29, 2016 · 2 revisions

Gentle uses Kaldi to recognize speech in your audio and align it with text.

The speech recognition model that is packaged with Gentle is based on the Kaldi fisher_english_v8 model (built by Dan Povey).

The acoustic model was created using a multi-splice deep neural network. It was trained on over 4000 hours of 8KHz (telephone bandwidth) conversational speech audio from the Fisher English corpus.

A new bigram language model is built every time you run Gentle to fit the words contained in your transcript.

After recognition, the speech is split into phonemes using a version of The CMU Pronouncing Dictionary. The phoneme set is based on ARPAbet.

Building your own model

We do not yet support alignment using other acoustic models or alignment in languages other than English. However we would like to! In future versions it may be possible to swap out the model for one better-suited to your domain or trained on another language.