Possible to get timing info? #14

benjismith · 2024-02-08T00:01:57Z

Is it possible to have this model also generate millisecond-level timestamps for the words (or phonemes) in the prompt?

I currently use speech-marks from a AWS Polly, and if this model could generate the same format, that would be very helpful!

sidroopdaska · 2024-02-10T01:31:41Z

That's interesting, can you share more details about your use-case?

benjismith · 2024-02-10T03:19:01Z

I make a cloud writing platform for fiction authors (https://shaxpir.com). I'm working on a new "read aloud" feature where authors can highlight a few paragraphs of text and hear it read aloud to them by an AI voice.

Each word in the selection is highlighted as the voice reads the text, so that the author can follow along with their eyes. That's why I need the timestamps. But you can imagine a similar use-case with ebook readers, news readers, etc. Any application where the user might want to follow along with their eyes as an AI voice reads a block of text.

I've heard similar kinds of requests from creators of animated avatars. But in those cases, the developers usually need timestamps for each phoneme, so that they can synchronize mouth movements and other facial animations with the AI voices.

Shiro836 · 2024-02-19T14:05:35Z

+1. I need word-level timestamps to make tts output look more interactive on the screen.

vatsalaggarwal · 2024-02-19T14:26:40Z

That makes sense, not sure when we'll have the time to get to it, in the meanwhile, this is probably something one can do using a forced alignment pipeline post generation?

danablend · 2024-03-10T11:14:03Z

@Shiro836 @benjismith
For getting word-level timestamps I had great success using Kalpy (https://github.com/mmcauliffe/kalpy), which is a low-level wrapper of Kaldi (C library for speech processing).

The author of Kalpy is also the author of the widely used Montreal Forced Aligner (MFA) which is a higher level wrapper of Kaldi.

MFA is nice, but it loads and unloads models every time you do alignment, which causes a lot of overhead for small jobs like a quick "get the timestamps of this paragraph's audio" feature. It's built mostly for batch processing, like preparing massive datasets for training AI models for ASR and TTS.

Kalpy allows you more control as it's closer to the Kaldi C library, and you can keep the aligner models loaded in memory on your server, reducing latency for your users. Kalpy takes ~300ms to align ~3s of audio with Kalpy in my experience. MFA in comparison takes about 3-5s due to the model loading/unloading overhead.

vatsalaggarwal added the feature request New feature or request label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to get timing info? #14

Possible to get timing info? #14

benjismith commented Feb 8, 2024

sidroopdaska commented Feb 10, 2024

benjismith commented Feb 10, 2024

Shiro836 commented Feb 19, 2024

vatsalaggarwal commented Feb 19, 2024

danablend commented Mar 10, 2024

Possible to get timing info? #14

Possible to get timing info? #14

Comments

benjismith commented Feb 8, 2024

sidroopdaska commented Feb 10, 2024

benjismith commented Feb 10, 2024

Shiro836 commented Feb 19, 2024

vatsalaggarwal commented Feb 19, 2024

danablend commented Mar 10, 2024