Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to get timing info? #14

Open
benjismith opened this issue Feb 8, 2024 · 5 comments
Open

Possible to get timing info? #14

benjismith opened this issue Feb 8, 2024 · 5 comments
Labels
feature request New feature or request

Comments

@benjismith
Copy link

Is it possible to have this model also generate millisecond-level timestamps for the words (or phonemes) in the prompt?

I currently use speech-marks from a AWS Polly, and if this model could generate the same format, that would be very helpful!

@sidroopdaska
Copy link
Contributor

That's interesting, can you share more details about your use-case?

@benjismith
Copy link
Author

I make a cloud writing platform for fiction authors (https://shaxpir.com). I'm working on a new "read aloud" feature where authors can highlight a few paragraphs of text and hear it read aloud to them by an AI voice.

Each word in the selection is highlighted as the voice reads the text, so that the author can follow along with their eyes. That's why I need the timestamps. But you can imagine a similar use-case with ebook readers, news readers, etc. Any application where the user might want to follow along with their eyes as an AI voice reads a block of text.

I've heard similar kinds of requests from creators of animated avatars. But in those cases, the developers usually need timestamps for each phoneme, so that they can synchronize mouth movements and other facial animations with the AI voices.

@Shiro836
Copy link

+1. I need word-level timestamps to make tts output look more interactive on the screen.

@vatsalaggarwal
Copy link
Contributor

That makes sense, not sure when we'll have the time to get to it, in the meanwhile, this is probably something one can do using a forced alignment pipeline post generation?

@danablend
Copy link

@Shiro836 @benjismith
For getting word-level timestamps I had great success using Kalpy (https://github.com/mmcauliffe/kalpy), which is a low-level wrapper of Kaldi (C library for speech processing).

The author of Kalpy is also the author of the widely used Montreal Forced Aligner (MFA) which is a higher level wrapper of Kaldi.

MFA is nice, but it loads and unloads models every time you do alignment, which causes a lot of overhead for small jobs like a quick "get the timestamps of this paragraph's audio" feature. It's built mostly for batch processing, like preparing massive datasets for training AI models for ASR and TTS.

Kalpy allows you more control as it's closer to the Kaldi C library, and you can keep the aligner models loaded in memory on your server, reducing latency for your users. Kalpy takes ~300ms to align ~3s of audio with Kalpy in my experience. MFA in comparison takes about 3-5s due to the model loading/unloading overhead.

@vatsalaggarwal vatsalaggarwal added the feature request New feature or request label Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants