-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to get timing info? #14
Comments
That's interesting, can you share more details about your use-case? |
I make a cloud writing platform for fiction authors (https://shaxpir.com). I'm working on a new "read aloud" feature where authors can highlight a few paragraphs of text and hear it read aloud to them by an AI voice. Each word in the selection is highlighted as the voice reads the text, so that the author can follow along with their eyes. That's why I need the timestamps. But you can imagine a similar use-case with ebook readers, news readers, etc. Any application where the user might want to follow along with their eyes as an AI voice reads a block of text. I've heard similar kinds of requests from creators of animated avatars. But in those cases, the developers usually need timestamps for each phoneme, so that they can synchronize mouth movements and other facial animations with the AI voices. |
+1. I need word-level timestamps to make tts output look more interactive on the screen. |
That makes sense, not sure when we'll have the time to get to it, in the meanwhile, this is probably something one can do using a forced alignment pipeline post generation? |
@Shiro836 @benjismith The author of Kalpy is also the author of the widely used Montreal Forced Aligner (MFA) which is a higher level wrapper of Kaldi. MFA is nice, but it loads and unloads models every time you do alignment, which causes a lot of overhead for small jobs like a quick "get the timestamps of this paragraph's audio" feature. It's built mostly for batch processing, like preparing massive datasets for training AI models for ASR and TTS. Kalpy allows you more control as it's closer to the Kaldi C library, and you can keep the aligner models loaded in memory on your server, reducing latency for your users. Kalpy takes ~300ms to align ~3s of audio with Kalpy in my experience. MFA in comparison takes about 3-5s due to the model loading/unloading overhead. |
Is it possible to have this model also generate millisecond-level timestamps for the words (or phonemes) in the prompt?
I currently use speech-marks from a AWS Polly, and if this model could generate the same format, that would be very helpful!
The text was updated successfully, but these errors were encountered: