Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to get the timing of phonemes, instead of full words? #687

Open
tscizzlebg opened this issue Sep 15, 2021 · 6 comments · May be fixed by #1377
Open

Is it possible to get the timing of phonemes, instead of full words? #687

tscizzlebg opened this issue Sep 15, 2021 · 6 comments · May be fixed by #1377

Comments

@tscizzlebg
Copy link

tscizzlebg commented Sep 15, 2021

I searched for docs, or docstrings in source code, but couldn't find a nice summary of what the options for output were, so figured I'd ask here and it might be a super quick answer.

(Apologies if this is not the right place for questions. I posted on StackOverflow as well, but the vosk tag doesn't have that many total questions so I wasn't sure what y'all prefer.)

@nshmyrev
Copy link
Collaborator

We do not support phones yet. There is a pull request though

#528

I posted on StackOverflow as well, but the vosk tag doesn't have that many total questions so I wasn't sure what y'all prefer.

Some time ago Stackoverflow denied me to answer Vosk questions there. So I left it altogether.

@tscizzlebg
Copy link
Author

Cool, thanks @nshmyrev ! I'm definitely looking forward to that PR getting in.

For getting more into the nitty-gritty of speech, and trying to create training sets for speech decoding models (as opposed to what I'm guessing are the more mainstream use cases of subtitling videos and stuff like that), output by phone is key.

Re StackOverflow, that's too bad. Good to know.

@nshmyrev
Copy link
Collaborator

trying to create training sets for speech decoding models (as opposed to what I'm guessing are the more mainstream use cases of subtitling videos and stuff like that), output by phone is key.

What are "speech decoding models" exactly? Could you please clarify?

@tscizzlebg
Copy link
Author

Ah. For decoding intended speech from neural activity.

Here's an example of research toward restoring the communication ability of people with severe paralysis: http://changlab.ucsf.edu/s/anumanchipalli_chartier_2019.pdf

@Shallowmallow
Copy link

Shouldn't it be possible to use make a model that recognizes all phones. Like this for example https://github.com/xinjli/allosaurus ?

@madhephaestus
Copy link

madhephaestus commented Jun 1, 2023

For anyone looking for a Java lip-sync software based on vosk, i have a small staand alone example for you! https://github.com/madhephaestus/TextToSpeechASDRTest.git I was able to use the partial results with the word timing to calculate the timing of the phonemems (after looking up the phonemes in a phoneme dictionary). I then down-mapped the phonemes to viseme and stored the visemes in a list with timestamps. THe timestamped visemes process in a static 200ms, and then the audio can begin playing with the mouth movemets synchronized precisly with the phoneme start times precomputed ahead of time. This is compaired to Rubarb which takes as long to run as the audio file is long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

4 participants