Is there a way to make an additional float[] output showing individual phoneme times/lengths from the native DLL? #425

JasonBlain · 2024-03-05T02:12:32Z

JasonBlain
Mar 5, 2024

I'm over in Unity and trying to lipsync visemes to the audio output, and I can see where the function to generate the audio from the phonemized text is appending the bits of phonemes together, but the final output length of the audio clip is not the whole story in terms of phoneme/viseme pacing.

Is there a way to add an output array of floats to go with the Phoneme sequence of a TTS output? No need to edit the tensor, just need an additional output with the array after it has been rendered. I imagine it would have to be done at the .cpp level and then rebuilt into the DLL?

Answered by synesthesiam

Mar 5, 2024

The w_ceil variable has the phoneme lengths:

piper/src/python/piper_train/vits/models.py

Line 703 in e5cb84c

w_ceil = torch.ceil(w)

Multiplying this tensor by 256 will get you the number of audio samples per phoneme.

That w_ceil tensor needs to be returned from the infer function and then also returned with the audio here:

piper/src/python/piper_train/export_onnx.py

Line 60 in e5cb84c

audio = model_g.infer(

On the C++ side, you then need to pick apart the multiple output tensors (one audio, one phoneme samples):

piper/src/cpp/piper.cpp

Line 386 in e5cb84c

auto outputTensors = session.onnx.Run(

View full answer

synesthesiam · 2024-03-05T04:03:27Z

synesthesiam
Mar 5, 2024
Maintainer

Yes, this is fully possible but will require either (1) using the original PyTorch models or (2) re-exporting the voice models and changing the C++ code.

An intermediary product of the model is the length of every phoneme. This can be returned with the audio, but will require the changes above.

0 replies

JasonBlain · 2024-03-05T04:06:39Z

JasonBlain
Mar 5, 2024
Author

Hey thanks for the reply.

How could I re-export with the additional outputs? I've got some python experience, kind of a generalist. I've got most of a modified overload header written in the cpp files of a fork of the piper.unity repo I made, so I have that part.. worked out, I think.

4 replies

synesthesiam Mar 5, 2024
Maintainer

The w_ceil variable has the phoneme lengths:

piper/src/python/piper_train/vits/models.py

Line 703 in e5cb84c

w_ceil = torch.ceil(w)

Multiplying this tensor by 256 will get you the number of audio samples per phoneme.

That w_ceil tensor needs to be returned from the infer function and then also returned with the audio here:

piper/src/python/piper_train/export_onnx.py

Line 60 in e5cb84c

audio = model_g.infer(

On the C++ side, you then need to pick apart the multiple output tensors (one audio, one phoneme samples):

piper/src/cpp/piper.cpp

Line 386 in e5cb84c

auto outputTensors = session.onnx.Run(

Answer selected by JasonBlain

JasonBlain Mar 5, 2024
Author

Lovely, will go through that as best I can and see what I can come up with. Good cites. Thank you.

JasonBlain Mar 6, 2024
Author

EDIT: Found checkpoints repo on HuggingFace and will try re-export.

JasonBlain Mar 7, 2024
Author

All of those citations were great, I did have to do a couple more things for my system config to be able to push the new ONNX export out from python.

I'm on windows so I had to do some edits to export_onnx.py, do some POSIX path Windows-explicit things. Monotonic align was acting up and had a circular reference so I had to change the setup and init on it just a bit, mostly path related issues.

JarbasAl · 2024-03-21T17:48:51Z

JarbasAl
Mar 21, 2024

Is it possible to expose this in Piper python api? It would help a lot in OVOS to generate mouth movements for the Mark1 device

together with the generated audio file in OVOS we need a list of phonemes + duration of each phoneme, for the most part we just use the original mimic1 TTS to generate these, but this is far from perfect as they often dont match the actual audio, if we could get piper to output this info natively it would be awesome!

3 replies

synesthesiam Apr 22, 2024
Maintainer

Yes, I will expose this to both the C and Python APIs in the future.

goldyfruit May 1, 2024

Would help a lot indeed as for now we have to install Mimic TTS which is around 400Mb 👍

goldyfruit May 1, 2024

Here is an example when using Piper with ryan-low voice on a Mark 1 device.

https://youtube.com/shorts/9m3501afUEw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to make an additional float[] output showing individual phoneme times/lengths from the native DLL? #425

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there a way to make an additional float[] output showing individual phoneme times/lengths from the native DLL? #425

JasonBlain Mar 5, 2024

Replies: 3 comments · 7 replies

synesthesiam Mar 5, 2024 Maintainer

JasonBlain Mar 5, 2024 Author

synesthesiam Mar 5, 2024 Maintainer

JasonBlain Mar 5, 2024 Author

JasonBlain Mar 6, 2024 Author

JasonBlain Mar 7, 2024 Author

JarbasAl Mar 21, 2024

synesthesiam Apr 22, 2024 Maintainer

goldyfruit May 1, 2024

goldyfruit May 1, 2024

JasonBlain
Mar 5, 2024

Replies: 3 comments 7 replies

synesthesiam
Mar 5, 2024
Maintainer

JasonBlain
Mar 5, 2024
Author

synesthesiam Mar 5, 2024
Maintainer

JasonBlain Mar 5, 2024
Author

JasonBlain Mar 6, 2024
Author

JasonBlain Mar 7, 2024
Author

JarbasAl
Mar 21, 2024

synesthesiam Apr 22, 2024
Maintainer