Inference audio generated at higher speed than training files #79

firelex · 2023-01-23T20:29:57Z

Finally, after a lot of labor, I got a decent English singer out of the model, which is great. But the audio generated during inference consistently plays back about 1.3 times faster than the training data fed in. The pitch and the phonemes are correct, but everything's sped up. Any idea why that would be the case? Thank you.

firelex · 2023-01-23T22:28:38Z

Actually, I think I know where the issues comes from. In the training data, you require note durations and phoneme durations, but during inference, you only require note durations. How does the system know where one note ends and the next note starts? For example, if you have:

Phoneme 1|Phoneme 2|Phoneme 3|Phoneme 4
C3|C3|C3|C3
1|1|1|1

It's clear that all four phonemes are sung over the note C, but it's not clear whether we're talking about one C with duration 1 (for a total duration of 1) or two Cs with duration 1 (for a total duration of 2) or four Cs with duration 1 (for a total duration of 4).

I think this explains why my playback speeds during inference are off.

Am I missing something?

MrDiplodocus · 2023-01-25T13:16:52Z

Great previous question.
I have the same problem.
I can't sync the singing and the midi file and i would really like to know how to solve it.

firelex · 2023-01-25T13:32:24Z

Are you using this for English or Chinese? I'm making some progress here (in English), but haven't solved it yet. Part of it, I think, is the fact that the model's been developed for Chinese, which I think has simpler syllable formation rules. But that doesn't explain everything. I've made a few changes to the data and will do a 500K training run tomorrow just to make sure it's not just an undertraining problem. Interestingly, on seen data, it also struggles with durations initially, but manages to learn that. It's on the unseen data that it's wildly off. Worse comes to worst, I might decide to infer in small chunks and time-stretch the results. But I'm still hoping to be able to tweak the model to come out with the right data. Let me know where you get to.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference audio generated at higher speed than training files #79

Inference audio generated at higher speed than training files #79

firelex commented Jan 23, 2023

firelex commented Jan 23, 2023

MrDiplodocus commented Jan 25, 2023

firelex commented Jan 25, 2023

Inference audio generated at higher speed than training files #79

Inference audio generated at higher speed than training files #79

Comments

firelex commented Jan 23, 2023

firelex commented Jan 23, 2023

MrDiplodocus commented Jan 25, 2023

firelex commented Jan 25, 2023