Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference audio generated at higher speed than training files #79

Open
firelex opened this issue Jan 23, 2023 · 3 comments
Open

Inference audio generated at higher speed than training files #79

firelex opened this issue Jan 23, 2023 · 3 comments

Comments

@firelex
Copy link

firelex commented Jan 23, 2023

Finally, after a lot of labor, I got a decent English singer out of the model, which is great. But the audio generated during inference consistently plays back about 1.3 times faster than the training data fed in. The pitch and the phonemes are correct, but everything's sped up. Any idea why that would be the case? Thank you.

@firelex
Copy link
Author

firelex commented Jan 23, 2023

Actually, I think I know where the issues comes from. In the training data, you require note durations and phoneme durations, but during inference, you only require note durations. How does the system know where one note ends and the next note starts? For example, if you have:

Phoneme 1|Phoneme 2|Phoneme 3|Phoneme 4
C3|C3|C3|C3
1|1|1|1

It's clear that all four phonemes are sung over the note C, but it's not clear whether we're talking about one C with duration 1 (for a total duration of 1) or two Cs with duration 1 (for a total duration of 2) or four Cs with duration 1 (for a total duration of 4).

I think this explains why my playback speeds during inference are off.

Am I missing something?

@MrDiplodocus
Copy link

Great previous question.
I have the same problem.
I can't sync the singing and the midi file and i would really like to know how to solve it.

@firelex
Copy link
Author

firelex commented Jan 25, 2023

Are you using this for English or Chinese? I'm making some progress here (in English), but haven't solved it yet. Part of it, I think, is the fact that the model's been developed for Chinese, which I think has simpler syllable formation rules. But that doesn't explain everything. I've made a few changes to the data and will do a 500K training run tomorrow just to make sure it's not just an undertraining problem. Interestingly, on seen data, it also struggles with durations initially, but manages to learn that. It's on the unseen data that it's wildly off. Worse comes to worst, I might decide to infer in small chunks and time-stretch the results. But I'm still hoping to be able to tweak the model to come out with the right data. Let me know where you get to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants