Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix D4C waveform decompositioning threshold (improves sound quality of variance models) #187

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lottev1991
Copy link

Hello all,

Recently, users from the DiffSinger community have been experimenting with lowering the threshold of the D4C waveform decompositioning step as found in binarizer_utils.py. The default setting for this is quite high, which can cause the following issues in models using variance parameters (tension and voicing in particular):

  • Decreased audio quality. I myself had trained one model with the tension parameter enabled, and one without. The tension model had a not insignificant reduction in audio quality, which was not present in the model that did not use it. I had trained both the acoustic and variance models to their maximum steps, but the sound quality never improved.
  • Devoicing of vowel sounds, especially when tension and voicing are trained together. With the current default settings, oftentimes vowels tend to be incorrectly recognized as being unvoiced sounds, which causes very strange gaps on long notes, reducing the quality of the model even more.

I've set the current threshold value at 0.25; there have been suggestions from the community to put an even lower value, though I have not tested that myself. The above-mentioned value has already significantly improved the quality of my latest model, which does support the tension parameter. This improvement in quality so far seems to be consistent across the board, with multiple positive reports from users so far. This is why I think it's a good idea that a lower threshold becomes the new default during waveform decomposition.

Initial findings were done by @UtaUtaUtau, who had this to say about it:

The D4C step in the waveform decomposition class could be prone to devoicing vowels because the default threshold is pretty high. I would know from experience with developing a WORLD-based UTAU resampler, and a few voicebanks get this issue because of that high threshold. I'd recommend passing threshold=0.25 in it as I found that value pretty decent at avoiding accidental vowel devoicing, although I didn't do any rigorous testing for that threshold. I'm just pointing it out because WORLD might react differently from actual singing samples versus UTAU recording samples...

Regards,

Lotte V

@yqzhishen
Copy link
Member

We have done some experiments on the parameter, but no observable difference was found between the default threshold and your proposed value.

Perhaps we should collect more information on this issue. For example, which PE are you using, or which PEs have you tried? Will different PEs matter on this? Currently in our Chinese comminity most people including us use RMVPE, and there is yet no evidence to indicate that the threshold (or tension itself) can affect the quality. I hope you (and other people, as well) can provide more experimental results before we determine whether to modified the settings, and how.

Changing a parameter is not an easy thing. For example, if there are not many cases to support the change, we would rather make it a user-defined configuration than hard-encoding it; if the influence is wide and significant, then we can consider changing it directly in the code; otherwise, the default value tuned by the library author should still be preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants