Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clipping distortion of the generated waveform #7

Open
WelkinYang opened this issue Sep 7, 2021 · 3 comments
Open

Clipping distortion of the generated waveform #7

WelkinYang opened this issue Sep 7, 2021 · 3 comments

Comments

@WelkinYang
Copy link

Hi, thanks for sharing the code. I have tried it on different datasets including Chinese and English. However, there is some clipping on some of the generated waveforms (like the generated Mel spectrum is too energetic at some positions?). I first tried different vocoders including hifigan and griffin-lim and this happened. Then I tried different ranges of values for the Mel spectrum, including the log domain and normalization to [-1,1], again without avoiding this phenomenon. Finally, I tried different values of temperature, 1.0, 1.3, 1.5, and again this phenomenon could not be avoided. I would like to know the possible causes of this phenomenon and how to solve it. If anyone has encountered this situation, please feel free to discuss it.
image
As shown above, the value of the waveform at some locations is out of range, thus causing clipping.

@ivanvovk
Copy link
Contributor

ivanvovk commented Sep 7, 2021

@WelkinYang Hey! This is strange, we haven't experienced this on standard LJSpeech and Libri-TTS datasets. Probably, your datasets are a bit noisy. However, I can suggest you to try 2 things:

  • Try to renormalize audios to lower values, for example [-0.7, 0.7], to leave some space for errors in energy predictions.
  • Or try to apply forward pre-emphasis filter to waveforms to balance frequency energies. However, in that case, I think you will need to re-train HiFi-GAN on filtered mels. And then during inference apply the inverse filter, of course.

@WelkinYang
Copy link
Author

@WelkinYang Hey! This is strange, we haven't experienced this on standard LJSpeech and Libri-TTS datasets. Probably, your datasets are a bit noisy. However, I can suggest you to try 2 things:

  • Try to renormalize audios to lower values, for example [-0.7, 0.7], to leave some space for errors in energy predictions.
  • Or try to apply forward pre-emphasis filter to waveforms to balance frequency energies. However, in that case, I think you will need to re-train HiFi-GAN on filtered mels. And then during inference apply the inverse filter, of course.

Thank you for your advice. I just tried to raise the temperature to 2, and most of the clipping disappeared, but the overall audio energy also decreased, I'm not sure of the principle behind this (maybe the encoder output has been better constrained by L2 loss, so the sampling at the time of inference may make the value exceed a reasonable range? So the temperature needs to be increased).
temperature :1.5
image
temperature :2.0
image

By the way, I have tried to make the encoder outputs go through several layers of convolution before calculating the L2 Loss (to make the frames different from each other, so that the start of the diffusion process is closer to the end, and in fact, the L2 loss is reduced a lot), but in the end, this leads to a serious detuning of the generated audio, do you have any suggestions for this attempt?

@ivanvovk
Copy link
Contributor

ivanvovk commented Sep 7, 2021

@WelkinYang issue with temperature seems reasonable. Lower temperature adds higher stochasticity to encoder outputs during inference, and vice versa. I think it highly depends on dataset starting from which temp the quality becomes acceptable, and in your case this high stochasticity (when temp=1.5) added too much uncertainty to decoder predictions. Do not be afraid to set higher temperatures. In my experiments I often set it to 5.0 (mostly in single speaker setting). We did even separate ablation study, where higher temperature for single speaker always showed better results in terms of perceptual quality.

What about adding convolutional layers: can you provide some examples to listen to, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants