Clipping distortion of the generated waveform #7

WelkinYang · 2021-09-07T09:24:19Z

Hi, thanks for sharing the code. I have tried it on different datasets including Chinese and English. However, there is some clipping on some of the generated waveforms (like the generated Mel spectrum is too energetic at some positions?). I first tried different vocoders including hifigan and griffin-lim and this happened. Then I tried different ranges of values for the Mel spectrum, including the log domain and normalization to [-1,1], again without avoiding this phenomenon. Finally, I tried different values of temperature, 1.0, 1.3, 1.5, and again this phenomenon could not be avoided. I would like to know the possible causes of this phenomenon and how to solve it. If anyone has encountered this situation, please feel free to discuss it.

As shown above, the value of the waveform at some locations is out of range, thus causing clipping.

ivanvovk · 2021-09-07T11:12:31Z

@WelkinYang Hey! This is strange, we haven't experienced this on standard LJSpeech and Libri-TTS datasets. Probably, your datasets are a bit noisy. However, I can suggest you to try 2 things:

Try to renormalize audios to lower values, for example [-0.7, 0.7], to leave some space for errors in energy predictions.
Or try to apply forward pre-emphasis filter to waveforms to balance frequency energies. However, in that case, I think you will need to re-train HiFi-GAN on filtered mels. And then during inference apply the inverse filter, of course.

WelkinYang · 2021-09-07T11:26:10Z

@WelkinYang Hey! This is strange, we haven't experienced this on standard LJSpeech and Libri-TTS datasets. Probably, your datasets are a bit noisy. However, I can suggest you to try 2 things:

Try to renormalize audios to lower values, for example [-0.7, 0.7], to leave some space for errors in energy predictions.

Or try to apply forward pre-emphasis filter to waveforms to balance frequency energies. However, in that case, I think you will need to re-train HiFi-GAN on filtered mels. And then during inference apply the inverse filter, of course.

Thank you for your advice. I just tried to raise the temperature to 2, and most of the clipping disappeared, but the overall audio energy also decreased, I'm not sure of the principle behind this (maybe the encoder output has been better constrained by L2 loss, so the sampling at the time of inference may make the value exceed a reasonable range? So the temperature needs to be increased).
temperature :1.5

temperature :2.0

By the way, I have tried to make the encoder outputs go through several layers of convolution before calculating the L2 Loss (to make the frames different from each other, so that the start of the diffusion process is closer to the end, and in fact, the L2 loss is reduced a lot), but in the end, this leads to a serious detuning of the generated audio, do you have any suggestions for this attempt?

ivanvovk · 2021-09-07T12:06:47Z

@WelkinYang issue with temperature seems reasonable. Lower temperature adds higher stochasticity to encoder outputs during inference, and vice versa. I think it highly depends on dataset starting from which temp the quality becomes acceptable, and in your case this high stochasticity (when temp=1.5) added too much uncertainty to decoder predictions. Do not be afraid to set higher temperatures. In my experiments I often set it to 5.0 (mostly in single speaker setting). We did even separate ablation study, where higher temperature for single speaker always showed better results in terms of perceptual quality.

What about adding convolutional layers: can you provide some examples to listen to, please?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clipping distortion of the generated waveform #7

Clipping distortion of the generated waveform #7

WelkinYang commented Sep 7, 2021

ivanvovk commented Sep 7, 2021

WelkinYang commented Sep 7, 2021

ivanvovk commented Sep 7, 2021 •

edited

Clipping distortion of the generated waveform #7

Clipping distortion of the generated waveform #7

Comments

WelkinYang commented Sep 7, 2021

ivanvovk commented Sep 7, 2021

WelkinYang commented Sep 7, 2021

ivanvovk commented Sep 7, 2021 • edited

ivanvovk commented Sep 7, 2021 •

edited