Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffusion loss not decreasing #13

Open
aniketp02 opened this issue Jun 3, 2022 · 1 comment
Open

Diffusion loss not decreasing #13

aniketp02 opened this issue Jun 3, 2022 · 1 comment

Comments

@aniketp02
Copy link

Hi,
I have trained the GradTTS model on the Indian accent English dataset, and the results are pretty awesome.

Looking at the logs, I was startled to see that the diffusion loss throughout the training was not decreasing unlike other losses, and was also fluctuating a lot. Can anyone explain to me why this is the case and if the diffusion loss fluctuates so much why is it used in the total loss calculation?

I have attached my tensorboard outputs.

  • Training Diffusion Loss
    Training Diffusion Loss
  • Training Prior Loss
    Training Prior Loss
  • Training Duration Loss
    Training Duration Loss
@ivanvovk
Copy link
Contributor

ivanvovk commented Jun 3, 2022

@aniketp02 Hi! All 3 losses are must-have to train the model properly. What you have observed about diffusion loss is a normal behaviour, which we discussed in Section 4 of our Grad-TTS paper.

The denoising score matching objective we want to minimize to train a diffusion model is the integral $\int_0^T \lambda(t) ||\nabla_x \log p (x_t|x_0) - s_\theta(x_t, t)||^2 dt$ over the time interval $t \in [0, T]$. The only possible way to estimate it is to use the Monte-Carlo method: sampling uniformly distributed $t$ and compute the average loss in these points at each training step. Random sampling of $t$ induces high variance (especially for small batch size), that is why it seems like it just randomly oscillates after some training stage. However, on average it minimizes. At the same time, we need to be very accurate (meaning up to VERY small changes of loss) in gradient prediction over the whole continuous interval $[0, T]$ to generate high-quality samples at inference and it takes many time for the score network $s_\theta(x_t, t)$ to achieve this, meaning that the loss will converge slowly on average.

Combining these facts we get such diffusion loss behavior. Nonetheless, that doesn't mean it is unnecessary to optimize, moreover it is crucial. Otherwise, Grad-TTS diffusion decoder would produce just noise. Finally, diffusion loss does the job well if we check the energy function it corresponds to. Look at this issue: #9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants