Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the prior loss and MAS algorithm #18

Open
cantabile-kwok opened this issue Jul 16, 2022 · 2 comments
Open

About the prior loss and MAS algorithm #18

cantabile-kwok opened this issue Jul 16, 2022 · 2 comments

Comments

@cantabile-kwok
Copy link

cantabile-kwok commented Jul 16, 2022

Great work! I've been studying the paper and the code recently and there's something that confuses me much.

In my understanding, the encoder outputs some Gaussian distributions with different mu for each phoneme, and the DPM decoder recovers mel-spec y from these Gaussians. Hence y is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood of y in the Gaussian distribution of mu. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted as log_prior in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They use z to evaluate the Gaussian likelihood with mean mu, and z is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, as z is Gaussian by itself.

@cantabile-kwok
Copy link
Author

cantabile-kwok commented Jul 26, 2022

I think this loss is good for:

  1. It is necessary for MAS as GradTTS is considering the same likelihood to be the soft alignment matrix for MAS algorithm. My experiments show that if using ground truth durations, then canceling prior loss does not decrease quality.
  2. It helps converging, as it pushes mu to be close to y in the first place.

@li1jkdaw
Copy link

@cantabile-kwok
Thank you very much for this question! It is indeed a very subtle moment.
Actually, we need the output of the encoder mu to have the following properties:

  1. mu should be some reasonable speech representation since we condition score matching network s_theta(x_t, mu, t) on this mu, so we want mu to have some important information about the target speech (e.g. mu should be aligned well with the input text; it corresponds with what you wrote in your previous comment in 1.
  2. mu should be close to the target mel-spectrogram y, because the reverse diffusion starts generation from N(mu, I) (it's exactly what you wrote in your previous comment in 2.
    Note that this second point is not necessary, but it is beneficial from the point of view of the reverse diffusion steps sufficient for a good quality (see Table 1 in our paper).

So, in contrast with Glow-TTS where the analogue of our encoder loss L_enc has clear probabilistic interpretation (it is one of the terms used to calculate log-likelihood optimized during training), in Grad-TTS the encoder should just output mu having the two properties mentioned above. You can consider the encoder output to be a Gaussian distribution (leading to weighted L_2 loss between mu and y), or you can just optimize any other distance between mu and y, and it may also work well. This is one of the differences between Glow-TTS and Grad-TTS: in our model the choice of encoder loss L_enc does not affect the diffusion loss L_diff (they are sort of "independent"), while in Glow-TTS there is a single NLL loss with the analogue of our encoder loss being one of its terms having a clear probabilistic interpretation (i.e. log of the prior).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants