Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about how to specify max mel_frames in the output spectrogram and training audio sample length in hparams.py #335

Open
jjoe1 opened this issue May 20, 2020 · 0 comments

Comments

@jjoe1
Copy link

jjoe1 commented May 20, 2020

First thanks for this detailed original tacotron model and the wiki.

I've been trying to read the wiki, this code, and the Tacotron paper (https://arxiv.org/pdf/1703.10135.pdf) for the last several days, but am confused about something basic. As someone trying to learn text-to-speech models, I'm unclear about how the spectrogram of fixed-length is generated for a input text during training.

  1. Max ground-truth clip length in LJSpeech dataset is 14sec, then wouldn't that indirectly define the max mel_frames in output to be 14*(1/0.0125)= 14*80=1120 ? What is max_sentence_length of the input-text after padding? I assume all the input sentences used during training and inference would be padded to a max_len, is that correct?

  2. Another related issue which may be a beginner question: After encoder creates 256 hidden states (from 256 bidirectional lstms), isn't the decoder output limited to 256 frames (for output layer reduction factor r=1)? If I understand encoder-decoder correctly, if decoder is producing 1 frame per encoder state as r=1, then how can it produce more frames than encoder states?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant