Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

訓練步驟 #30

Open
yiwei0730 opened this issue Oct 27, 2023 · 0 comments
Open

訓練步驟 #30

yiwei0730 opened this issue Oct 27, 2023 · 0 comments

Comments

@yiwei0730
Copy link

yiwei0730 commented Oct 27, 2023

We first train the audio codec using 8 NVIDIA TESLA V100 16GB GPUs with a batch size of 200 audios per GPU for 440K steps. We follow the implementation and experimental setting of SoundStream [19] and adopt Adam optimizer with 2e-4 learning rate. Then we use the trained codec to extract the quantized latent vectors for each audio to train the diffusion model in NaturalSpeech 2.

The diffusion model in NaturalSpeech 2 is trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6K frames of latent vectors per GPU for 300K steps (our model is still underfitting and longer training will result in better performance). We optimize the models with the AdamW optimizer with 5e-4 learning rate, 32k warmup steps following the inverse square root learning schedule.

根據原論文的敘述,似乎他將audio codec 和 diffusion的部分分開來做訓練。
想向您請教,不知道有沒有嘗試過將兩個部分分開來做訓練的嘗試,我看到在NS2-ttsv2的訓練上似乎把codec相關的使用全部給mark起來了,是codec的效果不盡人意嗎?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant