You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In XTTSv2, in dataloader condition indices (condition start indice and end indice) are loading as audio samples which later will be compressed by 256 due to mel spectrogram extraction. Later these condition start and end indices are used to mask the ground truth audio codes. Since, audio codes compressed by 1024 times these start and end indices also has to be divided by 1024 but they are divided by perceiver_cond_length_compression which is set to 256 coming from the default args of GPT. In this case condition indices in dvae domain will refer to 4 times higher of what it should refer. So you will mask the wrong part of the target audio tokens. I couldn't find any place where they are set to 1024 correcty and I can not understand with this bug how it's able to train and finetune.
I'd appreciate if anyone shed lights on this topic.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Describe the bug
[Bug]
Hello,
In XTTSv2, in dataloader condition indices (condition start indice and end indice) are loading as audio samples which later will be compressed by 256 due to mel spectrogram extraction. Later these condition start and end indices are used to mask the ground truth audio codes. Since, audio codes compressed by 1024 times these start and end indices also has to be divided by 1024 but they are divided by perceiver_cond_length_compression which is set to 256 coming from the default args of GPT. In this case condition indices in dvae domain will refer to 4 times higher of what it should refer. So you will mask the wrong part of the target audio tokens. I couldn't find any place where they are set to 1024 correcty and I can not understand with this bug how it's able to train and finetune.
I'd appreciate if anyone shed lights on this topic.
Error lines:
TTS/TTS/tts/layers/xtts/gpt.py
Line 110 in dbf1a08
TTS/TTS/tts/layers/xtts/gpt.py
Line 414 in dbf1a08
To Reproduce
Expected behavior
...
Logs
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: