Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branch in V4 version train it's working ? #33

Open
lpscr opened this issue Dec 15, 2023 · 17 comments
Open

branch in V4 version train it's working ? #33

lpscr opened this issue Dec 15, 2023 · 17 comments

Comments

@lpscr
Copy link

lpscr commented Dec 15, 2023

hi ! thank you very much for your work and this amazing repo

i try train the branch v4 i have something very wrong here
when i train about 3 hours it's not change i have noise all the steps i use this

1 . python preprocess.py
2. python model1.py

29000 steps v4 branch
image

in v3 or main branch after some steps i have this

5000 steps v3 or main branch
image

like you see in v4 , i get only noise i do something wrong

can you please tell me in the v4 train working ?
or what i do wrong

thank you for your time

@adelacvg
Copy link
Owner

You haven't done anything wrong. Due to the model v4 having over 200 million parameters, the training process is very slow. I am currently experimenting with features such as offset noise, normalization, and cfg to make the training more stable. Your results seem quite normal, and theoretically, the convergence time of the v4 model is close to that of sd1.5. The previous three versions used smaller noise and predicted x0, resulting in faster training. However, v4 employs the classic approach of predicting noise as the target.

@lpscr
Copy link
Author

lpscr commented Dec 16, 2023

this is so cool! I understand now. I'm going to retrain to see
thank you very much for explanation and quick reply .

@rishikksh20
Copy link

@lpscr are you able to converge the model ?

@rishikksh20
Copy link

@adelacvg checked you update the model arch on v4. Is implementation completed? and is new model converge faster?
I have collected lots of audio data now waiting for GPU availability to start training.

@adelacvg
Copy link
Owner

adelacvg commented Jan 9, 2024

Yes, the previous training process was slow to converge due to issues with the UNet. Additionally, there were semantic problems caused by a bug in the diffusion training architecture from ControlNet. The current diffusion training framework is now based on Tortoise, eliminating any semantic faults. Furthermore, the architecture employs transformer blocks without updown, leading to much faster convergence.

@rishikksh20
Copy link

Thanks :)
Are you using HuBERT only for context vector?
As my usecase is for non-english language so I thought to use Whisper layer 24 features rather than HuBERT.

@adelacvg
Copy link
Owner

Regarding contentvec, I chose it primarily to prevent timbre leakage. Hubert or Whisper have noticeable timbre leakage issues when trained using self-supervision. I have trained a model, and although there is some loss in audio quality during zero-shot scenarios, it performs better than the previous model on the same data scale.

@rishikksh20
Copy link

rishikksh20 commented Feb 28, 2024

Hi @adelacvg Is it possible to transfer bit Prosody and style also from NS2VC architecture not just voice?
For simply voice conversion it working good, although voice not match exactly but still fine

@adelacvg
Copy link
Owner

Certainly, but I believe that prosody and speed are better suited for GPT or an acoustic model. The diffusion part, working as a good decoder, should suffice.

@rishikksh20
Copy link

Just need to ask one more question, Are semantic tokens like Hu-BERT, wav2vec, and ContentVec have prosody information?

@adelacvg
Copy link
Owner

Of course, prosody encompasses fundamental frequency, pause duration, intonation, and other essential information. Semantic tokens inherently carry duration information and intonation.

@rishikksh20
Copy link

Yes, I have the same intuition because pronunciation is an integral part of linguistics.

@rishikksh20
Copy link

Hi @adelacvg Have you checked YODAS : https://huggingface.co/datasets/espnet/yodas 370k hours dataset, although data quality is poor as music is there or some samples are empty but still good quality data for VC pretraining.
If you are not GPU poor 😢 you can pretrain this to YODAS 😅.

@adelacvg
Copy link
Owner

adelacvg commented Mar 1, 2024

@rishikksh20 Thank you very much for your suggestion. However, I'm currently short on GPU resources, and all GPUs are being used for experiments with the AR TTS model based on GPT. The pre-trained model may be trained when there are available GPUs.

@rishikksh20
Copy link

@adelacvg Everyone is GPU-poor, I am also waiting for my GPU to be vacated. By the way how's the progress with TTTS training do you have any sample to share?
I have tested the Hierspeech++'s Non-autoregressive Text to vector module along with NS2VC which acts as end-to-end TTS, and it is performing well. GPT-based Text to Vector which I have tested before shows lots of hallucination.

@adelacvg
Copy link
Owner

adelacvg commented Mar 1, 2024

@rishikksh20 The model in the master branch of TTTS is based on Tortoise, and the results are comparable to Tortoise. I have provided a Colab link for testing the pre-trained model. For the v2 version, I would like to use a training method similar to Valle's, while still using Diffusion as the decoder, with the hope of achieving better zero-shot results.

@rishikksh20
Copy link

For v4 I am planning to train on Encodec features for better speaker generalization as commented here #16 (comment) .
Has anyone tried this before or like to give me any heads up thought?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants