New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training bug for 13b, 30b, and 65b #285
Comments
My finetuning arguments:
|
Same here. Could you share how did you slove this eventually? Thanks |
are you able to train with batch size 4 as in the readme? |
Haven't solved it yet, but switching from the huggingface trainer to pytorch lightning might solve the issue. If I can get it to work I'll post a link to a repo with everything set up. Also, I switched to a different machine with V100s instead of A100s and 13b works on there. Could also be a version difference because I can work with docker containers on the V100 machine but only with venvs on the A100 machine (admins are stingy about root access). Also, yes, I'm able to train with batch size of 4, but that does not make a difference. |
It seems like this might be a related issue: huggingface/transformers#14531 I turned off bf16 and it fixed my issue with 13b and 30b. Without bf16 I can't fit 65b onto my GPUs so I haven't tested that one yet. Any idea why bf16 is causing this problem? I think it's preventing the optimizer from stepping but have no idea why. |
Has anyone been able to finetune any of the models larger than 7b successfully? I'm training on 8 A100s with 80GB of RAM each which is more than enough space.
The problem I'm running into is the first loss being massive (1e5) and subsequent losses being 0 after the first step. Not sure how to fix this or what is causing this, as the 7b model trains fine. I'm training with deepspeed launcher.
Here's an example of the output when training the 65b model.
The text was updated successfully, but these errors were encountered: