-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPA-2 does not support multi-card invocation. #3691
Comments
@wangyi01 To help us locate your problem, please provide the following information if possible:
DPA-2 indeed supports multi-card operation, see here. When doing multi-GPU training, you should use the torchrun command like this:
However, without the additional information we requested, it will be difficult for us to identify the specific issue you're facing. Please share the details, and we'll be happy to assist you further. |
I used torchrun to implement multi-GPU invocation, but there was a conflict between the -m parameter inside and the -m parameter in the dp --pt parameter. Specifically, I used: torchrun --no_python --nproc_per_node=1 --nnode=4 dp --pt train input.json --finetune ./pretrained_model.pt -m Domains_OC2M --skip-neighbor-stat. I encountered the following error message: error: argument -m/--mpi-log: invalid choice: 'Domains_OC2M' (choose from 'master', 'collect', 'workers'). |
See discussion #3689 |
Summary
When fine-tuning the first step model of DPA-2, I keep encountering out-of-memory errors. Even reducing the batch size and switching to GPUs with larger memory doesn't seem to work well.
Details
The memory of one card is not sufficient, so multiple cards were used for operation.I was trying to utilize multiple cards when using DPA-2, but in practice, only one card was being invoked, leading to insufficient memory. I have four GPUs with 16GB each. However, it indicates that the memory of the first GPU is insufficient, suggesting that only the first GPU is being utilized. So why doesn't DPA-2 support multi-card operation?
The text was updated successfully, but these errors were encountered: