Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-Util is low When use multi-GPUs #26

Open
LTlitong opened this issue May 13, 2020 · 3 comments
Open

GPU-Util is low When use multi-GPUs #26

LTlitong opened this issue May 13, 2020 · 3 comments

Comments

@LTlitong
Copy link

Hello,

I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.

Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%.
This is GPU Usage when training on 4 cards:
gpu-util
image

I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?

Looking forward to your reply!

@ehsk
Copy link
Collaborator

ehsk commented May 14, 2020

You mentioned you ran the code with 1 or 2 GPUs. Did you have this problem in those runs too? I suggest turning on log_device in the config file and compare the single GPU run with 4/8 GPUs run.

I haven't had this problem before, although GPU-util was around 50-60% for all GPUs.

@LTlitong
Copy link
Author

Thanks for your reply!

  1. The GPU-util was 70-80% when run with 1 GPU. And it was 50% and 20% respectively when run with 2 GPUs. But there is always a gpu which GPU-util is 0% all the time. I turn on log_device to get the device mapping, and I have sent you an email.

  2. Moreover, I also wanna ask whether your experiment results in paper are averaged over 3 datasets(3/4/5 turn Reddit)? Because I run all epochs but the result is different from the paper. Could you please provide your results on each dataset?

@ehsk
Copy link
Collaborator

ehsk commented May 21, 2020

Sorry for the late reply.

  1. Have you set CUDA_VISIBLE_DEVICES? Based on the log you sent, no tensor was assigned to one of the GPUs.

  2. All the results in the paper are reported based on the 3-turn dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants