Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu training #6

Open
slinghe0321 opened this issue Mar 22, 2021 · 5 comments
Open

multi-gpu training #6

slinghe0321 opened this issue Mar 22, 2021 · 5 comments

Comments

@slinghe0321
Copy link

Hi, thanks for your great work!
I have trained GroundAwareYolo3D model and get results as below:
Car AP(Average Precision)@0.70, 0.70, 0.70
bbox AP: 97.29, 84.55, 64.65
bev AP: 29.53, 20.15, 15.53
3d AP: 22.90, 15.26, 11.33
aos AP: 96.52, 82.52, 63.05

seems comparable with paper report (23.63 16.16 12.06) in Car AP@0.70 validation set.

However if training with multi-gpu e.g. 4-GPU, we get poor result as below:
Car AP(Average Precision)@0.70, 0.70, 0.70
bbox AP: 97.08, 86.41, 66.67
bev AP: 20.56, 15.16, 11.22
3d AP: 15.17, 10.81, 8.22
aos AP: 95.50, 83.36, 64.24

training command:
bash ./launchers/train.sh config/$CONFIG_FILE.py 0,1,2,3 multi-gpu-train
bash ./launchers/train.sh config/$CONFIG_FILE.py 0 single-gpu-train

I trained twice with 'multi-gpu' and both results are similar and lower than 'single-gpu', so do you have some suggestions about this case? What about your multi-gpu training performance?

@Owen-Liuyuxuan
Copy link
Owner

I also notice this. I consider this a bug.

I guess the problem is that multi-GPU training changes the relative weights between batches (batches on different GPUs are simply averaged while batches on the same GPU weight depending on num_gt, and some batches are skipped).

I have not tested to debug this, because I am not that familiar with APIs on multi-GPUs training.

@Owen-Liuyuxuan
Copy link
Owner

I changed

weighted_regression_losses = torch.sum(weights * reg_loss / (torch.sum(weights) + 1e-6), dim=0)

into

weight_sum = torch.sum(weights)
if torch.distributed.is_initialized():
    N = torch.distributed.get_world_size()
    torch.distributed.all_reduce(weight_sum)
    reg_loss = reg_loss * N
weighted_regression_losses = torch.sum(weights * reg_loss / (weight_sum + 1e-6), dim=0)

and half the batch size, Empirically, the gap gets smaller, but the gap still exists

@cnexah
Copy link

cnexah commented May 17, 2021

请问multi-gpu会对mono_depth的训练产生影响吗?

@Owen-Liuyuxuan
Copy link
Owner

请问multi-gpu会对mono_depth的训练产生影响吗?

In my test, depth prediction is fine with multi-gpu

@Owen-Liuyuxuan
Copy link
Owner

Owen-Liuyuxuan commented Jul 19, 2022

For now, in the new update, with the distributed sampler from detectron2, we are able to train with multi-GPU and obtain reasonable performance.

Without tuning the learning rate and batch size, the result goes like this:

Car AP(Average Precision)@0.70, 0.70, 0.70:                                                                                                                                                                        
bbox AP:97.24, 86.90, 67.03                                                                                                                                                                                        
bev  AP:29.68, 20.48, 15.73                                                                                                                                                                                        
3d   AP:21.56, 15.00, 11.16                                                                                                                                                                                        
aos  AP:96.23, 84.25, 64.92                                                                                                                                                                                        
Car AP(Average Precision)@0.70, 0.50, 0.50:                                                                                                                                                                        
bbox AP:97.24, 86.90, 67.03                                                                                                                                                                                        
bev  AP:65.20, 46.35, 35.98                                                                                                                                                                                        
3d   AP:58.84, 41.06, 32.49                                                                                                                                                                                        
aos  AP:96.23, 84.25, 64.92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants