multi-gpu training #6

slinghe0321 · 2021-03-22T08:10:39Z

Hi, thanks for your great work!
I have trained GroundAwareYolo3D model and get results as below:
Car AP(Average Precision)@0.70, 0.70, 0.70
bbox AP: 97.29, 84.55, 64.65
bev AP: 29.53, 20.15, 15.53
3d AP: 22.90, 15.26, 11.33
aos AP: 96.52, 82.52, 63.05

seems comparable with paper report (23.63 16.16 12.06) in Car AP@0.70 validation set.

However if training with multi-gpu e.g. 4-GPU, we get poor result as below:
Car AP(Average Precision)@0.70, 0.70, 0.70
bbox AP: 97.08, 86.41, 66.67
bev AP: 20.56, 15.16, 11.22
3d AP: 15.17, 10.81, 8.22
aos AP: 95.50, 83.36, 64.24

training command:
bash ./launchers/train.sh config/$CONFIG_FILE.py 0,1,2,3 multi-gpu-train
bash ./launchers/train.sh config/$CONFIG_FILE.py 0 single-gpu-train

I trained twice with 'multi-gpu' and both results are similar and lower than 'single-gpu', so do you have some suggestions about this case? What about your multi-gpu training performance?

The text was updated successfully, but these errors were encountered:

Owen-Liuyuxuan · 2021-03-22T10:09:32Z

I also notice this. I consider this a bug.

I guess the problem is that multi-GPU training changes the relative weights between batches (batches on different GPUs are simply averaged while batches on the same GPU weight depending on num_gt, and some batches are skipped).

I have not tested to debug this, because I am not that familiar with APIs on multi-GPUs training.

Owen-Liuyuxuan · 2021-03-25T03:59:09Z

I changed

weighted_regression_losses = torch.sum(weights * reg_loss / (torch.sum(weights) + 1e-6), dim=0)

into

weight_sum = torch.sum(weights)
if torch.distributed.is_initialized():
    N = torch.distributed.get_world_size()
    torch.distributed.all_reduce(weight_sum)
    reg_loss = reg_loss * N
weighted_regression_losses = torch.sum(weights * reg_loss / (weight_sum + 1e-6), dim=0)

and half the batch size, Empirically, the gap gets smaller, but the gap still exists

cnexah · 2021-05-17T16:48:37Z

请问multi-gpu会对mono_depth的训练产生影响吗？

Owen-Liuyuxuan · 2021-06-02T02:48:04Z

请问multi-gpu会对mono_depth的训练产生影响吗？

In my test, depth prediction is fine with multi-gpu

Owen-Liuyuxuan · 2022-07-19T02:37:38Z

For now, in the new update, with the distributed sampler from detectron2, we are able to train with multi-GPU and obtain reasonable performance.

Without tuning the learning rate and batch size, the result goes like this:

Car AP(Average Precision)@0.70, 0.70, 0.70:                                                                                                                                                                        
bbox AP:97.24, 86.90, 67.03                                                                                                                                                                                        
bev  AP:29.68, 20.48, 15.73                                                                                                                                                                                        
3d   AP:21.56, 15.00, 11.16                                                                                                                                                                                        
aos  AP:96.23, 84.25, 64.92                                                                                                                                                                                        
Car AP(Average Precision)@0.70, 0.50, 0.50:                                                                                                                                                                        
bbox AP:97.24, 86.90, 67.03                                                                                                                                                                                        
bev  AP:65.20, 46.35, 35.98                                                                                                                                                                                        
3d   AP:58.84, 41.06, 32.49                                                                                                                                                                                        
aos  AP:96.23, 84.25, 64.92

mrsempress mentioned this issue Sep 2, 2021

Could you provide your results in each config? #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu training #6

multi-gpu training #6

slinghe0321 commented Mar 22, 2021

Owen-Liuyuxuan commented Mar 22, 2021

Owen-Liuyuxuan commented Mar 25, 2021

cnexah commented May 17, 2021

Owen-Liuyuxuan commented Jun 2, 2021

Owen-Liuyuxuan commented Jul 19, 2022 •

edited

multi-gpu training #6

multi-gpu training #6

Comments

slinghe0321 commented Mar 22, 2021

Owen-Liuyuxuan commented Mar 22, 2021

Owen-Liuyuxuan commented Mar 25, 2021

cnexah commented May 17, 2021

Owen-Liuyuxuan commented Jun 2, 2021

Owen-Liuyuxuan commented Jul 19, 2022 • edited

Owen-Liuyuxuan commented Jul 19, 2022 •

edited