torch.multinomial on GPU #144

rubencart · 2019-11-15T13:30:54Z

Results from training an FC model with self-critical RL on 1 single GPU with batch_size 32.
Output from training script, before change:

iter 50 (epoch 0), avg_reward = 0.001, time/batch = 1.137
iter 100 (epoch 0), avg_reward = 0.002, time/batch = 1.145
iter 150 (epoch 0), avg_reward = 0.001, time/batch = 1.143
iter 200 (epoch 0), avg_reward = 0.002, time/batch = 1.130
iter 250 (epoch 0), avg_reward = 0.000, time/batch = 1.143
iter 300 (epoch 0), avg_reward = 0.000, time/batch = 1.123
iter 350 (epoch 0), avg_reward = 0.001, time/batch = 1.123
iter 400 (epoch 0), avg_reward = -0.000, time/batch = 1.131
iter 450 (epoch 0), avg_reward = 0.001, time/batch = 1.115
iter 500 (epoch 0), avg_reward = 0.000, time/batch = 1.155
total time: 589.1319808959961

And cProfile output:

         173950435 function calls (173760536 primitive calls) in 591.966 seconds
   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8500  315.550    0.037  315.550    0.037 {built-in method multinomial}
   960610   90.125    0.000  117.889    0.000 cider/pyciderevalcap/ciderD/ciderD_scorer.py:128(counts2vec)
   800610   35.399    0.000   44.638    0.000 cider/pyciderevalcap/ciderD/ciderD_scorer.py:154(sim)
   960610   30.668    0.000   31.411    0.000 cider/pyciderevalcap/ciderD/ciderD_scorer.py:17(precook)
 37222323   11.875    0.000   11.875    0.000 {built-in method builtins.pow}
      511   11.348    0.022   11.348    0.022 {method 'item' of 'torch._C._TensorBase' objects}
     9500   11.165    0.001   11.165    0.001 {method 'cpu' of 'torch._C._TensorBase' objects}
     1000    9.129    0.009  344.628    0.345 /export/home1/NoCsBack/hci/rubenc/selfcritical/models/FCModel.py:150(_sample)
      500    9.106    0.018    9.106    0.018 {method 'run_backward' of 'torch._C._EngineBase' objects}
 32826245    6.941    0.000    6.941    0.000 {built-in method builtins.min}

After change, with exact same options and same number or iterations:

iter 50 (epoch 0), avg_reward = 0.000, time/batch = 0.519
iter 100 (epoch 0), avg_reward = 0.000, time/batch = 0.523
iter 150 (epoch 0), avg_reward = 0.001, time/batch = 0.534
iter 200 (epoch 0), avg_reward = 0.000, time/batch = 0.522
iter 250 (epoch 0), avg_reward = 0.001, time/batch = 0.529
iter 300 (epoch 0), avg_reward = 0.002, time/batch = 0.532
iter 350 (epoch 0), avg_reward = 0.001, time/batch = 0.711
iter 400 (epoch 0), avg_reward = -0.000, time/batch = 0.528
iter 450 (epoch 0), avg_reward = 0.001, time/batch = 0.517
iter 500 (epoch 0), avg_reward = 0.001, time/batch = 0.512
total time: 283.7362642288208

And cProfile output:

         184722279 function calls (184532377 primitive calls) in 296.112 seconds
   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   960610   99.424    0.000  131.812    0.000 cider/pyciderevalcap/ciderD/ciderD_scorer.py:128(counts2vec)
   800610   42.364    0.000   53.293    0.000 cider/pyciderevalcap/ciderD/ciderD_scorer.py:154(sim)
   960610   31.212    0.000   32.016    0.000 cider/pyciderevalcap/ciderD/ciderD_scorer.py:17(precook)
 38569360   15.590    0.000   15.590    0.000 {built-in method builtins.pow}
     1000   14.660    0.015   23.485    0.023 /export/home1/NoCsBack/hci/rubenc/selfcritical/models/FCModel.py:150(_sample)
      511   10.595    0.021   10.595    0.021 {method 'item' of 'torch._C._TensorBase' objects}
      500    9.524    0.019    9.524    0.019 {method 'run_backward' of 'torch._C._EngineBase' objects}
 39566383    8.326    0.000    8.326    0.000 {built-in method builtins.min}
 38570206    6.785    0.000    6.785    0.000 {built-in method builtins.max}
      500    6.722    0.013  195.449    0.391 cider/pyciderevalcap/ciderD/ciderD_scorer.py:127(compute_cider)

So basically, a big improvement in speed 🙂 .

A comparison in ipython:

In [28]: device = torch.device('cuda:0')

In [29]: weights = torch.randn((32, 20000), dtype=torch.float32).clamp(0.01, 1)

In [30]: cweights = weights.clone().detach().to(device)

In [31]: avg_timeit(lambda: torch.multinomial(cweights, 1), 100)
Out[31]: 7.232666015625e-05

In [32]: avg_timeit(lambda: torch.multinomial(weights, 1), 100)
Out[32]: 0.015503778457641601

Torch.multinomial on GPU

5b86d32

ruotianluo force-pushed the master branch from 5076fbc to be1a526 Compare May 17, 2021 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.multinomial on GPU #144

torch.multinomial on GPU #144

rubencart commented Nov 15, 2019

torch.multinomial on GPU #144

Are you sure you want to change the base?

torch.multinomial on GPU #144

Conversation

rubencart commented Nov 15, 2019