[POC] Adds TH_TENSOR_APPLY2_PARALLEL #395

fmassa · 2015-10-04T12:01:50Z

Following the discussion in #323, I've tried to add a macro TH_TENSOR_APPLY2_PARALLEL, which uses omp if both tensors are contiguous. For the moment, I haven't set a threshold to use omp or not.

As a proof of concept, I added it to the unary operations implemented by LAB_IMPLEMENT_BASIC_FUNCTION (like abs, tan, etc).

Any thoughts ?
cc @dominikgrewe

fmassa · 2015-10-05T18:17:46Z

Just a quick heads up on this PR.
A simple benchmark of the log function on a 2 cores/4 threads machine gave an speed up of 2.5 times compared to the non-threaded version, for a 2e8 elements contiguous double vector.
This could be specially useful in pointwise operations in nn, as for now the parallelism in some of the transfer functions has been removed torch/nn#403

dominikgrewe · 2015-10-23T10:29:23Z

lib/TH/THTensorApply.h

+          break; \
+        } \
+      } \
+    } \


Feels like there's a lot of code duplication here. Can't we just call THTensor_isContiguous? Same for counting the number of elements. We'd need the tensor type as parameter to the macro (in addition to the value type), but that can be easily added.

dominikgrewe · 2015-10-23T10:30:41Z

Sorry for the delay in getting back. It looks good to me in principle, but we should try to reduce the amount of code duplication.

fmassa · 2015-10-23T11:00:01Z

Hi Dominik,
Thanks for the comments.
I agree that there are lots of code duplication. It actually comes from the original TH_TENSOR_APPLY2 macro.
As I wanted to make this apply function as similar as possible to the non-omp one, I copied/pasted the TH_TENSOR_APPLY2 macro and added few lines to make it work with omp, the main difference is here.
When I found the need of the is_Contiguous function, I realised it was already present in the original macro (but not explicitly set up, see here for example), so I just added the variable isContiguous in this part.

I'll replace those repetitive parts by their respective function calls.

fmassa · 2015-11-16T12:17:37Z

Hi @dominikgrewe ,

I was thinking again about this again, and I'm not sure we would like to have a different interface than the one in TH_TENSOR_APPLY2.
My initial goal with this PR was to modify TH_TENSOR_APPLY2, but then I realised that there are some use cases where this parallelisation woudn't work (because of reductions or breaks in the middle of the for loop). I thus decided to create a new macro (following your idea in #323 ), but keeping the interface the same to allow easily replacing the TH_TENSOR_APPLY2 by this new one, when applicable (including in nn).

I agree though that there are lots of code duplication, making the compilation time quite long.

What do you think ? @soumith @andresy @koraykv @dominikgrewe

Adds TH_TENSOR_APPLY2_PARALLEL

1ff2c40

dominikgrewe reviewed Oct 23, 2015
View reviewed changes

fmassa mentioned this pull request Jul 18, 2017

tanh doesnt use all cores pytorch/pytorch#2136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Adds TH_TENSOR_APPLY2_PARALLEL #395

[POC] Adds TH_TENSOR_APPLY2_PARALLEL #395

fmassa commented Oct 4, 2015

fmassa commented Oct 5, 2015

dominikgrewe Oct 23, 2015

dominikgrewe commented Oct 23, 2015

fmassa commented Oct 23, 2015

fmassa commented Nov 16, 2015

[POC] Adds TH_TENSOR_APPLY2_PARALLEL #395

Are you sure you want to change the base?

[POC] Adds TH_TENSOR_APPLY2_PARALLEL #395

Conversation

fmassa commented Oct 4, 2015

fmassa commented Oct 5, 2015

dominikgrewe Oct 23, 2015

Choose a reason for hiding this comment

dominikgrewe commented Oct 23, 2015

fmassa commented Oct 23, 2015

fmassa commented Nov 16, 2015