`tanh` doesnt use all cores #2136

hughperkins · 2017-07-18T02:07:09Z

When I run this script:

import torch

a = torch.rand(1000, 10000)
while True:
    print('.')
    a.tanh_()

and then open htop, I expect to see all 8 cores running at 100%, but only 4 seem to be running? :

In addition, those cores that are running, are only running at ~30-40%.

(Note that I'm not submitting a fix for this, just flagging it)

The text was updated successfully, but these errors were encountered:

jekbradbury · 2017-07-18T02:19:25Z

What happens when you fiddle around with the pertinent environment variables (some subset of OPENBLAS_NUM_THREADS, MKL_NUM_THREADS, and OPENMP_NUM_THREADS)?

hughperkins · 2017-07-18T02:23:49Z

I think this should be handled automatically really? At least, I dont see anything in any of the doc I've read suggesting one needs to do this. So, either the doc needs to be updated, or else this should be handled automatically, I reckon.

hughperkins · 2017-07-18T02:26:02Z

That said, with MKL_NUM_THREADS=8, no change:

OPENBLAS_NUM_THREADS=8 MKL_NUM_THREADS=8 OPENMP_NUM_THREADS=8 python test_tanh.py: no change:

hughperkins · 2017-07-18T02:26:56Z

(tanh shouldnt be using blas? I would think it uses some combination of SSE and OpenMP?)

jekbradbury · 2017-07-18T02:28:16Z

Oh of course, sorry. Yeah, this is a good question then!

apaszke · 2017-07-18T03:21:29Z

It seems that Tanh doesn't use OpenMP at all. It calls into TH, and it uses TH_TENSOR_APPLY_2 which is single-threaded. TH could use some multi-core optimizations, but we don't have enough hands to do it right now.

We could provide some guidance if anyone wants to take a stab. We already have macros that extend TH_TENSOR_APPLY to make use of all cores in the contiguous case, so at least this improvement should be easy to add.

ruotianluo · 2017-07-18T08:22:23Z

I submitted a PR, check if it looks like what's expected. @apaszke

fmassa · 2017-07-18T11:12:37Z

If one wants to use OMP on TH_TENSOR_APPLY2, one could look into improving on top of torch/torch7#395 . But I think that having better support for AVX/SSE would be best.

weiyangfb · 2018-07-26T01:31:46Z

Closing since this has been fixed vix #2792

…ter (pytorch#2136) Co-authored-by: Ryan Spring <rdspring1@gmail.com>

ruotianluo mentioned this issue Jul 18, 2017

Use TH_TENSOR_APPLYx_CONTIG for contiguous tensor to increase the speed. #2137

Closed

apaszke added enhancement high priority labels Jul 25, 2017

soumith added this to Uncategorized in Issue Status Aug 23, 2017

soumith moved this from Uncategorized to High Priority in Issue Status Aug 23, 2017

soumith added this to Performance in Issue Categories Aug 30, 2017

weiyangfb closed this as completed Jul 26, 2018

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this issue Nov 30, 2022

Make sure predicate elimination info propagated through unarySetInser…

e27be91

…ter (pytorch#2136) Co-authored-by: Ryan Spring <rdspring1@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tanh` doesnt use all cores #2136

`tanh` doesnt use all cores #2136

hughperkins commented Jul 18, 2017

jekbradbury commented Jul 18, 2017

hughperkins commented Jul 18, 2017

hughperkins commented Jul 18, 2017

hughperkins commented Jul 18, 2017

jekbradbury commented Jul 18, 2017

apaszke commented Jul 18, 2017

ruotianluo commented Jul 18, 2017

fmassa commented Jul 18, 2017

weiyangfb commented Jul 26, 2018

tanh doesnt use all cores #2136

tanh doesnt use all cores #2136

Comments

hughperkins commented Jul 18, 2017

jekbradbury commented Jul 18, 2017

hughperkins commented Jul 18, 2017

hughperkins commented Jul 18, 2017

hughperkins commented Jul 18, 2017

jekbradbury commented Jul 18, 2017

apaszke commented Jul 18, 2017

ruotianluo commented Jul 18, 2017

fmassa commented Jul 18, 2017

weiyangfb commented Jul 26, 2018

`tanh` doesnt use all cores #2136

`tanh` doesnt use all cores #2136