possible performance regression from 1.6.2 --> 1.7.0: np.any() and np.all() are unexpectedly slow over large arrays #3446

alimuldal · 2013-06-16T01:17:23Z

When np.all encounters a zero element it should return False immediately without testing any further elements. Therefore, the time taken by np.all should not increase with increasing array size, provided that the first element in the array is always zero. The same should be true for np.any if the first element in the array is nonzero.

Test script:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(10):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool)" %(10**ii)
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

Results:

Numpy v1.6.2
Array size: 1E0, 1000000 loops, best of 3: 1.738 us/loop
Array size: 1E1, 1000000 loops, best of 3: 1.845 us/loop
Array size: 1E2, 1000000 loops, best of 3: 1.862 us/loop
Array size: 1E3, 1000000 loops, best of 3: 1.858 us/loop
Array size: 1E4, 1000000 loops, best of 3: 1.864 us/loop
Array size: 1E5, 1000000 loops, best of 3: 1.882 us/loop
Array size: 1E6, 1000000 loops, best of 3: 1.866 us/loop
Array size: 1E7, 1000000 loops, best of 3: 1.853 us/loop
Array size: 1E8, 1000000 loops, best of 3: 1.860 us/loop
Array size: 1E9, 1000000 loops, best of 3: 1.854 us/loop

Numpy v1.7.0
Array size: 1E0, 100000 loops, best of 3: 5.881 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.831 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.924 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.864 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.997 us/loop
Array size: 1E5, 100000 loops, best of 3: 6.979 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.196 us/loop
Array size: 1E7, 10000 loops, best of 3: 116.162 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.112 ms/loop
Array size: 1E9, 100 loops, best of 3: 11.061 ms/loop

Numpy v1.7.1
Array size: 1E0, 100000 loops, best of 3: 6.216 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.257 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.318 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.247 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.492 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.406 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.426 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.946 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.102 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.987 ms/loop

Numpy v1.8.0.dev-e11cd9b
Array size: 1E0, 100000 loops, best of 3: 6.357 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.399 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.425 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.397 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.596 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.569 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.445 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.109 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.094 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.840 ms/loop

Bug was initially reported in relation to this SO question.

The text was updated successfully, but these errors were encountered:

charris · 2013-06-16T01:56:09Z

Hmm, I'd guess a change in the handling of logical_and.reduce. IIRC, that has been refactored. The boolean loop should detect the reduce form and bail on False, but I speculate it isn't getting called any more. The integer cases could use fixing also but are more complicated because of the mixed types, bool and int.

charris · 2013-06-16T02:12:45Z

Looks like the reduce detection should still work for logical_and and logical_or, so those loops could be fixed up. The detection won't work for the other relational operators, but maybe we could fix that also by just going with the stepsize of the first input and output arguments being zero. That looks like it could be special cased regardless of reduce or not.

juliantaylor · 2013-06-16T10:53:57Z

I think the problem is that the reduce always uses a buffered iterator, which may be fine for calling from python but within C is just wasting time with unnecessary copies.

as a workaround in 1.7 you can increase the (very small by default) buffer to reduce the overhead a bit:

np.setbufsize(10E6)

which will of course only help if it actually fits in the buffer and the memcpy is much faster than the logical_or (which it is)
it still scales with O(N) but at least does not spend 99%of its time in iteration overhead.

charris · 2013-06-16T12:59:36Z

Ah, that could be. Increasing the buffer size helps a bit, but I suspect the deeper problem is that it doesn't know to shortcircuit the logical_and.reduce, so consequently everything gets copied and that accounts for the time since the loop overhead itself is constant and small as it does do the shortcircuit. I don't recall exactly why the buffered iterator is there, but I suspect it may have been part of the NA work.

alimuldal · 2013-06-16T14:36:26Z

One further point - I would also expect to see constant loop times w.r.t. array size for all-zero float arrays as well as boolean, since np.all should still only have to check the first element before returning. However, this isn't even the case for Numpy 1.6.2:

# x = np.zeros(10**ii,dtype=np.float32)
Numpy v1.6.2
Array size: 1E0, 100000 loops, best of 3: 3.503 us/loop
Array size: 1E1, 100000 loops, best of 3: 3.597 us/loop
Array size: 1E2, 100000 loops, best of 3: 3.742 us/loop
Array size: 1E3, 100000 loops, best of 3: 4.745 us/loop
Array size: 1E4, 100000 loops, best of 3: 14.533 us/loop
Array size: 1E5, 10000 loops, best of 3: 112.463 us/loop
Array size: 1E6, 1000 loops, best of 3: 1.101 ms/loop
Array size: 1E7, 100 loops, best of 3: 11.724 ms/loop
Array size: 1E8, 10 loops, best of 3: 116.924 ms/loop
Array size: 1E9, 1 loops, best of 3: 1.168 s/loop

charris · 2013-06-16T14:51:21Z

Yes, only the boolean loop has a short circuit.

alimuldal · 2013-06-16T15:03:51Z

Sorry for my ignorance - is there some particular reason why?

charris · 2013-06-16T15:10:43Z

Probably the mixed types: boolean output, float inputs. The usual reduce idea has trouble with that.

seberg · 2013-06-16T17:45:53Z

Yeah, I know I didn't want to hang out here ;). Someone might want to check nditer construction, because for an unbuffered loop (and this should be unbuffered as far as I can tell), I think I remember seeing code that should expand the innermost dimension to the maximum possible size (i.e. the whole array here). So this might be failing.

Other then that I would say this is merely an observation. Something like gh-2269 is more appropriate because usually we want to find any/first occurance after a calculation (though I would rather implement it using nditer, I believe I saw a sniplet by mark that shows how to do such things with it). I am not even sure the ufunc machinery currently supports mixed type reductions which would be necessary to optimize the non-bool case.

seberg · 2013-06-16T17:53:55Z

Haha, here is your explenation: * TODO: Could grow REDUCE loop too with some more logic above. in nditer_api.c :).

charris · 2013-06-16T17:57:15Z

Resistance is futile ;) But thanks for the pointers....

njsmith · 2013-06-16T18:40:19Z

Why do we even have a float loop for this? It's just doing cast-to-bool
then boolean operations, right? So if we just deleted the float loop, then
we'd... get exactly the same behaviour with less code and fewer bugs?

On Sun, Jun 16, 2013 at 6:57 PM, Charles Harris notifications@github.comwrote:

Resistance is futile ;) But thanks for the pointers....

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3446#issuecomment-19515970
.

njsmith · 2013-06-16T18:40:53Z

Or wait, the ufunc machinery thinks that you can't cast floats to bools no
matter what, doesn't it :-( Ugh.

On Sun, Jun 16, 2013 at 7:40 PM, Nathaniel Smith njs@pobox.com wrote:

Why do we even have a float loop for this? It's just doing cast-to-bool
then boolean operations, right? So if we just deleted the float loop, then
we'd... get exactly the same behaviour with less code and fewer bugs?

On Sun, Jun 16, 2013 at 6:57 PM, Charles Harris notifications@github.comwrote:

Resistance is futile ;) But thanks for the pointers....

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3446#issuecomment-19515970
.

seberg · 2013-06-16T18:50:38Z

It does cast for the reduction (I think it only uses the out dtype to
decide which dtype to use). But if it casts it must buffer. If it
buffers (and in this case also otherwise) it uses chunks of 8096
elements (default buffer size). And while the ufunc (inner most loop)
immediately terminates (due to reduce optimization, though only if the
memory layout is correct), this inner loop can't tell the outer one that
it can already stop. So the inner loop is called just as often if it
terminates immediately as it would if it doesn't terminate, resulting in
a small increase in execution time. But this is only the ufunc machinery
(+possibly casting) overhead in these timings.

charris · 2013-06-16T19:11:09Z

Apart from in lack of a simple optimization, it might not matter too much in practice. Many of the use cases of any/all are likely to be in expectation of False/True, in which case the entire data set needs to be traversed anyway.

juliantaylor · 2014-08-31T11:36:18Z

in numpy 1.9 you can do d[d.argmin()] == True instead of all() as a workaround if you expect a early false element. But unless you are using numpy 1.10 and the array is contiguous it will be significantly slower if the condition is true.

gdementen · 2016-04-11T10:17:19Z

@juliantaylor How did you expect numpy 1.10 to change your statement? I just tried on numpy 1.10 and your workaround (appreciated btw) is still much slower than all() if the condition is true.

juliantaylor · 2016-04-11T20:59:58Z

are you on windows or a system with a old c library? the change I was referring too is c12c31f which requires a good memchr implementation like the one in glibc newer than something around 2.12
with that, the argmin statement should have the same speed as allfor a contiguous array

gdementen · 2016-04-12T07:00:56Z

I am indeed on Windows...

seberg · 2021-07-01T23:24:49Z

Closing this, since the core of the issue is identical to gh-17471

juliantaylor mentioned this issue Jan 21, 2017

gufunc to test for all elements equal along axis #8513

Open

toobaz mentioned this issue Apr 4, 2017

first nonzero element (Trac #1673) #2269

Open

mattip mentioned this issue Oct 27, 2020

BUG: NPY_ITER_BUFFERED is set on ufunc reductions, causing bufferred iteration #17649

Closed

herbiebradley mentioned this issue Mar 3, 2021

Performance-improvement: Combine boolean masks ai4er-cdt/geograph#39

Open

seberg closed this as completed Jul 1, 2021

aryarm mentioned this issue Sep 16, 2022

speeding up the transform command CAST-genomics/haptools#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible performance regression from 1.6.2 --> 1.7.0: np.any() and np.all() are unexpectedly slow over large arrays #3446

possible performance regression from 1.6.2 --> 1.7.0: np.any() and np.all() are unexpectedly slow over large arrays #3446

alimuldal commented Jun 16, 2013

charris commented Jun 16, 2013

charris commented Jun 16, 2013

juliantaylor commented Jun 16, 2013

charris commented Jun 16, 2013

alimuldal commented Jun 16, 2013

charris commented Jun 16, 2013

alimuldal commented Jun 16, 2013

charris commented Jun 16, 2013

seberg commented Jun 16, 2013

seberg commented Jun 16, 2013

charris commented Jun 16, 2013

njsmith commented Jun 16, 2013

njsmith commented Jun 16, 2013

seberg commented Jun 16, 2013

charris commented Jun 16, 2013

juliantaylor commented Aug 31, 2014

gdementen commented Apr 11, 2016

juliantaylor commented Apr 11, 2016

gdementen commented Apr 12, 2016

seberg commented Jul 1, 2021

possible performance regression from 1.6.2 --> 1.7.0: np.any() and np.all() are unexpectedly slow over large arrays #3446

possible performance regression from 1.6.2 --> 1.7.0: np.any() and np.all() are unexpectedly slow over large arrays #3446

Comments

alimuldal commented Jun 16, 2013

charris commented Jun 16, 2013

charris commented Jun 16, 2013

juliantaylor commented Jun 16, 2013

charris commented Jun 16, 2013

alimuldal commented Jun 16, 2013

charris commented Jun 16, 2013

alimuldal commented Jun 16, 2013

charris commented Jun 16, 2013

seberg commented Jun 16, 2013

seberg commented Jun 16, 2013

charris commented Jun 16, 2013

njsmith commented Jun 16, 2013

njsmith commented Jun 16, 2013

seberg commented Jun 16, 2013

charris commented Jun 16, 2013

juliantaylor commented Aug 31, 2014

gdementen commented Apr 11, 2016

juliantaylor commented Apr 11, 2016

gdementen commented Apr 12, 2016

seberg commented Jul 1, 2021