Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible performance regression from 1.6.2 --> 1.7.0: np.any() and np.all() are unexpectedly slow over large arrays #3446

Closed
alimuldal opened this issue Jun 16, 2013 · 20 comments

Comments

@alimuldal
Copy link
Contributor

When np.all encounters a zero element it should return False immediately without testing any further elements. Therefore, the time taken by np.all should not increase with increasing array size, provided that the first element in the array is always zero. The same should be true for np.any if the first element in the array is nonzero.

Test script:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(10):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool)" %(10**ii)
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

Results:

Numpy v1.6.2
Array size: 1E0, 1000000 loops, best of 3: 1.738 us/loop
Array size: 1E1, 1000000 loops, best of 3: 1.845 us/loop
Array size: 1E2, 1000000 loops, best of 3: 1.862 us/loop
Array size: 1E3, 1000000 loops, best of 3: 1.858 us/loop
Array size: 1E4, 1000000 loops, best of 3: 1.864 us/loop
Array size: 1E5, 1000000 loops, best of 3: 1.882 us/loop
Array size: 1E6, 1000000 loops, best of 3: 1.866 us/loop
Array size: 1E7, 1000000 loops, best of 3: 1.853 us/loop
Array size: 1E8, 1000000 loops, best of 3: 1.860 us/loop
Array size: 1E9, 1000000 loops, best of 3: 1.854 us/loop

Numpy v1.7.0
Array size: 1E0, 100000 loops, best of 3: 5.881 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.831 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.924 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.864 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.997 us/loop
Array size: 1E5, 100000 loops, best of 3: 6.979 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.196 us/loop
Array size: 1E7, 10000 loops, best of 3: 116.162 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.112 ms/loop
Array size: 1E9, 100 loops, best of 3: 11.061 ms/loop

Numpy v1.7.1
Array size: 1E0, 100000 loops, best of 3: 6.216 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.257 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.318 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.247 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.492 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.406 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.426 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.946 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.102 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.987 ms/loop

Numpy v1.8.0.dev-e11cd9b
Array size: 1E0, 100000 loops, best of 3: 6.357 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.399 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.425 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.397 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.596 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.569 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.445 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.109 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.094 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.840 ms/loop

Bug was initially reported in relation to this SO question.

@charris
Copy link
Member

charris commented Jun 16, 2013

Hmm, I'd guess a change in the handling of logical_and.reduce. IIRC, that has been refactored. The boolean loop should detect the reduce form and bail on False, but I speculate it isn't getting called any more. The integer cases could use fixing also but are more complicated because of the mixed types, bool and int.

@charris
Copy link
Member

charris commented Jun 16, 2013

Looks like the reduce detection should still work for logical_and and logical_or, so those loops could be fixed up. The detection won't work for the other relational operators, but maybe we could fix that also by just going with the stepsize of the first input and output arguments being zero. That looks like it could be special cased regardless of reduce or not.

@juliantaylor
Copy link
Contributor

I think the problem is that the reduce always uses a buffered iterator, which may be fine for calling from python but within C is just wasting time with unnecessary copies.

as a workaround in 1.7 you can increase the (very small by default) buffer to reduce the overhead a bit:

np.setbufsize(10E6)

which will of course only help if it actually fits in the buffer and the memcpy is much faster than the logical_or (which it is)
it still scales with O(N) but at least does not spend 99%of its time in iteration overhead.

@charris
Copy link
Member

charris commented Jun 16, 2013

Ah, that could be. Increasing the buffer size helps a bit, but I suspect the deeper problem is that it doesn't know to shortcircuit the logical_and.reduce, so consequently everything gets copied and that accounts for the time since the loop overhead itself is constant and small as it does do the shortcircuit. I don't recall exactly why the buffered iterator is there, but I suspect it may have been part of the NA work.

@alimuldal
Copy link
Contributor Author

One further point - I would also expect to see constant loop times w.r.t. array size for all-zero float arrays as well as boolean, since np.all should still only have to check the first element before returning. However, this isn't even the case for Numpy 1.6.2:

# x = np.zeros(10**ii,dtype=np.float32)
Numpy v1.6.2
Array size: 1E0, 100000 loops, best of 3: 3.503 us/loop
Array size: 1E1, 100000 loops, best of 3: 3.597 us/loop
Array size: 1E2, 100000 loops, best of 3: 3.742 us/loop
Array size: 1E3, 100000 loops, best of 3: 4.745 us/loop
Array size: 1E4, 100000 loops, best of 3: 14.533 us/loop
Array size: 1E5, 10000 loops, best of 3: 112.463 us/loop
Array size: 1E6, 1000 loops, best of 3: 1.101 ms/loop
Array size: 1E7, 100 loops, best of 3: 11.724 ms/loop
Array size: 1E8, 10 loops, best of 3: 116.924 ms/loop
Array size: 1E9, 1 loops, best of 3: 1.168 s/loop

@charris
Copy link
Member

charris commented Jun 16, 2013

Yes, only the boolean loop has a short circuit.

@alimuldal
Copy link
Contributor Author

Sorry for my ignorance - is there some particular reason why?

@charris
Copy link
Member

charris commented Jun 16, 2013

Probably the mixed types: boolean output, float inputs. The usual reduce idea has trouble with that.

@seberg
Copy link
Member

seberg commented Jun 16, 2013

Yeah, I know I didn't want to hang out here ;). Someone might want to check nditer construction, because for an unbuffered loop (and this should be unbuffered as far as I can tell), I think I remember seeing code that should expand the innermost dimension to the maximum possible size (i.e. the whole array here). So this might be failing.

Other then that I would say this is merely an observation. Something like gh-2269 is more appropriate because usually we want to find any/first occurance after a calculation (though I would rather implement it using nditer, I believe I saw a sniplet by mark that shows how to do such things with it). I am not even sure the ufunc machinery currently supports mixed type reductions which would be necessary to optimize the non-bool case.

@seberg
Copy link
Member

seberg commented Jun 16, 2013

Haha, here is your explenation: * TODO: Could grow REDUCE loop too with some more logic above. in nditer_api.c :).

@charris
Copy link
Member

charris commented Jun 16, 2013

Resistance is futile ;) But thanks for the pointers....

@njsmith
Copy link
Member

njsmith commented Jun 16, 2013

Why do we even have a float loop for this? It's just doing cast-to-bool
then boolean operations, right? So if we just deleted the float loop, then
we'd... get exactly the same behaviour with less code and fewer bugs?

On Sun, Jun 16, 2013 at 6:57 PM, Charles Harris notifications@github.comwrote:

Resistance is futile ;) But thanks for the pointers....


Reply to this email directly or view it on GitHubhttps://github.com//issues/3446#issuecomment-19515970
.

@njsmith
Copy link
Member

njsmith commented Jun 16, 2013

Or wait, the ufunc machinery thinks that you can't cast floats to bools no
matter what, doesn't it :-( Ugh.

On Sun, Jun 16, 2013 at 7:40 PM, Nathaniel Smith njs@pobox.com wrote:

Why do we even have a float loop for this? It's just doing cast-to-bool
then boolean operations, right? So if we just deleted the float loop, then
we'd... get exactly the same behaviour with less code and fewer bugs?

On Sun, Jun 16, 2013 at 6:57 PM, Charles Harris notifications@github.comwrote:

Resistance is futile ;) But thanks for the pointers....


Reply to this email directly or view it on GitHubhttps://github.com//issues/3446#issuecomment-19515970
.

@seberg
Copy link
Member

seberg commented Jun 16, 2013

It does cast for the reduction (I think it only uses the out dtype to
decide which dtype to use). But if it casts it must buffer. If it
buffers (and in this case also otherwise) it uses chunks of 8096
elements (default buffer size). And while the ufunc (inner most loop)
immediately terminates (due to reduce optimization, though only if the
memory layout is correct), this inner loop can't tell the outer one that
it can already stop. So the inner loop is called just as often if it
terminates immediately as it would if it doesn't terminate, resulting in
a small increase in execution time. But this is only the ufunc machinery
(+possibly casting) overhead in these timings.

@charris
Copy link
Member

charris commented Jun 16, 2013

Apart from in lack of a simple optimization, it might not matter too much in practice. Many of the use cases of any/all are likely to be in expectation of False/True, in which case the entire data set needs to be traversed anyway.

@juliantaylor
Copy link
Contributor

in numpy 1.9 you can do d[d.argmin()] == True instead of all() as a workaround if you expect a early false element. But unless you are using numpy 1.10 and the array is contiguous it will be significantly slower if the condition is true.

@gdementen
Copy link
Contributor

@juliantaylor How did you expect numpy 1.10 to change your statement? I just tried on numpy 1.10 and your workaround (appreciated btw) is still much slower than all() if the condition is true.

@juliantaylor
Copy link
Contributor

are you on windows or a system with a old c library? the change I was referring too is c12c31f which requires a good memchr implementation like the one in glibc newer than something around 2.12
with that, the argmin statement should have the same speed as allfor a contiguous array

@gdementen
Copy link
Contributor

I am indeed on Windows...

@seberg
Copy link
Member

seberg commented Jul 1, 2021

Closing this, since the core of the issue is identical to gh-17471

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants