Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Optimize np.power(x, 2) for double and float type #26055

Merged
merged 15 commits into from May 14, 2024

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 16, 2024

We optimize np.power(x, e) by adding a fast path for the case e=2 and x is double or float.

Updated benchmark (linux):

Main

x**1 54.387 [us]
x**2 50.409 [us]
x**3 1999.226 [us]
x**1.123 1982.337 [us]
x**0.5 206.666 [us]
power(x, 1) 1980.208 [us]
power(x, 2) 1987.717 [us]
power(x, 3) 1982.347 [us]
power(x, 1.123) 1979.270 [us]
power(x, 0.5) 1981.093 [us]
power(x_float32, 1) 917.876 [us]
power(x_float32, 2) 937.346 [us]
power(x_float32, 3) 1006.258 [us]
power(x_float32, 1.123) 916.753 [us]
power(x_float32, 0.5) 918.313 [us]
power(double_size10, 1) 1.116 [us]
power(double_size10, 2) 1.127 [us]
power(double_size10, 3) 1.172 [us]
power(double_size10, 1.123) 1.010 [us]
power(double_size10, 0.5) 0.941 [us]

PR

x**1 55.415 [us]
x**2 51.346 [us]
x**3 1953.280 [us]
x**1.123 1953.009 [us]
x**0.5 206.721 [us]
power(x, 1) 1947.996 [us]
power(x, 2) 58.777 [us]
power(x, 3) 1951.243 [us]
power(x, 1.123) 1949.528 [us]
power(x, 0.5) 406.441 [us]
power(x_float32, 1) 900.470 [us]
power(x_float32, 2) 53.788 [us]
power(x_float32, 3) 900.063 [us]
power(x_float32, 1.123) 903.382 [us]
power(x_float32, 0.5) 205.104 [us]
power(double_size10, 1) 1.087 [us]
power(double_size10, 2) 0.912 [us]
power(double_size10, 3) 1.218 [us]
power(double_size10, 1.123) 0.918 [us]
power(double_size10, 0.5) 0.759 [us]
Benchmark script
import sys
del(sys.path[0])
import numpy as np
print(np, np.__version__)
np.show_config()

import timeit

exponents = [1, 2, 3, 1.123, .5]

xf=np.arange(100_000.).astype(np.float64)
nf = 1_000
for e in exponents:
    dt=timeit.timeit(f'xf**{e}', globals=globals(), number=nf)
    print(f'x**{e} {1e6*dt/nf:.3f} [us]')

for e in exponents:
    dt=timeit.timeit(f'np.power(xf, {e})', globals=globals(), number=nf)
    print(f'power(x, {e}) {1e6*dt/nf:.3f} [us]')

f=np.arange(100_000.).astype(np.float32)
for e in exponents:
    dt=timeit.timeit(f'np.power(f, {e})', globals=globals(), number=nf)
    print(f'power(x_float32, {e}) {1e6*dt/nf:.3f} [us]')


nf = 1_000_000
y=np.arange(10.)
for e in exponents:
    dt=timeit.timeit(f'np.power(y, {e})', globals=globals(), number=nf)
    print(f'power(double_size10, {e}) {1e6*dt/nf:.3f} [us]')

Old benchmark (windows)

import numpy as np
import timeit

xf=np.arange(100_000+.0)
nf = 2_000

for e in [1, 2, 3]:
    dt=timeit.timeit(f'xf**{e}', globals=globals(), number=nf)
    print(f'x**{e} {1e6*dt/nf:.3f} [us]')

for e in [1, 2, 3]:
    dt=timeit.timeit(f'np.power(xf, {e})', globals=globals(), number=nf)
    print(f'power(x, {e}) {1e6*dt/nf:.3f} [us]')
dt=timeit.timeit('np.power(xf, 2.)', globals=globals(), number=nf)
print(f'power(x, 2.) {1e6*dt/nf:.3f} [us]')

Results on main:

x**1 131.168 [us]
x**2 75.814 [us]
x**3 1509.223 [us]
power(x, 1) 306.960 [us]
power(x, 2) 1620.303 [us]
power(x, 3) 1571.338 [us]
power(x, 2.) 1493.646 [us]

Results on this PR:

x**1 119.589 [us]
x**2 77.959 [us]
x**3 1445.394 [us]
power(x, 1) 257.166 [us]
power(x, 2) 58.722 [us]
power(x, 3) 1406.567 [us]
power(x, 2.) 57.718 [us]

Notes:

  • Maybe the fast path for exponent 2 should be outside the SIMD check. On my system SIMD is not selected/available, so I cannot benchmark this.
  • Benchmark results are a bit unstable on my machine, so it would be good to reproduce this
  • We could special case also other often used exponents. For example -1, 0, .5, 1, 3, 4. But we should check whether that gives the exact same results as the current implementation.
  • Depending on the compiler the option ffast-math can make pow much faster for special cases such as e=2. See for example https://stackoverflow.com/a/2940800

Related: #26045, #9363, #6399, #22651, #7786

@eendebakpt
Copy link
Contributor Author

All tests pass, but this PR does change behavior. A minimal example:

z=np.frombuffer(b'\xb6\xfe\x10\x16\xe1\xa7\xda?')
print(z, np.power(z, 2) - z*z)

On main np.power(z, 2) and z*z slightly differ, with this PR they are exactly equal.

@ngoldbaum ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Mar 18, 2024
@ngoldbaum ngoldbaum changed the title Draft: ENH: Optimize np.power(x, 2) for double type ENH: Optimize np.power(x, 2) for double type Mar 18, 2024
@ngoldbaum
Copy link
Member

Given the dramatic performance improvement, I think accepting some small change in output is ok. Is it possible to hint to the compiler to generalize this a bit? Or maybe manually unroll up to e.g. n=5?

It would be good to see some benchmarks to capture the overhead of the extra branching for non-integer powers, if any.

Added the triage review label to make sure this gets looked at during a triage meeting.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice! Beyond the suggestion in-line, a more general question (perhaps to @seberg) of whether it might be possible to do this inside the promotion machinery, i.e., for specific powers call different ufunc inner loops. (Of course, this brings back value-based casting, but raising things to powers is always one of the types of routines that does not fit nicely - and we have precedent with the comparisons...) An advantage of that approach would be that then we can remove the special-casing for __pow__, and presumably any vectorization of np.square and np.sqrt, etc., would be used automatically.

#endif
#if @ispower@
if (steps[1]==0) {
BINARY_DEFS_ZERO_STRIDE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to call the inner loop for square? Mostly if that helps with the CPU dispatch (I don't understand that well enough to know whether this is possible).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update this to use SIMD for square and sqrt in another patch.

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. This is actually faster than using SIMD pow(x, y) by quite a bit. I would suggest bypassing the SIMD loop as well for this case. We could worry about CPU dispatching by leveraging square function in a separate patch.

@@ -250,6 +251,25 @@ NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@func@)
simd_@intrin@_@sfx@(src1, ssrc1, src2, ssrc2, dst, sdst, len);
return;
}
#endif
#if @ispower@
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of the nested if else conditions is getting a little hard to read. I propose splitting the atan2 and pow function to make this more readable.

@r-devulap
Copy link
Member

We optimize np.power(x, e) by adding a fast path for the case e=2 and x is double.

Your patch seems to do this for float as well. Or am I missing something?

@eendebakpt eendebakpt changed the title ENH: Optimize np.power(x, 2) for double type ENH: Optimize np.power(x, 2) for double and float type Mar 19, 2024
@eendebakpt
Copy link
Contributor Author

We optimize np.power(x, e) by adding a fast path for the case e=2 and x is double.

Your patch seems to do this for float as well. Or am I missing something?

It indeed handles both double and float.

@eendebakpt
Copy link
Contributor Author

Given the dramatic performance improvement, I think accepting some small change in output is ok. Is it possible to hint to the compiler to generalize this a bit? Or maybe manually unroll up to e.g. n=5?

It would be good to see some benchmarks to capture the overhead of the extra branching for non-integer powers, if any.

I added some tests with smaller arrays to see whether the overhead impacts the non-integer case (results updated in first comment). It seems the impact is small or none, but it might depend on the compiler/platform and would be good to benchmark on an independent system.

Adding more special cases is possible, but we need to decide which cases are important as each added case adds a bit more complexity (and overhead). I now have e=2 and e=.5. Other cases to consider are -1 and -2 (but this might require some though about whether cases like 0 are handles correctly) and indeed e=3, 4, 5, ...

@seberg
Copy link
Member

seberg commented Mar 19, 2024

The impact is largest in some degenerate cases. Small arrays are dominated by other overheads anyway.
Usage of where=[0, 1] * N may be the worst but is rather niche and ridiculous. Things like arr = np.empty((large, 2))[::2, :] might run into it (but I am actually not sure, it may be that it goes into buffered paths making the inner-loop large).

@r-devulap
Copy link
Member

Adding more special cases is possible, but we need to decide which cases are important

@eendebakpt out of curiosity, why optimize for np.power(x, 2)? one could as well just use x * x,

@eendebakpt
Copy link
Contributor Author

The following script can be used to generate some cases where z*z differs from np.power(z, e) (results might be platform dependent).

import sympy
import numpy as np
import random
from codecs import decode
import struct


def bin_to_float(b):
    """ Convert binary string to a float. """
    bf = int(b, 2).to_bytes(8)  # 8 bytes needed for IEEE 754 binary64.
    return struct.unpack('>d', bf)[0]

def check_float(z: float, e: int = 3):
    symbolic_z= sympy.sympify(z)

    symbolic_product = 1
    z_product = 1
    for _ in range(e):
        symbolic_product *= symbolic_z
        z_product *= z
    delta = symbolic_product - z_product
    delta_pow = symbolic_product - np.power(z, e).item()
    if  delta_pow!=delta:
       
        print(f'{z}: delta product {delta}, delta pow(., {e}) {delta_pow}')
    return delta_pow!=delta

z=np.frombuffer(b'\xb6\xfe\x10\x16\xe1\xa7\xda?')
z=float(z.item())
check_float(z)

x=np.random.rand(40_000,)
n=0
for z in x:
    symbolic_z= sympy.sympify(z)
    n+= check_float(z)
print(f'{n}/{x.size} = {n/x.size}')

for ii in range(10_000):
    z = bin_to_float(bin(random.getrandbits(64)))
    check_float(z)

In the cases generated with this script the values for z*z agree well with np.power(z, 2), but already for e=3 there are large differences. For example for z=-15656763.486516586. In the cases generated the calculated product is a better approximation than np.power(z, 3).

@eendebakpt
Copy link
Contributor Author

Adding more special cases is possible, but we need to decide which cases are important

@eendebakpt out of curiosity, why optimize for np.power(x, 2)? one could as well just use x * x,

That is a fair point. There are several issues (links in the first comment) where users experience a slow np.power (also for other exponents), and this would be one way to provide a performance improvement. Replacing np.power(x, 2) with x*x is not always possible or elegant (for example when np.power is used inside a method where the exponent is one of the arguments).

But whether numpy should special case for certain exponents (and if so which ones) is not clear to me. It is on the agenda for the next triage meeting I believe, that also seems like a good place to discuss.

@mhvk
Copy link
Contributor

mhvk commented Mar 19, 2024

@eendebakpt out of curiosity, why optimize for np.power(x, 2)? one could as well just use x * x,

Numpy already optimizes x ** 2 to call np.square - I think in any higher level language one wants to try to avoid people having to think about performance too much, but rather focus on code correctness.

@ngoldbaum
Copy link
Member

We talked about this at the triage meeting and there wasn't a ton of appetite for adding special handling for anything besides 0.5 and 2, mostly because of concerns about accuracy.

I did some github code searches for uses with common variable names like x and np.power(..., 2) definitely comes up a lot (3200 results with x), np.power(..., 0.5) less so (a few hundred results), so if there's any marginal cost to supporting square roots, I would drop that. np.power(... 3) seems to be relatively popular (1000 results), so if that can be supported without precision loss that seems worth doing as well, but nothing more.

@ngoldbaum ngoldbaum added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Mar 20, 2024
@eendebakpt
Copy link
Contributor Author

@ngoldbaum Thanks for the feedback! I implemented e=2 and e=.5.

I tried some experiments to determine the accuracy. Not sure there are representative, but here are some results.

For e=3 the precision loss seems small (cases found so far are always within one ULP), but the results can be less precise. So unless there is an expert that can confirm adding e=3 is ok, I will leave it out of this PR. One example for type float32 (the blue data points are a bit further from the exact values than the red ones):
image

For e=0.5 all the differences between sqrt and np.power seem to equal far off from the exact value (and all within one ULP). An example:
image

numpy/_core/src/highway Outdated Show resolved Hide resolved
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes but mostly LGTM. Do you know if there are unit tests that cover these special cases for np.power? If not, it might be worth adding them to test this new code paths.

#endif
#if @ispower@
if (steps[1]==0) {
BINARY_DEFS_ZERO_STRIDE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update this to use SIMD for square and sqrt in another patch.

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src Outdated Show resolved Hide resolved
@r-devulap
Copy link
Member

Added benchmarks in #26212. Results for this PR on Intel SKX:

| Change   | Before [45173782] <main>   | After [b5995034] <power>   |   Ratio | Benchmark (Parameter)                                          |
|----------|----------------------------|----------------------------|---------|----------------------------------------------------------------|
| +        | 8.01±0.09ms                | 8.51±0.01ms                |    1.06 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)      |
| -        | 8.17±0.01ms                | 928±1μs                    |    0.11 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>) |
| -        | 19.3±0.4ms                 | 1.90±0ms                   |    0.1  | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>) |
| -        | 8.23±0.02ms                | 625±2μs                    |    0.08 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)    |
| -        | 19.3±0.4ms                 | 857±20μs                   |    0.04 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)    |

@eendebakpt
Copy link
Contributor Author

Minor changes but mostly LGTM. Do you know if there are unit tests that cover these special cases for np.power? If not, it might be worth adding them to test this new code paths.
To be explicit I added a test for the two special cases.

@ngoldbaum
Copy link
Member

Is the 6% performance drop in test_pow statistically significant? What about 64 bit?

@r-devulap
Copy link
Member

Is the 6% performance drop in test_pow statistically significant? What about 64 bit?

we see this often. when the code placement changes it sometimes affects perf a little bit, But I am not worried about 6% too much. I have tried multiple times to narrow down why that happens but failed every time :/

for dt in [np.float32, np.float64]:
expected = np.array([0.0, 1.21, 4., 1.44e+26, np.inf, np.inf])
a = np.array([0, 1.1, 2, 12e12, np.inf, -np.inf], dt)
assert_equal(np.power(a, 2.), expected.astype(dt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_equal it too strict a condition. For example, 1.21 cannot be represented in float32 and is actually stored as 1.21000003814697265625, which will differ from the output of 1.1*1.1. I would recommend using assert_array_max_ulp with maxulp = 1.

@eendebakpt
Copy link
Contributor Author

eendebakpt commented Apr 5, 2024

There is an issue with the fast path for sqrt. On current numpy:

math.pow(float('inf'), .5)=inf
math.pow(-float('inf'), .5)=inf
math.pow(-10, .5)= <domain error
math.sqrt(float('inf'))=inf
math.sqrt(-float('inf'))= <domain error>
math.sqrt(-10)= <domain error>
np.power(float('inf'), .5)=inf
np.power(-float('inf'), .5)=inf
np.power(-10, .5)=nan (with warning)
np.sqrt(float('inf'))=inf
np.sqrt(-float('inf'))=nan (with warning)
np.sqrt(-10)=nan (with warning)

This all agrees with the documentation I checked on the C/C++ implementations of pow and sqrt (e.g. https://en.cppreference.com/w/c/numeric/math/pow).

The issue is that np.power(-float('inf'), .5) and np.sqrt(-float('inf')) have different values. So in order for the behaviour of np.power to keep unchanged we need to replace

*(@type@ *)op1 =@sqrt@(in1);

with

if (in1==-((@type@)INFINITY) {
     // special case for minus infinity
     *(@type@ *)op1 =@pow@(in1);
}
else
    *(@type@ *)op1 =@sqrt@(in1);

That is not a big change and I am sure it will still be faster, but I am not sure it is worth the (potential) trouble.

I am leaning towards removing the special case for sqrt and only leaving x*x in.

Update: I removed the fast path for sqrt. If desired, we can put it back in.

@eendebakpt eendebakpt changed the title ENH: Optimize np.power(x, 2) for double and float type Draft: ENH: Optimize np.power(x, 2) for double and float type Apr 5, 2024
@eendebakpt eendebakpt changed the title Draft: ENH: Optimize np.power(x, 2) for double and float type ENH: Optimize np.power(x, 2) for double and float type May 11, 2024
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-devulap could you give this one final look?

@eendebakpt it occurs to me it might be nice to add a release note for all the performance improvements you've been doing, including this one. That gives you a little bit of credit and also spreads the knowledge that there are fast paths for common cases in some of these functions now. No need to include it as part of this PR to make it mergeable.

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @eendebakpt

@r-devulap r-devulap merged commit 35c4f37 into numpy:main May 14, 2024
65 checks passed
@eendebakpt
Copy link
Contributor Author

@r-devulap Thanks for merging! A couple of the linked issues can be closed now I think.

@ngoldbaum
Copy link
Member

@eendebakpt cool! can you clean up the old issues? Happy to give you triage rights if you need them but I think anyone can close issues.

@seberg
Copy link
Member

seberg commented May 16, 2024

but I think anyone can close issues.

I don't think so, only if it is included in the commit message as Closes already (the person who merges closes it), or it is your own.

@ngoldbaum
Copy link
Member

Ah sorry, and it looks like I don't have the rights to give people triage rights like I thought...

@seberg
Copy link
Member

seberg commented May 16, 2024

Send an invite, if you like @eendebakpt.

@eendebakpt
Copy link
Contributor Author

Send an invite, if you like @eendebakpt.

Thanks for the invite, accepted! I am looking in the documentation for any guidelines on triage rights, but can't find anything specific for numpy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants