New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Optimize np.power(x, 2) for double and float type #26055
Conversation
All tests pass, but this PR does change behavior. A minimal example:
On main |
Given the dramatic performance improvement, I think accepting some small change in output is ok. Is it possible to hint to the compiler to generalize this a bit? Or maybe manually unroll up to e.g. It would be good to see some benchmarks to capture the overhead of the extra branching for non-integer powers, if any. Added the triage review label to make sure this gets looked at during a triage meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice! Beyond the suggestion in-line, a more general question (perhaps to @seberg) of whether it might be possible to do this inside the promotion machinery, i.e., for specific powers call different ufunc inner loops. (Of course, this brings back value-based casting, but raising things to powers is always one of the types of routines that does not fit nicely - and we have precedent with the comparisons...) An advantage of that approach would be that then we can remove the special-casing for __pow__
, and presumably any vectorization of np.square
and np.sqrt
, etc., would be used automatically.
#endif | ||
#if @ispower@ | ||
if (steps[1]==0) { | ||
BINARY_DEFS_ZERO_STRIDE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to call the inner loop for square
? Mostly if that helps with the CPU dispatch (I don't understand that well enough to know whether this is possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update this to use SIMD for square and sqrt in another patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. This is actually faster than using SIMD pow(x, y) by quite a bit. I would suggest bypassing the SIMD loop as well for this case. We could worry about CPU dispatching by leveraging square function in a separate patch.
@@ -250,6 +251,25 @@ NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@func@) | |||
simd_@intrin@_@sfx@(src1, ssrc1, src2, ssrc2, dst, sdst, len); | |||
return; | |||
} | |||
#endif | |||
#if @ispower@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of the nested if else conditions is getting a little hard to read. I propose splitting the atan2
and pow
function to make this more readable.
Your patch seems to do this for float as well. Or am I missing something? |
It indeed handles both double and float. |
I added some tests with smaller arrays to see whether the overhead impacts the non-integer case (results updated in first comment). It seems the impact is small or none, but it might depend on the compiler/platform and would be good to benchmark on an independent system. Adding more special cases is possible, but we need to decide which cases are important as each added case adds a bit more complexity (and overhead). I now have |
The impact is largest in some degenerate cases. Small arrays are dominated by other overheads anyway. |
@eendebakpt out of curiosity, why optimize for |
The following script can be used to generate some cases where
In the cases generated with this script the values for |
That is a fair point. There are several issues (links in the first comment) where users experience a slow np.power (also for other exponents), and this would be one way to provide a performance improvement. Replacing But whether numpy should special case for certain exponents (and if so which ones) is not clear to me. It is on the agenda for the next triage meeting I believe, that also seems like a good place to discuss. |
Numpy already optimizes |
We talked about this at the triage meeting and there wasn't a ton of appetite for adding special handling for anything besides 0.5 and 2, mostly because of concerns about accuracy. I did some github code searches for uses with common variable names like |
@ngoldbaum Thanks for the feedback! I implemented I tried some experiments to determine the accuracy. Not sure there are representative, but here are some results. For For |
825f062
to
e0194de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes but mostly LGTM. Do you know if there are unit tests that cover these special cases for np.power
? If not, it might be worth adding them to test this new code paths.
#endif | ||
#if @ispower@ | ||
if (steps[1]==0) { | ||
BINARY_DEFS_ZERO_STRIDE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update this to use SIMD for square and sqrt in another patch.
Added benchmarks in #26212. Results for this PR on Intel SKX:
|
|
Is the 6% performance drop in |
we see this often. when the code placement changes it sometimes affects perf a little bit, But I am not worried about 6% too much. I have tried multiple times to narrow down why that happens but failed every time :/ |
numpy/_core/tests/test_umath.py
Outdated
for dt in [np.float32, np.float64]: | ||
expected = np.array([0.0, 1.21, 4., 1.44e+26, np.inf, np.inf]) | ||
a = np.array([0, 1.1, 2, 12e12, np.inf, -np.inf], dt) | ||
assert_equal(np.power(a, 2.), expected.astype(dt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_equal
it too strict a condition. For example, 1.21 cannot be represented in float32 and is actually stored as 1.21000003814697265625, which will differ from the output of 1.1*1.1
. I would recommend using assert_array_max_ulp
with maxulp = 1.
There is an issue with the fast path for sqrt. On current numpy:
This all agrees with the documentation I checked on the C/C++ implementations of pow and sqrt (e.g. https://en.cppreference.com/w/c/numeric/math/pow). The issue is that
with
That is not a big change and I am sure it will still be faster, but I am not sure it is worth the (potential) trouble. I am leaning towards removing the special case for sqrt and only leaving x*x in. Update: I removed the fast path for sqrt. If desired, we can put it back in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-devulap could you give this one final look?
@eendebakpt it occurs to me it might be nice to add a release note for all the performance improvements you've been doing, including this one. That gives you a little bit of credit and also spreads the knowledge that there are fast paths for common cases in some of these functions now. No need to include it as part of this PR to make it mergeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @eendebakpt
@r-devulap Thanks for merging! A couple of the linked issues can be closed now I think. |
@eendebakpt cool! can you clean up the old issues? Happy to give you triage rights if you need them but I think anyone can close issues. |
I don't think so, only if it is included in the commit message as |
Ah sorry, and it looks like I don't have the rights to give people triage rights like I thought... |
Send an invite, if you like @eendebakpt. |
Thanks for the invite, accepted! I am looking in the documentation for any guidelines on triage rights, but can't find anything specific for numpy. |
We optimize
np.power(x, e)
by adding a fast path for the casee=2
andx
is double or float.Updated benchmark (linux):
Main
PR
Benchmark script
Old benchmark (windows)
Results on main:
Results on this PR:
Notes:
ffast-math
can makepow
much faster for special cases such ase=2
. See for example https://stackoverflow.com/a/2940800Related: #26045, #9363, #6399, #22651, #7786