New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memory allocation for counts #159
base: main
Are you sure you want to change the base?
Conversation
I can get the benchmarks in a better state so we can test this properly.
I guess it's possible that numba does this already — identify the allocations in a gufunc and batch them into a single allocation. I unfortunately don't know numba well enough to know. So if we really see no effect, that's a possibility. Are you really testing with numba==0.41? Likely worth upgrading? I'm on 0.56... |
No sorry. I copied that comment from the README. I'm on 0.56 too |
So the surprising result is that it makes operations on small arrays (which don't parallelize) much faster, and has a small effect on functions on other arrays Here's the output of
and here's the output for this branch:
The only one that makes a large change is I'm not sure how that works — possibly allocating an array in the numbagg function has some small overhead — so we get a speed up with tiny functions — And then we get a marginal increase with If that's the case, I think we should merge. It does add a small amount of complication to the code, but the 7% increase in performance of the large functions is sufficient on its own, I'd think... |
- Move more into the benchmark function, away from the script - Defer setting parameters to `pytest-benchmark`, don't use `pedantic` mode - Allow passing args to the script, rather than editing the files From testing numbagg#159
* De-layer the benchmarking code - Move more into the benchmark function, away from the script - Defer setting parameters to `pytest-benchmark`, don't use `pedantic` mode - Allow passing args to the script, rather than editing the files From testing #159
dtypes = [ | ||
(int32, int32, int32), | ||
(int32, int64, int32), | ||
(int64, int32, int64), | ||
(int64, int64, int64), | ||
(float32, int32, float32), | ||
(float32, int64, float32), | ||
(float64, int32, float64), | ||
(float64, int64, float64), | ||
] | ||
dtypes_counts = [ | ||
(int32, int32, int64, int32), | ||
(int32, int64, int64, int32), | ||
(int64, int32, int64, int64), | ||
(int64, int64, int64, int64), | ||
(float32, int32, int64, float32), | ||
(float32, int64, int64, float32), | ||
(float64, int32, int64, float64), | ||
(float64, int64, int64, float64), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've since changed how we specify the signature to include int8
, so I'll take a pass at updating this. In particular, we should update the int8
tests to include a function that uses the counts signature.
(When I apply this patch, the tests fail:
diff --git a/numbagg/test/test_grouped.py b/numbagg/test/test_grouped.py
index e399ab5..595e0a6 100644
--- a/numbagg/test/test_grouped.py
+++ b/numbagg/test/test_grouped.py
@@ -373,9 +373,9 @@ def test_int8_again(dtype):
expected = pd.DataFrame(array.T).groupby(by).sum().T.astype(dtype)
# https://github.com/numbagg/numbagg/issues/213
- assert_almost_equal(group_nansum(array, by, axis=-1), expected)
+ assert_almost_equal(group_nanmean(array, by, axis=-1), expected)
# Amazingly it can also be more incorrect with another run!
- assert_almost_equal(group_nansum(array, by, axis=-1), expected)
+ assert_almost_equal(group_nanmean(array, by, axis=-1), expected)
def test_dimensionality():
)
I pushed a commit to try initializing counts to 0 inside the inner loop. Does that make a difference? |
Preparing for numbagg#159
Unfortunately it doesn't budge at all. My guess is that writing from lots of threads to a single array small-ish might be causing CPU caches on each core to be fighting over the array, such that it's no faster than writing from a single CPU (but am a bit over my skis here) FYI you can run the same thing just with |
Closes #132
Just tried
nanmean
but I'm not seeing any difference (!) for