Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory allocation for counts #159

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Oct 15, 2023

Closes #132

Just tried nanmean but I'm not seeing any difference (!) for

import numbagg.grouped
import numpy as np
import numba

x = np.random.RandomState(42).randn(1000, 1000)
x[x < -1] = np.NaN
labels = np.repeat(np.arange(0,5), 1000//5)

# timings with numba=0.41.0 and bottleneck=1.2.1
%timeit numbagg.grouped.group_nanmean(x, labels, axis=-1)

@max-sixty
Copy link
Collaborator

I can get the benchmarks in a better state so we can test this properly.

asv should mostly work already, but I'm not sure it'll pick up on this sufficiently, which will only show when we're running lots of these in aggregate (i.e. your example)

I guess it's possible that numba does this already — identify the allocations in a gufunc and batch them into a single allocation. I unfortunately don't know numba well enough to know. So if we really see no effect, that's a possibility.

Are you really testing with numba==0.41? Likely worth upgrading? I'm on 0.56...

@dcherian
Copy link
Contributor Author

Are you really testing with numba==0.41? Likely worth upgrading? I'm on 0.56...

No sorry. I copied that comment from the README. I'm on 0.56 too

@max-sixty
Copy link
Collaborator

max-sixty commented Dec 18, 2023

So the surprising result is that it makes operations on small arrays (which don't parallelize) much faster, and has a small effect on functions on other arrays

Here's the output of python numbagg/test/run_benchmarks.py -k 'test_benchmark_main and group_nanmean and numbagg' on main:


------------------------------------ benchmark 'numbagg.group_nanmean|(10, 10, 10, 10, 1000)': 1 tests -------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape4-numbagg]     3.0955  3.5101  3.3455  0.1449  3.3963  0.2555       4;0  298.9092      16         100
--------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'numbagg.group_nanmean|(100, 100000)': 1 tests -----------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape2-numbagg]     3.4369  4.1615  3.6194  0.1719  3.6093  0.1574       2;1  276.2912      15         100
--------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------- benchmark 'numbagg.group_nanmean|(1000,)': 1 tests -------------------------------------------------
Name (time in us)                                         Min       Max     Mean   StdDev   Median      IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape0-numbagg]     72.6448  127.7752  90.4516  15.1171  86.3505  13.0321      11;6       11.0556      49        1191
-------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------- benchmark 'numbagg.group_nanmean|(10000000,)': 1 tests -------------------------------------------
Name (time in ms)                                         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape1-numbagg]     58.7501  61.9673  59.9400  0.9354  59.5886  1.1434      14;0  16.6833      42           2
-----------------------------------------------------------------------------------------------------------------------------------------------

and here's the output for this branch:

------------------------------------ benchmark 'numbagg.group_nanmean|(10, 10, 10, 10, 1000)': 1 tests -------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape4-numbagg]     2.9858  3.2933  3.1193  0.0806  3.1139  0.1145       4;0  320.5875      16         100
--------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'numbagg.group_nanmean|(100, 100000)': 1 tests -----------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape2-numbagg]     3.4574  3.9332  3.6569  0.1318  3.6611  0.1551       6;0  273.4565      14         100
--------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------ benchmark 'numbagg.group_nanmean|(1000,)': 1 tests ------------------------------------------------
Name (time in us)                                         Min      Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape0-numbagg]     64.1670  86.4269  68.8892  5.2843  66.2560  6.3888      10;1       14.5161      50        1562
----------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------- benchmark 'numbagg.group_nanmean|(10000000,)': 1 tests -------------------------------------------
Name (time in ms)                                         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape1-numbagg]     57.7563  66.6283  59.9324  1.8338  59.5346  2.2602      11;1  16.6855      43           2
------------------------------------------------------------------------------------------------------

The only one that makes a large change is numbagg.group_nanmean|(1000,) (mean of 68 vs 90), and numbagg.group_nanmean|(10, 10, 10, 10, 1000) does get a bit faster (mean of 3.11 vs 3.34).

I'm not sure how that works — possibly allocating an array in the numbagg function has some small overhead — so we get a speed up with tiny functions — numbagg.group_nanmean|(1000,) but not numbagg.group_nanmean|(10000000,).

And then we get a marginal increase with numbagg.group_nanmean|(10, 10, 10, 10, 1000) — the original expectations here — because we can allocate once rather than on every slice.

If that's the case, I think we should merge. It does add a small amount of complication to the code, but the 7% increase in performance of the large functions is sufficient on its own, I'd think...

max-sixty added a commit to max-sixty/numbagg that referenced this pull request Dec 18, 2023
- Move more into the benchmark function, away from the script
- Defer setting parameters to `pytest-benchmark`, don't use `pedantic` mode
- Allow passing args to the script, rather than editing the files

From testing numbagg#159
max-sixty added a commit that referenced this pull request Dec 18, 2023
* De-layer the benchmarking code

- Move more into the benchmark function, away from the script
- Defer setting parameters to `pytest-benchmark`, don't use `pedantic` mode
- Allow passing args to the script, rather than editing the files

From testing #159
Comment on lines +6 to +25
dtypes = [
(int32, int32, int32),
(int32, int64, int32),
(int64, int32, int64),
(int64, int64, int64),
(float32, int32, float32),
(float32, int64, float32),
(float64, int32, float64),
(float64, int64, float64),
]
dtypes_counts = [
(int32, int32, int64, int32),
(int32, int64, int64, int32),
(int64, int32, int64, int64),
(int64, int64, int64, int64),
(float32, int32, int64, float32),
(float32, int64, int64, float32),
(float64, int32, int64, float64),
(float64, int64, int64, float64),
]
Copy link
Collaborator

@max-sixty max-sixty Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've since changed how we specify the signature to include int8, so I'll take a pass at updating this. In particular, we should update the int8 tests to include a function that uses the counts signature.

(When I apply this patch, the tests fail:

diff --git a/numbagg/test/test_grouped.py b/numbagg/test/test_grouped.py
index e399ab5..595e0a6 100644
--- a/numbagg/test/test_grouped.py
+++ b/numbagg/test/test_grouped.py
@@ -373,9 +373,9 @@ def test_int8_again(dtype):
     expected = pd.DataFrame(array.T).groupby(by).sum().T.astype(dtype)
 
     # https://github.com/numbagg/numbagg/issues/213
-    assert_almost_equal(group_nansum(array, by, axis=-1), expected)
+    assert_almost_equal(group_nanmean(array, by, axis=-1), expected)
     # Amazingly it can also be more incorrect with another run!
-    assert_almost_equal(group_nansum(array, by, axis=-1), expected)
+    assert_almost_equal(group_nanmean(array, by, axis=-1), expected)
 
 
 def test_dimensionality():

)

@dcherian
Copy link
Contributor Author

I pushed a commit to try initializing counts to 0 inside the inner loop. Does that make a difference?

max-sixty added a commit to max-sixty/numbagg that referenced this pull request Dec 18, 2023
max-sixty added a commit that referenced this pull request Dec 18, 2023
@max-sixty
Copy link
Collaborator

I pushed a commit to try initializing counts to 0 inside the inner loop. Does that make a difference?

Unfortunately it doesn't budge at all.

My guess is that writing from lots of threads to a single array small-ish might be causing CPU caches on each core to be fighting over the array, such that it's no faster than writing from a single CPU (but am a bit over my skis here)

FYI you can run the same thing just with pytest --benchmark-enable --run-nightly -k 'test_benchmark_main and group_nanmean and numbagg'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize count for groupby
2 participants