Optimize memory allocation for counts #159

dcherian · 2023-10-15T05:36:12Z

Closes #132

Just tried nanmean but I'm not seeing any difference (!) for

import numbagg.grouped
import numpy as np
import numba

x = np.random.RandomState(42).randn(1000, 1000)
x[x < -1] = np.NaN
labels = np.repeat(np.arange(0,5), 1000//5)

# timings with numba=0.41.0 and bottleneck=1.2.1
%timeit numbagg.grouped.group_nanmean(x, labels, axis=-1)

Closes numbagg#132

max-sixty · 2023-10-15T06:45:29Z

I can get the benchmarks in a better state so we can test this properly.

asv should mostly work already, but I'm not sure it'll pick up on this sufficiently, which will only show when we're running lots of these in aggregate (i.e. your example)

I guess it's possible that numba does this already — identify the allocations in a gufunc and batch them into a single allocation. I unfortunately don't know numba well enough to know. So if we really see no effect, that's a possibility.

Are you really testing with numba==0.41? Likely worth upgrading? I'm on 0.56...

dcherian · 2023-10-16T06:10:07Z

Are you really testing with numba==0.41? Likely worth upgrading? I'm on 0.56...

No sorry. I copied that comment from the README. I'm on 0.56 too

max-sixty · 2023-12-18T22:38:24Z

So the surprising result is that it makes operations on small arrays (which don't parallelize) much faster, and has a small effect on functions on other arrays

Here's the output of python numbagg/test/run_benchmarks.py -k 'test_benchmark_main and group_nanmean and numbagg' on main:


------------------------------------ benchmark 'numbagg.group_nanmean|(10, 10, 10, 10, 1000)': 1 tests -------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape4-numbagg]     3.0955  3.5101  3.3455  0.1449  3.3963  0.2555       4;0  298.9092      16         100
--------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'numbagg.group_nanmean|(100, 100000)': 1 tests -----------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape2-numbagg]     3.4369  4.1615  3.6194  0.1719  3.6093  0.1574       2;1  276.2912      15         100
--------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------- benchmark 'numbagg.group_nanmean|(1000,)': 1 tests -------------------------------------------------
Name (time in us)                                         Min       Max     Mean   StdDev   Median      IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape0-numbagg]     72.6448  127.7752  90.4516  15.1171  86.3505  13.0321      11;6       11.0556      49        1191
-------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------- benchmark 'numbagg.group_nanmean|(10000000,)': 1 tests -------------------------------------------
Name (time in ms)                                         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape1-numbagg]     58.7501  61.9673  59.9400  0.9354  59.5886  1.1434      14;0  16.6833      42           2
-----------------------------------------------------------------------------------------------------------------------------------------------

and here's the output for this branch:

------------------------------------ benchmark 'numbagg.group_nanmean|(10, 10, 10, 10, 1000)': 1 tests -------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape4-numbagg]     2.9858  3.2933  3.1193  0.0806  3.1139  0.1145       4;0  320.5875      16         100
--------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'numbagg.group_nanmean|(100, 100000)': 1 tests -----------------------------------------
Name (time in ms)                                        Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape2-numbagg]     3.4574  3.9332  3.6569  0.1318  3.6611  0.1551       6;0  273.4565      14         100
--------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------ benchmark 'numbagg.group_nanmean|(1000,)': 1 tests ------------------------------------------------
Name (time in us)                                         Min      Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape0-numbagg]     64.1670  86.4269  68.8892  5.2843  66.2560  6.3888      10;1       14.5161      50        1562
----------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------- benchmark 'numbagg.group_nanmean|(10000000,)': 1 tests -------------------------------------------
Name (time in ms)                                         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_main[group_nanmean-shape1-numbagg]     57.7563  66.6283  59.9324  1.8338  59.5346  2.2602      11;1  16.6855      43           2
------------------------------------------------------------------------------------------------------

The only one that makes a large change is numbagg.group_nanmean|(1000,) (mean of 68 vs 90), and numbagg.group_nanmean|(10, 10, 10, 10, 1000) does get a bit faster (mean of 3.11 vs 3.34).

I'm not sure how that works — possibly allocating an array in the numbagg function has some small overhead — so we get a speed up with tiny functions — numbagg.group_nanmean|(1000,) but not numbagg.group_nanmean|(10000000,).

And then we get a marginal increase with numbagg.group_nanmean|(10, 10, 10, 10, 1000) — the original expectations here — because we can allocate once rather than on every slice.

If that's the case, I think we should merge. It does add a small amount of complication to the code, but the 7% increase in performance of the large functions is sufficient on its own, I'd think...

- Move more into the benchmark function, away from the script - Defer setting parameters to `pytest-benchmark`, don't use `pedantic` mode - Allow passing args to the script, rather than editing the files From testing numbagg#159

* De-layer the benchmarking code - Move more into the benchmark function, away from the script - Defer setting parameters to `pytest-benchmark`, don't use `pedantic` mode - Allow passing args to the script, rather than editing the files From testing #159

max-sixty · 2023-12-18T23:25:07Z

numbagg/grouped.py

+dtypes = [
+    (int32, int32, int32),
+    (int32, int64, int32),
+    (int64, int32, int64),
+    (int64, int64, int64),
+    (float32, int32, float32),
+    (float32, int64, float32),
+    (float64, int32, float64),
+    (float64, int64, float64),
+]
+dtypes_counts = [
+    (int32, int32, int64, int32),
+    (int32, int64, int64, int32),
+    (int64, int32, int64, int64),
+    (int64, int64, int64, int64),
+    (float32, int32, int64, float32),
+    (float32, int64, int64, float32),
+    (float64, int32, int64, float64),
+    (float64, int64, int64, float64),
+]


We've since changed how we specify the signature to include int8, so I'll take a pass at updating this. In particular, we should update the int8 tests to include a function that uses the counts signature.

(When I apply this patch, the tests fail:

diff --git a/numbagg/test/test_grouped.py b/numbagg/test/test_grouped.py index e399ab5..595e0a6 100644 --- a/numbagg/test/test_grouped.py +++ b/numbagg/test/test_grouped.py @@ -373,9 +373,9 @@ def test_int8_again(dtype): expected = pd.DataFrame(array.T).groupby(by).sum().T.astype(dtype) # https://github.com/numbagg/numbagg/issues/213 - assert_almost_equal(group_nansum(array, by, axis=-1), expected) + assert_almost_equal(group_nanmean(array, by, axis=-1), expected) # Amazingly it can also be more incorrect with another run! - assert_almost_equal(group_nansum(array, by, axis=-1), expected) + assert_almost_equal(group_nanmean(array, by, axis=-1), expected) def test_dimensionality():

)

dcherian · 2023-12-18T23:25:57Z

I pushed a commit to try initializing counts to 0 inside the inner loop. Does that make a difference?

Preparing for numbagg#159

Preparing for #159

max-sixty · 2023-12-18T23:42:47Z

I pushed a commit to try initializing counts to 0 inside the inner loop. Does that make a difference?

Unfortunately it doesn't budge at all.

My guess is that writing from lots of threads to a single array small-ish might be causing CPU caches on each core to be fighting over the array, such that it's no faster than writing from a single CPU (but am a bit over my skis here)

FYI you can run the same thing just with pytest --benchmark-enable --run-nightly -k 'test_benchmark_main and group_nanmean and numbagg'

Optimize memory allocation for counts

dbaec95

Closes numbagg#132

max-sixty mentioned this pull request Dec 7, 2023

Fix underlying int8 issue #216

Merged

max-sixty and others added 4 commits December 18, 2023 12:58

Merge branch 'main' into optimize-counts

62c4fa2

9c31469

Merge branch 'main' into optimize-counts

3cc88de

Merge branch 'main' into optimize-counts

7211e39

max-sixty mentioned this pull request Dec 18, 2023

De-layer the benchmarking code #249

Merged

max-sixty and others added 2 commits December 18, 2023 15:19

Merge branch 'main' into optimize-counts

24535d2

Initialize counts inside core loop

79a995c

max-sixty reviewed Dec 18, 2023

View reviewed changes

max-sixty added a commit to max-sixty/numbagg that referenced this pull request Dec 18, 2023

Add multiple funcs to int8 tests

d7c9152

Preparing for numbagg#159

max-sixty mentioned this pull request Dec 18, 2023

Add multiple funcs to int8 tests #250

Merged

max-sixty added a commit that referenced this pull request Dec 18, 2023

Add multiple funcs to int8 tests (#250)

faba611

Preparing for #159

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory allocation for counts #159

Optimize memory allocation for counts #159

dcherian commented Oct 15, 2023 •

edited

max-sixty commented Oct 15, 2023

dcherian commented Oct 16, 2023

max-sixty commented Dec 18, 2023 •

edited

max-sixty Dec 18, 2023 •

edited

dcherian commented Dec 18, 2023

max-sixty commented Dec 18, 2023

Optimize memory allocation for counts #159

Are you sure you want to change the base?

Optimize memory allocation for counts #159

Conversation

dcherian commented Oct 15, 2023 • edited

max-sixty commented Oct 15, 2023

dcherian commented Oct 16, 2023

max-sixty commented Dec 18, 2023 • edited

max-sixty Dec 18, 2023 • edited

Choose a reason for hiding this comment

dcherian commented Dec 18, 2023

max-sixty commented Dec 18, 2023

dcherian commented Oct 15, 2023 •

edited

max-sixty commented Dec 18, 2023 •

edited

max-sixty Dec 18, 2023 •

edited