Unreasonable default fill_values #32

max-sixty · 2020-12-30T22:39:25Z

Firstly thanks for the impressive package. We're considering using it in https://github.com/pydata/xarray to provide faster groupby operations.

It looks like some forms of stack / unstacking are supported too, if I'm looking at "Form 4" in the readme. Is it currently possible to supply a subset of the indices as part of that?

In [7]: import numpy_groupies as npg
In [1]: import numpy as np
In [30]: from numpy_groupies.aggregate_numpy import aggregate


In [26]: flat = np.arange(12).astype(float)
    ...: data = values = flat.reshape(3, -1)

In [4]: import itertools

In [5]: group_idx = np.array(list(itertools.product(*[range(x) for x in values.shape]))).T
   ...: group_idx
Out[5]:
array([[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
       [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

This works well:

In [32]: aggregate(group_idx, flat, "array", size=(3, 4))
Out[32]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

But this doesn't:

In [33]: aggregate(group_idx[:, :-1], flat[:-1].astype(float), "array", size=(3, 4))
/usr/local/lib/python3.8/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Out[33]:
array([[array([0.]), array([1.]), array([2.]), array([3.])],
       [array([4.]), array([5.]), array([6.]), array([7.])],
       [array([8.]), array([9.]), array([10.]), 0]], dtype=object)

Notably, supplying sum does compute, though the result has 0 rather than nan:

In [35]: aggregate(group_idx[:, :-1], flat[:-1].astype(float), "sum", size=(3, 4))
Out[35]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10.,  0.]])

Ideally the "array" case above would return:

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10.,  np.nan.]])

Of course, if this library — as per the name — is more focused on groupby than stacking, totally reasonable to close this as wontfix.

Thanks!

The text was updated successfully, but these errors were encountered:

ml31415 · 2021-01-29T15:10:03Z

Thanks for your kind words about our little project. Please be warned, that there is not much maintenance going on for this package. If you need any bigger changes or urgent fixes on numpy_groupies for xarray, you'd be largely on your own to implement it.

The array operation, that you mention above is mainly a front-end for np.split. Probably we messed up something there. Creating a ton of array objects is quite a time waster, so we rarely use it on our own. Mb you want to have a look at it on your own, patches welcome.

numpy-groupies/numpy_groupies/aggregate_numpy.py

Line 188 in 786a78b

def _array(group_idx, a, size, fill_value, dtype=None):

About the sum, you can provide nan as fill_value

aggregate(group_idx[:, :-1], flat[:-1].astype(float), "sum", size=(3, 4), fill_value=np.nan)
>>> array([[ 0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.],
           [ 8.,  9., 10., nan]])

ml31415 · 2021-01-30T10:38:54Z

Btw. fill_value=[] fixes the first issue. Looks like we need to improve the defaults selection.

max-sixty · 2021-01-31T20:48:55Z

Thanks @ml31415

For background, here's the xarray issue discussing using npg: pydata/xarray#4473

FYI re the stacking — I implemented that using normal numpy, which worked out OK (pydata/xarray#4746) — my sense is that it's not too different a result from using this library, let me know if that's missing something though.

I'll have a go at some point of finishing up pydata/xarray#4540. The main issue there are around our historical groupby implementation — which mostly based around a python loop — rather than any issues with npg.

Ack re not much maintenance going on. To the extent you're accepting PRs, I think we'd still be interested in having this as at least an optional path for xarray — it would make our code significantly faster, and simpler. I'd be interested if your sense is that the current features are robust, or would require more work.

Thank you — Max

ml31415 · 2021-02-01T00:56:56Z

Of course PRs are welcome any time. I'd also be happy to add you to the committers, if you want to take over some long term responsibility.

About the robustness of the current features, please have a look, if the unit tests are sufficient for your use case. The major part of the test suit compares the results of the optimized implementations against a generic implementation using only numpy functions. I use the library in production for years, and errors there would have be quite costly to me. But as we just saw with the examples you came up with, that only holds true for the more frequently used features of this library, and the tests might need some extensions for your use cases.

About stacking, not sure if I got your question right. Internally npg loops naively over two 1D arrays. All else is handled with plain numpy functions to prepare everything in 1D shape before the action starts, and restore the original shape afterwards. So there are probably no speed gains hiding, see

numpy-groupies/numpy_groupies/utils_numpy.py

Line 192 in 786a78b

def input_validation(group_idx, a, size=None, order='C', axis=None,

.

The code itself is a bit "grown", and some parts might need some refactoring. The main reason why the weave part is still there is to have a speed reference for the numba implementation, so it didn't see new group functions for a while.

ml31415 · 2022-07-14T08:02:20Z

@dcherian @d1manson What do you think: Shall we change the default fill_value from 0 for all operations to something operation-specific?

array, sort -> []
all, any, anynan, allnan -> False (unchanged)
argmin, argmax -> -maxint
mean, var, std -> np.nan
min, max, cummax, cummin -> np.nan
last, first -> np.nan
len, sum, cumsum, sum_of_squares -> 0 (unchanged)
prod, cumprod -> 1
generic -> np.nan

It would change the current API a bit, but might be worth it for "doing the right thing" automatically. Any better suggestions for the values?

dcherian · 2022-07-19T01:48:02Z

Any better suggestions for the values?

With flox I think set it to numpy_method([np.nan, np.nan]) so that's one option.

numpy_groupies treats groups with all-NaN members and groups with 0 members the same (since NaN values are removed before reducing), so this could be an OK choice.

EDIT: the confusing thing will be np.sum([np.nan, np.nan]) = np.nan or np.nansum([np.nan, np.nan]) = 0. Which one to pick ?

ml31415 · 2022-07-21T17:16:41Z

We might have different default fill_values for sum and nansum as well.

ml31415 changed the title ~~Sparse input possible in "Form 4"?~~ Unreasonable default fill_values Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreasonable default fill_values #32

Unreasonable default fill_values #32

max-sixty commented Dec 30, 2020

ml31415 commented Jan 29, 2021

ml31415 commented Jan 30, 2021

max-sixty commented Jan 31, 2021

ml31415 commented Feb 1, 2021

ml31415 commented Jul 14, 2022

dcherian commented Jul 19, 2022 •

edited

ml31415 commented Jul 21, 2022 •

edited

Unreasonable default fill_values #32

Unreasonable default fill_values #32

Comments

max-sixty commented Dec 30, 2020

ml31415 commented Jan 29, 2021

ml31415 commented Jan 30, 2021

max-sixty commented Jan 31, 2021

ml31415 commented Feb 1, 2021

ml31415 commented Jul 14, 2022

dcherian commented Jul 19, 2022 • edited

ml31415 commented Jul 21, 2022 • edited

dcherian commented Jul 19, 2022 •

edited

ml31415 commented Jul 21, 2022 •

edited