reduceat cornercase (Trac #236) #834

numpy-gitbot · 2012-10-19T18:15:41Z

Original ticket http://projects.scipy.org/numpy/ticket/236 on 2006-08-07 by trac user martin_wiechert, assigned to unknown.

.reduceat does not handle repeated indices correctly. When an index is repeated the neutral element of the operation should be returned. In the example below [0, 10], not [1, 10], is expected.

In [1]:import numpy

In [2]:numpy.version.version
Out[2]:'1.0b1'

In [3]:a = numpy.arange (5)

In [4]:numpy.add.reduceat (a, (1,1))
Out[4]:array([ 1, 10])

The text was updated successfully, but these errors were encountered:

numpy-gitbot · 2012-10-23T02:36:15Z

@teoliphant wrote on 2006-08-08

Unfortunately, perhaps, the reduceat method of NumPy follows the behavior of the reduceat method of Numeric for this corner case.

There is no facility for returning the "identity" element of the operation in cases of index-equality. The defined behavior is to return the element given by the first index if the slice returns an empty sequence. Therefore, the documented and actual behavior of reduceat in this case is to construct

[a[1], add.reduce(a[1:])]

This is a feature request.

numpy-gitbot · 2012-10-23T02:36:15Z

trac user martin_wiechert wrote on 2006-08-08

also see ticket #835

numpy-gitbot · 2012-10-23T02:36:16Z

Milestone changed to 1.1 by @alberts on 2007-05-12

numpy-gitbot · 2012-10-23T02:36:16Z

Milestone changed to Unscheduled by @cournape on 2009-03-02

jnothman · 2014-01-29T23:20:47Z

I think this is closely connected to #835: If one of the indices is len(a), reduceat cannot output the element at that index, which is needed if the index len(a) appears or is repeated at the end of the indices.

Some solutions:

an option to reduceat to not set any value in the output where end - start == 0
an option to set the output to a given fixed value where end - start == 0
a where parameter, like in ufunc(), which masks which outputs should be calculated at all.

jayvius · 2015-07-02T21:15:27Z

Has there been any more thought on this issue? I would be interested in having the option to set the output to the identity value (if it exists) where end - start == 0.

divenex · 2015-11-25T18:45:18Z

I strongly support the change of the reduceat behaviour as suggested in this long-standing open issue. It looks like a clear bug or obvious design mistake which hinders the usefulness of this great Numpy construct.

reduceat should behave consistently for all indices. Namely, for every index i, ufunc.reduceat(a, indices) should return ufunc.reduce(a[indices[i]:indices[i+1]]).

This should also be true for the case indices[i] == indices[i+1]. I cannot see any sensible reason why, in this case, reduceat should return a[indices[i]] instead of ufunc.reduce(a[indices[i]:indices[i+1]]).

See also HERE a similar comment by Pandas creator Wes McKinney.

njsmith · 2015-11-25T21:10:41Z

Wow, this is indeed terrible and broken.
.
We'd need some discussion on the mailing list, but I at least would be
totally in favor of making that issue a FutureWarning in the next release
and fixing the behavior a few releases later. We'd need someone to take the
lead on starting that discussion and writing the patch. Perhaps that's you?

divenex · 2015-11-26T00:11:53Z

Thanks for the supportive response. I can start a discussion if this helps, but unfortunately am not up to patching the C code.

jnothman · 2015-11-26T00:19:28Z

What do you intend for ufuncs without an identity, such as np.maximum?

njsmith · 2015-11-26T00:21:38Z

For such functions, an empty reduction should be an error, as it already is
when you use .reduce() instead of .reduceat().

divenex · 2015-11-26T00:31:55Z

Indeed, the behaviour should be driven by the consistency with ufunc.reduce(a[indices[i]:indices[i+1]]), which is what every user would expect. So this does not require new design decisions. It really looks just like a long standing bug fix to me. Unless anybody can justify the current inconsistent behaviour.

divenex · 2015-12-07T16:44:52Z

@njsmith I am unable to sign up to the Numpy list. I sent my address here https://mail.scipy.org/mailman/listinfo/numpy-discussion but I never get any "email requesting confirmation". Not sure whether one need special requirements to subscribe...

njsmith · 2015-12-07T18:44:47Z

@divenex: did you check your spam folder? (I always forget to do that...) Otherwise I'm not sure what could be going wrong. There definitely shouldn't be any special requirements to subscribe beyond "has an email address". If you still can't get it to work then speak up and we'll try to track down the relevant sysadmin... We definitely want to know if it's broken.

WarrenWeckesser · 2016-10-04T21:20:23Z

A version of reduceat that is consistent with ufunc.reduce(a[indices[i]:indices[i+1]]) would be really, really nice. It would be so much more useful! Either an argument to select the behavior or a new function (reduce_intervals? reduce_segments? ...?) would avoid breaking backwards incompatibility.

eric-wieser · 2017-04-13T11:47:51Z

I'd perhaps be tempted to deprecate np.ufunc.reduceat alltogether - it seems more useful to be able to specify a set of start and end indices, to avoid cases where indices[i] > indices[i+1]. Also, the name at suggests a much greater similarity to at than atually exists

What I'd propose as a replacement is ~~np.piecewise_reduce~~ np.reducebins, possibly pure-python, which basically does:

def reducebins(func, arr, start=None, stop=None, axis=-1, out=None):
    """
    Compute (in the 1d case) `out[i] = func.reduce(arr[start[i]:stop[i]])`

    If only `start` is specified, this computes the same reduce at `reduceat` did:

        `out[i]  = func.reduce(arr[start[i]:start[i+1]])`
        `out[-1] = func.reduce(arr[start[-1]:])`

    If only `stop` is specified, this computes:

        `out[0] = func.reduce(arr[:stop[0]])`
        `out[i] = func.reduce(arr[stop[i-1]:stop[i]])`

    """
    # convert to 1d arrays
    if start is not None:
        start = np.array(start, copy=False, ndmin=1, dtype=np.intp)
        assert start.ndim == 1
    if stop is not None:
        stop = np.array(stop, copy=False, ndmin=1, dtype=np.intp)
        assert stop.ndim == 1

    # default arguments that do useful things
    if start is None and stop is None:
        raise ValueError('At least one of start and stop must be specified')
    elif stop is None:
        # start only means reduce from one index to the next, and the last to the end
        stop = np.empty_like(start)
        stop[:-1] = start[1:]
        stop[-1] = arr.shape[axis]
    elif start is None:
        # stop only means reduce from the start to the first index, and one index to the next
        start = np.empty_like(stop)
        start[1:] = stop[:-1]
        start[0] = 0
    else:
        # TODO: possibly confusing?
        start, stop = np.broadcast_arrays(start, stop)

    # allocate output - not clear how to do this safely for subclasses
    if not out:
        sh = list(arr.shape)
        sh[axis] = len(stop)
        sh = tuple(sh)
        out = np.empty(shape=sh)

    # below assumes axis=0 for brevity here
    for i, (si, ei) in enumerate(zip(start, stop)):
        func.reduce(arr[si:ei,...], out=out[i, ...], axis=axis)
    return out

Which has the nice properties that:

np.add.reduce(arr) is the same as np.piecewise_reduce(np.add, arr, 0, len(arr))
np.add.reduceat(arr, inds) is the same as np.piecewise_reduce(np.add, arr, inds)
np.add.accumulate(arr) is the same as np.piecewise_reduce(np.add, arr, 0, np.arange(len(arr)))

Now, does this want to go through the__array_ufunc__ machinery? Most of what needs to be handled should be already covered by func.reduce - the only issue is the np.empty line, which is a problem that np.concatenate shares.

jnothman · 2017-04-13T12:06:04Z

That sounds like a nice solution to me from an API perspective. Even just being able to specify two sets of indices to reduceat would suffice. From an implementation perspective? Well it's not very hard to change the current PyUFunc_Reduceat to support having two sets of inds, if that provides benefit. If we really see the advantage in supporting the accumulate-like use-case efficiently, it would not be hard to do that either.

jaimefrio · 2017-04-13T13:14:46Z

Marten proposed something similar to this in a similar discussion from ~1 year ago, but he also mentioned the possibility of adding a 'step` option: http://numpy-discussion.10968.n7.nabble.com/Behavior-of-reduceat-td42667.html Things I like (so +1 if anyone is counting) from your proposal: - Creating a new function, rather than trying to salvage the existing one. - Making the start and end indices arguments specific, rather than magically figuring them out from a multidimensional array. - The defaults for the None indices are very neat. Things I think are important to think hard about for this new function: - Should we make 'step' an option? (I'd say yes) - Does it make sense for the indices arrays to broadcast, or must they be 1D? - Should this be a np function, or a ufunc method? (I think I prefer it as a method) And from the bike shedding department, I like better: - Give it a more memorable name, but I have no proposal. - Use 'start' and 'stop' (and 'step' if we decide to go for it) for consistency with np.arange and Python's slice. - Dropping the _indices from the kwarg names. Jaime

…

On Thu, Apr 13, 2017 at 1:47 PM, Eric Wieser ***@***.***> wrote: I'd perhaps be tempted to deprecate np.ufunc.reduceat alltogether - it seems more useful to be able to specify a set of start and end indices, to avoid cases where indices[i] > indices[i+1]. Also, the nameatsuggests a much greater similarity toat` than atually exists What I'd propose as a replacement is np.piecewise_reduce, which basically does: def piecewise_reduce(func, arr, start_indices=None, end_indices=None, axis=-1, out=None): if start_indices is None and end_indices is None: start_indices = np.array([0], dtype=np.intp) end_indices = np.array(arr.shape[axis], dtype=np.intp) elif end_indices is None: end_indices = np.empty_like(start_indices) end_indices[:-1] = start_indices[1:] end_indices[-1] = arr.shape[axis] elif start_indices is None: start_indices = np.empty_like(end_indices) start_indices[1:] = end_indices end_indices[0] = 0 else: assert len(start_indices) == len(end_indices) if not out: sh = list(arr.shape) sh[axis] = len(end_indices) out = np.empty(shape=sh) # below assumes axis=0 for brevity here for i, (si, ei) in enumerate(zip(start_indices, end_indices)): func.reduce(arr[si:ei,...], out=alloc[i, ...], axis=axis) return out Which has the nice properties that: - np.ufunc.reduce is the same as np.piecewise_reduce(func, arr, 0, len(arr)) - np.ufunc.accumulate is the same as `np.piecewise_reduce(func, arr, np.zeros(len(arr)), np.arange(len(arr))) Now, does this want to go through the__array_ufunc__ machinery? Most of what needs to be handled should be already covered by func.reduce - the only issue is the np.empty line, which is a problem that np.concatenate shares. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADMGdtjSCodONyu6gCpwofdBaJMCIKa-ks5rvgtrgaJpZM4ANcqc> .

-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

eric-wieser · 2017-04-13T13:34:29Z

Use 'start' and 'stop'

Done

Should we make 'step' an option

Seems like a pretty narrow use case

Does it make sense for the indices arrays to broadcast, or must they be 1D

Updated. > 1d is obviously bad, but I think we should allow 0d and broadcasting, for cases like accumulate.

Should this be a np function, or a ufunc method? (I think I prefer it
as a method)

Every ufunc method is one more thing for __array_ufunc__ to handle.

divenex · 2017-04-13T13:40:05Z

The main motivation for reduceat is to avoid a loop over reduce for maximum speed. So I am not entirely sure a wrapper of a for loop over reduce would be a very useful addition to Numpy. It would go against reduceat main purpose.

Moreover the logic for reduceat existence and API, as a fast vectorized replacement for a loop over reduce, is clean and useful. I would not deprecate it, but rather fix it.

Regarding reduceat speed, let's consider a simple example, but similar to some real-world cases I have in my own code, where I use reduceat:

n = 10000
arr = np.random.random(n)
inds = np.random.randint(0, n, n//10)
inds.sort()

%timeit out = np.add.reduceat(arr, inds)
10000 loops, best of 3: 42.1 µs per loop

%timeit out = piecewise_reduce(np.add, arr, inds)
100 loops, best of 3: 6.03 ms per loop

This is a time difference of more than 100x and illustrates the importance of preserving reduceat efficiency.

In summary, I would prioritize fixing reduceat over introducing new functions.

Having a start_indices and end_indices, altough useful in some cases, is often redundant and I would see it as a possible addition, but not as a fix for the current reduceat inconsistent behaviour.

jnothman · 2017-04-13T13:44:08Z

I don't think allowing start and stop indices to come from different arrays would make a big difference to efficiency if implemented in the C.

…

On 13 April 2017 at 23:40, divenex ***@***.***> wrote: The main motivation for reduceat is to avoid a loop over reduce for maximum speed. So I am not entirely sure a wrapper of a for loop over reduce would be a very useful addition to Numpy. It would go against reduceat main purpose. Moreover the logic for reduceat existence and API, as a fast vectorized replacement for a loop over reduce, is clean and useful. I would not deprecate it, but rather fix it. Regarding reduceat speed, let's consider a simple example, but similar to some real-world cases I have in my own code, where I use reduceat: n = 10000 arr = np.random.random(n) inds = np.random.randint(0, n, n//10) inds.sort() %timeit out = np.add.reduceat(arr, inds)10000 loops, best of 3: 42.1 µs per loop %timeit out = piecewise_reduce(np.add, arr, inds)100 loops, best of 3: 6.03 ms per loop This is a time difference of more than 100x and illustrates the importance of preserving reduceat efficiency. In summary, I would prioritize fixing reduceat over introducing new functions. Having a start_indices and end_indices, altough useful in some cases, is often redundant and I would see it as a possible addition, but not as a fix for the current reduceat inconsistent behaviour. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xPex0fo2y_MqVHbNP5YNkJ0CBJrks5rviW-gaJpZM4ANcqc> .

eric-wieser · 2017-04-13T14:11:31Z

This is a time difference of more than 100x and illustrates the importance of preserving reduceat efficiency.

Thanks for that - I guess I underestimated the overhead associated with the first stage of a reduce call (that only happens once for reduceat).

Not an argument against a free function, but certainly an argument against implementing it in pure python

eric-wieser · 2017-04-13T14:21:47Z

but not as a fix for the current reduceat inconsistent behaviour.

The problem is, that it's tricky to change the behaviour of code that's been around for so long.

Another possible extension: when indices[i] > indices[j], compute the inverse:

    for i, (si, ei) in enumerate(zip(start, stop)):
        if si >= ei:
            func.reduce(arr[si:ei,...], out=out[i, ...], axis=axis)
        else:
            func.reduce(arr[ei:si,...], out=out[i, ...], axis=axis)
            func.inverse(func.identity, out[i, ...], out=out[i, ...])

Where np.add.inverse = np.subtract, np.multiply.inverse = np.true_divide. This results in the nice property that

func.reduce(func.reduceat(x, inds_from_0)) == func.reduce(x))

For example

a = [1, 2, 3, 4]
inds = [0, 3, 1]
result = np.add.reduceat(a, inds) # [6, -5, 9] == [(1 + 2 + 3), -(3 + 2), (2 + 3 + 4)]

mhvk · 2017-04-13T14:54:42Z

The problem is, that it's tricky to change the behaviour of code that's been around for so long.

This is partially why in the e-mail thread I suggested to give special meaning to a 2-D array of indices in which the extra dimension is 2 or 3: it then is (effectively) interpreted as a stack of slices. But I realise this is also somewhat messy and of course one might as well have a reduce_by_slice, slicereduce, or reduceslice method.

p.s. I do think anything that works on many ufuncs should be a method, so that it can be passed through __array_ufunc__ and be overridden.

mhvk · 2017-04-13T14:57:30Z

Actually, a different suggestion that I think is much better: rather than salvaging reduceat, why not add a slice argument (or start, stop, step) to ufunc.reduce!? As @eric-wieser noted, any such implementation means we can just deprecate reduceat altogether, as it would just be

add.reduce(array, slice=slice(indices[:-1], indices[1:])

(where now we are free to make the behaviour match what is expected for an empty slice)

Here, one would broadcast the slice if it were 0-d, and might even consider passing in tuples of slices if a tuple of axes was used.

EDIT: made the above slice(indices[:-1], indices[1:]) to allow for extension to a tuple of slices (slice can hold arbitrary data, so this would work fine).

divenex · 2017-04-13T16:17:18Z

I would still find a fix to reduceat, to make it a proper 100% vectorized version of reduce, the most logical design solution. Alternatively, to avoid breaking code (but see below), an equivalent method named like reducebins could be created, which is simply a corrected version of reduceat. In fact, I agree with @eric-wieser that the naming of reduceat conveys more connection to the at function than there is.

I do understand the need not to break code. But I must say that I find it hard to imagine that much code depended on the old behavior, given that it simply did not make logical sense, and I would simply call it a long standing bug. I would expect that code using reduceat just made sure indices were not duplicated, to avoid a nonsense result from reduceat, or fixed the output as I did using out[:-1] *= np.diff(indices) > 0. Of course I would be interested in a user case where the old behavior/bug was used as intended.

I am not fully convinced about @mhvk slice solution because it introduces a non-standard usage for the slice construct. Moreover it would be inconsistent with the current design idea of reduce, which is to "reduce a‘s dimension by one, by applying ufunc along one axis."

I also do not see a compelling user case for both start and end indices. In fact, I see the nice design logic of the current reduceat method conceptually similar to np.histogram, where bins, which "defines the bin edges," are replaced by indices, which also represent the bins edges, but in index space rather than value. And reduceat applies a function to the elements contained inside each pair of bins edges. The histogram is an extremely popular construct, but it does not need, and in Numpy does not include, an option to pass two vectors of left and right edges. For the same reaason I doubt there is a strong need for both edges in reduceat or its replacement.

shoyer · 2017-04-13T16:47:42Z

The main motivation for reduceat is to avoid a loop over reduce for maximum speed. So I am not entirely sure a wrapper of a for loop over reduce would be a very useful addition to Numpy. It would go against reduceat main purpose.

I agree with @divenex here. The fact that reduceat requires indices to be sorted and overlapping is a reasonable constraint to ensure that the loop can be computed cache efficient manner with a single pass over the data. If you want overlapping bins, there are almost certainly better ways to compute the desired operation (e.g., rolling window aggregations).

I also agree that the cleanest solution is to define a new method such as reducebins with a fixed API (and deprecate reduceat), and to not to try to squeeze it into reduce which already does something different.

jni · 2017-04-14T08:23:35Z

Hi everyone,

I want to nip at the bud the discussion that this is a bug. This is documented behaviour from the docstring:

For i in ``range(len(indices))``, `reduceat` computes
``ufunc.reduce(a[indices[i]:indices[i+1]])``, which becomes the i-th
generalized "row" parallel to `axis` in the final result (i.e., in a
2-D array, for example, if `axis = 0`, it becomes the i-th row, but if
`axis = 1`, it becomes the i-th column).  There are three exceptions to this:

* when ``i = len(indices) - 1`` (so for the last index),
  ``indices[i+1] = a.shape[axis]``.
* if ``indices[i] >= indices[i + 1]``, the i-th generalized "row" is
  simply ``a[indices[i]]``.
* if ``indices[i] >= len(a)`` or ``indices[i] < 0``, an error is raised.

As such, I oppose any attempt to change the behaviour of reduceat.

A quick github search shows many, many uses of the function. Is everyone here certain that they all use only strictly increasing indices?

Regarding the behaviour of a new function, I would argue that without separate start/stop arrays, the functionality is severely hampered. There are many situations where one would want to measure values in overlapping windows that are not regularly arrayed (so rolling windows would not work). For example, regions of interest determined by some independent method. And @divenex has shown that the performance difference over Python iteration can be massive.

shoyer · 2017-04-14T16:45:27Z

There are many situations where one would want to measure values in overlapping windows that are not regularly arrayed (so rolling windows would not work).

Yes, but you wouldn't want to use a naive loop such as the one implemented by reduceat. You'd want to implement your own rolling window calculation storing intermediate results in some way so it can be done in a single linear pass over the data. But now we're talking about an algorithm that is much more complicated than reduceat.

eric-wieser · 2017-04-18T16:39:51Z

So, what would be a natural usage case for starts and stops in reducebins?

Achievable by other means, but a moving average of length k would be reducebins(np,add, arr, arange(n-k), k + arange(n-k)). I suspect that ignoring the cost of allocating the indices, performance would be comparable to a as_strided approach.

Uniquely, reducebins would allow a moving average of varying duration, which is not possible with as_strided

eric-wieser · 2017-04-18T16:48:28Z

Another use case - disambiguating between including the end or the start in the one-argument form.

For instance:

a = np.arange(10)
reducebins(np.add, start=[2, 4, 6]) == [2 + 3, 4 + 5, 6 + 7 + 8 + 9]  # what `reduceat` does
reducebins(np.add, stop=[2, 4, 6])  == [0 + 1, 2 + 3, 4 + 5]          # also useful

shoyer · 2017-04-19T02:52:52Z

Another use case - disambiguating between including the end or the start in the one-argument form.

I don't quite understand this one. Can you include the input tensor here? Also: what would be the default values for start/stop?

Anyways, I'm not strongly against the separate arguments, but it's not as clean of a replacement. I would love to able to say "Don't use reduceat, use reducebins instead" but that's (slightly) harder when the interface looks different.

jni · 2017-04-19T03:24:21Z

Actually, I just realised that even a start/stop option does not cover the use-case of empty slices, which is one that has been useful to me in the past: when my properties/labels correspond to rows in a CSR sparse matrix, and I use the values of indptr to do the reduction. With reduceat, I can ignore the empty rows. Any replacement will require additional bookkeeping. So, whatever replacement you come up with, please leave reduceat around.

In [2]: A = np.random.random((4000, 4000))
In [3]: B = sparse.csr_matrix((A > 0.8) * A)
In [9]: %timeit np.add.reduceat(B.data, B.indptr[:-1]) * (np.diff(B.indptr) > 1)
1000 loops, best of 3: 1.81 ms per loop
In [12]: %timeit B.sum(axis=1).A
100 loops, best of 3: 1.95 ms per loop
In [16]: %timeit np.maximum.reduceat(B.data, B.indptr[:-1]) * (np.diff(B.indptr) > 0)
1000 loops, best of 3: 1.8 ms per loop
In [20]: %timeit B.max(axis=1).A
100 loops, best of 3: 2.12 ms per loop

Incidentally, the empty sequence conundrum can be solved the same way that Python does it: by providing an initial value. This could be a scalar or an array of the same shape as indices.

jnothman · 2017-04-19T03:28:18Z

yes, i agree that the first focus needs to be on solving the empty slices case. In the case of start=end we can either have a way to set the output element to the identity, or to not modify the output element with a specified out array. The problem with the current is that it is overwritten with irrelevant data

divenex · 2017-04-19T12:22:07Z

I am fully with @shoyer about his last comment.

Let's simply define out=ufunc.reducebins(a, inds) as out[i]=ufunc.reduce(a[inds[i]:inds[i+1]]) for all i but the last, and deprecate reduceat.

Current use cases for starts and ends indices seem more naturally and likely more efficiently implemented with alternative functions like either as_strided or convolutions.

eric-wieser · 2017-04-19T12:35:06Z

@shoyer:

I don't quite understand this one. Can you include the input tensor here? Also: what would be the default values for start/stop?

Updated with the input. See the implementation of reduce_bins in the comment that started this for the default values. I've added a docstring there too. That implementation is feature-complete but slow (due to being python).

but that's (slightly) harder when the interface looks different.

When only one the start argument is passed, the interface is identical (ignoring the identity casethat we set out to fix in the first place). These three lines mean the same thing:

np.add.reduce_at(arr, inds)
reduce_bins(np.add, arr, inds)
reduce_bins(np.add, arr, start=inds)

(the method/function distinction is not something I care too much about, and I can't define a new ufunc method as a prototype in python!)

@jni:

Actually, I just realised that even a start/stop option does not cover the use-case of empty slices, which is one that has been useful to me in the past

You're wrong, it does - in the exact same way as ufunc.reduceat already does. It's also possible simply by passing start[i] == end[i].

the empty sequence conundrum can be solved ... by providing an initial value.

Yes, we've already covered this, and ufunc.reduce already does that by filling with ufunc.identity. This is not hard to add to the existing ufunc.reduecat, especially if #8952 is merged. But as you said yourself, the current behaviour is documented, so we should probably not change it.

@divenex

Let's simply define out=ufunc.reducebins(a, inds) as out[i]=ufunc.reduce(a[inds[i]:inds[i+1]]) for all i but the last

So len(out) == len(inds) - 1? This is different to the current behaviour of reduceat, so @shoyer's argument about switching is stronger here

All: I've gone through earlier comments and removed quoted email replies, as they were making this discussion hard to read

divenex · 2017-04-19T15:30:25Z

@eric-wieser good point. In my above sentence I meant that for the last index the behaviour of reducebins would be different as in the current reduceat. However, in that case, I am not sure what the value should be, as the last value formally does not make sense.

Ignoring compatibility concerns, the output of reducebins (in 1D) should have size inds.size-1, for the very same reason that np.diff(a) has size a.size-1 and np.histogram(a, bins) has size bins.size-1 . However this would go against the desire to have a drop-in replacement for reduceat.

eric-wieser · 2017-04-19T15:39:49Z

I don't think there's a convincing argument that a.size-1 is the right answer - including index 0 and/or index n seems like pretty reasonable behaviour as well. All of them seem handy in some circumstances, but I think it is very important to have a drop in replacement.

There's also another argument for stop/start hiding here - it allows you to build the diff-like behaviour if you want it, with very little cost, while still keeping the reduceat behaviour:

a = np.arange(10)
inds = [2, 4, 6]
reduce_bins(a, start=inds[:-1], stop=inds[1:])  #  [2 + 3, 4 + 5]

# or less efficiently:
reduce_at(a, inds)[:-1}
reduce_bins(a, start=inds)[:-1]
reduce_bins(a, stop=inds)[1:]

shoyer · 2017-04-19T16:34:28Z

@eric-wieser I would be OK with required start and stop arguments, but I do not like making one of them optional. It is not obvious that providing only start means out[i] = func.reduce(arr[start[i]:start[i+1]]) rather than out[i] = func.reduce(arr[start[i]:]), which is what I would have guessed.

My preferred API for reducebins is like reduceat but without the confusing "exceptions" noted in the docstring. Namely, just:

For i in range(len(indices)), reduceat computes ufunc.reduce(a[indices[i]:indices[i+1]]), which becomes the i-th generalized “row” parallel to axis in the final result (i.e., in a 2-D array, for example, if axis = 0, it becomes the i-th row, but if axis = 1, it becomes the i-th column).

I could go either way on the third "exception" which requires non-negative indices (0 <= indices[i] <= a.shape[axis]), which I view as more of a sanity check rather than an exception. But possibly that one could go, too -- I can see how negative indices might be useful to someone, and it's not hard to do the math to normalize such indices.

Not automatically adding an index at the end does imply that the result should have length len(a)-1, like the result of np.histogram.

@jni Can you please give an example of what you actually want to calculate from arrays found in sparse matrices? Preferably with a concrete example with non-random numbers, and self contained (without depending on scipy.sparse).

eric-wieser · 2017-04-19T16:47:28Z

It is not obvious that providing only start means out[i] = func.reduce(arr[start[i]:start[i+1]]) rather than out[i] = func.reduce(arr[start[i]:]), which is what I would have guessed.

The reading I was going for is that "Each bin starts at these positions", with the implication that all bins are contiguous unless explicitly specified otherwise. Perhaps I should try and draft a more complete docstring. I think I can see a strong argument for forbidding passing neither argument, so I'll remove that from my propose function.

which requires non-negative indices (0 <= indices[i] < a.shape[axis])

Note that there's also a bug here (#835) - the upper bound should be inclusive, since these are slices.

shoyer · 2017-04-19T17:05:35Z

Note that there's also a bug here - the upper bound should be inclusive, since these are slices.

Fixed, thanks.

eric-wieser · 2017-04-19T17:07:15Z

Not in the reduceat function itself, you haven't ;)

There didn't seem to be any value to a `assign_identity` function - all we actually care about is the value to assign. This also fixes numpy#8860 as a side-effect, and paves the way for: * easily adding more values (numpy#7702) * using the identity in more places (numpy#834)

eric-wieser · 2018-01-02T13:38:06Z

Turns out that :\doc\neps\groupby_additions.rst contains an (IMO inferior) proposal for a reduceby function.

martinling · 2023-12-22T19:13:55Z

Is this something that could be fixed in the upcoming 2.0 release?

mhvk · 2023-12-22T22:53:44Z

Triggered by this issue coming alive again, I wrote #25476 to see how hard it would be to allow passing in a 2-D array with start, stop values. Not too hard, it turns out. But API to be decided -- best discussed at #25476, probably!

numpy-gitbot mentioned this issue Oct 23, 2012

reduceat should handle outlier indices gracefully (Trac #237) #835

Open

shoyer mentioned this issue Apr 22, 2017

Add a dynamic_partial_sum operator to tensorflow? tensorflow/tensorflow#7662

Closed

eric-wieser mentioned this issue Jan 2, 2018

MAINT: deduplicate check_nonreorderable_axes #10309

Merged

eric-wieser mentioned this issue Feb 22, 2018

ENH: Implement initial kwarg for ufunc.add.reduce #10635

Merged

aminnj mentioned this issue Jul 10, 2018

optimized jagged operations scikit-hep/uproot3#99

Closed

mattip added 23 - Wish List and removed priority: lowest labels Oct 21, 2018

eliegoudout mentioned this issue Nov 18, 2023

FDataIrregular personal proposed reviews GAA-UAM/scikit-fda#593

Merged

12 tasks

mhvk linked a pull request Dec 22, 2023 that will close this issue

ENH: allow start-stop array for indices in reduceat #25476

Draft

reduceat cornercase (Trac #236) #834

reduceat cornercase (Trac #236) #834

Comments

numpy-gitbot commented Oct 19, 2012

numpy-gitbot commented Oct 23, 2012

numpy-gitbot commented Oct 23, 2012

numpy-gitbot commented Oct 23, 2012

numpy-gitbot commented Oct 23, 2012

jnothman commented Jan 29, 2014

jayvius commented Jul 2, 2015

divenex commented Nov 25, 2015

njsmith commented Nov 25, 2015 • edited by eric-wieser

divenex commented Nov 26, 2015

jnothman commented Nov 26, 2015 • edited by eric-wieser

njsmith commented Nov 26, 2015 • edited by eric-wieser

divenex commented Nov 26, 2015

divenex commented Dec 7, 2015

njsmith commented Dec 7, 2015

WarrenWeckesser commented Oct 4, 2016

eric-wieser commented Apr 13, 2017 • edited

jnothman commented Apr 13, 2017

jaimefrio commented Apr 13, 2017 via email

eric-wieser commented Apr 13, 2017 • edited

divenex commented Apr 13, 2017

jnothman commented Apr 13, 2017 via email

eric-wieser commented Apr 13, 2017

eric-wieser commented Apr 13, 2017 • edited

mhvk commented Apr 13, 2017

mhvk commented Apr 13, 2017 • edited

divenex commented Apr 13, 2017 • edited

shoyer commented Apr 13, 2017

jni commented Apr 14, 2017 • edited

shoyer commented Apr 14, 2017 • edited

eric-wieser commented Apr 18, 2017 • edited

eric-wieser commented Apr 18, 2017 • edited

shoyer commented Apr 19, 2017

jni commented Apr 19, 2017

jnothman commented Apr 19, 2017 via email • edited by eric-wieser

divenex commented Apr 19, 2017 • edited

eric-wieser commented Apr 19, 2017 • edited

divenex commented Apr 19, 2017 • edited

eric-wieser commented Apr 19, 2017 • edited

shoyer commented Apr 19, 2017 • edited

eric-wieser commented Apr 19, 2017 • edited

shoyer commented Apr 19, 2017

eric-wieser commented Apr 19, 2017

eric-wieser commented Jan 2, 2018

martinling commented Dec 22, 2023

mhvk commented Dec 22, 2023

njsmith commented Nov 25, 2015 •

edited by eric-wieser

jnothman commented Nov 26, 2015 •

edited by eric-wieser

njsmith commented Nov 26, 2015 •

edited by eric-wieser

eric-wieser commented Apr 13, 2017 •

edited

eric-wieser commented Apr 13, 2017 •

edited

eric-wieser commented Apr 13, 2017 •

edited

mhvk commented Apr 13, 2017 •

edited

divenex commented Apr 13, 2017 •

edited

jni commented Apr 14, 2017 •

edited

shoyer commented Apr 14, 2017 •

edited

eric-wieser commented Apr 18, 2017 •

edited

eric-wieser commented Apr 18, 2017 •

edited

jnothman commented Apr 19, 2017 via email •

edited by eric-wieser

divenex commented Apr 19, 2017 •

edited

eric-wieser commented Apr 19, 2017 •

edited

divenex commented Apr 19, 2017 •

edited

eric-wieser commented Apr 19, 2017 •

edited

shoyer commented Apr 19, 2017 •

edited

eric-wieser commented Apr 19, 2017 •

edited