New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduceat cornercase (Trac #236) #834
Comments
@teoliphant wrote on 2006-08-08 Unfortunately, perhaps, the reduceat method of NumPy follows the behavior of the reduceat method of Numeric for this corner case. There is no facility for returning the "identity" element of the operation in cases of index-equality. The defined behavior is to return the element given by the first index if the slice returns an empty sequence. Therefore, the documented and actual behavior of reduceat in this case is to construct [a[1], add.reduce(a[1:])] This is a feature request. |
trac user martin_wiechert wrote on 2006-08-08 also see ticket #835 |
Milestone changed to |
Milestone changed to |
I think this is closely connected to #835: If one of the indices is Some solutions:
|
Has there been any more thought on this issue? I would be interested in having the option to set the output to the identity value (if it exists) where end - start == 0. |
I strongly support the change of the
This should also be true for the case See also HERE a similar comment by Pandas creator Wes McKinney. |
Wow, this is indeed terrible and broken. |
Thanks for the supportive response. I can start a discussion if this helps, but unfortunately am not up to patching the C code. |
What do you intend for ufuncs without an identity, such as np.maximum? |
For such functions, an empty reduction should be an error, as it already is |
Indeed, the behaviour should be driven by the consistency with |
@njsmith I am unable to sign up to the Numpy list. I sent my address here https://mail.scipy.org/mailman/listinfo/numpy-discussion but I never get any "email requesting confirmation". Not sure whether one need special requirements to subscribe... |
@divenex: did you check your spam folder? (I always forget to do that...) Otherwise I'm not sure what could be going wrong. There definitely shouldn't be any special requirements to subscribe beyond "has an email address". If you still can't get it to work then speak up and we'll try to track down the relevant sysadmin... We definitely want to know if it's broken. |
A version of |
I'd perhaps be tempted to deprecate What I'd propose as a replacement is def reducebins(func, arr, start=None, stop=None, axis=-1, out=None):
"""
Compute (in the 1d case) `out[i] = func.reduce(arr[start[i]:stop[i]])`
If only `start` is specified, this computes the same reduce at `reduceat` did:
`out[i] = func.reduce(arr[start[i]:start[i+1]])`
`out[-1] = func.reduce(arr[start[-1]:])`
If only `stop` is specified, this computes:
`out[0] = func.reduce(arr[:stop[0]])`
`out[i] = func.reduce(arr[stop[i-1]:stop[i]])`
"""
# convert to 1d arrays
if start is not None:
start = np.array(start, copy=False, ndmin=1, dtype=np.intp)
assert start.ndim == 1
if stop is not None:
stop = np.array(stop, copy=False, ndmin=1, dtype=np.intp)
assert stop.ndim == 1
# default arguments that do useful things
if start is None and stop is None:
raise ValueError('At least one of start and stop must be specified')
elif stop is None:
# start only means reduce from one index to the next, and the last to the end
stop = np.empty_like(start)
stop[:-1] = start[1:]
stop[-1] = arr.shape[axis]
elif start is None:
# stop only means reduce from the start to the first index, and one index to the next
start = np.empty_like(stop)
start[1:] = stop[:-1]
start[0] = 0
else:
# TODO: possibly confusing?
start, stop = np.broadcast_arrays(start, stop)
# allocate output - not clear how to do this safely for subclasses
if not out:
sh = list(arr.shape)
sh[axis] = len(stop)
sh = tuple(sh)
out = np.empty(shape=sh)
# below assumes axis=0 for brevity here
for i, (si, ei) in enumerate(zip(start, stop)):
func.reduce(arr[si:ei,...], out=out[i, ...], axis=axis)
return out Which has the nice properties that:
Now, does this want to go through the |
That sounds like a nice solution to me from an API perspective. Even just being able to specify two sets of indices to |
Marten proposed something similar to this in a similar discussion from ~1
year ago, but he also mentioned the possibility of adding a 'step` option:
http://numpy-discussion.10968.n7.nabble.com/Behavior-of-reduceat-td42667.html
Things I like (so +1 if anyone is counting) from your proposal:
- Creating a new function, rather than trying to salvage the existing
one.
- Making the start and end indices arguments specific, rather than
magically figuring them out from a multidimensional array.
- The defaults for the None indices are very neat.
Things I think are important to think hard about for this new function:
- Should we make 'step' an option? (I'd say yes)
- Does it make sense for the indices arrays to broadcast, or must they
be 1D?
- Should this be a np function, or a ufunc method? (I think I prefer it
as a method)
And from the bike shedding department, I like better:
- Give it a more memorable name, but I have no proposal.
- Use 'start' and 'stop' (and 'step' if we decide to go for it) for
consistency with np.arange and Python's slice.
- Dropping the _indices from the kwarg names.
Jaime
…On Thu, Apr 13, 2017 at 1:47 PM, Eric Wieser ***@***.***> wrote:
I'd perhaps be tempted to deprecate np.ufunc.reduceat alltogether - it
seems more useful to be able to specify a set of start and end indices, to
avoid cases where indices[i] > indices[i+1]. Also, the nameatsuggests a
much greater similarity toat` than atually exists
What I'd propose as a replacement is np.piecewise_reduce, which basically
does:
def piecewise_reduce(func, arr, start_indices=None, end_indices=None, axis=-1, out=None):
if start_indices is None and end_indices is None:
start_indices = np.array([0], dtype=np.intp)
end_indices = np.array(arr.shape[axis], dtype=np.intp)
elif end_indices is None:
end_indices = np.empty_like(start_indices)
end_indices[:-1] = start_indices[1:]
end_indices[-1] = arr.shape[axis]
elif start_indices is None:
start_indices = np.empty_like(end_indices)
start_indices[1:] = end_indices
end_indices[0] = 0
else:
assert len(start_indices) == len(end_indices)
if not out:
sh = list(arr.shape)
sh[axis] = len(end_indices)
out = np.empty(shape=sh)
# below assumes axis=0 for brevity here
for i, (si, ei) in enumerate(zip(start_indices, end_indices)):
func.reduce(arr[si:ei,...], out=alloc[i, ...], axis=axis)
return out
Which has the nice properties that:
- np.ufunc.reduce is the same as np.piecewise_reduce(func, arr, 0,
len(arr))
- np.ufunc.accumulate is the same as `np.piecewise_reduce(func, arr,
np.zeros(len(arr)), np.arange(len(arr)))
Now, does this want to go through the__array_ufunc__ machinery? Most of
what needs to be handled should be already covered by func.reduce - the
only issue is the np.empty line, which is a problem that np.concatenate
shares.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#834 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADMGdtjSCodONyu6gCpwofdBaJMCIKa-ks5rvgtrgaJpZM4ANcqc>
.
--
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
|
Done
Seems like a pretty narrow use case
Updated. > 1d is obviously bad, but I think we should allow 0d and broadcasting, for cases like accumulate.
Every ufunc method is one more thing for |
The main motivation for Moreover the logic for Regarding n = 10000
arr = np.random.random(n)
inds = np.random.randint(0, n, n//10)
inds.sort()
%timeit out = np.add.reduceat(arr, inds)
10000 loops, best of 3: 42.1 µs per loop
%timeit out = piecewise_reduce(np.add, arr, inds)
100 loops, best of 3: 6.03 ms per loop This is a time difference of more than 100x and illustrates the importance of preserving In summary, I would prioritize fixing Having a |
I don't think allowing start and stop indices to come from different arrays
would make a big difference to efficiency if implemented in the C.
…On 13 April 2017 at 23:40, divenex ***@***.***> wrote:
The main motivation for reduceat is to avoid a loop over reduce for
maximum speed. So I am not entirely sure a wrapper of a for loop over
reduce would be a very useful addition to Numpy. It would go against
reduceat main purpose.
Moreover the logic for reduceat existence and API, as a fast vectorized
replacement for a loop over reduce, is clean and useful. I would not
deprecate it, but rather fix it.
Regarding reduceat speed, let's consider a simple example, but similar to
some real-world cases I have in my own code, where I use reduceat:
n = 10000
arr = np.random.random(n)
inds = np.random.randint(0, n, n//10)
inds.sort()
%timeit out = np.add.reduceat(arr, inds)10000 loops, best of 3: 42.1 µs per loop
%timeit out = piecewise_reduce(np.add, arr, inds)100 loops, best of 3: 6.03 ms per loop
This is a time difference of more than 100x and illustrates the importance
of preserving reduceat efficiency.
In summary, I would prioritize fixing reduceat over introducing new
functions.
Having a start_indices and end_indices, altough useful in some cases, is
often redundant and I would see it as a possible addition, but not as a fix
for the current reduceat inconsistent behaviour.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#834 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xPex0fo2y_MqVHbNP5YNkJ0CBJrks5rviW-gaJpZM4ANcqc>
.
|
Thanks for that - I guess I underestimated the overhead associated with the first stage of a Not an argument against a free function, but certainly an argument against implementing it in pure python |
The problem is, that it's tricky to change the behaviour of code that's been around for so long. Another possible extension: when for i, (si, ei) in enumerate(zip(start, stop)):
if si >= ei:
func.reduce(arr[si:ei,...], out=out[i, ...], axis=axis)
else:
func.reduce(arr[ei:si,...], out=out[i, ...], axis=axis)
func.inverse(func.identity, out[i, ...], out=out[i, ...]) Where func.reduce(func.reduceat(x, inds_from_0)) == func.reduce(x)) For example a = [1, 2, 3, 4]
inds = [0, 3, 1]
result = np.add.reduceat(a, inds) # [6, -5, 9] == [(1 + 2 + 3), -(3 + 2), (2 + 3 + 4)] |
This is partially why in the e-mail thread I suggested to give special meaning to a 2-D array of indices in which the extra dimension is 2 or 3: it then is (effectively) interpreted as a stack of slices. But I realise this is also somewhat messy and of course one might as well have a p.s. I do think anything that works on many ufuncs should be a method, so that it can be passed through |
Actually, a different suggestion that I think is much better: rather than salvaging
(where now we are free to make the behaviour match what is expected for an empty slice) Here, one would broadcast the slice if it were 0-d, and might even consider passing in tuples of slices if a tuple of axes was used. EDIT: made the above |
I would still find a fix to I do understand the need not to break code. But I must say that I find it hard to imagine that much code depended on the old behavior, given that it simply did not make logical sense, and I would simply call it a long standing bug. I would expect that code using I am not fully convinced about @mhvk I also do not see a compelling user case for both |
I agree with @divenex here. The fact that I also agree that the cleanest solution is to define a new method such as |
Hi everyone, I want to nip at the bud the discussion that this is a bug. This is documented behaviour from the docstring:
As such, I oppose any attempt to change the behaviour of A quick github search shows many, many uses of the function. Is everyone here certain that they all use only strictly increasing indices? Regarding the behaviour of a new function, I would argue that without separate start/stop arrays, the functionality is severely hampered. There are many situations where one would want to measure values in overlapping windows that are not regularly arrayed (so rolling windows would not work). For example, regions of interest determined by some independent method. And @divenex has shown that the performance difference over Python iteration can be massive. |
Yes, but you wouldn't want to use a naive loop such as the one implemented by |
Achievable by other means, but a moving average of length Uniquely, |
Another use case - disambiguating between including the end or the start in the one-argument form. For instance: a = np.arange(10)
reducebins(np.add, start=[2, 4, 6]) == [2 + 3, 4 + 5, 6 + 7 + 8 + 9] # what `reduceat` does
reducebins(np.add, stop=[2, 4, 6]) == [0 + 1, 2 + 3, 4 + 5] # also useful |
I don't quite understand this one. Can you include the input tensor here? Also: what would be the default values for Anyways, I'm not strongly against the separate arguments, but it's not as clean of a replacement. I would love to able to say "Don't use reduceat, use reducebins instead" but that's (slightly) harder when the interface looks different. |
Actually, I just realised that even a start/stop option does not cover the use-case of empty slices, which is one that has been useful to me in the past: when my properties/labels correspond to rows in a CSR sparse matrix, and I use the values of In [2]: A = np.random.random((4000, 4000))
In [3]: B = sparse.csr_matrix((A > 0.8) * A)
In [9]: %timeit np.add.reduceat(B.data, B.indptr[:-1]) * (np.diff(B.indptr) > 1)
1000 loops, best of 3: 1.81 ms per loop
In [12]: %timeit B.sum(axis=1).A
100 loops, best of 3: 1.95 ms per loop
In [16]: %timeit np.maximum.reduceat(B.data, B.indptr[:-1]) * (np.diff(B.indptr) > 0)
1000 loops, best of 3: 1.8 ms per loop
In [20]: %timeit B.max(axis=1).A
100 loops, best of 3: 2.12 ms per loop Incidentally, the empty sequence conundrum can be solved the same way that Python does it: by providing an initial value. This could be a scalar or an array of the same shape as |
yes, i agree that the first focus needs to be on solving the empty slices
case. In the case of start=end we can either have a way to set the output
element to the identity, or to not modify the output element with a
specified out array. The problem with the current is that it is overwritten
with irrelevant data
|
I am fully with @shoyer about his last comment. Let's simply define Current use cases for |
Updated with the input. See the implementation of
When only one the
(the method/function distinction is not something I care too much about, and I can't define a new ufunc method as a prototype in python!) @jni:
You're wrong, it does - in the exact same way as
Yes, we've already covered this, and
So All: I've gone through earlier comments and removed quoted email replies, as they were making this discussion hard to read |
@eric-wieser good point. In my above sentence I meant that for the last index the behaviour of Ignoring compatibility concerns, the output of |
I don't think there's a convincing argument that There's also another argument for a = np.arange(10)
inds = [2, 4, 6]
reduce_bins(a, start=inds[:-1], stop=inds[1:]) # [2 + 3, 4 + 5]
# or less efficiently:
reduce_at(a, inds)[:-1}
reduce_bins(a, start=inds)[:-1]
reduce_bins(a, stop=inds)[1:] |
@eric-wieser I would be OK with required My preferred API for
I could go either way on the third "exception" which requires non-negative indices ( Not automatically adding an index at the end does imply that the result should have length @jni Can you please give an example of what you actually want to calculate from arrays found in sparse matrices? Preferably with a concrete example with non-random numbers, and self contained (without depending on scipy.sparse). |
The reading I was going for is that "Each bin starts at these positions", with the implication that all bins are contiguous unless explicitly specified otherwise. Perhaps I should try and draft a more complete docstring. I think I can see a strong argument for forbidding passing neither argument, so I'll remove that from my propose function.
Note that there's also a bug here (#835) - the upper bound should be inclusive, since these are slices. |
Fixed, thanks. |
Not in the |
There didn't seem to be any value to a `assign_identity` function - all we actually care about is the value to assign. This also fixes numpy#8860 as a side-effect, and paves the way for: * easily adding more values (numpy#7702) * using the identity in more places (numpy#834)
Turns out that |
Is this something that could be fixed in the upcoming 2.0 release? |
Original ticket http://projects.scipy.org/numpy/ticket/236 on 2006-08-07 by trac user martin_wiechert, assigned to unknown.
.reduceat does not handle repeated indices correctly. When an index is repeated the neutral element of the operation should be returned. In the example below [0, 10], not [1, 10], is expected.
The text was updated successfully, but these errors were encountered: