Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement initial kwarg for ufunc.add.reduce #10635

Merged
merged 1 commit into from Apr 26, 2018

Conversation

hameerabbasi
Copy link
Contributor

@hameerabbasi hameerabbasi commented Feb 20, 2018

As discussed in #5032: Adds an initial kwarg for ufunc.reduce, plus changes in min, max, sum, prod.

This acts as the first item in a reduction. It's useful for empty reductions e.g. np.min([], initial=np.inf) reduces to inf. It's also useful for starting the reduction with a different value e.g. np.sum([10], initial=5) reduces to 15. (5 used as starting value rather than the identity 0).

Explicitly passing the initial value as None causes empty reductions to err.

Explicitly passing np._NoValue also keeps the default behaviour.

min, max, sum and prod only pass on the initial kwarg.

PyUFunc_Reduce is modified to accept the initial kwarg. It then replaces the identity by initial.

Changes are made to normalize_reduce_args so that it accepts and parses initial correctly.

Documentation has been added for all the changes.

Fixes #5032

@hameerabbasi
Copy link
Contributor Author

@eric-wieser Looking forward to your feedback. 😄

@hameerabbasi hameerabbasi changed the title Implement identity= for reduce + docs + tests. Implement identity kwarg for ufunc.add.reduce Feb 20, 2018
@hameerabbasi
Copy link
Contributor Author

Hmm. I have no idea why the CircleCI build failed, I didn't touch Cython. :/

identity : scalar, optional
The value to reduce to for an empty axis. Raises a ``ValueError``
by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a version added

identity = _get_identity(ufunc, &reorderable);
if (identity == NULL || identity == Py_None) {
identity = _get_identity(ufunc, &reorderable);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need an else { Py_INCREF(identity); }, so that the decref at the end is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, it only checks for None and NULL. The object is getting passed through.

identity is gotten from the PyArg_ParseTupleAndKeywords, it is passed through directly from there, does this do the Py_INCREF call automatically or not?

Edit: Looks like you're right, from the docs:

Note that any Python object references which are provided to the caller are borrowed references; do not decrement their reference count!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the other half of the puzzle is that _get_identity does increment the reference count - so you want to make them equivalent so the cleanup doesn't need special casing.

@eric-wieser
Copy link
Member

It looks like you built this patch on an out-of-date master from before we had CircleCI working. If you rebase, the CircleCI error will go away.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 20, 2018

This is failing because we reparse the arguments for forwarding to __array_ufunc__ - but I don't remember where.

@ahaldane
Copy link
Member

ahaldane commented Feb 20, 2018

Somewhat tangent comment: How would this fit in with the dream of turning reduce into a gufunc (related to discussions in #8528 #8819)?

Currently gufuncs have a fixed set of common keyword arguments, axis, dtype, casting, extobj, where and so on. Should individual gufuncs be allowed to have additional args besides the default ones, as identity would be?

Or does the fact that reduce will have an extra identity argument mean that it cannot be implemented as a gufunc?

Edit: Related: #8773

@ahaldane
Copy link
Member

Maybe the way to think of it is that a reduce_ufunc(a, out=b) gufunc treats its "out" parameter also as an input. So if you supply an out, that initial value gets read and added on to.

So maybe ultimately, we would have both a reduce-gufunc, and a reduce method. The reduce method's job is to properly set up the out parameter for the reduce-gufunc. The method would have an "identity" argument, while the ufunc itself would read that value from the out arg.

Sorry for the distraction, we can think about this later outside this PR.

@charris charris changed the title Implement identity kwarg for ufunc.add.reduce ENH: Implement identity kwarg for ufunc.add.reduce Feb 21, 2018
@charris
Copy link
Member

charris commented Feb 21, 2018

The explanation of these changes could be much more complete. And please explain it for all the functions changed.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 21, 2018

So, there are two things that worry me here:

  • The interface is different to the builtin max(..., default=0), even though it's roughly the same
  • ufunc.reduce(identity=...) only really makes sense for maximum. minimum, fmax, and fmin, where choosing a different domain lets you choose a different identity. For other ufuncs, it produces crazy results (I think np.add.reduce(np.ones(2,2), identity=10) will give a surprising result, but I haven't checked).
    • We could add a marker indicating that the identity is configurable by kwarg, alongside the current allowable identity values - and throw an error when the kwarg is passed for a ufunc without that marker
    • For object arrays, having an identity makes sense for lots of ufuncs - consider np.add.reduce(["hello", "world"], dtype=object, identity="")

@hameerabbasi
Copy link
Contributor Author

The explanation of these changes could be much more complete. And please explain it for all the functions changed.

I'll add more complete docs with examples, if that's what you mean, incorporating @eric-wieser's excellent use cases and insights. :-)

The interface is different to the builtin max(..., default=0), even though it's roughly the same.

Yes, I thought of that as well, however, plenty of kwargs here are different (axis, keepdims, etc.). I'll go ahead and assume that we're not following Python's interface here, however, this can still be changed.

However, if it WAS to be changed, I'd say it's more similar to functools.reduce's initializer argument, which is much more instructive. This is also in line with your second worry:

For other ufuncs, it produces crazy results (I think np.add.reduce(np.ones(2,2), identity=10) will give a surprising result, but I haven't checked).

>>> np.add.reduce(np.ones((2, 2)), identity=10)
array([12., 12.])

The name initializer instead of identity is much more instructive and takes care of this worry.

So initializer probably makes more sense here. We could add a warning in the docs that non-default initializers could lead to unexpected results.

@eric-wieser
Copy link
Member

will give a surprising result, but I haven't checked

There's a more surprising result than that one to be found, I think. Maybe np.add.reduce(np.ones((2, 2, 2)), axis=(0, 1), identity=10)?

@hameerabbasi
Copy link
Contributor Author

It gives a quite reasonable result, actually.

>>> np.add.reduce(np.ones((2, 2, 2)), axis=(0, 1), identity=10)
array([14., 14.])

@eric-wieser
Copy link
Member

eric-wieser commented Feb 21, 2018

The name initializer instead of identity is much more instructive and takes care of this worry.

That's maybe reasonable. Perhaps ufunc.reduce should take an initializer argument, but the np.min shorthand can alias it to default?

We'll definitely need to run this by the list before this gets merged, but may as well brainstorm here first

@eric-wieser
Copy link
Member

eric-wieser commented Feb 21, 2018

Non-adjacent axes? axis=(0, 2)?

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Feb 21, 2018

Non-adjacent axes? (0, 2)?

Still works beautifully.

That's maybe reasonable. Perhaps ufunc.reduce should take an initializer argument, but the np.min shorthand can alias it to default?

The slight problem I see with this is something like the following:

>>> np.minimum.reduce([0.0], identity=-np.inf)
-inf

As you can see, it gives a value not in the original set. I guess we could call it default but then there are two issues with this:

  • The one I mentioned above;
  • We're deviating from ufunc.reduce norms, which is a bad thing. It doesn't keep uniformity.
  • Problems when generalizing to other reductions.

@eric-wieser
Copy link
Member

Still works beautifully.

Huh, I think my memory of the code is incorrect. I thought that multi-axis reduce fell back on repeated single-axis reduce, and so was expecting to get 24 as the result. It seems that's not the case, which is great!

As you can see, it gives a value not in the original set.

I'm now on board with the argument being called initializer, which makes that far less confusing.

The issue in that case is that the caller chose to pass in an initializer that was not the minimum of the set of allowed values. That looks perfectly fine to me.

@eric-wieser
Copy link
Member

CI is failing because core/src/umath/override.c@normalize_reduce_args is missing this argument.

@hameerabbasi
Copy link
Contributor Author

The issue in that case is that the caller chose to pass in an initializer that was not the minimum of the set of allowed values. That looks perfectly fine to me.

Exactly. We can document a use-case as being reduction of an empty axis.

On that note, I'm going to work and might not be available for the next hours. If you want to fix that on my branch, feel free.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Feb 21, 2018

On another note, if we're renaming this to initializer, it would make sense to add it to accumulate and reduceat. Is that best done in another PR or this one?

Edit: I'm willing to take on the task of making reduceat use identity.

@eric-wieser
Copy link
Member

reduceat is best untouched - there's an issue about writing a replacement somewhere.

Accumulate would be a better fit for another PR, especially since the use case is less strong.

@eric-wieser
Copy link
Member

Edit: I'm willing to take on the task of making reduceat use identity.

#834 should get you started on that rabbit hole. Lets try and wrap up this PR first though!

@hameerabbasi hameerabbasi changed the title ENH: Implement identity kwarg for ufunc.add.reduce ENH: Implement initializer kwarg for ufunc.add.reduce Feb 22, 2018

You can use the initializer argument to initialize the reduction to a
different value. It can be used to calculate empty reductions for
ufuncs that don't support it as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this second sentence to just above np.minimum.reduce so it's clear what it refers to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -1827,6 +1827,11 @@ def sum(a, axis=None, dtype=None, out=None, keepdims=np._NoValue):
`ndarray`, however any non-default value will be. If the
sub-class' method does not implement `keepdims` any
exceptions will be raised.
initializer : scalar, optional
The initializer for this maximum. Can also be used to compute the
minimum of an empty slice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max? min? This is sum

I'd argue that this isn't useful enough to include in the public API here, or in any of these functions. It's niche enough that we probably want the user calling ufunc.reduce directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I went to a lot of effort to put it in. :-( Anyway, I'd argue it should stay for the NaN-skipping functions because there's no alternative there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it's useful for specifying the value of an All-NaN slice reduction (although I guess you could just do this):

x_max = np.nanmax(x, ...)
x[np.isnan(x)] = value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I propose:

  • np.some_ufunc.reduce(..., initializer=...)
  • np.sum(...) - not useful enough to expose - force people to use the more explicit API above
  • np.prod(...)
  • np.min(..., default=...) - forward to initializer, match the py3 API
  • np.nanmin(..., default=...) as above
  • np.min(..., initializer=...)
  • np.nanmin(..., initializer=...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back, default and initializer mean different things. max([0], default=10) is different to np.max([0], initializer=10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, as in the discussion above. But to keep uniformity, we should probably either keep it or remove it everywhere. I prefer users should not have to remember/recall which reductions had this and which didn't, they should just instinctively go for either the reduction or ufunc.reduce.

I guess to cater for the nanmax, etc. case, we could ask users to reshape, add an extra slice, then undo, but that would be excessive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanmax absolutely should have an initializer argument, since it needs it to deal with all-nan slices of object dtypes. I'm just not sure they're useful on sum and prod, but maybe leave them in for now, and consult the list.

Either way, my initial remark here is pointing out that this docstring is copied-and-pasted incorrectly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fixed. I see you noticed it was a copypasta.

@@ -3275,7 +3284,7 @@ PyUFunc_Reduceat(PyUFuncObject *ufunc, PyArrayObject *arr, PyArrayObject *ind,
op_axes_arrays[2]};
npy_uint32 op_flags[3];
int i, idim, ndim, otype_final;
int need_outer_iterator;
int need_outer_iterator = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a warning in GCC, I wanted to get rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the warnings/errors because of this change? Some C++ magic going on? Ideally, adding an intitializer shouldn't change anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warnings are showing up because there's a failure elsewhere - you're probably just not scrolling down far enough to see the error. I'd probably revert this change, and leave it for a future warning cleanup.

@mhvk
Copy link
Contributor

mhvk commented Apr 11, 2018

I think what you, the behaviour for objects is close enough to what the final state should be than what would be the case without those extra commits. So, I would suggest to merge and do a follow-up PR on the details.

@hameerabbasi
Copy link
Contributor Author

+1 for finally merging.

@seberg
Copy link
Member

seberg commented Apr 12, 2018

Well, I don't know and I do not want to think about it. Close enough for me is close enough, if assuming that someone will use the current behaviour, we can still get to a reasonable final state without problems at some point. I hate adding the next kludge that future us gets annoyed about.

Right now I am a bit confused what you are suggesting as final state for the object support....

@seberg
Copy link
Member

seberg commented Apr 12, 2018

That said, None as the "Error out" value, is maybe good even for object. Not ideal in some sense but the other solutions seem not much nicer. Plus, I am not sure it would even manoeuvre into a tricky to resolve issue for the future.

@hameerabbasi
Copy link
Contributor Author

@seberg Right now, the issue is that if you pass initial, it will always include it in the list even for objects. That could break backwards compatibility, as for some functions in _methods.py, we do pass it. So objects will get 0 in the list when they try to do np.sum, for example.

Right now, having np._NoValue or None as an "ignore me" is what we want.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 12, 2018

I think the cause of our problems is how the .identity is defined, and we'd do better to free ourselves from that constraint. What I'd propose then is:

  • Use .reduce(initial=np._NoValue) to indicate "no identity"
  • Make .reduce(obj_array, initial=some_value) use the "initial" logic, not the "default" logic
  • Allow initial=None to be passed like any other value
  • Remove the initial=ufunc.identity kwarg defaults, and just remark in the documentation of initial that "If the ufunc has no identity or the dtype is object, this defaults to None - otherwise it defaults to ufunc.identity"

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Apr 12, 2018

@eric-wieser Would you happen to know how I'd access np._NoValue inside the C code for this check? Doesn't seem terribly hard to do. I'd also need to do the same in override.c.

@eric-wieser
Copy link
Member

Something like

#include "npy_import.h"
static PyObject *NoValue = NULL;
npy_cache_import("numpy", "_NoValue", &NoValue);

ought to work

@@ -152,6 +155,9 @@ normalize_reduce_args(PyUFuncObject *ufunc, PyObject *args,
}
obj = PyTuple_GetSlice(args, 3, 4);
}
if (i == 5 && obj == NoValue) {
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment saying this is the initial argument would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Any idea why the Windows build is failing?

@eric-wieser
Copy link
Member

eric-wieser commented Apr 12, 2018

Penultimate commit message is very wrong!

@@ -3036,6 +3036,8 @@ PyUFunc_Reduce(PyUFuncObject *ufunc, PyArrayObject *arr, PyArrayObject *out,
const char *ufunc_name = ufunc_get_name_cstr(ufunc);
/* These parameters come from a TLS global */
int buffersize = 0, errormask = 0;
static PyObject *NoValue = NULL;
npy_cache_import("numpy", "_NoValue", &NoValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this should actually be if (... < 0) { return NULL; } (wrapped as normal), in case the import fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npy_cache_import has a return type of void, so we can't do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in that case, I mean you should check that NoValue != NULL

@@ -137,6 +137,9 @@ normalize_reduce_args(PyUFuncObject *ufunc, PyObject *args,
return -1;
}

static PyObject *NoValue = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C90 requires this be declared at the top of the function with the other variables (which is why the test fails)

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Apr 12, 2018

I think this now has the desired behavior for all cases. The np.NoIdentity thing is a matter for a separate PR, this PR keeps the convention that None means "no initial value" and np._NoValue means "ignore initial completely".

@hameerabbasi
Copy link
Contributor Author

Any more changes required here? If not, is this good to merge? cc @eric-wieser @mhvk @seberg

@seberg
Copy link
Member

seberg commented Apr 18, 2018

Won't look through the code, at least not now. It seems it still needs release notes I guess? So I understand correctly it is now:

  • Initial value for the reduction for all (almost) cases, including object
  • None will cause an error (even for object)
  • default is NoValue and will cause old behaviour.

Seems good to me as a change.

@hameerabbasi
Copy link
Contributor Author

That's correct:

  • initial behaves as an initial value in all cases now, even for dtype=object. Previously, it would behave as the default value for dtype=object and initial value otherwise.
  • None means "no initial value, err if empty, otherwise use the start of the array".
  • np._NoValue causes the old behavior.

I'll add release notes now.

@hameerabbasi hameerabbasi force-pushed the ufunc-reduce-identity branch 2 times, most recently from 7aca91b to 24e185a Compare April 24, 2018 08:54
@hameerabbasi
Copy link
Contributor Author

Rebased and tests added for dtype=object. Tell me when merging is imminent so I can squash all the commits.

@hameerabbasi hameerabbasi force-pushed the ufunc-reduce-identity branch 2 times, most recently from b3b7289 to 06d39e0 Compare April 24, 2018 09:35
@mhvk
Copy link
Contributor

mhvk commented Apr 24, 2018

@hameerabbasi - I'd gladly merge this (maybe after one more day for last comments), so do go ahead and squash!

@hameerabbasi
Copy link
Contributor Author

@mhvk Ping. 😀

@mhvk mhvk merged commit 4ea36ff into numpy:master Apr 26, 2018
@mhvk
Copy link
Contributor

mhvk commented Apr 26, 2018

Merged. Thanks, @hameerabbasi!

@hameerabbasi
Copy link
Contributor Author

Thanks so much to @eric-wieser and @mhvk especially for babying me through my first big PR. I owe you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants