Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting duck array coercion #13831

Open
jakirkham opened this issue Jun 25, 2019 · 55 comments
Open

Supporting duck array coercion #13831

jakirkham opened this issue Jun 25, 2019 · 55 comments

Comments

@jakirkham
Copy link
Contributor

Opening this issue after some discussion with @shoyer, @pentschev, and @mrocklin in issue ( dask/dask#4883 ). AIUI this was discussed in NEP 22 (so I'm mainly parroting other people's ideas here to renew discussion and correct my own misunderstanding ;).

It would be useful for various downstream array libraries to have a function to ensure we have some duck array (like ndarray). This would be somewhat similar to np.asanyarray, but without the requirement of subclassing. It would allow libraries to return their own (duck) array type. If no suitable conversion was supported by the object, we could fallback to handle ndarray subclasses, ndarrays, and coercion of other things (nested lists) to ndarrays.

cc @njsmith (who coauthored NEP 22)

@shoyer
Copy link
Member

shoyer commented Jun 25, 2019

The proposed implementation would look something like the following:

import numpy as np

# hypothetical np.duckarray() function
def duckarray(array_like):
  if hasattr(array_like, '__duckarray__'):
    # return an object that can be substituted for np.ndarray
    return array_like.__duckarray__()
  return np.asarray(array_like)

Example usage:

class SparseArray:
  def __duckarray__(self):
    return self
  def __array__(self):
    raise TypeError

np.duckarray(SparseArray())  # returns a SparseArray object
np.array(SparseArray())  # raises TypeError

Here I've used np.duckarray and __duckarray__ as placeholders, but we can probably do better for these names. See the Terminology from NEP 22:

“Duck array” works fine as a placeholder for now, but it’s pretty jargony and may confuse new users, so we may want to pick something else for the actual API functions. Unfortunately, “array-like” is already taken for the concept of “anything that can be coerced into an array” (including e.g. list objects), and “anyarray” is already taken for the concept of “something that shares ndarray’s implementation, but has different semantics”, which is the opposite of a duck array (e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is a classic bike-shed so for now we’re just using “duck array”. Some possible options though include: arrayish, pseudoarray, nominalarray, ersatzarray, arraymimic, …

Some other name ideas: np.array_compatible(), np.array_api()....

@rgommers
Copy link
Member

np.array_compatible could work, although I'm not sure I like it better than duckarray. np.array_api I don't like, gives the wrong idea imho.

Since after a long time we haven't come up with a better name, perhaps we should just bless the "duck-array" name......

@seberg
Copy link
Member

seberg commented Jun 30, 2019

I like the compatible word, maybe we can think of variations along that line as well as_compatible_array (somewhat implies that all compatible objects are arrays). The as is maybe annoying (partially because all as functions have no spaces). "duck" seems nice in libraries, but I think a bit strange for random people seeing it. So I think I dislike "duck" if and only if we want downstream users to use it a lot (i.e. even when I start writing a small tool for myself/a small lab).

@charris
Copy link
Member

charris commented Jun 30, 2019

Maybe quack_array :)

@pentschev
Copy link
Contributor

To extend a bit on the topic, there's one other case that isn't covered with np.duckarray, which is the creation of new arrays with a type based on an existing type, similar to what functions such as np.empty_like do. Currently we can do things like this:

>>> import numpy as np, cupy as cp
>>> a  = cp.array([1, 2])
>>> b = np.ones_like(a)
>>> type(b)
<class 'cupy.core.core.ndarray'>

On the other hand, if we have an array_like that we would like to create a CuPy array from via NumPy's API, that's not possible. I think it would be helpful to have something like:

import numpy as np, cupy as cp
a  = cp.array([1, 2])
b = [1, 2]
c = np.asarray(b, like=a)

Any ideas/suggestions on this?

@shoyer
Copy link
Member

shoyer commented Jul 1, 2019 via email

@pentschev
Copy link
Contributor

np.copy_like sounds good too. I agree, we most likely should have ways to control things such as dtype.

Sorry for the beginner's question, but should something like np.copy_like be an amendment to NEP-22, should it be discussed in the mailing list, or what would be the most appropriate approach to that?

@shoyer
Copy link
Member

shoyer commented Jul 1, 2019

We don't really have strict rules about this, but I would lean towards putting np.copy_like and np.duckarray (or whatever we call it) together into a new NEP on coercing/creating duck arrays, one that is prescriptive like NEP 18 rather than "Informational" like NEP 22. It doesn't need to be long, most of the motivation is already clear from referencing NEP 18/22.

One note about np.copy_like(): it should definitely do dispatching with __array_function__ (or something like it), so operations like np.copy_like(sparse_array, like=dask_array) could be defined either on either array type.

@pentschev
Copy link
Contributor

Great, thanks for the info, and I agree with your dispatching proposal. I will work on an NEP for the implementation of both np.duckarray and np.copy_like and submit a draft PR this week for that.

@shoyer
Copy link
Member

shoyer commented Jul 1, 2019 via email

@pentschev
Copy link
Contributor

My pleasure, and thanks a lot for the ideas and support with this work!

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

The array_like and copy_like functions would be a little odd to have in the main namespace I think, since we can't have a default implementation (at least not one that would do the right think for cupy/dask/sparse/etc), right? They're only useful when overridden. Or am I missing a way to create arbitrary non-numpy array objects here?

@shoyer
Copy link
Member

shoyer commented Jul 1, 2019

It's true, these would only really be useful if you want to support duck typing. But certainly np.duckarray and np.copy_like would work even if the arguments are only NumPy arrays -- they would just be equivalent to np.array/np.copy.

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

All array implementations have a copy method right? Using that instead of copy_like should work, so why add a new function?

array_like I can see the need for, but we may want to discuss where to put it.

np.duckarray does make sense to me.

I would lean towards putting np.copy_like and np.duckarray (or whatever we call it) together into a new NEP on coercing/creating duck arrays, one that is prescriptive like NEP 18 rather than "Informational" like NEP 22.

+1

@pentschev
Copy link
Contributor

array_like I can see the need for, but we may want to discuss where to put it.

That's actually the case which I would like to have addressed with something like np.copy_like. I haven't tested, but probably np.copy already dispatches correctly if the array is non-NumPy.

Just to be clear, are you referring also to a function np.array_like? I intentionally avoided such a name because I thought it could be confusing to all existing references to array_like-arrays. However, I do now realize that np.copy_like may imply a necessary copy, and I think it would be good to have a behavior similar to np.asarray, where the copy only happens if it's not already a NumPy array. In the case discussed here, the best would be to make the copy only if a is not the same type as b in a call such as np.copy_like(a, like=b).

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

I haven't tested, but probably np.copy already dispatches correctly if the array is non-NumPy.

It should, it's decorated to support __array_function__.

Just to be clear, are you referring also to a function np.array_like? I intentionally avoided such a name because I thought it could be confusing to all existing references to array_like-arrays.

Yes. And yes agree it can be confusing.

However, I do now realize that np.copy_like may imply a necessary copy,

Yes that name implies a data copy.

may imply a necessary copy, and I think it would be good to have a behavior similar to np.asarray,

I thought that that was np.duckarray.

@jakirkham
Copy link
Contributor Author

I think Peter's example above might help clarify this. Copied below and subbed in np.copy_like for simplicity.

import numpy as np, cupy as cp
a  = cp.array([1, 2])
b = [1, 2]
c = np.copy_like(b, like=a)

@pentschev
Copy link
Contributor

I thought that that was np.duckarray.

Actually, np.duckarray will basically do nothing and just return the array itself (if overriden), else return np.asarray (leading to a NumPy array). We can't get a CuPy array from a Python list with it, for example. We still need a function that can be dispatched to CuPy (or any other like= array) for an array_like.

Thanks @jakirkham for the updated example.

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

c = np.copy_like(b, like=a)

So that will dispatch to CuPy via a.__array_function__ and fail if that attribute doesn't exist (e.g. a=<scipy.sparse matrix> wouldn't work)? It feels like we need a new namespace or new interoperability utilities package for those kind of things. Either that or leave it to a more full-featured future dispatching mechanism where one could simple do:

with cupy_backend:
   np.array(b)

Introducing new functions in the main namespace that don't make sense for NumPy itself to support working around a limitation of __array_function__ seems a bit unhealthy....

@pentschev
Copy link
Contributor

So that will dispatch to CuPy via a.__array_function__ and fail if that attribute doesn't exist (e.g. a=<scipy.sparse matrix> wouldn't work)?

I wouldn't say it has to fail necessarily. We could default to NumPy and raise a warning (or don't raise it at all), for example.

It feels like we need a new namespace or new interoperability utilities package for those kind of things. Either that or leave it to a more full-featured future dispatching mechanism

Certainly it would be nice to have a full-featured dispatching mechanism, but I imagine this wasn't done before due to its complexity and backwards compatibility issues? I wasn't around when discussions happened, so just guessing.

Introducing new functions in the main namespace that don't make sense for NumPy itself to support working around a limitation of array_function seems a bit unhealthy....

I certainly see your point, but I also think that if we move too many things away from main namespace, it could scare users off. Maybe I'm wrong and this is just an impression. Either way, I'm not at all proposing to implement functions that won't work with NumPy, but perhaps only not absolutely necessary when using NumPy by itself.

@pentschev
Copy link
Contributor

pentschev commented Jul 1, 2019

Introducing new functions in the main namespace that don't make sense for NumPy itself to support working around a limitation of array_function seems a bit unhealthy....

Actually, in this sense, also np.duckarray wouldn't belong in the main namespace.

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

Actually, in this sense, also np.duckarray wouldn't belong in the main namespace.

I think that one is more defensible (analogous to asarray and it would basically check "does this meet our definition of a ndarray-like duck type"), but yes. If we also want to expose array_function_dispatch, and we have things np.lib.mixins.NDArrayOperatorsMixin and plan on writing more mixins, a sensible new submodule for all things interoperability related could make sense.

Certainly it would be nice to have a full-featured dispatching mechanism, but I imagine this wasn't done before due to its complexity and backwards compatibility issues? I wasn't around when discussions happened, so just guessing.

I think there's multiple reasons. __array_function__ is similar to things we already had, so it's easier to reason about. It has low overhead. It could be designed and implemented on a ~6 month timescale, and @shoyer made a strong case that we needed that. And we had no concrete alternative.

@pentschev
Copy link
Contributor

sensible new submodule for all things interoperability related could make sense.

No real objections from me, I think it's better to have functionality somewhere rather than nowhere. :)

I think there's multiple reasons. array_function is similar to things we already had, so it's easier to reason about. It has low overhead. It could be designed and implemented on a ~6 month timescale, and @shoyer made a strong case that we needed that. And we had no concrete alternative.

But if we want to leverage __array_function__ more broadly, do we have other alternatives now to implementing things like np.duckarray and np.copy_like (or whatever else we would decide to call it)? I'm open to all alternatives, but right now I don't see any, of course, rather than going the full-feature dispatching way, which is likely going to take a long time and limit the scope of __array_function__ tremendously (and basically rendering it impractical for most of the more complex cases I've seen).

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

But if we want to leverage __array_function__ more broadly, do we have other alternatives now to implementing things like np.duckarray and np.copy_like (or whatever else we would decide to call it)?

I think you indeed need a set of utility features like that, to go from covering some fraction of use cases to >80% of use cases. I don't think there's a way around that. I just don't like cluttering up the main namespace, so propose to find a better place for those.

I'm open to all alternatives, but right now I don't see any, of course, rather than going the full-feature dispatching way, which is likely going to take a long time and limit the scope of __array_function__ tremendously (and basically rendering it impractical for most of the more complex cases I've seen).

I mean, we're just plugging a few obvious holes here right? We're never going to cover all of the "more complex cases". Say you want to override np.errstate or np.dtype, that's just not going to happen with the protocol-based approach.

As for alternatives, uarray is not yet there and I'm not convinced yet that the overhead will be pushed down low enough to be used by default in NumPy, but it's getting close and we're about to try it to create the scipy.fft backend system (WIP PR: scipy/scipy#10383). If that does prove itself there, it should be considered as a complete multiple dispatch solution. And it already has a numpy API with Dask/Sparse/CuPy/PyTorch/XND backends, some of which are complete enough to be usable: https://github.com/Quansight-Labs/uarray/tree/master/unumpy

@jakirkham
Copy link
Contributor Author

The dispatch approach with uarray is certainly interesting. Though I'm still concerned about how we handle meta-arrays (like Dask, xarray, etc.). Please see this comment for details. It's unclear this has been addressed (though please correct me if I've missed something). I'd be interested in working with others at SciPy to try and hash out how we solve this problem.

@rgommers
Copy link
Member

rgommers commented Jul 1, 2019

Please see this comment for details. It's unclear this has been addressed (though please correct me if I've missed something).

I think the changes of the last week resolve that, but not sure - let's leave that for another thread.

I'd be interested in working with others at SciPy to try and hash out how we solve this problem.

I'll be there, would be great to meet you in person.

@shoyer
Copy link
Member

shoyer commented Jul 1, 2019

Maybe np.coerce_like() or np.cast_like() would be a better names than copy_like, so that it's clear that copies are not necessarily required. The desired functionality is indeed pretty similar to the .cast() method, except we want to convert array types as well as dtypes, and it should be a function rather than a protocol so it can be implemented by either argument.

@hameerabbasi
Copy link
Contributor

@pentschev This was the case until recently, when we added the ability to “register” a backend, but we recommend only NumPy (or a reference implementation) does this. Then users using Dask would need just a single set_backend.

@pentschev
Copy link
Contributor

Got it, I guess this is what @rgommers mentioned in #13831 (comment), pointing to the backends in https://github.com/Quansight-Labs/uarray/tree/master/unumpy.

Sorry for so many questions, but what if some hypothetical application relies on various backends, for example, both NumPy and Sparse, where depending on the user input, maybe everything will be NumPy-only, Sparse-only, or a mix of both. @peterbell10 mentioned multiple backends are supported #13831 (comment), but can the selection of backend be made automatic or would there be a need to handle the three cases separately?

@hameerabbasi
Copy link
Contributor

So, for this case, you would ideally register NumPy, use a context manager for Sparse, and return NotImplemented from sparse when appropriate, which would make something fall-back to NumPy.

@jakirkham
Copy link
Contributor Author

At SciPy @rgommers, @danielballan, and myself talked about this issue. We concluded it would be valuable to proceed with adding duckarray (using that name). That said, it sounded like this would be slated for 1.18. Though please correct me if I misunderstood things. Given this, would be alright to start a PR?

@shoyer
Copy link
Member

shoyer commented Jul 18, 2019

We concluded it would be valuable to proceed with adding duckarray (using that name). That said, it sounded like this would be slated for 1.18. Though please correct me if I misunderstood things. Given this, would be alright to start a PR?

This all sounds great to me, but it would be good to start with a short NEP spelling out the exact proposal. See #13831 (comment)

@jakirkham
Copy link
Contributor Author

Sure that makes sense. 🙂

@jakirkham
Copy link
Contributor Author

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

@pentschev
Copy link
Contributor

We concluded it would be valuable to proceed with adding duckarray (using that name).

This all sounds great to me, but it would be good to start with a short NEP spelling out the exact proposal. See #13831 (comment)

I have already started to write that, haven't been able to complete it yet though (sorry for my bad planning #13831 (comment)).

@pentschev
Copy link
Contributor

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

You can do that, but it may require special copying logic (such as in CuPy cupy/cupy#2079).

That said, a copy function may be best, to avoid this sort additional code from being necessary.

On the other hand, this would be sort of a replacement for asarray. So I was wondering if instead of some copy_like new function, we would instead want to revisit the idea suggested by NEP-18:

These will need their own protocols:
...
array and asarray, because they are explicitly intended for coercion to actual numpy.ndarray object.

If there's a chance we would like to revisit that, maybe would be better to start a new thread. Any ideas, suggestions, objections?

@pentschev
Copy link
Contributor

Just to be clear on my comment above, I myself don't know if a new protocol is a great idea (probably many cumbersome details that I don't foresee are involved), really just wondering if that's an idea we should revisit and discuss.

@rgommers
Copy link
Member

The consensus from the dev meeting and sprint at SciPy'19 was: let's get 1.17.0 out the door and get some real-world experience with it before taking any next steps.

really just wondering if that's an idea we should revisit and discuss.

probably yes, but in a few months.

@pentschev
Copy link
Contributor

probably yes, but in a few months.

Ok, thanks for the reply!

@shoyer
Copy link
Member

shoyer commented Jul 18, 2019

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

My main issue with this is that it wouldn't work for duck arrays that are immutable, which is not terribly uncommon. Also, for NumPy the additional cost of allocating an array and then filling it may be nearly zero, but I'm not sure that's true for all duck arrays.

@shoyer
Copy link
Member

shoyer commented Jul 18, 2019

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

You can do that, but it may require special copying logic (such as in CuPy cupy/cupy#2079).

That said, a copy function may be best, to avoid this sort additional code from being necessary.

On the other hand, this would be sort of a replacement for asarray. So I was wondering if instead of some copy_like new function, we would instead want to revisit the idea suggested by NEP-18:

These will need their own protocols:
...
array and asarray, because they are explicitly intended for coercion to actual numpy.ndarray object.

If there's a chance we would like to revisit that, maybe would be better to start a new thread. Any ideas, suggestions, objections?

I don't think it's a good idea to change the behavior of np.array or np.asarray with a new protocol. Their established meaning is to cast to NumPy arrays, which is basically why we need np.duckarray

That said, we could consider adding a like argument to duckarray. That would require changing the protocol from the simplified proposal above -- maybe to use __array_function__ instead of a dedicated protocol like __duckarray__? I haven't really thought this through.

@jakirkham
Copy link
Contributor Author

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

My main issue with this is that it wouldn't work for duck arrays that are immutable, which is not terribly uncommon. Also, for NumPy the additional cost of allocating an array and then filling it may be nearly zero, but I'm not sure that's true for all duck arrays.

That's fair. Actually we can already simplify things. For instance this works with CuPy and Sparse today.

a2 = np.copy(a1)

@shoyer
Copy link
Member

shoyer commented Jul 18, 2019

That's fair. Actually we can already simplify things. For instance this works with CuPy and Sparse today.

a2 = np.copy(a1)

Yes, but we also want "copy this duck-array into the type of this other duck-array"

@pentschev
Copy link
Contributor

I don't think it's a good idea to change the behavior of np.array or np.asarray with a new protocol. Their established meaning is to cast to NumPy arrays, which is basically why we need np.duckarray

I'm also unsure about this, and I was reluctant even to raise this question, this is why I hadn't until today.

That said, we could consider adding a like argument to duckarray. That would require changing the protocol from the simplified proposal above -- maybe to use array_function instead of a dedicated protocol like duckarray? I haven't really thought this through.

I don't know if there would be any complications with that, we probably need some careful though, but I tend to like this idea. That would seem redundant in various levels, but maybe to follow the existing pattern, instead of adding a like parameter we could have duckarray and duckarray_like?

@jakirkham
Copy link
Contributor Author

Yes, but we also want "copy this duck-array into the type of this other duck-array"

What about basing this around np.copyto?

@pentschev
Copy link
Contributor

What about basing this around np.copyto?

Feel free to correct me if I'm wrong, but I'm assuming you mean something like:

np.copyto(cupy_array, numpy_array)

That could work, assuming NumPy is willing to change the current behavior, e.g., asarray always implies the destination is a NumPy array, does copyto make the same assumption?

@shoyer
Copy link
Member

shoyer commented Jul 18, 2019

np.copyto already supporting dispatching with __array_function__, but it's roughly equivalent to:

def copyto(dst, src):
    dst[...] = src

We want the equivalent of:

def copylike(src, like):
    dst = np.empty_like(like)
    dst[...] = src
    return dst

@pentschev
Copy link
Contributor

np.copyto already supporting dispatching with __array_function__, but it's roughly equivalent to:

def copyto(dst, src):
    dst[...] = src

We want the equivalent of:

def copylike(src, like):
    dst = np.empty_like(like)
    dst[...] = src
    return dst

Correct, this is what we want. copyto gets dispatched and works if source and destination have the same type, we need something that allows dispatching to the destination array's library.

@jakirkham
Copy link
Contributor Author

jakirkham commented Jul 18, 2019

Well copyto could still make sense depending on how we thinking of it. Take for example the following use case.

np.copyto(cp.ndarray, np.random.random((3,)))

This could translate into something like allocate and copy over the data as we have discussed. If we dispatch around dst (cp.ndarray in this case), then libraries with immutable arrays could implement this in a suitable manner as well. It also saves us from adding a new API (that NumPy merely provides, but doesn't use), which seemed to be a concern.

@jakirkham
Copy link
Contributor Author

Just to surface another thought that occurred to me recently, it's worthing thinking about what these APIs will mean downstream between other libraries (for instance how Dask and Xarray interact).

seberg pushed a commit that referenced this issue Aug 5, 2019
This NEP proposes the introduction of the __duckarray__ protocol, as described in high-level by NEP-22 and further discussed in #13831 .

We have another idea by @shoyer on how to handle duck array typing through __array_function__, as mentioned in #13831 (comment):

we could consider adding a like argument to duckarray. That would require changing the protocol from the simplified proposal above -- maybe to use array_function instead of a dedicated protocol like duckarray? I haven't really thought this through.
The idea above seems viable, and perhaps more complete as well. That said, I want to either extend this NEP to cover that, or maybe write a separate NEP so we can discuss and judge which one is a better solution. In the meantime, let's start discussing the text here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants