Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Please consider adding a very simple abstract base class that promises the array interface #20459

Closed
NeilGirdhar opened this issue Nov 25, 2021 · 18 comments
Labels
23 - Wish List 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.

Comments

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Nov 25, 2021

Proposed new feature or change:

Many libraries (pandas-dev/pandas#44617) are checking for the array interface like so: isinstance(x, np.ndarray), but this fails for jax.numpy.ndarray despite it implementing the array interface (google/jax#8701).

Instead of trying to convince everyone to check for members, could we get an abstract base class that people could inherit from if they want to promise to support the numpy array interface? For example, numpy.AbstractArray? It doesn't need to have any abstract members, but could for example provide a default __array_interface__.

Then, Pandas and Seaborn could replace their isinstance(x, np.ndarray) with isinstance(x, np.AbstractArray). And all Jax has to do is make their various jax.numpy.ndarray classes inherit from AbstractArray. Both of these are easy changes.

Related: #19752. Although this proposal is for an extremely slim (possibly empty) abstract base class.

@NeilGirdhar NeilGirdhar changed the title ENH: Please consider adding a simple protocol to check if something implements the array interface ENH: Please consider adding a very simple abstract base class that promises the array interface Nov 26, 2021
@seberg
Copy link
Member

seberg commented Nov 29, 2021

I am fine with adding an ABC if there is buy-in from those libraries to register/inherit with it. You could also have the ABC check for __array_ufunc__ or __array_priority__, than it doesn't even matter whether JAX implements it.

In fact, you could just write such an ABC yourself and show us how nice it is ;).

You probably need to clearly define what defines an array which can register? Is it one with __array_interface__ because we can provide zero-copy views, is it one that provides __array_ufunc__, because that covers dask?

@seberg seberg added 23 - Wish List 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. labels Nov 29, 2021
@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Nov 30, 2021

As requested (Along with some default behaviour from the docs):

import numpy as np
import jax.numpy as jnp
from abc import ABC
from typing import Type, Any, Literal, Union, Optional, TypeVar

T = TypeVar('T', bound='AbstractArray')

class AbstractArray(ABC):
    @classmethod
    def __subclasshook__(cls, subclass: Type[Any]) -> Union[bool, Literal[NotImplemented]]:
        if hasattr(subclass, '__array_ufunc__') or hasattr(subclass, '__array_priority__'):
            return True
        return super().__subclasshook__(subclass)
    
    def __array_finalize__(self) -> None:
        pass
    
    def __array_prepare__(self: T, context: Optional[Any] = None) -> T:
        return self
    
    def __array_wrap__(self: T, context: Optional[Any] = None) -> T:
        return self

    __array_priority__ = 0

print(issubclass(np.ndarray, AbstractArray))  # True
print(issubclass(jnp.ndarray, AbstractArray))  # False, but can be fixed with AbstractArray.register(jnp.ndarray)
print(isinstance(jnp.zeros(10), AbstractArray))  # True

I guess this might fit in numpy.lib.mixins.py.

You probably need to clearly define what defines an array which can register? Is it one with __array_interface__ because we can provide zero-copy views, is it one that provides __array_ufunc__, because that covers dask?

It appears that Jax does neither of these things, so does it make sense to block them?

@NeilGirdhar
Copy link
Contributor Author

It just occurred to me that a much simpler solution would be to simply make np.ndarray inherit from ABC. Then, none of the instance and subclass checks in Pandas, Seaborn, etc. would have to change.

It would then be possible to do:

np.ndarray.register(jnp.ndarray)  # Boom!

The only downside would be that np.ndarray would absorb the metaclass ABCMeta.

@seberg
Copy link
Member

seberg commented Dec 1, 2021

The other downside is that it would break everything :)? You can't adhere to an ndarray protocol without proper subclassing...

@BvB93
Copy link
Member

BvB93 commented Dec 1, 2021

So what exactly is the intended purpose of this ABC and/or protocol?

Are we talking about recognizing objects that can be coerced into an array?
Objects with dedicated ufunc support? Objects that duck-type as an array? etc.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Dec 1, 2021

@seberg Sorry, could you elaborate so that I can understand better?

@BvB93 The point is to fix the case—in the nicest way possible—whereby Jax arrays that implement the numpy array interface are not recognized or treated as such by libraries like Seaborn and Pandas.

Either we have to convince packages like Pandas to replace their if isinstance(x, np.ndarray) with if hasattr(x, '__array_ufunc__') or hasattr(x, '__array_priority__'), or we provide the above ABC, and we convince them to write if isinstance(x, np.AbstractArray). I thought the latter would be an easier sell. Plus, it lets Numpy update the definition of what an array is.

The other option would be to convince the Jax team to have all of their array classes inherit from np.ndarray, but I don't know enough about their internals to know whether that's an option.

@seberg
Copy link
Member

seberg commented Dec 1, 2021

@NeilGirdhar did you ever see an ABC being a concrete class? The answer is "no" because it doesn't make sense (possibly ABC even rejects this). An ABC signals adherence to some "protocol". For a concrete (non abstract) class that protocol is everything (including C-ABI, which you seem to forget about). If everything is what you need, you might as well inherit (which you obviously can't).

Either we have to convince packages like Pandas to replace their if isinstance(x, np.ndarray) with if hasattr(x, 'array_ufunc') or hasattr(x, 'array_priority'), or we provide the above ABC

OK, but if pandas has an intention to use such hasattr duck-checks and use isinstance to do so conveniently, they could start off with writing such an ABC themselves (sure JAX can't register with it nicely, but it is not really needed).

To be convincing, you either need to show that pandas devs really want this, or that it is obvious useful in some generic way. Since people can also have it (by using hasattr) it would seem possible to show a concrete example.

Right now it feels very much not thought through what exactly you want this ABC to mean, and whether it is enough/useful.

The other option would be to convince the Jax team to have all of their array classes inherit from np.ndarray, but I don't know enough about their internals to know whether that's an option.

Again, this is simply impossible.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Dec 1, 2021

@NeilGirdhar did you ever see an ABC being a concrete class? The answer is "no" because it doesn't make sense (possibly ABC even rejects this).

True, I haven't. But all that ABC does is bring in the metaclass ABCMeta. If a class X inherits from, say, collections.abc.Sequence—that class's metaclass also inherits from ABCMeta. It then also brings in the whole ABC machinery: you can say X.register(...), etc.

And I've seen plenty of concrete classes that inherit from the collections ABCs. So, these ordinary concrete classes become usable as ABCs whether you want them to be or not. This surprised me when I realized this, but that's what led me to propose having np.ndarray inherit from ABC. If you were to make it inherit from the mixin, which would be reasonable, you would end up in exactly the same situation.

start off with writing such an ABC themselves (sure JAX can't register with it nicely, but it is not really needed).

That's not realistically going to happen. They would have to make the ABC, and users would have to register the Jax classes with that ABC. They'll never buy into something like that.

To be convincing, you either need to show that pandas devs really want this,

But they don't care about this; I do. Because I happen to use Jax with Pandas and Seaborn. So this my problem—not theirs.

I just want the nicest solution. My current solution is to carefully cast all Jax arrays to Numpy arrays before passing them. This is annoying. It would be awesome if everything just worked.

Right now it feels very much not thought through what exactly you want this ABC to mean, and whether it is enough/useful.

The goal is that by doing the replacement that I suggest in my last comment, passing Jax arrays to Pandas and Seaborn would be possible. Currently, it causes various crashing bugs.

Again, this is simply impossible.

Oh you know Jax well? Great, I didn't realize that 😄

@seberg
Copy link
Member

seberg commented Dec 1, 2021

That's not realistically going to happen. They would have to make the ABC, …

Maybe, but we probably could still have a proof of concept, or a: "yeah, that sounds interesting" from pandas. I simply do not know whether or in what form this would be useful for pandas.

and users would have to register the Jax classes with that ABC.

I thought you can write that ABC based on hasattr? But maybe not, maybe you need registration to nail it down well enough. Then the question is why and what "protocol" the ABC actually guarantees?

Is it hasattr("__array_interface") and "not a dataframe" (meaning it must be registered)? Will other libraries want the version that just uses hasattr("__array_interface__") because they don't care about dataframe or not?

Nailing this down is not trivial, it is a reasonable request, but I am not sure anyone at NumPy can nail it down without downstream input.

Oh you know Jax well? Great, I didn't realize that smile

Well enough to know that even if it was possible (for the record, I am pretty sure JAX will store data on the GPU, which would proof it most definitely is not) it would be too much of a stretch, but I guess you agree.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Dec 1, 2021

I thought you can write that ABC based on hasattr? But maybe not, maybe you need registration to nail it down well enough.

Right.

Nailing this down is not trivial, it is a reasonable request, but I am not sure anyone at NumPy can nail it down without downstream input.

Okay, if you help me understand the exact downstream questions, maybe I can find some time to ask the right people for input?

Thanks, by the way, for taking the time helping me to wrap my head around this.

Well enough to know that even if it was possible (for the record, I am pretty sure JAX will store data on the GPU, which would proof it most definitely is not) it would be too much of a stretch, but I guess you agree.

Oh, I don't know the Jax internals well at all. I imagine there's some reason they don't inherit from np.ndarray though.

@jorisvandenbossche
Copy link
Contributor

To be convincing, you either need to show that pandas devs really want this,

But they don't care about this; I do. Because I happen to use Jax with Pandas and Seaborn. So this my problem—not theirs.

It's not because no-one answered yet (pandas-dev/pandas#44617), that "we" don't care. I personally think it is a reasonable feature request that we check for array-likes in more places. For example for checking this in the DataFrame constructor, there was just another issue reported as well (pandas-dev/pandas#44616). We might need some discussion (on the pandas side) to which places in the API surface to limit this to, though.

@NeilGirdhar
Copy link
Contributor Author

It's not because no-one answered yet (pandas-dev/pandas#44617), that "we" don't care.

Sorry! I think this came out wrong. I was just trying to be modest in my proposal. I wasn't sure how big of an issue this would be for pandas 😄

@rgommers
Copy link
Member

rgommers commented Dec 3, 2021

Oh, I don't know the Jax internals well at all. I imagine there's some reason they don't inherit from np.ndarray though.

Because that doesn't make sense - JAX's ndarray has different semantics than NumPy's ndarray. A shortlist that comes to mind:

  • different dtype promotion rules
  • not providing control over memory ordering (C or Fortran), and only implementing contiguous C-order arrays
  • arrays being read-only
  • no views, so different semantics for even very simple code involving slices of arrays
  • implementing only a subset of the dtypes that numpy has

Your original request unfortunately isn't well-defined. Pandas needs either an isinstance check or an np.asarray coercion of inputs to numpy.ndarray because there are exactly zero other array implementations that have the same semantics as NumPy. CuPy comes closest, but doesn't work on CPU so that doesn't help Pandas.

The answer at google/jax#8701 (comment) seems correct to me.

If it would have made sense for other libraries to yield True for isinstance(x, np.ndarray) then we probably would have no need for the array API standard.

@cgarciae
Copy link

cgarciae commented Dec 3, 2021

I am wondering if an is_arraylike function as suggested here google/jax#8701 (comment) could be part of numpy for convenience?

@seberg
Copy link
Member

seberg commented Dec 3, 2021

The problem is, that there are a few subtle possibilities for is_arraylike:

  • Supports the overrides __array_function__ and __array_ufunc__
  • Supports __array_interface__
  • Supports __array__()
  • Supports the new data-api, i.e. has __array_namespace__
  • Maybe pandas would even use: Supports __array__(), but not __dataframe__()?
  • EDIT: And even, "Can be converted to a NumPy array using np.array()"?

What I am unclear about is, that we should add such an ABC/function unless we know that the same version will be useful across various use-cases. Especially, since a simple hasattr("__array_interface__") is not really an untypical practice for duck-typing.
Unless we either need to have registration with ABC.register() or wish to prescribe a certain kind of "duck" for multiple projects, I am unsure we gain much by adding one, or many.

However, I have not done a survey, or specific use-cases in mind. If pandas said: we would use XY all-over, and one other project agrees, that might just be good enough. But at least the distinction override support and "holds NumPy-style memory" seems something you need (so the minimum is 2 versions, 3 if you count the data-api).

@rgommers
Copy link
Member

rgommers commented Dec 3, 2021

Agreed with @seberg. To add to his list: arrays that implement the buffer protocol but not __array__ or __array_interface__. And IIRC the only good way to check for the buffer protocol is to just try the conversion with memoryview(x_in).

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Dec 3, 2021

@rgommers I just learned about the array API standard today when I was talking with @cgarciae . Looks like a fantastic project that essentially supplants this issue completely. I guess I had the same goal, but I mistakenly thought it could be solved with a simple ABC. I can see now that it really can't.

Feel free to close this suggestion, and thank you everyone for the illuminating discussion. I learned a lot.

@seberg
Copy link
Member

seberg commented Dec 3, 2021

Closing, if you want to read more, see also: https://numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html maybe ;).

@seberg seberg closed this as completed Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
23 - Wish List 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.
Projects
None yet
Development

No branches or pull requests

6 participants