Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align ExtensionArray with abc.Sequence #28770

Open
WillAyd opened this issue Oct 3, 2019 · 11 comments
Open

Align ExtensionArray with abc.Sequence #28770

WillAyd opened this issue Oct 3, 2019 · 11 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking

Comments

@WillAyd
Copy link
Member

WillAyd commented Oct 3, 2019

ref #28762 (comment)

Right now the ExtensionArray is missing a few methods that would allow it to align with the abc.Sequence interface, namely count, index, __reversed__ and __contains__

I think we might want to add those items to the interface to aid our type checking and make for more intuitive use cases of the ExtensionArray

@WillAyd WillAyd added ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action labels Oct 3, 2019
@jorisvandenbossche
Copy link
Member

@WillAyd do you know how numpy arrays do that?
Because the function in question pd.array also accepts numpy arrays, so wondering how that works for typing (ndarray has no index or count methods)

@WillAyd
Copy link
Member Author

WillAyd commented Oct 4, 2019

Numpy arrays resolve to Any right now when it comes to typing because they have no annotations in their codebase or typeshed available (AFAIK). So realistically it's almost impossible to cause a type failure with a ndarray in our code base

@jorisvandenbossche
Copy link
Member

For this specific case, don't we have some kind of ArrayLike in our typing module to handle this? (I suppose also Series is no Sequence, while that is often accepted as an array like as well)

Personally, I am a bit hesitant to add count and index methods to EA just for this

@WillAyd
Copy link
Member Author

WillAyd commented Oct 4, 2019

So you wouldn't consider an ExtensionArray to be a type of Sequence, right? Ignoring typing for a second it seems intuitive to me that an ExtensionArray is inherently a Sequence, and if that is true then it would make sense to adhere to the interface as defined by Python

@jorisvandenbossche
Copy link
Member

Somewhat relevant discussions in numpy about this: numpy/numpy#2776 and numpy/numpy#7315 (although they have other issues that complicate things).

Yes, I would consider an EA as a sequence. I mainly find it a bit unfortunate that Python defines methods (count, index) that are not essential for sequence as part of the abc.

@WillAyd
Copy link
Member Author

WillAyd commented Oct 4, 2019

Nice find - I'll give it a deeper look later. FWIW I don't know if we'd necessarily even need to define count or index - I think inheriting from the ABC will take care of that for us (haven't verified)

@WillAyd WillAyd added good first issue and removed Needs Discussion Requires discussion from core team before further action labels Oct 8, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Oct 8, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Oct 8, 2019

Removing Needs Discussion label and replacing with good first issue as I think we are aligned and can probably be done easy by community, though if you disagree @jorisvandenbossche lmk

@jorisvandenbossche
Copy link
Member

cc @TomAugspurger

@jorisvandenbossche
Copy link
Member

One other concern that is that adding count might clash with what "count" already means in pandas. We already have a Series.count(), and while adding a EA.count to dispatch to from Series.count was not yet discussed I think, having the same methods with different behaviour is also confusing.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 8, 2019

The mental clash with our NDFrame methods is unfortunate. I think .index is especially bad.

@jorisvandenbossche jorisvandenbossche added Needs Discussion Requires discussion from core team before further action and removed good first issue labels Oct 10, 2019
@machow
Copy link

machow commented Nov 11, 2019

Hey, coming to this discussion from the numpy issues, and thinking about how numpy arrays and pandas Series fit in terms of collections.abc.

Wanted to leave a few thoughts in case it's helpful--it seems like a kind of tricky issue, since a Series sits sort of between a couple collections.abc types.

What is an abc.Sequence?

It seems like the core issue here is that in python, abc.Sequence has two defining pieces...

  1. It inherits from Reversible and Collection, along with __getitem__
  2. It implements specific index and count methods

What is a pandas Series?

It sounds like a pandas Series is compatible with piece 1 (it is a reversible collection), but not piece 2.

I wonder if there are two approaches here for deciding the type of a Series...

  1. exact - a pandas Series is Reversible and is a Collection
  2. pragmatic - a pandas Series is a Sequence, except for those two methods

IMO, the trade-offs are that the exact approach would be counterintuitive for developers (I don't usually think of anything as a Reversible Collection that also implements __getitem__...), but a pragmatic approach will require people remembering the exception.

Singledispatch example

For example, suppose a Series (or ExtensionArray) was a subclass of Sequence.
Then, singledispatch for a function that gets the first element of a Sequence might look like this...

from functools import singledispatch
from collections.abc import Sequence, Reversible, Collection

@singledispatch
def first(x): 
    raise TypeError("does not support type %s" %type(x))


# Works if a series is a Sequence, but not if you plan to use .index()

@first.register(Sequence)
def _first_sequence(x):
    return next(iter(x))


# Also works pretty well, but doesn't guarantee __getitem__ or __len__

@first.register(Reversible)
def _first_reversible(x):
    # could add isinstance check for a Collection here, or __getitem__, etc..
    next(iter(x))

Since this issue originally laid out potential benefits as type checking, and intuitive use cases for EA--it seems like a Sequence (vs Reversible Collection) would be simpler to type check, and intuitive, but the intuition wouldn't be exactly right.

(This is me thinking out loud, though, so could be very wrong :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

6 participants