Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: ExtensionArray.take accept np.ndarray #43418

Merged
merged 15 commits into from
Sep 26, 2021

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Sep 5, 2021

@jreback jreback added the Typing type annotations, mypy/pyright type checking label Sep 6, 2021
@twoertwein
Copy link
Member Author

twoertwein commented Sep 10, 2021

Using the SequenceIndexer from #41258. This means that Sequence[int] is not any more allowed (only List[int], slice, np.ndarray). I think any sequence should work for take, as all of them end up in np.asarray. Happy to allow Sequence[int] for take (but it would also be great to be consistent across methods: not allowing any sequence). I favor consistency.

@Dr-Irv

@jreback jreback added this to the 1.4 milestone Sep 10, 2021
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 10, 2021
@simonjayhawkins
Copy link
Member

in the base class the docstring for take is indices : sequence of int. This is our contract with 3rd party EA developers.

pandas/_typing.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Sep 12, 2021

in the base class the docstring for take is indices : sequence of int. This is our contract with 3rd party EA developers.

does Sequence[int] include a suitable ndarray, e.g. ndarray[intp] (prob no?) if so we should extend .take to do this as its quite natural and IMHO expected.

@simonjayhawkins
Copy link
Member

from https://pandas.pydata.org/pandas-docs/dev/development/contributing_docstring.html#parameter-types

If the exact type is not relevant, but must be compatible with a NumPy array, array-like can be specified. If Any type that can be iterated is accepted, iterable can be used:

array-like
iterable

having different rules for the docstring parameter types and the type annotations is bound to create confusion but IIUC we want to keep the docstring rules for consistency with the ecosystem.

There is no definition of the term "sequence" in the docstring guide.

np.ndarray is not an instance of abc.Sequence or type compatible with the typing protocol.

so if we want to annotate using Sequence we need to Union with np.ndarray (and Series, Index, EA too in most cases)

see #28770 for related issue.

@simonjayhawkins
Copy link
Member

I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion.

see also #32773, #32773 (comment), #32773 (comment) and #32033 for discussions relating to avoiding the numpydoc type definitions for internal docstrings to therefore avoid the discrepancies (not relevant here as the base class docstrings are our api definitions for 3rd party EA developers)

if we could sometime in the future use the standard type notation in the public dosctrings that would be great (and redundant)

@twoertwein
Copy link
Member Author

I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion.

Think we have two options then:

  1. Use Sequence[int] | np.ndarray and in the future add shape information to the np.ndarray when that becomes standardized in numpy
  2. Keep the current state (close this PR)

@simonjayhawkins
Copy link
Member

no need to close. accepting np.ndarray for take as Jeff says is sort of a given since take is effectively a numpy function.

numpy.take accepts array-like which can be any dimension. It only takes in 1 dimension but will populate a multidimensional array if the indices is mutli-dimensional. EAs a predominantly 1d (causes issues with __array_function__ protocol extending the EA interface to be 2d is another topic)

The goal is to add types such that we don't add code in the main codebase that does not pass mypy silently when we break the EA contract with 3rd party EA developers (we effectively broke the contract in #41258 by adding list[int]: A list of int to the accepted types for __getitem__, 3rd party code could break)

so here okay to add np.ndarray, but not okay to add slice, 3rd party EAs aren't expecting a slice. If we add to the type annotations and then pass a slice from the main code, mypy won't report any issues.

@simonjayhawkins
Copy link
Member

In general, type annotations should be as permissive as possible. we can't do that when typing the EA methods as we can't allow wider types than the 3rd party EA developers are expecting.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 12, 2021

I think the reason "sequence" is used in the docstring type parameters section instead of array-like is because we don't accept zero or multi dimension arrays. see #41807 for discussion.

I would question whether take should accept Sequence[int] or just List[int], because I believe that Sequence[int] would include a tuple of ints, and if we want to be consistent between what .iloc and .loc do, those don't accept tuples.

So maybe take should only accept in its signature List[int] | np.ndarray[int] ??

@simonjayhawkins
Copy link
Member

So maybe take should only accept in its signature List[int] | np.ndarray[int] ??

narrowing the types would not break 3rd party EAs but could break user code. We're quite constrained here without changing the EA api.

to be fair though EA.take is not part of the public api for users.

Index.take and Series.take could continue to be permissive and the user passed arg could be cast before dispatching to the EA.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 12, 2021

narrowing the types would not break 3rd party EAs but could break user code. We're quite constrained here without changing the EA api.

I think the question is what defines the EA API? Is it the signature in the code? Is it what is in the docs as the signature? Is it what is in the docs that is in the docstring?

If you go to https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.take.html?highlight=take#pandas.api.extensions.ExtensionArray.take

the docs don't define the types for the argument indices, even though the signature in the source says Sequence[int] .

If the signature in the code defines the API, are we expecting people to read those signatures by looking directly at pandas source? Shouldn't the docs be the definition of the API?

Finally, I don't think changing the signature would break user code - it might just break it if they did type checking on their code.

@simonjayhawkins
Copy link
Member

I think the question is what defines the EA API? Is it the signature in the code? Is it what is in the docs as the signature? Is it what is in the docs that is in the docstring?

none of the above. only jesting but probably the EA base tests should be in this list.

If you go to https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.take.html?highlight=take#pandas.api.extensions.ExtensionArray.take

the docs don't define the types for the argument indices

it's defined in the parameter types section of the docstring. Type annotations across the codebase are incomplete and are not yet part of the official api. (IIRC we had a discussion in the past about removing the type annotations from the published docs)

Finally, I don't think changing the signature would break user code - it might just break it if they did type checking on their code.

sure. but once we change the type annotations that's what mypy checks the code against for consistency. The point of inline type annotations is to prevent type related errors being introduced and latent bugs to be discovered. Adding type annotations as a sort of second level of documentation of what we would like the types to be renders the typing exercise pointless IMHO.

So if we change the annotations it won't immediately break user code (if no code changes are required to get mypy to green). given. but subsequent code changes could and we won't have mypy to stop us.

@twoertwein
Copy link
Member Author

I think, except for maybe finding a better name for TakeIndexer = Union[Sequence[int], np.ndarray], this PR should be ready. I also checked all the affected take docs strings, they all define the type as sequence of int.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @twoertwein lgtm, suggested change to alias

numpy definitions are currently

    @overload
    def take(  # type: ignore[misc]
        self: ndarray[Any, dtype[_ScalarType]],
        indices: _IntLike_co,
        axis: Optional[SupportsIndex] = ...,
        out: None = ...,
        mode: _ModeKind = ...,
    ) -> _ScalarType: ...
    @overload
    def take(  # type: ignore[misc]
        self,
        indices: _ArrayLikeInt_co,
        axis: Optional[SupportsIndex] = ...,
        out: None = ...,
        mode: _ModeKind = ...,
    ) -> ndarray[Any, _DType_co]: ...
    @overload
    def take(
        self,
        indices: _ArrayLikeInt_co,
        axis: Optional[SupportsIndex] = ...,
        out: _NdArraySubClass = ...,
        mode: _ModeKind = ...,
    ) -> _NdArraySubClass: ...
_ArrayLikeInt_co = _ArrayLike[
    "dtype[Union[bool_, integer[Any]]]",
    Union[bool, int],
]
_ArrayLike = Union[
    _NestedSequence[_SupportsArray[_DType]],
    _NestedSequence[_T],
]
# TODO: Wait for support for recursive types
_NestedSequence = Union[
    _T,
    Sequence[_T],
    Sequence[Sequence[_T]],
    Sequence[Sequence[Sequence[_T]]],
    Sequence[Sequence[Sequence[Sequence[_T]]]],
]
# The `_SupportsArray` protocol only cares about the default dtype
# (i.e. `dtype=None` or no `dtype` parameter at all) of the to-be returned
# array.
# Concrete implementations of the protocol are responsible for adding
# any and all remaining overloads
class _SupportsArray(Protocol[_DType_co]):
    def __array__(self) -> ndarray[Any, _DType_co]: ...
_IntLike_co = Union[_BoolLike_co, int, np.integer]
_BoolLike_co = Union[bool, np.bool_]

except for maybe finding a better name for TakeIndexer = Union[Sequence[int], np.ndarray]

so the numpy take indexer is Union[_IntLike_co, _ArrayLikeInt_co] and i'm not sure if numpy have a specific alias for this.

in time (with an update to the EA api), we may want to make EA.take more aligned with numpy.take, but for now we don't accept _Intlike_co?

so that just leaves the _ArrayLikeInt_co bit and we don't allow 2d (or nested) and don't allow bool in the take indexer either.

we already have a ArrayLike alias that is just np.array and ExtensionArray and an AnyArrayLike alias that also includes Index and Series but not Sequence.

So we probably can't use something like _ArrayLikeInt_co yet without extra confusion. xref #41807

IMO TakeIndexer is okay naming for now. (with the expectation that we will align our interface and naming conventions with NumPy over time)

pandas/_typing.py Outdated Show resolved Hide resolved
pandas/_typing.py Outdated Show resolved Hide resolved
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @twoertwein lgtm.

(I think I saw some imports from pandas._typing inside TYPE_CHECKING blocks slip through in some recent EA typing PRs. So maybe some other places where the type definitions are not properly guarded)

@simonjayhawkins simonjayhawkins merged commit dd7158b into pandas-dev:master Sep 26, 2021
@twoertwein twoertwein deleted the take branch June 8, 2022 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TYP: ensure_platform_int probably needs a dtype annotation
4 participants