Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: ExtensionArray delete() and searchsorted() #41513

Merged
merged 20 commits into from
Sep 6, 2021

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented May 16, 2021

Additional typing for ExtensionArray delete() and searchsorted()

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dr-Irv . now that the EA methods are being annotated individually, it may make sense to add the annotations to all EA classes for that method to ensure consistency. (e.g. SparseArray looks iffy def searchsorted(self, v, side="left", sorter=None):

pandas/core/arrays/base.py Outdated Show resolved Hide resolved
@@ -1308,7 +1313,7 @@ def __hash__(self) -> int:
# ------------------------------------------------------------------------
# Non-Optimized Default Methods

def delete(self: ExtensionArrayT, loc) -> ExtensionArrayT:
def delete(self: ExtensionArrayT, loc: PositionalIndexer) -> ExtensionArrayT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the 'public' base EA class. it was added in #39405. @jbrockmendel is this method public? Is it part of the EA interface? does it need a docstring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch ill make a note to add a docstring

@simonjayhawkins simonjayhawkins added the Typing type annotations, mypy/pyright type checking label May 17, 2021
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 17, 2021

Thanks @Dr-Irv . now that the EA methods are being annotated individually, it may make sense to add the annotations to all EA classes for that method to ensure consistency. (e.g. SparseArray looks iffy def searchsorted(self, v, side="left", sorter=None):

Next commit does this. Had to create some types to support the bridge from our searchsorted to numpy searchsorted. Hope this works for you!

value: NumpyValueArrayLike,
side: Literal["left", "right"] = "left",
sorter: NumpySorter = None,
) -> np.ndarray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can sometimes return an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

searchsorted returns an np.ndarray according to the numpy docs and the signature in numpy 1.20.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like that doc/signature is inaccurate then

>>> arr = np.arange(5)
>>> type(arr.searchsorted(3))
<class 'numpy.int64'>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this seems to be an issue with numpy. This is using numpy 1.20.2:

import numpy as np
a = np.array([1, 3, 5])
# reveal_type(a)
s = a.searchsorted(4)
# reveal_type(s)
print(s, type(s), isinstance(s, np.ndarray))

Result of running this is:

2 <class 'numpy.int64'> False

If I uncomment the reveal_type lines, the types of a and s are both np.ndarray.

So that gives us this question:

Should the pandas searchsorted API match what numpy documents (only array elements allowed), or what numpy implements?

I've submitted a numpy issue here: numpy/numpy#19160

@simonjayhawkins simonjayhawkins added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 22, 2021
pandas/_typing.py Outdated Show resolved Hide resolved
pandas/_typing.py Outdated Show resolved Hide resolved
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 30, 2021

@simonjayhawkins @jbrockmendel pushed changes 5 days ago, so looking forward to a new review.

return self._ndarray.searchsorted(value, side=side, sorter=sorter)
def searchsorted(
self,
value: ArrayLike | object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object -> Scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would violate the Liskov substitution principle. Our searchsorted needs to have the EA's reject an incompatible type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that would be bc you've typed EA.searchsorted with object, so why not change it there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in EA, we have to allow any type to be inside the EA, including non-EA types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayLike is a subtype of object so not needed? Why does this not use the NumpyValueArrayLike alias?

value = self._validate_searchsorted_value(value).view("M8[ns]")
def searchsorted(
self,
value: ArrayLike | object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think de-facto the scalar needs to be either Period or Period-castable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the ExtensionArray type has to allow any object, so the subtype does as well, and then it should raise if someone passes an incompatible object type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has to allow any object

aren't we supposed to use Any for that?

so the subtype does as well, and then it should raise if someone passes an incompatible object type

so if a subclass is more restrictive in what it accepts (without raising), we can't have that be reflected in the annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has to allow any object

aren't we supposed to use Any for that?

If we use Any, then that tells the type checker to ignore type checking on the result. If we use object, then it will continue type checking.

so the subtype does as well, and then it should raise if someone passes an incompatible object type

so if a subclass is more restrictive in what it accepts (without raising), we can't have that be reflected in the annotation?

Yes and no. The full signature has to accept the same type as the parent, and can widen that argument type, but not narrow it. But, I think we can use an overload signature to indicate that if a specific type is passed, it creates a result of that type. So the master signature could accept object, and then you can have an overload to correspond to the narrower type.

@jbrockmendel
Copy link
Member

all the Literal["left", "right"] changes look good. the NumpySorter is fine AFAIK. the value arg in searchsorted is pretty nasty though; can we live without it for the time being?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 31, 2021

all the Literal["left", "right"] changes look good. the NumpySorter is fine AFAIK. the value arg in searchsorted is pretty nasty though; can we live without it for the time being?

I think we should go ahead as I proposed it. The issue with value is that ExtensionArray has to support any type for a singleton, so it takes object. Therefore the subtypes have to take any type, and ideally they should raise if the type is incompatible.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 4, 2021
@Dr-Irv Dr-Irv removed the Stale label Jul 4, 2021
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jul 9, 2021

@jbrockmendel Awaiting your response for 6 weeks to my last response here: #41513 (comment)

@jbrockmendel
Copy link
Member

The issue with value is that ExtensionArray has to support any type for a singleton, so it takes object.

You explained here why you are using object instead of Any. With object, the checker treats it as if that argument were not annotated (because everything matches), while with Any the checker skips the method completely. Do I have that right?

Therefore the subtypes have to take any type, and ideally they should raise if the type is incompatible.

The full signature has to accept the same type as the parent, and can widen that argument type, but not narrow it

IIUC "has to" here is in order to satisfy Liskov. Is that right? Liskov is nice, but in this context it is conflicting with accuracy. e.g. in the PeriodArray case it is inaccurately suggesting that it will accept a timedelta. If we have to choose between Liskov and accuracy, my inclination is to choose accuracy. And if I'm right above that putting "object" is equivalent to leaving it unannotated, then this is introducing inaccuracy with no upside AFAICT.

That said, this is not a hill I want to die on. I'll defer to @simonjayhawkins

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jul 26, 2021

pinging @simonjayhawkins

@jreback jreback added this to the 1.4 milestone Jul 28, 2021
@jreback
Copy link
Contributor

jreback commented Jul 28, 2021

@Dr-Irv this looks pretty good to me, can you rebase and hopefully @simonjayhawkins can have a look

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Aug 3, 2021

weekly ping @simonjayhawkins

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix conflicts and a mypy error pandas/core/indexes/datetimes.py:772: error: Argument "side" to "get_slice_bound" of "Index" has incompatible type "str"; expected "Union[Literal['left'], Literal['right']]" [arg-type]

Thanks @Dr-Irv for being patient. In future maybe worth starting at the lower level functions and work up to the public api and maybe tackle a subset of the method arguments. Gradual typing facilitates this.

return self._ndarray.searchsorted(value, side=side, sorter=sorter)
def searchsorted(
self,
value: ArrayLike | object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayLike is a subtype of object so not needed? Why does this not use the NumpyValueArrayLike alias?

return self._ndarray.searchsorted(npvalue, side=side, sorter=sorter)

def _validate_searchsorted_value(
self, value: ArrayLike | object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, ArrayLike is a subtype of object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to now correspond better to the numpy signature. No longer refer to object

else:
# While we also can support DatetimeLikeScalar, numpy.searchsorted
# does not accept that as an argument
return cast(PythonScalar, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are code changes here from master, this is a typing PR. Is this is fixing a bug? It looks like this prevents passing an EA to numpy? Does numpy currently raise a TypeError or does it use the __array__ protocol? Does this need a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy complains when you pass an ExtensionArray to searchsorted. E.g.:

import numpy as np

import pandas as pd
from pandas.core.arrays.base import ExtensionArray

from typing import cast

a = pd.array([1,3,4])

b = cast(ExtensionArray, a)

na = np.array([1,2,3,4,5])

r = na.searchsorted(b)

print(r)

produces from mypy:

irv.py:14: error: No overload variant of "searchsorted" of "ndarray" matches argument type "ExtensionArray"  [call-overload]
irv.py:14: note: Possible overload variants:
irv.py:14: note:     def searchsorted(self, v: Union[int, float, complex, str, bytes, generic], side: Union[Literal['left'], Literal['right']] = ..., sorter: Union[_SupportsArray[dtype[Union[bool_, integer[Any]]]], Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]], Sequence[Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]]], Sequence[Sequence[Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]]]], Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]]]]], int, Sequence[int], Sequence[Sequence[int]], Sequence[Sequence[Sequence[int]]], Sequence[Sequence[Sequence[Sequence[int]]]], None] = ...) -> signedinteger[Any]
irv.py:14: note:     def searchsorted(self, v: Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], Union[Union[_SupportsArray[dtype[Any]], Sequence[_SupportsArray[dtype[Any]]], Sequence[Sequence[_SupportsArray[dtype[Any]]]], Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]], Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[Any]]]]]]], Union[bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]]]], side: Union[Literal['left'], Literal['right']] = ..., sorter: Union[_SupportsArray[dtype[Union[bool_, integer[Any]]]], Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]], Sequence[Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]]], Sequence[Sequence[Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]]]], Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[Union[bool_, integer[Any]]]]]]]], int, Sequence[int], Sequence[Sequence[int]], Sequence[Sequence[Sequence[int]]], Sequence[Sequence[Sequence[Sequence[int]]]], None] = ...) -> ndarray[Any, dtype[signedinteger[Any]]]

So while mypy accepts the ExtensionArray argument, the typing that is present within mypy does not.

Comment on lines 184 to 185
# While we also can support DatetimeLikeScalar, numpy.searchsorted
# does not accept that as an argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy will raise TypeError for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed in next commit.

if isinstance(value, ExtensionArray):
return value.to_numpy()
elif isinstance(value, np.ndarray):
return value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition and the following condition both return value, the cast is a noop at runtime. can this be simplified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed in next commit

value: ArrayLike | object,
side: Literal["left", "right"] = "left",
sorter: NumpySorter = None,
) -> np.ndarray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, np.intp in next commit

value: ArrayLike | object,
side: Literal["left", "right"] = "left",
sorter: NumpySorter = None,
) -> np.ndarray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, np.intp in next commit.

Comment on lines 658 to 660
npvalue = cast(
np.ndarray, self._validate_searchsorted_value(value).view("M8[ns]")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the cast with the typing of _validate_searchsorted_value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in next commit.

value: NumpyValueArrayLike,
side: Literal["left", "right"] = "left",
sorter: NumpySorter = None,
) -> npt.NDArray[np.intp]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency can you update the return types of the other methods to also use the npt.NDArray[np.intp] syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in next commit

# "Union[Any, ndarray[Any, Any]]", variable has type "int")
result = metadata.searchsorted(
v, side="left"
) # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can avoid the ignore using a variable annotation for result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next commit

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Aug 22, 2021

@simonjayhawkins Have just made a bunch of changes to use more numpy types in arguments and return types for searchsorted.

One odd thing was that I put the overloads for searchsorted that numpy uses only in pandas/core/indexes/extension.py. Alternatively, I could put this everywhere, BUT as you'll see in the comments in that file, it requires a type: ignore statement, which is also present in the mypy definition of searchsorted() in numpy/__init__.pyi.

I could put the same overloads throughout the code, but it would require the type: ignore messages in every place. Let me know which you prefer.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 6, 2021

biweekly ping @simonjayhawkins

@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

@Dr-Irv ok if you can merge master and make sure still passing (i think it was last) time, will merge these.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 6, 2021

@jreback all green except for a Windows timeout on one of the tests

@jreback jreback merged commit 11a8c17 into pandas-dev:master Sep 6, 2021
@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

thanks @Dr-Irv followup to consolidate any fixtures created here in other PRs

@Dr-Irv Dr-Irv deleted the extdelete branch September 6, 2021 17:35
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants