Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typing for extension arrays and extension dtypes without isin and astype #40421

Closed
wants to merge 43 commits into from

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Mar 13, 2021

Same as #39501 by removing changes related to isin and the overloads for astype and return type for astype

Per discussion with @simonjayhawkins #39501 (comment)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Mar 13, 2021

@Dr-Irv I think we still need to reduce the size of this PR, so need to start breaking non controversial bits off so we can start merging some of this great work (i.e. changing typvars, the getitem overloads and maybe privatising registry need further input from others)

OK, next commit will remove those three things, which were changing the TypeVar on DatetimeLikeScalar, removing the __getitem__ overloads and the changing of registry to _registry

# error: Invalid index type "List[Any]" for "ExtensionArray"; expected
# type "Union[int, slice, ndarray]"
return self.values[[loc]] # type: ignore[index]
return self.values[[loc]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be a false positive. The EA interface does not specify (in the docstring) that a list is accepted for __getitem__, although maybe we have tests in the base extension tests that enforce this. will need to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the docs need to get updated, because this works:

>>> s=pd.array([1,2,3,4,5])
>>> s[[2,4]]
<IntegerArray>
[3, 5]
Length: 2, dtype: Int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure for our internal EAs. it's 3rd party EAs that may not support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess there are two choices:

  1. Require 3rd party EAs to support it and update the docs
  2. Remove the List[Any] that I put in, and just mark the mypy error on that one place

I will assume you prefer (2), and will remove that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will look in detail tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commits do it using (2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Require 3rd party EAs to support it and update the docs

This one (not necessarily for this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought on this. There was some comment about having the EA's include a 2D signature, but I think that can be done in a subclass that implements it. So right now, the signature for __getitem__() is:

def __getitem__(
        self, item: Union[int, slice, np.ndarray]
    ) -> Union[ExtensionArray, Any]:

The discussion here is (to do in another PR) to make it (and change the docs):

def __getitem__(
        self, item: Union[int, slice, np.ndarray, List[int]]
    ) -> Union[ExtensionArray, Any]:

Note the use of List[int] here, not List[Any] as I had it before.

If there is a 2D EA that wants to allow Tuple[int, ...], it can have it's own signature that widens the EA signature:

def __getitem__(
        self, item: Union[int, slice, np.ndarray, List[int], Tuple[int, ...]]
    ) -> Union[ExtensionArray, Any]:

But we don't have to include that signature with Tuple[int,...] in the base EA class.

@@ -1073,7 +1069,7 @@ def unique(self):
values = self._values

if not isinstance(values, np.ndarray):
result = values.unique()
result: ArrayLike = values.unique()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've got ExtensionArray in the namespace, can make the isinstance check above for EA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the isinstance check, but you still need to declare result as ArrayLike, because the code below that line can return the result as an np.ndarray . Here's the excerpt (as now written):

        if isinstance(values, ExtensionArray):
            result: ArrayLike = values.unique()
            if self.dtype.kind in ["m", "M"] and isinstance(self, ABCSeries):
                # GH#31182 Series._values returns EA, unpack for backward-compat
                if getattr(self.dtype, "tz", None) is None:
                    result = np.asarray(result)

@simonjayhawkins simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Mar 14, 2021
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Mar 16, 2021

@simonjayhawkins I believe I've addressed all your requests on this. Anything else to do?

@simonjayhawkins
Copy link
Member

Thanks @Dr-Irv for addressing the comments. I want to make another thorough pass before approving.

# error: Argument "dtype" to "array" has incompatible type
# "Union[ExtensionDtype, dtype[Any]]"; expected "Union[dtype[Any], None,
# type, _SupportsDType, str, Union[Tuple[Any, int], Tuple[Any,
# Union[int, Sequence[int]]], List[Any], _DTypeDict, Tuple[Any, Any]]]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you just assert that this is a NpDtype here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to address astype() (where the above code is) in a later PR. That will remove the above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment, remove the added type and this ignore will not be needed yet.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Apr 3, 2021

@simonjayhawkins pinging you again on review of this. Thanks!

@simonjayhawkins
Copy link
Member

@Dr-Irv can you resolve conflicts.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Apr 27, 2021

@Dr-Irv can you resolve conflicts.

@simonjayhawkins I've done this. Note that with respect to what was in #41097, I changed the name of the TypeVar to be ExtensionDtypeT rather than E to follow convention that we use elsewhere.

Note that the way that I use pyright didn't pick that up because the example was from CategoricalDtype and I was only verifying the base extension types.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dr-Irv

@@ -191,7 +191,7 @@ def extract_bool_array(mask: ArrayLike) -> np.ndarray:
# We could have BooleanArray, Sparse[bool], ...
# Except for BooleanArray, this is equivalent to just
# np.asarray(mask, dtype=bool)
mask = mask.to_numpy(dtype=bool, na_value=False)
mask = mask.to_numpy(dtype=np.dtype(bool), na_value=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't need to be changed. i've opened #41185 as a precursor to fix these false positives.

copy: bool = False,
na_value=lib.no_default,
na_value: Any | None = lib.no_default,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None seems redundant here.

@@ -510,8 +508,7 @@ def nbytes(self) -> int:
# ------------------------------------------------------------------------
# Additional Methods
# ------------------------------------------------------------------------

def astype(self, dtype, copy=True):
def astype(self, dtype: Dtype, copy: bool = True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when adding types, we should ensure that the types match the docstrings. From the PR title and previous discussion, IIUC astype was being done separately.

I would revert this for now until this method is sorted properly.

see also #41018 (comment) and response.

The docstring states we return np.ndarray and the one-liner suggests that too. We sometimes also return an ExtensionArray, this is dependent on the type of dtype.

my concern is that if we add the type now, this may get forgotten.

# error: Argument "dtype" to "array" has incompatible type
# "Union[ExtensionDtype, dtype[Any]]"; expected "Union[dtype[Any], None,
# type, _SupportsDType, str, Union[Tuple[Any, int], Tuple[Any,
# Union[int, Sequence[int]]], List[Any], _DTypeDict, Tuple[Any, Any]]]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment, remove the added type and this ignore will not be needed yet.

Comment on lines +594 to +595
*args: Any,
**kwargs: Any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave these untyped for now.

These are passed through to numpy.argsort, so we should be able to be more precise.

try to avoid adding Any where we maybe able to more precise with further investigation.

side note: these actually don't seem to be passed in the base implementation.

@@ -696,7 +693,7 @@ def value_counts(self, dropna: bool = True) -> Series:

_str_na_value = ArrowStringDtype.na_value

def _str_map(self, f, na_value=None, dtype: Dtype | None = None):
def _str_map(self, f, na_value=None, dtype: DtypeArg | None = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

pandas/core/dtypes/base.py Show resolved Hide resolved
pandas/core/dtypes/base.py Show resolved Hide resolved
pandas/core/dtypes/base.py Show resolved Hide resolved
# type "Union[ExtensionArray, ndarray]"; expected "Sequence[Any]"
locs = self._values[mask].searchsorted(
where._values, side="right" # type: ignore[arg-type]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments regarding using Sequence for ArrayLike

@Dr-Irv Dr-Irv marked this pull request as draft April 28, 2021 18:41
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 2, 2021

Closing this as I've started making even smaller PRs and tracking status here: #39501 (comment) in the comment from the original big PR

@Dr-Irv Dr-Irv closed this May 2, 2021
@Dr-Irv Dr-Irv deleted the limitextensiontyping branch February 13, 2023 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants