New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
base: main
Are you sure you want to change the base?
Changes from 3 commits
63a7fc5
0eee625
607b95e
bca157d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -805,6 +805,16 @@ def assert_extension_array_equal( | |
left_na, right_na, obj=f"{obj} NA mask", index_values=index_values | ||
) | ||
|
||
# Specifically for StringArrayNumpySemantics, validate here we have a valid array | ||
if isinstance(left.dtype, StringDtype) and left.dtype.storage == "python_numpy": | ||
assert np.all( | ||
[np.isnan(val) for val in left._ndarray[left_na]] # type: ignore[attr-defined] | ||
), "wrong missing value sentinels" | ||
Comment on lines
+808
to
+812
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit a custom check (and we don't do anything similarly for other types), but given I initially overlooked a case where we were creating string arrays with the wrong missing value sentinel because the tests don't actually catch that (two arrays with different missing value sentinels still pass as equal in case of EAs), I would prefer keeping this in at least on the short term. |
||
if isinstance(right.dtype, StringDtype) and right.dtype.storage == "python_numpy": | ||
assert np.all( | ||
[np.isnan(val) for val in right._ndarray[right_na]] # type: ignore[attr-defined] | ||
), "wrong missing value sentinels" | ||
|
||
left_valid = left[~left_na].to_numpy(dtype=object) | ||
right_valid = right[~right_na].to_numpy(dtype=object) | ||
if check_exact: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch more of such changes needed to update the code paths behind
if using_pyarrow_string_dtype()
(which is essentiallyfuture.infer_string = True
) to properly handle the case without pyarrow installed. But given that this is not needed to actually get the tests passing (we are not testing with infer_strings enabled), would prefer keeping this for a follow-up PR.