Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push some as_column arrow logic to ColumnBase.from_arrow #15738

Merged
merged 3 commits into from
May 22, 2024

Conversation

mroeschke
Copy link
Contributor

Description

as_column and ColumnBase.from_arrow have similar checks for handling pa.Array objects so consolidating them to
ColumnBase.from_arrow as as_column calls to that eventually.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 13, 2024
@mroeschke mroeschke requested a review from a team as a code owner May 13, 2024 21:56
@mroeschke mroeschke requested review from wence- and isVoid May 13, 2024 21:56
Comment on lines -1842 to -1864
if isinstance(arbitrary, pa.NullArray):
if dtype is not None:
# Cast the column to the `dtype` if specified.
new_dtype = dtype
elif len(arbitrary) == 0:
# If the column is empty, it has to be
# a `str` dtype.
new_dtype = cudf.dtype("str")
else:
# If the null column is not empty, it has to
# be of `object` dtype.
new_dtype = cudf.dtype(arbitrary.type.to_pandas_dtype())

if cudf.get_option(
"mode.pandas_compatible"
) and new_dtype == cudf.dtype("O"):
# We internally raise if we do `astype("object")`, hence
# need to cast to `str` since this is safe to do so because
# it is a null-array.
new_dtype = "str"

col = col.astype(new_dtype)
elif dtype is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to add handling for this into from_arrow? The other removals here all have a corresponding branch in from_arrow. Is this now being handled by the new three lines you added above that are doing pa.types.is_null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to add handling for this into from_arrow?

I had kept this here since from_arrow doesn't do any explicit casting with dtype. But, are pyarrow.null type arrays ever expected to be represented as is in cudf? Otherwise, it would make sense to convert null to str in from_arrow

Is this now being handled by the new three lines you added above that are doing pa.types.is_null?

Yes, I think this casting was here when the default "empty" type was float pre-pandas 2.0 support and we need this to be string sometimes. Now we can collapse all these conditions to always cast to string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pyarrow.null has like one possible code path in all of the cudf test suite where it gets used 😅 so I think you can safely ignore it as long as tests are currently passing. Eventually we'll need to deal with this more thoroughly in #12477

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f6cca50 into rapidsai:branch-24.06 May 22, 2024
70 checks passed
@mroeschke mroeschke deleted the cln/arrow/column branch May 22, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants