Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support converting empty time Series to pyarrow Array #11

Merged
merged 5 commits into from Sep 29, 2021

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Sep 27, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #10 馃

@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 27, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API. label Sep 27, 2021
@tswast
Copy link
Collaborator Author

tswast commented Sep 27, 2021

Currently getting pyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to time64 using function cast_time64 in the empty time Series test.

@tswast tswast changed the title fix: support constucting empty time Series fix: support converting empty time Series to pyarrow Array Sep 27, 2021
@tswast
Copy link
Collaborator Author

tswast commented Sep 27, 2021

I'm a bit surprised this doesn't happen for date. I'll check with plain pandas timedelta64 to see what it does with to_numpy() for an empty series.

@tswast
Copy link
Collaborator Author

tswast commented Sep 27, 2021

Seems to work fine on latest pandas. 馃

In [1]: import pandas

In [2]: pandas.__version__
Out[2]: '1.4.0.dev0+502.g019d782633'

In [3]: s = pandas.Series([], dtype='timedelta64[ns]')

In [4]: import pyarrow

In [5]: a = pyarrow.array(s)

In [6]: a
Out[6]:
<pyarrow.lib.DurationArray object at 0x7fb98813bd60>
[]

In [7]: s.to_numpy()
Out[7]: array([], dtype='timedelta64[ns]')

@tswast tswast marked this pull request as ready for review September 27, 2021 20:35
@tswast tswast requested a review from a team as a code owner September 27, 2021 20:35
@@ -98,7 +98,8 @@ def astype(self, dtype, copy=True):

def __arrow_array__(self, type=None):
return pyarrow.array(
self.to_numpy(), type=type if type is not None else pyarrow.time64("ns"),
self.to_numpy(dtype="object"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have performance implications, but it does seem to prevent the cast to float64 for empty arrays. Also, the dtype seems to be object whenever there are any values in the array, anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approached this by adding the missing to_numpy() for pandas <1, that just uses astype('object')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast tswast mentioned this pull request Sep 27, 2021
4 tasks
Comment on lines +1 to +163
),
),
(
pandas.Series(
[
dt.time(0, 0, 0, 0),
dt.time(12, 30, 15, 125_000),
dt.time(23, 59, 59, 999_999),
],
dtype="time",
),
pyarrow.array(
[
dt.time(0, 0, 0, 0),
dt.time(12, 30, 15, 125_000),
dt.time(23, 59, 59, 999_999),
],
type=pyarrow.time64("ns"),
),
),
),
)
def test_to_arrow(series, expected):
array = pyarrow.array(series)
assert array.equals(expected)


@pytest.mark.parametrize(
("series", "expected"),
(
(pandas.Series([], dtype="date"), pyarrow.array([], type=pyarrow.date64())),
(
pandas.Series([None, None, None], dtype="date"),
pyarrow.array([None, None, None], type=pyarrow.date64()),
),
(
pandas.Series(
[dt.date(2021, 9, 27), None, dt.date(2011, 9, 27)], dtype="date"
),
pyarrow.array(
[dt.date(2021, 9, 27), None, dt.date(2011, 9, 27)],
type=pyarrow.date64(),
),
),
(
pandas.Series(
[dt.date(1677, 9, 22), dt.date(1970, 1, 1), dt.date(2262, 4, 11)],
dtype="date",
),
pyarrow.array(
[dt.date(1677, 9, 22), dt.date(1970, 1, 1), dt.date(2262, 4, 11)],
type=pyarrow.date64(),
),
),
(pandas.Series([], dtype="time"), pyarrow.array([], type=pyarrow.time32("ms"))),
(
pandas.Series([None, None, None], dtype="time"),
pyarrow.array([None, None, None], type=pyarrow.time32("ms")),
),
(
pandas.Series(
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_000)], dtype="time"
),
pyarrow.array(
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_000)],
type=pyarrow.time32("ms"),
),
),
(
pandas.Series(
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_999)], dtype="time"
),
pyarrow.array(
[dt.time(0, 0, 0, 0), None, dt.time(23, 59, 59, 999_999)],
type=pyarrow.time64("us"),
),
),
(
pandas.Series(
[
dt.time(0, 0, 0, 0),
dt.time(12, 30, 15, 125_000),
dt.time(23, 59, 59, 999_999),
],
dtype="time",
),
pyarrow.array(
[
dt.time(0, 0, 0, 0),
dt.time(12, 30, 15, 125_000),
dt.time(23, 59, 59, 999_999),
],
type=pyarrow.time64("us"),
),
),
),
)
def test_to_arrow_w_arrow_type(series, expected):
array = pyarrow.array(series, type=expected.type)
assert array.equals(expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests are nicer than mine. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! They did catch a bug with empty arrays, so I'm glad I wrote them

Copy link

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just can't say about the possible performance impact of self.to_numpy(dtype="object").

Are we going to use that conditionally for pandas<1 as Jim suggested?

@tswast
Copy link
Collaborator Author

tswast commented Sep 28, 2021

I just can't say about the possible performance impact of self.to_numpy(dtype="object").

Thankfully I think this approach with using an extension dtype leaves us room for optimization. We might even switch to backing everything with pyarrow arrays instead of numpy arrays someday.

Are we going to use that conditionally for pandas<1 as Jim suggested?

Yes, that seems to be the right approach. Just pushed a commit.

@tswast tswast requested a review from plamut September 28, 2021 20:48


@for_date_and_time
def test_date___arrow__array__(dtype):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because it's redundant with the new test_arrow.py tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-db-dtypes-pandas API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unit test coverage missing __arrow_array__ methods
3 participants