Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for string[pyarrow] dtype #954

Closed
bnaul opened this issue Sep 8, 2021 · 9 comments · Fixed by #1529
Closed

Support for string[pyarrow] dtype #954

bnaul opened this issue Sep 8, 2021 · 9 comments · Fixed by #1529
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@bnaul
Copy link
Contributor

bnaul commented Sep 8, 2021

Pandas 1.3 added a new string[pyarrow] dtype which can be considerably more memory-efficient.

I'm not sure what all would be involved but obviously it would be nice to support this natively since presumably(?) we already communicate the data in the appropriate format for the pyarrow string type before converting it back to python string objects. Maybe an option like was introduced in #848 for geography types could be used to determine the behavior?

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 8, 2021
@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Sep 8, 2021
@tswast
Copy link
Contributor

tswast commented Sep 8, 2021

Thanks for the request @bnaul ! This does look like an improvement.

I wonder if the existing dtypes argument to to_dataframe would be suitable for your use case? I've noticed that it works well for types such as float16 (see system test where this is used), but pandas seems to ignore it for other types (especially timestamp-related types).

@tswast
Copy link
Contributor

tswast commented Sep 8, 2021

Note: even if dtypes does work, we may want to see if we can use the types_mapper argument of pyarrow's to_pandas method when we call it here: https://github.com/googleapis/python-bigquery-storage/blob/c7ac6984c34c387f279e6ee0a7024273298f3351/google/cloud/bigquery_storage_v1/reader.py#L703

That could prevent some unnecessary transformations to Python string object and back again.

@bnaul
Copy link
Contributor Author

bnaul commented Sep 8, 2021

Yep, dtypes works fine as would just calling .astype(...) outside. But as you say it feels very wasteful when the underlying data is already Arrow strings, so there's a lot of wasted compute+peak memory in the round-tripping. types_mapper is where I had in mind to do the conversion.

Maybe rather than clutter the API with another argument, it would make more sense to just set the default string type to pyarrow if pd.core.arrays.string_.StringDtype(storage="pyarrow") is available? I'm not aware of any downsides but maybe there are some...?

@tswast
Copy link
Contributor

tswast commented Sep 8, 2021

I hesitate to do it by default when pandas still considers it "experimental". Then again, so is the Int64 dtype, but I'm planning on using it by default in google-cloud-bigquery v3 #786 to avoid data loss for large integers.

In this case, there isn't a data loss issue with the pandas default behavior and it's not been around quite as long as int64, so I'm not as keen to use string[pyarrow] by default. And if we did do it by default, I still think we'd want a parameter to be able to turn off the behavior.

What if we had a string_dtype="string" / "object" / "string[pyarrow]" parameter in to_dataframe? I feel that could map pretty well to populating types_mapper.

@bnaul
Copy link
Contributor Author

bnaul commented Sep 10, 2021

Definitely fair, as far as I can tell the "experimental" label was simply copy-pasted from the other array type docstrings (Int64 as you point out but string boolean etc all have the exact same text) so it seems safe-ish but also reasonable to leave it opt-in. string_dtype seems like a good solution to me!

@bnaul
Copy link
Contributor Author

bnaul commented Sep 10, 2021

Update: definitely don't make it the default, probably should have taken the warning a little more seriously...this seems like extremely basic behavior that isn't supported

[ins] In [3]: pd.Series(['a'], dtype='string[pyarrow]') + 'b'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-ed8b3d43e316> in <module>
----> 1 pd.Series(['a'], dtype='string[pyarrow]') + 'b'
...
TypeError: unsupported operand type(s) for +: 'ArrowStringArray' and 'str'

Digging a bit more there's also pandas-dev/pandas#42597 etc so it's definitely not feature complete.

@tswast tswast added the semver: major Hint for users that this is an API breaking change. label Dec 11, 2021
@tswast tswast added this to To do in google-cloud-bigquery 3.0.0 via automation Dec 11, 2021
@tswast
Copy link
Contributor

tswast commented Dec 11, 2021

I've been doing some thinking about this issue.

I think for v3, we should add some kind of string dtype support. Possibly: string[pyarrow] if available, then string, then object. That would let us continue to support a wide range of pandas versions.

Alternatively/in-addition we could expose the types mapper argument. Our default types mapper could call the user-supplied one and only continue with the default logic if the user-supplied one returned None.

@tswast
Copy link
Contributor

tswast commented Dec 11, 2021

Oh, just looked at the thread. Yeah, let's not make it the default in that case. Exposing types mapper should still be done for v3

@tswast
Copy link
Contributor

tswast commented Jan 4, 2023

Update: we do use types mapper now, but haven't yet provided an override for string or other dtypes.

def default_types_mapper(date_as_object: bool = False):

Might make more sense for this to be string specific than exposing Arrow types as part of the pandas to_dataframe API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants