Support for string[pyarrow] dtype #954

bnaul · 2021-09-08T14:16:46Z

Pandas 1.3 added a new string[pyarrow] dtype which can be considerably more memory-efficient.

I'm not sure what all would be involved but obviously it would be nice to support this natively since presumably(?) we already communicate the data in the appropriate format for the pyarrow string type before converting it back to python string objects. Maybe an option like was introduced in #848 for geography types could be used to determine the behavior?

The text was updated successfully, but these errors were encountered:

tswast · 2021-09-08T15:10:57Z

Thanks for the request @bnaul ! This does look like an improvement.

I wonder if the existing dtypes argument to to_dataframe would be suitable for your use case? I've noticed that it works well for types such as float16 (see system test where this is used), but pandas seems to ignore it for other types (especially timestamp-related types).

tswast · 2021-09-08T15:19:30Z

Note: even if dtypes does work, we may want to see if we can use the types_mapper argument of pyarrow's to_pandas method when we call it here: https://github.com/googleapis/python-bigquery-storage/blob/c7ac6984c34c387f279e6ee0a7024273298f3351/google/cloud/bigquery_storage_v1/reader.py#L703

That could prevent some unnecessary transformations to Python string object and back again.

bnaul · 2021-09-08T18:35:10Z

Yep, dtypes works fine as would just calling .astype(...) outside. But as you say it feels very wasteful when the underlying data is already Arrow strings, so there's a lot of wasted compute+peak memory in the round-tripping. types_mapper is where I had in mind to do the conversion.

Maybe rather than clutter the API with another argument, it would make more sense to just set the default string type to pyarrow if pd.core.arrays.string_.StringDtype(storage="pyarrow") is available? I'm not aware of any downsides but maybe there are some...?

tswast · 2021-09-08T19:23:02Z

I hesitate to do it by default when pandas still considers it "experimental". Then again, so is the Int64 dtype, but I'm planning on using it by default in google-cloud-bigquery v3 #786 to avoid data loss for large integers.

In this case, there isn't a data loss issue with the pandas default behavior and it's not been around quite as long as int64, so I'm not as keen to use string[pyarrow] by default. And if we did do it by default, I still think we'd want a parameter to be able to turn off the behavior.

What if we had a string_dtype="string" / "object" / "string[pyarrow]" parameter in to_dataframe? I feel that could map pretty well to populating types_mapper.

bnaul · 2021-09-10T14:03:04Z

Definitely fair, as far as I can tell the "experimental" label was simply copy-pasted from the other array type docstrings (Int64 as you point out but string boolean etc all have the exact same text) so it seems safe-ish but also reasonable to leave it opt-in. string_dtype seems like a good solution to me!

bnaul · 2021-09-10T16:47:03Z

Update: definitely don't make it the default, probably should have taken the warning a little more seriously...this seems like extremely basic behavior that isn't supported

[ins] In [3]: pd.Series(['a'], dtype='string[pyarrow]') + 'b'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-ed8b3d43e316> in <module>
----> 1 pd.Series(['a'], dtype='string[pyarrow]') + 'b'
...
TypeError: unsupported operand type(s) for +: 'ArrowStringArray' and 'str'

Digging a bit more there's also pandas-dev/pandas#42597 etc so it's definitely not feature complete.

tswast · 2021-12-11T11:12:47Z

I've been doing some thinking about this issue.

I think for v3, we should add some kind of string dtype support. Possibly: string[pyarrow] if available, then string, then object. That would let us continue to support a wide range of pandas versions.

Alternatively/in-addition we could expose the types mapper argument. Our default types mapper could call the user-supplied one and only continue with the default logic if the user-supplied one returned None.

tswast · 2021-12-11T11:14:06Z

Oh, just looked at the thread. Yeah, let's not make it the default in that case. Exposing types mapper should still be done for v3

tswast · 2023-01-04T20:50:23Z

Update: we do use types mapper now, but haven't yet provided an override for string or other dtypes.

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 286 in e1aa921

def default_types_mapper(date_as_object: bool = False):

Might make more sense for this to be string specific than exposing Arrow types as part of the pandas to_dataframe API.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 8, 2021

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Sep 8, 2021

tswast added the semver: major Hint for users that this is an API breaking change. label Dec 11, 2021

tswast added this to To do in google-cloud-bigquery 3.0.0 via automation Dec 11, 2021

tswast mentioned this issue Mar 9, 2023

Optional switch to non-nullable dtypes in to_dataframe #1345

Closed

chelsea-lin mentioned this issue Mar 22, 2023

feat: add bool, int, float, string dtype to to_dataframe #1529

Merged

tswast assigned chelsea-lin Mar 23, 2023

tswast closed this as completed in #1529 Mar 23, 2023

google-cloud-bigquery 3.0.0 automation moved this from To do to Done Mar 23, 2023

bnaul mentioned this issue Apr 15, 2024

Support string_dtype, etc. in to_geodataframe #1902

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for string[pyarrow] dtype #954

Support for string[pyarrow] dtype #954

bnaul commented Sep 8, 2021

tswast commented Sep 8, 2021

tswast commented Sep 8, 2021

bnaul commented Sep 8, 2021

tswast commented Sep 8, 2021

bnaul commented Sep 10, 2021

bnaul commented Sep 10, 2021 •

edited

tswast commented Dec 11, 2021

tswast commented Dec 11, 2021

tswast commented Jan 4, 2023

Support for string[pyarrow] dtype #954

Support for string[pyarrow] dtype #954

Comments

bnaul commented Sep 8, 2021

tswast commented Sep 8, 2021

tswast commented Sep 8, 2021

bnaul commented Sep 8, 2021

tswast commented Sep 8, 2021

bnaul commented Sep 10, 2021

bnaul commented Sep 10, 2021 • edited

tswast commented Dec 11, 2021

tswast commented Dec 11, 2021

tswast commented Jan 4, 2023

bnaul commented Sep 10, 2021 •

edited