fix: support ARRAY data type when loading from DataFrame with Parquet #980

judahrand · 2021-09-21T21:45:43Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #19 🦕

judahrand · 2021-09-21T22:03:58Z

I've made this default to True because it is technically the correct thing to do from a BigQuery perspective. BQ expects Parquet so we should be giving it compliant Parquet. However, the argument could be made that it should default to False for backwards compatibility?

Users who need the old behaviour can set the argument to False manually.

judahrand · 2021-09-22T08:02:16Z

I suppose another question is whether or not we should set parquet_options.enable_list_inference = True by default in this method?

judahrand · 2021-09-22T12:53:12Z

These suggestions might be acceptable if released in 3.0.0?

Would it be acceptable to default use_compliant_nested_type and job_config.parquet_options.enable_list_inference to True in 3.0.0? This is the only way that a DataFrame with lists in the columns can be written to BigQuery and the same DataFrame retrieved with .to_dataframe(). I would have thought this would be the expected behaviour? @plamut

plamut · 2021-09-22T13:27:04Z

@judahrand First of all, thanks for the research and the fix!

These suggestions might be acceptable if released in 3.0.0?

BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming 2.x releases. Since no tests broke, this is already a good sign.

I'm just not sure about bumping pyarrow to >= 4.0.0 - we generally try to keep the version pins wide to reduce the chance of conflicting requirements with other cloud libraries or 3rd party dependencies. @tswast do you know if bumping the minimum pyarrow version would affect some users?

As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed?

Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19.

tswast · 2021-09-22T14:24:30Z

@tswast do you know if bumping the minimum pyarrow version would affect some users?

I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0.

judahrand · 2021-09-22T15:30:53Z

@tswast do you know if bumping the minimum pyarrow version would affect some users?

I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0.

I agree but I can't really see another way to address this (and it is quite annoying for some use cases). For example (Feast)[https://github.com/feast-dev/feast] is currently having to implement an ugly hack around this. It's a shame that PyArrow just didn't write proper Parquet for so long! 🤣

Plus, tbh we're already on 3.0.0 which was Jan 2021. 4.0.0 is April 2021. So only a few months difference.

plamut · 2021-09-22T15:40:37Z

We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version (pyarrow in our case). We've done it several times in the past, and it proved successful - it improved user experience while at the same time not spoiled it for those who were stuck on older versions (for any reason).

We can take the best from both worlds. 🙂

P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution.

judahrand · 2021-09-22T16:10:22Z

@judahrand First of all, thanks for the research and the fix!

These suggestions might be acceptable if released in 3.0.0?

BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming 2.x releases. Since no tests broke, this is already a good sign.

I'm just not sure about bumping pyarrow to >= 4.0.0 - we generally try to keep the version pins wide to reduce the chance of conflicting requirements with other cloud libraries or 3rd party dependencies. @tswast do you know if bumping the minimum pyarrow version would affect some users?

As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed?

Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19.

I've added to the basic test_load_table_from_dataframe_w_automatic_schema to test that simple REPEATED versions of all the basic types end up in the correct schema in BQ which is decent initial pass.

I suppose I should probably also add tests for all the combinations of use_compliant_nested_type and enable_list_inference?

judahrand · 2021-09-22T16:11:43Z

We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version (pyarrow in our case). We've done it several times in the past, and it proved successful - it improved user experience while at the same time not spoiled it for those who were stuck on older versions (for any reason).

We can take the best from both worlds. 🙂

P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution.

Can you point me to an example of where this has been done before so I can replicate the pattern here?

tswast · 2021-09-22T16:30:37Z

Can you point me to an example of where this has been done before so I can replicate the pattern here?

I recommend adding a property like use_compliant_nested_type to to PyarrowVersions:

python-bigquery/google/cloud/bigquery/_helpers.py

Line 107 in ee1e25c

class PyarrowVersions:

See a similar feature flag for the BQ Storage client:

python-bigquery/google/cloud/bigquery/_helpers.py

Line 76 in ee1e25c

def is_read_session_optional(self) -> bool:

tswast · 2021-09-22T16:31:44Z

google/cloud/bigquery/client.py

@@ -2456,6 +2457,7 @@ def load_table_from_dataframe(
        project: str = None,
        job_config: LoadJobConfig = None,
        parquet_compression: str = "snappy",
+        parquet_use_compliant_nested_type: bool = True,


Do we even need this? I would argue that the non-compliant version is just plain wrong, right?

I don't think we officially supported list/struct types for the pandas connector yet.

I'm more than happy to not expose this!! I agree that it's just plain wrong.

I was just thinking about anyone who needs a work around for a table they've already created with a 'wrong' format.

I think I should clarity this. The reason it might break old users tables is that the format that pyarrow uses for nested lists by default is list<item: list<item: value>>.

By default (without use_compliant_nested_type) it turns this into list<list<item: value>> when writing to Parquet.

When enable_list_inference is off in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{ "list: { "item: value } }

When enable_list_inference is on in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{ "item": value }

When PyArrow is told to use_compliant_nested_type it outputs list<list<element: value>>.

When enable_list_inference is off in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{ "list: { "element: value } }

When enable_list_inference is on in BigQuery this ends up as a ARRAY<value> in BigQuery. Which is what it should be!!!!

My point being that if we switch on use_compliant_nested_type without giving the option to turn it off the item becomes element always. This will be guaranteed to be incompatible with schemas created with the legacy version.

Does that make sense?

If you're happy to have this breaking change but 'officially' support list types for the pandas connector - then I'm happy to not expose it! And will make the change?

Yes, this change seems to be what's needed to officially support list types. 😁

Makes sense. I hadn't really expected anyone to actually use that rather strange schema...

Since #19 is still open, I don't consider lists supported as of today, so I'm okay making this breaking change.

That said, if you think it would be disruptive we can consider adding a field now, but then we immediately remove it in google-cloud-bigquery 3.0 🤔

I think it's reasonable to make the breaking change. If someone cares enough to maintain the old behaviour they can write the Dataframe using whatever options they want manually and then use whatever JobConfig they like with load_table_from_file. Doesn't seem worth it to add the option and immediately remove it. Especially, as as you say... the old schema is.... difficult to use.

Cool. Let's remove it, then. Especially since folks have a workaround if they really need the unsupported schema.

judahrand · 2021-09-22T17:33:57Z

Can you point me to an example of where this has been done before so I can replicate the pattern here?

I recommend adding a property like use_compliant_nested_type to to PyarrowVersions:

python-bigquery/google/cloud/bigquery/_helpers.py

Line 107 in ee1e25c

class PyarrowVersions:

See a similar feature flag for the BQ Storage client:

python-bigquery/google/cloud/bigquery/_helpers.py

Line 76 in ee1e25c

def is_read_session_optional(self) -> bool:

Cool 😁 Implemented this plus a bit of tidy up.

judahrand · 2021-09-30T20:52:39Z

@tswast Are we all good here? 😄

google/cloud/bigquery/client.py

tswast · 2021-10-07T15:17:33Z

@judahrand Thank you very much for the contribution and for addressing our feedback! I hope to release this change soon.

…googleapis#980) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [x] Ensure the tests and linter pass - [x] Code coverage does not decrease (if any source code was changed) - [x] Appropriate docs were updated (if necessary) Fixes googleapis#19 🦕

judahrand requested a review from a team September 21, 2021 21:45

judahrand requested a review from a team as a code owner September 21, 2021 21:45

judahrand requested a review from stephaniewang526 September 21, 2021 21:45

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 21, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 21, 2021

judahrand force-pushed the dataframe-arrays branch from 2b1b5c9 to 3f533c0 Compare September 21, 2021 21:47

fix: use compliant Parquet by default

62fd565

judahrand force-pushed the dataframe-arrays branch from 3f533c0 to 62fd565 Compare September 21, 2021 21:47

judahrand changed the title ~~Use compliant Parquet by default loading from dataframe~~ Use compliant Parquet by default when loading from dataframe Sep 21, 2021

judahrand changed the title ~~Use compliant Parquet by default when loading from dataframe~~ fix: use compliant Parquet by default when loading from dataframe Sep 21, 2021

chore: bump minimum pyarrow version

f44a6ae

judahrand force-pushed the dataframe-arrays branch from 4c77f16 to f44a6ae Compare September 21, 2021 22:43

plamut added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 22, 2021

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 22, 2021

fix: default to ParquetOptions.enable_list_inference is True

8179ded

judahrand force-pushed the dataframe-arrays branch from 3ddbc66 to a9ec5ca Compare September 22, 2021 16:06

tswast reviewed Sep 22, 2021

View reviewed changes

judahrand force-pushed the dataframe-arrays branch from 4e46f83 to a7fb8fa Compare September 22, 2021 16:42

feat: detect pyarrow.ListType as REPEATED

bbdad5d

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 27, 2021

judahrand added 2 commits September 27, 2021 23:03

Remove user facing argument

ba3f145

Fix doctring typo

ea54491

judahrand force-pushed the dataframe-arrays branch from 6993df4 to ea54491 Compare September 27, 2021 22:03

judahrand added 2 commits September 27, 2021 23:03

Merge branch 'main' into dataframe-arrays

e43c6fc

Merge branch 'main' into dataframe-arrays

d9c508c

tswast reviewed Oct 6, 2021

View reviewed changes

google/cloud/bigquery/client.py Outdated Show resolved Hide resolved

tswast added 2 commits October 6, 2021 16:58

Merge branch 'main' into dataframe-arrays

e8be400

Update google/cloud/bigquery/client.py

9ec8c67

tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 6, 2021

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 6, 2021

tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 6, 2021

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 6, 2021

tswast reviewed Oct 7, 2021

View reviewed changes

google/cloud/bigquery/client.py Outdated Show resolved Hide resolved

Update google/cloud/bigquery/client.py

4fa8665

tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 7, 2021

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 7, 2021

tswast approved these changes Oct 7, 2021

View reviewed changes

Merge branch 'main' into dataframe-arrays

4714b6c

tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 7, 2021

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 7, 2021

tswast added the automerge Merge the pull request once unit tests and other checks pass. label Oct 7, 2021

gcf-merge-on-green bot merged commit 1e59083 into googleapis:main Oct 7, 2021

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Oct 7, 2021

judahrand mentioned this pull request Oct 13, 2021

Simplify BigQuery load jobs feast-dev/feast#1935

Merged

judahrand deleted the dataframe-arrays branch November 19, 2021 09:34

release-please bot mentioned this pull request Jan 4, 2022

chore(main): release python-bigquery 1.27.1 #1097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support ARRAY data type when loading from DataFrame with Parquet #980

fix: support ARRAY data type when loading from DataFrame with Parquet #980

judahrand commented Sep 21, 2021 •

edited

judahrand commented Sep 21, 2021 •

edited

judahrand commented Sep 22, 2021

judahrand commented Sep 22, 2021 •

edited

plamut commented Sep 22, 2021

tswast commented Sep 22, 2021

judahrand commented Sep 22, 2021

plamut commented Sep 22, 2021 •

edited

judahrand commented Sep 22, 2021

judahrand commented Sep 22, 2021

tswast commented Sep 22, 2021

tswast Sep 22, 2021

judahrand Sep 22, 2021

judahrand Sep 22, 2021 •

edited

judahrand Sep 22, 2021

tswast Sep 23, 2021

tswast Sep 23, 2021

tswast Sep 23, 2021

judahrand Sep 23, 2021

tswast Sep 27, 2021

judahrand Sep 27, 2021

judahrand commented Sep 22, 2021

judahrand commented Sep 30, 2021

tswast commented Oct 7, 2021

fix: support ARRAY data type when loading from DataFrame with Parquet #980

fix: support ARRAY data type when loading from DataFrame with Parquet #980

Conversation

judahrand commented Sep 21, 2021 • edited

judahrand commented Sep 21, 2021 • edited

judahrand commented Sep 22, 2021

judahrand commented Sep 22, 2021 • edited

plamut commented Sep 22, 2021

tswast commented Sep 22, 2021

judahrand commented Sep 22, 2021

plamut commented Sep 22, 2021 • edited

judahrand commented Sep 22, 2021

judahrand commented Sep 22, 2021

tswast commented Sep 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judahrand Sep 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

judahrand commented Sep 22, 2021

judahrand commented Sep 30, 2021

tswast commented Oct 7, 2021

judahrand commented Sep 21, 2021 •

edited

judahrand commented Sep 21, 2021 •

edited

judahrand commented Sep 22, 2021 •

edited

plamut commented Sep 22, 2021 •

edited

judahrand Sep 22, 2021 •

edited