feat: `to_gbq` uses Parquet by default, use `api_method="load_csv"` for old behavior #413

tswast · 2021-10-26T22:05:53Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #366 🦕

…or old behavior

tswast · 2021-10-28T14:19:15Z

    pyarrow 3.0.0 depends on numpy>=1.16.6
    The user requested (constraint) numpy==1.14.5

Looks like we need to bump minimum numpy.

Per https://numpy.org/neps/nep-0029-deprecation_policy.html, we should be on Numpy 1.18 already, so requiring >=1.16.6 is justifiable.

tswast · 2021-10-28T15:10:36Z

Looks like most of the CircleCI failures are caused by the same issue.

FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data - Fil...
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_if_table_exists_append
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_subset_columns_if_table_exists_append
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_if_table_exists_replace
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_flexible_column_order
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_with_timestamp
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_tokyo
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_tokyo_non_existing_dataset

E   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns, tz=US/Arizona] to timestamp[ms] would lose data: 1635431957963003000

The same tests pass with the latest versions of packages in the 3.9 tests.

We don't test with the system tests with 3.7 on the Kokoro session, so I can't tell if the same issue happens with the pip version. Will update the noxfile to run with 3.7.

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast · 2021-10-28T15:21:06Z

Looks like we're not the only ones encountering this: https://stackoverflow.com/questions/59682833/pyarrow-lib-arrowinvalid-casting-from-timestampns-to-timestampms-would-los

I wonder if we bump the minimum pyarrow to 4.0.0 if it would fix it?

…sue366-null-strings

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast · 2021-10-28T16:03:33Z

I've tried pyarrow 4, 5, and 6. None of which fixed it. Possibly a problem with pandas?

tswast · 2021-10-28T17:25:40Z

Tests pass with

attrs==21.2.0
cachetools==4.2.4
certifi==2021.10.8
charset-normalizer==2.0.7
click==8.0.3
google-api-core==1.16.0
google-auth==1.4.1
google-auth-oauthlib==0.0.1
google-cloud-bigquery==1.11.1
google-cloud-bigquery-storage==1.1.0
google-cloud-core==0.29.1
google-cloud-testutils==1.2.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.53.0
grpcio==1.41.1
idna==3.3
importlib-metadata==4.8.1
iniconfig==1.1.1
mock==4.0.3
numpy==1.21.3
oauthlib==3.1.1
packaging==21.0
pandas==1.3.4
-e git+ssh://git@github.com/tswast/python-bigquery-pandas.git@845ff322a5d7900826c97a4da652aead5518ca73#egg=pandas_gbq
pluggy==1.0.0
protobuf==3.19.0
py==1.10.0
pyarrow==6.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydata-google-auth==0.1.2
pyparsing==3.0.3
pytest==6.2.5
python-dateutil==2.8.2
pytz==2021.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.16.0
toml==0.10.2
tqdm==4.23.0
typing-extensions==3.10.0.2
urllib3==1.26.7
zipp==3.6.0

I'll try different versions of pandas.

tswast · 2021-10-28T17:42:46Z

This issue is fixed by upgrading to pandas 1.1.0+.

Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime

I don't think we want to require pandas 1.1.0 just yet. Perhaps the tests could be updated not to mix and match timezones, since that's not actually supported by pandas until 1.1.0?

…sue366-null-strings

plamut

I can confirm the fix gets rid of the linked issue.

Overall it looks good, the comments are just some nits and one possible refactoring opportunity.

pandas_gbq/load.py

pandas_gbq/exceptions.py

plamut

LGTM.

The Python 3.10 check fails, because the BigQuery client does not yet support Python 3.10 and cannot be installed as a dependency.

James-Gilbert- · 2021-11-10T18:33:08Z

pandas_gbq/load.py

    if schema is not None:
        schema = pandas_gbq.schema.remove_policy_tags(schema)
        job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
-    # If not, let BigQuery determine schema unless we are encoding the CSV files ourselves.
-    elif not FEATURES.bigquery_has_from_dataframe_with_csv:
-        schema = pandas_gbq.schema.generate_bq_schema(dataframe)


This may introduce a failure if the schema is None and the generate_bq_schema is left unused.

The parquet conversion may be successful, but the actual BQ table schema type may not match the resultant conversion.

Interesting that our tests wouldn't have caught that. Do you have an example of a dataframe that demonstrates this?

FWIW, the reason we don't have this here is that the google-cloud-bigquery library does similar dataframe to BQ schema conversion logic if the schema is not populated on the job config: https://github.com/googleapis/python-bigquery/blob/66b3dd9f9aec3fda9610a3ceec8d8a477f2ab3b9/google/cloud/bigquery/client.py#L2625

feat: to_gbq uses Parquet by default, use api_method="load_csv" f…

a794a33

…or old behavior

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 26, 2021

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Oct 26, 2021

tswast added 2 commits October 27, 2021 16:20

warn when chunksize is used

64bae49

adjust exceptions

80d2cbc

tswast marked this pull request as ready for review October 27, 2021 21:41

tswast requested a review from a team as a code owner October 27, 2021 21:41

tswast added 2 commits October 27, 2021 16:51

fix unit tests

1b29ced

mention pyarrow as a dependency

0fea9fa

tswast mentioned this pull request Oct 27, 2021

Empty strings inconsistently converted to NULL's when using df.to_gbq() #366

Closed

add pyarrow to conda deps

a2ea621

update minimum numpy

4aece99

tswast and others added 2 commits October 28, 2021 10:14

test with Python 3.7

ea3b8cc

🦉 Updates from OwlBot

9a798d4

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast and others added 5 commits October 28, 2021 10:23

bump minimum pyarrow

96cc97f

Merge remote-tracking branch 'upstream/issue366-null-strings' into is…

fd52dff

…sue366-null-strings

try pyarrow 5.0.0

f790cfe

add python 3.7 to pip tests

845ff32

🦉 Updates from OwlBot

9150471

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast added 3 commits October 28, 2021 14:03

revert change to pyarrow minimum

88d8676

Merge remote-tracking branch 'upstream/issue366-null-strings' into is…

7eb4387

…sue366-null-strings

avoid parquet for older pandas

0b158d3

tswast requested a review from plamut October 29, 2021 14:25

plamut reviewed Oct 30, 2021

View reviewed changes

pandas_gbq/load.py Show resolved Hide resolved

pandas_gbq/load.py Outdated Show resolved Hide resolved

pandas_gbq/exceptions.py Outdated Show resolved Hide resolved

pandas_gbq/exceptions.py Show resolved Hide resolved

tswast added 2 commits November 1, 2021 13:45

fix exception messages

93f851c

refactor load_csv

0d5d5c5

tswast requested a review from plamut November 1, 2021 19:37

tswast added 2 commits November 1, 2021 14:39

remove unused job_config

a26e73b

trailing comma

171230f

plamut approved these changes Nov 1, 2021

View reviewed changes

tswast merged commit 9a65383 into main Nov 2, 2021

tswast deleted the issue366-null-strings branch November 2, 2021 14:52

James-Gilbert- reviewed Nov 10, 2021

View reviewed changes

tswast mentioned this pull request Nov 10, 2021

to_gbq fails when trying to save a Pandas datetime64[ns] to a BQ DATE field #362

Closed

tswast mentioned this pull request Nov 19, 2021

refactor to use more logic from google-cloud-bigquery #339

Closed

2 tasks

tswast mentioned this pull request Nov 30, 2021

ENH: load timestamp data with timezones other than UTC #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `to_gbq` uses Parquet by default, use `api_method="load_csv"` for old behavior #413

feat: `to_gbq` uses Parquet by default, use `api_method="load_csv"` for old behavior #413

tswast commented Oct 26, 2021 •

edited

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

plamut left a comment

plamut left a comment

James-Gilbert- Nov 10, 2021

tswast Nov 11, 2021

tswast Nov 11, 2021

feat: to_gbq uses Parquet by default, use api_method="load_csv" for old behavior #413

feat: to_gbq uses Parquet by default, use api_method="load_csv" for old behavior #413

Conversation

tswast commented Oct 26, 2021 • edited

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

tswast commented Oct 28, 2021

plamut left a comment

Choose a reason for hiding this comment

plamut left a comment

Choose a reason for hiding this comment

James-Gilbert- Nov 10, 2021

Choose a reason for hiding this comment

tswast Nov 11, 2021

Choose a reason for hiding this comment

tswast Nov 11, 2021

Choose a reason for hiding this comment

feat: `to_gbq` uses Parquet by default, use `api_method="load_csv"` for old behavior #413

feat: `to_gbq` uses Parquet by default, use `api_method="load_csv"` for old behavior #413

tswast commented Oct 26, 2021 •

edited