Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: to_gbq uses Parquet by default, use api_method="load_csv" for old behavior #413

Merged
merged 21 commits into from Nov 2, 2021

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Oct 26, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #366 🦕

@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 26, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Oct 26, 2021
@tswast tswast marked this pull request as ready for review October 27, 2021 21:41
@tswast tswast requested a review from a team as a code owner October 27, 2021 21:41
@tswast
Copy link
Collaborator Author

tswast commented Oct 28, 2021

    pyarrow 3.0.0 depends on numpy>=1.16.6
    The user requested (constraint) numpy==1.14.5

Looks like we need to bump minimum numpy.

Per https://numpy.org/neps/nep-0029-deprecation_policy.html, we should be on Numpy 1.18 already, so requiring >=1.16.6 is justifiable.

@tswast
Copy link
Collaborator Author

tswast commented Oct 28, 2021

Looks like most of the CircleCI failures are caused by the same issue.

FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data - Fil...
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_if_table_exists_append
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_subset_columns_if_table_exists_append
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_if_table_exists_replace
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_flexible_column_order
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_with_timestamp
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_tokyo
FAILED tests/system/test_gbq.py::TestToGBQIntegration::test_upload_data_tokyo_non_existing_dataset
E   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns, tz=US/Arizona] to timestamp[ms] would lose data: 1635431957963003000

The same tests pass with the latest versions of packages in the 3.9 tests.

We don't test with the system tests with 3.7 on the Kokoro session, so I can't tell if the same issue happens with the pip version. Will update the noxfile to run with 3.7.

@tswast
Copy link
Collaborator Author

tswast commented Oct 28, 2021

Looks like we're not the only ones encountering this: https://stackoverflow.com/questions/59682833/pyarrow-lib-arrowinvalid-casting-from-timestampns-to-timestampms-would-los

I wonder if we bump the minimum pyarrow to 4.0.0 if it would fix it?

@tswast
Copy link
Collaborator Author

tswast commented Oct 28, 2021

I've tried pyarrow 4, 5, and 6. None of which fixed it. Possibly a problem with pandas?

@tswast
Copy link
Collaborator Author

tswast commented Oct 28, 2021

Tests pass with

attrs==21.2.0
cachetools==4.2.4
certifi==2021.10.8
charset-normalizer==2.0.7
click==8.0.3
google-api-core==1.16.0
google-auth==1.4.1
google-auth-oauthlib==0.0.1
google-cloud-bigquery==1.11.1
google-cloud-bigquery-storage==1.1.0
google-cloud-core==0.29.1
google-cloud-testutils==1.2.0
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.53.0
grpcio==1.41.1
idna==3.3
importlib-metadata==4.8.1
iniconfig==1.1.1
mock==4.0.3
numpy==1.21.3
oauthlib==3.1.1
packaging==21.0
pandas==1.3.4
-e git+ssh://git@github.com/tswast/python-bigquery-pandas.git@845ff322a5d7900826c97a4da652aead5518ca73#egg=pandas_gbq
pluggy==1.0.0
protobuf==3.19.0
py==1.10.0
pyarrow==6.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydata-google-auth==0.1.2
pyparsing==3.0.3
pytest==6.2.5
python-dateutil==2.8.2
pytz==2021.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.16.0
toml==0.10.2
tqdm==4.23.0
typing-extensions==3.10.0.2
urllib3==1.26.7
zipp==3.6.0

I'll try different versions of pandas.

@tswast
Copy link
Collaborator Author

tswast commented Oct 28, 2021

This issue is fixed by upgrading to pandas 1.1.0+.

Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime

I don't think we want to require pandas 1.1.0 just yet. Perhaps the tests could be updated not to mix and match timezones, since that's not actually supported by pandas until 1.1.0?

@tswast tswast requested a review from plamut October 29, 2021 14:25
Copy link

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the fix gets rid of the linked issue.

Overall it looks good, the comments are just some nits and one possible refactoring opportunity.

pandas_gbq/load.py Show resolved Hide resolved
pandas_gbq/load.py Outdated Show resolved Hide resolved
pandas_gbq/exceptions.py Outdated Show resolved Hide resolved
pandas_gbq/exceptions.py Show resolved Hide resolved
@tswast tswast requested a review from plamut November 1, 2021 19:37
Copy link

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The Python 3.10 check fails, because the BigQuery client does not yet support Python 3.10 and cannot be installed as a dependency.

@tswast tswast merged commit 9a65383 into main Nov 2, 2021
@tswast tswast deleted the issue366-null-strings branch November 2, 2021 14:52
if schema is not None:
schema = pandas_gbq.schema.remove_policy_tags(schema)
job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
# If not, let BigQuery determine schema unless we are encoding the CSV files ourselves.
elif not FEATURES.bigquery_has_from_dataframe_with_csv:
schema = pandas_gbq.schema.generate_bq_schema(dataframe)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may introduce a failure if the schema is None and the generate_bq_schema is left unused.

The parquet conversion may be successful, but the actual BQ table schema type may not match the resultant conversion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that our tests wouldn't have caught that. Do you have an example of a dataframe that demonstrates this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the reason we don't have this here is that the google-cloud-bigquery library does similar dataframe to BQ schema conversion logic if the schema is not populated on the job config: https://github.com/googleapis/python-bigquery/blob/66b3dd9f9aec3fda9610a3ceec8d8a477f2ab3b9/google/cloud/bigquery/client.py#L2625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty strings inconsistently converted to NULL's when using df.to_gbq()
3 participants