Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use pyarrow stream compression, if available #593

Merged
merged 4 commits into from Apr 12, 2021

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Apr 7, 2021

Closes #579.

This PR uses pyarrow compression of BQ Storage streams when available, i.e. when a recent enough version of google-cloud-bigquery-storage is installed.

Currently blocked on the BQ Storage release that will add this support, but it's still possible to test this PR locally by changing the noxfile to use a development version of google-cloud-bigquery-storage.

PR checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@plamut plamut requested a review from tswast April 7, 2021 13:39
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Apr 7, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 7, 2021
Comment on lines 41 to 46
import pkg_resources

# Having BQ Storage available implies that pyarrow is available, too.
_ARROW_COMPRESSION_SUPPORT = pkg_resources.get_distribution(
"pyarrow"
).parsed_version >= pkg_resources.parse_version("1.0.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need pyarrow version check?

The bqstorage extra already pins the minimum pyarrow version to 1.0.0, thus I suppose if somebody somehow installs a less recent version, it can be considered an error on the user side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me (removing this check). Since all of our extras require > 1.0.0, I'm okay with failing.

Counterpoint though is: what error will they get when pyarrow is too low? I wonder if we should file an FR to check for minimum versions and throw nicer errors? It might help with some of the issues like #556. This would mean keeping minimum versions in sync with 3 locations: setup.py, constraints.txt, and version_check.py, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pip dependency resolver (announcement) is significantly stricter, thus I suppose this will become a no-issue once its use is widespread enough and/or becomes a default. It shouldn't even be possible to install an incompatible pyarrow version with it.

Yes, we can open a FR to discuss there if covering this presumably corner-case scenario in the transitional period is worth doing, considering the problem with keeping the min versions in sync.

@plamut plamut added the status: blocked Resolving the issue is dependent on other work. label Apr 7, 2021
@tswast
Copy link
Contributor

tswast commented Apr 7, 2021

Currently blocked on the BQ Storage release that will add this support

Oops! I'll see what open PRs we have and hopefully cut a release today.

Arrow stream compression requires pyarrow>=1.0.0, but that's already
guaranteed by a version pin in setup.py if bqstorage extra is
installed.
@plamut
Copy link
Contributor Author

plamut commented Apr 8, 2021

I see that a new version of BQ Storage has been released, good. But I now also see that one of the Python 3.6 unit tests fails, will investigate.

Update:
When Python 3.6 is used, the google-cloud-bigquery-storage version 2.0.0 is installed, and not the latest 2.4.0, hence the AttributeError in the test itself. It's because we have a non-empty constraints file for Python 3.6 test dependencies.

@plamut plamut removed the status: blocked Resolving the issue is dependent on other work. label Apr 8, 2021
@plamut plamut marked this pull request as ready for review April 8, 2021 12:50
@plamut plamut requested a review from a team April 8, 2021 12:50
@tswast tswast merged commit dde9dc5 into googleapis:master Apr 12, 2021
gcf-merge-on-green bot pushed a commit that referenced this pull request Apr 26, 2021
🤖 I have created a release \*beep\* \*boop\*
---
## [2.14.0](https://www.github.com/googleapis/python-bigquery/compare/v2.13.1...v2.14.0) (2021-04-26)


### Features

* accept DatasetListItem where DatasetReference is accepted ([#597](https://www.github.com/googleapis/python-bigquery/issues/597)) ([c8b5581](https://www.github.com/googleapis/python-bigquery/commit/c8b5581ea3c94005d69755c4a3b5a0d8900f3fe2))
* accept job object as argument to `get_job` and `cancel_job` ([#617](https://www.github.com/googleapis/python-bigquery/issues/617)) ([f75dcdf](https://www.github.com/googleapis/python-bigquery/commit/f75dcdf3943b87daba60011c9a3b42e34ff81910))
* add `Client.delete_job_metadata` method to remove job metadata ([#610](https://www.github.com/googleapis/python-bigquery/issues/610)) ([0abb566](https://www.github.com/googleapis/python-bigquery/commit/0abb56669c097c59fbffce007c702e7a55f2d9c1))
* add `max_queue_size` argument to `RowIterator.to_dataframe_iterable` ([#575](https://www.github.com/googleapis/python-bigquery/issues/575)) ([f95f415](https://www.github.com/googleapis/python-bigquery/commit/f95f415d3441b3928f6cc705cb8a75603d790fd6))
* add type hints for public methods ([#613](https://www.github.com/googleapis/python-bigquery/issues/613)) ([f8d4aaa](https://www.github.com/googleapis/python-bigquery/commit/f8d4aaa335a0eef915e73596fc9b43b11d11be9f))
* DB API cursors are now iterable ([#618](https://www.github.com/googleapis/python-bigquery/issues/618)) ([e0b373d](https://www.github.com/googleapis/python-bigquery/commit/e0b373d0e721a70656ed8faceb7f5c70f642d144))
* retry google.auth TransportError by default ([#624](https://www.github.com/googleapis/python-bigquery/issues/624)) ([34ecc3f](https://www.github.com/googleapis/python-bigquery/commit/34ecc3f1ca0ff073330c0c605673d89b43af7ed9))
* use pyarrow stream compression, if available ([#593](https://www.github.com/googleapis/python-bigquery/issues/593)) ([dde9dc5](https://www.github.com/googleapis/python-bigquery/commit/dde9dc5114c2311fb76fafc5b222fff561e8abf1))


### Bug Fixes

* consistent percents handling in DB API query ([#619](https://www.github.com/googleapis/python-bigquery/issues/619)) ([6502a60](https://www.github.com/googleapis/python-bigquery/commit/6502a602337ae562652a20b20270949f2c9d5073))
* missing license headers in new test files ([#604](https://www.github.com/googleapis/python-bigquery/issues/604)) ([df48cc5](https://www.github.com/googleapis/python-bigquery/commit/df48cc5a0be99ad39d5835652d1b7422209afc5d))
* unsetting clustering fileds on Table is now possible ([#622](https://www.github.com/googleapis/python-bigquery/issues/622)) ([33a871f](https://www.github.com/googleapis/python-bigquery/commit/33a871f06329f9bf5a6a92fab9ead65bf2bee75d))


### Documentation

* add sample to run DML query ([#591](https://www.github.com/googleapis/python-bigquery/issues/591)) ([ff2ec3a](https://www.github.com/googleapis/python-bigquery/commit/ff2ec3abe418a443cd07751c08e654f94e8b3155))
* update the description of the return value of `_QueryResults.rows()` ([#594](https://www.github.com/googleapis/python-bigquery/issues/594)) ([8f4c0b8](https://www.github.com/googleapis/python-bigquery/commit/8f4c0b84dac3840532d7865247b8ad94b625b897))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: make use of LZ4 in to_arrow / to_dataframe
2 participants