Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use BigQuery Storage client by default #55

Merged
merged 16 commits into from Jun 10, 2020

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Mar 9, 2020

Closes #86.
Closes #84.

No issue yet, just a POC preview.

The PR is now ready to be reviewed. It uses the BQ Storage API by default to fetch query results, and removes the "beta" label from that feature.

It's probably easier to review this PR commit by commit.

Key points:

  • If BQ Storage client cannot be used (e.g. missing dependencies) to fetch query results, the library falls back to the REST API.
  • The BQ storage API v1 stable is now used, replacing v1beta1 client (however, there is a feature request to support both BQ Storage client versions simultaneously).
  • The Python DB API now more closely follows the specs. Closing a connection instance makes it unusable from that point on, and also closes every Cursor instance created by it.
  • Closing a DB API Connection now also closes the BigQuery / BQ Storage clients the latter has created, fixing the sockets leak discovered.

@plamut plamut added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 9, 2020
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Mar 9, 2020
@plamut
Copy link
Contributor Author

plamut commented Apr 7, 2020

I did some timings to compare fetching data with and without the BQ Storage client, results below. Spoiler: When using the BQ Storage API, performance improvements are enormous.

Query

SELECT * FROM `project.dataset.table`
LIMIT 200000

Source tables

  • A synthetic table with two FLOAT columns containing 10M rows of random numbers
  • bigquery-public-data-cms_medicare.inpatient_charges_2011 - multiple numeric and string columns, contains ~163k rows.

Timings

REST API (tabledata.list) BQ Storage API
Synthetic table 10 - 11.5 s 4.5 - 5 s
Inpatient charges table 23 - 25s 10.5 - 12.5 s

@plamut plamut force-pushed the optimize-to-dataframe branch 4 times, most recently from f080813 to 1aa0c62 Compare April 28, 2020 07:59
@plamut plamut removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Apr 28, 2020
@plamut plamut requested a review from shollyman April 28, 2020 08:18
@plamut plamut marked this pull request as ready for review April 28, 2020 08:29
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic satiation seeing all the v1/v1beta references. Sorry it took me so long to get to this. Couple minor nits, with a couple more interesting bits (avro vs arrow, small result set, versioning for storage).

google/cloud/bigquery/dbapi/cursor.py Outdated Show resolved Hide resolved
google/cloud/bigquery/dbapi/cursor.py Outdated Show resolved Hide resolved
google/cloud/bigquery/job.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
tests/system.py Outdated Show resolved Hide resolved
tests/system.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the slogging through this.

@plamut
Copy link
Contributor Author

plamut commented May 15, 2020

Just a thought - since this is a pretty significant change, how about releasing all the other smaller changes and fixes that have accumulated since the last release separately, and release this one separately for an easier rollback, should that be necessary?

@shollyman
Copy link
Contributor

Isolating this change to its own release seems prudent.

@plamut
Copy link
Contributor Author

plamut commented May 19, 2020

Putting this on hold until the next release is made, we will ship it separately.

@plamut plamut added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 19, 2020
@plamut plamut removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jun 10, 2020
@plamut plamut requested a review from shollyman June 10, 2020 15:02
@plamut
Copy link
Contributor Author

plamut commented Jun 10, 2020

With the new release (1.25.0) now out, we can now merge this and release in the near future (possibly with miscellaneous related cleanup fixes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use BQ Storage API by default BigQuery Storage API integrations should accept either v1beta1 or v1 client
4 participants