Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query performance optimizations #362

Closed
4 of 7 tasks
tswast opened this issue Nov 3, 2020 · 2 comments
Closed
4 of 7 tasks

Query performance optimizations #362

tswast opened this issue Nov 3, 2020 · 2 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Nov 3, 2020

This issue tracks the "fast query path" changes for the Python client(s):

  • perf: use jobs.getQueryResults to download result sets #363 -- Update QueryJob to use getQueryResults in RowIterator. Project down to avoid fetching schema and other unnecessary job stats in RowIterator.
  • perf: cache first page of jobs.getQueryResults rows #374 -- Update QueryJob and RowIterator to cache the first page of results, which we fetch as part of the logic to wait for the job to finish. Discard the cache if maxResults or startIndex are set.
  • perf: use getQueryResults from DB-API #375 -- Update DB-API to avoid direct call to list_rows()
  • perf: avoid extra API calls from to_dataframe if all rows are cached #384 -- Update to_dataframe and related methods in RowIterator to not call BQ Storage API if cached results are the only page.
  • Update DB-API to not call BQ Storage API if cached results are the only page.
  • Update Client.query to call jobs.query backend API method for acceptable job_configs.
  • (optional?) Avoid call to jobs.get in certain cases, such as QueryJob.to_dataframe and QueryJob.to_arrow
    • Add "reload" argument to QueryJob.result() -- default to True.
    • Update RowIterator to call get_job to fetch the destination table ID before attempting use of BQ Storage API (if destination table ID isn't available).
@tswast
Copy link
Contributor Author

tswast commented Nov 24, 2020

When we reintroduce row caching (reverted in #400), it should be opt-in to account for time/memory tradeoff #394 and longer calls to result() causing infinite loops in pandas-gbq googleapis/python-bigquery-pandas#343 and the magics.

@tswast
Copy link
Contributor Author

tswast commented Dec 17, 2020

I've conducted some microbenchmarks, and for a large class of row sizes, the BigQuery Storage API is faster than getQueryResults. Closing this work item out, as I don't think this pattern of long-running API requests is well suited to the typical use of the Python client (especially not with pandas DataFrames).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

1 participant