Query performance optimizations #362

tswast · 2020-11-03T20:36:58Z

This issue tracks the "fast query path" changes for the Python client(s):

perf: use jobs.getQueryResults to download result sets #363 -- Update QueryJob to use getQueryResults in RowIterator. Project down to avoid fetching schema and other unnecessary job stats in RowIterator.
perf: cache first page of jobs.getQueryResults rows #374 -- Update QueryJob and RowIterator to cache the first page of results, which we fetch as part of the logic to wait for the job to finish. Discard the cache if maxResults or startIndex are set.
perf: use getQueryResults from DB-API #375 -- Update DB-API to avoid direct call to list_rows()
perf: avoid extra API calls from to_dataframe if all rows are cached #384 -- Update to_dataframe and related methods in RowIterator to not call BQ Storage API if cached results are the only page.
Update DB-API to not call BQ Storage API if cached results are the only page.
Update Client.query to call jobs.query backend API method for acceptable job_configs.
(optional?) Avoid call to jobs.get in certain cases, such as QueryJob.to_dataframe and QueryJob.to_arrow
- Add "reload" argument to QueryJob.result() -- default to True.
- Update RowIterator to call get_job to fetch the destination table ID before attempting use of BQ Storage API (if destination table ID isn't available).

The text was updated successfully, but these errors were encountered:

tswast · 2020-11-24T22:35:44Z

When we reintroduce row caching (reverted in #400), it should be opt-in to account for time/memory tradeoff #394 and longer calls to result() causing infinite loops in pandas-gbq googleapis/python-bigquery-pandas#343 and the magics.

tswast · 2020-12-17T21:29:20Z

I've conducted some microbenchmarks, and for a large class of row sizes, the BigQuery Storage API is faster than getQueryResults. Closing this work item out, as I don't think this pattern of long-running API requests is well suited to the typical use of the Python client (especially not with pandas DataFrames).

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 3, 2020

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Nov 3, 2020

tswast self-assigned this Nov 3, 2020

tswast mentioned this issue Nov 11, 2020

perf: avoid extra API calls from to_dataframe if all rows are cached #384

Merged

4 tasks

tswast closed this as completed Dec 17, 2020

tswast mentioned this issue Dec 17, 2020

simple query hangs in 0.14.1, works in 0.13.3 googleapis/python-bigquery-pandas#343

Closed

tswast mentioned this issue Apr 13, 2021

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Closed

r1b mentioned this issue Dec 6, 2023

dbapi: Skip storage client fetch when results cached #1745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query performance optimizations #362

Query performance optimizations #362

tswast commented Nov 3, 2020 •

edited by plamut

tswast commented Nov 24, 2020

tswast commented Dec 17, 2020

Query performance optimizations #362

Query performance optimizations #362

Comments

tswast commented Nov 3, 2020 • edited by plamut

tswast commented Nov 24, 2020

tswast commented Dec 17, 2020

tswast commented Nov 3, 2020 •

edited by plamut