perf: use the first page a results when `query(api_method="QUERY")` #1723

tswast · 2023-11-15T20:21:44Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

This is a restoration of part of #374

TODO:

Unit tests to verify that no extra jobs.getQueryResults or jobs.get calls happen when iterating over rows that are fully returned from jobs.query.
Ensure to_pandas and to_arrow work with this code path.

Closes #589 🦕

google/cloud/bigquery/table.py

…_is_completely_cached

Linchin · 2023-11-16T21:24:09Z

I'm not sure I understand. Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

tswast · 2023-11-16T22:24:24Z

Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

Great question @Linchin.

There are two APIs for issuing a query in BigQuery: jobs.insert and jobs.query. There are some key differences between these APIs.

jobs.insert: creates and returns a query job unless it's considered a "bad request" or insufficient permissions or similar. In general this API returns relatively quickly, but only includes job metadata, no query results.

When running a query this way, we call jobs.getQueryResults with the page size set to 0 results. This is purely to find out when the job finishes. The reason for 0 results is that for large rows, getQueryResults can actually hang well past the expected 10 seconds, causing issues such as simple query hangs in 0.14.1, works in 0.13.3 python-bigquery-pandas#343.

Once the query has finished, we call jobs.get to refresh the job metadata. This ensures we have a destination table, which can be used with tabledata.list or the BQ Storage Read API to fetch the results. Many years ago, it used to be much faster to use tabledata.list instead of jobs.getQueryResults to download data, but that changed in late 2019 when many optimizations were made to getQueryResults on the backend. As of perf: use jobs.getQueryResults to download result sets #363, we use either jobs.getQueryResults or the BQ Storage Read API to download results.

In general, this path works best for large query results, which we expect from folks using pandas, for example.

Edit: To clarify, after this PR, I think it'd be OK to use jobs.query even for pandas, but since the cached first page of results increases memory usage (increased memory usage in 2.4.0 #394), I don't think it makes sense to change the default. But definitely something to consider since users who want less memory usage could explicitly set api_method="INSERT".
jobs.query: starts the query and waits. If it finishes within about 10 seconds, it returns the results in that same API call. If the query has not finished, we use the job ID from the response to call jobs.getQueryResults. The subsequent steps are the same as with jobs.insert.

The key difference here is that jobs.query can actually return results in the same request/response as the one to start the query, whereas jobs.insert always requires a subsequent API call to fetch the results.

This PR does a few things: (1) if the job has finished, don't make the unnecessary call to getQueryResults (2), actually use the rows returned from jobs.query in RowIterator, and (3) avoid using the BQ Storage Read API if we know only 1 or 2 more API requests are needed to download the rest of the data.

…_is_completely_cached

Linchin · 2023-11-20T22:27:03Z

google/cloud/bigquery/_job_helpers.py

-        # This also requires updates to `to_dataframe` and the DB API connector
-        # so that they don't try to read from a destination table if all the
-        # results are present.
+        query_job._query_results = google.cloud.bigquery.query._QueryResults(


Does this change mean we also load the query results when job is complete, when query(api_method="INSERT")?

Good question. I just double-checked that this method is only called from query_jobs_query, so it won't affect when api_method="INSERT"

Linchin

Thank you for helping me understand this PR :)

…oogleapis#1723) * perf: use the first page a results when `query(api_method="QUERY")` * add tests * respect max_results with cached page * respect page_size, also avoid bqstorage if almost fully downloaded * skip true test if bqstorage not installed * coverage

…nto pandas extra (#1726) * feat: Introduce compatibility with native namespace packages * Update copyright year * removed pkg_resources from all test files and moved importlib into pandas extra * feat: removed pkg_resources from all test files and moved importlib into pandas extra * Adding no cover tag to test code * reformatted with black * undo revert * perf: use the first page a results when `query(api_method="QUERY")` (#1723) * perf: use the first page a results when `query(api_method="QUERY")` * add tests * respect max_results with cached page * respect page_size, also avoid bqstorage if almost fully downloaded * skip true test if bqstorage not installed * coverage * fix: ensure query job retry has longer deadline than API request deadline (#1734) In cases where we can't disambiguate API failure from job failure, this ensures we can still retry the job at least once. * fix: `load_table_from_dataframe` now assumes there may be local null values (#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #1692 🦕 * chore: standardize samples directory - delete unneeded dependencies (#1732) * chore: standardize samples directory = delete unneeded dependencies * Removed unused import for linter * fix: move grpc, proto-plus and protobuf packages to extras (#1721) * chore: move grpc, proto-plus and protobuff packages to extras * formatted with black * feat: add `job_timeout_ms` to job configuration classes (#1675) * fix: adds new property and tests * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates docs to correct a sphinx failure * Updates formatting * Update tests/system/test_query.py * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * Update google/cloud/bigquery/job/base.py * updates one test and uses int_or_none * Update tests/system/test_query.py testing something. * Update tests/system/test_query.py * testing coverage feature * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * minor edits * tweaks to noxfile for testing purposes * add new test to base as experiment * adds a test, updates import statements * add another test * edit to tests * formatting fixes * update noxfile to correct debug code * removes unneeded comments. --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> --------- Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Swast <swast@google.com> * remove unnecessary version checks * undo bad commit, remove unneeded version checks * Revert "undo bad commit, remove unneeded version checks" This reverts commit 5c82dcf. * Revert "remove unnecessary version checks" This reverts commit 9331a7e. * revert bad changes, remove pkg_resources from file * after clarification, reimplement changes and ignore 3.12 tests * reformatted with black * removed minimum check * updated pandas installed version check --------- Co-authored-by: Anthonios Partheniou <partheniou@google.com> Co-authored-by: Tim Swast <swast@google.com> Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

perf: use the first page a results when query(api_method="QUERY")

b8c583a

product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Nov 15, 2023

tswast mentioned this pull request Nov 15, 2023

feat: add Client.query_and_wait which directly returns a RowIterator of results #1722

Merged

4 tasks

add tests

6a8059d

product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Nov 15, 2023

tswast marked this pull request as ready for review November 15, 2023 22:54

tswast requested review from a team as code owners November 15, 2023 22:54

tswast requested a review from farhan0102 November 15, 2023 22:54

shollyman requested review from chalmerlowe and Linchin and removed request for farhan0102 November 15, 2023 22:56

tswast added 2 commits November 16, 2023 09:28

respect max_results with cached page

1f0e38e

respect page_size, also avoid bqstorage if almost fully downloaded

4401725

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Nov 16, 2023

tswast mentioned this pull request Nov 16, 2023

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Closed

skip true test if bqstorage not installed

d078941

tswast commented Nov 16, 2023

View reviewed changes

google/cloud/bigquery/table.py Show resolved Hide resolved

tswast commented Nov 16, 2023

View reviewed changes

google/cloud/bigquery/table.py Show resolved Hide resolved

tswast added 2 commits November 16, 2023 15:02

Merge remote-tracking branch 'origin/main' into issue589-RowIterator.…

660aa76

…_is_completely_cached

coverage

476bcd7

Merge remote-tracking branch 'origin/main' into issue589-RowIterator.…

05d6a3e

…_is_completely_cached

Linchin reviewed Nov 20, 2023

View reviewed changes

Linchin approved these changes Nov 20, 2023

View reviewed changes

tswast merged commit 6290517 into main Nov 21, 2023
20 checks passed

tswast deleted the issue589-RowIterator._is_completely_cached branch November 21, 2023 15:13

release-please bot mentioned this pull request Nov 18, 2023

chore(main): release 3.14.0 #1709

Merged

This was referenced Dec 5, 2023

December 04, 2023 kitta65/bq-extension-vscode#257

Closed

December 04, 2023 kitta65/prettier-plugin-bq#264

Closed

December 04, 2023 kitta65/bq2cst#274

Closed

r1b mentioned this pull request Dec 6, 2023

dbapi: Skip storage client fetch when results cached #1745

Closed

This was referenced Dec 19, 2023

December 18, 2023 kitta65/bq-extension-vscode#262

Closed

December 18, 2023 kitta65/prettier-plugin-bq#269

Closed

December 18, 2023 kitta65/bq2cst#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use the first page a results when `query(api_method="QUERY")` #1723

perf: use the first page a results when `query(api_method="QUERY")` #1723

tswast commented Nov 15, 2023 •

edited

Linchin commented Nov 16, 2023

tswast commented Nov 16, 2023 •

edited

Linchin Nov 20, 2023

tswast Nov 21, 2023

Linchin left a comment

perf: use the first page a results when query(api_method="QUERY") #1723

perf: use the first page a results when query(api_method="QUERY") #1723

Conversation

tswast commented Nov 15, 2023 • edited

Linchin commented Nov 16, 2023

tswast commented Nov 16, 2023 • edited

Linchin Nov 20, 2023

Choose a reason for hiding this comment

tswast Nov 21, 2023

Choose a reason for hiding this comment

Linchin left a comment

Choose a reason for hiding this comment

perf: use the first page a results when `query(api_method="QUERY")` #1723

perf: use the first page a results when `query(api_method="QUERY")` #1723

tswast commented Nov 15, 2023 •

edited

tswast commented Nov 16, 2023 •

edited