perf: use `jobs.getQueryResults` to download result sets #347

tswast · 2020-10-27T21:47:37Z

Since getQueryResults was already used to wait for the job to finish,
this avoids an additional call to tabledata.list. The first page of
results are cached in-memory.

Additional changes will come in the future to avoid calling the BQ
Storage API when the cached results contain the full result set.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Towards #362

tswast · 2020-10-27T21:47:53Z

Based on #341

tswast · 2020-10-28T14:28:54Z

google/cloud/bigquery/job.py

@@ -2646,6 +2649,7 @@ def __init__(self, job_id, query, client, job_config=None):
            )

        self._query_results = None
+        self._get_query_results_kwargs = {}


Does this need to be a thread-local variable?

Actually, the cached query results might need to be thread-local too. Imagine if two threads called result with different starting indexes and/or max results.

We'll also need some logic like

https://github.com/googleapis/google-cloud-go/blob/925033712191bce44fa99eb117d6531106042272/bigquery/iterator.go#L314

to see if we can use the cached page if result is called more than once

Done in latest commit.

Since `getQueryResults` was already used to wait for the job to finish, this avoids an additional call to `tabledata.list`. The first page of results are cached in-memory. Additional changes will come in the future to avoid calling the BQ Storage API when the cached results contain the full result set.

Also, move to thread-local variables for values that were intended to track parameters across methods.

…etQueryResults

startIndex is no longer passed to the iterator It is used in the initial (cached) call to getQueryResults

tswast · 2020-11-02T22:39:30Z

google/cloud/bigquery/client.py

+                Iterator of row data
+                :class:`~google.cloud.bigquery.table.Row`-s.
+        """
+        row_iterator = RowIterator(


Be sure to populate extra args with the field projection. We only need rows and page token.

tswast · 2020-11-03T20:16:00Z

Per our discussion, I'll be splitting this into 2 PRs:

Call getQueryResults (no cache) from RowIterator -- make sure to add a projection to exclude the schema and other irrelevant job stats. perf: use jobs.getQueryResults to download result sets #363
Cache the first page of results.

I'll base them on the refactoring to split up the giant job module here: #361

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 27, 2020

tswast commented Oct 28, 2020

View reviewed changes

tswast force-pushed the optimized-query-getQueryResults branch from 7364196 to 983c8d2 Compare October 29, 2020 19:15

fix: validate the query results cache before using

f52ed71

Also, move to thread-local variables for values that were intended to track parameters across methods.

tswast marked this pull request as ready for review October 30, 2020 21:40

tswast requested review from a team and shollyman October 30, 2020 21:40

tswast added 4 commits November 2, 2020 09:44

Merge remote-tracking branch 'upstream/master' into optimized-query-g…

e149360

…etQueryResults

blacken. update dbapi to use thread local var

9b5920f

fix dbapi tests

07e6043

fix system test

af2e2cc

startIndex is no longer passed to the iterator It is used in the initial (cached) call to getQueryResults

tswast mentioned this pull request Nov 2, 2020

refactor: split job.py and test_job.py #358

Closed

tswast added 2 commits November 2, 2020 10:50

add unit tests for missing coverage

540d530

blacken

6e83fbf

tswast commented Nov 2, 2020

View reviewed changes

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Nov 2, 2020

tswast closed this Nov 4, 2020

tswast mentioned this pull request Nov 5, 2020

perf: cache first page of jobs.getQueryResults rows #374

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use `jobs.getQueryResults` to download result sets #347

perf: use `jobs.getQueryResults` to download result sets #347

tswast commented Oct 27, 2020 •

edited

tswast commented Oct 27, 2020

tswast Oct 28, 2020

tswast Oct 28, 2020

tswast Oct 28, 2020

tswast Oct 30, 2020

tswast Nov 2, 2020

tswast commented Nov 3, 2020 •

edited

perf: use jobs.getQueryResults to download result sets #347

perf: use jobs.getQueryResults to download result sets #347

Conversation

tswast commented Oct 27, 2020 • edited

tswast commented Oct 27, 2020

tswast Oct 28, 2020

Choose a reason for hiding this comment

tswast Oct 28, 2020

Choose a reason for hiding this comment

tswast Oct 28, 2020

Choose a reason for hiding this comment

tswast Oct 30, 2020

Choose a reason for hiding this comment

tswast Nov 2, 2020

Choose a reason for hiding this comment

tswast commented Nov 3, 2020 • edited

perf: use `jobs.getQueryResults` to download result sets #347

perf: use `jobs.getQueryResults` to download result sets #347

tswast commented Oct 27, 2020 •

edited

tswast commented Nov 3, 2020 •

edited