Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use jobs.getQueryResults to download result sets #347

Closed

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Oct 27, 2020

Since getQueryResults was already used to wait for the job to finish,
this avoids an additional call to tabledata.list. The first page of
results are cached in-memory.

Additional changes will come in the future to avoid calling the BQ
Storage API when the cached results contain the full result set.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Towards #362

@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 27, 2020
@tswast
Copy link
Contributor Author

tswast commented Oct 27, 2020

Based on #341

@@ -2646,6 +2649,7 @@ def __init__(self, job_id, query, client, job_config=None):
)

self._query_results = None
self._get_query_results_kwargs = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a thread-local variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the cached query results might need to be thread-local too. Imagine if two threads called result with different starting indexes and/or max results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need some logic like

https://github.com/googleapis/google-cloud-go/blob/925033712191bce44fa99eb117d6531106042272/bigquery/iterator.go#L314

to see if we can use the cached page if result is called more than once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest commit.

Since `getQueryResults` was already used to wait for the job to finish,
this avoids an additional call to `tabledata.list`. The first page of
results are cached in-memory.

Additional changes will come in the future to avoid calling the BQ
Storage API when the cached results contain the full result set.
@tswast tswast force-pushed the optimized-query-getQueryResults branch from 7364196 to 983c8d2 Compare October 29, 2020 19:15
Also, move to thread-local variables for values that were
intended to track parameters across methods.
@tswast tswast marked this pull request as ready for review October 30, 2020 21:40
@tswast tswast requested review from a team and shollyman October 30, 2020 21:40
startIndex is no longer passed to the iterator
It is used in the initial (cached) call to
getQueryResults
Iterator of row data
:class:`~google.cloud.bigquery.table.Row`-s.
"""
row_iterator = RowIterator(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be sure to populate extra args with the field projection. We only need rows and page token.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Nov 2, 2020
@tswast
Copy link
Contributor Author

tswast commented Nov 3, 2020

Per our discussion, I'll be splitting this into 2 PRs:

  1. Call getQueryResults (no cache) from RowIterator -- make sure to add a projection to exclude the schema and other irrelevant job stats. perf: use jobs.getQueryResults to download result sets #363
  2. Cache the first page of results.

I'll base them on the refactoring to split up the giant job module here: #361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement. do not merge Indicates a pull request not ready for merge, due to either quality or timing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant