Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increased memory usage in 2.4.0 #394

Closed
pietrodn opened this issue Nov 19, 2020 · 4 comments · Fixed by #400
Closed

increased memory usage in 2.4.0 #394

pietrodn opened this issue Nov 19, 2020 · 4 comments · Fixed by #400
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@pietrodn
Copy link

pietrodn commented Nov 19, 2020

Version 2.4.0 of the library is allocating much more memory that the previous version, 2.3.1, when running multiple queries.
In particular, it seems that the QueryJob object is retaining the results of the query internally, and that memory is not deallocated.

I think that the problem is related to #374.

Environment details

  • macOS 11.0.1 (also observing this on Linux in a production environment)
  • Python version: 3.8.6
  • pip version: 20.1.1
  • google-cloud-bigquery version: 2.4.0

Steps to reproduce

Run the script in the code example with google-cloud-bigquery 2.4.0 and 2.3.1 versions.
You will also need to install:

google-cloud-bigquery-storage==2.1.0
pandas==1.1.4
psutil==5.7.3

The outputs on my machine are:

With 2.4.0:

Initial memory used: 77 MB
Memory used: 642 MB
Memory used: 875 MB
Memory used: 1117 MB
Memory used: 1342 MB
Memory used: 1568 MB
Memory used: 1792 MB
Memory used: 2039 MB
Memory used: 2265 MB
Memory used: 2505 MB
Memory used: 2725 MB

With 2.3.1:

Initial memory used: 77 MB
Memory used: 97 MB
Memory used: 98 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 100 MB
Memory used: 101 MB
Memory used: 101 MB
Memory used: 101 MB

Code example

Please note that we are storing a reference to the QueryJob objects, but not to the resulting DataFrames.

import os

import psutil
from google.cloud import bigquery

if __name__ == '__main__':
    client = bigquery.Client()

    process = psutil.Process(os.getpid())
    print(f"Initial memory used: {process.memory_info().rss / 1e6:.0f} MB")

    jobs = []

    for i in range(10):
        job = client.query("SELECT x FROM UNNEST(GENERATE_ARRAY(1, 1000000)) AS x")
        job.result().to_dataframe()
        jobs.append(job)
        print(f"Memory used: {process.memory_info().rss / 1e6:.0f} MB")
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 19, 2020
@tswast
Copy link
Contributor

tswast commented Nov 19, 2020

We are caching the first page of results in the QueryJob class, which is why the memory is still being used in this example. (You're hanging on to the QueryJob class.)

What's the reason you'd want to retain a reference to this job class?

@tswast
Copy link
Contributor

tswast commented Nov 19, 2020

Workaround: You can call job._query_results = None once you are finished with getting results from the job.

We can investigate doing this automatically, though it's a bit tricky since the first page of results aren't actually used until the RowIterator is iterated over.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Nov 19, 2020
@tswast tswast changed the title Memory leak in 2.4.0 increased memory usage in 2.4.0 Nov 19, 2020
@pietrodn
Copy link
Author

@tswast the reason to keep around the QueryJob objects is to launch multiple queries in parallel, and then call .result() on them as they finish. Thanks for the workaround!

@tswast
Copy link
Contributor

tswast commented Nov 24, 2020

I'm reverting this change in #400

I'm also doing some more intense benchmarking with a variety of table / row sizes, as it's clear that there are many combinations for which this caching behavior was a regression.

gcf-merge-on-green bot pushed a commit that referenced this issue Nov 24, 2020
When there are large result sets, fetching rows while waiting for the
query to finish can cause the API to hang indefinitely. (This may be due
to an interaction between connection timeout and API timeout.)

This reverts commit 86f6a51 (#374).

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)

Fixes googleapis/python-bigquery-pandas#343
Fixes #394 🦕
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants