increased memory usage in 2.4.0 #394

pietrodn · 2020-11-19T08:05:36Z

Version 2.4.0 of the library is allocating much more memory that the previous version, 2.3.1, when running multiple queries.
In particular, it seems that the QueryJob object is retaining the results of the query internally, and that memory is not deallocated.

I think that the problem is related to #374.

Environment details

macOS 11.0.1 (also observing this on Linux in a production environment)
Python version: 3.8.6
pip version: 20.1.1
google-cloud-bigquery version: 2.4.0

Steps to reproduce

Run the script in the code example with google-cloud-bigquery 2.4.0 and 2.3.1 versions.
You will also need to install:

google-cloud-bigquery-storage==2.1.0
pandas==1.1.4
psutil==5.7.3

The outputs on my machine are:

With 2.4.0:

Initial memory used: 77 MB
Memory used: 642 MB
Memory used: 875 MB
Memory used: 1117 MB
Memory used: 1342 MB
Memory used: 1568 MB
Memory used: 1792 MB
Memory used: 2039 MB
Memory used: 2265 MB
Memory used: 2505 MB
Memory used: 2725 MB

With 2.3.1:

Initial memory used: 77 MB
Memory used: 97 MB
Memory used: 98 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 100 MB
Memory used: 101 MB
Memory used: 101 MB
Memory used: 101 MB

Code example

Please note that we are storing a reference to the QueryJob objects, but not to the resulting DataFrames.

import os

import psutil
from google.cloud import bigquery

if __name__ == '__main__':
    client = bigquery.Client()

    process = psutil.Process(os.getpid())
    print(f"Initial memory used: {process.memory_info().rss / 1e6:.0f} MB")

    jobs = []

    for i in range(10):
        job = client.query("SELECT x FROM UNNEST(GENERATE_ARRAY(1, 1000000)) AS x")
        job.result().to_dataframe()
        jobs.append(job)
        print(f"Memory used: {process.memory_info().rss / 1e6:.0f} MB")

The text was updated successfully, but these errors were encountered:

tswast · 2020-11-19T14:45:33Z

We are caching the first page of results in the QueryJob class, which is why the memory is still being used in this example. (You're hanging on to the QueryJob class.)

What's the reason you'd want to retain a reference to this job class?

tswast · 2020-11-19T14:48:25Z

Workaround: You can call job._query_results = None once you are finished with getting results from the job.

We can investigate doing this automatically, though it's a bit tricky since the first page of results aren't actually used until the RowIterator is iterated over.

pietrodn · 2020-11-19T14:51:37Z

@tswast the reason to keep around the QueryJob objects is to launch multiple queries in parallel, and then call .result() on them as they finish. Thanks for the workaround!

tswast · 2020-11-24T21:02:18Z

I'm reverting this change in #400

I'm also doing some more intense benchmarking with a variety of table / row sizes, as it's clear that there are many combinations for which this caching behavior was a regression.

When there are large result sets, fetching rows while waiting for the query to finish can cause the API to hang indefinitely. (This may be due to an interaction between connection timeout and API timeout.) This reverts commit 86f6a51 (#374). Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [x] Ensure the tests and linter pass - [x] Code coverage does not decrease (if any source code was changed) - [x] Appropriate docs were updated (if necessary) Fixes googleapis/python-bigquery-pandas#343 Fixes #394 🦕

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 19, 2020

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Nov 19, 2020

tswast changed the title ~~Memory leak in 2.4.0~~ increased memory usage in 2.4.0 Nov 19, 2020

tswast mentioned this issue Nov 24, 2020

perf: don't fetch rows when waiting for query to finish #400

Merged

4 tasks

tswast mentioned this issue Nov 24, 2020

Query performance optimizations #362

Closed

7 tasks

gcf-merge-on-green bot closed this as completed in #400 Nov 24, 2020

JustinBeckwith assigned yoshi-automation Feb 2, 2021

tswast mentioned this issue Apr 13, 2021

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Closed

tswast mentioned this issue Nov 16, 2023

perf: use the first page a results when query(api_method="QUERY") #1723

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increased memory usage in 2.4.0 #394

increased memory usage in 2.4.0 #394

pietrodn commented Nov 19, 2020 •

edited

tswast commented Nov 19, 2020

tswast commented Nov 19, 2020

pietrodn commented Nov 19, 2020

tswast commented Nov 24, 2020

increased memory usage in 2.4.0 #394

increased memory usage in 2.4.0 #394

Comments

pietrodn commented Nov 19, 2020 • edited

Environment details

Steps to reproduce

Code example

tswast commented Nov 19, 2020

tswast commented Nov 19, 2020

pietrodn commented Nov 19, 2020

tswast commented Nov 24, 2020

pietrodn commented Nov 19, 2020 •

edited