Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for Query hangs indefinitely for some queries #403

Closed
willbowditch opened this issue Nov 25, 2020 · 7 comments
Closed

Wait for Query hangs indefinitely for some queries #403

willbowditch opened this issue Nov 25, 2020 · 7 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: question Request for information or clarification. Not an issue.

Comments

@willbowditch
Copy link

I think this bug has been introduced by the wait_for_query tqdm helper in #352

Dropping the previous version resolves pipenv install google-cloud-bigquery==2.3.1.

Some queries now hang indefinitely at "query is running" or "complete":

    query_job = client.query(query, job_config=job_config)
    df = query_job.to_dataframe(progress_bar_type='tqdm')

image

However running the same without the progress bar returns the dataframe

    query_job = client.query(query, job_config=job_config)
    df = query_job.to_dataframe(progress_bar_type=None)

The following also returns the dataframe:

    query_job = client.query(query, job_config=job_config)
    df = query_job.to_dataframe(progress_bar_type=None)
    df = query_job.to_dataframe(progress_bar_type='tqdm')

image

Environment details

  • OS type and version: OS X 10.15.7
  • Python version: python --version 3.6.10
  • pip version: pip --version pip 20.2.4
  • google-cloud-bigquery version: pip show google-cloud-bigquery Version: 2.4.0
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 25, 2020
@willbowditch
Copy link
Author

Playing around with the internals of wait_for_query, it looks as though a timeout of .5 will always return an timeout error for me, even when the query has completed:

ReadTimeoutError: HTTPSConnectionPool(host='bigquery.googleapis.com', port=443): Read timed out. (read timeout=0.5)

Setting the timeout to 1 works.

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Nov 26, 2020
@HemangChothani HemangChothani added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Nov 26, 2020
@HemangChothani
Copy link
Contributor

HemangChothani commented Nov 26, 2020

@willbowditch Could please share the query or amount of data you are retrieving, it might happen with the big queries or when downloading large number of rows, i have tried with around 16,000 rows but not able to reproduce it.

Screenshot from 2020-11-26 17-41-28

@willbowditch
Copy link
Author

@HemangChothani Sure, here's some details.

The data is not that large, resulting data frame using the previous version:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12460 entries, 0 to 12459
Data columns (total 9 columns):
***                  12460 non-null object
***                  12460 non-null object
***                  2460 non-null float64
***                  12460 non-null float64
***                  12460 non-null object
***                  12460 non-null float64
***                  12460 non-null int64
***                  12460 non-null float64
***                  12460 non-null float64
dtypes: float64(5), int64(1), object(3)
memory usage: 876.2+ KB

The query is not super complex. Roughly it's three WITH statements and an inner join, the query plan has 6 stages:
image

Couple of observations:

  • If I change the query slightly, to prevent the cached version being returned, the progress bar works as intended but never moves past completed (It does not return the dataframe):
    image
  • If I rerun, using the cahce, it hangs at
    image

@HemangChothani
Copy link
Contributor

* If I rerun, using the cache, it hangs

Screenshot from 2020-11-27 16-39-51

This is when query hit the cache and it just show that query is running and when query ends it shows the 100% and the time it takes, it doesn’t hang there. The only difference is when you didn't pass tqdm, in console it shows nothing and when you pass tqdm it just prints that Query is running.

* If I change the query slightly, to prevent the cached version being returned 

If the progress bar show the last stage and completed and still time increments on progress bar means some operation is going on after that(completion of stages), once the time increment stopped you will get the Downloading progress bar

Yes, right now timeout=0.5 will take more time then timeout=1 in some queries, i will discuss about to make default time or do public for users.

@willbowditch
Copy link
Author

@HemangChothani

If the progress bar show the last stage and completed and still time increments on progress bar means some operation is going on after that(completion of stages), once the time increment stopped you will get the Downloading progress bar

The downloading progress bar never starts and a data frame is never returned. The same query in the web GUI takes: 1.9 sec elapsed, 18.8 MB processed

See screen grab of the following test program for the last two versions of python-bigquery:

asciicast

from pathlib import Path
from google.cloud import bigquery

print(f"BigQuery version: {bigquery.__version__}")

query = Path("test.sql").read_text()

bq = bigquery.Client()
query_job = bq.query(query)
df = query_job.to_dataframe(progress_bar_type="tqdm")
print(df.shape)

@HemangChothani
Copy link
Contributor

HemangChothani commented Nov 27, 2020

@willbowditch Sorry for the extra noise, i am able to reproduce and it might because of caching changes done in the PR which is now reverted, Could please try with the master branch of google-cloud-bigquery?

@willbowditch
Copy link
Author

@HemangChothani Can confirm it's now fixed on master, thanks 👍

❯ python test.py
BigQuery version: 2.4.0
Query complete after 0.49s: 100%|███████████| 1/1 [00:00<00:00,  2.04query/s]
Downloading: 100%|█████████████████| 12460/12460 [00:02<00:00, 4819.29rows/s]
(12460, 9)
Time: 0h:00m:05s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants