Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

Open
kdw2126 opened this issue May 18, 2023 · 1 comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@kdw2126
Copy link

kdw2126 commented May 18, 2023

Environment details

  • OS type and version: Mac OS X 13.1
  • Python version: 3.10.9, although I've had the same issue in 3.11 and earlier versions
  • pip version: 23.1.2
  • google cloud library versions:

google-api-core 2.11.0
google-auth 2.18.0
google-cloud-bigquery 3.10.0
google-cloud-bigquery-storage 2.19.1
google-cloud-core 2.3.2
google-crc32c 1.5.0
google-resumable-media 2.5.0

Steps to reproduce

  1. Pull a large table (tables with a lot of columns seem to cause the largest problems) using client.list_rows, setting a large value for max_results and a non-zero start_index value.
  2. Use rows.to_dataframe to aggregate the results from that table into a DataFrame, which will (for whatever reason) cause multiple pages containing the same row data to be combined in a way that leads to duplicates.

Code example

from google.oauth2.service_account import Credentials
from google.cloud.bigquery import Client as BigQueryAPIClient

credentials = Credentials.from_service_account_file(
    INSERT_CREDENTIAL_FILE_HERE
)

client = BigQueryAPIClient(credentials=credentials, project="bigquery-public-data")

table = client.get_table("bigquery-public-data.patents.publications")
rows = client.list_rows(table, max_results=50000, start_index=50000)
data = rows.to_dataframe(progress_bar_type="tqdm")

print(data[data[["publication_number", "application_number"]].duplicated()])
print(data[data['publication_number'] == 'WO-2011055014-A1'])

These columns are not duplicated in the actual publications table, which can be confirmed with this SQL:

SELECT publication_number, application_number, COUNT(*) FROM `bigquery-public-data.patents.publications` 
GROUP BY publication_number, application_number
HAVING COUNT(*) > 1
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label May 18, 2023
@Linchin Linchin added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Oct 31, 2023
@Tunneller
Copy link

I exactly have this problem. I had hoped to paginate through table produced by SQL, but these duplicates make it useless. My specific example was to have multiple threads, each with their own offsets = K * Page_Size for K = 0,...., Num_Threads, but it doesn't work. Each thread gets exactly Page_Size worth of rows, but many are duplicates, so each thread only gets about a 1/3 of the data. My package versions are all more recent than yours, if that makes a difference::

google-api 0.1.12
google-api-core 2.18.0
google-api-python-client 2.106.0
google-auth 2.23.4
google-auth-httplib2 0.1.0
google-auth-oauthlib 1.0.0
google-cloud 0.34.0
google-cloud-bigquery 3.20.1
google-cloud-bigquery-storage 2.24.0
google-cloud-core 2.4.1
google-cloud-datastore 2.18.0
google-cloud-storage 2.13.0
google-crc32c 1.5.0
google-resumable-media 2.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants