Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

kdw2126 · 2023-05-18T16:48:41Z

Environment details

OS type and version: Mac OS X 13.1
Python version: 3.10.9, although I've had the same issue in 3.11 and earlier versions
pip version: 23.1.2
google cloud library versions:

google-api-core 2.11.0
google-auth 2.18.0
google-cloud-bigquery 3.10.0
google-cloud-bigquery-storage 2.19.1
google-cloud-core 2.3.2
google-crc32c 1.5.0
google-resumable-media 2.5.0

Steps to reproduce

Pull a large table (tables with a lot of columns seem to cause the largest problems) using client.list_rows, setting a large value for max_results and a non-zero start_index value.
Use rows.to_dataframe to aggregate the results from that table into a DataFrame, which will (for whatever reason) cause multiple pages containing the same row data to be combined in a way that leads to duplicates.

Code example

from google.oauth2.service_account import Credentials
from google.cloud.bigquery import Client as BigQueryAPIClient

credentials = Credentials.from_service_account_file(
    INSERT_CREDENTIAL_FILE_HERE
)

client = BigQueryAPIClient(credentials=credentials, project="bigquery-public-data")

table = client.get_table("bigquery-public-data.patents.publications")
rows = client.list_rows(table, max_results=50000, start_index=50000)
data = rows.to_dataframe(progress_bar_type="tqdm")

print(data[data[["publication_number", "application_number"]].duplicated()])
print(data[data['publication_number'] == 'WO-2011055014-A1'])

These columns are not duplicated in the actual publications table, which can be confirmed with this SQL:

SELECT publication_number, application_number, COUNT(*) FROM `bigquery-public-data.patents.publications` 
GROUP BY publication_number, application_number
HAVING COUNT(*) > 1

Tunneller · 2024-04-08T03:31:31Z

I exactly have this problem. I had hoped to paginate through table produced by SQL, but these duplicates make it useless. My specific example was to have multiple threads, each with their own offsets = K * Page_Size for K = 0,...., Num_Threads, but it doesn't work. Each thread gets exactly Page_Size worth of rows, but many are duplicates, so each thread only gets about a 1/3 of the data. My package versions are all more recent than yours, if that makes a difference::

google-api 0.1.12
google-api-core 2.18.0
google-api-python-client 2.106.0
google-auth 2.23.4
google-auth-httplib2 0.1.0
google-auth-oauthlib 1.0.0
google-cloud 0.34.0
google-cloud-bigquery 3.20.1
google-cloud-bigquery-storage 2.24.0
google-cloud-core 2.4.1
google-cloud-datastore 2.18.0
google-cloud-storage 2.13.0
google-crc32c 1.5.0
google-resumable-media 2.7.0

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label May 18, 2023

Linchin added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

kdw2126 commented May 18, 2023

Tunneller commented Apr 8, 2024

Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

Pulling large dataframes using client.list_rows and rows.to_dataframe() leads to duplicates when setting max_results and start_index #1569

Comments

kdw2126 commented May 18, 2023

Environment details

Steps to reproduce

Code example

Tunneller commented Apr 8, 2024