Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Arrows streams only ~5 million records out of ~7 million records in the table #111

Closed
petr-ponomarenko opened this issue Jan 11, 2021 · 2 comments · Fixed by #114
Closed
Assignees
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. type: docs Improvement to the documentation for an API.

Comments

@petr-ponomarenko
Copy link

When I use streaming of data from BQ table with 7 million records using Apache Arrows I get only about 5 million records. Other methods of getting data from that table into pandas work fine. I am following this example with one stream

# This example reads from only a single stream. Read from multiple streams
# to fetch data faster. Note that the session may not contain any streams
# if there are no rows to read.
stream = read_session.streams[0]
reader = bqstorageclient.read_rows(stream.name)
# Parse all Arrow blocks and create a dataframe. This call requires a
# session, because the session contains the schema for the row blocks.
dataframe = reader.to_dataframe(read_session)
print(dataframe.head())

@product-auto-label product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. label Jan 11, 2021
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jan 12, 2021
@tswast
Copy link
Contributor

tswast commented Jan 12, 2021

We probably need to update that code sample to explicitly request only one stream.

@tswast tswast added type: docs Improvement to the documentation for an API. and removed triage me I really want to be triaged. labels Jan 12, 2021
@tswast tswast self-assigned this Jan 12, 2021
@tswast
Copy link
Contributor

tswast commented Jan 12, 2021

I've sent #114 which adds max_stream_count=1 to the create_read_session call. I believe the issue you're encountering is because additional streams are created, which are assigned the rows which you are missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. type: docs Improvement to the documentation for an API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants