Apache Arrows streams only ~5 million records out of ~7 million records in the table #111

petr-ponomarenko · 2021-01-11T19:37:27Z

When I use streaming of data from BQ table with 7 million records using Apache Arrows I get only about 5 million records. Other methods of getting data from that table into pandas work fine. I am following this example with one stream

python-bigquery-storage/samples/to_dataframe/main_test.py

Lines 130 to 139 in 6254bf2

    
           # This example reads from only a single stream. Read from multiple streams 
        
           # to fetch data faster. Note that the session may not contain any streams 
        
           # if there are no rows to read. 
        
           stream = read_session.streams[0] 
        
           reader = bqstorageclient.read_rows(stream.name) 
        
           # Parse all Arrow blocks and create a dataframe. This call requires a 
        
           # session, because the session contains the schema for the row blocks. 
        
           dataframe = reader.to_dataframe(read_session) 
        
           print(dataframe.head())

tswast · 2021-01-12T22:43:47Z

We probably need to update that code sample to explicitly request only one stream.

tswast · 2021-01-12T22:51:54Z

I've sent #114 which adds max_stream_count=1 to the create_read_session call. I believe the issue you're encountering is because additional streams are created, which are assigned the rows which you are missing.

product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. label Jan 11, 2021

yoshi-automation added the triage me I really want to be triaged. label Jan 12, 2021

tswast added type: docs Improvement to the documentation for an API. and removed triage me I really want to be triaged. labels Jan 12, 2021

tswast self-assigned this Jan 12, 2021

tswast mentioned this issue Jan 12, 2021

docs: request only a single stream in dataframe example #114

Merged

4 tasks

tswast closed this as completed in #114 Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Arrows streams only ~5 million records out of ~7 million records in the table #111

Apache Arrows streams only ~5 million records out of ~7 million records in the table #111

petr-ponomarenko commented Jan 11, 2021

tswast commented Jan 12, 2021

tswast commented Jan 12, 2021

Apache Arrows streams only ~5 million records out of ~7 million records in the table #111

Apache Arrows streams only ~5 million records out of ~7 million records in the table #111

Comments

petr-ponomarenko commented Jan 11, 2021

tswast commented Jan 12, 2021

tswast commented Jan 12, 2021