Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make read_session optional in ReadRowsStream.rows() #168

Closed
tswast opened this issue Mar 30, 2021 · 3 comments · Fixed by #228
Closed

feat: make read_session optional in ReadRowsStream.rows() #168

tswast opened this issue Mar 30, 2021 · 3 comments · Fixed by #228
Assignees
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Mar 30, 2021

Is your feature request related to a problem? Please describe.

ReadRowsStream.rows() and other methods that decode the data currently take a ReadSession as a required argument. This is so that it can use the avro/arrow schema to decode the rows.

This is a bit of a pain, especially when workers are on separate processes.

Describe the solution you'd like

#166 adds arrow_schema and avro_schema properties to ReadRowsResponse. These are populated only in the first message of a read stream, but can be cached locally to decode subsequent messages.

To avoid breaking changes, a ReadSession can still be accepted, but it should be optional. Thankfully Python supports arguments which work either as args or kwargs, so I think this does not have to be a breaking change. We need to make sure we comment to the library devs that the order matters for this argument, though.

Describe alternatives you've considered

New Reader class?

Additional context

N/A

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Mar 30, 2021
@product-auto-label product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. label Mar 30, 2021
@tswast tswast self-assigned this Jul 7, 2021
@tswast
Copy link
Contributor Author

tswast commented Jul 7, 2021

Although it's not documented in the Python reference, the source proto has a schema oneof field for this behavior.

https://github.com/googleapis/googleapis/blob/fe3cc5e650b8e52537f02549fe5fb49fc386f3a9/google/cloud/bigquery/storage/v1/storage.proto#L209-L215

We should be able to use very similar logic as from_read_session

def from_read_session(read_session):
schema_type = read_session._pb.WhichOneof("schema")
if schema_type == "avro_schema":
return _AvroStreamParser(read_session)
elif schema_type == "arrow_schema":
return _ArrowStreamParser(read_session)
else:
raise TypeError(
"Unsupported schema type in read_session: {0}".format(schema_type)
)

@tswast
Copy link
Contributor Author

tswast commented Jul 8, 2021

Behavior change: Can't raise ImportError if necessary libraries aren't installed until iteration time because without read_session, we don't know the serialization format until the first ReadRowsResponse message is parsed.

@tswast
Copy link
Contributor Author

tswast commented Jul 8, 2021

Behavior change: Empty stream returns dataframe / arrow table with no columns because schema is not available.

Edit: Since this is a breaking change, maybe we actually do still parse read_session for this? We can still deprecate read_session but wait to remove all parsing code until next major version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant