Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery/storage/managedwriter/adapt): add schema -> proto support #4375

Merged
merged 15 commits into from Jul 9, 2021

Conversation

shollyman
Copy link
Contributor

@shollyman shollyman commented Jul 2, 2021

This PR adds the ability to generate proto message definitions dynamically based on a table's schema. The new BQ write API communicates data solely through protocol buffers , so this helps enable the various use cases where users show up without a proto message predefined/compiled.

With this, a user can use the dynamic definitions to construct and serialize messages suitable for use in the write client. It also enables other features like json -> dynamic proto -> serialized data.
Towards: #4366

More background: The underlying API expects users to communicate proto schema (in the form of a DescriptorProto). Then we can append rows via streaming RPC where the backend acknowledges the writes.

@shollyman shollyman requested a review from a team as a code owner July 2, 2021 22:26
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 2, 2021
@shollyman shollyman added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 2, 2021
@shollyman shollyman requested a review from codyoss July 7, 2021 17:43
@shollyman shollyman removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 7, 2021
storagepb.TableFieldSchema_NUMERIC: ".google.protobuf.BytesValue",
storagepb.TableFieldSchema_STRING: ".google.protobuf.StringValue",
storagepb.TableFieldSchema_TIME: ".google.protobuf.Int64Value",
storagepb.TableFieldSchema_TIMESTAMP: ".google.protobuf.Int64Value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a Timestamp type in proto3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/protocolbuffers/protobuf/tree/master/src/google/protobuf, but this comes down to what the backend accepts for each column type. If the backend will convert timestamp protos, we can use them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we haven't supported it yet.

bigquery/storage/managedwriter/adapt/protoconversion.go Outdated Show resolved Hide resolved
bigquery/storage/managedwriter/adapt/protoconversion.go Outdated Show resolved Hide resolved
}, nil
}
// For NULLABLE, we use the wrapper types.
return &descriptorpb.FieldDescriptorProto{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default nullable to wrapper? Should we annotate the field with use_defaults instead?
https://screenshot.googleplex.com/3kHa8QbkhmgFsuj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to include a bunch of internal type annotations for driving behavior here.

  • We have no stable source of truth for them. They're published within the zetasql project, but that project doesn't make any guarantees about stability and has no GA release, so the route to a GA launch is...probably problematic.

  • The reason we're using proto3 semantics is external users live in a proto3 ecosystem. In that world, the wrapper types are how to properly communicate nulls. I don't know that it matters, but type_annotations.proto is still proto2 based, so it's possible it introduces other issues as a side effect. We could consider revisiting this route down the line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think frontend supports wrapper to nullable field conversion (we probably should). So currently if you convert these to wrapper, they will try map themselves back as a struct. In order to support such converter, the field needs to be annotated as is_wrapper. I understand that zetasql is not usable, but we can open backdoors for your self-defined annotation. Introducing the default behavior of mapping wrapper to nullable field would be a breaking change now.

Note that our API actually accepts proto2 instead of proto3 as the schema descriptor, inside of the converter, we look at the default_value field, if it is set then we set the default value, if not set, then we set null. You can surely set it on the ProtoSchema explicitly without having to introduce this mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, you're making users choose between being unable to send nulls, or being unable to send default values (empty string, 0, etc) when communicating the ProtoSchema?

Do you want me to create an issue for supporting wrapper types on the internal converter, or is there one already? An advantage of using the well known wrapper types is to avoid the need for special annotations, since both the client and backend have the definitions in place as part of the whole protocol buffer ecosystem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support the wrapper, we would need the is_wrapper annotation, otherwise it is a breaking change to existing conversions.

As to your first question, we don't have to make the choice if we are using proto2 (yeah, going back to that question). If it is proto 3 then there seems to be no other choices.

you can create an issue to support the wrapper, you are even welcomed to just go ahead and fix it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created b/193064992 for the wrapper issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final solution would be for zetasql to be published and everything conform with the current googlesql specification (instead of creating our own wheels). If that is not possible, adding some fields to ProtoSchema would be fine, less ideal since it would be per schema instead of per msg/field option.

Since you are applying this to all nullable fields, it would be better for the backend to first support it before you add the wrapper conversion, otherwise, the library would be unusable...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a toggle to suppress generation of wrapper types, so we can revisit this once we have a path forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also looking at the is_wrapper annotation you mentioned (googlesql MessageOption extension). I don't think that's going to be the right way to do this, as the expectation is that you "own" the message and can annotate it appropriately. These types are provided as part of the standard protocol buffer definitions, so adding annotations seems dodgy. I could see adding a FieldOption extension where you effectively say "the message in this field is a wrapper", but the "this message is a wrapper" annotation seems mismatched for standard types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the annotation should belong to the field.

@codyoss codyoss changed the title feat(bigquery managedstorage): add schema -> proto support feat(bigquery/storage/managedwriter/adapt): add schema -> proto support Jul 7, 2021
@shollyman shollyman requested a review from codyoss July 9, 2021 16:28
Copy link
Member

@codyoss codyoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shollyman shollyman merged commit 4ff6243 into googleapis:master Jul 9, 2021
@tyang020
Copy link

@shollyman Could you support packed = true annotation in repeated FieldOption? We use StorageSchemaToProto2Descriptor and observed 3 bytes overhead per element on repeated fields in the encoded rows. This would be huge overhead to AppendRows throughput for large arrays.

@shollyman shollyman deleted the fr-managedwriter-protoschema branch July 3, 2022 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants