feat(bigquery/storage/managedwriter/adapt): add schema -> proto support #4375

shollyman · 2021-07-02T22:26:42Z

This PR adds the ability to generate proto message definitions dynamically based on a table's schema. The new BQ write API communicates data solely through protocol buffers , so this helps enable the various use cases where users show up without a proto message predefined/compiled.

With this, a user can use the dynamic definitions to construct and serialize messages suitable for use in the write client. It also enables other features like json -> dynamic proto -> serialized data.
Towards: #4366

More background: The underlying API expects users to communicate proto schema (in the form of a DescriptorProto). Then we can append rows via streaming RPC where the backend acknowledges the writes.

TODO: this doesn't deal with nested messages yet.

yirutang · 2021-07-07T18:07:05Z

bigquery/storage/managedwriter/adapt/protoconversion.go

+	storagepb.TableFieldSchema_NUMERIC:    ".google.protobuf.BytesValue",
+	storagepb.TableFieldSchema_STRING:     ".google.protobuf.StringValue",
+	storagepb.TableFieldSchema_TIME:       ".google.protobuf.Int64Value",
+	storagepb.TableFieldSchema_TIMESTAMP:  ".google.protobuf.Int64Value",


Is there a Timestamp type in proto3?

https://github.com/protocolbuffers/protobuf/tree/master/src/google/protobuf, but this comes down to what the backend accepts for each column type. If the backend will convert timestamp protos, we can use them.

Yeah, we haven't supported it yet.

bigquery/storage/managedwriter/adapt/protoconversion.go

yirutang · 2021-07-07T18:17:32Z

bigquery/storage/managedwriter/adapt/protoconversion.go

+		}, nil
+	}
+	// For NULLABLE, we use the wrapper types.
+	return &descriptorpb.FieldDescriptorProto{


Should we default nullable to wrapper? Should we annotate the field with use_defaults instead?
https://screenshot.googleplex.com/3kHa8QbkhmgFsuj

I'm hesitant to include a bunch of internal type annotations for driving behavior here.

We have no stable source of truth for them. They're published within the zetasql project, but that project doesn't make any guarantees about stability and has no GA release, so the route to a GA launch is...probably problematic.

The reason we're using proto3 semantics is external users live in a proto3 ecosystem. In that world, the wrapper types are how to properly communicate nulls. I don't know that it matters, but type_annotations.proto is still proto2 based, so it's possible it introduces other issues as a side effect. We could consider revisiting this route down the line.

I don't think frontend supports wrapper to nullable field conversion (we probably should). So currently if you convert these to wrapper, they will try map themselves back as a struct. In order to support such converter, the field needs to be annotated as is_wrapper. I understand that zetasql is not usable, but we can open backdoors for your self-defined annotation. Introducing the default behavior of mapping wrapper to nullable field would be a breaking change now.

Note that our API actually accepts proto2 instead of proto3 as the schema descriptor, inside of the converter, we look at the default_value field, if it is set then we set the default value, if not set, then we set null. You can surely set it on the ProtoSchema explicitly without having to introduce this mapping.

If I'm understanding correctly, you're making users choose between being unable to send nulls, or being unable to send default values (empty string, 0, etc) when communicating the ProtoSchema?

Do you want me to create an issue for supporting wrapper types on the internal converter, or is there one already? An advantage of using the well known wrapper types is to avoid the need for special annotations, since both the client and backend have the definitions in place as part of the whole protocol buffer ecosystem.

In order to support the wrapper, we would need the is_wrapper annotation, otherwise it is a breaking change to existing conversions.

As to your first question, we don't have to make the choice if we are using proto2 (yeah, going back to that question). If it is proto 3 then there seems to be no other choices.

you can create an issue to support the wrapper, you are even welcomed to just go ahead and fix it!

Created b/193064992 for the wrapper issue

I think the final solution would be for zetasql to be published and everything conform with the current googlesql specification (instead of creating our own wheels). If that is not possible, adding some fields to ProtoSchema would be fine, less ideal since it would be per schema instead of per msg/field option.

Since you are applying this to all nullable fields, it would be better for the backend to first support it before you add the wrapper conversion, otherwise, the library would be unusable...

Added a toggle to suppress generation of wrapper types, so we can revisit this once we have a path forward.

I was also looking at the is_wrapper annotation you mentioned (googlesql MessageOption extension). I don't think that's going to be the right way to do this, as the expectation is that you "own" the message and can annotate it appropriately. These types are provided as part of the standard protocol buffer definitions, so adding annotations seems dodgy. I could see adding a FieldOption extension where you effectively say "the message in this field is a wrapper", but the "this message is a wrapper" annotation seems mismatched for standard types.

Yeah, the annotation should belong to the field.

bigquery/storage/managedwriter/adapt/protoconversion.go

bigquery/storage/managedwriter/adapt/error.go

codyoss

LGTM

tyang020 · 2022-06-30T23:00:48Z

@shollyman Could you support packed = true annotation in repeated FieldOption? We use StorageSchemaToProto2Descriptor and observed 3 bytes overhead per element on repeated fields in the encoded rows. This would be huge overhead to AppendRows throughput for large arrays.

shollyman added 2 commits July 1, 2021 21:01

feat(bigquery managedstorage): add schema -> proto support

d802075

TODO: this doesn't deal with nested messages yet.

cleanup proto reflection code

1bf877e

shollyman requested a review from a team as a code owner July 2, 2021 22:26

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 2, 2021

shollyman added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 2, 2021

shollyman added 3 commits July 2, 2021 23:06

address re-use of submessages, augment testing

df9faa8

add roundtrip json serialization test, ditch the manual reflect test

3cbe9c2

refactor weak map into a custom cache type

03abf14

shollyman requested a review from codyoss July 7, 2021 17:43

Merge branch 'master' into fr-managedwriter-protoschema

b353af0

shollyman removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 7, 2021

remove stale TODO

e30b336

yirutang reviewed Jul 7, 2021

View reviewed changes

correct invalid wrapper mapping, reviewer caught

c82dec6

codyoss changed the title ~~feat(bigquery managedstorage): add schema -> proto support~~ feat(bigquery/storage/managedwriter/adapt): add schema -> proto support Jul 7, 2021

codyoss reviewed Jul 7, 2021

View reviewed changes

shollyman added 3 commits July 8, 2021 00:09

add internal toggle to disable wrapper types in proto descriptors

de0c252

Merge branch 'master' into fr-managedwriter-protoschema

7ea758f

improve error signalling

af9cb2f

codyoss reviewed Jul 9, 2021

View reviewed changes

bigquery/storage/managedwriter/adapt/error.go Outdated Show resolved Hide resolved

bigquery/storage/managedwriter/adapt/error.go Outdated Show resolved Hide resolved

shollyman added 3 commits July 9, 2021 16:24

address reviewer feedback

66faea3

Merge branch 'master' into fr-managedwriter-protoschema

a170306

better error string

b94b29c

shollyman requested a review from codyoss July 9, 2021 16:28

codyoss approved these changes Jul 9, 2021

View reviewed changes

make test agnostic to whitespace padding differences

c257d01

shollyman merged commit 4ff6243 into googleapis:master Jul 9, 2021

shollyman mentioned this pull request Jul 3, 2022

bigquery/storage/managedwriter/adapt: support packed annotation #6302

Closed

shollyman deleted the fr-managedwriter-protoschema branch July 3, 2022 04:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery/storage/managedwriter/adapt): add schema -> proto support #4375

feat(bigquery/storage/managedwriter/adapt): add schema -> proto support #4375

shollyman commented Jul 2, 2021 •

edited

yirutang Jul 7, 2021

shollyman Jul 7, 2021

yirutang Jul 7, 2021

yirutang Jul 7, 2021

shollyman Jul 7, 2021

yirutang Jul 7, 2021

shollyman Jul 7, 2021

yirutang Jul 7, 2021

shollyman Jul 7, 2021

yirutang Jul 7, 2021

shollyman Jul 8, 2021

shollyman Jul 8, 2021

yirutang Jul 8, 2021

codyoss left a comment

tyang020 commented Jun 30, 2022

feat(bigquery/storage/managedwriter/adapt): add schema -> proto support #4375

feat(bigquery/storage/managedwriter/adapt): add schema -> proto support #4375

Conversation

shollyman commented Jul 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codyoss left a comment

Choose a reason for hiding this comment

tyang020 commented Jun 30, 2022

shollyman commented Jul 2, 2021 •

edited