Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable native byte array fields on structs #766

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

fungl164
Copy link

@fungl164 fungl164 commented Feb 14, 2019

Auto-converts []byte arrays to strings behind the scenes so the bytes are not stored as individual integers on the graph.

Examples:

// Create some struct with native []byte array field inside
type MyStruct struct {
	ID   string `json:"id" quad:"@id"`
	Name string `json:"name"`  // Sample text field
	Data []byte `json:"data"`  // Sample binary field
}

// Register schema
schema.RegisterType(quad.IRI("mystruct"), MyStruct{})

// Write struct to graph
m := &MyStruct{
	ID:   "1",
	Name: "Sample struct with embedded byte array",
	Data: []byte{'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!'},
}
schema.WriteAsQuads(qw, m)


// Write raw bytes to graph
bytes := quad.ByteArr(m.Data)
schema.WriteAsQuads(qw, bytes)

This change is Reviewable

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good in general, and I wanted to add bytes at some point as well, but there are some things we should figure out on RDF encoding first.

Can you please also add the new time to tests like this one? This will ensure that it can be saved to all databases properly.

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @fungl164)


quad/value.go, line 437 at r1 (raw file):

// ByteArr is representation of []byte as a value
type ByteArr string

Just Bytes


quad/value.go, line 453 at r1 (raw file):

	return TypedString{
		// TODO(dennwc): this is used to compute hash
		Value: String(b),

To be compatible with RDF we will probably need to encode it differently.

This may act as a reference, although I'm not sure it's the right one:
https://www.w3.org/TR/Content-in-RDF10/#bytesProperty

So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.


schema/writer.go, line 95 at r1 (raw file):

			}
		case saveRule:
			if f.Type.Kind() == reflect.Slice && f.Type != reflect.TypeOf([]byte(nil)) {

It's better to save the result of reflect.TypeOf([]byte(nil)) to a global and use it here


voc/schema/schema.go, line 29 at r1 (raw file):

	Text = Prefix + `Text`
	// Data type: ByteArr.
	ByteArr = Prefix + "ByteArr"

This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".

@fungl164
Copy link
Author


quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

To be compatible with RDF we will probably need to encode it differently.

This may act as a reference, although I'm not sure it's the right one:
https://www.w3.org/TR/Content-in-RDF10/#bytesProperty

So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.

I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?

@fungl164
Copy link
Author


schema/writer.go, line 95 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It's better to save the result of reflect.TypeOf([]byte(nil)) to a global and use it here

Done.

@fungl164
Copy link
Author


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".

I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?

@fungl164
Copy link
Author

For the tests, I might need your help with the protobuf marshalling/unmarshalling.

With Bolt, I see the marshalling of quad.Bytes to ValueRaw going through, but the return ToNative() returns an empty array so the test fails.

Here's what I have so far, but you may be faster at figuring out where/what's failing... : )

func MakeValue(qv quad.Value) *Value {
	...
	case quad.Bytes:
		return &Value{&Value_Raw{v.Native().([]byte)}}
	...
}


// ToNative converts protobuf Value to quad.Value.
func (m *Value) ToNative() (qv quad.Value) {
	...
	case *Value_Raw:
		return quad.Bytes(v.Raw)
	...
}

@fungl164
Copy link
Author

Inching along...Defined quad.Bytes as proper []byte wrapper and I see the bytes getting stored as typed strings: "Hello World!"^^<schema:Bytes> on both Badger and Bolt.

The LoadTypedQuad Test is still not passing even though it's stored. I suspect resolving the bucket key is part of the issue, but I don't quite understand how the keying and indexing work yet...any comments/pointers/help very much appreciated...

Thnxs!

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rebase on top of the latest master? The PR now includes changes from other recently merged PRs.

Reviewed 6 of 12 files at r4, 13 of 13 files at r5.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @fungl164)


graph/graphtest/graphtest.go, line 710 at r5 (raw file):

		quad.Bool(true),
		quad.Time(time.Now()),
		quad.Bytes([]byte{'b', 'y', 't', 'e', 's'}),

Makes sense to add some "raw" bytes like 0x00 and similar.


graph/kv/indexing.go, line 833 at r5 (raw file):

		}
		if bytes, ok := v.(quad.Bytes); ok {
			v = bytes.TypedString()

This will definitely force it to be stored as "xxxx"^^<bytes> string.
I think it might be better to pass it as quad.Bytes further and convert it to something else when writing the value to the store.


internal/mapset/comparator.go, line 52 at r5 (raw file):

		return TimeComparator(a, b)
	}
	if a == b {

This case can be used for every other type except Time and []byte.


internal/mapset/map.go, line 36 at r5 (raw file):

func NewMapWithComparator(cmp func(a, b interface{}) int) Map {
	m := &btreeMap{
		inner: btree.NewWith(10, cmp),

Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?


internal/mapset/set.go, line 7 at r5 (raw file):

)

type Set interface {

Same here, do we really need a generic Set interface? Why not define a specific implementation?


quad/value.go, line 453 at r1 (raw file):

Previously, fungl164 wrote…

I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?

There should be some canonical URI/IRI for this type. I'm not sure https://www.w3.org/TR/xmlschema-2/# is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.


quad/value.go, line 66 at r5 (raw file):

	if v != nil {
		// if vv, ok := v.(Bytes); ok {
		// 	h.Write([]byte(vv))

Hmm, right, we may want to write it as raw bytes. It should be fine since the String() method returns RDF string, so all string values will be quoted.

But it will lead to hash collisions for quad.String("a") and quad.Bytes("\"a\""). This may or may not be desirable. If we allow collisions, the first value that is written will "bind" it to a specific type. So if String("a") is written first, and the Bytes("\"a\"") is written next, you will always retrieve a String("a") for both cases.

To avoid this we should probably change the hash to include the type ID. But this will be a breaking change.

Let me think about it a bit more. There may be a way to store it efficiently while preserving backward compatibility. Or as an option, we can include it in v0.8 that won't be bound by binary compatibility promise.


quad/pquads/quads.go, line 23 at r5 (raw file):

	case quad.String:
		return &Value{&Value_Str{string(v)}}
	case quad.Bytes:

It may be better to remove this case and add a generic one before default. This case may check for the TypedString() method on the value (define an interface somewhere) and convert it.

So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to TypeString - it will be encoded this way as a fallback.


quad/pquads/quads.go, line 97 at r5 (raw file):

		return quad.BNode(v.Bnode)
	case *Value_TypedStr:
		if v.TypedStr.Type == schema.Bytes {

Instead of asserting for a specific type, there should be references to TypedString.ParseValue()call - it will check for the known value types and will parse the string with a specified function. See quad.RegisterStringConversion.


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, fungl164 wrote…

I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?

Yeah, it should be removed from here, but can be added as an unexported constant to quad package.

@fungl164
Copy link
Author


internal/mapset/map.go, line 36 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?

Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...

@fungl164
Copy link
Author


internal/mapset/set.go, line 7 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Same here, do we really need a generic Set interface? Why not define a specific implementation?

Not used. will take it out.

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r6.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @dennwc and @fungl164)


internal/mapset/map.go, line 36 at r5 (raw file):

Previously, fungl164 wrote…

Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...

They have a special optimization for bytes if you use it like this:

m := make(map[string]interface{})
// cast to string should be inline for optimization to work
m[string(b)] = x
// this won't be optimized
k := string(b)
m[s] = x

quad/value.go, line 450 at r6 (raw file):

}
func (b Bytes) Native() interface{} {
	// v, err := base64.StdEncoding.DecodeString(string(b))

It should store raw bytes in the value, not base64. It should convert to base64 only in TypedValue.

@fungl164
Copy link
Author


quad/value.go, line 450 at r6 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It should store raw bytes in the value, not base64. It should convert to base64 only in TypedValue.

Done.

@fungl164
Copy link
Author


quad/pquads/quads.go, line 23 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It may be better to remove this case and add a generic one before default. This case may check for the TypedString() method on the value (define an interface somewhere) and convert it.

So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to TypeString - it will be encoded this way as a fallback.

Fixed to store as raw bytes.

@fungl164
Copy link
Author


quad/pquads/quads.go, line 97 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Instead of asserting for a specific type, there should be references to TypedString.ParseValue()call - it will check for the known value types and will parse the string with a specified function. See quad.RegisterStringConversion.

Fixed. Removed.

@fungl164
Copy link
Author


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Yeah, it should be removed from here, but can be added as an unexported constant to quad package.

If we removed this, what does the TypedString type become?

@fungl164
Copy link
Author


internal/mapset/comparator.go, line 52 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

This case can be used for every other type except Time and []byte.

Removed.

@fungl164
Copy link
Author


internal/mapset/map.go, line 36 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

They have a special optimization for bytes if you use it like this:

m := make(map[string]interface{})
// cast to string should be inline for optimization to work
m[string(b)] = x
// this won't be optimized
k := string(b)
m[s] = x

Removed.

@fungl164
Copy link
Author


internal/mapset/set.go, line 7 at r5 (raw file):

Previously, fungl164 wrote…

Not used. will take it out.

Removed.

@fungl164
Copy link
Author


quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

There should be some canonical URI/IRI for this type. I'm not sure https://www.w3.org/TR/xmlschema-2/# is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.

I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 11 files at r7.
Reviewable status: 7 of 8 files reviewed, 5 unresolved discussions (waiting on @dennwc and @fungl164)


graph/graphtest/graphtest.go, line 710 at r7 (raw file):

		quad.Bool(true),
		quad.Time(time.Now()),
		quad.IRI("C"), quad.Raw("<bytes>"),

quad.IRI("bytes") to make the hashes collide


graph/kv/indexing.go, line 145 at r7 (raw file):

				continue
			}
		} else if byt, ok := d.Val.(quad.Bytes); ok {

Please remove this and the same code below. Bytes should not appear as often as IRIs.


quad/value.go, line 453 at r1 (raw file):

Previously, fungl164 wrote…

I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?

CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, fungl164 wrote…

If we removed this, what does the TypedString type become?

We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in quad package.

@fungl164
Copy link
Author


quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.

Will need you to take the lead on this.

@fungl164
Copy link
Author


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in quad package.

Ok. I'll remove it when we find the right type.

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r8.
Reviewable status: 7 of 8 files reviewed, 4 unresolved discussions (waiting on @dennwc and @fungl164)


graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

quad.IRI("bytes") to make the hashes collide

Sorry, I meant quad.IRI("C") -> quad.IRI("bytes"). And quad.Raw("<bytes>") -> quad.Bytes("<bytes>"). This should create the collision that I was talking about.


quad/value.go, line 453 at r1 (raw file):

Previously, fungl164 wrote…

Will need you to take the lead on this.

OK, sure, will push to this branch when I have a bit of the time.

@fungl164
Copy link
Author


graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Sorry, I meant quad.IRI("C") -> quad.IRI("bytes"). And quad.Raw("<bytes>") -> quad.Bytes("<bytes>"). This should create the collision that I was talking about.

I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )

@fungl164
Copy link
Author


graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, fungl164 wrote…

I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )

If the extra byte is not enough of a differentiator, adding a stronger suffix and/or prefix should make it work... : )

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 8 files reviewed, 3 unresolved discussions (waiting on @dennwc and @fungl164)


quad/value.go, line 66 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Hmm, right, we may want to write it as raw bytes. It should be fine since the String() method returns RDF string, so all string values will be quoted.

But it will lead to hash collisions for quad.String("a") and quad.Bytes("\"a\""). This may or may not be desirable. If we allow collisions, the first value that is written will "bind" it to a specific type. So if String("a") is written first, and the Bytes("\"a\"") is written next, you will always retrieve a String("a") for both cases.

To avoid this we should probably change the hash to include the type ID. But this will be a breaking change.

Let me think about it a bit more. There may be a way to store it efficiently while preserving backward compatibility. Or as an option, we can include it in v0.8 that won't be bound by binary compatibility promise.

Let's make it a \x00 prefix instead of a \x01 suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.

@fungl164
Copy link
Author


quad/value.go, line 66 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Let's make it a \x00 prefix instead of a \x01 suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.

Done.

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great! We only need a correct RDF type for TypedString now. I'll take care of it.

Reviewed 1 of 2 files at r9.
Reviewable status: 6 of 8 files reviewed, 2 unresolved discussions (waiting on @dennwc)

@iddan
Copy link
Collaborator

iddan commented Sep 19, 2019

What is the status of this PR?

@dennwc
Copy link
Member

dennwc commented Sep 23, 2019

Need to rebase and check it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants