Enable native byte array fields on structs #766

fungl164 · 2019-02-14T21:15:26Z

Auto-converts []byte arrays to strings behind the scenes so the bytes are not stored as individual integers on the graph.

Examples:

// Create some struct with native []byte array field inside
type MyStruct struct {
	ID   string `json:"id" quad:"@id"`
	Name string `json:"name"`  // Sample text field
	Data []byte `json:"data"`  // Sample binary field
}

// Register schema
schema.RegisterType(quad.IRI("mystruct"), MyStruct{})

// Write struct to graph
m := &MyStruct{
	ID:   "1",
	Name: "Sample struct with embedded byte array",
	Data: []byte{'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!'},
}
schema.WriteAsQuads(qw, m)


// Write raw bytes to graph
bytes := quad.ByteArr(m.Data)
schema.WriteAsQuads(qw, bytes)

This change is

dennwc

The change looks good in general, and I wanted to add bytes at some point as well, but there are some things we should figure out on RDF encoding first.

Can you please also add the new time to tests like this one? This will ensure that it can be saved to all databases properly.

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @fungl164)

quad/value.go, line 437 at r1 (raw file):

// ByteArr is representation of []byte as a value
type ByteArr string

Just Bytes

quad/value.go, line 453 at r1 (raw file):

	return TypedString{
		// TODO(dennwc): this is used to compute hash
		Value: String(b),

To be compatible with RDF we will probably need to encode it differently.

This may act as a reference, although I'm not sure it's the right one:
https://www.w3.org/TR/Content-in-RDF10/#bytesProperty

So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.

schema/writer.go, line 95 at r1 (raw file):

			}
		case saveRule:
			if f.Type.Kind() == reflect.Slice && f.Type != reflect.TypeOf([]byte(nil)) {

It's better to save the result of reflect.TypeOf([]byte(nil)) to a global and use it here

voc/schema/schema.go, line 29 at r1 (raw file):

	Text = Prefix + `Text`
	// Data type: ByteArr.
	ByteArr = Prefix + "ByteArr"

This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".

fungl164 · 2019-02-14T23:49:54Z

quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

To be compatible with RDF we will probably need to encode it differently.

This may act as a reference, although I'm not sure it's the right one:
https://www.w3.org/TR/Content-in-RDF10/#bytesProperty

So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.

I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?

fungl164 · 2019-02-14T23:55:15Z

schema/writer.go, line 95 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It's better to save the result of reflect.TypeOf([]byte(nil)) to a global and use it here

Done.

fungl164 · 2019-02-15T00:04:26Z

voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".

I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?

fungl164 · 2019-02-15T01:18:02Z

For the tests, I might need your help with the protobuf marshalling/unmarshalling.

With Bolt, I see the marshalling of quad.Bytes to ValueRaw going through, but the return ToNative() returns an empty array so the test fails.

Here's what I have so far, but you may be faster at figuring out where/what's failing... : )

func MakeValue(qv quad.Value) *Value {
	...
	case quad.Bytes:
		return &Value{&Value_Raw{v.Native().([]byte)}}
	...
}


// ToNative converts protobuf Value to quad.Value.
func (m *Value) ToNative() (qv quad.Value) {
	...
	case *Value_Raw:
		return quad.Bytes(v.Raw)
	...
}

fungl164 · 2019-02-16T05:39:49Z

Inching along...Defined quad.Bytes as proper []byte wrapper and I see the bytes getting stored as typed strings: "Hello World!"^^<schema:Bytes> on both Badger and Bolt.

The LoadTypedQuad Test is still not passing even though it's stored. I suspect resolving the bucket key is part of the issue, but I don't quite understand how the keying and indexing work yet...any comments/pointers/help very much appreciated...

Thnxs!

dennwc

Can you please rebase on top of the latest master? The PR now includes changes from other recently merged PRs.

Reviewed 6 of 12 files at r4, 13 of 13 files at r5.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @fungl164)

graph/graphtest/graphtest.go, line 710 at r5 (raw file):

		quad.Bool(true),
		quad.Time(time.Now()),
		quad.Bytes([]byte{'b', 'y', 't', 'e', 's'}),

Makes sense to add some "raw" bytes like 0x00 and similar.

graph/kv/indexing.go, line 833 at r5 (raw file):

		}
		if bytes, ok := v.(quad.Bytes); ok {
			v = bytes.TypedString()

This will definitely force it to be stored as "xxxx"^^<bytes> string.
I think it might be better to pass it as quad.Bytes further and convert it to something else when writing the value to the store.

internal/mapset/comparator.go, line 52 at r5 (raw file):

		return TimeComparator(a, b)
	}
	if a == b {

This case can be used for every other type except Time and []byte.

internal/mapset/map.go, line 36 at r5 (raw file):

func NewMapWithComparator(cmp func(a, b interface{}) int) Map {
	m := &btreeMap{
		inner: btree.NewWith(10, cmp),

Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?

internal/mapset/set.go, line 7 at r5 (raw file):

)

type Set interface {

Same here, do we really need a generic Set interface? Why not define a specific implementation?

quad/value.go, line 453 at r1 (raw file):

Previously, fungl164 wrote…

I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?

There should be some canonical URI/IRI for this type. I'm not sure https://www.w3.org/TR/xmlschema-2/# is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.

quad/value.go, line 66 at r5 (raw file):

	if v != nil {
		// if vv, ok := v.(Bytes); ok {
		// 	h.Write([]byte(vv))

Hmm, right, we may want to write it as raw bytes. It should be fine since the String() method returns RDF string, so all string values will be quoted.

But it will lead to hash collisions for quad.String("a") and quad.Bytes("\"a\""). This may or may not be desirable. If we allow collisions, the first value that is written will "bind" it to a specific type. So if String("a") is written first, and the Bytes("\"a\"") is written next, you will always retrieve a String("a") for both cases.

To avoid this we should probably change the hash to include the type ID. But this will be a breaking change.

Let me think about it a bit more. There may be a way to store it efficiently while preserving backward compatibility. Or as an option, we can include it in v0.8 that won't be bound by binary compatibility promise.

quad/pquads/quads.go, line 23 at r5 (raw file):

	case quad.String:
		return &Value{&Value_Str{string(v)}}
	case quad.Bytes:

It may be better to remove this case and add a generic one before default. This case may check for the TypedString() method on the value (define an interface somewhere) and convert it.

So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to TypeString - it will be encoded this way as a fallback.

quad/pquads/quads.go, line 97 at r5 (raw file):

		return quad.BNode(v.Bnode)
	case *Value_TypedStr:
		if v.TypedStr.Type == schema.Bytes {

Instead of asserting for a specific type, there should be references to TypedString.ParseValue()call - it will check for the known value types and will parse the string with a specified function. See quad.RegisterStringConversion.

voc/schema/schema.go, line 29 at r1 (raw file):

Previously, fungl164 wrote…

I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?

Yeah, it should be removed from here, but can be added as an unexported constant to quad package.

…stores

fungl164 · 2019-02-19T06:29:56Z

internal/mapset/map.go, line 36 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?

Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...

fungl164 · 2019-02-19T06:30:22Z

internal/mapset/set.go, line 7 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Same here, do we really need a generic Set interface? Why not define a specific implementation?

Not used. will take it out.

dennwc

Reviewed 1 of 2 files at r6.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @dennwc and @fungl164)

internal/mapset/map.go, line 36 at r5 (raw file):

Previously, fungl164 wrote…

Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...

They have a special optimization for bytes if you use it like this:

m := make(map[string]interface{})
// cast to string should be inline for optimization to work
m[string(b)] = x
// this won't be optimized
k := string(b)
m[s] = x

quad/value.go, line 450 at r6 (raw file):

}
func (b Bytes) Native() interface{} {
	// v, err := base64.StdEncoding.DecodeString(string(b))

It should store raw bytes in the value, not base64. It should convert to base64 only in TypedValue.

fungl164 · 2019-02-21T16:27:47Z

quad/value.go, line 450 at r6 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It should store raw bytes in the value, not base64. It should convert to base64 only in TypedValue.

Done.

fungl164 · 2019-02-21T16:30:04Z

quad/pquads/quads.go, line 23 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It may be better to remove this case and add a generic one before default. This case may check for the TypedString() method on the value (define an interface somewhere) and convert it.

So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to TypeString - it will be encoded this way as a fallback.

Fixed to store as raw bytes.

fungl164 · 2019-02-21T16:30:24Z

quad/pquads/quads.go, line 97 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Instead of asserting for a specific type, there should be references to TypedString.ParseValue()call - it will check for the known value types and will parse the string with a specified function. See quad.RegisterStringConversion.

Fixed. Removed.

fungl164 · 2019-02-21T16:33:53Z

voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Yeah, it should be removed from here, but can be added as an unexported constant to quad package.

If we removed this, what does the TypedString type become?

fungl164 · 2019-02-21T16:34:10Z

internal/mapset/comparator.go, line 52 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

This case can be used for every other type except Time and []byte.

Removed.

fungl164 · 2019-02-21T16:34:28Z

internal/mapset/map.go, line 36 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

They have a special optimization for bytes if you use it like this:

m := make(map[string]interface{})
// cast to string should be inline for optimization to work
m[string(b)] = x
// this won't be optimized
k := string(b)
m[s] = x

Removed.

fungl164 · 2019-02-21T16:34:41Z

internal/mapset/set.go, line 7 at r5 (raw file):

Previously, fungl164 wrote…

Not used. will take it out.

Removed.

fungl164 · 2019-02-21T17:26:52Z

quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

There should be some canonical URI/IRI for this type. I'm not sure https://www.w3.org/TR/xmlschema-2/# is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.

I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?

dennwc

Reviewed 7 of 11 files at r7.
Reviewable status: 7 of 8 files reviewed, 5 unresolved discussions (waiting on @dennwc and @fungl164)

graph/graphtest/graphtest.go, line 710 at r7 (raw file):

		quad.Bool(true),
		quad.Time(time.Now()),
		quad.IRI("C"), quad.Raw("<bytes>"),

quad.IRI("bytes") to make the hashes collide

graph/kv/indexing.go, line 145 at r7 (raw file):

				continue
			}
		} else if byt, ok := d.Val.(quad.Bytes); ok {

Please remove this and the same code below. Bytes should not appear as often as IRIs.

quad/value.go, line 453 at r1 (raw file):

Previously, fungl164 wrote…

I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?

CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.

voc/schema/schema.go, line 29 at r1 (raw file):

Previously, fungl164 wrote…

If we removed this, what does the TypedString type become?

We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in quad package.

fungl164 · 2019-02-21T18:21:36Z

quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.

Will need you to take the lead on this.

fungl164 · 2019-02-21T18:24:18Z

voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in quad package.

Ok. I'll remove it when we find the right type.

dennwc

Reviewed 2 of 2 files at r8.
Reviewable status: 7 of 8 files reviewed, 4 unresolved discussions (waiting on @dennwc and @fungl164)

graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

quad.IRI("bytes") to make the hashes collide

Sorry, I meant quad.IRI("C") -> quad.IRI("bytes"). And quad.Raw("<bytes>") -> quad.Bytes("<bytes>"). This should create the collision that I was talking about.

quad/value.go, line 453 at r1 (raw file):

Previously, fungl164 wrote…

Will need you to take the lead on this.

OK, sure, will push to this branch when I have a bit of the time.

fungl164 · 2019-02-21T19:44:37Z

graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Sorry, I meant quad.IRI("C") -> quad.IRI("bytes"). And quad.Raw("<bytes>") -> quad.Bytes("<bytes>"). This should create the collision that I was talking about.

I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )

fungl164 · 2019-02-21T19:48:55Z

graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, fungl164 wrote…

I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )

If the extra byte is not enough of a differentiator, adding a stronger suffix and/or prefix should make it work... : )

dennwc

Reviewable status: 5 of 8 files reviewed, 3 unresolved discussions (waiting on @dennwc and @fungl164)

quad/value.go, line 66 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Hmm, right, we may want to write it as raw bytes. It should be fine since the String() method returns RDF string, so all string values will be quoted.

But it will lead to hash collisions for quad.String("a") and quad.Bytes("\"a\""). This may or may not be desirable. If we allow collisions, the first value that is written will "bind" it to a specific type. So if String("a") is written first, and the Bytes("\"a\"") is written next, you will always retrieve a String("a") for both cases.

To avoid this we should probably change the hash to include the type ID. But this will be a breaking change.

Let me think about it a bit more. There may be a way to store it efficiently while preserving backward compatibility. Or as an option, we can include it in v0.8 that won't be bound by binary compatibility promise.

Let's make it a \x00 prefix instead of a \x01 suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.

fungl164 · 2019-02-21T20:18:20Z

quad/value.go, line 66 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Let's make it a \x00 prefix instead of a \x01 suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.

Done.

dennwc

OK, great! We only need a correct RDF type for TypedString now. I'll take care of it.

Reviewed 1 of 2 files at r9.
Reviewable status: 6 of 8 files reviewed, 2 unresolved discussions (waiting on @dennwc)

iddan · 2019-09-19T10:03:59Z

What is the status of this PR?

dennwc · 2019-09-23T03:36:56Z

Need to rebase and check it again.

fungl164 requested a review from dennwc as a code owner February 14, 2019 21:15

dennwc requested changes Feb 14, 2019

View reviewed changes

dennwc requested changes Feb 18, 2019

View reviewed changes

Luis Fung added 5 commits February 19, 2019 00:21

Enable native byte array fields on structs

099b941

Code cleanup

69a3d1c

Enable native byte array fields on structs

12aca64

Code cleanup

b6b074a

Modifications to support byte array indexing and transactions for KV …

e5da3c5

…stores

fungl164 force-pushed the byte-arrays branch from 9c5a6c6 to e5da3c5 Compare February 19, 2019 05:40

dennwc requested changes Feb 21, 2019

View reviewed changes

FIX: quad.Bytes cleanup (tests passing)

84b7919

dennwc requested changes Feb 21, 2019

View reviewed changes

Clean up graphtest.go and indexing.go

3cd32b7

dennwc requested changes Feb 21, 2019

View reviewed changes

FIX: HashOf IRI clashing wiht Bytes (needs review)

e9fe408

dennwc requested changes Feb 21, 2019

View reviewed changes

FIX: Change HashOf Bytes 0x01 suffix to a 0x00 prefix

fcdbfe9

dennwc reviewed Feb 21, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable native byte array fields on structs #766

Enable native byte array fields on structs #766

fungl164 commented Feb 14, 2019 •

edited

dennwc left a comment

fungl164 commented Feb 14, 2019

fungl164 commented Feb 14, 2019

fungl164 commented Feb 15, 2019

fungl164 commented Feb 15, 2019

fungl164 commented Feb 16, 2019

dennwc left a comment

fungl164 commented Feb 19, 2019

fungl164 commented Feb 19, 2019

dennwc left a comment

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

dennwc left a comment

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

dennwc left a comment

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

dennwc left a comment

fungl164 commented Feb 21, 2019

dennwc left a comment

iddan commented Sep 19, 2019

dennwc commented Sep 23, 2019

Enable native byte array fields on structs #766

Are you sure you want to change the base?

Enable native byte array fields on structs #766

Conversation

fungl164 commented Feb 14, 2019 • edited

dennwc left a comment

Choose a reason for hiding this comment

fungl164 commented Feb 14, 2019

fungl164 commented Feb 14, 2019

fungl164 commented Feb 15, 2019

fungl164 commented Feb 15, 2019

fungl164 commented Feb 16, 2019

dennwc left a comment

Choose a reason for hiding this comment

fungl164 commented Feb 19, 2019

fungl164 commented Feb 19, 2019

dennwc left a comment

Choose a reason for hiding this comment

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

dennwc left a comment

Choose a reason for hiding this comment

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

dennwc left a comment

Choose a reason for hiding this comment

fungl164 commented Feb 21, 2019

fungl164 commented Feb 21, 2019

dennwc left a comment

Choose a reason for hiding this comment

fungl164 commented Feb 21, 2019

dennwc left a comment

Choose a reason for hiding this comment

iddan commented Sep 19, 2019

dennwc commented Sep 23, 2019

fungl164 commented Feb 14, 2019 •

edited