Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: initial WAC spec #226

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

spec: initial WAC spec #226

wants to merge 2 commits into from

Conversation

aschmahmann
Copy link

This is a proposal for a new IPLD Codec called WAC. It is meant to be a complete and fitted IPLD codec, to the best that we can do with the data model.

It's very much a WIP and suggestions are much appreciated. The basic framework was taken from @mikeal's SimpleDAG codec proposal, but made suitable for a complete and fitted codec (i.e. no things like map sorting) and some decisions were made to make encoding/decoding easier at the cost making zero-copy implementations harder.

It is currently in use in https://github.com/aschmahmann/wasm-ipld and I've been doing some exploring on how a WASM IPLD implementation could plug into something like go-ipfs in ipfs/kubo#9016.


### Strings

As with the IPLD Data Model itself Strings are basically just Bytes with something to tag them as "string-like" but where UTF-8 is encouraged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still consider it a mistake that the IPLD Data Model defines strings as arbitrary bytes. There currently is no valid use of it ("valid" in the sense of that you cannot create valid data with existing specified DAG codecs). To my knowledge, strings as arbitrary bytes causes more harm then good (here are two examples of PL related stacks where things were fixed in retrospective due to this: Fixing an interop issue in libp2p gossipsub, FIP-0027).

Also newer systems based on IPLD like Edelweiss, define Strings as a list of characters.

Systems outside of IPLD that have similar challenges, like the WebAssembly Interface Types, where interoperability between different programming languages is the goal, came to the conclusion that Strings should be a list of Unicode scalar values.

Recently the topic was again brought up by @Stebalien at #203.

Having talked with people about this topic a lot in the last few years, I haven't found people that thought defining strings the same way as bytes is a good idea. Hence it might be time to revisit this decision.

I bring this issue up as without WAC, I didn't really mind what the IPLD spec said about strings, as practically they were Unicode. With having WAC this would change. So I would be in favour if WAC would define Strings as list of Unicode characters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two issues with Strings == Unicode that I think we'd have to deal with if we wanted to make a change here.

  1. Is there existing data that would break here, e.g. dag-cbor data where there are non-unicode strings?
  2. Maps are currently defined to have String keys which IIUC it's quite problematic to assume Unicode-only keys here. Would this change to support non-String keys?

For example see #227 which is a spec for a fairly reasonable existing format called Bencode which is used by BitTorrent. A straight up change of Strings are now Unicode would make that format super miserable to make a codec for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stand-out example is addresses in Filecoin used as map keys. Thankfully that got changed in a (very) recent upgrade so they're now strings (can't find the issue but this was recognised as a mistake post-launch, but just not super high priority to "fix"). There exists Filecoin chain data that we can't read in JavaScript because it sticks []byte into a CBOR string field. I think maybe Rust may be in the same boat? Maybe FVM forced this fix. I've been considering hacks in JS to get backdoor access to the bytes but it's a real mess.

Go is comfortable with string([]byte(string(...)), but nobody else is.

I'm mostly with @vmx on this. Although we do have the ability to hand-wave a little bit on this, suggest that Unicode is how it is supposed to work, but that if an implementation chooses to use byte arrays without a validation step then it becomes caveat emptor that your data may not be readable by other implementations (though hand-waving is kind of gross if we don't have to do it—like we do with codecs with existing data).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From: #226 (comment)

I think #203 is inevitable at some point, it's only the Go-exclusive folks that are torn on this

Maybe although IMO #203 isn't feasible unless something is done about maps. Restricting map keys to be Strings while also insisting the Strings must be Unicode makes it impossible to have map keys that are non-Unicode.

As I mentioned above, how does a codec like #227 get implemented sanely? My understanding is there are also folks who want bytes or integer map keys that have been hobbling by with just shoving arbitrary stuff into bytes.


For others interested in the "what do we do about Strings" debate. I'd recommend reading https://github.com/ipld/ipld/blob/257834ab78bca17506e5ef0c4a7f00a43a06b0c0/notebook/exploration-reports/2020.10-data-model-maps-keys.md and the links from that issue. @warpfork and @vmx lay out a lot of the space and tradeoffs here for us to figure out what to do.

I have some opinions on this (e.g. the "IMO" section above), but if we're planning on making changes to the data model I'm certainly happy to reflect them in WAC since it's just trying to be a high fidelity representation that can support almost all of what the data model can.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is there existing data that would break here, e.g. dag-cbor data where there are non-unicode strings?

Even if, to me that's not really a reason not to change it. If it exists, it's invalid CBOR. I don't think we can (nor should) support any invalid CBOR data that is out there. It was produced by buggy implementation, not due to being within the bounds of specs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't tuple lists work for that quite well? You could have bytes and arbitrary ordering.

That is technically possible but is not desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is technically possible but is not desired.

What would be needed to make it desired? Would it help if it would be somehow implemented directly in IPLD libraries? To me (I might be the only one, but I'd like to change that :) the tuple list pattern is the best solution to the "arbitrary IPLD as map keys" and "custom sort order" problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what the API would look like for a tuple list?

One of the main things I like the existing Map API is that it's easy to reason about with existing ADLs like HAMT. Having more moving bits and new APIs feels like it'll be harder to integrate with existing things that are willing to work with a regular Map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what the API would look like for a tuple list?

I sadly can't I know too little about Go, go-ipld-prime and ADLs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RangerMauve since you're talking about an ADL you just need to make sure you provide a sane mapping between encoded bytes and the external API, so it really shouldn't matter here whether you encode as Bytes or Strings. If you need arbitrary bytes then you encode as Bytes but maybe those get presented to the user through the ADL as Strings.

We do something like this with the HAMT and I'd expect to see a similar thing when we end up doing sorted maps: https://ipld.io/specs/advanced-data-layouts/hamt/spec/#schema - note how BucketEntry uses key Bytes, but when that gets all translated through the ADL code we end up with Strings (for the most part - we also retain the option of having Bytes keys through the API which isn't entirely precluded but is a touch controversial), see:

So, thanks to ADLs, or "lenses" in general, we get to do the right thing at the codec layer, but mess around with it however we like above it to transform it into what the end user wants.

Same goes with schema "typed" interfaces, using "tuple" representation maps (since we're talking tuples here), which are encoded by the codec as a List but are presented to the user as a Map, or even a Struct (the Filecoin chain makes use of this extensively, not encoding any Map to bytes even though they're presented to users as Structs that kind of look like they should be maps, they're just arrays of values in a certain order).

Comment on lines +76 to +78
Note: Simple-DAG put a VARINT_LENGTH before the CID indicating how long it was.

CIDs are self-delimiting so this didn't seem necessary.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this choice, it may have been left off Simple DAG because at the time we didn't have simple tools to pluck it out without writing an initial-bytes decoder (we've since added one to both Go and JS impls) so it didn't occur as an obvious thing to do

Comment on lines +98 to +99
TODO: Floats (and negative floats) need definition here.
Making implementations actually bidirectional with respect to the IPLD Data Model seems difficult here.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg @mikeal @warpfork would like some feedback and thoughts here given that the IPLD Data Model is kind of sketchy on what we mean by floats and what we'd like to be representable.

The text here was grabbed from SimpleDAG, but there wasn't much there to go off of.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe I should write an exploration report about this some time.. we've been iterating toward more accuracy with dag-cbor, although we may have further to go in our specification (our implementations go a bit further than our specifications I think, for reasons).

For this, you could say something like: Double-precision (64-bit) IEEE 754 floats representing finite numbers only. This excludes: half, single or any other length representation, infinites, negative zeros, NaNs, or any other byte representation that does not represent a finite floating point number.

We get pretty good mileage out of being strictly 64-bit IEEE 754 in dag-cbor now, it's been a while since I've seen anything surprising between implementations, although we don't get a ton of float usage by anyone so maybe something will come up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, if you're really keen to stick with the mantissa+int thing .. then it might be hard to squish that definition in. If you just stick to 64-bit IEEE 754 then we'd have an easier time with bidirectionality, across languages too since everyone agrees with that binary representation.

I came to the decision that futzing with minimal-length representation thing with floats was a waste of time and too prone to impacting byte format convergence. If you really want floats in your binary data, then 64-bit IEEE 754 gives you what most people are after. If you want anything more then you're going to be doing some weird GPU encoding format optimizing for a strange use-case. And the fact is, people doing content addressing mostly avoid floating point numbers because they suck. (I'd be tempted to even leave it out entirely and let them reconstruct floating point numbers ADL-style).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, if you're really keen to stick with the mantissa+int thing

I literally just copied this from https://github.com/mikeal/simple-dag/blob/bd5e533e61c90c392c6b2d05df90c6f1f3843475/README.md?plain=1#L102. No strong opinions here looking for suggestions as to what makes sense.

I'm trying to be close to the "data model" so that the codec is rarely the thing getting in the way of interoperability. However, since the data model is particularly fluffy on float descriptions and comes with a ⚠️ it is recommended that Float values be avoided when developing systems on IPLD I'm looking for suggestions.

So looking for advice here. Part of me doesn't like arbitrary size cutoffs (e.g. 64 bits) since I tend to think of the data model as more abstracted than that, but implementations as caring about it. However, I don't think that's something I'm necessarily that attached to if there are better ideas out there.

(I'd be tempted to even leave it out entirely and let them reconstruct floating point numbers ADL-style).

Something I think would be cool is to basically enable any data that could be represented in an ADL to be representable as a WAC encoded block. Sure, it's unlikely any reasonable implementation would take a 50GB map of maps of maps of bytes and create a single block out of it but being able to do that for 10KB of data sounds useful.

Perhaps it's not reasonable for WAC to reliably round-trip float data through things like ADLs due to the different ways people work with floats though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable any data that could be represented in an ADL to be representable as a WAC encoded block

This seems like a goal that could be forcing some decisions that might be less than ideal for a codec. Do you want to make that an explicit goal? Cause I'll need to put on new glasses if that's the case, and maybe you need to order your goals too because there's conflicts arising out of that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable any data that could be represented in an ADL to be representable as a WAC encoded block

I think it'd be nice if this was the case since then an ADL that returns small enough data could be returned as WAC instead of as some other custom format. If that's not doable though then we could probably live with it.

Chatted with @vmx about this a bit as well. I'm currently thinking that even though the IPLD data model doesn't define an explicit range on floats (same as for integers) the added complexity of defining a varint-like spec for floats isn't currently worthwhile given floats aren't highly used in the IPLD ecosystem in general.

As a result, even though we've been able to represent the rest of the data model in WAC (using things like varints), for floats it's reasonable to compromise and use double precision (64-bit) IEEE 754 floats (https://en.wikipedia.org/wiki/Double-precision_floating-point_format).

If it turns out that people need a version of WAC that supports different types of floats then we can always make a WACv2 codec that supports them.

For this, you could say something like: Double-precision (64-bit) IEEE 754 floats representing finite numbers only. This excludes: half, single or any other length representation, infinites, negative zeros, NaNs, or any other byte representation that does not represent a finite floating point number.

I don't have super strong feelings here it looks like the WASM component model folks have settled on a single canonical NaN value to use but don't say anything about the infinities, negative zero, etc.

Maybe we just say that consumers shouldn't depend on that behavior, but not explicitly cause problems in WAC encoding/decoding since most implementations will probably end up being lazy and use whatever their environment's version of float64ToBytes/float64FromBytes anyhow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about removing the existence of a negative float type further up the document, and having something like:

Suggested change
TODO: Floats (and negative floats) need definition here.
Making implementations actually bidirectional with respect to the IPLD Data Model seems difficult here.
Floats are represented as big-endian encoded double-precision (64-bit) IEEE 754 floats. Floats MUST represent finite numbers, other values representable as double-precision IEEE 754 floats MAY also be represented. These other values include: infinites, negative zeros, NaNs, or any other byte representation that does not represent a finite floating point number.

Note: I went with big-endian here mostly because this seems to be what CBOR uses, if it should be something else lmk 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to include a link to the IEEE 754 spec, but everything seems to be behind a paywall. Should I just use https://en.wikipedia.org/wiki/Double-precision_floating-point_format?


Note: This is essentially the same as Binary, but with a different token flag.

As with the IPLD Data Model itself Strings are basically just Bytes with something to tag them as "string-like" but where UTF-8 is encouraged. See [Strings].
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps somewhat contentious, but this is effectively what the data model is and since I'm trying to represent the data model accurately this is what we've got.

@vmx @rvagg I suspect this means that if I add more fixtures to https://github.com/ipld/codec-fixtures that the Rust/WASM code will not be able to use the existing Rust library and something else will need to be coded up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #203 is inevitable at some point, it's only the Go-exclusive folks that are torn on this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded in the thread Volker started #226 (comment)

Comment on lines +139 to +140
Note: We could assert that keys are just `VARINT_LENGTH | STRING` and remove the String token since it's always a string.
It's some added complexity and really stops us from putting anything other than Strings in map keys, but that may not be too bad.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the Data Model is we're restricted to "Strings" as map keys even if they're not UTF-8 strings. Keeping the extra token here seems safe, but could be convinced otherwise. It seems like keeping the token allows us room to maneuver in the future if we extend what are valid map keys, although to be fair we could just make a new codec version at that time 🤷.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way you should specify - I assumed from first read that this was strictly a string so the extra token was unnecessary - I think SimpleDAG did this?
I'd prefer leaving off the token and making the assumption. I think strictly string keys at the data model layer is going to stay that way, but other key types above that might become appropriate. So for a codec, it seems safe to me to lock in strings.

Comment on lines +135 to +137
Note: Simple-DAG went with `KEYS_VARINT_LENGTH | VALUES_VARINT_LENGTH | KEYS | VALUES |`. Both seem doable,
this approach seemed to make writing encoder/decoders really simple. However, adding in more length prefixes
makes creating faster "zero copy" decoders very nice. It seems to mostly be a tradeoff for which side has to have bigger buffers, the encoder or the decoder.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts? This change seemed pretty reasonable from the perspective of trying to make a simple decoder, but perhaps the lack of easy zero-copy here is a big deal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adapting our existing tokenising parsers to your form would be trivial, whereas buffering keys and values prior to emitting is an entirely different thing and has lots of annoying buffer problems (and potential security risks) on both encode and decode. Maybe if we had a strict block size limit it wouldn't be quite as annoying. Currently we assume streaming decode wherever we can (also streaming encode too except for maps, thanks to sorting).

+1 on the choice here from me.

Comment on lines +150 to +153
Note: Simple-DAG went with `VARINT_LENGTH` (the size of the VALUES binary section) instead of `VARINT_NUM_ELEMENTS`
and in the `VALUES` section had every value proceeded by the length of the value.

Both seem doable, this approach seemed to make writing a decoder really simple. However, adding in more length prefixes makes creating faster "zero copy" decoders very nice. It seems to mostly be a tradeoff for which side has to have bigger buffers, the encoder or the decoder.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same type of change as with map entries

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to num elements

#### TYPE_INTEGER

```
| 1 | VARINT |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all the varints here should be https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128 unless someone has a better idea.

While all the varint lengths will likely be capped by implementations to reasonable sizes (e.g. no one is expecting 100EiBs of data in a single WAC encoded block 🙃), I don't see why positive and negate integers shouldn't be allowed to be enormous and still called integers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth referencing https://github.com/multiformats/unsigned-varint as the definition for it somewhere in the doc since we like both unsigned and minimally encoded

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Although I don't think we want this part of that spec Implementations MUST restrict the size of the varint to a max of 9 bytes (63 bits).. If someone wants a super big integer here I don't see why not, a big integer is still an integer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, but you should probably have language in here about the practical limits - something like: "Implementations that write integers using more than 9 bytes should not expect data to be readable by other implementations due to the commonality of 64-bit integer representations in IPLD implementations. It may be reasonable to implement a hard limit on varint representations by default but this decision is left to the implementer."

navTitle: "WAC"
---

WAC
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially named for WebAssembly Codec because I didn't know what this was going to look like when I started but I knew what it was for. It's turned out to be something probably more general than just for transporting data to/from WASM.

If someone else has got a name suggestion for the codec I'm game. Only rules are it cannot contain the prefix dag- 🙃.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to no dag-, I would have tried to veto that anyway!

You could use "SimpleDAG", with this just being an evolutionary step of it? Mikeal might be OK with that since we didn't push SimpleDAG far enough to be used anywhere in practice.

no other suggestions from me, naming is hard

* [Format](#format)
* [Implementations](#implementations)
* [Go](#go)
* [Rust/WASM](#rustwasm)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right way to do this? error: bad link specs/codecs/wac/spec#rustwasm at line 14

I could just change the title to Rust and WASM, but also curious what the answer is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wants #rust-wasm (confirmed by looking at the built file: <h3 id="rust-wasm" tabindex="-1">)

TODO: Floats (and negative floats) need definition here.
Making implementations actually bidirectional with respect to the IPLD Data Model seems difficult here.

#### TYPE_NEGATIVE_FLOAT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(continuing the thought from above, I hate this, but it's unavoidable if you want mantissa+varint = vuarint+uvarint, although avoidable if you bite the IEEE 754 bullet)


`PAIRS` contains alternating keys then values concatenated. i.e. ` KEY1 | VALUE1 | KEY2 | VALUE2 ...`

This codec does not have any form of canonical map sorting as that would make it ill-fitted to the IPLD Data Model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth having the discussion about ordering .. it's been discussed nearly to death but it has some very important things worth acknowledging. This definition of fittedness is an Eric construct; we all line up differently on this topic. There's tradeoffs either way you roll with this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth having the discussion about ordering

Yep, I recognize that this PR has sort of kicked the hornets nest in reminding us of all the data model issues we've left unresolved and/or have left some folks unhappy with.

This is happening basically because for a few reasons, like ease of development and performance, in order to enable the WASM work in https://github.com/aschmahmann/wasm-ipld I wanted to be able to work with data model data without creating tons of functions that look like the node building and node iterating functions that exist in go-ipld-prime. This means I've moved some of the pieces of the data model that are "left as an exercise to the implementer" into a spec for a codec we can discuss. It also means that hopefully the wasm-ipld work isn't overly biased towards my perspective coming from the Go libraries. It's definitely biased since it's the only place I've written bindings for and go-ipld-prime is the only library we have ADL support, but I'm trying to minimize the bias and publicly communicate where I can.

This definition of fittedness is an Eric construct

A bunch of the definitions in that linked document are his construct, but the point remains that if the data model doesn't have a global ordering and WAC wants to be bidirectional with the data model (e.g. I can take an abstract data model node and then serialize and deserialize it and get the same data) then it needs to not care about map ordering as well.

There's tradeoffs either way you roll with this.

I don't think we run into any serious tradeoffs in this codec around ordering (at least for it's initially intended purpose), however there are some at the data model layer (which this codec is trying to be bidirectional with).

I think ipld/go-ipld-prime#220 was the most recent time this came up. The best resource I can locate at the moment with some of the tradeoffs and analysis here is https://github.com/ipld/ipld/blob/f059543ae6882d0d475a4d0b10ed4bbe1c421fed/design/tricky-choices/ordering.md, but if you've got more to look at (or comments to add) please post them 🙏.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is far from exhaustive, it's just top-of-mind while I'm tired and at the end of my day, this is a bit topic)

ipld/go-ipld-prime#343 is another example of where this comes up, doing this is near top of my list of next things to bite off. With current suite of codecs, the caveats to document there are fewer than when we have a codec that preserves "data model order" (because "data model order" really is a novel concept to us, as ipld/go-ipld-prime#220 demonstrated and ipld/go-ipld-prime#251 is related too). We have similar problems in JS where we're not supposed to depend on object key order (although in practice we can/do). Consider the problem of round-tripping through another codec (dag-json and dag-cbor have different sorting). But the biggest item of concern is convergence/determinism/canonical-forms. When you have a data tightly bound to a schema then you tend toward canonical forms (at the whims of an implementation), when you don't have a schema you're down to initial insertion-order.

Insertion-order preservation as a feature is something that's been argued for before, but that can be done already with List representations, that's what a List is good at.

Maybe it would be interesting to have a non-map-sorting codec in the mix, but you're going to get some hate for it when people bring their ordering expectations to it (maybe it goes the other way today and we don't hear about it? but I see it more in the other way, like when cbor-gen and our other dag-cbor implementations disagree on sorting order and we get different bytes that need explaining—this happens).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the record, I'm in the mixed-feelings camp on this. I'd love there to be a middle way that allows order preservation to be dictated down from the data model but a default to some kind of sorting without an opinion being expressed by another part of the stack. e.g. Typed (schema-backed) graphs don't get sorting, but untyped (no schema) graphs get sorting. But an argument could be made that this just inserts a whole new set of problems to the mix!


### Go

Here and adheres to the specification. However, in a practical sense because it implements the go-ipld-prime interface and its limits on integers (and floats once specified) are currently bounded by those interfaces. Similarly any varints are capped around 64 bits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"capped around" ... squishy language, we're at the int64 vs uint64 challenge here, do you want to be specific? "around" is probably not appropriate for a spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I think it's fair to ask for more specifics here. Although, the "around" part isn't related to the spec it's in about a particular implementation of the spec which is going to have limitations (e.g. I suspect reasonable implementations will not load integers of size 100GB).

At the moment go-ipld-prime gives us int64 rather than uint64 in most places so it's that, but as some of your recent work has shown (ipld/go-ipld-prime#413) we can manipulate the boundaries there. The point of the "around 64 bits" was basically to indicate the current lack of support for big integers.

Happy to make this more precise before merging though.

@vmx
Copy link
Member

vmx commented Jun 30, 2022

I forgot to leave a general note. One of the reasons why SimpleDAG wasn't pushed further was, that it wasn't that different from DAG-CBOR. I did a prototype implementation of SimpleDAG in JS, and then did a PR to transform it into a DAG-CBOR one. The difference was so little, that it didn't seem to be worth a new format.

I just wanted to bring it up as it might be nice to contrast it to DAG-CBOR.

@BigLep
Copy link
Contributor

BigLep commented Jul 19, 2022

IPLD triage note: assuming @aschmahmann will update with discussions that happened during IPFS Thing. This could also happen synchronously during an IPLD Community Call.


### Rust/WASM

Adheres to the specification. However, in a practical sense its limits on integers and floats are currently bounded to be of fixed maximum sizes. Similarly, all varints are capped around 64 bits.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable complex data passing over the Wasm boundary, MsgPack (or another serialization standard) can be used along with a standardized ABI. This also comes with the benefit of being language agnostic.

At the Polywrap project we've had success with this approach, and now have support for building composable wasm modules in AssemblyScript and Rust (TinyGo OTW). Might be useful to check it out for inspiration. (core standard being defined here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CBOR could be a good option, and we've been using it in the FVM with some success (though mostly tuple encoded structures)

@RangerMauve
Copy link
Contributor

Should we include fat pointers as part of this spec? (or wait to see whether they will even become a thing)

multiformats/cid#49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔎 In Review
Development

Successfully merging this pull request may close these issues.

None yet

7 participants