Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CARv2 Specification Issues #292

Open
sstelfox opened this issue Jul 31, 2023 · 5 comments
Open

CARv2 Specification Issues #292

sstelfox opened this issue Jul 31, 2023 · 5 comments

Comments

@sstelfox
Copy link

sstelfox commented Jul 31, 2023

We started implementing the two index types in our CAR handling code initially by hand crafting fixtures according to the published specifications. During this exercise we found a couple of issues of ambiguity and incorrectness that I wanted to bring up here.

PRAGMA is an Invalid CARv1 Header

The first one is that the CARv2 fixed PRAGMA is an invalid CARv1 header. From the CARv2 specification:

The CARv2 version pragma magic bytes was chosen for compatibility with existing CARv1 parsers.

From the CARv1 specification, the IPLD Schema for the header is:

type CarHeader struct {
  version Int
  roots [&Any]
}

The IPLD schema specifically has an type modifier to allow for optional fields which this definition doesn't include. The omission of the "roots" key in the map in the hard coded PRAGMA is a violation of this specification. This prevents any decoder from ever being able to operate in strict decode mode as this would be considered a poorly formed.

I'm guessing this is omitted to save the 7-8 bytes, and the on-going ambiguity of whether this array is allowed to be empty. As far as I can tell there is no discussion on whether this key is allowed to be omitted entirely, and it doesn't seem like there is any question that this must not be empty when any blocks are present. This presents its own challenge as a fixed header by definition can't point at every possible CID the file format can contain. I'm not sure how to resolve this specification violation but at the very least adjusting the PRAGMA to not omit a required field is probably the correct option.

MultihashIndexSorted Ambiguous / Incorrect

The MultihashIndexSorted specifies encoding the multihash-code as a uint64, from the spec:

| multihash-code (uint64) | width (uint32) | count (uint64) | digest1 | digest1 offset (uint64) | digest2 | digest2 offset (uint64) ...

A multihash code is not a uint64 and that is not a sufficient type to either unambiguously define how to encode it, nor is it potentially sufficient to encode the full valid ranges of a multihash code as they are encoded as two 128-bit unsigned varints. In practice the size isn't likely to become an issue, but the ambiguity in encoding two variable length values in a single fixed length field. It is unclear whether the space should be split in half and each of the varints are encoded into 32bits, whether the encoded varint version should be treated as an unsigned value and just encoded directly as a little-endian value (which importantly does not necessarily match the byte codes for the encoding and length in the tables), and while I don't think its likely as big-endian isn't used anywhere that is also an option that is possible with the ambiguity in the spec.

I would recommend adjusting the spec to properly encode the multihash-code as two varints instead of cramming them into one undersized number. This will also likely recover most of the bytes added to fix the invalid PRAGMA issue when at least one index is present.

@willscott
Copy link
Member

for the PRAGMA

i believe the reason roots is fully omitted is because we tested with that pragma empirically against all the known implementations of carv1 parsing that we were aware of and got an appropriate errors (like 'invalid version') out of all of them. The point of the header was to trigger an error of some form. That the error is "invalid car" for a carv1 parser is considered appropriate. That's why it's specified as a fixed magic byte prefix rather than itself as a valid cbor object itself.

@sstelfox
Copy link
Author

sstelfox commented Jul 31, 2023

That explanation goes directly against what the CARv2 spec claims was intended (of which I already quoted) and doesn't really seem to make sense to me. If the PRAGMA was a valid CARv1 header, existing CARv1 parsers would simply see it as a regular CARv1 header with a single block not listed in the roots which would be a mostly valid expectation for a CARv1 only parser and generally parseable by any non-strict decoder.

It just wouldn't know that there is more to do with the inner block which is both fine and a safe failure case considering the intended use of the data exchanged using these formats is not guaranteed to be understood. Rejecting the file should be a higher level consideration of the application consuming it. If the application is then unable to access the data it expects inside the archive, accessing the archive will fail.

@rvagg
Copy link
Member

rvagg commented Aug 7, 2023

against what the CARv2 spec claims was intended (of which I already quoted)

but what you quoted was what @willscott said it was doing:

The CARv2 version pragma magic bytes was chosen for compatibility with existing CARv1 parsers.

I went into some detail on how current parsers work now in #287 (comment). You can see the JavaScript form of this parsing process in the links provided, and there's an updated WIP Go form that's basically doing the same thing here: https://github.com/ipld/go-car/blob/9247204455e7027f49c176b237a6d277c4979568/v3/carv1.go#L33-L45 - you can see IPLD schema trickery going on in both of these - the "v1 header or v2 pragma?" question is resolved with a schema that makes roots optional, but if it's a version:1 then it becomes non-optional. I need to get that process into the spec doc I think.

But back to the original quote about why it was chosen - at the time, both parsers that we had, would both error with a message that roughly said: "unsupported version: 2", which is the minimum that we wanted. We needed it to be a parse error, and ideally we wanted a meaningful message out of existing code. So the choice was the minimum number of bytes to trigger a meaningful message out of existing parsers.

@rvagg
Copy link
Member

rvagg commented Aug 7, 2023

Regarding the multihash "code" issue - this is referring to the multicodec code for the hash function used in the multihash.

A multihash code is not a uint64 and that is not a sufficient type to either unambiguously define how to encode it, nor is it potentially sufficient to encode the full valid ranges of a multihash code as they are encoded as two 128-bit unsigned varints.

The two varints in a multihash are the multicodec code and the digest length. The length is taken care of separately as the "width" in the MultihashIndexSorted format, so it's only the code that we care about; which is just a single integer. The varints that we use aren't specifically 128 bits, but they certainly could be. See https://github.com/multiformats/unsigned-varint for details on the format we use, which is mostly compatible with varint implementations but has some additional constraints.

For practical purposes, we're nowhere near needing the full 64-bit range with multicodec codes: https://github.com/multiformats/multicodec/blob/master/table.csv; I suspect that when we do we're in bigger trouble because of so many implementations that assume the number can fit within a uint64.

I hope that clarifies the format. If you think that some language adjustments to the spec would be helpful then contributions would be more than welcome, it's good to have fresh eyes on these things, especially from people trying to implement them.

@rvagg
Copy link
Member

rvagg commented Sep 5, 2023

@sstelfox would you like to suggest any specific clarifications to the spec after this discussion so that it's more clear for anyone else following up? Otherwise I'd like to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants