Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON form of the table #305

Open
rvagg opened this issue Dec 6, 2022 · 7 comments · May be fixed by #311
Open

Add JSON form of the table #305

rvagg opened this issue Dec 6, 2022 · 7 comments · May be fixed by #311
Assignees

Comments

@rvagg
Copy link
Member

rvagg commented Dec 6, 2022

I can't find the issue(s) where we discussed this but there's been a proposal on the table for a while that we represent the multicodec table as JSON for easier consumption and more flexibility. I'd like to switch it so that the CSV is generated from the JSON and people contribute to the JSON. A JSON table would let us add more items, like #297, and longer descriptions, and overall much more flexibility for entries and ease of downstream consumption.

This needs a PR to propose the format. Does it need to be line-delimited JSON? Can it be fully pretty-printed? What does linting look like? What does CSV generation look like?

Anyone want to give this a go?

@zzo38
Copy link

zzo38 commented Jan 1, 2023

I prefer to use CSV, and think it is simpler for many cases. (I do use the CSV myself.) However, it is also true that in some programming langauges, JSON will be better, so having both can be helpful.

One problem with using JSON for contribution is the lack of a trailing comma, meaning that the records do not all have the same format. Another note about JSON is that JSON does not properly have 64-bit integers; although JavaScript has a integer type now, JSON predates the integer type in JavaScript.

If JSON is implemented, ensure that the CSV is generated in the same format it is now (including spacing, although a column may be made wider if necessary), possibly adding additional columns at the end if such a thing is necessary (which it might or might not be). The CSV should be plain ASCII (do not use any non-ASCII characters), and should not have any quotation marks; if you convert JSON to CSV, ensure that this is the case, even though the JSON will definitely include quotation marks, and possibly non-ASCII characters. (If it includes long descriptions, they will need to somehow be truncated, or maybe it is better to have a separate field for a short description.)

If you do want to switch to JSON like that, you will first need to convert CSV to JSON first, so that it will all be JSON, and then you can convert JSON to CSV. (Then, you can check that the output matches the original, that the converter does not have a bug. Fortunately, you have a version control system, so you can revert it if something goes wrong.)

The program will have to check that it is valid. If the JSON includes the varint code (as suggested in #297), then it will have to check that the varint code matches the numeric code. This should be easily enough to implement.

(Some programs might use the numeric codes directly, such as the hash.c and hash.h files in Free Hero Mesh, where they are used for defining constants with names such as HASH_SHA3_256 (defined as 0x16) and passed as arguments to function calls. Passing numbers will be more efficient than passing strings to identify hash algorithms where they need to specified (in a function call or something else), and using multicodec numbers will allow the same numbers to be used in many programs, in case that is helpful sometimes. So, both numeric codes and varint codes will be in use.)

@rvagg rvagg self-assigned this Jan 3, 2023
@BigLep
Copy link

BigLep commented Jan 3, 2023

2023-01-03 IPLD triage conversation:
@mriise : just curious if this would be something you'd be interested in picking up?

@rvagg
Copy link
Member Author

rvagg commented Jan 4, 2023

Here's a rough outline that in my head from the various evolved conversations about this:

  • table.json is the primary form of the table, new edits should only touch that
  • table.csv is compiled from table.json and CI should be involved in some way - either blocking contributions where the user hasn't run a compile and table.csv doesn't match table.json compiled, or by just compiling and committing back the compiled form into the PR
  • The table.csv format should be fixed in stone, no more columns please, those are all we get there (e.g. Add varint encoding of code to table #297 belongs in table.json only)
  • It would be nice if table.json wasn't too sprawling to read, so I'm thinking that we come up with our own prettified JSON and enforce that with a script (user-runnable and in CI), perhaps one entry per line, but with nice spacing in each line (not sure if just plain spaces or spaced so columns line up so it's readable). So something like this (incl Add varint encoding of code to table #297, and I've made up a new "ref" field for links to specs):
[
  { "name": "identity", "tag": "multihash", "code": "0x00", "varint": "0", "status": "permanent", "description": "raw binary" },
  { "name": "cidv1",    "tag": "cid",       "code": "0x01", "varint": "1", "status": "permanent", "description": "CIDv1",     "ref": [ "https://github.com/multiformats/cid" ] },
  { "name": "cidv2",    "tag": "cid",       "code": "0x02", "varint": "2", "status": "draft",     "description": "CIDv2",     "ref": [ "https://github.com/multiformats/cid" ] },
  { "name": "cidv3",    "tag": "cid",       "code": "0x03", "varint": "3", "status": "draft",     "description": "CIDv3",     "ref": [ "https://github.com/multiformats/cid" ] },
  { "name": "ip4",      "tag": "multiaddr", "code": "0x04", "varint": "4", "status": "permanent" }
]

@zzo38
Copy link

zzo38 commented Jan 4, 2023

That looks like to me that it can work, that it seems good enough (the lack of a trailing comma is a bit messy because now the records do not all have the same format, although I suppose there is not a satisfactory way to avoid that). The reference seems a good idea, too.

However, please ensure that the CSV file contains no quotation marks, no commas within the data of a field (commas are only between fields), no non-ASCII characters. (If the JSON contains strings with commas, quotation marks, and/or non-ASCII characters, which would appear in the CSV file, then the conversion would need to somehow change them, when it is being converted to CSV.)

Also, how are those fields encoded? The varint field just stores a single digit number, even though it seems that varint encoding should be encoded just as a sequence of hex codes, e.g. 808001 for the number 16384 (according to the examples in the varint specification)).

@RangerMauve
Copy link

RangerMauve commented Jan 10, 2023

One thing to note, might be useful to use ndjsonfor streaming parsing.

https://www.json-to-ndjson.app/

@rvagg
Copy link
Member Author

rvagg commented Jan 12, 2023

maybe? ndjson is nice, but makes it pretty inconvenient for just loading the whole lot

I quite like streaming parsing and it might be neat if this gets huge but perhaps consumers would get annoyed they can't just dump it through a standard JSON parser as it is.

One nice thing ndjson would give us is strict enforcement of one-entry-per-line, which is what I'd like to see.

rvagg added a commit that referenced this issue Jan 30, 2023
rvagg added a commit that referenced this issue Jan 30, 2023
@rvagg rvagg linked a pull request Jan 30, 2023 that will close this issue
5 tasks
rvagg added a commit that referenced this issue Jan 30, 2023
@rvagg
Copy link
Member Author

rvagg commented Jan 30, 2023

Initial proposal for some feedback: #311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@BigLep @rvagg @RangerMauve @zzo38 and others