Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: assign codes for MIME types #159

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Feb 1, 2020

And add a script to automate this.

fixes #4

Questions:

  • Names: Currently, I've left the mime types as names. We might want to use mime/ as in feat: adding MIME types as codecs #84.
  • Range: I've allocated quite a large range (0.4% of the 4 byte range). Most of this is reserved.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import script is not future-proof. If I understand it correctly, then the script parses the XML file and assigns numbers to the codecs based on the order in the XML file. They are sorted alphabetically in the XML file, so if a new mimetype is added and you re-run the script, you would end up with different codes.

Some <record>s have a date attribute, it looks like it got introduced in 2014. Hence I propose sorting the records by date attribute first (the ones without a date first) and if they have the same date alphabetically based on <file>. This we we hopefully end up with a future-proof reproducible assignment of codes.

I'd also like to see more information text about why we include MIME types and that Multicodecs are not MIME types. If I understand things correctly in an ideal Multicodec-only world all those +json MIME types would all just be Multicodec JSON 0x0200.

@vmx vmx requested a review from mikeal February 3, 2020 15:42
@mikeal
Copy link
Contributor

mikeal commented Feb 3, 2020

I think that we should factor out the consumption of the entire multiformat table in our multiformat clients before we drastically increase the size of the table. The parse time and table size are already noticable in our bundle size, this would make that much worse. My older http clients include a full mime database and it makes it unsuitable for browser bundles. Luckily, we already have a tentative plan to move to using integer references that won’t require the full table in code.

@Stebalien
Copy link
Member Author

The import script is not future-proof. If I understand it correctly, then the script parses the XML file and assigns numbers to the codecs based on the order in the XML file. They are sorted alphabetically in the XML file, so if a new mimetype is added and you re-run the script, you would end up with different codes.

Unless there's a bug, it first loads the already-assigned numbers. Then, for all new mime types, it assigns increasing (unique) codes.

Some s have a date attribute, it looks like it got introduced in 2014. Hence I propose sorting the records by date attribute first (the ones without a date first) and if they have the same date alphabetically based on . This we we hopefully end up with a future-proof reproducible assignment of codes.

Sounds like a good starting point. I'll keep the current "don't change things" logic but having a stable conversion would be nice.


@mikeal Hm. Fair point. Even compressed, the table is going to grow from 3K to 22K. Or 26K to 260K uncompressed.


I'm fine leaving this in a PR for now. I submitted it because we kept getting requests to do something like this.

@vmx
Copy link
Member

vmx commented Feb 4, 2020

Unless there's a bug, it first loads the already-assigned numbers. Then, for all new mime types, it assigns increasing (unique) codes.

Oh I missed that when I read the code. Though I'm still in favour of having a stable conversion that can be run at any time and results in the same output.

table.csv Outdated
libp2p-peer-record, libp2p, 0x0301, libp2p peer record type
x11, multihash, 0x1100,
sm3-256, multihash, 0x00534d,
blake2b-8, multihash, 0x00b201, Blake2b consists of 64 output lengths that give different hashes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashes are prefixed with an additional 0x00 is that intentional or a bug (the MIME types as well)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm trying to make it clear how many bytes are in each code. Every two hex digits is a single byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't get it. What do you mean with "how many bytes". Do you mean how many bytes it takes when encoded as varint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't then values over 0x80 also have a 0x00 prefix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@vmx
Copy link
Member

vmx commented Feb 12, 2020

I've put on my todo list to make this stable. I don't know when I will find the time to do it. When this issue becomes urgent to be merged, please let me know and i'll prioritize it.

@pierogitus
Copy link

A downside of a 4-byte range is it makes a base32 sha256 CID 64 characters which doesn't fit in a DNS segment. A 3-byte range would work and with a single range instead of sub ranges for each major type it wouldn't reserve too much of that space.

This was referenced Aug 24, 2020
ntninja added a commit to ntninja/multicodec that referenced this pull request Aug 25, 2020
rvagg pushed a commit that referenced this pull request Aug 28, 2020
See GH/#158. Changes are analogous to those proposed in GH/#159.
@CMCDragonkai
Copy link

What's the status of this? I'm hoping to use CIDs to refer to image types soon.

@CMCDragonkai
Copy link

I have a suggestion, perhaps instead of one big table.csv, we have multiple tables. And use certain codes to go down address spaces.

@CMCDragonkai
Copy link

CMCDragonkai commented Nov 7, 2020

Also I noticed that image/jpeg is missing from the table. This is a very strange mime to not have in the csv. Also image/gif.

@CMCDragonkai
Copy link

Another question, adding in mimetypes overlaps with existing codecs in the tables.csv. For example how do we compare application/json to json which already exists in the table.csv?

@lewisl9029
Copy link

Looks like this effort has been stalled for a while, mostly due to concerns around the drastic increase in table size?

The readme of the project describes a first-come, first-serve policy when it comes to adding new codecs, and I wonder if we could maybe apply that here as well with mime types. I.e. maybe we can start with a small handful of the most commonly used MIME types on the internet today (say, this list), and then add more over time based on demand, instead of dumping in all known mime-types at once?

Is there some particular need for all the mime types to be in a contiguous block that I'm not aware of?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mimetypes as codes
6 participants