Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why sha256 - base58 multihash is 46 characters length? #18

Open
SimonLab opened this issue Jan 23, 2019 · 3 comments
Open

Why sha256 - base58 multihash is 46 characters length? #18

SimonLab opened this issue Jan 23, 2019 · 3 comments
Labels

Comments

@SimonLab
Copy link
Member

The first step of the cid decode algorithm (https://github.com/multiformats/cid#decoding-algorithm) is:

If it is 46 characters long and starts with Qm..., it's a CIDv0. Decode it as base58btc and continue to step 2

and it was not obvious to me why the character length of the CIDv0 which is a multihash is 46.

Here is my understanding of the format of the multihash:

Multihash Format

Multihashes are bytes where:

  • the first part represent which hash function is used.
  • the second part is represents the length of the hash value (ie the length of the third part)
  • the third and last part is the hash value itself

ex: 00010001 00000100 101101100 11111000 01011100 10110101

  • 00010001 represent the hash function used
  • 00000100 represent the length of the hash in bytes
  • 101101100 11111000 01011100 10110101 represents the hash

The first and second part are defined as varint (VAriable Integer). See https://github.com/multiformats/unsigned-varint and https://developers.google.com/protocol-buffers/docs/encoding#varints to understand how varints are created. In our case because there is only 1 byte for each part we can simply do a conversion from binary to decimal:

  • 00010001 is 1x2^4 + 1x2^0 = 17 which is 11 in hexadecimal (also written 0x11)
  • 00000100 is 1x2^2 = 4 which is also 4 in hexadecimal (0x4)

Then with the following table we can check the representation:
https://github.com/multiformats/multicodec/blob/master/table.csv
image

So we can deduce that the multihash 00010001 00000100 101101100 11111000 01011100 10110101 is a sha1 and the hash value length is 4 bytes or 4x8 = 32 bits
sha1 normally returns a hash of 20 bytes (160 bits) but for this example we reduced the length to 4 bytes

Why CIDv0 is 46 characters length?

Now that we understand the format of a multihash we can calculate the total number of bytes in a CIDv0.
a CIDv0 is a multihash where the hash function is sha2-256 (see https://github.com/multiformats/cid#versions).

From the table above we know that sha2-256 is represented with 0x12 which is 1x16^1 + 2x16^0 = 18. From the definition of varint (see links above) we have 18 < 127 so the hash function used (first part) can be encoded with 1 byte

From the definition of sha2-256 (https://en.wikipedia.org/wiki/SHA-2) we know that the hash value will be 256 bits long, ie 256/8 = 32 bytes. Again 32 < 127 so the length of the hash (part 2) is represented with 1 byte

We now have:

  • 1 byte to represent the sha2-256
  • 1 byte to represent the length of the hash value
  • 32 bytes for the hash value

The the CIDv0 length will always be 34 bytes length.

The default representation of these bytes is done using base58btc. This base is build from base64 (some characters have been removed, see #2). So the multihash is represented in these based with a set of 58 characters (123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz) see https://en.wikipedia.org/wiki/Base58. So we need numbers from 0 to 57 to be able to represent this set in binary and 57 is 111001 which is 6 bits. So each digit in base58 is represented with 6 bits.

To conclude we have 34 bytes for a multihash which is 34*8= 272 bits and 272/6 = 45.33 so we need 46 characters to represent the multihash

@RobStallion @nelsonic I think this is the reason why CIDv0 are 46 length, but if you think have done a mistake especially with the encoding to base58 and spliting the multihash in group of 6bits let me know 👍

@RobStallion
Copy link
Member

@SimonLab That is super interesting and very helpful to understand. Thanks 👍

@nelsonic
Copy link
Member

@SimonLab good write-up on the length of CIDs. 👍
Have you taken a look at how other JS/Ex implementations are doing the Binary/Base16 to Base58BTC encoding/decoding?

@SimonLab
Copy link
Member Author

@nelsonic this is my next step. I had a look at https://github.com/nocursor/b58 (which use multiple type of set characters, we need just one for now) and I'm going to create a simplified version now that we have manage to create a CIDv1 (in base16) matching the one from the js-cid and ex_cid packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants