Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an IPFS compatible CID function in Elixir using Multihash SHA256 #11

Open
13 of 18 tasks
nelsonic opened this issue Dec 13, 2018 · 29 comments
Open
13 of 18 tasks
Assignees
Labels
enhancement New feature or request epic good first issue Good for newcomers help wanted Extra attention is needed priority-2 technical

Comments

@nelsonic
Copy link
Member

nelsonic commented Dec 13, 2018

This issue/epic is dedicated exclusively to How i.e. implementation

For the reasoning behind Why we are writing this code, please see: #1
If anything is unclear as to Why, please comment on that issue.
(we all need to be 100% clear on Why we are doing this work if you aren't, Ask Questions!)

Todo

  • Read the JavaScript Implementation of CID: https://github.com/multiformats/js-cid
    to understand how it works. If you have questions, please ask them in: https://github.com/dwyl/learn-ipfs/issues

    • try out any examples in the docs on localhost and try to see if re-ordering elements in an Object or nested Array produces a different CID.
    • Document the process of trying out the JS cid implementation.
      (i.e. this is for everyone to learn, not just the one person!)
  • Read the "not working" Elixir version: https://github.com/nocursor/ex-cid

    • See if it's "salvageable" or if we need to write our own implementation from scratch.

    (obviously my preference is to use a module someone else has invested time in,
    to avoid "fragmenting" the ecosystem, but if the author does not respond to issues requesting help, then I am not left with much choice ...
    😢)

  • Implement an "offline" version of CID in Elixir that produces the exact same CID as the JS version.

    • cid of a String should always be the same for a given string.
    • cid of a Map should work regardless of the order of content.

    The way I did this was to order the keys of the Map but we need to check the JS implementation to ensure that's how they have done it.

    • What other types of content do we want/need to create CIDs for? (can we add them later?!)

MVP CID v1

for our MVP we only need a sha2-256 hash in Base58BTC which is URL-safe.
For this we can use the code from https://github.com/multiformats/ex_multihash (which is maintained) and https://github.com/nocursor/b58 (which is unresponsive) respectively.

  • Write comprehensive doctests that demonstrate that the code works as expected.
  • Create beginner-friendly examples. (we can split this out into separate repos later!)
  • Re-org the Readme of this repo so the "theory" & reasoning is lower down
    • and the "get started in 1 minute" is at the top.
  • Publish to Hex.pm :shipit:
  • PR to https://github.com/multiformats/multibase

@RobStallion I apologise for using the expression "deep dive" to describe this quest.
I sometimes forget that certain expressions aren't widely used outside of the tech community. 😕
(and sadly, we don't yet have a comprehensive glossary of common tech expressions/terms ...)
https://en.wikipedia.org/wiki/Deep_diving just means "diving to a depth beyond the norm".
i.e. diving into the code/examples/docs beyond the superficial.
We need to deeply understand everything about IPFS CIDs so that we can create compatible Elixir-lang implementation that we can use to create cids for all our content.
Deeply understanding the JS/Go-lang CID code requires "Deep Work" youtu.be/ZD7dXfdDPfg
This is a prereq to writing a CMS, a Learning Platform and a Time/Task Management app!
i.e. this is a "foundational piece" for the entire future of @dwyl (#NoPressure...)

Relevant Reading

Help Wanted

We really need help on getting this package built, documented and shipped so we can move forward with our "stack" dwyl/technology-stack#67 and "roadmap" https://github.com/dwyl/product-roadmap
If you have the curiosity, energy and time to help, please comment below! (Thanks!)

@nelsonic nelsonic added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers technical priority-1 epic labels Jan 13, 2019
@iteles iteles added the BLOCKED Core team's HIGHEST priority, blocking critical work label Jan 18, 2019
@RobStallion RobStallion self-assigned this Jan 21, 2019
@RobStallion
Copy link
Member

RobStallion commented Jan 21, 2019

Have been looking into CIDs and IPFS. See here for all thoughts captured.

CIDs are made up of codecs and multihashes.

Multihashes themselves are self describing hashes (e.g. they contain information about the hashing algorithm that was used to hash the data). See this comment for an example of a multihash in elixir.

In order to create our own CIDs it appears we needs to be able to create a multihash (something that we have been able to do with ex_multihash and a codec.

Looking into codecs now to get a better understanding of what exactly they are. They appear to have a similar role in a CID as the hash_type in a multihash.

@RobStallion
Copy link
Member

RobStallion commented Jan 22, 2019

I have been able to recreate the steps listed in this article. This shows that I can at least create the same hashes at CIDv0 myself using the command line.

I do not fully understand what additional data that IPFS is adding to the data that you store on it...

$ echo "Hello World" | ipfs add -n
$ added QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u

$ ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | sed -n l
$ $
$ \022\b\002\022\fHello World$
$ \030\f$

As you can see above, I have just added Hello World text to a file in IPFS but when I log the file from IPFS it shows more than just Hello World now.

However, I do not think that we need to fully understand exactly what extra data IPFS is adding right now. We should (hopefully) be able to use a library that will handle this step for us (otherwise we are just reimplementing the IPFS logic ourselves which doesn't seem logical)

The code above only relates to CIDv0.

CIDv1
<mb><version><mc><mh>

CIDv0
<mh>

mb = multibase prefix
version = CID version
mc = multicodec-packed-content-type
mh = multihash-content-address

CIDv0 is only the last part of CIDv1.

Next step

Try to recreate the steps to create a CIDv0 hash taken in the command line, but in elixir.

@RobStallion
Copy link
Member

RobStallion commented Jan 22, 2019

At the moment I want to create the same hash as the one from the command line.

e.g. the result of...

:crypto.hash(:sha256, "Hello World")

should be the same as

echo "Hello World" | shasum -a 256
d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26

This is not the same hash that IPFS creates but if we can match a more simple sha256 hash that would be a good first step.

@RobStallion
Copy link
Member

file = File.read!("hello.txt")

:crypto.hash(:sha256, file)
|> Base.encode16(case: :lower)
|> IO.inspect()

"d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26"

This is the same and the sha256 function from the command line now. Now I know that I can reliably get the same hash string using elixir as I can in the terminal

@nelsonic
Copy link
Member Author

@RobStallion CID v0 is irrelevant to us at this stage. We will not use it.
The only reason it still exists is for backward compatibility reasons for existing CIDs.
Since we are only creating new CIDs in our apps and not decoding any CIDs on that have been put on IPFS we do not need v0 compatibility for the foreseeable future.

@nelsonic nelsonic mentioned this issue Jan 22, 2019
1 task
@RobStallion
Copy link
Member

I thought I would focus on CIDv0 as it is contained in CIDv1...

CIDv1
<mb><version><mc><mh>

CIDv0
<mh>

Also, I have installed IPFS locally and the hash string that I am getting back at the moment is still CIDv0

I do want to get v1 working as it is the 'future', but I felt that work on v0 would be needed no matter what.

Have I misunderstood this @nelsonic?

@nelsonic
Copy link
Member Author

@RobStallion provided the <mh> (multihash) that is used is the one we need then, yes. 🥇
We only need sha2-256 (which I have added above) ... thanks for reminding me. 👍

@RobStallion
Copy link
Member

iex(1)> codec = "dag-pb" # the codec needed to hash CIDv0
"dag-pb"

iex(2)> file = File.read!("hello.txt") # reading the file with the text of Hello World (same file that was uploaded to IPFS)
"Hello World\n"

iex(3)> digest = :crypto.hash(:sha256, file) # hash the file string with sha256
<<210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199, 74, 221,
  90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74, 38>>

iex(4)> {:ok, multihash} = Multihash.encode(:sha2_256, digest) # create multihash from hash (see #8 for more info on this step)
{:ok,
 <<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199,
   74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74,
   38>>}

iex(5)> cid = CID.cid!(multihash, codec, 0) # creates a CID struct (just simple struct creation. Something everyone has done 1000 times)
%CID{
  codec: "dag-pb",
  multihash: <<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205,
    139, 226, 199, 74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41,
    218, 128, 74, 38>>,
  version: 0
}

iex(6)> CID.encode!(cid) # turns the CID struct into a base58 string (this is where the magic is happening)
"QmcWyBPyedDzHFytTX6CAjjpvqQAyhzURziwiBKDKgqx6R"

Using this online tool I have converted the base58 string to base16.

1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26

As you can see, this matches my digest from IPFS. This means that the above functions are working as expected.

Next steps

Most of the above is pretty straightforward to understand. Need to look into the CID.encode function to get a better understanding of what is happening here and how it works.

@nelsonic
Copy link
Member Author

@RobStallion #progress 🎉 (keep up the good work!)

If you are able to push some of the code on your branch it would be amaze! 🌶
Thanks! ✨

@RobStallion
Copy link
Member

1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26

As you can see, this matches my digest from IPFS. This means that the above functions are working as expected.

This line was a mistake. It is not the same as the one from IPFS. It is the same as the hash of the file that was created in the terminal here and in iex here.

What this means (to me at least) is that all the CID.encode function does (for CIDv0) is take the multihash and turn it into a base58 string.

That is literally it!!!

This can be done with the following lines of code...

defmodule CidTester do
  def read_file(str), do: File.read!(str)

  def hash(file), do: :crypto.hash(:sha256, file)

  def multihash(digest), do: Multihash.encode(:sha2_256, digest)

  def encode({:ok, multihash}), do: Base.encode16(multihash, case: :lower)

  def run(filename) do
    filename
    |> read_file()
    |> hash()
    |> multihash()
    |> encode()
  end
end

Then run iex -S mix and call the CidTester.run/1 function with the filename...

iex(1)> Cid.run("hello.txt")
"1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26" 

as you can see this is the same as calling the CID.encode!/1 function with a CID struct...
(following block is a snippet from here)

iex(6)> CID.encode!(cid)
"QmcWyBPyedDzHFytTX6CAjjpvqQAyhzURziwiBKDKgqx6R" 

Same as "1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26" when converted to base16.

The ex_cid module is not returning the same CID values as IPFS. It is only returning the multihash as a base58 string (for version 0 CIDs).

This does not mean that the ex_cid module is not working however. I have spoken to @SimonLab and he has shown that js-cid produces the same cid string.

This really confused me as both modules are producing the same string (which is just a base58 string of a multihash) and that string is not the same as the one from IPFS. This seems to be because these modules are not adding the data that IPFS adds data when it is added to IPFS. For example...

this is the hello text file on my local machine...

$ cid sed -n l hello.txt
Hello World$

Next, I'll add this file to IPFS...

$ ipfs add -n hello.txt
$ added QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u hello.txt

Now if we run the same sed command on the IPFS file we see that there is more info than the one on my machine...

$ ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | sed -n l
$ $
$ \022\b\002\022\fHello World$
$ \030\f$

I think that the difference in the CIDs is coming from this extra data that IPFS is adding to the file.

This means that as it currently stands, the CIDs that these modules are creating can not be used to get data from IPFS as they will not be the correct CID for the data that is on IPFS.

to put it simply "QmcWyBPyedDzHFytTX6CAjjpvqQAyhzURziwiBKDKgqx6R" from the CID module is not the same as "QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u" from IPFS despite the same file being passed in to both.

The CIDs that these modules make can be used in our projects and will always produce the same CID for the same data that is passed in. We just cannot integrate them into IPFS right now as they will not be able to that same data that is on IPFS (if my understanding is correct).

After speaking with @SimonLab about this problem, he came across this, https://github.com/ipfs/ipfs#protocol-implementations.

This seems to be the missing step. I haven't had much of a chance to look into this as of now but on my brief look it says to raise and issue if you want to implement this in a specific language. I looked at the issues and the only issue I saw with a mention of elixir is issue83. This issue has a link to the following repo, https://github.com/tensor-programming/Elixir-Ipfs-Api.

I will begin looking into this 'missing step' in more detail.

@nelsonic @SimonLab do either of you have any thoughts on this (sorry for the SUPER long comment. Hopefully it makes sense)

@nelsonic I believe that in order for us to be able to complete this issue (Implement an IPFS compatible CID function in Elixir) we will need to include this step

SimonLab added a commit that referenced this issue Jan 23, 2019
@nelsonic
Copy link
Member Author

nelsonic commented Jan 23, 2019

@RobStallion this comment makes sense. 👍 (thanks for adding this detail)
Please formulate this question on StackOverflow so that
(a) we confirm our own understanding and
(b) we can seek help from the IPFS/JS community.
Thanks. ✨

@RobStallion
Copy link
Member

ipfs/go-cid#77. Someone has had this issue in go.

I have confirmed that I can get a matching CID using ex_cid when the cid is v1 and the codec is "raw".

$ ipfs add --cid-version=1 hello.txt
added zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9 hello.txt

Now in iex

iex(1)> file = File.read!("hello.txt")
"Hello World\n"
iex(2)> digest = :crypto.hash(:sha256, file)
<<210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199, 74, 221,
  90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74, 38>>
iex(3)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
{:ok,
 <<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199,
   74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74,
   38>>}
iex(4)> cid = CID.cid!(multihash, "raw", 1)
%CID{
  codec: "raw",
  multihash: <<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205,
    139, 226, 199, 74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41,
    218, 128, 74, 38>>,
  version: 1
}
iex(5)> CID.encode cid
{:ok, "zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9"}

As you can see, the two CIDs created match (for sure this time 🙄🤦‍♀️)

zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9
zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9

This IS a step in the right direction but is not a solution. This will not work for all files. It will only work for files that are smaller than a certain size (256kb).

Let's repeat the steps above with a larger file...

$ ipfs add --cid-version=1 elm-slides.pdf
added zdj7We6WnfhRq5zmJZDeMKdKmS2z8fEPrUSneapijtnQYzYpm elm-slides.pdf
 1.11 MiB / 1.11 MiB [===========================================================] 100.00%

As you can see this file is 1.11MiB. When we repeat the steps with ex_cid with this file...

iex(1)> file = File.read!("elm-slides.pdf")
<<37, 80, 68, 70, 45, 49, 46, 55, 13, 10, 37, 161, 179, 197, 215, 13, 10, 49,
  32, 48, 32, 111, 98, 106, 13, 10, 60, 60, 47, 80, 97, 103, 101, 115, 32, 50,
  32, 48, 32, 82, 32, 47, 84, 121, 112, 101, 47, 67, 97, 116, ...>>
iex(2)> digest = :crypto.hash(:sha256, file)
<<80, 53, 122, 165, 21, 149, 132, 189, 86, 141, 57, 245, 185, 240, 119, 254,
  217, 210, 49, 37, 225, 87, 43, 153, 79, 135, 166, 115, 82, 144, 54, 51>>
iex(3)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
{:ok,
 <<18, 32, 80, 53, 122, 165, 21, 149, 132, 189, 86, 141, 57, 245, 185, 240, 119,
   254, 217, 210, 49, 37, 225, 87, 43, 153, 79, 135, 166, 115, 82, 144, 54,
   51>>}
iex(4)> cid = CID.cid!(multihash, "raw", 1)
%CID{
  codec: "raw",
  multihash: <<18, 32, 80, 53, 122, 165, 21, 149, 132, 189, 86, 141, 57, 245,
    185, 240, 119, 254, 217, 210, 49, 37, 225, 87, 43, 153, 79, 135, 166, 115,
    82, 144, 54, 51>>,
  version: 1
}
iex(5)> CID.encode cid
{:ok, "zb2rhc3P77eryPttouAgYrzwuByVmkDSrLRt1UciwUmWmUzCS"}

You can see that the 2 CIDs do not match...

zdj7We6WnfhRq5zmJZDeMKdKmS2z8fEPrUSneapijtnQYzYpm
zb2rhc3P77eryPttouAgYrzwuByVmkDSrLRt1UciwUmWmUzCS

@nelsonic
Copy link
Member Author

@RobStallion it's good that you are being thorough with your investigation,
but please note that we will not be hashing files (yet) only hashing Ecto Changesets i.e. Elixir Maps
in order to generate the CID for a record before inserting it into the database.

We can return to the "large file" quest later or even write a Node.js/Go microservice on AWS lambda to do our file uploads e.g: uploading images. For now we litterally only need the most basic CID such that a map of %User{ name: "Rob", username: "robdabank"} will create a valid CID so we can insert the data.

@RobStallion
Copy link
Member

It seems that when we upload a small file to IPFS in version1 with the "raw" codec it doesn't manipulate the data. This can be seen with the following...

$ ipfs add --cid-version=1 hello.txt         # add hello.txt to ipfs
added zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9 hello.txt

sed -n l hello.txt               # print contents of hello.txt 
Hello World$

$ ipfs block get zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9 | sed -n l         # print contents of file from ipfs
Hello World$

As you can see, when we retrieve the file from IPFS and log the data is hasn't added anything new to it like it did when we did this with v0 (see this comment for example)

@RobStallion
Copy link
Member

This is now the same as IPFS

zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ
{:ok, "zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ"}

The only difference from the first elixir implementation is that json variable did not have a new line, "\n", on the end.

first elixir attempt

iex(2)> json = Jason.encode!(map)
"{\"a\":\"a\"}"

second attempt

iex(14)> file = File.read!("json.txt")
"{\"a\":\"a\"}\n"

We should easily be able to fix this by just appending a new line to the end of a JSON object in elixir.

@RobStallion
Copy link
Member

RobStallion commented Jan 24, 2019

elixir implementation adding new line to end of json...

iex(1)> map = %{a: "a"}
%{a: "a"}
iex(2)> json = Jason.encode!(map)
"{\"a\":\"a\"}"
iex(3)> json = json <> "\n"
"{\"a\":\"a\"}\n"
iex(4)> digest = :crypto.hash(:sha256, json)
<<72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167, 139, 145, 12, 84, 241,
  135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208, 82, 81, 176, 200>>
iex(5)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
{:ok,
 <<18, 32, 72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167, 139, 145, 12,
   84, 241, 135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208, 82, 81, 176,
   200>>}
iex(6)> cid = CID.cid!(multihash, "raw", 1)
%CID{
  codec: "raw",
  multihash: <<18, 32, 72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167,
    139, 145, 12, 84, 241, 135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208,
    82, 81, 176, 200>>,
  version: 1
}
iex(7)> CID.encode(cid)
{:ok, "zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ"}

Same as IPFS again.

I would say that this is working reliably now.

Will test with #19 when @SimonLab is ready 👍

@RobStallion
Copy link
Member

@nelsonic

Do you think that the following points have been covered...

Write comprehensive doctests that demonstrate that the code works as expected.

Create beginner-friendly examples. (we can split this out into separate repos later!)

If so can you check them off in the acceptance criteria please?

@RobStallion
Copy link
Member

RobStallion commented Jan 31, 2019

Going to work on the following points from the acceptance criteria...

  • Re-org the Readme of this repo so the "theory" & reasoning is lower down
  • and the "get started in 1 minute" is at the top.

estimate t25m.

@RobStallion RobStallion added the question Further information is requested label Jan 31, 2019
@RobStallion RobStallion removed the question Further information is requested label Jan 31, 2019
@nelsonic
Copy link
Member Author

@RobStallion doctests are good. ✅
beginner-friendly example: dwyl/phoenix-ecto-append-only-log-example#22
please proceed.
Thanks!

@nelsonic nelsonic removed BLOCKED Core team's HIGHEST priority, blocking critical work priority-1 labels Jan 31, 2019
nelsonic added a commit that referenced this issue Feb 1, 2019
reorder readme. Updates how section for adding package to app #11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic good first issue Good for newcomers help wanted Extra attention is needed priority-2 technical
Projects
None yet
Development

No branches or pull requests

4 participants