Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are CIDs? #13

Open
SimonLab opened this issue Jan 21, 2019 · 5 comments
Open

What are CIDs? #13

SimonLab opened this issue Jan 21, 2019 · 5 comments
Assignees
Labels

Comments

@SimonLab
Copy link
Member

Currently the Readme focus on explaining why CIDs are important and gives a few example (Google, Instagram, Youtube) but the what section is quiet short and doesn't explain in details how CIDs can be implemented and what are their format.

I think we can follow the specification defined with https://github.com/multiformats/cid and explain in the What section of the Readme how CIDs work

see aslo https://proto.school/#/data-structures/04

@SimonLab SimonLab self-assigned this Jan 21, 2019
SimonLab added a commit that referenced this issue Jan 21, 2019
@SimonLab SimonLab mentioned this issue Jan 21, 2019
@nelsonic
Copy link
Member

nelsonic commented Jan 22, 2019

@SimonLab yes, the "What?" section is incomplete.
(thank you for opening this issue and starting the PR to fix it!)

We should attempt to summarise our own understanding of what a CID is in a paragraph
but we should resist the temptation to re-explain the whole of https://github.com/multiformats/cid because all they have done is detail the technical side without clarifying the human side.
We should link to it as "more detail" but we should attempt to explain it in terms a "6 year old" can understand.

Explaining a CID in terms of it's components (multibase, multihash, multicodec) is like teaching a kid to tie their shoelaces using a mathematical equation:
image
https://www.fieggen.com/shoelace/lengthformulas.htm
(yes, there is someone who has mathematically described all the different ways to tie laces ... 🙄)

The point is this, our "What?" section needs to answer the question "What's in it for Me?"
What benefit will my application get from using CIDs? Why should I (or anyone else) bother?
Why is CID objectively better than a sequential ID or UUID for the users of my App and consequently for me as the developer of the App?

The whole point of using a CID is that the ID of the record is always the same whether the content is created by a JavaScript/Elm Web App or iOS/Android Mobile App on a Client (offline first) or a C/Go/Haskell/Rust app on a remote server. The CID is the cryptographic hash of the contents of the data. It's always the same and can be verified simply by re-running the hash using the original data.
This is exceptionally useful if you want the ability to create records in an offline-first app and then sync and verify the records when the app re-connects with a backend/API.

For example: imagine you have a learning app that tracks the progress of the learner,
and that learner is in an area without network access (think Rural Uganda or the London Underground) you want them to be able to continue with their learning offline. When they re-gain internet access, they can "sync" the learning progress they have made offline (including answers to any "quiz" questions, notes they have made on what they learned and follow up questions they have).

Traditionally, record IDs in a distributed system are random (nondeterministc pseudorandom) e.g: UUID `
But that means that the same content/record can be stored multiple times and will generate a new ID each time. That is wasteful as it creates duplicate data. e.g:

inserted entry_id(UUID) name address
1541609554 8c700337-1e57-430a-8ba1-704bbeae9ecf Bruce Wane 1007 Mountain Drive, Gotham
1541618643 68fbca94-6b96-458a-8502-b8324d098174 Bruce Wane 1007 Mountain Drive, Gotham

The same data different (random) UUID >> duplicate records! 😞

This table illustrates that it's possible to insert the same record twice and it will have a different entry_id (i.e. UUID) this is highly undesirable as duplicate data is one of the reasons there is so much wasted hard drive space in the world! If each person in the company/organisation has a "copy" of the same Word Document on their hard drive it's easy to waste petabytes of space! Using CIDs for content identifiers where the ID is the hash of the content means that it's impossible to have duplicate data and instead our Database will simply reject an insert/update request that attempts to insert a duplicate row. (or more likely we have a batch process that cleans out duplicate data based on matching CIDs but you get the idea) if Bruce Wane is editing his address and "autosave" is enabled on in the app to avoid losing changes, then the client-side app will know that "nothing has changed" because the CID is still the same and there is no need to send the data to the server.

I believe some of the "Frequently Asked Questions" about CIDs are contained in #10 along with my (attempt) at clear answers #10 (comment) ...

Sadly, (for some reason I cannot comprehend) we don't have the discipline in @dwyl to take the answers given in issues and convert them into Questions/Answers in the README.md so that new people to a repository/project don't have to scratch their heads or go hunting for the same questions/answers. I feel like I'm constantly having to spell-out how to capture knowledge: if it's in an issue it's only partially captured, all knowledge must be in an .md file so that everyone can easily find it. That's how we get Tutorials that are useful to tens of thousands of people like https://github.com/dwyl/learn-json-web-tokens

@nelsonic
Copy link
Member

@SimonLab please see: #16 where the "What?" section was expanded. 🚢
Feel free to merge before or after your PR. 👍

@SimonLab
Copy link
Member Author

SimonLab commented Jan 22, 2019

Thanks for the clarifications @nelsonic
I totally agree that the explanation of what is a cid should be easy to understand.

I've created the issue with the intention to explain how the ifps cid format is created as @RobStallion and I had difficulty to understand the structure and how to implement it. https://github.com/multiformats/cid Readme is quiet short and I needed to research some questions and details to understand how the cid is created. So it is indeed more a technical issue/doc, maybe this type of doc should be in another .md file to not "pollute" the Readme?

Concerning your explanation, it is really clear on what features a Content Identifier can provide 👍 . I would also specify that we choose to use the IPFS CID format to be able later on (if needed) to use the different tools provided by IPFS and especially decentralisation, otherwise someone could ask why not just use a simple hash function (without multibase, multihashes...) to create the id.

@RobStallion
Copy link
Member

@nelsonic

With regards to the following point...

Sadly, (for some reason I cannot comprehend) we don't have the discipline in @dwyl to take the answers given in issues and convert them into Questions/Answers in the README.md so that new people to a repository/project don't have to scratch their heads or go hunting for the same questions/answers.

I should have done this (for issue #10) 😞

I'll make sure I summarise the points and get anything that is not clear at the moment from that issue into the readme. Sorry for the delay.

SimonLab added a commit that referenced this issue Jan 22, 2019
@SimonLab SimonLab mentioned this issue Jan 22, 2019
1 task
@nelsonic
Copy link
Member

@SimonLab indeed. The question of Why the cid should be compatible with IPFS will inevitably come up, and it's wise to answer it proactively.

Why IPFS Compatibility?

IPFS is created by several really smart people who have thought of several aspects of "futureproofing".

By making our Elixir cid function compatible with IPFS, we get several benefits:

  1. We get to use the "official" JS cid function in our client app, thus we only have to do part of the work for getting our system running and can work offline by default. 📱
  2. We get a future-proof algorithm for creating the IDs for our content that already has implementations in various languages. We avoid "re-inventing the wheel" on CIDs and instead support the ideas of a growing community. 🚀
  3. We can pick the brains of the super-friendly IPFS community if/when we get "stuck" simply by asking questions on the JS/Go implementation on GitHub! (There are 4 members of core IPFS team who are @dwyl members ... if you get stuck ask them for help!) 🌍
  4. Other people in the Elixir/IPFS community using our Elixir cid implementation will give us feedback including PRs with improvements! ❤️
  5. We can push public content (for our learning platform, blog, etc) onto IPFS which will streamline distribution to people at the edge of the internet because and are less likely to be censored. 💡

these last two are "bonus" benefits and we aren't really concerned with right now, it's good to "give back" to open source and good to be able to use the IPFS network in the future! but right now we only care about getting the Elixir cid function working so that we can use...

The "difficult" part of getting CIDs to work is removing all the noise and focussing just on what we need for "MVP". We definitely do not need full compatibility for all the features of the JS cid.

We only need a way of creating a CID for a String and a Map so that we can use it to create a CID for a record before inserting it into PostgreSQL.

A bit of Context: in our "append-only evolution of Phoenix" alog I asked the question dwyl/alog#15 "Should we use (random/nondeterministic) UUIDs as the entry_id for records?"
The point being that while UUIDs solve part of the "distributed database" problem (and allowed us to get started with building alog and the Client projects that use it), using a nondeterministic ID for records means we don't get any benefits beyond uniqueness and we get at least two downsides: super-long IDs that can't be used in human-readable URLs and no ability to verify content. I briefly considered using deterministic UUIDs as IDs because that would allow us to verify content, but the length of UUIDs and their destinctly un-friendly format put me right off.
So I went searching for how other projects are solving the distributed ID problem and found that the IPFS crew have a well-considered "future-proof" approach that is deterministic (which means no duplicate records) and they have a reference implementation in JS which we can use in our Client (Elm) apps and code against in getting the Elixir version working.

@RobStallion no need to apologise. I know everyone is busy with client work.
Also, it's not you, it's everyone. I've answered many questions on GitHub, in-person and on Video Chat and almost never seen the person consolidate the answer for others to learn from. it's super frustrating because by re-formulating the answer the person understands the question/answers much better and helps other people to learn faster it's like "win-win-and-keep-winning-for-ever!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants