ZIM on IPFS: maximizing deduplication #71

lidel · 2020-04-16T19:35:46Z

See parent issues first: #42 / #69

tldr we are looking at low hanging fruits when it comes to deduplication during ipfa add of files from https://download.kiwix.org/zim/wikipedia/

Context

📦 Each ZIM snapshot is a big file

ZIM is a binary file format in which Wikipedia snapshots are produced by the Kiwix project and published at https://download.kiwix.org/zim/wikipedia/

⬇️ Import to IPFS can be customized

See relevant parameters (ipfs add --help):

  -t, --trickle              bool   - Use trickle-dag format for dag generation.
  -s, --chunker              string - Chunking algorithm, size-[bytes], rabin-[min]-[avg]-[max] or buzhash. Default: size-262144.
  --raw-leaves               bool   - Use raw blocks for leaf nodes. (experimental).
  --nocopy                   bool   - Add the file using filestore. Implies raw-leaves. (experimental).
  --fscache                  bool   - Check the filestore for pre-existing blocks. (experimental).
  --cid-version              int    - CID version. Defaults to 0 unless an option that depends on CIDv1 is passed. (experimental).
  --hash                     string - Hash function to use. Implies CIDv1 if not sha2-256. (experimental). Default: sha2-256.
  --inline                   bool   - Inline small blocks into CIDs. (experimental).
  --inline-limit             int    - Maximum block size to inline. (experimental). Default: 32.

Low hanging fruit: `--chunker`

When we import a file to IPFS it is chunked into pieces, and this chunking algorithm can be customized.

The default one is just a static block of some fixed size (iirc 256KB), but in go-ipfs 0.5+ we support parameterized Rabin and experimental Buzhash chunkers.

  -s, --chunker              string - Chunking algorithm, size-[bytes], rabin-[min]-[avg]-[max] or buzhash. Default: size-262144.

💰 Goal: maximizing deduplication with `ipfs add`

The key problem with keeping ZIM around is their size: Turkish Wikipedia is tens of gigabyes, English wikipedia is 650 GB. Keeping multiple snapshots of the same language is expensive. Any level of deduplication that IPFS could produce would be highly beneficial.

Kiwix project considered supporting something like zsync (kiwix/overview#12) however I believe ipfs add with custom rabin or buzhash could produce even better results. We need to do some benchmarks to validate that claim.

Questions

How much space (%) we can save on average within a single language?
- This is important if we want to create collaborative pinning cluster per language
..or across different languages?
- This is important for Kiwix project, because if the numbers are significant, they could switch entire https://download.kiwix.org/zim/ to IPFS and save resources or be able to provide more snapshots.

📈 Benchmarking plan (TBD)

TBD, below is a doc with my initial ideas – needs refinement before we invest time, as it will be time-consuming to run it.

https://hackmd.io/5Ga2Z-usSDO7toY5wlDb1A

I wonder if we could leverage ipfs/DAGger for automating benchmarks – thoughts @ribasushi?

tldr we are looking at low hanging fruits when it comes to deduplication during ipfa add of files from https://download.kiwix.org/zim/wikipedia/

The text was updated successfully, but these errors were encountered:

ribasushi · 2020-04-24T13:04:09Z

@lidel sorry for not responding in time here. Yes this is precisely what stream-dagger was designed to do. However, I really need to finish the padfinder module and plumbing for it, as the result is mindblowingly good over a single chunking algorithm ( be it rabin or anything else ). I am on track to have all of it ready by the end of the month, will ping you again once https://github.com/ipfs-shipyard/DAGger/milestone/1 is reached.

Artoria2e5 · 2020-07-19T08:24:30Z

Using ZIM files directly still has the issue of the built-in clustering and compression likely reducing the dedupe power. Anything we can reap from running a rolling hash on an opaque binary is, to some extent, coincidental. Sure, cutting the bandwidth down by a half is nice, but we could do better.

It would be nice to get some sort of guarantee from kiwix that they are packing entries into clusters in some stable manner like article ID. But eh, the guys at libzim don't want to make that possible (openzim/libzim#83). They don't even want to tell us the cluster size for chunk tuning.

kelson42 · 2020-07-19T08:32:37Z

It would be nice to get some sort of guarantee from kiwix that they are packing entries into clusters in some stable manner like article ID. But eh, the guys at libzim don't want to make that possible (openzim/libzim#83). They don't even want to tell us the cluster size for chunk tuning.

@Artoria2e5 In an attempt to phrase it for someone which is not familiar with IPFS, you would like to have a lot of clusters which are binary identical between ZIM files (or at least between many ZIM versions over time of the same content)?

Artoria2e5 · 2020-07-19T08:45:35Z

you would like to have a lot of clusters which are binary identical between many ZIM versions over time of the same content

Exactly. (Rsync wants the same too.)

IPFS's tree stores a file as a set of references to chunks that are hashed separately. The chunking can be purely size-based, or it can be based on a rolling hash like rsync does. With rolling hashes we can make sure additions or insertions don't mess up the chunking of otherwise identical data.

If we are to apply a rolling hash to ZIM, the target chunk size should approximately match ZIM's compressed cluster size, so that a change in one cluster only affects around two chunks. The stability requirement is mostly to restrict the number of changes.

Having a chunk size different from the cluster size is totally okay. Making it higher means some extra data being required for recording changes, while making it lower only helps when the data is uncompressed or using "rsyncable" compression with a smaller block size.

kelson42 · 2020-07-19T09:11:58Z

@Artoria2e5 Thank you for the clarification. To me the discussion about the format or even only how we (should) save things in the ZIM files is a pretty complex one, and in any case an optimisation. Here I would like to make things in the proper order, which is:

Comfirming my measure with zsync is right - to have a kind of baseline reference
Make a measure with IPFS (standard hash algo) to have a baseline measure and be able to make comparisons

Considering that IPFS offers many other advantages over HTTP+zsync, even if it would be a bit lower, this would be still an serious incencitive to setup a seriously plan to provide all our ZIM on IPFS and then have a project to provide incremental updates based on it.

I have tomorrow a meeting with @lidel, will talk about that again.

Artoria2e5 · 2020-07-22T05:55:33Z

I put two versions of zh.wikipedia all maxi onto ipfs as:

# --chunker=buzhash --hash=blake2b-256
bafykbzacea5d2c2lxcdbgja2vnswe5aembmwvzqh6an4wzix3wbxkbkkkevdq
bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm

The two sizes are 17843989064 and 17633450554. My shell script thinks 17628509273 got reused, which is 99.97%, a really huge amount. zsync should give a similar result with its own rolling hash. It does say a lot about how inactive zh.wp is though.

Hashed enwp to bafykbzacec62ryurd3iakx4pwjvayk23wikee5ex6kvzdkpgh4lgh6gwtum54.

kelson42 · 2020-07-22T07:14:28Z

@Artoria2e5 Thank you for the benchmark, but I believe I can not understand it properly. Please excuse my ignorance but:

Which ZIM files are we talking about exactly?
Which ratio of chunks have the two files in common (it can not be 99.97%!), I guess this is what matters here?
Considering having the oldest file and wanting to obtain the newest file, how much would I be able to save?
What would be the command to get the newest file from the oldest file using IPFS?

Artoria2e5 · 2020-07-22T07:32:09Z

The zh files should be wikipedia_zh_all_maxi_2020-0{7,6}zim. I think the first one is the newer file since it's larger, but I can't be sure since I effed up my terminal multiple times and didn't keep the output handy.
The “reuse size" is calculated by summing the sizes of all chunks in the new file that has appeared in the old file. And yeah... it somehow is 99.97% the size of the old file. Or 98.79% the size of the new file.
98.79% on the download of the actual blocks. You still need the tree metadata.
You just run ipfs get bafykbzacea5d2c2lxcdbgja2vnswe5aembmwvzqh6an4wzix3wbxkbkkkevdq when you have already done ipfs get bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm. IPFS will recurse down the tree and decide that some blocks are available in the "blockstore" and don't need downloading. Some on-disk copying will happen while IPFS concats the blocks into the output file.
- If the old file is not already in your IPFS blockstore but you have it on disk, you can put it in there with ipfs add --progress --chunker=buzhash --hash=blake2b-256 --pin=false wikipedia_zh_all_maxi_2020-06.zim. The hash it makes should be identical to mine (bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm).

kelson42 · 2020-07-22T07:54:29Z

* The zh files should be `wikipedia_zh_all_maxi_2020-0{7,6}zim`. I think the first one is the newer file since it's larger, but I can't be sure since I effed up my terminal multiple times and didn't keep the output handy.

OK

* The “reuse size" is calculated by summing the sizes of all chunks in the new file that has appeared in the old file. And yeah... it somehow is 99.97% the size of the old file.  Or 98.79% the size of the new file.

Does "99.97% the size of the old file." means they share 99.97% of the chunks?

* 98.79% on the download of the actual blocks. You still need the tree metadata.

OK, so roughly how much data needs to be download to get the new file if we have already the old?

* You just run `ipfs get bafykbzacea5d2c2lxcdbgja2vnswe5aembmwvzqh6an4wzix3wbxkbkkkevdq` when you have already done `ipfs get bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm`. IPFS will recurse down the tree and decide that some blocks are available in the "blockstore" and don't need downloading. Some on-disk copying will happen while IPFS concats the blocks into the output file.
  
  * If the old file is not already in your IPFS blockstore but you have it on disk, you can put it in there with `ipfs add --progress --chunker=buzhash --hash=blake2b-256 --pin=false wikipedia_zh_all_maxi_2020-06.zim`. The hash it makes should be identical to mine (bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm).

You have published it?

 ipfs get bafykbzacea5d2c2lxcdbgja2vnswe5aembmwvzqh6an4wzix3wbxkbkkkevdq
Error: merkledag: not found

Artoria2e5 · 2020-07-22T08:04:58Z

Looks like my daemon just crashed :/

ribasushi · 2020-07-22T08:17:29Z

Hi folks, seeing how you are trying to move this forward I pushed an incompleteversion of 🗡️ that will do what you need ( please be kind, the helptexts are not entirely finished )

git clone https://github.com/ribasushi/DAGger
cd DAGger
git checkout zim_reference 
make
bin/stream-repack-multipart * | bin/stream-dagger --multipart --ipfs-add-compatible-command="--cid-version=1 --chunker=buzhash --hash=blake2b-256"

Should get you going. Replace * with a directory/file of your choice, and repeat with as few or as many files as you want. You can also get results out in a more machine-able form by adding --emit-stdout=stats-jsonl.

If you want to import the results into ipfs, you need to:
... --ipfs-add-compatibe-command="..." --emit-stdout=car-v0-fifos-xargs | xargs -0 ipfs dag import

/cc @lidel

lidel · 2020-07-22T12:54:40Z

@kelson42 I believe Error: merkledag: not found means you work in offline mode: try starting ipfs daemon first and then retry ipfs get

@Artoria2e5 interesting! Got some questions:

What was the reason you decided on -chunker=buzhash --hash=blake2b-256?

If you picked it on a hunch, you mind find DAGger/tree/zim_reference (branch) useful – it makes it easier to produce stats for benchmarking different chunking / hashing algorithms, useful for validating assumptions.
I am skeptical regarding shell script thinks 17628509273 got reused, which is 99.97% – how do you calculate dedupliacation? mind sharing the script?

I was unable to reproduce your numbers, according to DAGger analysis, import that produces the same two CIDs as ones you listed dedupes into 35,179,075,919 bytes, which is 99.19% of original (saving <1%)

$ bin/stream-repack-multipart wikipedia_zh_all_maxi_2020-06.zim wikipedia_zh_all_maxi_2020-07.zim  | bin/stream-dagger --multipart --ipfs-add-compatible-command="--cid-version=1 --chunker=buzhash --hash=blake2b-256"

{"event":   "root", "payload": 17628509273, "stream":      1, "cid":"bafykbzaceb3mzfxjiuizmh7yoz5sfajx7zil6xhck2onrl5sr7aolq65iwmhm", "wiresize": 17633450554 }
{"event":   "root", "payload": 17838985507, "stream":      2, "cid":"bafykbzacea5d2c2lxcdbgja2vnswe5aembmwvzqh6an4wzix3wbxkbkkkevdq", "wiresize": 17843989064 }

Ran on 4-core/8-thread Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz
Processing took 129.80 seconds using 0.89 vCPU and 180.59 MiB peak memory
Performing 879,752 system reads using 0.14 vCPU at about 260.60 MiB/s
Ingesting payload of:   35,467,494,780 bytes from 2 substreams

Forming DAG covering:   35,477,439,618 bytes of 190,994 logical nodes

Dataset deduped into:   35,169,131,081 bytes over 188,654 unique leaf nodes
Linked as streams by:        9,944,838 bytes over 1,103 unique DAG-PB nodes
Taking a grand-total:   35,179,075,919 bytes, 99.19% of original, 1.0x smaller
 Roots\Counts\Sizes:     3%       10%       25%       50%       95% |      Avg
{2}         2 L3:                           234                 234 |      234
            8 L2:               1,144     9,407     9,407     9,407 |    7,388
        1,093 L1:     9,058     9,058     9,058     9,058     9,058 |    9,044
      188,654 DB:   131,613   132,956   136,219   145,597   400,991 |  186,421

Artoria2e5 · 2020-07-23T04:25:41Z

Now that I checked my scropt again, my is-block-seen test is broken. Of course it is! I forgot a bit of ${}, so instead of checking whether the variable is null I am checking whether a string containing its name is.

buzhash + blake2b is mainly due to my poor CPU. I want something that is not awful but also not going to freeze an old mac mini. This thing's vnc can go down from downloading a file too fast…

lidel added help wanted kind/question A question or request for support labels Apr 16, 2020

This was referenced Apr 16, 2020

ZIM mirror at download.wikipedia-on-ipfs.org #69

Open

Question: Use directly ZIM files? #42

Closed

kelson42 mentioned this issue Apr 17, 2020

Integrate Kiwix with zsync kiwix/overview#12

Open

lidel mentioned this issue Apr 23, 2020

kiwix-js + IPFS = unstoppable wikipedia mirror ipfs/devgrants#49

Closed

3 tasks

ShadowJonathan mentioned this issue May 19, 2020

Update en.wikipedia-on-ipfs.org #61

Closed

9 tasks

ribasushi mentioned this issue Jul 22, 2020

nul is a terrible name ipfs-shipyard/DAGger#6

Closed

lidel mentioned this issue Feb 15, 2021

Update tr.wikipedia-on-ipfs.org #60

Closed

9 tasks

lidel mentioned this issue Feb 22, 2021

Set up collaborative pinning clusters #68

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZIM on IPFS: maximizing deduplication #71

ZIM on IPFS: maximizing deduplication #71

lidel commented Apr 16, 2020 •

edited

ribasushi commented Apr 24, 2020

Artoria2e5 commented Jul 19, 2020 •

edited

kelson42 commented Jul 19, 2020

Artoria2e5 commented Jul 19, 2020 •

edited

kelson42 commented Jul 19, 2020

Artoria2e5 commented Jul 22, 2020 •

edited

kelson42 commented Jul 22, 2020

Artoria2e5 commented Jul 22, 2020 •

edited

kelson42 commented Jul 22, 2020

Artoria2e5 commented Jul 22, 2020

ribasushi commented Jul 22, 2020

lidel commented Jul 22, 2020 •

edited

Artoria2e5 commented Jul 23, 2020 •

edited

ZIM on IPFS: maximizing deduplication #71

ZIM on IPFS: maximizing deduplication #71

Comments

lidel commented Apr 16, 2020 • edited

Context

📦 Each ZIM snapshot is a big file

⬇️ Import to IPFS can be customized

Low hanging fruit: --chunker

💰 Goal: maximizing deduplication with ipfs add

Questions

📈 Benchmarking plan (TBD)

ribasushi commented Apr 24, 2020

Artoria2e5 commented Jul 19, 2020 • edited

kelson42 commented Jul 19, 2020

Artoria2e5 commented Jul 19, 2020 • edited

kelson42 commented Jul 19, 2020

Artoria2e5 commented Jul 22, 2020 • edited

kelson42 commented Jul 22, 2020

Artoria2e5 commented Jul 22, 2020 • edited

kelson42 commented Jul 22, 2020

Artoria2e5 commented Jul 22, 2020

ribasushi commented Jul 22, 2020

lidel commented Jul 22, 2020 • edited

Artoria2e5 commented Jul 23, 2020 • edited

lidel commented Apr 16, 2020 •

edited

Low hanging fruit: `--chunker`

💰 Goal: maximizing deduplication with `ipfs add`

Artoria2e5 commented Jul 19, 2020 •

edited

Artoria2e5 commented Jul 19, 2020 •

edited

Artoria2e5 commented Jul 22, 2020 •

edited

Artoria2e5 commented Jul 22, 2020 •

edited

lidel commented Jul 22, 2020 •

edited

Artoria2e5 commented Jul 23, 2020 •

edited