New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: compression as part of the CARv2 standard #288
Comments
Thanks for raising How do you imagine compression / access working? Would you imagine compression being applied per-block, or would you imagine a CarV2 index that's able to provide offsets / hints for the zst-compressed car? My understanding of the challenge here is that for most compression algorithms, once the data is compressed, there isn't an efficient way to read specific bytes of the underlying data, because the dictionary/compression table is built up dynamically from that earlier in the stream. I don't think there's an opposition to describing and standardizing a mechanism for compression of car data. |
I'm PoC-ing an implementation with the following:
struct Index {
index: HashMap<Cid, Location>,
compression_type: u64,
roots: Vec<Cid>
}
struct Location {
compressed_group_offset_into_file: u64,
block_index_in_group: u32,
}
struct CompressedGroup {
blocks: Vec<Block>,
}
struct Block {
cid: Cid,
data: Vec<u8>,
}
I think you're right - an alternative implementation might try and be clever with e.g zstd frames, and making sure that they line up with block boundaries in the CAR, but if there is going to be a new format anyway, we can be more principled about it. What would you like to see from a PoC here? |
High level points:
|
Thanks for the feedback!
I think this warrants some investigation - I'm going to see what compression ratios if we e.g compress each block vs compress the entire file
I think this is obviously true I didn't know about flatfs - I'll have a look at that too! I think we want to save newer users from e.g setting up ZFS, which is why a I'll come back with some results - perhaps per-block compression, or none at all |
To close the loop on this, Forest is now using $ du forest_snapshot* --human-readable
3.6G forest_snapshot_calibnet_2023-10-02_height_964087.car
2.0G forest_snapshot_calibnet_2023-10-02_height_964087.car.zst
2.3G forest_snapshot_calibnet_2023-10-02_height_964087.forest.car.zst There is more information in the forest documentation |
I think maybe this should be titled "CARv3" since CARv2 is a thing and is pretty fixed. Unfortunately the characteristics array at the start doesn't really give us much room to manoeuvre wrt the byte layout of data so this is very unlikely to be something we can fit into the current design. But, a couple of thoughts that are related to this proposal: first, it's my belief that any CARv3, or similar effort, should consider a divergence between the on-disk blockstore and/or archival friendly format and the over-the-wire transport format. We effectively have that today in fact, with CARv2 being the former and CARv1 being the latter; but we're in the unfortunate position of there being significant confusion (even in our code) about what should be used for what purpose with people who don't want to care about such things just choosing the higher number because that's obviously going to be better! So a proposal for compression might be worthwhile considering for an on-disk format and we just isolate it from the transport format where it likely makes more sense to rely on compression at other layers. Having said that, one of the proposals (by @hannahhoward) for a CARv3 that's particularly transport-oriented introduces the possibilities of each of what we call "sections" today would also carry an additional indicator that tells you what type of section it is, which may be a standard cid:block section, or it may be something else relevant to the transport of the data (attestation block, error message, "skipme/missing", BAO block, etc. in such a format you could even make "header", "trailer" and perhaps even "index" one of these types to capture current CARv1 & 2 functionality). Such a format may lend itself to a compression format where you benefit more by grouping many blocks together and need to make compromise decisions about how much to group (e.g. when streaming you probably don't want to retain too many blocks while calculating but maybe send an indicator: "next 10 blocks form a compression group"). |
Hello 馃憢 I'm coming from forest, a filecoin node implementation by ChainSafe, funded by the Filecoin Foundation, and I saw the current draft of CARv2.
I'd like to open a discussion for CARv2 being compression-aware.
We recently added support for randomly accessing CID data in an uncompressed CAR file: ChainSafe/forest#3085.
We do this by creating an index up front, and seeking to offsets in the file, which seems to rhyme with the goals for CARv2:
ipld/specs/transport/car/carv2/index.md
Line 26 in 84142ba
See also docs for this feature
We publish zstd-compressed CAR "snapshots", as do lotus, the reference filecoin implementation.
Compression ratios hover around
2x
for typical "calibnet" snapshots today:Our users will care about:
We believe it's possible, and desirable to integrate the two, and will be working on this.
However, we think CARv2 will be a perfect opportunity to create a standard here.
Are you open to inclusion of this in CARv2? Happy to work on an implementation and proposal for the above!
The text was updated successfully, but these errors were encountered: