Feature request: generate and use preset compression dictionaries #197

CAFxX · 2022-08-12T23:28:11Z

Some compression algorithms (zstd, lz4, lzo, deflate, ...) support using a custom preset dictionary to achieve higher compression ratios especially for "small" source data sizes.

Given a corpus of known source files (or source blocks), it is thus possible to generate one or more preset dictionaries, and then use these as preset dictionaries during compression and decompression. Depending on the number and size of the source blocks/files, as well as the size and number of dictionaries, it is more or less likely that the resulting compressed size (including the size of the resulting preset dictionaries, that must be available during decompression) will be smaller than if no preset dictionaries were used.

If my understanding of squashfs is correct, squashfs could achieve higher compression ratios, especially when using small block sizes, by (optionally) attempting to generate one or more preset dictionaries based on the blocks that the image is composed of, and then attempting to compress each block using one of the generated preset dictionaries. If the resulting compressed size (+ dictionary size) is smaller than the compressed size without dictionaries, and if the extra memory usage required during decompression is acceptable, then the image compressed using the preset dictionaries would be preferable.

Note that while above I mentioned generating "one or more" dictionaries, It would be likely easier to start with a single one. Generating multiple ones would require somehow grouping "similar" files together, generating a dedicated dictionary for each group, and the use different dictionaries for different files/blocks.

An important use case for this is to efficiently compress executables that are statically compiled, and especially those that embed large runtimes, e.g. Go or Java (native image). In these cases different executables are likely to contain large identical or very similar chunks, something that a custom dictionary would be quite effective at compressing.

plougher · 2022-08-13T02:18:07Z

Interesting you should bring that up ... That idea, along with block deduplication (see issue #58) and others has been on the back-burner as improvements since at least 2009 !

2009 was when the Squashfs kernel code was mainlined by myself into the upstream kernel. One of the results of the mainlining was a commitment to keep the filesystem layout stable. This was an understandable request. because between 2002 and 2009 Squashfs had undergone rapid development, producing a whole set of incompatible versions (five layout changes in 24 releases over 6 years).

But never say never, and these improvements are still on my todo list. It all depends on amount of time, priorities and what else people want done. Squashfs is still self-funded and done as unpaid voluntary work, which limits the amount of work I can do.

CAFxX · 2022-08-13T05:32:28Z

Understood, although I am not sure if this (or at least a baseline version of this) would really require changes to the filesystem layout. Maybe it would be a bit hackish, but I can kinda imagine ways to implement this that would be no more disruptive than adding support for a new compression algorithm (e.g. define a new compressor ID for "zstd + preset dictionary", and store the dictionary somewhere known like in the compressor options¹, in a special hidden file, or directly as a dedicated data block²), that as we know is something that has been done since mainlining.

Or are there plans for a future "5.0" version of the layout? (in which case this could be done properly, including support for per-block dictionaries...)

although this would likely limit the dictionary size to <8000 bytes ↩
with its offset/length stored in the compression options ↩

Sonicadvance1 · 2023-05-17T16:17:43Z

ZSTD has posted some pretty big performance number improvements in their readme when using dictionaries.
https://github.com/facebook/zstd#the-case-for-small-data-compression
Huge decompression speed increases would be awesome when using ZSTD. 645MB/s going to 2772MB/s when using a dictionary, but also compression ratio and compression performance improvements.

For myself this would be interesting to see how much the compression ratio improves for compressing a read-only filesystem (Like a Linux rootfs), since the read-only performance delta between LZ4HC and ZSTD has prevented me from using ZSTD today.

CAFxX changed the title ~~Feature request: generate and use compression dictionaries~~ Feature request: generate and use preset compression dictionaries Aug 12, 2022

plougher self-assigned this Aug 13, 2022

plougher added the enhancement label Aug 13, 2022

plougher added this to the Undecided milestone Aug 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: generate and use preset compression dictionaries #197

Feature request: generate and use preset compression dictionaries #197

CAFxX commented Aug 12, 2022 •

edited

plougher commented Aug 13, 2022 •

edited

CAFxX commented Aug 13, 2022 •

edited

Sonicadvance1 commented May 17, 2023

Feature request: generate and use preset compression dictionaries #197

Feature request: generate and use preset compression dictionaries #197

Comments

CAFxX commented Aug 12, 2022 • edited

plougher commented Aug 13, 2022 • edited

CAFxX commented Aug 13, 2022 • edited

Footnotes

Sonicadvance1 commented May 17, 2023

CAFxX commented Aug 12, 2022 •

edited

plougher commented Aug 13, 2022 •

edited

CAFxX commented Aug 13, 2022 •

edited