Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZSTD_c_targetCBlockSize == ZSTD_TARGETCBLOCKSIZE_MAX leads to improbably bad compression #2096

Closed
dciliske opened this issue Apr 28, 2020 · 3 comments · Fixed by #2100
Closed
Assignees
Labels
bug release-blocking Must be done by the release

Comments

@dciliske
Copy link

dciliske commented Apr 28, 2020

Using ZSTD_CCtxParams_setParameter(cctxParams, ZSTD_c_targetCBlockSize, ZSTD_TARGETCBLOCKSIZE_MAX) leads to improbably bad compression for sources with large minimum sequences (>1kB).

Using a repeated concatenation of a minified CSS:
With setting targetCBlockSize == ZSTD_TARGETBLOCKSIZE_MAX:
bootstrap.min.css : 97.31% (13862462 => 13488900 bytes, bootstrap.min.css.zst)
Without setting targetCBlockSize:
bootstrap.min.css : 0.15% (13862462 => 20476 bytes, bootstrap.min.css.zst)

Further note: I encountered this as a result of trying to reduce the block size on decompress regarding #2093. If instead of setting the TargetCompressedBlockSize, ZSTD_BLOCKSIZEMAX itself is reduced, in turn causing the maximum blocksize to be reduced as well, it has a much smaller impact on the compression.


The source used was obtained via the following sequence:
rm bootstrap.min.css; wget https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css && for i in {1..5}; do cat bootstrap.min.css >> bootstrap_2.min.css; cat bootstrap_2.min.css >> bootstrap.min.css; done && rm bootstrap_2.min.css

@dciliske
Copy link
Author

I'm going to ask a question here regarding architecture: does targetCBlockSize cause the compressor to assume that it is going to be fed into a decompressor who will have no knowledge outside of the block it receives? i.e. it is a fully bufferless streaming decompressor?

@terrelln
Copy link
Contributor

targetCBlockSize is "meant" to be used in streaming scenarios where you want to reduce the time to decompress the first byte. So if your packet size is 4KB you could set your target block size to 4KB and attempt to make each packet decompressible, instead of having to wait for a full 128KB before decompressing the first byte.

I'm not sure whats going on in this scenario, but with the input it should be easy to reproduce and fix. I will look into it soon. Thanks again for the report and the detailed repro instructions!

@terrelln terrelln added the bug label Apr 28, 2020
@terrelln
Copy link
Contributor

terrelln commented Apr 28, 2020

@dciliske I've reproduced the issue at level 22. It is not present in zstd-1.4.4, so it never made it into a release. It looks like I introduced it in #1947. I'll fix it soon.

@terrelln terrelln added the release-blocking Must be done by the release label May 1, 2020
terrelln added a commit that referenced this issue May 1, 2020
Fixes:

Enable RLE blocks for superblock mode
Fix the limitation that the literals block must shrink. Instead, when we're within 200 bytes of the next header byte size, we will just use the next one up. That way we should (almost?) always have space for the table.
Remove the limitation that the first sub-block MUST have compressed literals and be compressed. Now one sub-block MUST be compressed (otherwise we fall back to raw block which is okay, since that is streamable). If no block has compressed literals that is okay, we will fix up the next Huffman table.
Handle the case where the last sub-block is uncompressed (maybe it is very small). Before it would skip superblock in this case, now we allow the last sub-block to be uncompressed. To do this we need to regenerate the correct repcodes.
Respect disableLiteralsCompression in superblock mode
Fix superblock mode to handle a block consisting of only compressed literals
Fix a off by 1 error in superblock mode that disabled it whenever there were last literals
Fix superblock mode with long literals/matches (> 0xFFFF)
Allow superblock mode to repeat Huffman tables
Respect ZSTD_minGain().
Tests:

Simple check for the condition in #2096.
When the simple_round_trip fuzzer enables superblock mode, it checks that the compressed size isn't expanded too much.
Remaining limitations:

O(targetCBlockSize^2) because we recompute statistics every sequence
Unable to split literals of length > targetCBlockSize into multiple sequences
Refuses to generate sub-blocks that don't shrink the compressed data, so we could end up with large sub-blocks. We should emit those sections as uncompressed blocks instead.
...
Fixes #2096
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug release-blocking Must be done by the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants