Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix up superblock mode #2100

Merged
merged 1 commit into from
May 1, 2020
Merged

Fix up superblock mode #2100

merged 1 commit into from
May 1, 2020

Conversation

terrelln
Copy link
Contributor

@terrelln terrelln commented Apr 30, 2020

Fixes:

  • Enable RLE blocks for superblock mode
  • Fix the limitation that the literals block must shrink. Instead, when we're within 200 bytes of the next header byte size, we will just use the next one up. That way we should (almost?) always have space for the table.
  • Remove the limitation that the first sub-block MUST have compressed literals and be compressed. Now one sub-block MUST be compressed (otherwise we fall back to raw block which is okay, since that is streamable). If no block has compressed literals that is okay, we will fix up the next Huffman table.
  • Handle the case where the last sub-block is uncompressed (maybe it is very small). Before it would skip superblock in this case, now we allow the last sub-block to be uncompressed. To do this we need to regenerate the correct repcodes.
  • Respect disableLiteralsCompression in superblock mode
  • Fix superblock mode to handle a block consisting of only compressed literals
  • Fix a off by 1 error in superblock mode that disabled it whenever there were last literals
  • Fix superblock mode with long literals/matches (> 0xFFFF)
  • Allow superblock mode to repeat Huffman tables
  • Respect ZSTD_minGain().

Tests:

Remaining limitations:

  • O(targetCBlockSize^2) because we recompute statistics every sequence
  • Unable to split literals of length > targetCBlockSize into multiple sequences
  • Refuses to generate sub-blocks that don't shrink the compressed data, so we could end up with large sub-blocks. We should emit those sections as uncompressed blocks instead.
  • ...

Fixes #2096

@terrelln terrelln changed the title working on fixing superblock Fix up superblock mode Apr 30, 2020
@terrelln terrelln force-pushed the superblock-fix branch 5 times, most recently from 7184e79 to f28bc9a Compare April 30, 2020 23:20
@@ -2463,6 +2460,16 @@ static size_t ZSTD_compressBlock_targetCBlockSize_body(ZSTD_CCtx* zc,
{
DEBUGLOG(6, "Attempting ZSTD_compressSuperBlock()");
if (bss == ZSTDbss_compress) {
if (/* We don't want to emit our first block as a RLE even if it qualifies because
* doing so will cause the decoder (cli only) to throw a "should consume all input error."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* Returns the ZSTD_sequenceLength for the given sequences. It handles the decoding of long sequences
* indicated by longLengthPos and longLengthID, and adds MINMATCH back to matchLength.
*/
MEM_STATIC ZSTD_sequenceLength ZSTD_getSequenceLength(seqStore_t const* seqStore, seqDef const* seq)
Copy link
Contributor

@Cyan4973 Cyan4973 May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this function, ZSTD_getSequenceLength(), is only used within zstd_compress_superblock.c.

Therefore, why is it implemented in common/zstd_internal.h ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a generic helper function related to sequences. By putting it near the sequences, hopefully the next person who needs to read a litlen/matchlen will use this function, and see the reasoning behind it. A refactor of existing helper code, like the sequence collector, should switch it to using this function.

@@ -16,6 +16,7 @@
#include "zstd_compress_sequences.h"
#include "zstd_compress_literals.h"
#include "zstd_compress_superblock.h"
#include "zstd_internal.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : document a reason for this dependency (getSequenceLength())

tests/fuzzer.c Outdated
@@ -12,6 +12,7 @@
/*-************************************
* Compiler specific
**************************************/
#include "zstd_internal.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : document a reason for this additional dependency.

Question : why does it have to be here, rather than in the * Includes paragraph below ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, that is why I ended up with duplicate includes... I think vscode decided to start silently adding missing includes...

tests/fuzzer.c Outdated
DISPLAYLEVEL(3, "OK \n");

DISPLAYLEVEL(3, "test%3d: superblock with no literals : ", testNb++);
// Generate the same data 20 times over
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit : avoid // comment style

@Cyan4973
Copy link
Contributor

Cyan4973 commented May 1, 2020

Thanks for this fairly large fix, complete with a new test set

Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some minor nits suggested, looks good to me

@terrelln terrelln merged commit e103d7b into facebook:dev May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZSTD_c_targetCBlockSize == ZSTD_TARGETCBLOCKSIZE_MAX leads to improbably bad compression
3 participants