Compression does not respect LZ4 official End of block conditions #12

rlespinet · 2023-02-16T13:53:32Z

I noticed that the library does not respect end of block conditions specified in the official LZ4 repository. More specifically

End of block conditions

The last match must start at least 12 bytes before the end of block. The last match is part of the penultimate sequence. It is followed by the last sequence, which contains only literals.

For example the following ASCII string

Abcdefghijklmnop0000000000000000Abcdefghijk

is encoded as

04 22 4d 18 40 70 df 1e 00 00 00 fb 02 41 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 30 01 00 02 20 00 50 67 68 69 6a 6b 00 00 00 00
<━ ━ ━  FRAME ━ ━ ━> <━ BLOCK ━> <━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ SEQUENCE 0 ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━> <SEQ  1> <━  SEQUENCE 2 ━> <━ FRAME ━>
                                        A  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  0    |         |     g  h  i  j  k
                                                                                        ▲    |         |
                                        ▲              ▲                                ┕━━━━┙         |
                                        ┕━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙

This produces a match starting less than 12 bytes before the end of the block, which is not guaranteed to be decoded correctly.
In contrast, LZ4 official encoder correctly prevents the match from happening: here is what is generated for the same input

04 22 4D 18 60 40 82 21 00 00 00 FB 02 41 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 30 01 00 B0 41 62 63 64 65 66 67 68 69 6A 6B 00 00 00 00
<━ ━ ━  FRAME ━ ━ ━> <━ BLOCK ━> <━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ SEQUENCE 0 ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━> <━ ━ ━ ━ ━ ━ SEQUENCE 1  ━ ━ ━ ━ ━> <━ FRAME ━>
                                        A  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  0    |       A  b  c  d  e  f  g  h  i  j  k
                                                                                        ▲    |
                                                                                        ┕━━━━┙

This was obtained with the following command

$ echo -ne "Abcdefghijklmnop0000000000000000Abcdefghijk" | lz4 -c -12 --no-frame-crc | od -t x1 -A n
 04 22 4d 18 60 40 82 21 00 00 00 fb 02 41 62 63
 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 30 01 00
 b0 41 62 63 64 65 66 67 68 69 6a 6b 00 00 00 00

Using

$ lz4 --version
*** LZ4 command line interface 64-bits v1.9.2, by Yann Collet ***

Note that adding an extra character (from Abcdefghijklmnop0000000000000000Abcdefghijk to Abcdefghijklmnop0000000000000000Abcdefghijkl) the match is now starting 12 bytes before the end of block and producing a match is now legal (therefore LZ4 official produces the same output as lz4js)

echo -ne "Abcdefghijklmnop0000000000000000Abcdefghijkl" | lz4 -c -12 --n o-frame-crc | od -t x1 -A n
 04 22 4d 18 60 40 82 1e 00 00 00 fb 02 41 62 63
 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 30 01 00
 03 20 00 50 68 69 6a 6b 6c 00 00 00 00

The text was updated successfully, but these errors were encountered:

rlespinet linked a pull request Feb 16, 2023 that will close this issue

Respect LZ4 end of block conditions #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression does not respect LZ4 official End of block conditions #12

Compression does not respect LZ4 official End of block conditions #12

rlespinet commented Feb 16, 2023

End of block conditions

Compression does not respect LZ4 official End of block conditions #12

Compression does not respect LZ4 official End of block conditions #12

Comments

rlespinet commented Feb 16, 2023

End of block conditions