Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does srcSize passed to lz4_decompress_safe_continue need to equal compressed block size? #1172

Open
negatratoron opened this issue Sep 26, 2022 · 15 comments
Labels

Comments

@negatratoron
Copy link

I'm updating some code for work, and we're using the deprecated lz4_decompress_fast_continue. This function takes a pointer to a compressed block, and returns the number of bytes read from the source buffer.

I have two questions about lz4_decompress_safe_continue.

(1) Can srcSize be larger than the compressed block?
(2) What is the return value of lz4_decompress_safe_continue? (It is not documented.)

I peeked at lz4.c, and it seems like the answer to (2) is that the return value is the decompressed size. From this, I reason that the answer to (1) is no, since there would be no way to tell how much data lz4_decompress_safe_continue read from the source buffer.

The reason I ask is that we happen to be compressing a series of blocks and saving the total compressed size, but not the block boundaries. We are then relying on the behavior of lz4_decompress_fast_continue to stop decompressing at the end of a block and return the amount of data decompressed, in order for us to decompress the blocks in a loop. It seems like the safe API is incompatible with our current data format.

To be specific, we are decompressing a memory-contiguous sequence of blocks in a loop. I understand that lz4_decompress_*_continue must be used in this situation, because lz4_decompress_* assumes that one block is being decompressed, and will give incorrect results if it is run on a sequence of blocks. Is this true, or is e.g. lz4_decompress_safe intended to decompress multiple contiguous blocks?

The one action item I'd suggest from this issue is to document (1) and (2) in the comments to lz4_decompress_safe_continue.

Thanks!

@Cyan4973
Copy link
Member

Indeed, LZ4_decompress_safe_continue() is modeled after LZ4_decompress_safe(),
and decompresses one full block (and only one) per invocation.
For this it needs to know the exact compressed size of the block.
It will return the exact decompressed size of the block.

The scenario you are looking for, replacing lz4_decompress_fast_continue(), can be achieved using LZ4_decompress_safe_partial_usingDict(), which is currently undocumented.
It's modeled after LZ4_decompress_safe_partial(), and features the same properties and behavior, with the added capability to specify where in memory is the history of previous decompressed block(s) (up to last 64 KB).

While LZ4_decompress_safe_partial_usingDict() wasn't created initially for this purpose, it can effectively support the same scenario as lz4_decompress_fast_continue(), aka be provided the decompressed size, and return the compressed size, although it needs to be provided an input size which is at least as large as the compressed size of the current block.

@negatratoron
Copy link
Author

negatratoron commented Oct 25, 2022

Hey, so I've looked at LZ4_decompress_safe_partial_usingDict, and I'm a little unsure if I'm doing what you intended. It looks like it calls lz4_decompress_generic through every code path, and that just has one happy-path return statement which returns the decompressed size https://github.com/lz4/lz4/blob/v1.9.4/lib/lz4.c#L2333 . It does return the compressed size in the _output_error path just below that, but I'm not sure how best to trigger that error condition, if that's what you're suggesting.

I did try passing 1 + targetOutputSize and 1 + maxOutputSize to it, to try to trigger the error condition intentionally by asking for more bytes to be decompressed than were originally compressed, and it does trigger the error condition. The return value seems to be more or less the negative of the compressed size that was read. It turns out that setting compressedSize = -3 - LZ4_decompress_safe_partial_usingDict(...) passes the tests I've thrown at it so far (which is just compressing and decompressing some random strings). I'm not sure if the -3 is always correct or just happened to be correct for my test cases thus far. Is this the kind of call you were suggesting I make, and is this a correct way to determine the compressed size from the return value?

In other words, what I've found so far is writing this code

compressedSize = -3 - LZ4_decompress_safe_partial_usingDict(
    inPtr, outPtr,
    inBufferLen,
    // Lie about the decompressed size (add 1) to get lz4 to error out and return us the compressed size
    decompressedSize + 1, decompressedSize + 1,
    dictPtr, dictSize);

Wondering if that's the sort of thing you had in mind, or if you had something a little more straightforward in mind.

Thanks!

I really don't think I would be comfortable committing the magic number -3 unless I had some idea why it needs to be there.

@negatratoron
Copy link
Author

negatratoron commented Oct 25, 2022

Okay, I'm pretty sure the -3 is unreliable. It looks like the the origin of the magic number -3 is that ip is incremented by 2 when reading out the offset to copy from (v1.9.4/lib/lz4.c#L2232) and then the offset is found to be before the beginning of the available space to copy from (v1.9.4/lib/lz4.c#L2250). When we try to decode past the end of a block, we do get garbage: the "offset" read is the token at the next block start and the next byte of data. If the byte at the start of the next block has a low character value, the offset can easily be within the decoded data. For example, if the literal was just the character \0, and the match length was 4, the offset read on the line given would be 16. This could easily be a valid offset to read from. So that doesn't work.

Still looking for a good way to use the API to decode a known decompressed size and return the compressed size that was used.

Thanks!

Edit: I'm also working with the folks at work to see if we can alter our data format to move off of the deprecated API, but I'd still like to find a way to use the new set of functions with our current data format if possible.

@negatratoron
Copy link
Author

@Cyan4973 Sorry to bug you again, but are you sure this use case is still supported with the current API? Thanks!

@Cyan4973
Copy link
Member

Cyan4973 commented Dec 1, 2022

If we are talking about the first scenario presented at the top of the issue,
aka multiple compressed blocks, stacked back-to-back,
only the total compressed size of all blocks reported,

then there is another important thing to ask for : is the decompressed size of each block known ?

@negatratoron
Copy link
Author

Correct. Multiple compressed blocks stacked back to back, total compressed size known, and decompressed size for each block known.

@Cyan4973
Copy link
Member

Cyan4973 commented Dec 2, 2022

Ah, and since you are using the _continue() variant of decompress,
it implies that compressed blocks can have matches leading into previous blocks ?

@negatratoron
Copy link
Author

negatratoron commented Dec 2, 2022

No, during compression we call LZ4_resetStreamHC before compressing each block, so I do not believe they can have matches leading into previous blocks.

@Cyan4973
Copy link
Member

Cyan4973 commented Dec 2, 2022

OK,
then why using the _continue() variant ?
Would decompression work with "just" LZ4_decompress_fast() ?

@negatratoron
Copy link
Author

negatratoron commented Dec 2, 2022

To answer your question, the code was like that when I got here. I'm just reporting what it says, however I agree that it would make more sense not to use the _continue variant.

My whole intent here is just not to use deprecated functions if possible. It looks like LZ4_decompress_fast would work as well as LZ4_decompress_fast_continue, but that's deprecated too.

@negatratoron
Copy link
Author

To be more concrete, here's some example code. It compresses some data into a series of blocks and returns the compressed data (and also the number of blocks). The decompress function discovers the compressed size of each block from the return value of LZ4_decompress_fast so that it can decompress the series of blocks in sequence.

However, LZ4_decompress_fast is deprecated. Do you plan on removing that in the future, or continuing to maintain it? If the answers are yes / no respectively, I would be interested in finding a way to implement decompress below that will be maintained.

#include <iostream>
#include <string>
#include <vector>

#include "lz4.h"

// Returns number of compressed blocks and compressed data
// Takes decompressed block size and data to compress
std::pair<size_t, std::vector<char>>
compress(size_t decompressed_block_size, std::string str) {
  std::vector<char> out_buf(1024); // big enough
  size_t block_count = 0;

  auto offset = 0;
  while (block_count * decompressed_block_size < str.size()) {
    offset += LZ4_compress_default(
      str.data() + block_count * decompressed_block_size,
      out_buf.data() + offset,
      decompressed_block_size,
      out_buf.size() - offset
    );
    block_count++;
  }

  out_buf.resize(offset);
  return {block_count, out_buf};
}

// Returns decompressed data
// Takes decompressed block size, number of compressed blocks, and compressed data (which includes the total size of the compressed data)
std::string
decompress(size_t decompressed_block_size, std::pair<size_t, std::vector<char>> compressed) {
  std::string result;

  size_t offset = 0;
  for (size_t i = 0; i < compressed.first; i++) {
    std::string decompressed_block(decompressed_block_size, '\0');
    offset += LZ4_decompress_fast(
      compressed.second.data() + offset,
      decompressed_block.data(),
      decompressed_block_size
    );
    result += decompressed_block;
  }

  return result;
}

int main() {
  constexpr size_t decompressed_block_size = 128;

  std::string str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Amet volutpat consequat mauris nunc congue nisi vitae. Tortor at auctor urna nunc id. Eu ultrices vitae auctor eu augue ut lectus arcu. Arcu odio ut sem nulla pharetra diam. Dui id ornare arcu odio ut sem nulla pharetra diam. Nulla aliquet enim tortor at auctor urna nunc. Odio ut sem nulla pharetra diam. Viverra tellus in hac habitasse. Purus sit amet volutpat consequat. Viverra vitae congue eu consequat ac felis donec et. Ut placerat orci nulla pellentesque dignissim enim sit amet. Orci a scelerisque purus semper eget duis at tellus. Venenatis a condimentum vitae sapien pellentesque habitant morbi tristique. Elementum facilisis leo vel fringilla est ullamcorper. Posuere ac ut consequat semper viverra.";

  // Round string length up to multiple of decompressed block size
  if (auto mod = str.size() % decompressed_block_size) {
    str.resize(str.size() + (decompressed_block_size - mod), '\0');
  }

  auto compressed = compress(decompressed_block_size, str);
  auto decompressed = decompress(decompressed_block_size, compressed);

  std::cout << str << std::endl;
  std::cout << decompressed << std::endl;
}

@Cyan4973
Copy link
Member

Cyan4973 commented Dec 2, 2022

To answer your question, the code was like that when I got here. I'm just reporting what it says, however I agree that it would make more sense not to use the _continue variant.

My whole intent here is just not to use deprecated functions if possible. It looks like LZ4_decompress_fast would work as well as LZ4_decompress_fast_continue, but that's deprecated too.

Sure.
The point is, if you can use LZ4_decompress_fast(), then I've got a good replacement to propose,
it is LZ4_decompress_safe_partial().

The way it works :

LZ4LIB_API int LZ4_decompress_safe_partial (const char* src, char* dst, int srcSize, int targetOutputSize, int dstCapacity);
@src: ptr to input (beginning of compressed block)
@dst: destination ptr to write to
@srcSize: your _entire_ multi-block input. The only thing that matters is that it must be >= compressed block size.
@targetOutputSize: the decompressed size of the block
@dstCapacity: the size of output buffer, must be >= targetOutputSize

LZ4_decompress_safe_partial() is more general than LZ4_decompress_fast(), it's meant to serve more complex scenarios, when only a portion of the block is decompressed.
But it happens to also perfectly overlap with LZ4_decompress_fast(), so you can use that as an equivalent.

Maybe I could also update the documentation to make that clearer.

edit :
I now realize that it doesn't completely replaces LZ4_decompress_fast() because it returns the nb of bytes written into dst, which is expected to be == targetOutputSize, but it doesn't provide the nb of byte read from src, in contrast with LZ4_decompress_fast(). As a consequence, in the scenario you describe, an information is missing to determine where the next block starts.
TBC

@Cyan4973
Copy link
Member

Cyan4973 commented Dec 2, 2022

Having looked at the API again,
I'm afraid there is no direct replacement available yet.

The main problem is returning the nb of bytes read from src.
This was provided by LZ4_decompress_fast() and its variants,
but is not provided by LZ4_decompress_safe() and its variants.

I presume restoring this capability will require creating a new entry point in the library.

@Cyan4973
Copy link
Member

Cyan4973 commented Dec 2, 2022

The return value seems to be more or less the negative of the compressed size that was read. It turns out that setting compressedSize = -3 - LZ4_decompress_safe_partial_usingDict(...) passes the tests I've thrown at it so far (which is just compressing and decompressing some random strings). I'm not sure if the -3 is always correct or just happened to be correct for my test cases thus far.

I see how you have managed to make LZ4_decompress_safe_partial() provide you with the value you want, aka the compressed Size.
I wouldn't rely on this behavior though. It's undocumented, no guaranteed, and even less stable than relying on LZ4_decompress_fast().

At this point, my recommendation would be to keep using LZ4_decompress_fast() for the time being,
while a new library function able to return the compressed size gets prepared.

@negatratoron
Copy link
Author

Okay, thank you so much for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants