Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguishing between decompression errors #156

Open
nemequ opened this issue Sep 15, 2015 · 21 comments
Open

Distinguishing between decompression errors #156

nemequ opened this issue Sep 15, 2015 · 21 comments

Comments

@nemequ
Copy link
Contributor

nemequ commented Sep 15, 2015

LZ4_decompress_safe returns

the number of bytes decompressed into destination buffer (necessarily <= maxDecompressedSize)
If destination buffer is not large enough, decoding will stop and output an error code (<0).
If the source stream is detected malformed, the function will stop decoding and return a negative result.
This function is protected against buffer overflow exploits, including malicious data packets.

I would like to be able to distinguish between the different errors. At least I would like to know if the input was malformed, incomplete, or the output buffer was just too small.

AFAICT the error code isn't really documented anywhere, only that <0 is an error. Maybe you could just return -1, -2, or -3 (or whatever) depending on the type of error…?

nemequ added a commit to quixdb/squash that referenced this issue Sep 15, 2015
This required improving the error handing in a lot of plugins, but in
the end *almost* all plugins can do it (which I found surprising).
The two which can't are:

 * lz4:lz4: lz4/lz4#156
 * ms-compress:lznt1: coderforlife/ms-compress#17

I did have to modify my LZJB fork a bit, and LZO's behaviour is
disconcerting, but adding some checks to the bounds.c tests should
be comforting.

Signed-off-by: Evan Nemerson <evan@nemerson.com>
@Cyan4973
Copy link
Member

It's a good idea.

We could complete this proposal with a few common functions such as LZ4_isError() and LZ4_getErrorName(), similar to what already exist within lz4frame, FSE and Zstd.

I'll introduce this feature in a future revision.
We'll keep this issue opened to track it.

@nemequ
Copy link
Contributor Author

nemequ commented Sep 16, 2015

I'm not sure LZ4_isError is really necessary—testing for <0 isn't hard. Zstd's errors are a bit less obvious (not sure about FSE, I haven't used it). LZ4_getErrorName would be nice, though.

@Cyan4973
Copy link
Member

Indeed, you are right.

I just remember that, a few years ago, there was a debated to move length variables from int to unsigned. The project was cancelled after discovering that it would change the way errors are detected, hence it would not be transparent.

LZ4_isError() would have been useful for such transition.

Today, the primary purpose of such a function would be to make the API more "familiar" with my other public libraries.

@nemequ
Copy link
Contributor Author

nemequ commented Nov 29, 2015

I don't want to be a pest, but is there any chance of getting this fixed relatively soon? It's actually causing some issues for me, and I'd like to make sure it gets resolved before Squash 0.8.

The problem is that, for codecs which don't include the uncompressed length in the compressed data, when someone uses a streaming interface Squash has no choice but to basically just try to decompress the data. If we can distinguish between the output being too small and other errors then we can just try again with a larger buffer for the former. Currently we start with the next_power_of_two(decompressed_size) << 3, so it's usually not an issue (all the files in the benchmark work with that value for all codecs), but for highly compressible data LZ4 currently fails during decompression.

@nemequ
Copy link
Contributor Author

nemequ commented Nov 29, 2015

FWIW, if it's just a matter of adapting the code from zstd I'd be willing to put together a patch…

@Cyan4973
Copy link
Member

Hi Evan

Yes, sorry, my time is pretty much tied to Zstd currently.

The problem is that, for codecs which don't include the uncompressed length in the compressed data, when someone uses a streaming interface Squash has no choice but to basically just try to decompress the data.

I believe this is the core issue.

"Block compression" layer of LZ4 doesn't provide any header, since it delegates its management to its caller (for example, lz4frame is such a user layer, and generates headers).

This design is on purpose. That's because lz4 block format is intended to be integrated into other algorithms, and there are a lot of ways to encode header information, optimal only within application specific contexts.
For example, some will know in advance the size to regenerate, or at least guarantee the maximum block size.
Some will know that the size can be safely represented using 2 bytes, others will need 4. Others will feature some kind of jump table from which the size of each block can be derived. And so on.

The fact that the LZ4 "block compression" layer doesn't manage headers itself doesn't mean that it's supposed to be used without headers.
In fact, the calling layer is required to design its own headers.

Btw, Squash already does it. It encapsulates a large number of compression algorithms, so it has a header, at least in order to distinguish between them.

So my suggestion would be to add relevant missing information (size to generate, or an upper bound of it) into squash header. It's likely a smaller and safer effort compared to "guess" buffer sizes mechanism as described previously.

@nemequ
Copy link
Contributor Author

nemequ commented Nov 29, 2015

Yes, sorry, my time is pretty much tied to Zstd currently.

No worries; like I said, I'm willing to write the patch if you just want to adapt the code from zstd (otherwise I'm still willing, just need some guidance about how you want it to look).

Squash already does by the way, since it encapsulates a large number of compression algorithms.

Actually, it doesn't. I've been considering adding some wrapper functions which include a small header to indicate the decompressed size (basically, just prepending a variable-length quantity if the codec doesn't already include the information), but it will always only be an option for two reasons:

  • The codecs in Squash need to be compatible with other software.
  • People at a higher level (i.e., those calling Squash), may want to implement the header themselves. Just like you don't want to force a header in LZ4, I don't want to force on in Squash.

If there were a bunch of codecs with this problem I might be forced to simply not say this isn't supported but, rather surprisingly (at least to me), of all the codecs in Squash LZ4 is actually the only one incapable of distinguishing between the buffer being too small and some other error. It doesn't seem like there is a good reason for this, so I'd rather just fix it in LZ4 than add a workaround in Squash which will cause pain for everyone using it.

@Cyan4973
Copy link
Member

The codecs in Squash need to be compatible with other software.

Great, so I read it as : when using lz4frame (for example), Squash generates compressed data directly compatible with other softwares respecting the frame format.

This is a highly desirable property.

However, at least regarding LZ4, only the frame format has been specified for interoperability. Anything else leads to the realm of custom formats, which may sometimes prove compatible, by chance, but most of the time are not.

What I mean is that, if you want to be compatible with other software, and if supporting even a subset of the frame format looks overkill, it's likely necessary to define which software will be compatible with the format read or generated by Squash, or have reasonable expectations that the newly defined format will be implemented by other softwares.

For example, for LZ4, a few ports decided to go for a relatively simple file format (this was before the frame format was even defined, but remain popular today) : start with an unsigned little-endian 4-bytes value indicating the size of data to regenerate, then provide compressed data as a single memory block. Filesize provides the size of compressed data. This convention has obvious limitations (memory storage, size limit, concatenation, etc.), but it's damn simple. I suspect it was selected by several ports such as Java, Python and Perl, although maybe with some minor but incompatible differences. Other ports may follow different conventions though, with different capabilities.

Let's be clear that I'm completely favorable to improving the error codes provided by LZ4, from a generic "there is a problem" right now to a list of well-defined issues, like zstd. Consider this objective granted, and yes, PR are warmly welcomed.

It's just that I feel that, even when this capability will be available, it may not be the better answer to your issue.

(as a side-note, I know also wonder how squash decompressor can distinguish between multiple different codecs without a "prefix" in the compressed stream to tell which one it is. Filename suffix ? )

@nemequ
Copy link
Contributor Author

nemequ commented Nov 30, 2015

Squash supports both the "lz4" and "lz4f" codecs; "lz4" is raw LZ4 data, "lz4f" uses the framed API. Obviously this is only an issue for the raw format.

If you want, I can switch the codec names around so "lz4" is the framed format, and "lz4-raw" (or whatever you would like) is the raw format; I think people who don't understand the issue are, based solely on the name, more likely to use "lz4" than "lz4-framed" or "lz4-raw". IIRC you change the lz4 CLI to generate the framed format by default, so I would argue that this would actually be helpful for interoperability.

Let's be clear that I'm completely favorable to improving the error codes provided by LZ4, from a generic "there is a problem" right now to a list of well-defined issues, like zstd. Consider this objective granted, and yes, PR are warmly welcomed.

Okay, I'll try to start copying some API out of zstd and send you a PR when I'm done.

It's just that I feel that, even when this capability will be available, it may not be the better answer to your issue.

Well, it's the answer to my issue because I just want the raw lz4 codec, whatever it is called, to work. However, I completely agree that it's sub-optimal and that people should be using a framing format. It's most definitely an ugly hack, but it works surprisingly well (I was shocked that LZ4 is the only codec with an issue).

(as a side-note, I know also wonder how squash decompressor can distinguish between multiple different codecs without a "prefix" in the compressed stream to tell which one it is. Filename suffix ? )

In general, it doesn't; you have to tell it what format you're using. However, you can also register an extension ("filename suffix") in the squash.ini file distributed with each plugin. There is an API in squash to look up a codec based on the extension (squash_get_codec_from_extension), and the CLI will fall back on that if the codec isn't explicitly specified. You can also register a MIME type; I seem to be missing an API to look up a codec based on MIME type, but that's just an oversight I wasn't aware of until now, I'll fix it shortly.

It's easy to forget that the primary use case for Squash is as a library. The original use case was actually to compress pages in a database (hence the "quixdb" group on GitHub); I couldn't make up my mind about what codec to use (I used to think I was indecisive, but now I'm not so sure…), so I put together a little abstraction layer to switch between them so I could test them both. From that point, well, things got a bit out of hand… In situations like that you don't want an extra header, so Squash doesn't force on one you.

@Cyan4973
Copy link
Member

If you want, I can switch the codec names around so "lz4" is the framed format, and "lz4-raw" (or whatever you would like) is the raw format; (...) I would argue that this would actually be helpful for interoperability

Totally agree.
If I could rebuild it today, that's the way I would spell it.

Having learned from this experience, this is how Zstd is being developed right now : ZSTD_compress() generates by default a frame format. In the future, an advanced API will allow the creation of raw compressed data blocks, and delegate header management to the calling layer. This is basically a change in priority order : the "normal" format is the interoperable one. "raw" is considered advanced, for specific usages.

nemequ added a commit to quixdb/squash that referenced this issue Nov 30, 2015
This is according to Yann Collet's wishes; see
lz4/lz4#156 (comment)
for details.

Hopefully this will encourage people to use the framed format unless
they really understand the implications of not doing so.
@nemequ
Copy link
Contributor Author

nemequ commented Jan 6, 2016

I'm looking at the error API in ZSTD right now, and IMHO it's quite problematic…

The most obvious issue is namespace pollution. It defines the following macros:

  • ERR_H_MODULE
  • ERR_STATIC
  • PREFIX
  • ERROR
  • ERROR_LIST
  • ERROR_GENERATE_ENUM
  • ERROR_CONVERTTOSTRING
  • ERROR_GENERATE_STRING

As well as several static symbols

  • ERR_strings
  • ERR_isError
  • ERR_getErrorName

This means that, if I just copy the header over to lz4 then using both lz4 and zstd in the same project becomes… complicated.

I think a better approach would be something more like:

typedef enum {
  LZ4_error_none = 0,
  LZ4_error_foo,
  LZ4_error_barBaz,
  LZ4_error_qux
} LZ4_error_t;

/* This could be implemented as a static inline function */
LZ4_error_t LZ4_errorCode (int retval);

const char* LZ4_errorToString (LZ4_error_t error);

I'm not clear on the convention for when to use underscores, camelCase, or StudlyCaps, but that's easy to correct if you can point me in the right direction.

Having the errorCode function/macro return 0 for success and positive values for error codes makes it easy to use the result in an if statement, which could be handy (it can be used like ERR_isError in zstd).

The errorToString function could be implemented similarly to squash_status_to_string. Or it could just be left out altogether if human-readable error strings aren't worth the storage space.

@Cyan4973
Copy link
Member

Cyan4973 commented Jan 7, 2016

I'm looking at the error API in ZSTD right now

I guess you refer to error.h. This API is intended to remain internal only, it should not be included into client program.

I know that zstd_static.h currently includes error.h, but that's just a temporary work-around, to get the enum list. Note that zstd_static.h is reserved for "experimental API" and static linking. Stable API will be zstd.h.

So, the only "stable" error API prototypes right now are :

  • ZSTD_isError()
  • ZSTD_getErrorName()

What's missing here is the enum list.
That's the reason why zstd_static.h includes error.h, but that's overkill, because doing so it also includes too many other API elements which should remain private.

I'll have to find something to fix that, and provide an enum list accessible through zstd_static.h. It's just not so clear yet how...

@nemequ
Copy link
Contributor Author

nemequ commented Jan 7, 2016

Ah, didn't realize it was supposed to be internal. So the error codes aren't exposed in zstd, either…

I'll have to find something to fix that, and provide an enum list accessible through zstd_static.h. It's just not so clear yet how.

Why not just provide an enum and get rid of all the macros? What good are they doing? The code is much more complicated, and the only benefit I can think of is that you can automatically stringify them to generate the result of ZSTD_getErrorName(), but nobody wants that. They either want an integer error code, or a human-readable string.

Keeping the list of errors in sync shouldn't be an issue; you have CI set up (off-topic: https://github.com/nemequ/icc-travis if you want to test with icc, too), so with the appropriate options (-Wswitch-enum, or -Werror=switch-enum) you should be warned if the list gets out of sync…

@Cyan4973
Copy link
Member

Cyan4973 commented Jan 8, 2016

What good are they doing?

There are 2 main benefits :

  • It simplify returning error.

Before error.h existed, it was necessary to write something like this :
if (condition) return (size_t) - ZSTD_ERROR_srcSize_Wrong;

Now it's just :
if (condition) return ERROR(srcSize_Wrong);

On top of, imho, better readability, the most important gain is the automatic -, a little detail which is easy to forget, and can cost some time to debug when forgotten (compilation stage cannot catch this).

  • Already listed : guaranteed correspondence between error code, defined as an enum, and its string identity.

This one is also fairly easy to get wrong when a new error code is inserted.
Sure, with enough carefulness, it should not happen, but that's exactly the kind of maintenance cost that bite most when changes are made in the future.

With these macros in place, it's enough to add, remove or modify an error enum, and everything falls into place automatically.

I see you have selected a switch case methodology instead. I've instinctively avoided this solution because a switch case construction looks heavier than a trivial fetch into a table of char*.

That being said, the more important reason I avoided it is that I wasn't aware of the -Wswitch-enum gcc compilation flag, which seems the perfect complement to ensure no error code is forgotten.

The benefit you get is that your error strings are not confined to being an image of the enum : they can be more descriptive.

Sidenote : I'm curious to know if the compiler can be clever enough to optimize the switch case into a trivial char* table fetch when all enums, hence all values from 0 to N, are implemented...

Edit : just made a test using godbolt, and it seems both methods get reduced to pretty much the same assembler once factored in the "not an enum" case

nobody wants that. They either want an integer error code, or a human-readable string.

Do you mean having access to the enum is not useful ? only the numeric value is enough ?

@nemequ
Copy link
Contributor Author

nemequ commented Jan 8, 2016

It simplify returning error.

Makes sense. #define ZSTD_Error(x) ((size_t) -(x)) isn't something I have any objection to. It's the other macros which do very strange things instead of just creating an enum (and the namespace issue, but ZSTD_/LZ4_, or keeping them private, resolves that).

On top of, imho, better readability,

I suppose that's subjective, but I have trouble believing anyone finds

#define ERROR_LIST(ITEM) \
        ITEM(PREFIX(No_Error)) ITEM(PREFIX(GENERIC)) \
        ITEM(PREFIX(prefix_unknown)) ITEM(PREFIX(frameParameter_unsupported)) ITEM(PREFIX(frameParameter_unsupportedBy32bitsImplementation)) \
        ITEM(PREFIX(init_missing)) ITEM(PREFIX(memory_allocation)) \
        ITEM(PREFIX(dstSize_tooSmall)) ITEM(PREFIX(srcSize_wrong)) \
        ITEM(PREFIX(corruption_detected)) \
        ITEM(PREFIX(tableLog_tooLarge)) ITEM(PREFIX(maxSymbolValue_tooLarge)) ITEM(PREFIX(maxSymbolValue_tooSmall)) \
        ITEM(PREFIX(maxCode))

More readable than

enum {
  ZSTD_error_No_Error,
  ZSTD_error_GENERIC,
  ZSTD_error_prefix_unknown,
  ZSTD_error_frameParameter_unsupported,
  ZSTD_error_frameParameter_unsupportedBy32bitsImplementation,
  ZSTD_error_init_missing,
  ZSTD_error_memory_allocation,
  ZSTD_error_dstSize_tooSmall,
  ZSTD_error_srcSize_wrong,
  ZSTD_error_corruption_detected,
  ZSTD_error_tableLog_tooLarge,
  ZSTD_error_maxSymbolValue_tooLarge,
  ZSTD_error_maxSymbolValue_tooSmall,
  ZSTD_error_maxCode
}

Edit : just made a test using godbolt, and it seems both methods get reduced to pretty much the same assembler once factored in the "not an enum" case

Yeah, compilers are good at optimizing switches, especially if there aren't lots of missing values.

Do you mean having access to the enum is not useful ? only the numeric value is enough ?

No, having access to the enum is definitely nice. What I meant is that having getErrorString() return something like "ZSTD_error_maxSymbolValue_tooSmall" or "maxSymbolValue_tooSmall" isn't helpful. What you really want is either the enum value itself (ZSTD_error_maxSymbolValue_tooSmall) or a human-readable string like "Max symbol value too small".

@Cyan4973
Copy link
Member

Cyan4973 commented Jan 9, 2016

These are good points.
I'll have to ponder how to evolve the current error system along these lines.

@nemequ
Copy link
Contributor Author

nemequ commented Jan 9, 2016

FWIW, I think what I posted earlier works pretty well as an enhancement to what is already in place, at least from the public side.

The only public difference would be exposing the enum and a function to convert a return value to a value of that enum, plus a function to retrieve a string descriptor of the error. AFAICT the only value that should change would be the return value of LZ4_decompress_generic (and its wrappers), but I believe the docs only say the return value will be negative on error. All the functions which return (size_t) -error_code would keep doing so.

Anyways, take your time thinking about it; when you settle on something just let me know and I'll try to put together a patch to implement it.

@Luiji
Copy link

Luiji commented Jan 12, 2016

Lurker, here!

@nemequ:

The problem is that, for codecs which don't include the uncompressed length in the compressed data, when someone uses a streaming interface Squash has no choice but to basically just try to decompress the data. If we can distinguish between the output being too small and other errors then we can just try again with a larger buffer for the former. Currently we start with the next_power_of_two(decompressed_size) << 3, so it's usually not an issue (all the files in the benchmark work with that value for all codecs), but for highly compressible data LZ4 currently fails during decompression.

For this use case, I'd argue it'd be much better to get a "null" decompressor that doesn't actually write anything to a buffer, but calculates the size of the output so you can allocate that space then run the real decompression.

This is, in fact, the exact way many custom asnprintf implementations do it when the C library doesn't provide one. So basically what'd you'd have is:

size = lz4Compress(buffer: NULL, size: 0)
buffer = allocate(size)
lz4Compress(buffer: buffer, size: size)

The end result would save memory (since it doesn't guess) and may perform much better since it doesn't run as many expensive calls to malloc (though a benchmark would be necessary to confirm).

@nemequ
Copy link
Contributor Author

nemequ commented Jan 12, 2016

For this use case, I'd argue it'd be much better to get a "null" decompressor that doesn't actually write anything to a buffer, but calculates the size of the output so you can allocate that space then run the real decompression.

I'm not so sure. It would still have to decompress the buffer; saving the actual write would be a pretty minimal gain. Also (Yann, please correct me if I'm wrong here), I think LZ4 would still need a buffer to write to since it needs to be able to look up data at a given offset, so it would have to create a buffer of at least the maximum match distance.

In general I think it's much better to allocate a buffer which will likely be large enough to hold the result and only have to decompress once. The only real exception is if you're on a 32-bit platform where the problem is address space and you're working with large pieces of data.

This is, in fact, the exact way many custom asnprintf implementations do it when the C library doesn't provide one.

Whenever I've had to do this I always just put a small buffer on the stack; maybe 256 or 512 bytes, and if it succeeds just memcpy the result over to a new buffer of the right size. That provides a decent optimization for the common case (the vast majority of printf calls are for pretty small pieces of data, and memcpy is a lot faster than having to process all the data again). The exception, of course, is Windows where vsnprintf doesn't actually follow the C99 standard and you have to call _vscprintf.

The end result would save memory (since it doesn't guess) and may perform much better since it doesn't run as many expensive calls to malloc (though a benchmark would be necessary to confirm).

  • malloc is very fast on modern platforms, especially when compared to something like decompressing a buffer.
  • You perform just as many calls to malloc (1) for all but the most heavily compressible data (every piece of data in my current benchmark will decompress on the first try). Actually, you're likely to perform more since lz4 would need to allocate an internal buffer for the first run.
  • It's much more likely to perform much worse since you always perform the expensive operation (decompression) twice.
  • For small buffers you wouldn't save much memory, for large buffers it probably wouldn't matter anyways because you never touch the memory. Worst case is that, in the unlikely event the physical memory is ever actually needed, the excess gets swapped out and costs you nothing since you never touch it.
  • If you're writing an application with stressful performance or memory demands, this will probably show up pretty quickly in profiling, in which case you will realize that you're doing something wrong and quickly switch to the lz4 codec instead of lz4-raw, or add a wrapper containing the decompressed length or, if it is use-configurable, tell them to switch to another codec (possibly disallowing codecs which don't natively support streaming and don't provide the decompressed length).

If you still disagree we should talk about it in Squash's issue tracker, since it's pretty off-topic for here.

@Luiji
Copy link

Luiji commented Jan 12, 2016

@nemequ: When it comes to performance I think debating whether or not it'll work isn't as useful as just running a benchmark, but I see your points nonetheless. :) In the case of lz4 having to create buffers anyway, then I'd definitely see the calculate-buffer-size option as inefficient.

If I continued on with this subject, it'd be in a ticket here, though, since it suggests adding a new API to LZ4 (in particular a calculate-buffer-size procedure). But at minimum it's definitely off-topic for this ticket, so I'll shut up about it here. ;P

@Lveyou
Copy link

Lveyou commented Oct 23, 2022

I think "If the source stream is detected malformed" should return a minimum int:

#define LZ4_MALFORMED_SOURCE INT_MIN

int ret = LZ4_decompress_safe(...) ;
if(ret < 0 )
{
    if(ret == LZ4_MALFORMED_SOURCE)
        ;
    else
        ;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants