New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguishing between decompression errors #156
Comments
This required improving the error handing in a lot of plugins, but in the end *almost* all plugins can do it (which I found surprising). The two which can't are: * lz4:lz4: lz4/lz4#156 * ms-compress:lznt1: coderforlife/ms-compress#17 I did have to modify my LZJB fork a bit, and LZO's behaviour is disconcerting, but adding some checks to the bounds.c tests should be comforting. Signed-off-by: Evan Nemerson <evan@nemerson.com>
It's a good idea. We could complete this proposal with a few common functions such as I'll introduce this feature in a future revision. |
I'm not sure |
Indeed, you are right. I just remember that, a few years ago, there was a debated to move length variables from
Today, the primary purpose of such a function would be to make the API more "familiar" with my other public libraries. |
I don't want to be a pest, but is there any chance of getting this fixed relatively soon? It's actually causing some issues for me, and I'd like to make sure it gets resolved before Squash 0.8. The problem is that, for codecs which don't include the uncompressed length in the compressed data, when someone uses a streaming interface Squash has no choice but to basically just try to decompress the data. If we can distinguish between the output being too small and other errors then we can just try again with a larger buffer for the former. Currently we start with the next_power_of_two(decompressed_size) << 3, so it's usually not an issue (all the files in the benchmark work with that value for all codecs), but for highly compressible data LZ4 currently fails during decompression. |
FWIW, if it's just a matter of adapting the code from zstd I'd be willing to put together a patch… |
Hi Evan Yes, sorry, my time is pretty much tied to Zstd currently.
I believe this is the core issue. "Block compression" layer of LZ4 doesn't provide any header, since it delegates its management to its caller (for example, This design is on purpose. That's because lz4 block format is intended to be integrated into other algorithms, and there are a lot of ways to encode header information, optimal only within application specific contexts. The fact that the LZ4 "block compression" layer doesn't manage headers itself doesn't mean that it's supposed to be used without headers. Btw, Squash already does it. It encapsulates a large number of compression algorithms, so it has a header, at least in order to distinguish between them. So my suggestion would be to add relevant missing information (size to generate, or an upper bound of it) into squash header. It's likely a smaller and safer effort compared to "guess" buffer sizes mechanism as described previously. |
No worries; like I said, I'm willing to write the patch if you just want to adapt the code from zstd (otherwise I'm still willing, just need some guidance about how you want it to look).
Actually, it doesn't. I've been considering adding some wrapper functions which include a small header to indicate the decompressed size (basically, just prepending a variable-length quantity if the codec doesn't already include the information), but it will always only be an option for two reasons:
If there were a bunch of codecs with this problem I might be forced to simply not say this isn't supported but, rather surprisingly (at least to me), of all the codecs in Squash LZ4 is actually the only one incapable of distinguishing between the buffer being too small and some other error. It doesn't seem like there is a good reason for this, so I'd rather just fix it in LZ4 than add a workaround in Squash which will cause pain for everyone using it. |
Great, so I read it as : when using This is a highly desirable property. However, at least regarding LZ4, only the frame format has been specified for interoperability. Anything else leads to the realm of custom formats, which may sometimes prove compatible, by chance, but most of the time are not. What I mean is that, if you want to be compatible with other software, and if supporting even a subset of the frame format looks overkill, it's likely necessary to define which software will be compatible with the format read or generated by Squash, or have reasonable expectations that the newly defined format will be implemented by other softwares. For example, for LZ4, a few ports decided to go for a relatively simple file format (this was before the frame format was even defined, but remain popular today) : start with an unsigned little-endian 4-bytes value indicating the size of data to regenerate, then provide compressed data as a single memory block. Filesize provides the size of compressed data. This convention has obvious limitations (memory storage, size limit, concatenation, etc.), but it's damn simple. I suspect it was selected by several ports such as Java, Python and Perl, although maybe with some minor but incompatible differences. Other ports may follow different conventions though, with different capabilities. Let's be clear that I'm completely favorable to improving the error codes provided by LZ4, from a generic "there is a problem" right now to a list of well-defined issues, like zstd. Consider this objective granted, and yes, PR are warmly welcomed. It's just that I feel that, even when this capability will be available, it may not be the better answer to your issue. (as a side-note, I know also wonder how squash decompressor can distinguish between multiple different codecs without a "prefix" in the compressed stream to tell which one it is. Filename suffix ? ) |
Squash supports both the "lz4" and "lz4f" codecs; "lz4" is raw LZ4 data, "lz4f" uses the framed API. Obviously this is only an issue for the raw format. If you want, I can switch the codec names around so "lz4" is the framed format, and "lz4-raw" (or whatever you would like) is the raw format; I think people who don't understand the issue are, based solely on the name, more likely to use "lz4" than "lz4-framed" or "lz4-raw". IIRC you change the lz4 CLI to generate the framed format by default, so I would argue that this would actually be helpful for interoperability.
Okay, I'll try to start copying some API out of zstd and send you a PR when I'm done.
Well, it's the answer to my issue because I just want the raw lz4 codec, whatever it is called, to work. However, I completely agree that it's sub-optimal and that people should be using a framing format. It's most definitely an ugly hack, but it works surprisingly well (I was shocked that LZ4 is the only codec with an issue).
In general, it doesn't; you have to tell it what format you're using. However, you can also register an extension ("filename suffix") in the squash.ini file distributed with each plugin. There is an API in squash to look up a codec based on the extension ( It's easy to forget that the primary use case for Squash is as a library. The original use case was actually to compress pages in a database (hence the "quixdb" group on GitHub); I couldn't make up my mind about what codec to use (I used to think I was indecisive, but now I'm not so sure…), so I put together a little abstraction layer to switch between them so I could test them both. From that point, well, things got a bit out of hand… In situations like that you don't want an extra header, so Squash doesn't force on one you. |
Totally agree. Having learned from this experience, this is how Zstd is being developed right now : |
This is according to Yann Collet's wishes; see lz4/lz4#156 (comment) for details. Hopefully this will encourage people to use the framed format unless they really understand the implications of not doing so.
I'm looking at the error API in ZSTD right now, and IMHO it's quite problematic… The most obvious issue is namespace pollution. It defines the following macros:
As well as several static symbols
This means that, if I just copy the header over to lz4 then using both lz4 and zstd in the same project becomes… complicated. I think a better approach would be something more like: typedef enum {
LZ4_error_none = 0,
LZ4_error_foo,
LZ4_error_barBaz,
LZ4_error_qux
} LZ4_error_t;
/* This could be implemented as a static inline function */
LZ4_error_t LZ4_errorCode (int retval);
const char* LZ4_errorToString (LZ4_error_t error); I'm not clear on the convention for when to use underscores, camelCase, or StudlyCaps, but that's easy to correct if you can point me in the right direction. Having the errorCode function/macro return 0 for success and positive values for error codes makes it easy to use the result in an if statement, which could be handy (it can be used like ERR_isError in zstd). The errorToString function could be implemented similarly to squash_status_to_string. Or it could just be left out altogether if human-readable error strings aren't worth the storage space. |
I guess you refer to I know that So, the only "stable" error API prototypes right now are :
What's missing here is the enum list. I'll have to find something to fix that, and provide an enum list accessible through |
Ah, didn't realize it was supposed to be internal. So the error codes aren't exposed in zstd, either…
Why not just provide an enum and get rid of all the macros? What good are they doing? The code is much more complicated, and the only benefit I can think of is that you can automatically stringify them to generate the result of Keeping the list of errors in sync shouldn't be an issue; you have CI set up (off-topic: https://github.com/nemequ/icc-travis if you want to test with icc, too), so with the appropriate options ( |
There are 2 main benefits :
Before Now it's just : On top of, imho, better readability, the most important gain is the automatic
This one is also fairly easy to get wrong when a new error code is inserted. With these macros in place, it's enough to add, remove or modify an error enum, and everything falls into place automatically. I see you have selected a That being said, the more important reason I avoided it is that I wasn't aware of the The benefit you get is that your error strings are not confined to being an image of the enum : they can be more descriptive. Sidenote : I'm curious to know if the compiler can be clever enough to optimize the Edit : just made a test using godbolt, and it seems both methods get reduced to pretty much the same assembler once factored in the "not an enum" case
Do you mean having access to the enum is not useful ? only the numeric value is enough ? |
Makes sense.
I suppose that's subjective, but I have trouble believing anyone finds #define ERROR_LIST(ITEM) \
ITEM(PREFIX(No_Error)) ITEM(PREFIX(GENERIC)) \
ITEM(PREFIX(prefix_unknown)) ITEM(PREFIX(frameParameter_unsupported)) ITEM(PREFIX(frameParameter_unsupportedBy32bitsImplementation)) \
ITEM(PREFIX(init_missing)) ITEM(PREFIX(memory_allocation)) \
ITEM(PREFIX(dstSize_tooSmall)) ITEM(PREFIX(srcSize_wrong)) \
ITEM(PREFIX(corruption_detected)) \
ITEM(PREFIX(tableLog_tooLarge)) ITEM(PREFIX(maxSymbolValue_tooLarge)) ITEM(PREFIX(maxSymbolValue_tooSmall)) \
ITEM(PREFIX(maxCode)) More readable than enum {
ZSTD_error_No_Error,
ZSTD_error_GENERIC,
ZSTD_error_prefix_unknown,
ZSTD_error_frameParameter_unsupported,
ZSTD_error_frameParameter_unsupportedBy32bitsImplementation,
ZSTD_error_init_missing,
ZSTD_error_memory_allocation,
ZSTD_error_dstSize_tooSmall,
ZSTD_error_srcSize_wrong,
ZSTD_error_corruption_detected,
ZSTD_error_tableLog_tooLarge,
ZSTD_error_maxSymbolValue_tooLarge,
ZSTD_error_maxSymbolValue_tooSmall,
ZSTD_error_maxCode
}
Yeah, compilers are good at optimizing switches, especially if there aren't lots of missing values.
No, having access to the enum is definitely nice. What I meant is that having |
These are good points. |
FWIW, I think what I posted earlier works pretty well as an enhancement to what is already in place, at least from the public side. The only public difference would be exposing the enum and a function to convert a return value to a value of that enum, plus a function to retrieve a string descriptor of the error. AFAICT the only value that should change would be the return value of Anyways, take your time thinking about it; when you settle on something just let me know and I'll try to put together a patch to implement it. |
Lurker, here!
For this use case, I'd argue it'd be much better to get a "null" decompressor that doesn't actually write anything to a buffer, but calculates the size of the output so you can allocate that space then run the real decompression. This is, in fact, the exact way many custom asnprintf implementations do it when the C library doesn't provide one. So basically what'd you'd have is:
The end result would save memory (since it doesn't guess) and may perform much better since it doesn't run as many expensive calls to |
I'm not so sure. It would still have to decompress the buffer; saving the actual write would be a pretty minimal gain. Also (Yann, please correct me if I'm wrong here), I think LZ4 would still need a buffer to write to since it needs to be able to look up data at a given offset, so it would have to create a buffer of at least the maximum match distance. In general I think it's much better to allocate a buffer which will likely be large enough to hold the result and only have to decompress once. The only real exception is if you're on a 32-bit platform where the problem is address space and you're working with large pieces of data.
Whenever I've had to do this I always just put a small buffer on the stack; maybe 256 or 512 bytes, and if it succeeds just memcpy the result over to a new buffer of the right size. That provides a decent optimization for the common case (the vast majority of printf calls are for pretty small pieces of data, and memcpy is a lot faster than having to process all the data again). The exception, of course, is Windows where vsnprintf doesn't actually follow the C99 standard and you have to call _vscprintf.
If you still disagree we should talk about it in Squash's issue tracker, since it's pretty off-topic for here. |
@nemequ: When it comes to performance I think debating whether or not it'll work isn't as useful as just running a benchmark, but I see your points nonetheless. :) In the case of lz4 having to create buffers anyway, then I'd definitely see the calculate-buffer-size option as inefficient. If I continued on with this subject, it'd be in a ticket here, though, since it suggests adding a new API to LZ4 (in particular a calculate-buffer-size procedure). But at minimum it's definitely off-topic for this ticket, so I'll shut up about it here. ;P |
I think "If the source stream is detected malformed" should return a minimum int:
|
LZ4_decompress_safe
returnsI would like to be able to distinguish between the different errors. At least I would like to know if the input was malformed, incomplete, or the output buffer was just too small.
AFAICT the error code isn't really documented anywhere, only that <0 is an error. Maybe you could just return -1, -2, or -3 (or whatever) depending on the type of error…?
The text was updated successfully, but these errors were encountered: