Ensure buffers are flushed to avoid decompression stream underread #251

markbt · 2023-10-19T08:43:38Z

When reading from a decompressor that was constructed using Decoder::with_buffer, the decoder may not consume all of the input by the time it returns all of the decompressed output. This means when you call .finish() on the decoder to get the underlying stream back, it is not pointing after the end of the compressed data.

This is expected from the way that the decoder makes use of ZSTD_decompressStream. From the zstd documentation for ZSTD_decompressStream:

But if output.pos == output.size, there might be some data left within internal buffers.
In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer.

At the point the decoder stream has yielded all of the decompressed data, we have not done this additional call. It is only necessary if the caller wants the underlying stream back, so at that point we can force an additional call to ZSTD_decompressStream by reading to a zero-length buffer.

This PR adds a test that demonstrates the issue, then adds the flush to prevent it.

When reading from a decompressor that was constructed using `Decoder::with_buffer`, the decoder may not consume all of the input by the time it returns all of the decompressed output. This means when you call `.finish()` on the decoder to get the underlying stream back, it is not pointing after the end of the compressed data. This commit adds a test that demonstrates the issue.

gyscos · 2023-10-21T14:54:23Z

Hi, and thanks for the PR!

The comment from the zstd lib refers to internal buffers that need to be flushed: when no more input is required, but more output is available.
The situation here is the reverse: all the output has been received already, but some of the input (the frame footer) has not been consumed yet.

I'm not against calling zstd again to make sure we fed it the entire frame, but I have a few concerns with the approach here:

into_inner should be a thin function to just get the inner object in it's current condition. It's mostly aimed for advanced users who know what they're doing, for example for testing or toying with the inner data. I would like to minimize any extra logic we do there. Decoder::finish() might be a better place for it.
If the frame was actually finished already, trying to read again might trigger a read from the underlying stream (could be a tcp socket), potentially blocking until more data comes in - which could be forever. Any method to make sure we consumed all input should probably make sure we don't trigger an extra read here.
Unless you read the decoded stream to the end, it's essentially a partial read, and there is indeed no guarantee regarding how much data was consumed. It looks like "consume the rest of the frame" could be an explicit, separate action users could call. One way to do that today could be to make the Decoded single-frame (to make sure it won't try to start a new frame if the current one is already complete), then call read_to_end() or something similar.

What do you think?

After the decoder stream has yielded all of the uncompressed data, it is possible for the input stream to still not be fully consumed. This means if we extract the inner stream at this point, it will not be pointing to the end of the compressed data. From the [zstd documentation](https://facebook.github.io/zstd/zstd_manual.html#Chapter9) for `ZSTD_decompressStream`: > But if `output.pos == output.size`, there might be some data left within internal buffers. > In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer. This is only necessary if the caller wants the stream back, so at that point we can force an additional call to `ZSTD_decompressStream` by reading to a zero-length buffer.

markbt · 2023-10-24T19:57:16Z

Moving it to Decoder::finish seems reasonable. I have updated the PR with that change.

I do think this comment in the ZSTD documentation is relevant here. Specifically, in the test I added, while the compressed data is 21 bytes long, at the point where all data is yielded (i.e. output.pos == output.size), we still have data in our input buffer (input.pos = 17, input.size = 21). The documentation isn't exactly clear, but I think "some data left within internal buffers" also means that some of the data from the input stream isn't consumed either. The key part of the additional read is to read to a zero-length buffer, which means we call into ZSTD_decodeStream with output.pos == output.size, which is what the documentation requests, and means aside from consuming input for already-yielded data, zstd will not try to read any more data as it will have nowhere to decompress it to.

This is the only way to successfully read up to the end of a compressed object in a stream with non-compressed data following it. If you continue to read past the end of the decompressed data, the stream doesn't end, but instead it attempts to decompress the non-compressed data, and panics ("Unknown frame descriptor"). For Decoder::finish to be useful at all, you need to know when the decompressed stream ends exactly and stop reading there.

markbt · 2023-12-28T20:34:02Z

Hi @gyscos. Do you have any thoughts on my updated PR? Hopefully I've addressed your concerns, but if not, please let me know what else you'd like to see addressed.

markbt force-pushed the decompress_underread branch from 299cd5b to f2b80a6 Compare October 19, 2023 08:45

markbt force-pushed the decompress_underread branch from f2b80a6 to ea02a36 Compare October 24, 2023 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure buffers are flushed to avoid decompression stream underread #251

Ensure buffers are flushed to avoid decompression stream underread #251

markbt commented Oct 19, 2023

gyscos commented Oct 21, 2023

markbt commented Oct 24, 2023

markbt commented Dec 28, 2023

Ensure buffers are flushed to avoid decompression stream underread #251

Are you sure you want to change the base?

Ensure buffers are flushed to avoid decompression stream underread #251

Conversation

markbt commented Oct 19, 2023

gyscos commented Oct 21, 2023

markbt commented Oct 24, 2023

markbt commented Dec 28, 2023