Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python-zstandard produces incomplete data after decompress #196

Open
grossag opened this issue May 2, 2023 · 17 comments
Open

python-zstandard produces incomplete data after decompress #196

grossag opened this issue May 2, 2023 · 17 comments
Labels

Comments

@grossag
Copy link

grossag commented May 2, 2023

I originally filed this against urllib3 in urllib3/urllib3#3008 but the investigation so far is showing that urllib3 is providing the correct data to python-zstandard.

Environment info

OS Windows-10-10.0.22621-SP0
Python 3.10.11
urllib3 2.0.1

pip freeze results:

appdirs==1.4.4
certifi==2022.9.24
charset-normalizer==2.1.1
colorama==0.4.6
flake8==5.0.4
idna==3.4
mccabe==0.7.0
numpy==1.23.4
pycodestyle==2.9.1
pydiffx==1.1
pyflakes==2.5.0
python-gitlab==3.12.0
RBTools==4.0
requests==2.28.1
requests-toolbelt==0.10.1
six==1.16.0
texttable==1.6.7
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==2.0.1
zstandard==0.21.0

Sample code

Example source data to decompress

I have attached a zip of text.txt.zstd because GitHub doesn't support attaching .zstd files. This file text.txt.zstd contains the raw data passed to ZstdDecoder.decompress(). Extract that text.txt.zstd file out of the zip then run:

import zstandard as zstd

print('Using zstandard backend: %s' % zstd.backend)

with open(r'C:\utils\temp\text.txt.zstd', 'rb') as f:
    data = f.read()

print('decompressed %d into %d bytes' %
      (len(data), len(zstd.decompress(data))))

My output is:

Using zstandard backend: cext
decompressed 2554 into 1048576 bytes

Whereas downloading zstd 1.5.5 and running zstd.exe text.txt.zstd -d -o hi.txt produces 8724469 bytes and a valid text file.

@indygreg
Copy link
Owner

indygreg commented May 2, 2023

Need to verify with supplied sample data this but knee jerk is it has something to do with zstd frames. Various APIs in this library stop reading at frame boundaries by default. In hindsight this was a bad default behavior and the release notes have called out a desire to change it so multiple frames are read transparently.

Or you found a legit issue where we fail to handle a special case where the amount of bytes read falls directly on a power of 2 boundary. 1048576 bytes decompressed is very suspicious!

@sethmlarson
Copy link

sethmlarson commented May 2, 2023

Thanks for taking a look at this @indygreg, I've posted a $300 bounty on the upstream since this issue affects data integrity of urllib3 users. If you find the root cause and publish a fix for this release you can submit a claim for that bounty. Thanks again!

@grossag
Copy link
Author

grossag commented May 3, 2023

Current theory is that python-zstandard's DecompressionObj_decompress() function is stopping after the first frame. Sample C code that seems to match what that function is doing:

   // File read code copied from stackoverflow.
   std::ifstream my_file("C:\\utils\\temp\\text.txt.zstd");
   my_file.seekg(0, std::ios_base::end); // Seek to end of file.
   const unsigned int file_length = my_file.tellg();
   my_file.seekg(0);
   std::vector<char> file_data(file_length);
   my_file.read(&file_data[0], file_length);

   ZSTD_DStream* zds = ZSTD_createDStream();
   if (zds) {
      ZSTD_initDStream(zds);

      ZSTD_inBuffer input = {&file_data[0], file_length, 0};
      size_t outSize = ZSTD_DStreamOutSize();
      std::vector<char> out_data(outSize);
      std::vector<char> full_decompressed;

      bool done = false;
      bool error = false;
      while (!done) {
         auto fcs = ZSTD_getFrameContentSize(input.src, input.size);  // Unused, but it returns 1048576
         ZSTD_outBuffer output = { &out_data[0], outSize, 0 };
         auto result = ZSTD_decompressStream(zds, &output, &input);
         if (ZSTD_isError(result)) {
            done = true;
            error = true;
         } else {
            std::copy(out_data.begin(), out_data.begin() + output.pos,
                      std::back_inserter(full_decompressed));
            if (result == 0) {
               auto full_write = full_decompressed.size();
               done = true;
            }
         }
      }

      ZSTD_freeDStream(zds);
   }

This results in a C buffer with 1048576 bytes because ZSTD_decompressStream() is only expected to decompress one frame. I'm still reading up on the zstd APIs to try to understand what it looks like to move on to another frame.

@grossag
Copy link
Author

grossag commented May 4, 2023

Need to verify with supplied sample data this but knee jerk is it has something to do with zstd frames. Various APIs in this library stop reading at frame boundaries by default. In hindsight this was a bad default behavior and the release notes have called out a desire to change it so multiple frames are read transparently.

Or you found a legit issue where we fail to handle a special case where the amount of bytes read falls directly on a power of 2 boundary. 1048576 bytes decompressed is very suspicious!

Yeah, this seems to be the correct explanation. The sample data has multiple frames and python-zstandard stops after the first frame. I have two main follow-up questions:

  1. Is there a way now to use python-zstandard to decode multiple frames, even if the current requirement is one call to decompress() per frame? I can't find any API that tells me where the current frame ended for me to adjust the source bytestring appropriately for another call into decompress().
  2. Is there any timeframe for when read_across_frames=True will be supported?

@grossag
Copy link
Author

grossag commented May 4, 2023

Ok so reading through documentation and other older bug reports, it appears that I should be using ZstdCompressor().stream_reader() to get support for reading across frames. I will talk to the urllib3 maintainers about switching to that API.

@grossag
Copy link
Author

grossag commented May 4, 2023

Ok so reading through documentation and other older bug reports, it appears that I should be using ZstdCompressor().stream_reader() to get support for reading across frames. I will talk to the urllib3 maintainers about switching to that API.

Sorry, one typo. I meant to say that I should be using ZstdDecompressor().stream_reader().

@indygreg
Copy link
Owner

indygreg commented May 6, 2023

For the ZstdDecompressionObj.decompress() API, I consider the lack of reading across frames a bug. These decompress() APIs are supposed to read as much as possible in a single operation. Its current behavior of stopping at frame boundaries violates that implicit contract.

I plan to change the API to decompress across frames by default. As this is technically an API breaking change, it will require a new 0.x version instead of a 0.x.y point release.

@indygreg indygreg added the bug label May 6, 2023
@indygreg
Copy link
Owner

indygreg commented May 6, 2023

Actually, reading the docs for ZstdDecompressionObj, we explicitly call out the single frame reading behavior:

Each instance is single use: once an input frame is decoded, decompress() can no longer be called.

As the existing behavior was explicitly documented, following prior conventions of this library, I'm not comfortable making a breaking change in the next 0.x release. The best we can do is add an off-by-default flag to allow ZstdDecompressionObj instances to read across frames. Then the 2nd release from now we can make it the default.

I think the overhead for calling ZstdDecompressor.decompressobj() should be minimal. Although there may be non-negligible cost calling ZSTD_DCtx_refDDict().

Would urllib3 be comfortable adapting their usage to instantiate multiple ZstdDecompressionObj instances? i.e. how important is it to ship a new zstandard version with support for reading across frames?

@pquentin
Copy link

pquentin commented May 7, 2023

We would be very comfortable adapting our usage.

By the way this is ironic because urllib3 2.0 broke users that did not read the documentation for 1.x so we know that nobody reads the documentation. Especially the documentation for ZstdDecompressionObj which is supposed to work like the standard library. So you likely have other users that are broken without realizing it. But this is your problem to solve and again we would be very comfortable adapting our usage.

@pquentin
Copy link

pquentin commented May 8, 2023

@grossag Do you have everything you need to fix this bug on the urllib3 side?

@grossag
Copy link
Author

grossag commented May 8, 2023

Actually, reading the docs for ZstdDecompressionObj, we explicitly call out the single frame reading behavior:

Each instance is single use: once an input frame is decoded, decompress() can no longer be called.

As the existing behavior was explicitly documented, following prior conventions of this library, I'm not comfortable making a breaking change in the next 0.x release. The best we can do is add an off-by-default flag to allow ZstdDecompressionObj instances to read across frames. Then the 2nd release from now we can make it the default.

I think the overhead for calling ZstdDecompressor.decompressobj() should be minimal. Although there may be non-negligible cost calling ZSTD_DCtx_refDDict().

Would urllib3 be comfortable adapting their usage to instantiate multiple ZstdDecompressionObj instances? i.e. how important is it to ship a new zstandard version with support for reading across frames?

Thanks for the response! I have been reading through zstd and python-zstandard code for a couple days now and have some questions about the usefulness of APIs to read a single frame.

Overall, I don't understand the use of any API that only supports decoding a single frame, requires the caller to tell it how much to read, and doesn't tell you how much it read. A single frame in and of itself is not useful unless I can call the same API again to read the next frame. So I think there is a really strong argument for the one-shot API reading across frames because otherwise, I can't really see any use for this API. It doesn't tell you how much it read, so it's not like I can move forward in my source bytesarray and call the API again to read the next frame.

I would understand if there was a version of ZstdDecompressor.stream_reader() that uses the zstd APIs to query the frame size and read only that. That function would then consume enough of the iterable to read that frame. It could then be called repeatedly until the iterable is consumed and thus the data is exhausted. But instead, ZstdDecompress.stream_reader() requires that the caller pass the read size when it doesn't have enough information to understand how much source data is needed to construct a frame unless it starts parsing the zstd frame header block. And if it doesn't pass the read size, it defaults to something unrelated to the actual frame size, potentially consuming more data than is just needed for that frame.

@grossag
Copy link
Author

grossag commented May 8, 2023

@grossag Do you have everything you need to fix this bug on the urllib3 side?

It appears that I can use ZstdDecompressionReader to get this to work in urllib3. I don't have tests working entirely yet, but I am close.

If ZstdDecompressionObj supported reading across frames, then I could call ZstdDecompressionObj.decompress() each time urllib3 has a chunk of data and then use the unused_data variable when we get future chunks. But I don't think urllib3 can use this object because it doesn't support reading across frames.

The only object I can find that supports reading across frames is ZstdDecompressionReader. I think I am able to use this to decompress multiple frames if I use an intermediate io.BytesIO() to push response data in and have zstd use that.

@indygreg
Copy link
Owner

Overall, I don't understand the use of any API that only supports decoding a single frame, requires the caller to tell it how much to read, and doesn't tell you how much it read.

Part of this is due to bad design in this library. In my defense, when I started this library in 2016 and when zstandard was relatively young, you rarely saw zstd payload with multiple frames. In the years since, zstandard has skyrocketed in popularity and has built out new features, such as adaptive compression. Its APIs have also evolved, making writing multiple frames a bit simpler.

Part of this is due to bad design of the (de)compression APIs in the Python standard library. The only reason that ZstdCompressionObj and ZstdDecompressionObj exist is to be a drop-in alternative to zlib, xz, etc formats in the Python stdlib. The stdlib decompress() is especially poorly designed because it resizes the output buffer to fit data. This is a massive performance footgun.

The io.RawIOBase interface (Zstd[De]CompressorReader + Zstd[De]CompressorWriter, by contrast, is much more reasonable and allows the caller to control sizes explicitly and control/account for I/O + CPU [back]pressure. I really, really, really wish I could make the stdlib decompression APIs go away because they are just that bad.

Anyway, you aren't the only ones complaining about the lack of multi-frame support. It is clear we need to evolve this library to support a now common requirement in 2023. The goal posts moved and we need to adapt. This will be the major issue I focus on the next time I sit down to hack on this project.

indygreg added a commit that referenced this issue May 15, 2023
All APIs need to gain the ability to transparently read across
frames. This commit teaches the stdlib decompressionobj interface to
transparently read across frames.

Behavior is controlled via a boolean named argument. It defaults to
False to preserve existing behavior.

Related to #196.

I haven't tested this very thoroughly. But the added fuzzing test
not failing is usually a good indicator that this works as intended.
@Rogdham
Copy link

Rogdham commented May 15, 2023

FYI in urllib3 a PR has been merged using ZstdDecompressionObj.unused_data to decompress one frame separately after the other.

Moving forward, I personnaly believe the read_across_frames being implemented in all methods should do the trick for python-zstandard, but you may also want to peek at what pyzstd is doing under the hood in case they had good ideas there.

@grossag
Copy link
Author

grossag commented May 15, 2023

Thanks for the response @indygreg ! As @Rogdham mentioned, he was able to put together a solution that worked well in the context of the existing urllib3 design. I had put together urllib3/urllib3#3021 but the ZstdDecompressionReader wasn't a great fit for urllib3 because urllib3 can receive the compressed data in chunks but I needed to allocate ZstdDecompressionReader with a io.RawIOBase object. I had worked around it by creating an intermediate io.BytesIO buffer, but it resulted in an extra copy that @Rogdham 's change avoids. As a result, I thought his change was better for urllib3.

So anyway, it seems like there are still many ways to use your APIs to decode data with multiple frames. unused_data was really important in allowing for this in urllib3.

Would be great to get the one-shot API to support multiple frames but in the meantime, urllib3 has a solution to their issue. Thanks for your support!

indygreg added a commit that referenced this issue May 16, 2023
All APIs need to gain the ability to transparently read across
frames. This commit teaches the stdlib decompressionobj interface to
transparently read across frames.

Behavior is controlled via a boolean named argument. It defaults to
False to preserve existing behavior.

Related to #196.

I haven't tested this very thoroughly. But the added fuzzing test
not failing is usually a good indicator that this works as intended.
@indygreg
Copy link
Owner

I added support for read_across_frames=True to ZstdDecompressionObj in the main branch. If you have time to test and validate with urllib3, any feedback would be appreciated.

@grossag
Copy link
Author

grossag commented Sep 8, 2023

I added support for read_across_frames=True to ZstdDecompressionObj in the main branch. If you have time to test and validate with urllib3, any feedback would be appreciated.

I'm really sorry it took me so long to get to this. I gave it a test today by cloning urllib3 and python-zstandard, reverting urllib3/urllib3@aca0f01 , and adding read_across_frames=True to the decompressobj() call.

Overall I didn't get any truncated data, but urllib3 was broken in a small way and I have a question for you about the expected behavior.

Here is the test urllib3 code:

if zstd is not None:

    class ZstdDecoder(ContentDecoder):
        def __init__(self) -> None:
            self._obj = zstd.ZstdDecompressor().decompressobj(read_across_frames=True)

        def decompress(self, data: bytes) -> bytes:
            if not data:
                return b""
            return self._obj.decompress(data)  # type: ignore[no-any-return]

        def flush(self) -> bytes:
            ret = self._obj.flush()  # note: this is a no-op
            if not self._obj.eof:
                raise DecodeError("Zstandard data is incomplete")
            return ret  # type: ignore[no-any-return]

In my tests, flush() raises an exception because self._obj.eof is always False. Is this expected with the new behavior read_across_frames=True?

I confirmed that I had no truncation and everything worked correctly if I commented out the two lines in flush() that check eof and raise a DecodeError . So overall it looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants