Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimizing memory requirements for Decompression? #2093

Closed
dciliske opened this issue Apr 27, 2020 · 12 comments · Fixed by #2094
Closed

Minimizing memory requirements for Decompression? #2093

dciliske opened this issue Apr 27, 2020 · 12 comments · Fixed by #2094
Assignees
Labels

Comments

@dciliske
Copy link

Main questions:

  • Does decompression require referencing previously used input data?
  • If it does not reference previous input data, is a working buffer (other than the destination) required when writing to a fully allocated output buffer, aka non-streaming mode?
  • If no additional working buffer is required, is there some way to perform a decompression using an input stream with a fixed output block?

Context:
I'm working on a new generation bootloader for an embedded device and would like to use zstd for image compression. Going through all the documentation and API, I feel like I'm in a bit of a hole. For context, I have two main blocks of memory I can use: main memory 2-32MB, and internal SRAM 32-512kB.

The image is read over a non-memory mapped serial stream. The destination, however, is a fixed block buffer.

What is unclear is whether the decompressor needs to reuse previously read input data. If it does not, then it should be possible to stream decompress without the intermediate window buffer.

I think I can solve my issue by doing an in-place decompression akin to how it is done in the Linux Kernel. However, there is a non-negligible performance penalty for this, as this potentially doubles the the bandwidth used for main memory, and introduces yet another copy-then-move operation on startup.

@terrelln
Copy link
Contributor

Does decompression require referencing previously used input data?

The decompressor needs to buffer 1 block of input data at a time (128KB). Once it decompresses a block, it doesn't need to reference it again.

If it does not reference previous input data, is a working buffer (other than the destination) required when writing to a fully allocated output buffer, aka non-streaming mode?

The standard streaming API doesn't currently support writing directly into the output buffer, but you can use the bufferless API (see below). However, I think we could safely support a simplified scenario where the input needs to be buffered & streamed, but the output buffer is fixed. I will put up a PR.

If no additional working buffer is required, is there some way to perform a decompression using an input stream with a fixed output block?

You can use the bufferless streaming API to support this use case. In which case the decompressor only needs to allocate the ZSTD_DCtx, which is ~160 KB. It isn't a very easy to use API, but will allow you to minimize your memory usage.

@terrelln
Copy link
Contributor

terrelln commented Apr 28, 2020

I have a PR up that handles the simple case where you want buffered streaming for the input, but guarantee that the output buffer is large enough for the entire output, and never changes.

That will allow you to use the high level streaming API instead of the buffer-less streaming API, which is much simpler.

@terrelln terrelln self-assigned this Apr 28, 2020
@dciliske
Copy link
Author

dciliske commented Apr 28, 2020

The current PR addresses the immediate questions I was having, but places the overall memory requirement at the very high end of workable at 290KB.

As for the bufferless streaming API: I'll be honest, I have zero idea whether or not this solves this better.

It is not clear at all whether the 'back-reference distance' is indexing into the decompressed output or the raw compressed input. If it were referencing into the decompressed output, why does there need to be an input buffer allocated, other than block efficiency? If it's referencing the raw, compressed input, how can we only buffer one block and not the full WindowSize? The documentation states that:

ZSTD_decompressContinue() needs previous data blocks during decompression, up to windowSize bytes

This makes it look like it's wanting prior input, unless it's actually referring to the resulting decompressed data encoded by the previous block.

Much of the documentation for the bufferless API is ambiguous as to whether the item being referenced is in a compressed or uncompressed state.

[Edit] To be clear, thank you for implementing what you have done. This provides a concrete path forward in the current setup, even if it does not solve it more generally.

@terrelln
Copy link
Contributor

terrelln commented Apr 28, 2020

@dciliske the buffer-less streaming API can shave another ~128KB off of that budget if you don’t need to buffer the input blocks. Since zstd requires to read 1 block at a time, we allocate a buffer to buffer this block. If you have external buffering the buffer-less API can be used to avoid these allocations, which leaves you with only the 160KB ZSTD_DCtx

It is not clear at all whether the 'back-reference distance' is indexing into the decompressed output or the raw compressed input.

It is referencing the decompressed output.

This makes it look like it's wanting prior input, unless it's actually referring to the resulting decompressed data encoded by the previous block.

It’s referring to the last windowSize bytes of decompressed data.

The buffer-less API is certainly not well documented. It is rarely used externally, though it is what we build the standard streaming API with. So no one has spent too much time improving the docs over the years. I will attempt to answer your questions tomorrow morning when I’m at my keyboard, and put up a PR to improve the docs. So if you have any more questions ask away.

A zstd decoder could be implemented using less memory, but we haven’t optimized for extremely low memory usage. For instance we have a 128KB buffer for literals to speed up decompression, but the literals could be streamed in an alternate implementation to save that 128KB at the cost of speed. That would get us down to ~32KB of memory, and that could be reduced slightly further. But speed would suffer significantly.

@dciliske
Copy link
Author

What does "zstd requires to read 1 block at a time" mean? That the decoder requires a block be fully available prior to actually decoding it? that it cannot parse it incrementally? Does the decoder need random access into the block it is decoding? Or is it always sequential forward?

I ask this, since building with DEBUGLEVEL=5 I see that I have a block in a sample that is 53KB long. By my understanding, you are saying that while the library wouldn't be required to buffer that, it does need to be buffered.

If blocks must be decoded as contiguous segments, is there a way to reduce the generated block size during compression? Does a reduced block size lead to less compression than a larger block size?

@terrelln
Copy link
Contributor

What does "zstd requires to read 1 block at a time" mean?

The decoder requires a full block to decompress the data. Blocks are, by default, the maximum block size of 128 KB.

The zstd format breaks the block up into two segments, the literals, and the sequences. Each of those segments are read backwards. The zstd decoder implements this by requiring the whole block be contiguous in memory.

Is there a way to reduce the generated block size during compression? Does a reduced block size lead to less compression than a larger block size?

You can use ZSTD_c_targetCBlockSize to target smaller blocks. You tell it how large you want your compressed blocks to be, and it will aim for about that size. It doesn't make strong guarantees on the block size, but it should get pretty close to the target. The cost will be 3 bytes per block size, so it is a pretty small cost.

@dciliske
Copy link
Author

dciliske commented Apr 28, 2020

Unfortunately, targetCBlockSize seems to completely choke the compressor to the point of being almost non-functional (see #2096).

Playing around with adjusting ZSTD_BLOCKSIZE_MAX ( via ZSTD_BLOCKSIZELOG_MAX), it looks like I can accomplish the same net effect, with the caveat that it requires the compressor to adhere to the same build configuration. Other than being non-compliant with the standard, is there a downside to reducing ZSTD_BLOCKSIZE_MAX?

@terrelln
Copy link
Contributor

Unfortunately, targetCBlockSize seems to completely choke the compressor to the point of being almost non-functional (see #2096).

Thanks for the bug report! We'll have to improve that for the next release.

Playing around with adjusting ZSTD_BLOCKSIZE_MAX ( via ZSTD_BLOCKSIZELOG_MAX), it looks like I can accomplish the same net effect, with the caveat that it requires the compressor to adhere to the same build configuration. Other than being non-compliant with the standard, is there a downside to reducing ZSTD_BLOCKSIZE_MAX?

That will work. The upside is that will shrink decompression memory usage proportional to the reduction in ZSTD_BLOCKSIZE_MAX. The downside is that your compressor must have the same value for ZSTD_BLOCKSIZE_MAX. Additionally, each block contains a header with entropy tables and can be ~100 bytes. So if you shrink it too small that overhead will start to become larger.

@dciliske
Copy link
Author

Gotcha. Gotta say, this is an amazing library y'all have built. I've previously used LZ4 and have been smashing my head against the hole that is high decompression performance (LZ4) versus high compression ratio with moderate performance (LZMA) trying to get better boot times but still fit the application. Zstd manages to hit all the checkboxes.

Also as an FYI (not asking for changes), while the ZSTD_malloc/ZSTD_free allows for custom allocator overrides at run time, they don't quite solve the issue for minimal embedded systems or pre-boot environments, as it still requires the executable to link malloc and free. The way it's architected however does allow for a fairly trivial splicing out to remove that dependency.

The reason why the linking of malloc and free are relevant is that those in turn cause a massive cascade of stdlib dependencies to be linked as well, much of which are probably not implemented.

I realize that this is primarily a compression library built by the necessities of moving whole datacenters around, but as a general purpose compression format that intends to become the defacto standard for the next decade or more, it's something to at least have heard of.

@terrelln
Copy link
Contributor

The reason why the linking of malloc and free are relevant is that those in turn cause a massive cascade of stdlib dependencies to be linked as well, much of which are probably not implemented.

I realize that this is primarily a compression library built by the necessities of moving whole datacenters around, but as a general purpose compression format that intends to become the defacto standard for the next decade or more, it's something to at least have heard of.

At some point we need to update the version of zstd in the kernel. At that point in time I want to spend some time to separate all of our stdlib dependencies into one place that can be replaced to make zstd easier to compile in freestanding mode. So it is definitely on the todo list.

@dgrunwald
Copy link

I am also having trouble with the memory consumption during streaming decompression.
Not in an embedded use case, but in a case where we need to read many compressed files at the same time (doing a kind of multi-way mergesort). We wanted fast compression+decompression; after doing some benchmarks we decided on compression_level=2. Both compression and decompression currently use the streaming API (as in examples/streaming_compression.c; some of our files are too big to hold in memory).

I found that ZSTD_sizeof_DCtx reports 1471096 for me. In addition to that, we've been using the recommended sizes for our own buffers (ZSTD_DStreamOutSize + ZSTD_DStreamInSize); which is an additional 262147 bytes.
So that's 1700 KB per input file. We sometimes have more than a thousand input files, so this adds up to significant memory consumption.
What can we do to reduce the memory usage during decompression?
Are there any examples for the use of the bufferless streaming API?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2020

One of the more important budgets for streaming decompression is the history buffer,
which is direct product of windowLog.

Level 2, by default, will target a 1 MB history buffer.

In order to reduce memory usage during streaming, it's possible to modify history size to some smaller value, like 512 KB for example, or even 256 KB. This must be done at compression stage, using this advanced parameter. This reduced memory budget generally comes at a loss of compression ratio, although, depending on content to compress, the ratio decrease may in fact be not that large, hence worth the trade off.

It's not uncommon for applications managing thousands of streams in-flight to reduce history window size, for the purpose of keeping memory usage under some manageable level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants