Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose or document a constant that describes a best-case minimal compression length #1166

Open
tristan957 opened this issue Sep 14, 2022 · 4 comments

Comments

@tristan957
Copy link
Contributor

Is your feature request related to a problem? Please describe.
In our storage engine work, we have to determine whether compression will save any space on-media.

Describe the solution you'd like
A constant or documentation that describes a best-case minimal compression length which actually reduces the length of the given data.

@Cyan4973
Copy link
Member

The question is not clear, or could be interpreted in multiple ways.
Best-case minimal compression length is ~srcSize / 250, as documented in the LZ4 Block format.
But this kind of performance is only reachable for extremely simple cases, hence generally not reached.
It's also completely unrelated to data compressibility or not.
It's unclear to me why this information would be useful.

@tristan957
Copy link
Contributor Author

tristan957 commented Sep 20, 2022

In our storage engine, we provide the capability of compressing values.

Essentially we have some logic that looks like this:

if (wants_compression && len(value) >= XXX) {
  compressed_value = compress(value);
  value = len(compressed_value) < len(value) ? compressed_value : value;
}

My question is given a value of some best-case scenario for lz4, like a run of "000000000000..." for example, what would that XXX be? Phrased another way, what is the minimal amount of overhead for lz4 given a best-case scenario value?

But if there is no good answer to this question, that I can remove that XXX check.

@tristan957
Copy link
Contributor Author

tristan957 commented Sep 20, 2022

I realize that I may have been naive initially because the overhead of lz4 is probably pretty dependent on the length of the original value to compress.

@Cyan4973
Copy link
Member

The LZ4 Block format specifies that, to be compressible, an (independent) input must have a length >= 13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants