Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-defined compression level #661

Closed
wants to merge 6 commits into from
Closed

Conversation

data-man
Copy link
Contributor

Closes #544.

@data-man data-man closed this Jan 14, 2022
@data-man data-man reopened this Jan 14, 2022
@data-man
Copy link
Contributor Author

I don't know why CI is frozen. :(

@data-man data-man marked this pull request as ready for review January 24, 2022 00:20
include/zim/writer/creator.h Outdated Show resolved Hide resolved
src/writer/cluster.cpp Outdated Show resolved Hide resolved
Comment on lines 25 to 35
enum class LZMACompressionLevel: int {
MINIMUM = 0,
DEFAULT = 5,
MAXIMUM = 9
};

enum class ZSTDCompressionLevel: int {
MINIMUM = -21,
DEFAULT = 3,
MAXIMUM = 21
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this. User will use the value of the corresponding algorithm. It make more sens as we don't interpret the value (just pass it to the compression library).
If we define something, it should be the default value we will use if the user doesn't set the a compressionLevel explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for zimtools: displaying and checking for correctness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main problem I have with this feature. We should not IMO maintain any kind of "matching" table. Just putting the native compression level is better IMO.

@data-man
Copy link
Contributor Author

Just wondering if other compression methods need support?
Especially lz4 and Brotli.

@kelson42
Copy link
Contributor

Just wondering if other compression methods need support?

No, we want to keep the supported algorithms set as small as possible to ease the support of the format.

@data-man
Copy link
Contributor Author

No, we want to keep the supported algorithms set as small as possible to ease the support of the format.

Ok, got it!
I'll be maintaining my own fork: https://github.com/ZimProjects/libzimext :)

@kelson42
Copy link
Contributor

@data-man Do you plan to fix this PR based on @mgautierfr comments?

@data-man
Copy link
Contributor Author

@kelson42
Yes, a little bit later.

@@ -205,6 +205,7 @@ namespace zim
// configuration
bool m_verbose = false;
Compression m_compression = Compression::Zstd;
int m_compressionLevel = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ZSTDCompressionLevel::DEFAULT in case a user never call configCompression

Comment on lines +35 to +37
if (cl == static_cast<int>(zim::LZMACompressionLevel::MAXIMUM)) {
cl |= LZMA_PRESET_EXTREME;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the compression level passed by the user. It should be a simple pass through.
With this, a user would never be able to set a compression level of 9 without LZMA_PRESET_EXTREME.

@@ -25,6 +25,7 @@
#include "../debug.h"
#include "../compression.h"

#include <zim/zim.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this include ?

virtual ~Cluster();

void setCompression(Compression c) { compression = c; }
void setCompressionLevel(int cl) { compressionLevel = cl; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use setCompression. I'm in favor of removing it instead of adding another setCompressionLevel

Comment on lines +63 to +73
enum class LZMACompressionLevel: int {
MINIMUM = 0,
MAXIMUM = 9,
DEFAULT = MAXIMUM
};

enum class ZSTDCompressionLevel: int {
MINIMUM = -21,
MAXIMUM = 19,
DEFAULT = MAXIMUM
};
Copy link
Collaborator

@mgautierfr mgautierfr Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet again, I dislike those enums.
19 is not the maximum for zstd algorithm. It is 21 (and it may change in the futur). This is the same for 9 and LZMA.
Those values convey very few information and are pretty useless.
We should keep the default value (as we need to store theme somewhere) but I would go with something more like:

enum class DefaultCompressionLevel: int {
     LZMA = 9 | LZMA_PRESET_EXTREME,
     ZSTD = 19
};

But we don't want to expose LZMA "symbol" (LZMA_PRESET_EXTREME) in public zim headers. So we should copy the value and put a explicit comment from where the value come from.
Or we can simply not expose the default value and have two different configCompression (with and without compression level), the version without could simply set the default value (we would not have to publicly define the default value as we use it internally and never define a default value for a argument)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet again, I dislike those enums.

@mgautierfr what about something like this (in zim.h):

int getDefaultCompressionLevel(Compression comp) {
  switch(comp) {
    case Compression::None:
      {
        return 0;
        break;
      }

    case Compression::Lzma:
      {
        return 9 | LZMA_PRESET_EXTREME;
        break;
      }

    case Compression::Zstd:
      {
        return 19;
        break;
      }
  };
}

And use it in appropriate places and in the zim-tools.
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it in zim.h ? We should use it internally. Either you pass a compression level (and you know which one) or you don't and libzim picks one internally.

And use it in appropriate places and in the zim-tools.

In fact, I'm not sure we want the same thing.
I "only" want to allow the user to pass a custom compression level.
But it seems you want to expose some compression level and use them somehow (in zim-tools).
In this case, we cannot agree as we are not wanted the same thing 🙂

Can you be more explicit on how you want to use the default level in zim-tools ? Maybe you have a branch somewhere you can share.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ zimrecreate

zimrecreate

[ERROR] Not enough Arguments provided

zimrecreate recreates a ZIM file from a existing ZIM.
Usage: zimrecreate ORIGIN_FILE OUTPUT_FILE [Options]

Options:
-v, --version print software version
-j, --withoutFTIndex don't create and add a fulltext index of the content to the ZIM
-J, --threads count of threads to utilize (default: 4)
-cl, --compression_level compression level (default: 19)

The same for zimwriterfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it in zim.h ?

Because enum class Compression also defined in this file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want getDefaultCompressionLevel being a public API ?
For me not.
We can have two functions is the creator :

# In the header
Creator& configCompression(Compression compression);
Creator& configCompression(Compression compression, int level);
# In the implementation
Creator& configCompression(Compression compression) {
    return configCompresssion(compression, getDefaultCompressionLevel(compression));
}

As getDefaultCompressionLevel is only used internally, we don't need to expose it in the public API.
It could not work if the user code need to know what is the default level. But I don't have such a use case, maybe you have one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't have such a use case, maybe you have one ?

As I mentioned above, getDefaultCompressionLevel function will be used in zim-tools for displaying a default value and checking for correctness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do would you display 9|LZMA_PRESET_EXTREME (2147483657) ? And why we would like to display it ? It is enough to tell that if no value is provided we will take a correct one (without saying with one). And it allow us to change this value in the future without having to change anything as it is a internal thing.
How will you us the default value for checking ? If the user pass a value, we must blindly use it. And there is nothing to check on compression level in zimcheck

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do would you display 9|LZMA_PRESET_EXTREME

Isn't LZMA deprecated?

And why we would like to display it ?

Why not?
$ zstd -h

-1 .. -19 compression level (faster .. better; default: 3)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't LZMA deprecated?

LZMA is obsolete yes. But we still support it, so we must have a default value for it.

Why not?

Because, user has to refer to the specific algorithm documentation to know what are the correct level. I don't want to change our code if zstd add a new 20 compression level.

I would agree with a getDefaultCompressionLevel in the public API. But I have some mandatory constraints:

  • No include of zstd.h or lzma.h in libzim's public header. A user should be able to use a precompiled libzim without needing development headers of compression algorithms available. He may want to still include them if he want to specify a level, but it is his choice.
  • The default level must be dynamic. If we want to change it a libzim level, we should be able to only rebuild the libzim library, without API/ABI break. zimtools (or other) should not have to be recompiled (so no definition of default value in the header)
  • Consequentially of the previous constraints (or this is the cause of the previous constraints ?), this default level must not be part of the api. If we change it, it is not a api break. User must not ("cannot" is better) base his code on this value. If we want to define it to 1, we must be able to do so. It would be less efficient at compression level but we can. (And if user doesn't tell us what value to use, he trusts us to use some value). Any issue like "I was used to do getDefaultCompressionLevel()-1 and now it is broken" will be refused.

I'm still not convinced we need to publicly define getDefaultCompressionLevel but if you really think it is needed, I'm ok with that.

@kelson42
Copy link
Contributor

@data-man @mgautierfr What is the status here. The PR meanwhile is pretty old and meanwhile does not seem to make progresses.

@mgautierfr
Copy link
Collaborator

Sorry, I've missed the @data-man answer in the review comments.

@kelson42
Copy link
Contributor

@data-man Any feedback? This PR is really moving too slowly. I consider closing it if we don’t have an end in view.

@data-man
Copy link
Contributor Author

@kelson42

I'll rework the API after thinking about it, sorry.

@data-man data-man closed this May 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

User-defined compression level
3 participants