Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why lzma for data compression? #559

Open
Yomguithereal opened this issue Apr 15, 2024 · 5 comments
Open

Why lzma for data compression? #559

Yomguithereal opened this issue Apr 15, 2024 · 5 comments
Labels
question Further information is requested

Comments

@Yomguithereal
Copy link
Contributor

Hello @adbar,

Sorry to bother you but can I ask the reason why the library's model data is compressed using lzma? I am asking because I have found that a lot of people are using versions of python on their computer that were compiled/installed without lzma support and using trafilatura therefore breaks and they often struggle to fix the problem as they don't always know how to reinstall python after having installed the proper dependencies (through yum or apt usually). Wouldn't gzip or another compression scheme be more widespread and avoid this issue?

Have a good day,

@adbar adbar added the question Further information is requested label Apr 15, 2024
@adbar
Copy link
Owner

adbar commented Apr 15, 2024

Hi @Yomguithereal, I didn't know that Python could come without LZMA, I thought it was a standard package and I used it because it compresses text better.

I could switch to bz2 for example, do you know a list of supported platforms so that we don't run into the same problem again?

@Yomguithereal
Copy link
Contributor Author

@adbar I think you would have the same problem with bz2. In which case gzip is probably better because it relies on zlib which is installed on most systems since it is part of most distros build essentials. Sometimes it also compresses better and faster than lzma but your mileage may vary of course.

@adbar
Copy link
Owner

adbar commented Apr 15, 2024

I checked again, usually all the packages in the stdlib are available. In some cases compression librairies are missing with Python compiled from source but it's inconsistent across systems, see pyenv wiki, on Mac OS zlib can be missing as well. On Linux it's sometimes bz2 (also according to the wiki), so I'm not sure how to solve this.

@Yomguithereal
Copy link
Contributor Author

The only thing I see here could be to conditionally support a pure-python implementation of the lzma decompression scheme (using this for instance https://github.com/Rogdham/python-xz). So your code would import lzma then fallback on the pure-python implementation if available (which does not have to be in your deps). But this is quite a lot of work and cruft for maybe too tiny a benefit.

@adbar
Copy link
Owner

adbar commented Apr 15, 2024

It would also be difficult to test on Github Actions (the current CI/CD). We could also explain how to fix the problem in the docs.

Let's leave the issue open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants