Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed index? #115

Open
peteruhrig opened this issue May 4, 2023 · 3 comments
Open

Compressed index? #115

peteruhrig opened this issue May 4, 2023 · 3 comments
Labels
enhancement New feature or request performance Something is slower than it could be

Comments

@peteruhrig
Copy link
Contributor

I was wondering whether compressed indexes are available, and if not, put this in as a suggestion for a new feature.

In my application, I have files of around 200 MB, for which ratarmount creates a 40 MB index. So basically, we're adding a 20% overhead. With bzip2, that index file compresses down to 2.9 MB, reducing the overhead to 1.5%. This makes a huge difference, because we have a few hundred thousand archives. Since you already have really fast decompression algorithms in place, the overhead in processing time might be acceptable for many applications.

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 6, 2023

They are not available. A feature like this would be welcome. It would have been nice to have this included directly in SQLite but it seems like it is not easily possible with the standard version but there exists the proprietary extension ZIPVFS.

It would be kinda cool to compress the sqlite database and then use one of the new backends, such as indexed_bzip2, pragzip, or indexed_zstd, probably the last one. One problem will be write support, but it might be easy to compress it as a post-process step after the index has been created. Would that work for you? As a first step, I might even simply support reading of such compressed indexes while not yet implementing automatic compression. The user could then simply compress it manually if they want. This would also be useful for initial performance benchmarks.

@mxmlnkn mxmlnkn added enhancement New feature or request performance Something is slower than it could be labels May 6, 2023
@mxmlnkn
Copy link
Owner

mxmlnkn commented May 10, 2023

I looked a bit into the second solution. On-the-fly access is difficult because Python's sqlite module only wants to work with file paths (or :memory:). Therefore, I can't simply open an IndexedGzipFile or similar file object and give it to sqlite. I would have to either FUSE mount the index via a separate process or extract it into /tmp while the SQLite index is being used.

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 10, 2023

In my application, I have files of around 200 MB, for which ratarmount creates a 40 MB index.

Are those gzip-compressed TAR archives? The index for gzip files is the largest. The tradeoff between index size and seek access latency can be configured with the --gzip-seek-point-spacing option. A higher spacing between seek points increases the seek latency but reduces the index size because each gzip seek point requires storing 32 KiB of data. For bzip2, each seek point is only 16 B. Zstandard might also work better but for seeking to work with Zstandard you have to compress it with pzstd instead of zstd or ensure otherwise that the zstandard file contains multiple frames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Something is slower than it could be
Projects
None yet
Development

No branches or pull requests

2 participants