Compressed index? #115

peteruhrig · 2023-05-04T20:55:07Z

I was wondering whether compressed indexes are available, and if not, put this in as a suggestion for a new feature.

In my application, I have files of around 200 MB, for which ratarmount creates a 40 MB index. So basically, we're adding a 20% overhead. With bzip2, that index file compresses down to 2.9 MB, reducing the overhead to 1.5%. This makes a huge difference, because we have a few hundred thousand archives. Since you already have really fast decompression algorithms in place, the overhead in processing time might be acceptable for many applications.

mxmlnkn · 2023-05-06T07:03:12Z

They are not available. A feature like this would be welcome. It would have been nice to have this included directly in SQLite but it seems like it is not easily possible with the standard version but there exists the proprietary extension ZIPVFS.

It would be kinda cool to compress the sqlite database and then use one of the new backends, such as indexed_bzip2, pragzip, or indexed_zstd, probably the last one. One problem will be write support, but it might be easy to compress it as a post-process step after the index has been created. Would that work for you? As a first step, I might even simply support reading of such compressed indexes while not yet implementing automatic compression. The user could then simply compress it manually if they want. This would also be useful for initial performance benchmarks.

mxmlnkn · 2023-05-10T10:14:18Z

I looked a bit into the second solution. On-the-fly access is difficult because Python's sqlite module only wants to work with file paths (or :memory:). Therefore, I can't simply open an IndexedGzipFile or similar file object and give it to sqlite. I would have to either FUSE mount the index via a separate process or extract it into /tmp while the SQLite index is being used.

mxmlnkn · 2023-05-10T10:27:49Z

In my application, I have files of around 200 MB, for which ratarmount creates a 40 MB index.

Are those gzip-compressed TAR archives? The index for gzip files is the largest. The tradeoff between index size and seek access latency can be configured with the --gzip-seek-point-spacing option. A higher spacing between seek points increases the seek latency but reduces the index size because each gzip seek point requires storing 32 KiB of data. For bzip2, each seek point is only 16 B. Zstandard might also work better but for seeking to work with Zstandard you have to compress it with pzstd instead of zstd or ensure otherwise that the zstandard file contains multiple frames.

mxmlnkn added enhancement New feature or request performance Something is slower than it could be labels May 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed index? #115

Compressed index? #115

peteruhrig commented May 4, 2023

mxmlnkn commented May 6, 2023

mxmlnkn commented May 10, 2023

mxmlnkn commented May 10, 2023

Compressed index? #115

Compressed index? #115

Comments

peteruhrig commented May 4, 2023

mxmlnkn commented May 6, 2023

mxmlnkn commented May 10, 2023

mxmlnkn commented May 10, 2023