Parallelize indexing of nested archives #80

mxmlnkn · 2022-03-19T09:45:13Z

Usecase: Mounting a TAR with a lot of compressed (single block and therefore not parallelizable) xz archives. Such data can be created by archiving logfiles that have been logrotated and compressed with (single block) xz or gzip, which are either not parallelizable or have not been parallelized yet.

logs-2022-03-19.tar
    dmesg.1.gz
    dmesg.2.gz
    dmesg.3.gz

Because the outer layer is uncompressed, a simple folder, or because it might be compressed with parallel decompressable bz2, the reading speeds of the outer layer should be vastly faster than those for the inner xz files. Therefore, it would be helpful, if, for recursive mounting, the nested archives could be analyzed in parallel.

Note that the performance improvements by this are moot if every backend could and had been parallelized. Because then, there would always be a bottlenecking layer, which would hog all processing cores and increasing parallelization over mutliple archives would not amount to anything or might even make things worse. However, some formats are very hard to parallelize like single-block xz and zstd files. I started a parallelized gzip decoder prototype, which I'm kinda close to getting a working prototype but it might turn out to be more difficult than thought and the single-core performance is worse than other implementations, which is demotivating. There should be sufficient edge-cases for this to still make sense even after gzip has been parallelized. And implementing this should also be much easier.

It basically optimizes the same use-cases as #79 and therefore might have even smaller benefits after #79 has been implemented, namely only for the first mounting.

The text was updated successfully, but these errors were encountered:

mxmlnkn added the performance Something is slower than it could be label Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize indexing of nested archives #80

Parallelize indexing of nested archives #80

mxmlnkn commented Mar 19, 2022 •

edited

Parallelize indexing of nested archives #80

Parallelize indexing of nested archives #80

Comments

mxmlnkn commented Mar 19, 2022 • edited

mxmlnkn commented Mar 19, 2022 •

edited