Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize indexing of nested archives #80

Open
mxmlnkn opened this issue Mar 19, 2022 · 0 comments
Open

Parallelize indexing of nested archives #80

mxmlnkn opened this issue Mar 19, 2022 · 0 comments
Labels
performance Something is slower than it could be

Comments

@mxmlnkn
Copy link
Owner

mxmlnkn commented Mar 19, 2022

Usecase: Mounting a TAR with a lot of compressed (single block and therefore not parallelizable) xz archives. Such data can be created by archiving logfiles that have been logrotated and compressed with (single block) xz or gzip, which are either not parallelizable or have not been parallelized yet.

logs-2022-03-19.tar
    dmesg.1.gz
    dmesg.2.gz
    dmesg.3.gz

Because the outer layer is uncompressed, a simple folder, or because it might be compressed with parallel decompressable bz2, the reading speeds of the outer layer should be vastly faster than those for the inner xz files. Therefore, it would be helpful, if, for recursive mounting, the nested archives could be analyzed in parallel.

Note that the performance improvements by this are moot if every backend could and had been parallelized. Because then, there would always be a bottlenecking layer, which would hog all processing cores and increasing parallelization over mutliple archives would not amount to anything or might even make things worse. However, some formats are very hard to parallelize like single-block xz and zstd files. I started a parallelized gzip decoder prototype, which I'm kinda close to getting a working prototype but it might turn out to be more difficult than thought and the single-core performance is worse than other implementations, which is demotivating. There should be sufficient edge-cases for this to still make sense even after gzip has been parallelized. And implementing this should also be much easier.

It basically optimizes the same use-cases as #79 and therefore might have even smaller benefits after #79 has been implemented, namely only for the first mounting.

@mxmlnkn mxmlnkn added the performance Something is slower than it could be label Mar 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slower than it could be
Projects
None yet
Development

No branches or pull requests

1 participant