Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more formats #109

Open
2 of 3 tasks
mxmlnkn opened this issue Mar 14, 2023 · 14 comments
Open
2 of 3 tasks

Support more formats #109

mxmlnkn opened this issue Mar 14, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@mxmlnkn
Copy link
Owner

mxmlnkn commented Mar 14, 2023

Some archive formats I would particularly like to have access to:

  • .mbox mail archive files
  • .dmg macOS application bundles
  • squashfs [1] [2]
  • asar
  • Not a format per se but support for files specified as an http[s] URL, which can be accessed using ranged GETs
  • SQLite Archive

Probably unneeded or out of scope:

  • .parquet via fastparquet files as used for ML. Basically a columnar format, so might not make sense to support it. I got the idea because some image training datasets are somehow stored as parquet, maybe with a column "images".
  • ADIOS? HDF5?
  • Dockerfiles?

Done:

Single file compressions:

  • zlib, Implemented in rapidgzip 0.12.0.
  • Implement lz4, lzo, .. backends in rapidgzip. Currently they are usable via libarchive, so it would "only" be a performance improvement.
  • raw deflate?
  • brotli?
  • snappy?

It seems like two other frameworks have very similar goals to ratarmount and might even be further along:

  • FoxIT dissect FUSE
  • fsspec seems to also have FUSE support(?)
  • Just for completion sake, these also have the same goal of supporting as many archives as possible, but without FUSE built-in

I need to benchmark those but I hope that at least for very big data, ratarmount should still have an edge. If not, that's a bug.

@BingoKingo
Copy link

Will it support git?

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Sep 24, 2023

@BingoKingo What part of git should ratarmount support? git-archive creates .tar.gz or zip archives, so that is already supported. Git packfiles? Not sure what the use case for that would be with ratarmount except that it would be nice to simply inspect all kinds of different archives manually for whatever reason. Or do you want to mount a remote git repository without fully cloning it? I'm not sure the protocol allows for that. Using ratarmount to show a checked out view without actually checking it out?

@BingoKingo
Copy link

Or do you want to mount a remote git repository without fully cloning it? I'm not sure the protocol allows for that. Using ratarmount to show a checked out view without actually checking it out?

I means this like gitfs.

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Sep 25, 2023

Hm, I think this is out of scope for ratarmount, especially as an existing solution already exists, even if it hasn't seen many commits in the last 3-4 years.

@hasB4K
Copy link

hasB4K commented Feb 25, 2024

Hello @mxmlnkn,

First, I would like to thank you for this amazing project. I'm actually very interested by it, and I would love to do a PR to add at least the 7z format. However, before doing that, I have four questions:

  1. Do you think ratarmount we could use py7zr as a dependency to support 7z files?

Regarding libarchive, you said that there is currently no proper binding in Python allowing you to use it. I think you were refering to that project while saying that: https://github.com/smartfile/python-libarchive (which uses SWIG internally, and I had experienced issues with SWIG in past projects, it felt unreliable back then)

I have found this project that had a big update last july: https://github.com/Changaco/python-libarchive-c. It only use ctypes to support the c binding, like fusepy.

  1. Have you considered using this "new" lib?

Last but not least, I'm a little confused by a sentence in the main description of ratarmount. You wrote the following (here):

In contrast to libarchive, on which archivemount is based, random access and true seeking is supported.

  1. Does that mean that the new format supported by libarchive would not support random access? Or did you found a workaround since then?
  2. Does all the archive formats supported by ratarmount support random access?

Have a great evening,
Kind regards

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Feb 25, 2024

First, I would like to thank you for this amazing project. I'm actually very interested by it, and I would love to do a PR to add at least the 7z format. However, before doing that, I have four questions:

PRs would be very welcome! The old libarchive branch contains my earlier try with failing tests because of issues with the libarchive Python bindings. The tests could be reused / cherry-picked for your PR.

  1. Do you think ratarmount we could use py7zr as a dependency to support 7z files?

It's probably fine to use it.

In the worst case, the dependency could be made optional at first. I think, I didn't consider it much because it only seems to solve one format issue: 7z, while libarchive would solve many formats. But one more format is better than none. Another issue might be the LGPL license. Then again, there exist other dependent projects mentioned in its own ReadMe, which are MIT: https://github.com/miurahr/aqtinstall . I think LGPL is definitely better than GPL-3 and the manner in which Python packages are distributed and installed might even make them compatible (it allows "relinking" / switching out the LGPL dependency).

Regarding libarchive, you said that there is currently no proper binding in Python allowing you to use it. I think you were refering to that project while saying that: https://github.com/smartfile/python-libarchive (which uses SWIG internally, and I had experienced issues with SWIG in past projects, it felt unreliable back then)

The issue wasn't with the binding itself as far as I recall. The issue was with the provided file object that didn't reliably allow to do arbitrary seeks, especially seeking backwards would fail in some scenarios, meaning random access didn't work at all. It might be possible to fix this upstream with a PR or roll out a custom file object abstraction if the lower-level bindings are sufficiently exposed.

I have found this project that had a big update last july: https://github.com/Changaco/python-libarchive-c. It only use ctypes to support the c binding, like fusepy.

  1. Have you considered using this "new" lib?

I was checking libarchive bindings again recently and I saw that this project got active again. I never looked deeper into it though. It could be worth a try.

Last but not least, I'm a little confused by a sentence in the main description of ratarmount. You wrote the following (here):

In contrast to libarchive, on which archivemount is based, random access and true seeking is supported.

3. Does that mean that the new format supported by libarchive would not support random access? Or did you found a workaround since then?

I'm pretty sure that random access to formats such as gzip and bzip2 is not supported, I would have to check the source code to be completely sure, or do some benchmarks. The benchmark in the bottom left "Time Required to Get Contents of One File" is the most indicative. It shows that alternatives can take minutes for compressed archives to seek to a requested file because they presumably have to start decompression from the very beginning of the archive. Providing access to these heavily stream-based formats is difficult, but there are solutions like indexed_bzip2 and indexed_gzip / rapidgzip.

Other formats, such as 7z, zip, rar will definitely allow seeking to file members, but seeking inside files, when compressed, might also not be possible performantly, i.e., seeking backwards would require reopening that file member and reading from the beginning up to the desired position. For small files, it would not be an issue, only for larger ones. Again, I'm speculating and would have to check the source code. However, random access to bz2 and gzip would always require a kind of index, and such an index is not written by libarchive. So at least subsequent archive mounts would require analyzing the archives again, while ratarmount can simply load the exported indexes.

4. Does all the archive formats supported by ratarmount support random access?

In principle yes. In practice, there are requirements on zstd and xz files, so random access will not work performantly with all of these files. The general rule would be, if it was compressed in parallel, then random access is highly likely possible.

Lastly, I would mention fsspec again. It seems to provide a filesystem abstraction interface akin to my MountSource interface only with many more methods. It also comes with a libarchive implementation, which uses the libarchive-c Python bindings, which you mentioned for your second question. It might therefore be possible to make fsspec a dependency and provide a FsspecMountSource adaptor, which could provide libarchive support and many other bindings to ratarmount. Or its existence can simply be taken as another voucher for the libarchive-c Python bindings.

There also are some performance results for 7z access in this issue. py7zr seems to be 7x slower than 7z.exe and libarchive-c seems to be 20x slower. It looks bad for Python bindings. And, if Windows-support is intended, then python-libarchive would also be disqualified.

@hasB4K
Copy link

hasB4K commented Feb 29, 2024

First, thank you for your lengthy answer ! :)

Benchmark of libarchive-c (I got good result 🤷)

First I would like to start with the benchmark issue. I decided to give it a go and to benchmark on some data that I have (I cannot share the archives sadly, but if needed, we could create a benchmark archive with open sourced data).

Here are my results, on 2 files, on Linux with 32 GB of RAM, 8 core (i7-7820HQ):

  • First 7z, 18 files - 11GB uncompressed, roughly the same when compressed. Here are the times it took for a full extraction:
    • 7z binary: 3min 30s
    • py7zr: 4min 32s
    • libarchive-c (using this method): 3min 00s
  • Second 7z, ~160K files - 44GB uncompressed, 3.1GB compressed. Here are the times it took for a full extraction:
    • 7z binary: 20min 35s
    • py7zr: 20min 52s
    • libarchive-c (using this method): 14min 11s

As you can see, on this setup, libarchive-c was faster for me... This difference could come from the fact that I'm on linux (the other benchmark was on Windows), or maybe the kind of compression is different, or it could come from the RAM or the number or cores?... No idea. But that's why I did the same benchmark on two different kind of archives (two big archives, but one with a LOT of files in it). Further investigations might be necessary, but it tell me at least that libarchive-c is usable since I got even better result than the original 7z binary.

Regarding libarchive

I noticed that libarchive was able to open some zip archive that zipfile was not able to open (sadly I cannot share it, but maybe I could create a similar one). So I decided to investigate on libarchive more and disregard for now py7zr. (btw, it would be nice if we create a libarchive binding to have an option to open zip files with it instead of using zipfile).

Libarchive doesn't support random seek within a file, and the underlying C-struct (different for each format) inside libarchive are not accessible. They did however implemented:

  • a seekable method for their internal backend for some of their archive formats such as 7z that needs to be able to do random access while decompressing. There is very few documentation on this but it is explained here. Sadly, the conclusion is that we cannot use this to be able to seek within a file. FYI, most of the format supported by libarchive don't even implement the 'seek' method, for example the internal tar backend here doesn't support that and put a NULL pointer instead of a real function.
  • skip archive entry method: so, we can basically fetch an entry (ie. a file within an archive) of an archive without reading the entire archive. You can see this function here.

Regarding fsspec

I didn't know that fsspec had an implem of libarchive! This is great! I think there is a potential Proof of Concept that could be done by using a variant of what did fsspec for libarchive. By either maintaining a special version for ratarmount, or by submitting a PR to fsspec. Here is what I thought:

  • fsspec doesn't handle password for archives. This support could however be easily added with a PR on fsspec since it is now added in libarchive-c here.
  • fsspec manage random access by iterating through all files within an archive (and using the skip entry functionnality of libarchive) and then they extract the file and they put it into a MemoryFile to be able to have random access on it (see here).

The issue here is that putting a file into a MemoryFile implies that all files that will be accessed through it will be able to fit into memory. On the data that I have used on different projects so far, sometimes you can make that assumption, sometimes you cannot. You can have a 4GB file that needed to be accessed (and it could even be a nested archive). Some of their other implementation such as the FTP one handle the files as a custom object that inherit of their caching attribute (see here).

They are different kind of caching in fsspec:

  1. The caching of the file listing: https://filesystem-spec.readthedocs.io/en/latest/features.html#listings-caching
  2. The caching of file like shown on the FTP implem
  3. And finally, a CachingFileSystem that can be used on top of other filesystem within fsspec: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/cached.py

For the kinds 2 and 3, you can have different caching strategy (mmap, read ahead, etc.).
And for the 3, there is a mechanism of file expiration and all that can be triggered.

My conclusion

I think for ratarmount to be able to support libarchive and fsspec, we should use a caching mechanism with an expiration date (and make it conditionnal if the files that we would like to cache are small enough). It would allow random access for files inside an archive, and ratarmount could continue the support of recursive mount of archives without trouble. What do you think?

PS: I'm not yet sure that I will be able to do this PR time-wise. I think I could maybe do a PoC. But I'm not sure yet, it would largely depends on your vision of the project and what you think about adding libarchive through an implem of fsspec using a (potentially optional) caching mechanism.

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Feb 29, 2024

Benchmark of libarchive-c (I got good result 🤷)

but if needed, we could create a benchmark archive with open sourced data.

I have a variety of larger files for benchmarking purposes, which I test on only locally because it would be too expensive for the CI:

All of these have different desirable properties and compressibilities.
The synthetic data covers certain edge cases.
If one of the real data sets is too small, then they simply are duplicated and concatenated, which should not affect benchmark behavior for many compressions because the compression dictionary distance is very limited (32-64 KiB).
7z can be configured with notable large dictionary distances, so care would have to be taken there.
The wikidata example is my go-to benchmark for memory usage because it extracts to 1.2 TiB of data.
With such a size, even the memory required for the index (~10 GB) becomes a problem!
Lately, I have been mostly testing with compressed single files instead of archives because I was mostly working on rapidgzip.

Here are my results, on 2 files, on Linux with 32 GB of RAM, 8 core (i7-7820HQ):

This looks much more promising than the linked issue. Thank you for doing some benchmarks yourself. Weird, though, how much it differs, but there are so many factors that could play a role as you already stated.

Regarding libarchive

I noticed that libarchive was able to open some zip archive that zipfile was not able to open (sadly I cannot share it, but maybe I could create a similar one).

ZIP supports many things, which are not supported on all decoders. Some of these are:

  • The most common ZIP member compression is DEFLATE. All others might have only limited support. Here is a list:
     0 - The file is stored (no compression) **most common**
     1 - The file is Shrunk
     2 - The file is Reduced with compression factor 1
     3 - The file is Reduced with compression factor 2
     4 - The file is Reduced with compression factor 3
     5 - The file is Reduced with compression factor 4
     6 - The file is Imploded
     7 - Reserved for Tokenizing compression algorithm
     8 - The file is Deflated **most common**
     9 - Enhanced Deflating using Deflate64(tm)
    10 - PKWARE Data Compression Library Imploding (old IBM TERSE)
    11 - Reserved by PKWARE
    12 - File is compressed using BZIP2 algorithm **probably second most common**
    13 - Reserved by PKWARE
    14 - LZMA
    15 - Reserved by PKWARE
    16 - IBM z/OS CMPSC Compression
    17 - Reserved by PKWARE
    18 - File is compressed using IBM TERSE (new)
    19 - IBM LZ77 z Architecture 
    20 - deprecated (use method 93 for zstd)
    93 - Zstandard (zstd) Compression 
    94 - MP3 Compression 
    95 - XZ Compression 
    96 - JPEG variant
    97 - WavPack compressed data
    98 - PPMd version I, Rev 1
    99 - AE-x encryption marker (see APPENDIX E)
    
    Python's zipfile supports: uncompressed, DEFLATE, BZIP2, and LZMA.
  • ZIP64 support. Should also be supported by zipfile.
  • Encryption. Again, there are lots of different ways to encrypt supported by ZIP. And while zipfile seems to support encryption, I'm not sure whether it supports every incarnation, e.g.,
    • Data encryption
    • Metadata / Header encryption
    • Encryption algorithms supported by zip: traditional PKWARE encryption, DES, 3DES, original RC2 encryption, RC4 encryption, AES encryption, corrected RC2 encryption, corrected RC2-64 encryption, non-OAEP key wrapping, Blowfish, Twofish
    • PKCS#7 Encryption Recipient Certificate List
    • ...
  • ...

It's probably infeasible to support 100% of the ZIP format specification.
To find out why your archive does not work, you could use zipinfo and look for obscure features, e.g., file members that are not shown as def and not as stor.

(btw, it would be nice if we create a libarchive binding to have an option to open zip files with it instead of using zipfile).

I think this would be a standalone project, although the backend could be reused from a ratarmount implementation.
Although, after searching for "zipfile" on PyPI and seeing the plethora of tries to improve upon the standard zipfile module, I'm not so sure anymore that another implementation would remedy the problems. Some of these should probably be merged into CPython instead.

Personally, I want to get around to implement some kind of indexed_zip package that provides seeking inside single zip members / files by using rapidgzip / indexed_bzip2 as backends (again, only supporting a very restricted subset of zip seems possible to me). There are still some hurdles for that though.

Regarding fsspec

Yes, it would be better to upstream support for passwords.

  • fsspec manage random access by iterating through all files within an archive (and using the skip entry functionality of libarchive) and then they extract the file and they put it into a MemoryFile to be able to have random access on it (see here).

This does sound less performant than ratarmount because the runtime adds up if there are millions of files, even with skipping over files.
But again, better to implement slow format support for now and then maybe roll out faster custom solutions in the future.

The issue here is that putting a file into a MemoryFile implies that all files that will be accessed through it will be able to fit into memory. On the data that I have used on different projects so far, sometimes you can make that assumption, sometimes you cannot.

The workaround of using MemoryFile (BytesIO would also work) is not optimal because of possible out-of-memory errors.
I would even prefer slower seek times by decompressing the file from the beginning than this, except for small files, but I think this can be implemented as a second step in a performance commit.

I'm not entirely sure how the caching is connected to libarchive support.
Yes, there is some necessary caching for the simplest support of random access, but this is an implementation detail.
On the FUSE-Python-layer-side, there also is caching of all open file objects keyed with a file handle.
But these file objects cannot be easily expired for now and I don't see the necessity.
If the user opens millions of files without ever closing them, then the memory waste is not ratarmount's fault.

what you think about adding libarchive through an implem of fsspec using a (potentially optional) caching mechanism.

Ideally, I would want to keep the direct and indirect dependendencies to a miniumum, so a solution directly with libarchive sounds more amenable.

Fun fact: python-libarchive and libarchive-c can be installed at the same time without a conflict warning even though they have the same module name, which did cause errors such as AttributeError: module 'libarchive' has no attribute 'SeekableArchive'. This would probably deserve an issue report in both repositories.

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Mar 20, 2024

I have given the libarchive backend another try in the equally named branch. It only uses simple libarchive function calls without any possibly bugged higher-level Python layers as was the case before. This currently has the mentioned memory usage problem though as opened files are wholly extracted into memory. This should be fixed before merging. There are also some conceptual problems with the index. The index would be incompatible with the TAR backend. But it might be possible under current circumstances to create an index with the libarchive backend and then try to use the index from another ratarmount installation that uses the custom TAR backend. I think I had some heuristic metadata verification but I'll have to check that. Furthermore, it would be nice if the backend priority command line option would also affect libarchive vs. custom TAR backend / ZIP and others. Currently, it only affects the stream compression backends (bz2, gz, ...) for the custom TAR backend. This option could then be used to easily test the problem with index backend compatibility. It would also make handling of chimera files (think a ZIP file appended to a TAR file) more reliable as it could force the ZIP backend being tried first. It would be another step to somehow tell this priority to libarchive, though.

@hasB4K
Copy link

hasB4K commented Mar 21, 2024

Awesome! Could you create a draft PR maybe? It would help to see what changes were done, and we could have a place to discuss about it.

Regarding the memory usage, I think we should let the user define a temp directory where we could keep extracted files, I could help with that if you want.

I'm not sure to grasp the entire issue regarding the index problem. As I mentioned before, it seems that you cannot have the real offsets information of files using libarchive, you can only use the entryCount (as you did in your branch). So, I think that this index should only be used by the libarchive backend. I think it's fine if it's not cross compatible with the tar backend. But again, maybe I have not understood the issue here

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Mar 26, 2024

I think it's fine if it's not cross compatible with the tar backend.

It is fine, but it should be detected. Currently, older versions would simply load an index created by the libarchive backend and when trying to access a file would return Input/Output errors. I have now added lots of checks to prevent this in the future but it does not fix compatibility with older ratarmount versions. For now, I have decided to not write the index out because it doesn't help performance that much anyway. Each file seek, currently, has to reparse / go over all archive entries anyway and therefore would be almost as slow as index creation anyway. This cannot be solved. That's why the custom tar-, bzip2-, gzip-, ... parsing in ratarmount exist in the first place.

Regarding the memory usage, I think we should let the user define a temp directory where we could keep extracted files, I could help with that if you want.

I think, the better approach would be to implement seeking in LibarchiveFile without fully extracting the file. In the worst case it can mean starting parsing from the start of the archive file to seek backwards, but I think that's fine and how all other competitors work. And it wouldn't fail for very large files.

I have opened a PR. There are some open todos:

Blockers:

  • Fix AppImage again by adding libarchive and all its dependencies.
  • Implement an alternative LibarchiveFile implementation (or extend it) that does not do self.fileobj = io.BytesIO(buffer) but instead extracts on demand, implements forward-seeking by simply decompressing and throwing away the requested amount, and backwards-seeking by reopening the archive and then forward-seeking to the requested absolute offset. Each LibarchiveFile opens a completely independent libarchive object and therefore works independently. This wasn't the case in python-libarchive, which meant that one file access might invalidate another opened "LibarchiveFile"-like object.
  • Fix pylint. For some reason it has dozens of false positives such as E1101: Module 'libarchive.ffi' has no 'read_open_filename_w' member (no-member). Adding libarchive-c to the pylintrc whitelist also doesn't seem to help. I simply suppressed the warning in the whole file.

Nice to have:

  • Implement the _findPassword, which should work similarly to ZipMountSource, i.e., it should try to open the archive with one of the given passwords and somehow detect success.
  • Somehow make libarchive work with Python-file-like objects, which have open, read, seek, .. methods. Currently, it only works with a file path or file descriptor but for recursively mounted files, this is insufficient. Libarchive has a very generic open-method that takes open,read,and close callbacks (not the absence of seek because libarchive is only meant for non-seekable/streaming access). With that, it should be easily doable. The probably more time-intensive part is writing tests.
  • Implement better file format detection. Currently, the input files are manually tested for magic bytes. This becomes cumbersome because libarchive supports a myriad of formats. It might be possible to extend the existing five or so format tests with the missing rest, or everything regarding the format check would have to be refactored to use some generic test, e.g., a try open with libarchive and return true on success function.
  • Add more tests. I have added some minimal 7z files but more tests would be nice.
  • Wait for Truncated 7-Zip file body (error code: -30) on archive_read_data with 7z archives containing larger (>= 32 MiB) files when skipping entries libarchive/libarchive#2106 to be fixed?
  • Reuse LibarchiveFile objects after they are closed to possibly open the next file faster. This would speed up the use case of iterating all files in the order they appear in the archive.
  • Run pytest tests in parllel.

If you want to give one of these a try, then let me know, else I'll hopefully slowly but steadily fix these open issues.

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Apr 7, 2024

Merged

@hasB4K
Copy link

hasB4K commented Apr 18, 2024

@mxmlnkn Thank you so much for your work! I thought I would have some time to actually help you on this...

Do you think there is a need to have a cache system for libarchive? If so, I could a PR later on.

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Apr 18, 2024

@mxmlnkn Thank you so much for your work! I thought I would have some time to actually help you on this...

Do you think there is a need to have a cache system for libarchive? If so, I could a PR later on.

To cache small files? I'm not sure that is necessary, but some benchmarks to (dis)prove it would be necessary. The LibarchiveFile implementation is buffered. It reads in 1 MiB chunks and if possible it avoids full reparsing via libarchive and only seeks inside the buffer. This also means that small files <= 1 MiB are fully cached.

Currently, I'm working on a PySquashfsImage backend. The simple implementation in the squashfs branch already works, but it has some major performance issues that I'd like to give a try to fix before merging it. Maybe some of that could even be merged upstream into PySquashfsImage.

The two other projects I arrived at in the linked issue from 3 days ago would be:

  • Add pyfatfs as backend.
  • Add the fsspec cloud backends, so that you can do stuff like: ratarmount s3://...archive.tar.gz mounted

I have not started with any of these two.

You could also still review the already merged PR/commits for libarchive or simply test them. Maybe there are still (performance) bugs there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants