Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read/readall dumps the decompressed files to memory, instead of streaming them #579

Open
jaboja opened this issue Mar 17, 2024 · 2 comments
Labels
enhancement New feature or request for extraction Issue on extraction, decompression or decryption help wanted Extra attention is needed

Comments

@jaboja
Copy link

jaboja commented Mar 17, 2024

There is a problem with reading large files, whose decompressed form exceed the available RAM:

The library (namely read/readall methods) tries to first decompress the file to memory using BytesIO, and then returns that BytesIO object. While that may work well for small files, it fails due to lack of memory, for bigger ones.

It would be better if the library streamed the files, just like the standard file IO.

To Reproduce

  1. Download a huge 7z file, e.g. this Wikipedia dump:
wget https://dumps.wikimedia.org/plwiki/20240301/plwiki-20240301-pages-meta-history1.xml-p1p6814.7z
  1. Try to read it:
import py7zr
archive = py7zr.SevenZipFile('plwiki-20240301-pages-meta-history1.xml-p1p6814.7z', mode='r')
for _ignore, content in archive.readall().items():
    print(content.read(10))
  1. If the machine has < 36GB memory, the script will try to allocate memory until it runs out of it, then it will break.

Expected behavior
Library should allocate only as much memory as really needed for reading data requested, and allow to stream files even if their decompressed form exceeds available memory and disk space.

Environment:

  • OS: Ubuntu 22.04.3 LTS
  • Python 3.10.12
  • py7zr version: 0.21.0
  • Disk space: 10 GB
  • Memory: 2 GB

(the Wikipedia dump file used as an example is 246.6 MB in compressed form, and 36 GB when decompressed)

@miurahr miurahr added enhancement New feature or request help wanted Extra attention is needed for extraction Issue on extraction, decompression or decryption labels Mar 20, 2024
@miurahr
Copy link
Owner

miurahr commented Apr 2, 2024

There is a one of main loop in SevenZipFile#_extract which is like

        for f in self.files:
             # if - else  block
             # if  memory extraction 
              _buf = io.BytesIO()
               self.worker.register_filelike(f.id, MemIO(_buf))
             # else in default
                self.worker.register_filelike(f.id, outfilename)

       # now finished a preparation of extraction index
       # then calls  7z file file pointer and target path
       self.worker.extract(   self.fp,   path, parallel = ... )

With this structure, Worker class create thread and extract solid blocks in muti-thread when possible.
If you want to implement, it need to be changed significantly. When you get an idea how to improve, please tell me.

py7zr originally extract files into file system, and @Zoynels contribute a memory IO feature as #111

@starkapandan
Copy link

+1 on this issue. Probably not an easy fix, but on large 7z archives, being able to properly stream the contents is the proper approach, otherwise the only other alternative becomes to extract it first which is not the most elegant/quick solution. python tarfile and zipfile libraries for example achieves this properly, eg no memory crash when reading a large file.

image
don't know how the change needs to be implemented so leaving that to someone with more experience in this library hopefully, but here is the part where it crashes the memory... Normally for regular filehandles this is not an issue, since it reads a chunk and frees it after writing it. But the memory Read design is made so that a memory stream is provided into the existing decompress function was probably done for simplicity, but the issue here is that the memory object is as name states always in memory, so even if decompress function does its job correctly eg read and frees memory, the way memory object is used causes it to load everything until the end into this object, leading to memory filling up.

The ideal solution to these form of stream readers is to somehow give the control that the decompress method has over the bufferedreader (eg the one that is actually reading the 7z source) and create some form of abstraction over it with a regular BytesIO or any other form of memory stream, so that the end user eg implementer gets the control of what to load into memory and therefore increasing memory efficiency drastically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request for extraction Issue on extraction, decompression or decryption help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants