Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf file inside a rar archive cannot be opened by zathura #136

Open
flux242 opened this issue Mar 25, 2020 · 14 comments
Open

pdf file inside a rar archive cannot be opened by zathura #136

flux242 opened this issue Mar 25, 2020 · 14 comments

Comments

@flux242
Copy link

flux242 commented Mar 25, 2020

I've cloned sources today and built rar2fs. If I try to mount a pdf file inside of a rar file then zathura won't open it saying - 'Document does not contain any pages'. If I copy the pdf file to a regular file system then it can be opened by zathura. I believe this is the related to the random access problem.

Additionally I put the same pdf into a zip archive which was also fuse mounted using archivemount and zathura opened it normally too. So it really is rar2fs that produces such strange problem

ps: some pdf files inside rar archives mounted by rar2fs can be opened normally. This is probably related to the pdf file structure.
psps: problematic pdf that I have could be subject of copyrights claims, so I canno't simply attach it. But I'll try to find a similar file

@flux242

This comment has been minimized.

@hasse69
Copy link
Owner

hasse69 commented Mar 25, 2020

Thanks for the issue report (and the archive).

I am no expert in the .pdf format, actually I know basically nothing.
But if I would make a wild guess size is part of the equation here.

If you mount the archive and then copy the file from the mount point to some location on your regular file system, does zathura open it without problems then?
Another wild guess is that this is very depending on how the .pdf browser chose to interact with a file. In the case this would be a generic problem then any .pdf browser would experience the same problem, I need to test that.

Please note also that .zip and .rar is two very different compression implementations. Also be aware that even if it works in archivemount does not necessarily mean that rar2fs is error here. As the RAR compression/decompression algorithm (not -m0) is built on sequential decoding blocks its design prevents any form of real random access. It can very well be so that archivemount extracts the whole thing first and then operates on the monolithic result. That is not how rar2fs works and it has been designed so deliberately for several reasons.

@flux242
Copy link
Author

flux242 commented Mar 25, 2020

yes, as I already written - if the pdf file is copied from the rar2fs mount point into my home directory then zathura works correctly. And archivemount does fuse mount as well.

To test I'd create a 2MiB file filed with some pattern, say repeated sequence from 0 to 255. Then I'd try to randomly access a 256 bytes chunk (or any other size) and check if the pattern pass. It could be done even with a perl script. Say If I read 256 bytes beginning with an offset 238774 then I should get ((238774%256)+i)%256; where i [0..256]. ( I hope I put it right)

@hasse69
Copy link
Owner

hasse69 commented Mar 25, 2020

What you have here is not really related to random access in general but the fact that zathura tries to read 4k at the very end of the file before reading from offset 0. I am not sure how this relates to the .pdf format in general but I suspect most .pdf files and readers would behave the same unless there are variants/versions of the format floating around. This is the exact same reason why indexing of e.g. AVI files inside compressed RAR archives does not work very well (the index table is at the end of the file). The benefit of AVI though is that most media players does seem to care much and starts playback of the contents anyway, just that major jumps in the stream is not possible unless it is covered by the I/O buffer in rar2fs. And again, my primary guess here is that what archivemount does is to extract the entire file to some temporary storage before giving control back to the reader or libarchive (which is what archivemount is using) has done some true magic here. I would really like to see that magic work on RAR archives. It is not possible using libunrar at least and if they would not use that then they would have to (more or less) implement the entire decompression algorithm and by that indirectly also the compression algorithm which is prohibited and would be in strict violation to the commercial copyright license protecting it.

yes, as I already written - if the pdf file is copied from the rar2fs mount point into my home directory then zathura works correctly.

That was not exactly what you said if you look back ;) Hence the question.

I am not too sure what I can do here right now other than explain the rationale behind why it works just like you described it. Of course we could also start to extract everything ahead-of-time, but that would really be a real pain for large contents. Possibly the access pattern could trigger such a fallback but it would possibly also make other things break apart.

EDIT:

I did a quick comparison and not even zathura and gimp implements the same access pattern.
They are similar but not identical.

zathura

PID 28446 calling lread_rar(), seq = 1, size=4096, offset=1826816/0
PID 28446 calling lread_rar(), seq = 1, size=16384, offset=0/0
PID 28446 calling lread_rar(), seq = 2, size=32768, offset=16384/16384
...

gimp

PID 28446 calling lread_rar(), seq = 1, size=16384, offset=0/0
PID 28446 calling lread_rar(), seq = 1, size=4096, offset=1826816/0
PID 28446 calling lread_rar(), seq = 1, size=16384, offset=0/0
PID 28446 calling lread_rar(), seq = 2, size=32768, offset=16384/16384
...

What you can see here is that rar2fs does a sort of replay-trap, trying to restart the read to become aligned. But the trap is triggered once for zathura but twice for gimp, which can be seen by having the same sequence number multiple times. Other than that they look fairly the same.

@flux242
Copy link
Author

flux242 commented Mar 25, 2020

can an artificial read with offset 0 be inserted by rar2fs then? So the first read is always done by rar2fs and discarded. Will it help?

@hasse69
Copy link
Owner

hasse69 commented Mar 25, 2020

No, that is not really the problem.
The read is fine, it is simply trying to access data at the very end of the file. For that to work the data has to be available (i.e. extracted) since there is no way (theoretically it is but that would require changes in the algorithm) for RAR to extract randomly when its entire algorithm depends on a streaming approach.

But, I did try to disable the long jump hack we have to specifically handle e.g. AVI files and then it works as long as the I/O buffer is greater than or equal to the size of the .pdf file.
It works because what rar2fs does in that case is to wait for the I/O buffer to be filled and for the data to become available. But to handle e.g. an AVI file using the same approach you would need a huge I/O buffer and which is allocated for each file open. If you would open multiple files you would quickly run out of memory.

@hasse69
Copy link
Owner

hasse69 commented Mar 25, 2020

But hang on. there might be a quick for this. But I would need to carefully do some regression on it before it would be possible to release it to the public. But if you would accept a patch for your own testing it might be possible using just a few lines of code.

@flux242
Copy link
Author

flux242 commented Mar 25, 2020

sure I can test it

@hasse69
Copy link
Owner

hasse69 commented Mar 25, 2020

issue136.patch.txt

Please try this one.
Just stand in the root of the repo folder and do:

patch -p1 < issue136.patch.txt

EDIT: Note now that the whole idea behind this is that you would control this yourself using the size of the I/O buffer. An access far away from the current file position would otherwise trigger a long jump hack suitable only for some specific file formats. But if the I/O buffer is big enough it can digest all the information and the hack is avoided. The default size of the I/O buffer is 4MB which you can control using the --iob-size option.

@flux242
Copy link
Author

flux242 commented Mar 26, 2020

nope, doesn't seem to help. Neither with 2.5 MiB nor with 25 MiB pdfs.

@hasse69
Copy link
Owner

hasse69 commented Mar 26, 2020 via email

@flux242
Copy link
Author

flux242 commented Mar 26, 2020

ok, with the patch applied and with --iob-size=64 seems to work correctly for small files (less that 2MiB) and for big files (bigger than 100MiB). This covers all my use cases. Thanks

@hasse69
Copy link
Owner

hasse69 commented Mar 26, 2020

It is an interesting observation though. This would mean the PDF file format is rather volatile.
If large files like 100MiB works then it is impossible it will try to read data from the end of the file immediately. That would simply not work with an I/O buffer of 64MiB. Note that the size of the I/O buffer is only indirectly affecting the result, it is the size of the history window that really matters and which by default is set to half the I/O buffer size. That is why you would need 64MiB for e.g. a 25MiB PDF, 32MiB history.

@hasse69 hasse69 added this to In progress in Backlog Mar 29, 2020
@hasse69
Copy link
Owner

hasse69 commented Mar 29, 2020

I will have to start looking into what can be done here. I have put the issue in the backlog.
I do not have an ETA though since this will require some careful redesign and currently I am a bit short on time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Backlog
  
In progress
Development

No branches or pull requests

2 participants