Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent file index table #164

Open
mercora opened this issue Jun 25, 2021 · 20 comments
Open

Persistent file index table #164

mercora opened this issue Jun 25, 2021 · 20 comments

Comments

@mercora
Copy link

mercora commented Jun 25, 2021

Hello, and thanks for this great piece of software!

I wondered if it could be possible to store the file index table alongside the original files. I first had the impression the --save-eof parameter would do this because it mentioned "r2i" files which did sound a lot like what i am asking for. However, i am not able to find any of these files in the source folder and it appears as if the file indexing does reoccur after a remount even with this parameter present. So probably that was more wishful thinking and it really stores the EOF "somewhere" probably to speedup readers that need access at the end of the file in order to interpret them...

I think it could be very beneficial to store the results of analyzing archives alongside these archives, like sfv or quickpar files are often seen around them... This could speed up scanning while remounting slower mounts quite drastically and in addition could be done outside of rar2fs too. i.e i could trigger the creation whenever i think its an appropriate time for that or even share the results.

Is this something you would consider working on or alternatively would you accept pull requests attempting to implement this? i am not really proficient in writing C code but i would give it a try and would likely succeed with some guidance... in either case i would be quite happy to discuss this here :)

@hasse69
Copy link
Owner

hasse69 commented Jun 26, 2021

Hello, just to understand we are on the same page with respect to definitions and the term "index", have you read this?

https://github.com/hasse69/rar2fs/wiki/mkr2i

@mercora
Copy link
Author

mercora commented Jun 26, 2021

Sorry, i could have been more clear on this. I think it is a quite similar idea but a more general approach. I made some assumptions about how this software works; In order to generate a list of all files and their offsets one would need to read the file headers which are scattered around inside the archives. i.e the data follows directly after a file header and to read the next header one has to seek to it by taking the length of the data as an offset into the next header. this is done until the end of the archive is reached...

I further assumed whenever this was done it caches the result of this scan in memory so it wont run the next time the user traverses the tree containing the same archive. So if the result of a previous scan would be saved on disk alongside the archives the code could just pickup the stored results without reading into each archive volume even if its encounter for the very first time during the current execution.

This is very similar to how the quick open information in RAR5 archives work but would be stored out of band. It might even make sense to reuse the structures used there for on disk storage.

In case you don't know RAR5 introduced a new method to retrieve file headers more quickly called "quick open information" it works by storing a locator record inside the main archive headers extra area which indicates the offset inside the archives where the quick open information header is stored. the quick open information records themselves store the offset of the data inside the archive and a cached copy of that same data. This is used to store file headers and their offsets into the archives and allows for quick listing of files and to directly seek onto the right offsets for them.

This is more or less what i would propose to do but instead of depending on the archive already containing this information i would like to store it alongside the archive so instead of using a locator record to find the quick open information records the code could check for the existence of a file containing them and instead of storing relative offsets the records would store absolute offsets.

@hasse69
Copy link
Owner

hasse69 commented Jun 26, 2021

I am not really sure I understand the purpose of what you wish to do here?

File offsets etc are already cached today, headers are normally read only once for archive contents. There is a cache for both file meta data and directory entries. The cache is not mount persistent though and the reason why I do not wish to make it so has been discussed before.

For uncompressed archives (-m0) rar2fs uses its own file extraction routine but for compressed and encrypted archives it relies completely on libunrar.

So if I could only understand the true nature behind your suggestion and what it is you think would be gained/improved by it I might be able to make this discussion a bit more constructive.
Is it access of archives themselves or simply the population of the cache at mount time?

As you might know by now RAR archive format does not support random access since it is using a sequential block cipher/decompression algorithm that makes each block of data depend on the previous one. Thus there is currently no support from libunrar to access any given offset in a file. That is the real reason why rar2fs deals with uncompressed archives internally to allow this.

@mercora
Copy link
Author

mercora commented Jun 26, 2021

File offsets etc are already cached today, headers are normally read only once for archive contents. There is a cache for both file meta data and directory entries. The cache is not mount persistent though and the reason why I do not wish to make it so has been discussed before.

I tried to find these discussions without much success, i just found another issue with the same remark, can you point me to your previous encounters of this proposal were you discuss these reasons?

So if I could only understand the true nature behind your suggestion and what it is you think would be gained/improved by it I might be able to make this discussion a bit more constructive.
Is it access of archives themselves or simply the population of the cache at mount time?

Some network filesystems can be very slow when encountered this read->seek->read patterns. I use a separate cache for these filesystems and the scanning is therefore ideally done right after the archives have been written and therefore very likely completely cached. i can do this by traversing the tree but the results will be lost on remount. I am not really suggesting to populate the cache from disk but to change the way the cache is populated if the proposed index file is present. this could result in a dramatic improvement in the time it takes to populate the cache with hopefully obvious benefits.

As you might know by now RAR archive format does not support random access since it is using a sequential block cipher/decompression algorithm that makes each block of data depend on the previous one. Thus there is currently no support from libunrar to access any given offset in a file. That is the real reason why rar2fs deals with uncompressed archives internally to allow this.

i mainly deal with uncompressed archives but i am unsure why this would matter here. I am not proposing to store the location of compressed data inside the archives i.e a table translating compressed offset to uncompressed offsets as this would be impossible to use like you already stated. I really only want to avoid the need to hit the archives in order to find file names and their offsets. this should also work for compressed archives as far as i can tell.

@hasse69
Copy link
Owner

hasse69 commented Jun 26, 2021

Then I understand. Have you tried the warmup mount option? It was designed to populate the cache in the background and a lot faster than you would be able to achieve manually too. One reason it that the warmup is completely bypassing the FUSE layer.

From what I can tell from your description I do not really see it as any different from making the current file cache persistent across mounts? The file cache is implicitly a cache of archive contents. One similar discussion can be seen below.

#135

@mercora
Copy link
Author

mercora commented Jun 26, 2021

Have you tried the warmup mount option? It was designed to populate the cache in the background and a lot faster than you would be able to achieve manually too. One reason it that the warmup is completely bypassing the FUSE layer.

it definitely helps enormously to prepopulate the cache but for me comes with the same cost of having to do this read->seek->read pattern. I actually don't even want to prepopulate the cache for this reason. The older my archives become the less likely they will be accessed and the lower the chances of these archives being accessed the more i am wasting resources scanning these for apparently no good reason...

From what I can tell from your description I do not really see it as any different from making the current file cache persistent across mounts? The file cache is implicitly a cache of archive contents

I do understand that having a persistent cache would be a complex undertaking because of all the things that could change between mounts that would invalidate entries. For this reason i am actually not proposing to read back the cache from disk. Instead i want the cache to be populated on access like its already done but the file headers that are to be cached should be read from a separate file if it is present. this is what i mean when i say i am not suggesting to store the cache itself on disk.
this could be a single read operation instead of multiple read->seek->read operations... the actual code that handles caching could probably be left completely unchanged, the only thing that does change is where the data to be cached is coming from.

@hasse69
Copy link
Owner

hasse69 commented Jun 26, 2021

Ok so if I understand you correctly here you wish for some simplified header data to be available "off-line" so to speak.
Thus you would put such entries locally on your file system avoiding the need for accessing the slow network?

@mercora
Copy link
Author

mercora commented Jun 26, 2021

actually i would like to store the entries alongside the archives. so there still would be a cost associated to accessing them but this ensures some consistency between the data on disk and the headers stored there. However accessing a single file will improve things dramatically for me.

the filesystem i currently use (s3ql) creates "blocks" of a specified size for files larger than a single "block". when i access a file at an uncached offset the whole "block" needs to be downloaded. this means just to read the few bytes containing the headers will end up in multiple large transfers. there is a tradeoff to be made here. either i have very slow writes with a small block size (more network activity) or i have slow reads with a larger block size. the reasons are kinda similar to why we cant seek to compressed data inside the archive without extracting it from the start.

@hasse69
Copy link
Owner

hasse69 commented Jun 26, 2021

But why not simply use --seek-length=1?
I am still a bit unsure what you would gain here? At the end of the day you anyway need to populate the cache and it would not be satisfied with some simplified header data. It needs the real deal.

@mercora
Copy link
Author

mercora commented Jun 26, 2021

But why not simply use --seek-length=1?

because i definitely do have archives with multiple large files stored inside that have their headers scattered around multiple offsets in multiple files. i also don't like that approach because it is error prone and gives no indication of the failure mode happening.

I am still a bit unsure what you would gain here? At the end of the day you anyway need to populate the cache and it would not be satisfied with some simplified header data. It needs the real deal.

it can be the real deal. i suggest to use the same data structure as is used in RAR5 quick open information records. this would be equivalent to reading them from the archive. i never really said i want "simplified" headers in fact i really want to have "the real deal" in order to avoid the access pattern mentioned. Accessing some large archives can take up to multiple minutes for me and even if just a single file is to be accessed multiple large transfers for unrelated headers will be triggered....

Look, i am fine if you don't feel comfortable having this implemented and i think i have not much more to tell about the benefits of my proposal. So in order to safe both of us some time, are you still with me?

@hasse69
Copy link
Owner

hasse69 commented Jun 26, 2021

It has very little to do with if I am comfortable or not. I am always open to suggestions but I need to question them as well since I am fairly sure I know how rar2fs works 😉

So no offense here, I am simply trying to make sure you understand what a change like this might require in effort and if it would really yield accordingly.

I am prepared to listen but since you seem more convinced than I am feel free to create a pull request. I am happy to assist you in anyway you need to accomplish what you believe would benefit your use-case. As long as legacy remains intact of course.

@mercora
Copy link
Author

mercora commented Jun 26, 2021

It has very little to do with if I am comfortable or not. I am always open to suggestions but I need to question them as well since I am fairly sure I know how rar2fs works

I am sorry if i came across as rude. I totally understand that you are the one who knows best how any of this works and I definitely value your opinion on my proposal accordingly. it just felt like i was trying to convince you of something you know wont work the way i imagined and suggesting solutions that wont help in my use-case made me feel being misunderstood at the same time.

So no offense here, I am simply trying to make sure you understand what a change like this might require in effort and if it would really yield accordingly.

no offense taken, really. I cant ask for more.

I am prepared to listen but since you seem more convinced than I am feel free to create a pull request. I am happy to assist you in anyway you need to accomplish what you believe would benefit your use-case. As long as legacy remains intact of course.

that is nice! i will give it a try although you should be prepared that i will make mistakes and would need a proper review process loop because i am not really proficient in writing C code but i will give it a go. I just want to make sure that work actually has a chance to be merged as i would be otherwise just wasting time...

@karibertils
Copy link

I know making the full file cache persistent across mounts has been discussed several times and found to be not feasible. But reading over this issue makes me wonder yet again if it was possible to speed up parts of populating the cache with the approach that was discussed here ?

When using rar2fs on a remote file system, opening non-cached folders containing archives tends to be slow. If the remote file system has a lot of data and high latency it can easily take days to populate the full cache which then has to be partly redone after adding/removing a folder. The warmup even if faster, can also easily take days on a slower remote.

I guess a good amount of time that goes to populating the cache goes to reading the archive header and finding the file offsets. I wonder if it would be possible to have rar2fs save the real header (with file names & offsets) to a local folder (which might mirror the remote folder layout) when the archives are accessed. So next time it could read this info from faster local storage ?

I imagine this might also be helpful to speed up local mounts, since you could have your data on slow spinning disks, while these header files could be placed on a super fast flash storage like an intel optane.

@hasse69
Copy link
Owner

hasse69 commented May 7, 2022

I understand your concern here, no question about that. I also think I have a pretty good view on what it is you wish to solve here. The reason why I am a bit reluctant to throw myself into this is not due to its complexity, it is not that hard to implement a "cache" like the one you describe really. The issue I have with this is that a cache is no better than its guarantee to remain consistent with "reality". When/if a cache cannot make such guarantees it is not a cache anymore, it is something else in my book. I can only assume that is why most network file systems do not cache things very aggressively, if at all since maintaining the cache coherency would basically mean the same amount of traffic needed compared to not having a cache, right?
So, in your case, you are actually prepared to basically cache meta-data that might in fact be completely wrong compared with where the physical files reside?

@hasse69
Copy link
Owner

hasse69 commented May 7, 2022

And, I assume here that what you are after really is that "remote" side would basically create a simplified header (less footprint) that would be the one that "local" side would try to access first before actually trying to access the actual archive, right? In which I would not really qualify this as a cache but rather some intermediate fast-path. The only issue here is how to accurately detect that this simplified header is in a need of a "refresh". Potentially the same mechanism used as for detecting external changes to the source folder could be used. But that I need to give some more thought.

Again, your options here is to store only a simplified header on the "remote" to reduce network and processing or to, as you say, completely mirror the remote locally. The latter is again what might be causing problems when it comes to ensuring full consistency.

@karibertils
Copy link

karibertils commented May 7, 2022

Yes correct. Headers with small footprint only on "local" side. And if they exist they are read instead of the headers on the remote side. If the headers don't exist locally they get created when accessing the archives for the first time.

In my case I never ever update the rar archives so they might as well be assumed permanently read only. I assume many have similar use case which should simplify keeping consistency.

In other cases, where the archives are being updated I guess info like byte size + modification timestamp could be used to refresh the cache for individual archives. If both size + timestamp are the same but the content updated regardless (probably rare case?), then I would be fine with getting a read error or any unspecified behaviour (as long as it does not lead to crash or destroying data) when trying to access that particular content. In that rare case it would be nice being able to wipe the associated "header cache" and trying again, and rar2fs would afterwards fetch the new header and work properly for that particular content.

Rar2fs already works fast listing non-archive folders when there's folders added/removed. So no change there.

If the user wipes a folder containing archive on the mounted path. Then the associated header file should never get accessed since the archive file doesn't exist anymore. I don't see any problem having stale cached headers that are never used, since they should have very small footprint. If they are bothering someone, they can just wipe the entire header cache and start filling it again.

If the user adds a folder containing an archive, rar2fs will create the header when the archive is opened for the first time.

I think like this it should be consistent enough ?

@hasse69
Copy link
Owner

hasse69 commented May 7, 2022

If you give this some extra thought I think what you will soon realize is that what you are after here is basically to consider a complete FUSE project on its own. A RAR-aware bind mount basically that could keep local copies of N number of archive/volume files with their headers only. The rest mirrored as any other bind style mount would do. Then rar2fs could mount itself on top of that, let's give it the working name as 'rarbindfs' 👍

I think rar2fs should remain focused on what is was designed for initially.

@karibertils
Copy link

Yeah I already did think about that. There might already be some projects out there that do similar caching, except caching everything that goes through instead of just the headers. Might be possible to modify some existing FUSE project to do this.

But then I also thought projects like rclone did decide to include features similar to this in their own project. They have the option of --vfs-cache-mode full to enable it. So maybe it makes sense to have it included inside rar2fs since rar2fs is already dealing with these headers and this kind of functionality only really makes sense to improve rar2fs experience and probably nothing else.

@hasse69
Copy link
Owner

hasse69 commented May 7, 2022

Sure, it is a fair point. I will of course consider this and see if there is something that can be done here.

@philamp
Copy link

philamp commented Oct 8, 2023

I would also be interested,
would it be possible to recontruct a sparse "quick open" file that would then be used to reconstruct the index without seeking in the RAR file ?
In other words the cache would also write index files to disk that would be read when rar2fs is restarted. One file per archive.
I understand the problem of the cache integrity and purpose but for read only contexts it would really be cool to have this feature. Thanks for all the work already done btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants