Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLIte database is created in memory #85

Open
Vadiml1024 opened this issue May 25, 2022 · 13 comments
Open

SQLIte database is created in memory #85

Vadiml1024 opened this issue May 25, 2022 · 13 comments

Comments

@Vadiml1024
Copy link

When using -r option I'm getting log messages about SQLite databases created in :memory: even when --index-folder is specified.
Is it normal?

Btw there is separate db for each .tar file - wouldn't be more efficient to have only one db?

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 25, 2022

Yeah, this is normal, at least for now. The problem is under which name to store the data for recursive archives. I think this issue might be a duplicate of #79.

@Vadiml1024
Copy link
Author

Vadiml1024 commented May 25, 2022 via email

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 25, 2022

What about storing all .tar indices in single DB?

For further databases caused by recursive archives, I think I answered your question.

Do you mean when using the union mounting feature like so: ratarmount file1.tar file2.tar mountpoint? In this case, I think it is better to have one DB per archive in order to increase reusability when, e.g., trying to mount only file1.tar or when trying to add another archive to the union mount: ratarmount file1.tar file2.tar file3.tar mountpoint.

What is your use case?

@Vadiml1024
Copy link
Author

Maybe I expressed myself incorrectly....
Actually for each .tar file ratarmount creates an SQLite database specific to this .tar.
I thought maybe it'll be more efficient to have ONE database which contains data for all archives simultaneously?
Of course, this will require significant modifications of the existing codebase but not too complicated I think.
The idea would be to assign virtual_inode_number to each archive and include it as a key field in all tables of this unified db...
The advantage of this approach is that it could be easily adapted to other SQL-based database which is useful when mounting directories with a LOT of really BIG archives. I'm talking about disks with several TB of data and archives of hundreds GBs and more than 100K files inside.
This is actually my use case.
Thanks to your advice i've implemented a kind of hybrid between guestmount and ratarmount.
I use libguestfs to mount .iso, .img, *.ova, *.vmdk files
then I create a temp dir containing mount points (with help of mount --bind) for the above files
and then in launch ratarmount -r -l to mount this temp dir.
Given the fact, that the disk images contain big archives with archives inside and that ratarmount uses :memory: to index archives inside archives the memory consumption is pretty impressive, hence my ideas on reorganizing the DB.

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 27, 2022

Ah, I see.
I think, this really is very related to #79 then but it goes one step further and would also combine data from "sibling" archives (those in the same bind-mounted folder) not just "descendants" (recursively nested archives).

But, maybe your problem also will disappear when using the new --recursion-depth 1 argument from #84.
Of course, this is only possible if you don't want to mount recursively deeper than that.
If it still creates indexes in memory, then that might be because it can't find a suitable writable location and --index-folders might help. I noticed that you also tried that out... Now I kinda understand your problem.

When using -r option I'm getting log messages about SQLite databases created in :memory: even when --index-folder is specified. Is it normal?

Could you paste one of those warnings? I'm beginning to doubt that it is normal. Also, what is the compression chain? It should only try to use an in-memory database in circumstances like mounting a compressed tar that is inside another archive.

@Vadiml1024
Copy link
Author

Vadiml1024 commented May 28, 2022

Could you paste one of those warnings? I'm beginning to doubt that it is normal. Also, what is the compression chain? It should only try to use an in-memory database in circumstances like mounting a compressed tar that is inside another archive.

That is precisely my case...
So it seems to be expected behavior.

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 28, 2022

Could you paste one of those warnings? I'm beginning to doubt that it is normal. Also, what is the compression chain? It should only try to use an in-memory database in circumstances like mounting a compressed tar that is inside another archive.

That is precisely my case... So it seems to be expected behavior.

Unfortunately, yes. But, I'll try to fix it, but it might take a while :/. PRs are welcome ...

My basic idea to fix this is outlined in #79. The downwards-compatible version would simply add tables for each recursively contained TAR. Simply adding it inside the existing table won't work because there would not be enough information. The table basically just stores names and offsets. It would additionally also need to safe something like the path to the recursive archive but that seems like a waste of space and might also be harder to implement because as it is implemented now, multiple SQLiteIndexedTar instances will be created, one for each recursive TAR, and they basically don't know of each other.

Well, instead of this large amount of work, it might be simpler, to support writing indexes of in-memory file objects out to the index-folder. The only problem is to somehow generate a stable name. E.g. using the hash of the whole archive would be a good name but it would be too cost-prohibitive to calculate. A hash over the metadata might work though as those data has to be calculated anyway and should be magnitudes smaller than the file contents. And the file contents don't matter anyway.

But, in order to speed up loading with the indexes, I wouldn't be able to check all metadata, only like the first 1000. I'm already doing something similar to detect TARs which have been appended to. It still would only be a heuristic nothing 100% stable :/.

Currently, if the TAR is not placed directly besides the archive, it will be placed in the home folder with a kind of cleaned up path to the archive as name. I might simply store those recursive tars with their inner path appended to the path of the outside tar. That should be unique enough. If the path becomes too long to use as a file name, I could simply hash it.

Hmm, thinking about it, I might be able to implement the second idea soonish.

@Vadiml1024
Copy link
Author

I'm not really DB expert, but I suspect that table creation is an expensive operation.
Maybe the approach of adding a table mapping hash(pathname) to vitual_inode_number and
then adding this vitual_inode_number as a key part to the existing tables will be more efficient?

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 28, 2022

How many archives inside the outer archive are we talking about?

@Vadiml1024
Copy link
Author

The biggest one i've met is more than 300K

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 28, 2022

That is quite a lot and indeed might need more brainstorming and benchmarking :/.

This might also trigger performance problems at other locations in the code, for example inside the AutoMountLayer class, which mounts those archives recursively and has to keep a map of all mounted recursive locations and has to look them up each time FUSE requests a file.

@Vadiml1024
Copy link
Author

Given that the self.mounted is a dict in AutoMountLayer
I think that the lookup is not too much time consuming

@Vadiml1024
Copy link
Author

Vadiml1024 commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants