Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Mounting multiple archives to the same path #219

Open
hexahigh opened this issue May 1, 2024 · 12 comments
Open

[Feature Request] Mounting multiple archives to the same path #219

hexahigh opened this issue May 1, 2024 · 12 comments

Comments

@hexahigh
Copy link

hexahigh commented May 1, 2024

It would be really convenient if you could mount multiple archives to the same folder.
I use dwarfs to compress warc's and because they are quite large and take a long time to compress, decompressing and recompressing them is not really an option. So i create a new archive every few weeks, i then mount each of them to their own folder and then use mergerfs to mount all of them to a single folder. This is really impractical and it would be great to see this feature implemented into the program.

@mhx
Copy link
Owner

mhx commented May 2, 2024

Hi!

This sounds to me like what you want is very similar, if not identical, to "incremental backup" functionality, i.e. the ability to add a new snapshot of a directory to a DwarFS image, but only storing the changes relative to the previous snapshot.

I'm not entirely sure, though, because I don't really understand how you achieve this with creating multiple archives and then merge-mounting them. It'd be good to have a more detailed example of exactly what you're doing.

As for the "incremental backup" functionality, that's been requested before and it's definitely something I want to add. See #18, #208.

@hexahigh
Copy link
Author

hexahigh commented May 2, 2024

Here is the unholy shell script i use:

dwarfs -o workers=16 -o allow_root -o readonly comp/collection1.dwarfs ./mount/collection1/
dwarfs -o workers=16 -o allow_root -o readonly comp/collection2.dwarfs ./mount/collection2/
dwarfs -o workers=16 -o allow_root -o readonly comp/blalange1.dwarfs ./mount/blalange1/
dwarfs -o workers=16 -o allow_root -o readonly comp/collection3.dwarfs ./mount/collection3/
dwarfs -o workers=16 -o allow_root -o readonly comp/rantonse1.dwarfs ./mount/rantonse1/
dwarfs -o workers=16 -o allow_root -o readonly comp/collection4.dwarfs ./mount/collection4/
dwarfs -o workers=16 -o allow_root -o readonly comp/collection5.dwarfs ./mount/collection5/
dwarfs -o workers=16 -o allow_root -o readonly comp/blalange2.dwarfs ./mount/blalange2/
dwarfs -o workers=16 -o allow_root -o readonly comp/collection6.dwarfs ./mount/collection6/

sudo mergerfs -o cache.files=partial,dropcacheonclose=true,allow_other \
	./mount/collection1:./mount/rantonse1:./mount/collection2:./mount/blalange1:./mount/collection3:./mount/collection4:./mount/collection5:./mount/blalange2:./mount/collection6 \
	./pywb/collections/main/archive/

I think you understand why it would be great to have this implemented into dwarfs

@mhx
Copy link
Owner

mhx commented May 2, 2024

That part was clear from your description. I'm more interested in how you actually build the individual archives. I assume you're creating those from the writable layer in the merged file system?

@hexahigh
Copy link
Author

hexahigh commented May 2, 2024

Ah, sorry about that. No, i create the archives separetely.

@silentnoodlemaster
Copy link

silentnoodlemaster commented May 2, 2024

in my (irrelevant) opinion this functionality should be left to specialized union filesystems, like mergerfs you use in your script, or overlayfs. There is a whole list of special considerations when it comes to having multiple filesystem at one location, one of them being filename clashes.

@mhx
Copy link
Owner

mhx commented May 2, 2024

Ah, sorry about that. No, i create the archives separetely.

And that means?

Assume I know nothing about your data (or exactly what mergerfs does in your use case).

Is there any overlap between the individual images? Or are they completely separate sets of files?

There's probably a dozen different ways to implement "mounting multiple archives to the same path".

I've looked at the mergerfs README for the last 15 minutes and it's unclear to me what exactly it does. I understand the overlayfs/unionfs approach, but mergerfs is apparently different from that. How does it behave if the same path exists in multiple branches but with different contents?

@silentnoodlemaster
Copy link

I've looked at the mergerfs README for the last 15 minutes and it's unclear to me what exactly it does. I understand the overlayfs/unionfs approach, but mergerfs is apparently different from that. How does it behave if the same path exists in multiple branches but with different contents?

in my understanding, the traditional way (overlayfs/unionfs) is to have one bottom filesystem and one or more on top, whereas mergerfs uses a merge policy (similar to git merge) that creates a virtual combination of filesystems

@hexahigh
Copy link
Author

hexahigh commented May 2, 2024

Is there any overlap between the individual images? Or are they completely separate sets of files?

Each dwarfs file contains its own set of files, there is only one copy of each file across all the dwarfs files.

There's probably a dozen different ways to implement "mounting multiple archives to the same path".

One idea i have of how this might be implemented is this:
For example, if the user mounted two dwarfs using dwarfs -i file1.dwarfs -i file2.dwarfs -o ./mount the program then "combines" the file list from each of the archives, if a file exists in both of the archives then it should use the file from the first archive (file1.dwarfs)

For example, the user has two dwarfs files that look like this:

image.png
another_image.png
video.mp4
audio.mp3
documents/project.odt
documents/test.md

And the resulting mount directory would look like this

audio.mp3
documents/project.odt
documents/test.md
image.png
another_image.png
video.mp4

Im horrible at explaining things

@hexahigh
Copy link
Author

hexahigh commented May 2, 2024

Kind of like ratarmount's union mounting system

@mhx
Copy link
Owner

mhx commented May 2, 2024

Im horrible at explaining things

Your reply definitely helps, though! :)

The problem I'm having is the open questions this leaves. And I agree with @silentnoodlemaster that special/different cases should be left to special tools.

The one thing that is definitely ugly about your use case is that you have a myriad of dwarfs processes running, each of which have their own config and, much more importantly, own independent cache. This has been bugging me for a while now as I have a somewhat similar use case — tens (maybe hundreds in the future) of dwarfs images that I'd like to mount simultaneously — but for which I don't need a merged view (I'm perfectly fine if they live in separate directories).

So what I'd like to implement, and this is likely going to happen sooner than the incremental-backup feature, is a way to add to (and remove from) a running dwarfs process additional mounts that will share the same cache.

I just haven't figured out all the details yet. And then I need to find the time to do it. So don't hold your breath just yet.

@mhx
Copy link
Owner

mhx commented May 3, 2024

Here's a quick brain dump, feel free to comment, I'd definitely appreciate feedback.

None of these will implement any kind of "merging", though.

Mounting multiple DwarFS images

Single mount of multiple images

A single mount of multiple file systems (will show up as one FUSE mount) in a single process; shared cache

dwarfs multi [<subdir1>:]<image1> [<subdir2>:]<image2> ... <mountpoint> [options]
dwarfs add [<subdir3>:]<image3> [<subdir3>:]<image3> <mountpoint> ...
dwarfs remove <mountpoint>/<subdir> <mountpoint>/<subdir> ...
dwarfs remove -m <mountpoint> <subdir> <subdir> ...

add and remove only work for multi mounts. Actually, multi might not even be needed; add alone might be good enough.

The contents of each DwarFS image would be accessible at <mountpoint>/<subdir> instead of just <mountpoint>.

dwarfs config <mountpoint>                   # show config?
dwarfs config <mountpoint> cachesize=8g      # change cache size

The config command would also work in the following scenarios.

Multiple mounts sharing the same process/cache

Multiple mounts of multiple file systems (will show up as multiple FUSE mounts) in a single process; shared cache

 dwarfs <image1> <mountpoint1>
 dwarfs <image2> <mountpoint2> -oattach=<mountpoint1>
 dwarfs <image3> <mountpoint3> -oattach=<mountpoint1>

Options that cannot be changed at run-time will report an error.

I'm definitely open for suggestions regarding a name different than attach. Or maybe even a different syntax for the command.

Multiple mounts with distinct process/cache

Multiple mounts of multiple file systems in multiple processes (current behaviour); exclusive caches

@hexahigh
Copy link
Author

hexahigh commented May 3, 2024

Single mount of multiple images

A single mount of multiple file systems (will show up as one FUSE mount) in a single process; shared cache

That implementation seems like the cleanest and most user friendly alternative.
I assume specifying the subdir is optional ([<subdir1>:]<image1>), which would be great as it would allow you to use globs to mount multiple images without creating a ridiculously long command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants