Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basepath-hash inode algorithm #1323

Open
wants to merge 1 commit into
base: latest-release
Choose a base branch
from

Conversation

PhracturedBlue
Copy link

This is an updated version of the patch from #1002

I ran into a similar issue recently where after replacing the SATA controller, the device-ids all changed, and my backup tool which relies on inodes being consistent got very confused.

This is just the patch from @thrnz cleaned up for 2.40.2, with the addition of README updates for the changes.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/fs_inode.hpp Outdated
@@ -37,18 +37,23 @@ namespace fs
const uint64_t fusepath_len,
const mode_t mode,
const dev_t dev,
const ino_t ino);
const ino_t ino,
const std::string &basepath = "");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the basepath be the first argument.

src/fs_inode.hpp Outdated
@@ -37,18 +37,23 @@ namespace fs
const uint64_t fusepath_len,
const mode_t mode,
const dev_t dev,
const ino_t ino);
const ino_t ino,
const std::string &basepath = "");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default value really necessary? This would create a temp object if it was called with the default which isn't great.

src/fs_inode.cpp Outdated Show resolved Hide resolved
@trapexit
Copy link
Owner

trapexit commented Apr 1, 2024

BTW... once we are happy with this please squash all these commits

@PhracturedBlue
Copy link
Author

PhracturedBlue commented Apr 1, 2024

Sorry about so many typos in such a small amount of text. I even checked it with a spell-checker 1st.

As far as getting rid of the default, I started on it, but I ran into a couple of snags. To get the base branch-dir everywhere that inode::calc is called, requires access to the Branch.path variable. In fuse_readdir_cosr.cpp that requires passing it in the DirRV structure which can't be done as a pointer, so it grows the structure. I'm not sure if that is an issue. For fuse_fgetattr I'm not sure how to get the base branch-dir. For now I passed in "" which is what the default is anyway, but it doesn't seem right, and I'm not sure why the code works today (yet it does for practical puroposes, as I've been using it for some time). So I'm open to inputs on this, hough I'll continue working on it myself too.

Edit: Are the instructions on running the tests somewhere? I tried make tests, but they fail even on the main branch.

@trapexit
Copy link
Owner

trapexit commented Apr 1, 2024

but it doesn't seem right

It isn't right :)

Need to ensure that the data required for the computation is available anywhere it is used. Otherwise it will provide invalid / inconsistent results.

As for tests: they wouldn't test this. But you're right. I changed something and didn't fix the tests. Will add it to the todo list.

@PhracturedBlue
Copy link
Author

PhracturedBlue commented Apr 1, 2024

To fix fgetattr I think I need to get the basepath somewhere, but all that we have is the file-descriptopr and the fusepath.
One option is to pass the basepath through the FileInfo object. That seems like it could be a lot of overhead for something not many users will need. Alternatively, I can use the Config policy to look it up from the fusepath. If I do it in fgetattr, then that has a performance cost that everyone would pay even when not using basepath-hash. An alternative may be to do the lookup in the basepath-hash function itself when an empty string is provided to calc(). There is obviously a performance impact to that too, but more localized. A third option would be to compute the inode when the file is opened and store that in the FileInfo (which would then just be a 64bit value stored for each FileInfo which may be palateable, though it comes at the cost of slightly slowing down 'open' to speed up fgetattr.

Do you think one of those options is better than another?

@trapexit
Copy link
Owner

trapexit commented Apr 2, 2024

Storing the basepath in FileInfo is likely the best general solution.

@PhracturedBlue
Copy link
Author

PhracturedBlue commented Apr 2, 2024

After looking at the code, deferring the identification of the basepath to calc() time doesn't seem feasible, as the fusepath is not sufficient to determine the basepath.
So I decided to create the indoe during open() and create() which seemed overall like it might be a better tradeoff of cpu vs memory for each call than passing the basepath around with ever FileInfo object.
But now that I've done it, it seems you prefer the other one, so I'll try that instead :)

@trapexit
Copy link
Owner

trapexit commented Apr 2, 2024

I just figured the basepath is a reasonable value to include in the fileinfo and might be useful elsewhere.

@PhracturedBlue
Copy link
Author

PhracturedBlue commented Apr 2, 2024

It turns out your choice is far superior. I didn't realize aI could pass it as a pointer (since the branch string is a const in Config) which means it takes up the same memory as the inode solution, but is far simpler and without the performance penalty.

I have made the change. I wil squash the commits once we're happy with the change. I just find it easier to keep the history during development.

I also tested on my system to confirm and the original pathc (with the default="") gives the same inode for stat() and fstat(). That was surprising to me, since I assumed fgetattr would be used to generate the fstat() results. I added some debug, and it seems fgetattr is not (always?) called for an fstat(). It seems fuse is doing some caching, and calls 'getattr' during open() and caches that result rather than calling fgetattr when fstat() is called.

I'm not usre if there is a better way to test ths code.

@trapexit
Copy link
Owner

trapexit commented Apr 2, 2024

I didn't realize aI could pass it as a pointer

Actually, that's a bad idea. The config, currently, can be changed at almost any time. That includes the branch list. So the branch could be removed while the file is open and then the memory would be freed and a use after free would then be possible. The performance and memory usage of a single string is minimal. In theory it could be interned but we've not gotten to a point where that should be needed. So just use a std::string.

I also tested on my system to confirm and the original pathc (with the default="") gives the same inode for stat() and fstat().

Thing is... fstat doesn't translate into fgetattr on the mergerfs side. There is some caching depending on the settings but they added fgetattr a while back but didn't really plumb the kernel side fully to use it in all the places it might be used. Honestly I'm not sure if it currently is used at all but I've kept it for that time when they do. So it really isn't a good test. Because a default of empty would naturally have to change the hash output. Otherwise it'd be a garbage hash :)

@PhracturedBlue
Copy link
Author

PhracturedBlue commented Apr 2, 2024

I made the change to dereference the basepath as you suggested.
I found this thread:
libfuse/libfuse#62
which shows that you can trigger fgetattr if you call fstat on a newly created file. I confirmed this does actually trigger fgetattr to be called, though it doesn't reslt in the returned inode being different even in the default="" case, so clearly there are still fuse shenanigans going on.

But regardless, at least I can trigger the code and verify that fgetattr and getattr now produce the same result (though I had to do it via injecting log messages rather than being able to write a test app)

Edit: Also, I appreciate you taking your time to help me get this right. Thanks! I've been using mergerfs for many years, and really appreciate your dedication to it.

README.md Outdated
* basepath-hash: Hashes the branch base path along with
the inode of the underlying entry. This has a similar purpose to
devino-hash, but by using the path instead of the device-id, the inodes
will be gauranteed to be stable across reboots. Useful for backup or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
will be gauranteed to be stable across reboots. Useful for backup or
will be guaranteed to be stable across reboots. Useful for backup or

A filesystem's device id (st_dev) may change on reboot (eg with zfs). Instead, we can use the files base path (+underlying inode) to generate an inode, which will remain constant across reboots. However, this may have unexpected effects if multiple unique devices appear under a base path.

Like hybrid_hash, basehybrid_hash/32 hashes relative path for dirs and basepath_hash for files

Original patch by thrnz@github
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants