Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future ideas for filesystem and multi-process synchronization #584

Open
pwmarcz opened this issue May 13, 2022 · 1 comment
Open

Future ideas for filesystem and multi-process synchronization #584

pwmarcz opened this issue May 13, 2022 · 1 comment
Labels
enhancement New feature or request P: 2

Comments

@pwmarcz
Copy link
Contributor

pwmarcz commented May 13, 2022

This is a collection of notes about what I've learned working on Gramine's FS code: I'm leaving active Gramine development, so hopefully this will be useful for others.

Goals

Some plausible scenarios that I'm assuming we might have for synchronization:

  • O_APPEND host file: multiple processes writing to a file, e.g. a log file. This is not possible now, because all file accesses are absolute, so processes will overwrite each other's data.

  • Shared encrypted files: writing to an encrypted (protected) file from multiple processes, e.g. an SQLite database. Currently, we assume that only one process opens a file at a time, so if two processes write to it, they're likely going to corrupt the file.

  • Shared tmpfs files: same, but for an in-memory system. These are currently separate for each process.

  • File locks (fcntl, flock): We currently implement fcntl locks, but they're non-interruptible, which limits their usefulness.

Sync engine

Before (see gramineproject/graphene#2158), I proposed and started to implement a "sync engine", a module based on the idea of synchronizing arbitrary data between processes. My thinking was that we could optimize the uncontested case (i.e. a single process doing most of the work) by keeping track of which process has the latest version at the moment.

I no longer think this is a good idea: the implementation ended up extremely over-engineered, with complicated flow of messages being passed around, even before I got to more advanced features like exchanging non-trivial data, or more complicated wait conditions, or interruptible waits.

I believe that good solutions for Gramine will be:

  • simpler,
  • not as abstract/general, and more tailored to a specific use case,
  • not heavily invested into optimizing the uncontested case (this goal was way too ambitious),
  • maybe opt-in, if the feature is expensive (e.g. "enable synchronization for this directory")

The idea that I think is worth keeping is relying on the process leader as the "server" that keeps all data.

Remember less data

We used to have a problem that when a (host) file got added or removed by another process, Gramine did not notice that. That was because we kept the files in dentry cache and relied on that data.

The (easy!) solution turned out to be do not rely on cache so much, but update data every time. For instance, each listdir operation actually calls host to list the directory again. If a new file appeared, we fill a dentry; if it disappeared, we clear a dentry.

This might be applicable in other situations as well: when in doubt, load the data from host.

Use Linux sources for inspiration

Actually, the easy solution described above was made possible by introducing inodes (#279). Before, we couldn't just clear a dentry so easily, because it represented a possibly open file.

More generally, I learned a lot by studying real sources of filesystem code in Linux: how dentries and inodes work, what kind of mutexes it uses and in what order, how fcntl locks are implemented, what callbacks it uses for the filesystem (e.g. position-independent read).

(I also looked at older, simpler versions of Linux, and at FreeBSD).

I'm not saying to blindly follow Linux: Gramine solves a different problem, and can implement many things in a simpler way. But it's a good starting point. Things are done in Linux this way for a good reason.

Support append mode on host?

Is writing to a (non-encrypted) host file a common use case? For instance, multiple processes logging to a file, probably opened with O_APPEND.

If so, then I think the best course of action is to implement real append mode in PAL, i.e. allow opening files in append mode. We haven't done it so far, I think because stateless operations (write at offset) are more "pure" and deterministic. However, this is a good place to compromise on that principle: append mode is a much better, simpler solution than any kind of synchronization between processes.

Serve files from process leader?

For shared encrypted files, or shared tmpfs files, I think it's worth investigating a client-server model: the "server", i.e. the host process, would make these files available to other processes over IPC.

I admit I haven't thought that through in detail; it's possible that this is also too complicated to consider. I would probably start by examining "prior art": NFS, FUSE, and the 9P protocol which promises to be simple.

fcntl locks

I implemented fcntl locks (gramineproject/graphene#2481) in this client-server model: the process leader keeps information about the locks, and other processes use IPC for locking and unlocking. I think that might be a good starting point for further work on synchronization, but there are some problems that came up.

  • Identifying a file: how do I tell the process leader which file to operate on? The current implementation uses absolute paths (like /foo/bar) and thus stores information in dentries, not inodes. That's perhaps good enough, but it means corner cases around deleting or renaming a file are not handled correctly.

    A perhaps related problem is to have consistent inode numbers between processes. Right now, inode numbers are derived from absolute paths using a deterministic function. That mostly works, but it gives no guarantee that there won't be a collision, and renaming a file changes its inode number.

  • Interruptible operations: The current implementation uses a "send IPC message and wait for response" primitive, but this is wrong: there is no easy way to interrupt waiting, so taking a lock can actually hang forever in Gramine. The primitive in question was not even meant for such cases.

    To support interruptible requests like this, you probably need separate operations like "make a request", "wait for response", and "cancel my request". See RFC: interruptible POSIX locks #12 for discussion (and [LibOS] Make POSIX locks interruptible graphene#2522 for a failed attempt at fix).

  • Boilerplate: The implementation is simpler than the "sync engine", but still required a lot of boilerplate code. We can probably do better.

@dimakuv
Copy link
Contributor

dimakuv commented May 7, 2024

Update from May 2024

I was looking at the possibility of removing libos_handle::dentry field. Unfortunately, this is still far away from being possible.

There are two Gramine problems that make removing dentry from the handle object (and using inodes instead) complex:

  1. Legacy: Gramine/Graphene was initially designed with dentry and inode objects being fused into one. This is being solved piece by piece, by moving dentry fields into inode fields, and side-stepping handle->dentry in favor of handle->inode.
  2. Design: Gramine is decentralized and mostly doesn't use/rely on host information. This leads to synchronization problems like when P1 updates the size/position in a file and P2 doesn't see these updates. This also leads to not-really-correct implementations of IPC mechanisms like POSIX locks -- locks must be associated with an inode, but since there is no universal inode ID in P1 and P2, we had to fall back to dentries (more specifically, abs paths that are stored in dentries).

The design problem is hard to fix, as Pawel explained in this issue. Also, it will have high performance overhead, if child processes must constantly check for updates on the main process (or vice versa, if the main process broadcasts updates to children).

On the good side, I think the only problematic places in Gramine currently are:

  • POSIX locks and flocks: they tie to absolute paths from dentries, rather than inode numbers.
  • Checkpoint-and-restore copies into the child only opened dentries; I think either all or no dentries must be copied.
  • vma->file->dentry -- probably easy to fix (move to inodes)
  • g_process.exec->dentry -- probably easy to fix (move to inodes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P: 2
Projects
None yet
Development

No branches or pull requests

2 participants