Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-387861 Introduce fair locking subsystem #683

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

TimSmithCtx
Copy link
Contributor

None of the file-based locking systems available are "fair", as in, there is no guarantee that the first waiter will not be usurped when the lock is released by later waiters, thus suffering from lock starvation. This has been something of a problem across various attempts to wrap access to LVM, and indeed it turns out that multipath commands also need to be serialised on some of the same resources.

This adds a small templated systemd service which does nothing but accept one connection at a time on a UNIX Domain socket, with a fairly long accept queue. The service is started when needed (thus every process using this mechanism needs permission to start the service), and the "Lock" is acquired by connecting to the socket. connect() on UNIX Domain sockets does not time out, and waiting connections are held in a queue, providing fairness.

The python3 context manager is provided via a metaclass which checks the arguments and allows only one of each named Fairlock object to exist, so that attempts to take a lock when it is already held can raise an exception. This is not thread-safe but could probably be made so.

@MarkSymsCtx
Copy link
Contributor

Probably should have an update to https://github.com/xapi-project/sm/blob/master/mk/sm.spec.in ?

@MarkSymsCtx
Copy link
Contributor

Needs an update to .github/workflow/main.yml to repeat the python path changes

@edwintorok
Copy link
Contributor

Were you mixing byte range and whole file locks (or attempt overlapping byte range locks) previously? That one is known to have fairness issues: https://rusty.ozlabs.org/2010/08/13/fcntl-lock-starvation-and-tdb.html
Although there is some evidence that flock is not fair https://code.jjb.cc/linux-flock-does-not-provide-fair-locking (pointing to potentially a bug in the kernel?

In python there might be a simpler way to get a process-wide fair lock though by using shared pthread mutexes (this is simple if you want to share lock with child processes and more complicated if you want to share with other processes). https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Lock

Perhaps a Manager object could be used for sharing Lock objects? https://docs.python.org/3/library/multiprocessing.html#using-a-remote-manager

None of the file-based locking systems available are "fair", as in,
there is no guarantee that the first waiter will not be usurped when the
lock is released by later waiters, thus suffering from lock starvation.
This has been something of a problem across various attempts to wrap
access to LVM, and indeed it turns out that multipath commands also need
to be serialised on some of the same resources.

This adds a small templated systemd service which does nothing but
accept one connection at a time on a UNIX Domain socket, with a fairly
long accept queue. The service is started when needed (thus every
process using this mechanism needs permission to start the service), and
the "Lock" is acquired by connecting to the socket. connect() on UNIX
Domain sockets does not time out, and waiting connections are held in a
queue, providing fairness.

The python3 context manager is provided via a metaclass which checks the
arguments and allows only one of each named Fairlock object to exist, so
that attempts to take a lock when it is already held can raise an
exception. This is not thread-safe but could probably be made so.

Signed-off-by: Tim Smith <tim.smith@cloud.com>
@TimSmithCtx
Copy link
Contributor Author

Probably should have an update to https://github.com/xapi-project/sm/blob/master/mk/sm.spec.in ?

Added. Also fixed a couple of rpmlint issues (Should not specify BuildRoot, or reference RPM_BUILD_ROOT or %buildroot in the %build section)

@TimSmithCtx
Copy link
Contributor Author

Were you mixing byte range and whole file locks (or attempt overlapping byte range locks) previously? That one is known to have fairness issues: https://rusty.ozlabs.org/2010/08/13/fcntl-lock-starvation-and-tdb.html Although there is some evidence that flock is not fair https://code.jjb.cc/linux-flock-does-not-provide-fair-locking (pointing to potentially a bug in the kernel?

Neither fcntl() nor flock() is fair. We've run into this issue more than once.

I wouldn't call it a bug unless you can find somewhere any sort of fairness promise which is being violated. I know of no such promise. We might wish they were fair, but wishes are sadly not ponies...

In python there might be a simpler way to get a process-wide fair lock though by using shared pthread mutexes (this is simple if you want to share lock with child processes and more complicated if you want to share with other processes). https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Lock

In a single-threaded process, there is no need for locks at all. Since this code is single-threaded, the locking requirement is clearly between processes, so this won't help :-)

@edwintorok
Copy link
Contributor

In python there might be a simpler way to get a process-wide fair lock though by using shared pthread mutexes (this is simple if you want to share lock with child processes and more complicated if you want to share with other processes). https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Lock

In a single-threaded process, there is no need for locks at all. Since this code is single-threaded, the locking requirement is clearly between processes, so this won't help :-)

If you look at the implementation it is a bit smarter than that, it uses named semaphores (in /dev/shm) https://github.com/python/cpython/blob/main/Modules/_multiprocessing/semaphore.c, so it does work between processes.

@MarkSymsCtx MarkSymsCtx requested a review from rdn32 April 19, 2024 09:40
@TimSmithCtx
Copy link
Contributor Author

If you look at the implementation it is a bit smarter than that, it uses named semaphores (in /dev/shm) https://github.com/python/cpython/blob/main/Modules/_multiprocessing/semaphore.c, so it does work between processes.

POSIX states that "the highest priority thread that has been waiting the longest shall be unblocked" only when either SCHED_FIFO or SCHED_RR applies to the blocked process. Apart from that, I see no promise of fairness in POSIX or SYSV semaphores.

And we've already been round the loop trying to manage a queue with locks. This is simpler and cleaner for our current purposes.

self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
try:
self.sock.connect(self.sockname)
except (FileNotFoundError, ConnectionRefusedError):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment in fairlock.c saying that we'd also get this error if there were more than 64 waiters. I assume that trying to start the service when it is already running would be harmless, but we are only retrying the connection once. I take it that in reality we wouldn't expect to have that many waiters, so there's no point doing anything more involved. Yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local experimentation suggests that python's connect() handles the case of >64 waiters on the client side quite comfortably. While connecting to a Unix domain socket when the listen queue is full is allowed to return ECONNREFUSED, in practice on linux it is EAGAIN, which python handles for us.

The check for ECONNREFUSED here is mostly to cover the one-time-only race condition where we are the second client in, and the service has been started by the first client in but has not quite got around to listening yet.

self.assertEquals(l, n)
# Real code would use another 'with Fairlock("test")' here but we cannot
# do that because it insists on having a code block as a body, which would
# then not be reached, causing a "Test code not fully covered" failure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of a muchness, but you can use # pragma: no cover

while ((fd = accept(sock, NULL, NULL)) > -1) {
char buffer[128];

do {} while (read(fd, buffer, sizeof(buffer)) > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If -1 is returned and errno == EINTR for some reason, I expect that it should continue the loop rather than exiting otherwise the lock might be dropped too early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed important if one is going to handle any signals. Fortunately the only signals coming in here will be ones which terminate the process entirely :-)

@TimSmithCtx TimSmithCtx merged commit a077ecd into xapi-project:master Apr 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants