Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filesystem locking on jetstream fileStore instances. #4407

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

jevolk
Copy link
Contributor

@jevolk jevolk commented Aug 18, 2023

  • Link to issue, e.g. Resolves #NNN
  • Documentation added (if applicable)
  • Tests added
  • Branch rebased on top of current main dev
  • Changes squashed to a single commit (described here)
  • Build is green in Travis CI
  • You have certified that the contribution is your original work and that you license the work to the project

@jevolk jevolk requested a review from a team as a code owner August 18, 2023 21:22
@neilalexander neilalexander added the post-freeze We'll come back to this after the freeze period label Aug 21, 2023
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to get a test written that shows correct behavior?

Also where to we clear the lock on successful shutdown of the filestore?

@jevolk jevolk force-pushed the jevolk/flock branch 2 times, most recently from c3a4492 to 34c7624 Compare August 21, 2023 20:57
@jevolk
Copy link
Contributor Author

jevolk commented Aug 21, 2023

Thanks for taking the time to review this.

Also where to we clear the lock on successful shutdown of the filestore?

The lock is cleared when the file descriptor at fileStore.lfd is garbage collected, or when the process exits. I have since added an explicit Close() call in fileStore.Stop(). I have not added matching unlock fcntl's but can do so at your request.

Possible to get a test written that shows correct behavior?

POSIX record locks used here are "process-associated" which means testing within the same process or thread group will likely be insufficient. The flock(2) alternative is superior but not available on all platforms. AFAICT, the testing infrastructure here is limited to instances within the same address-space, and does not fork into sub-processes. I'm not yet sure how to test this here.

Let me know your thoughts, thanks.

@jevolk
Copy link
Contributor Author

jevolk commented Aug 21, 2023

Also, the CI appears limited to Linux only; I have not verified if the windows overload is effective.

@derekcollison
Copy link
Member

If you added the close to Stop() you can do a Stop() and make sure you can recreate the filestore with the same storedir..

Running two instances of the server which share the same directory (e.g.
default configuration `/tmp/nats/jetstream') will corrupt each other.

We mitigate by creating an empty file called LOCK in the directory and then
acquire a flock(2) on it.

Signed-off-by: Jason Volk <jason@zemos.net>
@jevolk
Copy link
Contributor Author

jevolk commented Aug 22, 2023

By switching to flock(2) I have been able to add the proper tests, one affirmative (TestFileStoreLockFileSystem), one negative (TestFileStoreLockFileSystemConflict), which will work with the single address-space testing infrastructure here. This approach is superior because it can catch conflicts from two FileStore instances within the same server process as well as between two separate servers processes.

It appears that syscall.Flock() is made available by Go on all relevant targets. The caveat is that the behavior may be different depending on whether it's wrapping fcntl(2). On systems with a wrapper, the negative test might fail, since the POSIX record lock (via fcntl(2)) does not contend within the same process; two FileStores instantiated within the same process will not exclude each other. So while flock(2) is superior if available, it might not actually be available.

Volumes have been written about how UNIX filesystem locking is a notorious shitshow, so please bear with me here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
post-freeze We'll come back to this after the freeze period
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants