Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

Open
jgraettinger opened this issue Sep 21, 2020 · 0 comments
Labels

Comments

@jgraettinger
Copy link
Contributor

The storage bucket(s) backing a large and very busy cluster had a configuration change such that fragment uploads were refused for an extended period (hours), and the disks of many brokers filled to 100%.

As intended, brokers paused accepting new appends until more disk was available, and also as intended, once uploads to the backing bucket resumed, all but one of the brokers reclaimed disk space and were able to continue accepting appends.

The specific mechanism by which this works is that, once a broker sees the remote bucket contains a fragment covering the span of a local fragment, the local fragment (and it's *os.File) is dropped for the GC to finalize. It's not closed, because it may still be accessed via a concurrent read.

One broker, for unknown reasons, was unable to reclaim dangling *os.File's, and never escaped the 100% disk full condition. Before forcibly killing it, I was able to verify 1) GC was still running regularly, 2) goroutine traces showed that refreshes of the fragment index from the bucket -- the mechanism by which *os.File references are dropped -- were proceeding normally, and 3) there weren't other wedged goroutines which could explain a very large number of dangling *os.File references. Other than that, I'm currently scratching my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant