subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

jgraettinger · 2020-09-21T21:11:32Z

The storage bucket(s) backing a large and very busy cluster had a configuration change such that fragment uploads were refused for an extended period (hours), and the disks of many brokers filled to 100%.

As intended, brokers paused accepting new appends until more disk was available, and also as intended, once uploads to the backing bucket resumed, all but one of the brokers reclaimed disk space and were able to continue accepting appends.

The specific mechanism by which this works is that, once a broker sees the remote bucket contains a fragment covering the span of a local fragment, the local fragment (and it's *os.File) is dropped for the GC to finalize. It's not closed, because it may still be accessed via a concurrent read.

One broker, for unknown reasons, was unable to reclaim dangling *os.File's, and never escaped the 100% disk full condition. Before forcibly killing it, I was able to verify 1) GC was still running regularly, 2) goroutine traces showed that refreshes of the fragment index from the bucket -- the mechanism by which *os.File references are dropped -- were proceeding normally, and 3) there weren't other wedged goroutines which could explain a very large number of dangling *os.File references. Other than that, I'm currently scratching my head.

The text was updated successfully, but these errors were encountered:

jgraettinger added the bug label Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

jgraettinger commented Sep 21, 2020

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

subtle GC finalizer (?) issue in recovery from prolonged out-of-disk condition #282

Comments

jgraettinger commented Sep 21, 2020