Found mis-handled disk corruption or other internal bug #47

ryanworl · 2021-02-19T13:23:45Z

I am a heavy user of this cache in production and have ran into a bug. We save our cache to disk and have O(1GB) of cache per node storing many small and medium size objects using the SetBig function.

Yesterday we ran into a machine in a "zombie" situation piling up tens of thousands of goroutines doing cache lookups. They were all attempting to acquire the RLock to read a bucket in the cache.

Four goroutines were attempting to write to the cache and had paniced. Due to a quirk in our usage, we have a defer that will eventually also write to the cache under some circumstances that will run in the same goroutine where the panic happened. This results in deadlock.

The panic happened at fastcache.go:318, the first line in this snippet:

chunk := b.chunks[chunkIdx]
if chunk == nil {
    chunk = getChunk()
    chunk = chunk[:0]
}

I'm not sure how this would come up, but it should be handled instead of crashing by returning an error that corruption was found. This may not be possible given the lack of checksums in the library generally.

I don't have a great solution here, but I thought you should know about this. On our end, we are wrapping our usage with a defer function that will nuke the cache from disk, sync the directory, and call os.Exit.

The text was updated successfully, but these errors were encountered:

cristaloleg · 2021-02-19T14:35:11Z

Hi, can you share panic's message?

ryanworl · 2021-02-19T15:52:18Z

I don't have it.

All I have is the stack trace from a goroutine dump pointing to that line fastcache.go:318, where it did panic at that line. After the panic, in a defer we attempted to write to the cache again through our wrapper and it deadlocked on one of our mutexes, because the write which caused the panic already held it.

ryanworl · 2021-02-19T17:03:19Z

A sequence of events that could lead to corruption is (in theory, not sure):

t0: one goroutine beings writing to the cache using SetBig
t0: another goroutine begins writing the cache to disk using SaveToFileConcurrent
t1: first goroutine modifies a bucket before the cache is written to disk
t2: second goroutine writes the new bucket to disk and any other (old) chained buckets from the same key the first goroutine wrote to
t3: first goroutine finishes writing to the rest of the buckets

This sequence does not require any data races, so it wouldn't be caught by the race detector.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Found mis-handled disk corruption or other internal bug #47

Found mis-handled disk corruption or other internal bug #47

ryanworl commented Feb 19, 2021 •

edited

cristaloleg commented Feb 19, 2021

ryanworl commented Feb 19, 2021

ryanworl commented Feb 19, 2021 •

edited

Found mis-handled disk corruption or other internal bug #47

Found mis-handled disk corruption or other internal bug #47

Comments

ryanworl commented Feb 19, 2021 • edited

cristaloleg commented Feb 19, 2021

ryanworl commented Feb 19, 2021

ryanworl commented Feb 19, 2021 • edited

ryanworl commented Feb 19, 2021 •

edited

ryanworl commented Feb 19, 2021 •

edited