Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Found mis-handled disk corruption or other internal bug #47

Open
ryanworl opened this issue Feb 19, 2021 · 3 comments
Open

Found mis-handled disk corruption or other internal bug #47

ryanworl opened this issue Feb 19, 2021 · 3 comments

Comments

@ryanworl
Copy link

ryanworl commented Feb 19, 2021

I am a heavy user of this cache in production and have ran into a bug. We save our cache to disk and have O(1GB) of cache per node storing many small and medium size objects using the SetBig function.

Yesterday we ran into a machine in a "zombie" situation piling up tens of thousands of goroutines doing cache lookups. They were all attempting to acquire the RLock to read a bucket in the cache.

Four goroutines were attempting to write to the cache and had paniced. Due to a quirk in our usage, we have a defer that will eventually also write to the cache under some circumstances that will run in the same goroutine where the panic happened. This results in deadlock.

The panic happened at fastcache.go:318, the first line in this snippet:

chunk := b.chunks[chunkIdx]
if chunk == nil {
    chunk = getChunk()
    chunk = chunk[:0]
}

I'm not sure how this would come up, but it should be handled instead of crashing by returning an error that corruption was found. This may not be possible given the lack of checksums in the library generally.

I don't have a great solution here, but I thought you should know about this. On our end, we are wrapping our usage with a defer function that will nuke the cache from disk, sync the directory, and call os.Exit.

@cristaloleg
Copy link

Hi, can you share panic's message?

@ryanworl
Copy link
Author

I don't have it.

All I have is the stack trace from a goroutine dump pointing to that line fastcache.go:318, where it did panic at that line. After the panic, in a defer we attempted to write to the cache again through our wrapper and it deadlocked on one of our mutexes, because the write which caused the panic already held it.

@ryanworl
Copy link
Author

ryanworl commented Feb 19, 2021

A sequence of events that could lead to corruption is (in theory, not sure):

t0: one goroutine beings writing to the cache using SetBig
t0: another goroutine begins writing the cache to disk using SaveToFileConcurrent
t1: first goroutine modifies a bucket before the cache is written to disk
t2: second goroutine writes the new bucket to disk and any other (old) chained buckets from the same key the first goroutine wrote to
t3: first goroutine finishes writing to the rest of the buckets

This sequence does not require any data races, so it wouldn't be caught by the race detector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants