New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: nil pointer crash in writer goroutine after CloseWithError #4167
Comments
Thanks for the detailed writeup and for your proposed fix! I wasn't able to reproduce the issue locally with a couple versions of your repro. Your reasoning seems correct, but I have a couple follow up questions to try to understand better what's going on:
|
My reproduction was unfortunately a bit more complicated: I was running several processes across a cluster of machines that were all writing files to GCS while under heavy sustained CPU load. They'd usually crash within about 5min, sometimes as long as 10min, after successfully writing 2-10k files. Sorry, I realize that isn't exactly a simple reliable local reproduction but it was reliable in that it did always crash before finishing the overall task.
Yes and no: I was calling
No -- after switching to only calling While it was still happening, I did a little quick and dirty println debugging, specifically I adding a package var My shot in the dark guess is that this is down extreme CPU contention and whether the I also ran it a few times with the patch in #4168 to add |
EDIT: never mind, I think this more recent one -- since my application changes to avoid calling |
Thanks for following up and sorry for the delay (I've been on vacation and then a busy week last week). Just wanted to check in, has it continued to work alright for you in the intervening 10 days or so? I think I need to come up with some kind of repro myself for this (or at least do a little more research) before following up with your PR. I'm reluctant to mess with |
Yeah, since removing (also, just while I poking around in writer.go, I think it looked like |
Thanks for following up! Good to hear that strategy has been working better. I think you're right about the check on |
Since w.opened is set to true before this validation currently, subsequent calls to Writer.Close will deadlock. There is no reason not to do this check first before opening the pipe. Thanks to @dt for pointing this out. Updates googleapis#4167
Since you haven't seen this issue when using context cancellation and calling |
Client
Storage
Code
Extracted/adapted from a more complex reproduction, but more or less:
Expected behavior
The write is aborted.
Actual behavior
The process crashes:
Additional Information
It looks like
Writer
spawns this goroutine to handle the upload side of the pipe. Inside this goroutine it is reading fields ofstorage.Client
such asClient.raw
on the line mentioned in the above crash stack trace:google-cloud-go/storage/writer.go
Line 131 in 825ddd6
Looking at
Client.Close()
, it nils outraw
when the client is closed:google-cloud-go/storage/storage.go
Line 167 in 825ddd6
So it looks like the caller has called
Client.Close()
while while the writer goroutine was still running. However reviewing the calling the code in my application, the theWriter
is always closed -- via eitherClose
or aCloseWithError
in a defer -- before the function that created it returns to the function which created and would close theClient
.So it appears that the writer was somehow still running after it was closed, which meant it was then poised to crash when the client was closed out from under it?
Looking at
Writer.Close()
, it blocks on thew.donec
channel, which is closed by the spawned goroutine when it exits, thus ensuring that that goroutine has exited beforeClose
returns:google-cloud-go/storage/writer.go
Line 230 in 825ddd6
However
Writer.CloseWithError()
does not block on thatdonec
-- it simply callsCloseWithError()
on the underlying io.Pipe and returns but it does not actually wait for the writer goroutine to notice the closed pipe and exit:google-cloud-go/storage/writer.go
Line 261 in 825ddd6
The text was updated successfully, but these errors were encountered: