New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(storage): try to reopen for failed Reads #4226
Conversation
This error can occur in calls to reader.Read if the CFE closes the connection between when the response headers are received and when the caller reads from the reader. Fixes googleapis#3040
FYI @neild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, per offline discussions on GOAWAY
being retriable safely with idempotent requests. We have not yet understood how to retry these errors with non-idempotent requests.
storage/reader.go
Outdated
@@ -389,7 +389,17 @@ func shouldRetryRead(err error) bool { | |||
if err == nil { | |||
return false | |||
} | |||
return strings.HasSuffix(err.Error(), "INTERNAL_ERROR") && strings.Contains(reflect.TypeOf(err).String(), "http2") | |||
// Certain HTTP/2 errors are retryable, but the types are not exported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What errors are not retryable here?
This path (shouldRetryRead) is only consulted when a read has returned a successful response header, but something went wrong while reading the body. Is there a reason not to unconditionally retry the read in this case? (Preferably with backoff.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of this call, "retry" means to call the reader's reopen
at the updated offset (which involves a new GET request to the GCS API) and then attempt to continue reading bytes from that point.
I'm a little bit hesitant to retry unconditionally:
- If the error from Read is a context timeout/cancellation, we would probably want to return to the caller immediately and not fire off another request.
- If there is some error which could indicate problems with the data received, we would definitely want to surface that to the caller (so that they can retry the whole Read from the start of the object if desired). I'm not sure if this is actually something that can occur, though (do you have thoughts?).
I could see having a retry for any http2 error though, if you think it's safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to stop retrying when the context expires/is canceled, of course. (I wouldn't test the error for this, though; test the context directly to see if it is done.)
For any other error--once you've read a response header, I don't think there is any error condition other than variations on "the stream broke". The only point where the remote end can provide a useful error is when sending the response header. I wouldn't limit retries to HTTP/2 errors, because the same applies to HTTP/1 (if someone disables HTTP/2 for some reason) or HTTP/3 (someday).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, much appreciated!
I tried to implement this but I'm a bit stuck on what to do with the context. What I'm seeing is that we don't have access to the context via the Read
method; it's passed in via NewRangeReader
and then embedded in reopen()
. Any thoughts on what to do with this? I guess I could have reopen()
itself check for context.Done
before doing any calls? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I wrote an implementation doing what I suggested above-- does this seem right to you? I think I should write a test that cancels the context as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an integration test that makes sure cancelation actually works as expected as well (did this because I wanted to make sure there's no possibility of an infinite loop of calls to reopen at this point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes googleapis#3040
Patches v0.45.1 of `cloud.google.com/go` with googleapis/google-cloud-go#4226.
This change switches the release-21.1 branch to point to a fork of v0.45.1 of the google cloud SDK that includes googleapis/google-cloud-go#4226. The above change provides better retry on GOAWAY errors which have become an increasingly common occurrence in our roachtests. Release note (general change): Switches the release-21.1 branch to point to a fork of the google cloud SDK at version 0.45.1 with an upstream bug fix googleapis/google-cloud-go#4226.
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes googleapis#3040
This error can occur in calls to reader.Read if the CFE closes the
connection between when the response headers are received and when
the caller reads from the reader.
Fixes #3040