Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(storage): try to reopen for failed Reads #4226

Merged
merged 7 commits into from Jun 23, 2021

Conversation

tritone
Copy link
Contributor

@tritone tritone commented Jun 7, 2021

This error can occur in calls to reader.Read if the CFE closes the
connection between when the response headers are received and when
the caller reads from the reader.

Fixes #3040

This error can occur in calls to reader.Read if the CFE closes the
connection between when the response headers are received and when
the caller reads from the reader.

Fixes googleapis#3040
@tritone tritone requested a review from a team June 7, 2021 03:07
@tritone tritone requested a review from a team as a code owner June 7, 2021 03:07
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jun 7, 2021
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Jun 7, 2021
@tritone
Copy link
Contributor Author

tritone commented Jun 7, 2021

FYI @neild

Copy link
Member

@frankyn frankyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, per offline discussions on GOAWAY being retriable safely with idempotent requests. We have not yet understood how to retry these errors with non-idempotent requests.

@@ -389,7 +389,17 @@ func shouldRetryRead(err error) bool {
if err == nil {
return false
}
return strings.HasSuffix(err.Error(), "INTERNAL_ERROR") && strings.Contains(reflect.TypeOf(err).String(), "http2")
// Certain HTTP/2 errors are retryable, but the types are not exported.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What errors are not retryable here?

This path (shouldRetryRead) is only consulted when a read has returned a successful response header, but something went wrong while reading the body. Is there a reason not to unconditionally retry the read in this case? (Preferably with backoff.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of this call, "retry" means to call the reader's reopen at the updated offset (which involves a new GET request to the GCS API) and then attempt to continue reading bytes from that point.

I'm a little bit hesitant to retry unconditionally:

  1. If the error from Read is a context timeout/cancellation, we would probably want to return to the caller immediately and not fire off another request.
  2. If there is some error which could indicate problems with the data received, we would definitely want to surface that to the caller (so that they can retry the whole Read from the start of the object if desired). I'm not sure if this is actually something that can occur, though (do you have thoughts?).

I could see having a retry for any http2 error though, if you think it's safe.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to stop retrying when the context expires/is canceled, of course. (I wouldn't test the error for this, though; test the context directly to see if it is done.)

For any other error--once you've read a response header, I don't think there is any error condition other than variations on "the stream broke". The only point where the remote end can provide a useful error is when sending the response header. I wouldn't limit retries to HTTP/2 errors, because the same applies to HTTP/1 (if someone disables HTTP/2 for some reason) or HTTP/3 (someday).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, much appreciated!

I tried to implement this but I'm a bit stuck on what to do with the context. What I'm seeing is that we don't have access to the context via the Read method; it's passed in via NewRangeReader and then embedded in reopen(). Any thoughts on what to do with this? I guess I could have reopen() itself check for context.Done before doing any calls? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I wrote an implementation doing what I suggested above-- does this seem right to you? I think I should write a test that cancels the context as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an integration test that makes sure cancelation actually works as expected as well (did this because I wanted to make sure there's no possibility of an infinite loop of calls to reopen at this point)

Copy link

@neild neild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tritone tritone added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 22, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 22, 2021
@tritone tritone changed the title fix(storage): retry GOAWAY errors for Read fix(storage): try to reopen for failed Reads Jun 22, 2021
Copy link
Member

@codyoss codyoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tritone tritone merged commit 564102b into googleapis:master Jun 23, 2021
@tritone tritone deleted the retry-goaway branch June 23, 2021 04:02
adityamaru pushed a commit to cockroachdb/google-cloud-go that referenced this pull request Jun 23, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes googleapis#3040
adityamaru added a commit to cockroachdb/vendored that referenced this pull request Jun 23, 2021
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Jun 23, 2021
This change switches the release-21.1 branch to point to a
fork of v0.45.1 of the google cloud SDK that includes
googleapis/google-cloud-go#4226.

The above change provides better retry on GOAWAY errors which
have become an increasingly common occurrence in our roachtests.

Release note (general change): Switches the release-21.1 branch
to point to a fork of the google cloud SDK at version 0.45.1
with an upstream bug fix
googleapis/google-cloud-go#4226.
adityamaru pushed a commit to cockroachdb/google-cloud-go that referenced this pull request Jun 24, 2021
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2.

Fixes googleapis#3040
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: automatic retries with exponential backoff for download failures caused by load shed
6 participants