Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: non-retryable rpc errors in subscription Receive() method #4257

Closed
maximerety opened this issue Jun 14, 2021 · 9 comments · Fixed by #4313 or #4384
Closed

pubsub: non-retryable rpc errors in subscription Receive() method #4257

maximerety opened this issue Jun 14, 2021 · 9 comments · Fixed by #4313 or #4384
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@maximerety
Copy link

Client

PubSub

Environment

Ubuntu Focal on GKE

Go Environment

$ go version
go version go1.16.5

Code

We're using the standard way to receive messages from a subscription, with default configurations (similar to the code snippet from doc):

 err := subscription.Receive(context.Background(), func(ctx context.Context, m *Message) {
 	log.Printf("Got message: %s", m.Data)
 	m.Ack()
 })
 if err != nil {
	// Handle error.
 }

We've managed to use this method for several months in long-living containers, without receiving any (non-retryable) errors.

On June 2, 2021, we started to receive sporadic fatal RPC errors in the following forms:

rpc error: code = Unknown desc = closing transport due to: connection error:
  desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: server_shutting_down

rpc error: code = Unknown desc = closing transport due to: connection error:
  desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: max_age

rpc error: code = Unknown desc = closing transport due to: connection error:
  desc = "error reading from server: read tcp <ip>:54780-><ip>:443: read: connection reset by peer", received prior goaway: code: NO_ERROR, debug data: max_age

We've seen these kinds of errors several times per day since then.

Expected behavior

Automatic internal retries on these kinds of RPC errors, so that the Receive() method doesn't error.

I've noticed that there is a piece of retrying logic in pubsub/service.go that detects server shutdowns associated to the Unavailable code and the "Server shutdownNow invoked" message.

Maybe something recently changed in the way the remote server signals that it is shutting down and this logic needs to be updated?

Actual behavior

The Receive() method from the subscription errors.

Additional context

Started on June 2, 2021.

Not long after we upgraded to pubsub v1.11.0, although I don't see how it would be related - so I'm mentioning this mostly for completeness.

@maximerety maximerety added the triage me I really want to be triaged. label Jun 14, 2021
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Jun 14, 2021
@hongalex hongalex added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Jun 14, 2021
@hongalex
Copy link
Member

Thanks for reporting. This error is actually being returned by a reverse proxy server rather than the Pub/Sub service itself, and the issue you described affects all of our Pub/Sub libraries. Assuming you're not overriding connections, this is likely just an issue with the server getting hitting a large number of connections. I'll submit a fix for this later this week to get this retried automatically.

@maximerety
Copy link
Author

Hi @hongalex, thank you for this quick update. If I can be of any help, don't hesitate to ask.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jun 21, 2021
@noseglid
Copy link

Any news here? This is hitting us hard recently.

@noseglid
Copy link

I did some investigation since this seems to not happen for some services we have running, while it happens continually for others. After som digging I managed to boil it down to one dependency; google.golang.org/grpc.

I have 2 github tags in my service, and when I do go list -m all for each tag, the only dependency that differs is google.golang.org/grpc.

One tag is using google.golang.org/grpc v1.37.0 and this has no issues.
The other tag is using google.golang.org/grpc v1.38.0 and this shows the issue described in this ticket.

Maybe it's an issue with a proxy, but it definitely seems to depend on some behaviour of the grpc lib as well.

@hongalex
Copy link
Member

Apologies for the delay, other issues had crept up. I have a PR open to get this addressed in the short term, and will be following up on another PR to allow custom retries of our methods so workarounds will be available earlier. This fix should be merged/released later today.

gcf-merge-on-green bot pushed a commit that referenced this issue Jun 29, 2021
The frontend server for Pub/Sub might occasionally emit GOAWAY errors which are currently not retried. This is not unique to the Go client, though the categorization as `UNKNOWN` vs `UNVAILABLE` is a golang-GRPC issue. Although UNKNOWN [should not generally be retried](https://google.aip.dev/194), this will unblock users of `Receive` until the grpc library can be changed. See internal cl/377393940 for a similar fix in another Go library.

Fixes #4257.
@noseglid
Copy link

noseglid commented Jul 1, 2021

Is there any ETA on when this will be released? Seems like it didn't make it into the previous (v1.12) release.

@hongalex
Copy link
Member

hongalex commented Jul 1, 2021

This should be released in in v1.12.1 via #4334 shortly. Thanks for your patience

@noseglid
Copy link

noseglid commented Jul 7, 2021

Unfortunately this still happens. Much less frequent but still. Now with this error:

rpc error: code = Unknown desc = closing transport due to: connection error: desc = \"error reading from server: read tcp 10.0.234.233:37960->74.125.133.95:443: read: connection reset by peer\", received prior goaway: code: NO_ERROR, debug data: server_shutting_down

The fix doesn't work because it says error reading from server: read tcp not error reading from server: EOF which the fix expects.

Should it perhaps retry on an gRPC error with status code unknown instead?

@hongalex
Copy link
Member

hongalex commented Jul 7, 2021

Yeah, that was an oversight on my part. I don't think we want to retry all unknown errors (based on AIP), but I'll fix the error message checking to just check for received prior goaway: code: NO_ERROR instead. I'll try to get this released later today.

@hongalex hongalex reopened this Jul 7, 2021
gcf-merge-on-green bot pushed a commit that referenced this issue Jul 7, 2021
The previous iteration of this PR, #4313, only retried GOAWAY in conjunction with an EOF string, which doesn't capture all GOAWAY errors. Now, all error messages with `GOAWAY` and error code `UNKNOWN` are retried.

Fixes #4257
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
4 participants