Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pubsub): retry deadline exceeded errors in Acknowledge #3157

Merged
merged 7 commits into from Nov 9, 2020

Conversation

hongalex
Copy link
Member

@hongalex hongalex commented Nov 6, 2020

Currently, DeadlineExceeded errors in the Acknowledge RPC can cause the iterator to fail, and thus causing the entire errgroup and Receive call to shut down. This is a bad user experience as the better solution is to allow DeadlineExceeded to be transparent and allow messages to be redelivered. This change marks rpc DeadlineExceeded and context deadline exceeded errors from the Acknowledge RPC to fail transparently so Receive does not shut down.

In addition, this adds a 3 minute timeout for DeadlineExceeded errors to be retried with backoff in the case of network issues. This should not impact most users as the default 60 second timeout for each Acknowledge RPC is still intact. We can look into exposing these values in the future.

@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 6, 2020
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Nov 6, 2020
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand what's going on here.

You're using the first 3 min context for sleeping, but it will unpause early if the context transitions to Done, at which point it's unusable anyways.

The second cctx2 context appears to be used for making the Acknowledge calls.

The check for the "context deadline exceeded" is just a matter of probing context.Done(), isn't it? Should this be rewritten as for loop with a select that let's you wait on context expiry or a response from the rpc?

It's entirely possible I'm missing something obvious here like the interaction between the iterator and the acks, given my lack of familiarity with this.

// Use the outer context with timeout here. Errors from gax, including
// context deadline exceeded should be transparent, as unacked messages
// will be redelivered.
if err := gax.Sleep(cctx, bo.Pause()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we got a deadline exceeded, this seems to be a sleep and return, but I'm not seeing the retry. I'd expect a continue if a retry of the Acknowledge() call on line 400 is the intent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the gax.Sleep call is to briefly pause before retrying the Acknowledge call. The err checking on this line, and the other instance of gax.Sleep, is to see if the outer 3 minute context has finished or not. If so, the entire call will terminate since err != nil and the next line will return nil. If err == nil here, Acknowledge will be called again on the next iteration of the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was misreading the logic, thanks for the clarification.

@hongalex
Copy link
Member Author

hongalex commented Nov 7, 2020

You're using the first 3 min context for sleeping, but it will unpause early if the context transitions to Done, at which point it's unusable anyways.

The 3m context is for sleeping yes, but the context should only transition to Done after 3 minutes (I don't cancel elsewhere). The idea is that I only want to retry Acknoweldge calls for up to 3 minutes total, with a 60 second timeout per RPC. After 3 minutes or a non-deadline exceeded error, we'll either return nil or the unexpected non-retryable error.

The check for the "context deadline exceeded" is just a matter of probing context.Done(), isn't it? Should this be rewritten as for loop with a select that let's you wait on context expiry or a response from the rpc?

I believe in strange circumstances, the RPC call can return a context deadline exceeded error even if the original context hasn't timed out. This checks for both. Of course, this is a kind of hacky fix, which is why I included the link to the Github issue.

// Use the outer context with timeout here. Errors from gax, including
// context deadline exceeded should be transparent, as unacked messages
// will be redelivered.
if err := gax.Sleep(cctx, bo.Pause()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was misreading the logic, thanks for the clarification.

@hongalex hongalex merged commit ae75b46 into googleapis:master Nov 9, 2020
@hongalex hongalex deleted the pubsub-ack-retry branch November 9, 2020 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants