Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP pub/sub subscription stops working after error #1250

Closed
yordan-pavlov opened this issue Oct 29, 2021 · 10 comments · Fixed by #1285 or #1293
Closed

GCP pub/sub subscription stops working after error #1250

yordan-pavlov opened this issue Oct 29, 2021 · 10 comments · Fixed by #1285 or #1293
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented Oct 29, 2021

Expected Behavior

GCP pub/sub subscription should be resilient to temporary service / network issues.

Actual Behavior

When the Subscription::Receive method returns an error, the subscription is not reconnected automatically and only starts working again after the dapr process is restarted; see https://github.com/dapr/components-contrib/blob/master/pubsub/gcp/pubsub/pubsub.go#L238

An example error can be:

Subscription closed with error:
rpc error: code = Unknown
desc = closing transport due to:
connection error:
desc = "error reading from server: EOF",
received prior goaway:
code: NO_ERROR,
debug data: server_shutting_down"

Steps to Reproduce the Problem

(1) Start dapr sidecar and application and subscribe to GCP pub/sub
(2) After some amount of time, when the subscription is closed due to an error, new messages will not be received any more
(3) Restart dapr to continue receiving messages

NOTE: since errors that have caused the subscription to close are not currently logged, there is no way to know when the issue has occurred; I was only able to get to the error above after adding the following code just before return err in the handleSubscriptionMessages method.

if err != nil {
	g.logger.Warnf("Subscription closed with error: %s", err)
}

I imagine that a reconnect loop could be implemented in the handleSubscriptionMessages method to fix this issue and make the dapr GCP pub/sub component more resilient, similar to the dapr Azure service bus component here: https://github.com/dapr/components-contrib/blob/master/pubsub/azure/servicebus/servicebus.go#L342

Release Note

RELEASE NOTE:

@yordan-pavlov yordan-pavlov added the kind/bug Something isn't working label Oct 29, 2021
@yaron2
Copy link
Member

yaron2 commented Oct 30, 2021

Thanks for reporting this. Would you be interested in submitting a PR to fix this?

@yordan-pavlov
Copy link
Contributor Author

@yaron2 I would be happy to submit a PR for this.
Also, it appears that this is an issue with the GO GCP pub/sub library which has been reported here googleapis/google-cloud-go#4257 and fixed in version 1.12.2.

In that case, I wonder if it would be enough to just upgrade to v 1.12.2 of the GCP pub/sub library? Or do you think it would still be useful to implement a reconnect loop in the dapr GCP pub/sub component as well (in addition to upgrading the GCP pub/sub library)?

I will be testing the fix at work over the next few days and will report here by the end of this week.

@yaron2
Copy link
Member

yaron2 commented Nov 2, 2021

Perhaps upgrading the SDK version is enough. It would be great if you can report back and open a PR for this when possible.

@yordan-pavlov
Copy link
Contributor Author

@yaron2 testing over the past few days is looking good - it appears that upgrading the GCP pub/sub library to version 1.12.2 solves the problem. I plan to submit a PR for this next week.
I did test a reconnect loop as well, and after the library upgrade it didn't get triggered for server_shutting_down errors (because they are now handled internally by the pub/sub subscription), but there are still use cases where a reconnect loop might be useful (and the Azure Service Bus component already has one), so I might create a PR for that as well.

@yaron2
Copy link
Member

yaron2 commented Nov 9, 2021

@yaron2 testing over the past few days is looking good - it appears that upgrading the GCP pub/sub library to version 1.12.2 solves the problem. I plan to submit a PR for this next week. I did test a reconnect loop as well, and after the library upgrade it didn't get triggered for server_shutting_down errors (because they are now handled internally by the pub/sub subscription), but there are still use cases where a reconnect loop might be useful (and the Azure Service Bus component already has one), so I might create a PR for that as well.

Sounds good, please submit the PR when ready.

@yordan-pavlov
Copy link
Contributor Author

@yaron2 please have a look at the PR for upgrading the GCP pub/sub library here #1285.
I will be submitting a separate PR for a reconnect loop + some error logging.

@dapr-bot
Copy link
Collaborator

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale label Dec 19, 2021
@dapr-bot
Copy link
Collaborator

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@yaron2 yaron2 reopened this Dec 26, 2021
@dapr-bot
Copy link
Collaborator

dapr-bot commented Jan 2, 2022

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@dapr-bot dapr-bot closed this as completed Jan 2, 2022
@yaron2 yaron2 reopened this Jan 2, 2022
@dapr-bot
Copy link
Collaborator

dapr-bot commented Jan 9, 2022

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@dapr-bot dapr-bot closed this as completed Jan 9, 2022
@yaron2 yaron2 reopened this Jan 13, 2022
yaron2 added a commit that referenced this issue Jan 13, 2022
* upgrade cloud.google.com/go/pubsub to v1.12.2

Signed-off-by: Yordan Pavlov <yordan.pavlov@dunnhumby.com>

* update GCP secret manager test

Signed-off-by: Yordan Pavlov <yordan.pavlov@dunnhumby.com>

* make modtidy-all

Signed-off-by: Yordan Pavlov <yordan.pavlov@dunnhumby.com>

Co-authored-by: Yordan Pavlov <yordan.pavlov@dunnhumby.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
Co-authored-by: Yaron Schneider <schneider.yaron@live.com>
@artursouza artursouza removed the stale label Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment