Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service: retry on transport-level errors from Etcd within keyspace.Watch #294

Open
jgraettinger opened this issue Sep 1, 2021 · 1 comment
Labels

Comments

@jgraettinger
Copy link
Contributor

jgraettinger commented Sep 1, 2021

During a recent automated GKE upgrade, all brokers and Etcd pods were simultaneously signaled to exit (not ideal, but also not the issue at hand).

Etcd pods exited, and on the way out Gazette brokers observed transport-level errors which were treated as terminal, and caused a controlled but fatal shutdown across all brokers (along with a pod restart):

{"err":"service.Watch: rpc error: code = Unknown desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR, debug data: ","level":"fatal","msg":"broker task failed","time":"2021-09-01T14:08:13Z"}

The shutdown was controlled -- no data loss is believed or expected to have occurred -- but it did cause cluster consistency to be lost and require operator intervention (gazctl journals reset-head).

What should happen instead

Brokers should have retried the Etcd watch on this transport-level error.

@jgraettinger
Copy link
Contributor Author

This is arguably a bug within the Etcd client, which generally speaking takes on retry semantics if a particular Etcd member is unavailable. In the absence of that, Gazette could inspect an error returned by service.Watch and -- if it has a gRPC status code of Unavailable or Unknown -- retry the watch (rebuilding the client if necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant