recovery: retry on fetching log spec #307

jgraettinger · 2021-09-24T19:43:51Z

Observed from a consumer application on a recent release, where gazette and the consumer were rolling in tandem (the consumer happened to pick a gazette broker that was exiting):

{"err":"beginRecovery: fetching log spec: rpc error: code = Unavailable desc = error reading from server: read tcp 10.0.0.35:51576-\u003e10.3.247.101:8080: read: connection reset by peer","level":"error","msg":"serveStandby failed","shard":"/gazette/consumers/flow/reactor/items/ ... ","time":"2021-09-24T18:58:07Z"}

The text was updated successfully, but these errors were encountered:

jgraettinger · 2021-09-24T21:33:42Z

Digging into this, I'm hesitant to slap a retry on it because we are using FailFast(false) or equivalently WaitForReady(true) on these and other calls, and per WaitForReady comments gRPC should be re-trying these requests on a transient failure if the server didn't process the request (details.

We go a graceful gRPC stop, and all of the brokers exited cleanly, so I'm unsure why these pods received a TCP RST.

Likely will take no immediate action unless this is a repeated problem.

jgraettinger added the bug label Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recovery: retry on fetching log spec #307

recovery: retry on fetching log spec #307

jgraettinger commented Sep 24, 2021

jgraettinger commented Sep 24, 2021 •

edited

recovery: retry on fetching log spec #307

recovery: retry on fetching log spec #307

Comments

jgraettinger commented Sep 24, 2021

jgraettinger commented Sep 24, 2021 • edited

jgraettinger commented Sep 24, 2021 •

edited