Retry 400s #154

joshsammut · 2020-06-18T16:08:07Z

A problem I've been noticing is that occasionally a server in our eureka cluster starts returning a 400 status. We do notice the problem and fix it, however because of the way the retry logic is structured, only 5xx status get retried. The implication of this is that once our bad server starts returning a 400, the eureka client doesn't move on and try the healthy good ones. This make our service(s) unavailable for a period of time until the server is fixed or manually restarted.
I've updated the logic so that all non-200 statuses are considered an error to be retried to help make our services more resilient without manual intervention.

jquatier · 2020-06-18T18:59:31Z

Hey @joshsammut, thats an interesting problem. It feels a little fishy that the servers start tossing back 400 errors (bad request). I haven't seen that and this library is being used on some pretty large installations. This feels indicative of a configuration problem with the server itself.

That being said, if the desire is to retry these errors too (to work around the server problem) I think it might be better to make this configurable while keeping the current behavior the default. The problem with the code change now is that a misconfigured client application would also sit and retry with backoff needlessly. That's a case where we really want to fail fast, especially during a deployment. I don't think adding a new configuration option for the retryableStatusCodes is out of reach. it could be an array that defaults to 500-504

joshsammut · 2020-06-18T19:09:19Z

@jquatier ya you're right, the server returning a 400 is definitely due to a misconfiguration. What we've seen is that in development environments we have clusters where the first url in the list works fine but the second does not. Because the first works 99.9% of the time it goes unnoticed and then we get a random blip where there's a time causing the client to go to the second in the list which returns a 400 and we never move on from that so it starts failing until a human intervenes. We don't see this in prod and I was just trying to prevent myself from having to manually intervene.
Since this is a change of behaviour I can make this a configuration option where we pass a list in.

joshsammut added 2 commits June 18, 2020 12:01

Update retry test to ensure 400s are retried

6881d56

Retry all non-200 response

146253b

Allow retryable status codes to be a list of ranges

5837ed9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry 400s #154

Retry 400s #154

joshsammut commented Jun 18, 2020

jquatier commented Jun 18, 2020

joshsammut commented Jun 18, 2020

Retry 400s #154

Are you sure you want to change the base?

Retry 400s #154

Conversation

joshsammut commented Jun 18, 2020

jquatier commented Jun 18, 2020

joshsammut commented Jun 18, 2020