Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry 400s #154

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

joshsammut
Copy link

A problem I've been noticing is that occasionally a server in our eureka cluster starts returning a 400 status. We do notice the problem and fix it, however because of the way the retry logic is structured, only 5xx status get retried. The implication of this is that once our bad server starts returning a 400, the eureka client doesn't move on and try the healthy good ones. This make our service(s) unavailable for a period of time until the server is fixed or manually restarted.
I've updated the logic so that all non-200 statuses are considered an error to be retried to help make our services more resilient without manual intervention.

@jquatier
Copy link
Owner

Hey @joshsammut, thats an interesting problem. It feels a little fishy that the servers start tossing back 400 errors (bad request). I haven't seen that and this library is being used on some pretty large installations. This feels indicative of a configuration problem with the server itself.

That being said, if the desire is to retry these errors too (to work around the server problem) I think it might be better to make this configurable while keeping the current behavior the default. The problem with the code change now is that a misconfigured client application would also sit and retry with backoff needlessly. That's a case where we really want to fail fast, especially during a deployment. I don't think adding a new configuration option for the retryableStatusCodes is out of reach. it could be an array that defaults to 500-504

@joshsammut
Copy link
Author

@jquatier ya you're right, the server returning a 400 is definitely due to a misconfiguration. What we've seen is that in development environments we have clusters where the first url in the list works fine but the second does not. Because the first works 99.9% of the time it goes unnoticed and then we get a random blip where there's a time causing the client to go to the second in the list which returns a 400 and we never move on from that so it starts failing until a human intervenes. We don't see this in prod and I was just trying to prevent myself from having to manually intervene.
Since this is a change of behaviour I can make this a configuration option where we pass a list in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants