Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries confusion #15

Open
cespare opened this issue Dec 8, 2013 · 3 comments
Open

Retries confusion #15

cespare opened this issue Dec 8, 2013 · 3 comments

Comments

@cespare
Copy link
Collaborator

cespare commented Dec 8, 2013

I find "max retries" confusing.

"Max" is confusing because it's actually the highest allowed number of attempts, and N+1 attempts cause the check to fail. Intuitively, if I set "max retries" to 3 for a failing service, I would expect it to ping 3 times, and on the third failure declare it failed. But watchman doesn't fail the check until the 4th failure. (This corresponds to < vs <= on line 124 in pinger.clj.) Maybe the reason for this is that the first attempt isn't a "retry"; only subsequent attempts are.

I think "retries" is also not necessarily the right word. To me, that implies that if the ping fails or times out, watchman would immediately try again, and after N tries it would declare the check failed. But watchman doesn't "retry" until the next time the check would normally be performed.

The net result is that if your interval is 1 minute, and "max retries" is 3, both of which seem reasonable, a completely dead check won't fail for 4 minutes, which seems unreasonable.

I think better behavior would be to change the <= to <, and to make watchman immediately retry failed pings up to N times. At that point the check is declared failed. Subsequently, when the ping interval comes around again for the failing check, the ping would only be done once to see if it's back up or not.

Then if you have a 1 minute check interval and max retries = 3, the check will fail after no more than 3 * timeout.

The current behavior is actually more like "max allowed consecutive failures".

@harob
Copy link
Collaborator

harob commented Dec 9, 2013

I agree about the confusing terminology. However, given my past experiences setting up alerts I do think it's valuable to have a "max allowed consecutive failures" setting (which is what max retries currently is); it allows one to say "only bug me if the service is down for more than N minutes", which is handy for systems with not-super-tight SLA's.

@cespare
Copy link
Collaborator Author

cespare commented Dec 9, 2013

Yeah, there are two different kinds of failure parameters one might want to control:

  1. Max consecutive failures -- you only want to know when the service has been down for some amount of time, which is good if you expect it to go down for a minute or two here and there and you don't care about that (as you pointed out).
  2. Max retries -- you want to know as soon as something's wrong, but like any HTTP/network request there can always be transient failures and immediately retrying the request will keep the noise down.

@philc
Copy link
Owner

philc commented Feb 22, 2014

I'm going to rename this to max_consecutive_failures to clear up the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants