New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retries confusion #15
Comments
I agree about the confusing terminology. However, given my past experiences setting up alerts I do think it's valuable to have a "max allowed consecutive failures" setting (which is what max retries currently is); it allows one to say "only bug me if the service is down for more than N minutes", which is handy for systems with not-super-tight SLA's. |
Yeah, there are two different kinds of failure parameters one might want to control:
|
I'm going to rename this to |
I find "max retries" confusing.
"Max" is confusing because it's actually the highest allowed number of attempts, and N+1 attempts cause the check to fail. Intuitively, if I set "max retries" to 3 for a failing service, I would expect it to ping 3 times, and on the third failure declare it failed. But watchman doesn't fail the check until the 4th failure. (This corresponds to
<
vs<=
on line 124 inpinger.clj
.) Maybe the reason for this is that the first attempt isn't a "retry"; only subsequent attempts are.I think "retries" is also not necessarily the right word. To me, that implies that if the ping fails or times out, watchman would immediately try again, and after N tries it would declare the check failed. But watchman doesn't "retry" until the next time the check would normally be performed.
The net result is that if your interval is 1 minute, and "max retries" is 3, both of which seem reasonable, a completely dead check won't fail for 4 minutes, which seems unreasonable.
I think better behavior would be to change the
<=
to<
, and to make watchman immediately retry failed pings up to N times. At that point the check is declared failed. Subsequently, when the ping interval comes around again for the failing check, the ping would only be done once to see if it's back up or not.Then if you have a 1 minute check interval and max retries = 3, the check will fail after no more than
3 * timeout
.The current behavior is actually more like "max allowed consecutive failures".
The text was updated successfully, but these errors were encountered: