How to monitor the AM server properly #1018
Labels
Priority 4 - Strategic
would benefit multiple use cases if resolved
ReliabilityRecovery
improve behaviour in failure situations.
Last night, the AM server crashed and it was partially caused by
sr3 sanity
.One of the socket connections coming from the regional servers crashed and the AM server forked the connection properly. The orphan socket also got closed properly. Because of this,
sr3 sanity
picked this up and tried to restart the hung instance. Not only that butThis begs the question as to how can we make sanity properly monitor the AM server?
One problem that we have is that some processes get eventually marked as "strays" and the PID number inside the pid filename gets changed, from the looks of it. This is the reason why sanity restarted the active processes, their pid files didn't match. This still needs more investigation and might require another issue.
Proposed solutions
sanity off
that would skip over the config file for sanity checks.The text was updated successfully, but these errors were encountered: