-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sr3 sanity not picking up missing instance #927
Comments
tried a trival case: running a dynamic_flow test, then do kill -9 of the poll process, then run sr3 sanity. It started it up as expected, no problem. |
The same thing happened again last night, with the same configuration. Configuration stopped running @19:02Z yesterday. Again, sanity never noticed anything..
|
next time this shows up, please confirm:
it would be good to capture a case, doing a ps with some options... maybe it's only half-dead. |
The auto-restart script had a bug and the data stopped flowing again this morning @7:07Z.. However this gave me the chance to get some info on the process. When I looked for a process, I wasn't able to find anything..
However, it looks like sanity KILLS the poll instance this time, and afterwards it fails.. It's hard to tell as the sanity logs don't include milliseconds 😢
|
I may have spoke too soon. Looking at yesterday's sanity logs (and today), it looks like it's finding the configuration in a 'hung' state, despite this not being the case two days ago.
We did a configuration change yesterday (around 13:32Z where we changed |
I haven't been able to reproduce the problem lately. The poll has only failed once more since Sunday and got restarted by the auto-restart script. I have a poll on dev to try and catch it again (without the auto-restart), but it won't bite |
This happened with a sender. v3.00.52 The log showed the instance crashed:
sr3 status reported that it was missing
Sanity was running every 7 minutes and did not detect or restart the missing instance |
This happened again with the same AIRNOW poll this Sunday night.
Sanity never noticed anything
|
"missing" in this context means that the poll process died. In this case, it looks like it raised an exception, not clear that it died. It might not be doing anything good, but can you clarify if the process actually died? If not, I guess that means the log was idle for a long time (should have been caught by sanity.) Was the log idle (nothing being written all night?) |
The pager does not recall having checked for active/inactive processes when we got paged. The poll crashed at Not sure how long the log needs to be dead before the instance is declared missing (5 minutes , a.k.a every housekeeping interval?) |
I also was able to recreate the problem (with a sender) and I noticed a couple of things I noticed that the
If you look at the code, finding Lines 1979 to 2012 in 9f61e90
|
You keep referring to "missing_instances" but I still don't know if any instances are missing... missing means the process died (either by crashing or by something killing it.) You say "the poll crashed" but I don't know if that means the process referred to in the state file stopped, or if it just stopped writing messages in the log. did you do a ps -aux | grep the_pid_in_the _instance file, to find the process? We need to know whether it is missing or hung. you don't kill missing processes... they are already dead. You say you "recreated the problem with sender" ... If I kill a sender instance, and then run sanity... it finds it and does start it up ... does not reproduce the problem. Please describe HOW you "reproduce the problem" ... What did you do to get sanity not to notice? |
How I recreated the problem
|
collaborative debug session done. PR generated. #1053 |
for posterity's sake: sanity worked if instances > 1. Sanity was checking for stopped configurations, instead of configurations with missing_instances. |
What happened
A poll failed, and its instance was set to "missing" at 22:47Z
In the sanity logs, we see that it never got restarted after it failed
Questions
poll/airnow.py
flowcb? @junhu3 wondersThe text was updated successfully, but these errors were encountered: