How to monitor the AM server properly #1018

andreleblanc11 · 2024-04-16T18:50:00Z

Last night, the AM server crashed and it was partially caused by sr3 sanity.

One of the socket connections coming from the regional servers crashed and the AM server forked the connection properly. The orphan socket also got closed properly. Because of this, sr3 sanity picked this up and tried to restart the hung instance. Not only that but

sr3 sanity closed and restarted all of the active processes that didn't need a restart
sr3 sanity didn't close some of the orphaned processes that did need a restart.

[2024-04-16 05:07:07] found hung flow/amserver/5 pid: 1699289
[2024-04-16 05:07:07] killing hung processes... (no point in SIGTERM if it is hung)
[2024-04-16 05:07:07] missing: [['flow', 'amserver', 5], ['flow', 'amserver', 61], ['flow', 'amserver', 99], ['flow', 'amserver', 5]]
[2024-04-16 05:07:07] starting them up...
[2024-04-16 05:07:07] killing strays...
[2024-04-16 05:07:07] pid: 954922-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '1', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954930-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '2', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954933-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '3', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954943-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '4', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954949-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '6', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954952-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '7', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954955-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '8', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 957499-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '9', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 960060-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '10', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 965821-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '11', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 966501-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '12', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 969833-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '13', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 970860-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '14', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 976915-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '15', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 979379-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '16', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 982782-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '17', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 985242-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '18', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 987831-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '19', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 992756-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '20', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 993592-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '21', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 995241-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '22', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 998754-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '23', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1003814-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '24', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1005200-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '25', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1007720-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '26', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1015146-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '28', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1015206-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '29', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1025513-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '30', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1036282-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '31', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1040709-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '32', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1044807-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '33', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1837764-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '127', 'start'] does not match any configured instance, sending it TERM

This begs the question as to how can we make sanity properly monitor the AM server?

One problem that we have is that some processes get eventually marked as "strays" and the PID number inside the pid filename gets changed, from the looks of it. This is the reason why sanity restarted the active processes, their pid files didn't match. This still needs more investigation and might require another issue.

Proposed solutions

@petersilva suggests to have a systemd unit file control the restarts of the AM server. We would still need to correct the stray process problem before continuing with this solution.
We could also have an option like sanity off that would skip over the config file for sanity checks.

The text was updated successfully, but these errors were encountered:

andreleblanc11 · 2024-04-16T18:50:46Z

The temporary fix for now is to turn off sanity and having a cronjob restarting the master process from the AM server whenever it goes down

petersilva · 2024-05-02T14:51:06Z

perhaps we should try setting keepalive...

https://stackoverflow.com/questions/12248132/how-to-change-tcp-keepalive-timer-using-python-script

petersilva · 2024-05-02T14:51:48Z

set it to drop after 90 minutes.

andreleblanc11 added Priority 4 - Strategic would benefit multiple use cases if resolved ReliabilityRecovery improve behaviour in failure situations. labels Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to monitor the AM server properly #1018

How to monitor the AM server properly #1018

andreleblanc11 commented Apr 16, 2024

andreleblanc11 commented Apr 16, 2024

petersilva commented May 2, 2024

petersilva commented May 2, 2024

How to monitor the AM server properly #1018

How to monitor the AM server properly #1018

Comments

andreleblanc11 commented Apr 16, 2024

Proposed solutions

andreleblanc11 commented Apr 16, 2024

petersilva commented May 2, 2024

petersilva commented May 2, 2024