Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to monitor the AM server properly #1018

Open
andreleblanc11 opened this issue Apr 16, 2024 · 3 comments
Open

How to monitor the AM server properly #1018

andreleblanc11 opened this issue Apr 16, 2024 · 3 comments
Labels
Priority 4 - Strategic would benefit multiple use cases if resolved ReliabilityRecovery improve behaviour in failure situations.

Comments

@andreleblanc11
Copy link
Member

Last night, the AM server crashed and it was partially caused by sr3 sanity.

One of the socket connections coming from the regional servers crashed and the AM server forked the connection properly. The orphan socket also got closed properly. Because of this, sr3 sanity picked this up and tried to restart the hung instance. Not only that but

  • sr3 sanity closed and restarted all of the active processes that didn't need a restart
  • sr3 sanity didn't close some of the orphaned processes that did need a restart.
[2024-04-16 05:07:07] found hung flow/amserver/5 pid: 1699289
[2024-04-16 05:07:07] killing hung processes... (no point in SIGTERM if it is hung)
[2024-04-16 05:07:07] missing: [['flow', 'amserver', 5], ['flow', 'amserver', 61], ['flow', 'amserver', 99], ['flow', 'amserver', 5]]
[2024-04-16 05:07:07] starting them up...
[2024-04-16 05:07:07] killing strays...
[2024-04-16 05:07:07] pid: 954922-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '1', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954930-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '2', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954933-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '3', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954943-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '4', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954949-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '6', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954952-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '7', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 954955-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '8', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 957499-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '9', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 960060-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '10', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 965821-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '11', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 966501-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '12', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 969833-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '13', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 970860-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '14', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 976915-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '15', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 979379-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '16', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 982782-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '17', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 985242-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '18', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 987831-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '19', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 992756-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '20', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 993592-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '21', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 995241-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '22', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 998754-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '23', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1003814-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '24', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1005200-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '25', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1007720-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '26', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1015146-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '28', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1015206-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '29', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1025513-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '30', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1036282-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '31', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1040709-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '32', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1044807-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '33', 'start'] does not match any configured instance, sending it TERM
[2024-04-16 05:07:07] pid: 1837764-['/usr/bin/python3', '/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py', '--no', '127', 'start'] does not match any configured instance, sending it TERM

This begs the question as to how can we make sanity properly monitor the AM server?

One problem that we have is that some processes get eventually marked as "strays" and the PID number inside the pid filename gets changed, from the looks of it. This is the reason why sanity restarted the active processes, their pid files didn't match. This still needs more investigation and might require another issue.

Proposed solutions

  • @petersilva suggests to have a systemd unit file control the restarts of the AM server. We would still need to correct the stray process problem before continuing with this solution.
  • We could also have an option like sanity off that would skip over the config file for sanity checks.
@andreleblanc11 andreleblanc11 added Priority 4 - Strategic would benefit multiple use cases if resolved ReliabilityRecovery improve behaviour in failure situations. labels Apr 16, 2024
@andreleblanc11
Copy link
Member Author

The temporary fix for now is to turn off sanity and having a cronjob restarting the master process from the AM server whenever it goes down

@petersilva
Copy link
Contributor

perhaps we should try setting keepalive...

https://stackoverflow.com/questions/12248132/how-to-change-tcp-keepalive-timer-using-python-script

@petersilva
Copy link
Contributor

set it to drop after 90 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority 4 - Strategic would benefit multiple use cases if resolved ReliabilityRecovery improve behaviour in failure situations.
Projects
None yet
Development

No branches or pull requests

2 participants