Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sr3 sanity not picking up missing instance #927

Closed
andreleblanc11 opened this issue Feb 7, 2024 · 15 comments · Fixed by #1059
Closed

sr3 sanity not picking up missing instance #927

andreleblanc11 opened this issue Feb 7, 2024 · 15 comments · Fixed by #1059
Labels
Discussion_Needed developers should discuss this issue. Priority 3 - Important worrisome impact... should study carefully v3 issue deferred to or affects version 3

Comments

@andreleblanc11
Copy link
Member

What happened

A poll failed, and its instance was set to "missing" at 22:47Z

    Component/Config                         Processes   Connection        Lag                              Rates
                                            State   Run Retry  msg data   Queued   LagMax  LagAvg  %rej     pubsub   messages     RxData     TxData
                                              -----   --- -----  --- ----   ------   ------  ------  ----   --------       ----     ------     ------
    poll/airnow_http_poll                    miss    0/1     0   0%   0%      0    0.00s    0.00s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
          Total Running Configs:   0 ( Processes: 24 missing: 1 stray: 0 )
                          Memory: uss:1.7 GiB rss:1.9 GiB vms:5.1 GiB
                      CPU Time: User:221354.24s System:4841.57s
             Pub/Sub Received: 0 msgs/s (0 Bytes/s), Sent:  0 msgs/s (0 Bytes/s) Queued: 0 Retry: 0, Mean lag: 0.00s
                   Data Received: 0 Files/s (0 Bytes/s), Sent: 0 Files/s (0 B

In the sanity logs, we see that it never got restarted after it failed

[2024-02-05 22:49:03] sanity: killing hung processes... (no point in SIGTERM if it is hung)
[2024-02-05 22:49:03] no missing processes found
[2024-02-05 22:49:03] notice: python module mqtt is missing: cannot connect to mqtt brokers
[2024-02-05 22:49:03] notice: python module redis is missing: cannot use redis implementations of retry and nodupe
[2024-02-05 22:49:03]
[2024-02-05 22:56:04] sanity: killing hung processes... (no point in SIGTERM if it is hung)
[2024-02-05 22:56:04] no missing processes found
[2024-02-05 22:56:04] notice: python module mqtt is missing: cannot connect to mqtt brokers
[2024-02-05 22:56:04] notice: python module redis is missing: cannot use redis implementations of retry and nodupe
[2024-02-05 22:56:04]

Questions

  • Was the system in a weird state, where sanity couldn't find the missing instance?
  • Is this isolated to poll/airnow.py flowcb? @junhu3 wonders
@andreleblanc11 andreleblanc11 added Priority 3 - Important worrisome impact... should study carefully v3 issue deferred to or affects version 3 Discussion_Needed developers should discuss this issue. labels Feb 7, 2024
@petersilva
Copy link
Contributor

tried a trival case: running a dynamic_flow test, then do kill -9 of the poll process, then run sr3 sanity. It started it up as expected, no problem.
hmm...

@andreleblanc11
Copy link
Member Author

The same thing happened again last night, with the same configuration. Configuration stopped running @19:02Z yesterday.

Again, sanity never noticed anything..

[2024-02-07 19:00:04] sanity: killing hung processes... (no point in SIGTERM if it is hung)
[2024-02-07 19:00:04] no missing processes found
[2024-02-07 19:00:04] notice: python module mqtt is missing: cannot connect to mqtt brokers
[2024-02-07 19:00:04] notice: python module redis is missing: cannot use redis implementations of retry and nodupe
[2024-02-07 19:00:04]
[2024-02-07 19:07:03] sanity: killing hung processes... (no point in SIGTERM if it is hung)
[2024-02-07 19:07:03] no missing processes found
[2024-02-07 19:07:03] notice: python module mqtt is missing: cannot connect to mqtt brokers
[2024-02-07 19:07:03] notice: python module redis is missing: cannot use redis implementations of retry and nodupe
[2024-02-07 19:07:03]
[2024-02-07 19:14:04] sanity: killing hung processes... (no point in SIGTERM if it is hung)
[2024-02-07 19:14:04] no missing processes found
[2024-02-07 19:14:04] notice: python module mqtt is missing: cannot connect to mqtt brokers
[2024-02-07 19:14:04] notice: python module redis is missing: cannot use redis implementations of retry and nodupe
[2024-02-07 19:14:04]

@petersilva
Copy link
Contributor

next time this shows up, please confirm:

  • the process is indeed gone... Could it be in some weird hung state, but still present?

it would be good to capture a case, doing a ps with some options... maybe it's only half-dead.

@andreleblanc11
Copy link
Member Author

The auto-restart script had a bug and the data stopped flowing again this morning @7:07Z.. However this gave me the chance to get some info on the process.

When I looked for a process, I wasn't able to find anything..

ps aux | grep http_poll
sarra    18824  0.0  0.0  15580  1104 pts/1    S+   13:29   0:00 grep --color=auto http_poll

However, it looks like sanity KILLS the poll instance this time, and afterwards it fails.. It's hard to tell as the sanity logs don't include milliseconds 😢

# Sanity kills the poll

[2024-02-09 07:07:10] found hung poll/airnow_http_poll/1 pid: 59394
[2024-02-09 07:07:10] killing hung processes... (no point in SIGTERM if it is hung)
[2024-02-09 07:07:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-09 07:07:10] starting them up...
[2024-02-09 07:07:10] killing strays...

# The poll is getting reinitialized, after sanity killed it (Notice the timestamps)

2024-02-09 06:56:13,080 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_AIRNOW_http_poll topic: v02.post a file with baseUrl: https://s
3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/today relPath: HourlyData_2024020904.dat
2024-02-09 07:07:10,927 [INFO] sarracenia.flow.poll __init__ Good! post_exchange: xs_AIRNOW_http_poll and exchange: xs_AIRNOW_http_poll match so multi
ple instances to share a poll.
.
.
.

# This is where the poll fails, one second after the re-init

2024-02-09 07:07:11,427 [INFO] sarracenia.flowcb.poll.airnow poll INFO https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/today/HourlyData_2024020905.dat
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 144, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 852, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 298, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 153, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f1fe89f49b0>: Failed to establish a new connection: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3-us-west-1.amazonaws.com', port=443): Max retries exceeded with url: //files.airnowtech.org/airnow/today/HourlyData_2024020905.dat (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f1fe89f49b0>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))

@andreleblanc11
Copy link
Member Author

I may have spoke too soon.

Looking at yesterday's sanity logs (and today), it looks like it's finding the configuration in a 'hung' state, despite this not being the case two days ago.

grep airnow sanity.log.2024-02-07
# Nothing

grep airnow sanity.log.2024-02-08
[2024-02-08 13:49:10] found hung poll/airnow_http_poll/1 pid: 55719
[2024-02-08 13:49:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 14:00:11] found hung poll/airnow_http_poll/1 pid: 51119
[2024-02-08 14:00:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 14:14:10] found hung poll/airnow_http_poll/1 pid: 10589
[2024-02-08 14:14:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 14:28:10] found hung poll/airnow_http_poll/1 pid: 1401
[2024-02-08 14:28:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 14:42:11] found hung poll/airnow_http_poll/1 pid: 57000
[2024-02-08 14:42:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 14:56:11] found hung poll/airnow_http_poll/1 pid: 48022
[2024-02-08 14:56:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 15:07:10] found hung poll/airnow_http_poll/1 pid: 36774
[2024-02-08 15:07:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 15:21:11] found hung poll/airnow_http_poll/1 pid: 22721
[2024-02-08 15:21:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 15:35:11] found hung poll/airnow_http_poll/1 pid: 13511
[2024-02-08 15:35:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 15:49:11] found hung poll/airnow_http_poll/1 pid: 6787
[2024-02-08 15:49:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 16:00:11] found hung poll/airnow_http_poll/1 pid: 22166
[2024-02-08 16:00:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 16:14:10] found hung poll/airnow_http_poll/1 pid: 64864
[2024-02-08 16:14:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 16:28:10] found hung poll/airnow_http_poll/1 pid: 8467
[2024-02-08 16:28:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 16:42:11] found hung poll/airnow_http_poll/1 pid: 20318
[2024-02-08 16:42:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 16:56:11] found hung poll/airnow_http_poll/1 pid: 33371
[2024-02-08 16:56:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 17:07:11] found hung poll/airnow_http_poll/1 pid: 31684
[2024-02-08 17:07:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 17:21:11] found hung poll/airnow_http_poll/1 pid: 16311
[2024-02-08 17:21:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 17:35:10] found hung poll/airnow_http_poll/1 pid: 6156
[2024-02-08 17:35:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 17:49:11] found hung poll/airnow_http_poll/1 pid: 1915
[2024-02-08 17:49:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 18:00:11] found hung poll/airnow_http_poll/1 pid: 60619
[2024-02-08 18:00:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 18:14:11] found hung poll/airnow_http_poll/1 pid: 27566
[2024-02-08 18:14:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 18:28:10] found hung poll/airnow_http_poll/1 pid: 26903
[2024-02-08 18:28:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 18:42:10] found hung poll/airnow_http_poll/1 pid: 20685
[2024-02-08 18:42:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 18:56:11] found hung poll/airnow_http_poll/1 pid: 12486
[2024-02-08 18:56:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 19:07:10] found hung poll/airnow_http_poll/1 pid: 2905
[2024-02-08 19:07:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 19:21:11] found hung poll/airnow_http_poll/1 pid: 45382
[2024-02-08 19:21:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 19:35:11] found hung poll/airnow_http_poll/1 pid: 34131
[2024-02-08 19:35:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 19:49:10] found hung poll/airnow_http_poll/1 pid: 18670
[2024-02-08 19:49:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 20:00:10] found hung poll/airnow_http_poll/1 pid: 13385
[2024-02-08 20:00:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 20:14:10] found hung poll/airnow_http_poll/1 pid: 42218
[2024-02-08 20:14:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 20:28:10] found hung poll/airnow_http_poll/1 pid: 36743
[2024-02-08 20:28:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 20:42:10] found hung poll/airnow_http_poll/1 pid: 28323
[2024-02-08 20:42:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 20:56:10] found hung poll/airnow_http_poll/1 pid: 16646
[2024-02-08 20:56:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 21:07:10] found hung poll/airnow_http_poll/1 pid: 2920
[2024-02-08 21:07:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 21:21:10] found hung poll/airnow_http_poll/1 pid: 42506
[2024-02-08 21:21:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 21:35:10] found hung poll/airnow_http_poll/1 pid: 31573
[2024-02-08 21:35:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 21:49:10] found hung poll/airnow_http_poll/1 pid: 17139
[2024-02-08 21:49:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 22:00:11] found hung poll/airnow_http_poll/1 pid: 22926
[2024-02-08 22:00:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 22:14:10] found hung poll/airnow_http_poll/1 pid: 49459
[2024-02-08 22:14:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 22:28:11] found hung poll/airnow_http_poll/1 pid: 45482
[2024-02-08 22:28:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 22:42:10] found hung poll/airnow_http_poll/1 pid: 40305
[2024-02-08 22:42:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 22:56:10] found hung poll/airnow_http_poll/1 pid: 31404
[2024-02-08 22:56:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 23:07:11] found hung poll/airnow_http_poll/1 pid: 15394
[2024-02-08 23:07:11] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 23:21:10] found hung poll/airnow_http_poll/1 pid: 52833
[2024-02-08 23:21:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 23:35:10] found hung poll/airnow_http_poll/1 pid: 46685
[2024-02-08 23:35:10] missing: [['poll', 'airnow_http_poll', 1]]
[2024-02-08 23:49:10] found hung poll/airnow_http_poll/1 pid: 32185
[2024-02-08 23:49:10] missing: [['poll', 'airnow_http_poll', 1]]

# The sanity restarts do seem to work when the configuration doesn't fail

2024-02-08 13:41:30,168 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_AIRNOW_http_poll topic: v02.post a file with baseUrl: https://s3-us-west-1.amazonaws.com//files.airnowtech.org/ai
rnow/today relPath: HourlyData_2024020811.dat
2024-02-08 13:49:10,930 [INFO] sarracenia.flow.poll __init__ Good! post_exchange: xs_AIRNOW_http_poll and exchange: xs_AIRNOW_http_poll match so multiple instances to share a poll.

We did a configuration change yesterday (around 13:32Z where we changed sleep from 500 to 900). I'm not sure if this could have caused the sanity trigger spiral.

@andreleblanc11
Copy link
Member Author

I haven't been able to reproduce the problem lately. The poll has only failed once more since Sunday and got restarted by the auto-restart script.

I have a poll on dev to try and catch it again (without the auto-restart), but it won't bite

@reidsunderland
Copy link
Member

This happened with a sender. v3.00.52

The log showed the instance crashed:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/usr/lib/python3/dist-packages/sarracenia/flow/__init__.py", line 676, in run
    time.sleep(increment)
  File "/usr/lib/python3/dist-packages/sarracenia/transfer/__init__.py", line 61, in alarm_raise
    raise TimeoutException("signal alarm timed out")
sarracenia.transfer.TimeoutException: signal alarm timed out

sr3 status reported that it was missing

$ sr3 status | grep miss
sender/config_name                       miss    0/1     0   0%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s

Sanity was running every 7 minutes and did not detect or restart the missing instance

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented May 13, 2024

This happened again with the same AIRNOW poll this Sunday night.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/usr/lib/python3/dist-packages/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/usr/lib/python3/dist-packages/sarracenia/flow/__init__.py", line 460, in run
    self.gather()
  File "/usr/lib/python3/dist-packages/sarracenia/flow/poll.py", line 115, in gather
    new_incoming = plugin()
  File "/usr/lib/python3/dist-packages/sarracenia/flowcb/poll/airnow.py", line 42, in poll
    resp = requests.get(URL)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 522, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 632, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='s3-us-west-1.amazonaws.com', port=443): Max retries exceeded with url: //files.airnowtech.org/airnow/today/HourlyData_2024051120.dat (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f803594e4e0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

Sanity never noticed anything

[2024-05-11 22:35:04]
[2024-05-11 22:42:04] sanity: killing hung processes... (no point in SIGTERM if it is hung)
[2024-05-11 22:42:04] no missing processes found
[2024-05-11 22:42:04] notice: python module mqtt is missing: cannot connect to mqtt brokers
[2024-05-11 22:42:04] notice: python module reassembly is missing: need to lock block segmented files to put them back together
[2024-05-11 22:42:04] notice: python module redis is missing: cannot use redis implementations of retry and nodupe
[2024-05-11 22:42:04]

@petersilva
Copy link
Contributor

"missing" in this context means that the poll process died. In this case, it looks like it raised an exception, not clear that it died. It might not be doing anything good, but can you clarify if the process actually died?

If not, I guess that means the log was idle for a long time (should have been caught by sanity.) Was the log idle (nothing being written all night?)

@andreleblanc11
Copy link
Member Author

The pager does not recall having checked for active/inactive processes when we got paged.

The poll crashed at 22:35:34
The monitoring picked up the missing instance at 22:40:22, so almost 5 minutes later.

Not sure how long the log needs to be dead before the instance is declared missing (5 minutes , a.k.a every housekeeping interval?)

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented May 15, 2024

I also was able to recreate the problem (with a sender) and I noticed a couple of things

I noticed that the missing_instances field would be filled inside of the state dictionary, but not the hung_instances field.

{'status': 'missing', 'options': <sarracenia.config.Config object at 0x7f2f838ed630>, 'instances': 1}

{'instance_pids': {1: 2948147}, 'queueName': None, 'has_state': False, 'noVip': None, 'instance_metrics': {1: {'flow': {'stop_requested': False, 'last_housekeeping': 1715785919.9663486, 'transferConnected': False, 'transferConnectStart': 0, 'transferConnectTime': 0, 'transferRxBytes': 0, 'transferTxBytes': 0, 'transferRxFiles': 0, 'transferTxFiles': 0, 'msgRate': 0.04711561243407943, 'current_sleep': 6.4}, 'post.message': {}, 'gather.message': {}, 'retry': {'msgs_in_download_retry': 0, 'msgs_in_post_retry': 0}, 'log': {'lagMax': 0, 'lagTotal': 0, 'lagMessageCount': 0, 'rejectCount': 0}, 'status': {'mtime': 1715785943.412397}}}, 'logAge': {1: 643.4065582752228}, 'metrics': {'byteRate': 0, 'rejectCount': 0, 'last_housekeeping': 0, 'rxByteCount': 0, 'rxGoodCount': 0, 'rxBadCount': 0, 'txByteCount': 0, 'txGoodCount': 0, 'txBadCount': 0, 'lagMax': 0, 'lagTotal': 0, 'lagMessageCount': 0, 'disconnectTime': 0, 'transferConnectTime': 0, 'transferRxLast': 0, 'transferTxLast': 0, 'rxLast': 0, 'txLast': 0, 'transferRxBytes': 0, 'transferRxFiles': 0, 'transferTxBytes': 0, 'transferTxFiles': 0, 'msgs_in_post_retry': 0, 'msgs_in_download_retry': 0, 'brokerQueuedMessageCount': 0}, 'missing_instances': [1], 'hung_instances': [], 'resource_usage': {'uss': 0, 'rss': 0, 'vms': 0, 'user_cpu': 0.0, 'system_cpu': 0.0}}

If you look at the code, finding hung_instances determines what needs to be killed. So if it's never found then the associated process won't be killed

sarracenia/sarracenia/sr.py

Lines 1979 to 2012 in 9f61e90

for f in self.filtered_configurations:
(c, cfg) = f.split(os.sep)
component_path = self._find_component_path(c)
if component_path == '':
continue
if self.configs[c][cfg]['status'] in ['stopped']:
numi = self.configs[c][cfg]['instances']
for i in range(1, numi + 1):
if pcount % 10 == 0: print('.', end='', flush=True)
pcount += 1
if 'hung_instances' in self.states[c][cfg] and len(self.states[c][cfg]['hung_instances']) > 0:
for i in self.states[c][cfg]['hung_instances']:
kill_pid=self.states[c][cfg]['instance_pids'][i]
print( f'\nfound hung {c}/{cfg}/{i} pid: {kill_pid}' )
kill_hung.append( kill_pid )
pcount += 1
print('killing hung processes... (no point in SIGTERM if it is hung)')
if (len(kill_hung) > 0) and not self.options.dry_run :
for pid in kill_hung:
signal_pid(pid, signal.SIGKILL)
time.sleep(5)
self._read_procs()
# next step should identify the missing instances and start them up.
if pcount != 0:
self._find_missing_instances()
self._clean_missing_proc_state()
self._read_states()
self._resolve()
filtered_missing = []
for m in self.missing:
if m[0] + os.sep + m[1] in self.filtered_configurations:
filtered_missing.append(m)

@petersilva
Copy link
Contributor

  • missing_instances are not running, ... they are not there... you can't kill them. killing them makes no sense. They are missing.
  • hung_instances can be killed, but figuring out they are hung would be populated by looking at the log files, and finding ones that have not been modified in sanity_log_dead time. this is done in _resolve(). (around line 999)

You keep referring to "missing_instances" but I still don't know if any instances are missing... missing means the process died (either by crashing or by something killing it.)

You say "the poll crashed" but I don't know if that means the process referred to in the state file stopped, or if it just stopped writing messages in the log.

did you do a ps -aux | grep the_pid_in_the _instance file, to find the process?

We need to know whether it is missing or hung. you don't kill missing processes... they are already dead.

You say you "recreated the problem with sender" ... If I kill a sender instance, and then run sanity... it finds it and does start it up ... does not reproduce the problem. Please describe HOW you "reproduce the problem" ... What did you do to get sanity not to notice?

@andreleblanc11
Copy link
Member Author

How I recreated the problem

  • Take my sender config sender/my-config and start it up with sr3.
# From user@host:~/.config/sr3
sr3 start sender/my-config
# Make sure it's running
sr3 status sender/my-config
sender/my-config run     1/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
  • Kill the process associated with sender/my-config and check if instance missing
ps aux | grep my-config # Get the PID number
kill -9 my-pid
ps aux | grep my-config # Verify it's gone

sr3 status sender/my-config
sender/my-config miss    0/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
  • Run sr3 sanity and see if it restarts my-config
sr3 sanity
> sanity: killing hung processes... (no point in SIGTERM if it is hung)
> no missing processes found
> notice: python module azurestorage is missing: cannot connect natively to Azure Stoarge accounts
> notice: python module reassembly is missing: need to lock block segmented files to put them back together
> notice: python module redis is missing: cannot use redis implementations of retry and nodupe

# Verify the status of my-config after
sr3 status sender/my-config
sender/my-config miss    0/1     0 100%   0%      0    0.00s    0.00s n/a    0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s

ps aux | grep my-config # Returns nothing

cat ~/.cache/sr3/sender/my-config/sender_my-config_01.pid # Returns the PID that was just `kill -9`

@petersilva
Copy link
Contributor

collaborative debug session done. PR generated. #1053

@petersilva
Copy link
Contributor

for posterity's sake: sanity worked if instances > 1. Sanity was checking for stopped configurations, instead of configurations with missing_instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion_Needed developers should discuss this issue. Priority 3 - Important worrisome impact... should study carefully v3 issue deferred to or affects version 3
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants