SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783

rs-mfurlender · 2021-12-22T01:00:22Z

We have a situation where certain memory-heavy tasks will cause our dedicated server to run out of memory and trigger the OOM killer.
The killer terminates the Resque job (which is fine) with a SIGKILL, but it also breaks Resque for any subsequent jobs (not so fine).

To replicate this issue I created a Resque job that purposely eats up all all available RAM and causes an OOM and gets reaped.
After it's reaped all subsequent tasks fail like this:

Strangely, the Workers tab for the worker in question shows this:

This doesn't make sense to me because all 24 of those tasks very much failed (and their failure shows up on the Failures tab)
Also, why is there a heartbeat? The PID should be dead, right?

I was hoping for some guidance on this issue. Right now we are thinking of setting up a cron to check dmesg for "process killed" and restarting Resque, but that seems like a dirty solution.

I am using version 2.2 of Resque.

Thanks!

The text was updated successfully, but these errors were encountered:

iloveitaly · 2021-12-22T02:59:07Z

Is the process restarted automatically when it is sigkill'd?

From my memory, SIGKILL terminates the process immediately. To safely exit, you'll want to sent a TERM signal. More info here: https://github.com/resque/resque#signals

I'm not sure why the heartbeat is reporting as active. That doesn't make much sense, but I don't have enough knowledge about that part of the code to give you a quick answer.

rs-mfurlender · 2021-12-22T18:20:07Z

The process does not get restarted automatically. I see that a TERM signal would let the child exit safely, but that is out of our hands because the child is being killed by the Linux OOM reaper.

rs-mfurlender · 2021-12-27T20:25:55Z

You can also replicate this situation by simply kill -9 <pid> where <pid> is the forked child process.

rs-mfurlender · 2021-12-28T00:02:00Z

I just want to share my solution for this situation in case it helps someone else or in case you guys decide to integrate it (or something like it) into the core code.

I created a simple Failure backend to detect child processes that were terminated with SIGKILL and kill their parent (worker) process. The workers automatically respawn with new PIDs and future jobs do not fail with SIGABRT.

module Resque
  module Failure
    class SigkillFailureBackend < Base
      def save
        return if exception.to_s.exclude? 'SIGKILL (signal 9)'

        Process.kill('QUIT', worker.pid)
      end
    end
  end
end

I then added it to the Resque backends like this:

Resque::Failure::MultipleWithRetrySuppression.classes = [ Resque::Failure::Redis, Resque::Failure::SigkillFailureBackend ]

iloveitaly · 2021-12-28T00:13:22Z

@rs-mfurlender That's interesting that this works—my assumption was as soon as SIGKILL is received the process stops executing, which would mean that this chunk of code wouldn't execute. Is that not the case? How does does a process have to cleanup before it gets force killed by the OS?

rs-mfurlender · 2021-12-28T02:03:30Z

From what I understand, the workers spawn actual jobs via fork(). The parent (worker) can detect the signal used to kill the child, even if it is SIGKILL, which, as you said, does not give the child a chance to communicate with the parent.

I think this must be the case, otherwise there is no way that SIGKILL could appear as the cause of termination within the failure panel.

The worker executes this code, not the dead child process.

Or, this may occur entirely outside of the worker thread. I haven't dug that deep.

iloveitaly · 2021-12-30T17:42:49Z

@rs-mfurlender Ah, I didn't realize that was possible with forked processes—I always assumed the signals were isolated to the individual process.

Could you add a note with your example code (and a link to this discussion) to the readme? I know others using this gem would love to see this info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783

SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783

rs-mfurlender commented Dec 22, 2021 •

edited

iloveitaly commented Dec 22, 2021

rs-mfurlender commented Dec 22, 2021

rs-mfurlender commented Dec 27, 2021

rs-mfurlender commented Dec 28, 2021

iloveitaly commented Dec 28, 2021

rs-mfurlender commented Dec 28, 2021 •

edited

iloveitaly commented Dec 30, 2021

SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783

SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783

Comments

rs-mfurlender commented Dec 22, 2021 • edited

iloveitaly commented Dec 22, 2021

rs-mfurlender commented Dec 22, 2021

rs-mfurlender commented Dec 27, 2021

rs-mfurlender commented Dec 28, 2021

iloveitaly commented Dec 28, 2021

rs-mfurlender commented Dec 28, 2021 • edited

iloveitaly commented Dec 30, 2021

rs-mfurlender commented Dec 22, 2021 •

edited

rs-mfurlender commented Dec 28, 2021 •

edited