Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783

Open
rs-mfurlender opened this issue Dec 22, 2021 · 7 comments
Open

Comments

@rs-mfurlender
Copy link

rs-mfurlender commented Dec 22, 2021

We have a situation where certain memory-heavy tasks will cause our dedicated server to run out of memory and trigger the OOM killer.
The killer terminates the Resque job (which is fine) with a SIGKILL, but it also breaks Resque for any subsequent jobs (not so fine).

To replicate this issue I created a Resque job that purposely eats up all all available RAM and causes an OOM and gets reaped.
After it's reaped all subsequent tasks fail like this:

image

Strangely, the Workers tab for the worker in question shows this:
image

This doesn't make sense to me because all 24 of those tasks very much failed (and their failure shows up on the Failures tab)
Also, why is there a heartbeat? The PID should be dead, right?

I was hoping for some guidance on this issue. Right now we are thinking of setting up a cron to check dmesg for "process killed" and restarting Resque, but that seems like a dirty solution.

I am using version 2.2 of Resque.

Thanks!

@iloveitaly
Copy link
Contributor

Is the process restarted automatically when it is sigkill'd?

From my memory, SIGKILL terminates the process immediately. To safely exit, you'll want to sent a TERM signal. More info here: https://github.com/resque/resque#signals

I'm not sure why the heartbeat is reporting as active. That doesn't make much sense, but I don't have enough knowledge about that part of the code to give you a quick answer.

@rs-mfurlender
Copy link
Author

The process does not get restarted automatically. I see that a TERM signal would let the child exit safely, but that is out of our hands because the child is being killed by the Linux OOM reaper.

@rs-mfurlender
Copy link
Author

You can also replicate this situation by simply kill -9 <pid> where <pid> is the forked child process.

@rs-mfurlender
Copy link
Author

I just want to share my solution for this situation in case it helps someone else or in case you guys decide to integrate it (or something like it) into the core code.

I created a simple Failure backend to detect child processes that were terminated with SIGKILL and kill their parent (worker) process. The workers automatically respawn with new PIDs and future jobs do not fail with SIGABRT.

module Resque
  module Failure
    class SigkillFailureBackend < Base
      def save
        return if exception.to_s.exclude? 'SIGKILL (signal 9)'

        Process.kill('QUIT', worker.pid)
      end
    end
  end
end

I then added it to the Resque backends like this:

Resque::Failure::MultipleWithRetrySuppression.classes = [ Resque::Failure::Redis, Resque::Failure::SigkillFailureBackend ]

@iloveitaly
Copy link
Contributor

@rs-mfurlender That's interesting that this works—my assumption was as soon as SIGKILL is received the process stops executing, which would mean that this chunk of code wouldn't execute. Is that not the case? How does does a process have to cleanup before it gets force killed by the OS?

@rs-mfurlender
Copy link
Author

rs-mfurlender commented Dec 28, 2021

From what I understand, the workers spawn actual jobs via fork(). The parent (worker) can detect the signal used to kill the child, even if it is SIGKILL, which, as you said, does not give the child a chance to communicate with the parent.

I think this must be the case, otherwise there is no way that SIGKILL could appear as the cause of termination within the failure panel.

The worker executes this code, not the dead child process.

Or, this may occur entirely outside of the worker thread. I haven't dug that deep.

@iloveitaly
Copy link
Contributor

@rs-mfurlender Ah, I didn't realize that was possible with forked processes—I always assumed the signals were isolated to the individual process.

Could you add a note with your example code (and a link to this discussion) to the readme? I know others using this gem would love to see this info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants