New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGABRT (signal 6) after a job is killed by the Linux OOM killer #1783
Comments
Is the process restarted automatically when it is sigkill'd? From my memory, SIGKILL terminates the process immediately. To safely exit, you'll want to sent a TERM signal. More info here: https://github.com/resque/resque#signals I'm not sure why the heartbeat is reporting as active. That doesn't make much sense, but I don't have enough knowledge about that part of the code to give you a quick answer. |
The process does not get restarted automatically. I see that a TERM signal would let the child exit safely, but that is out of our hands because the child is being killed by the Linux OOM reaper. |
You can also replicate this situation by simply |
I just want to share my solution for this situation in case it helps someone else or in case you guys decide to integrate it (or something like it) into the core code. I created a simple
I then added it to the Resque backends like this:
|
@rs-mfurlender That's interesting that this works—my assumption was as soon as SIGKILL is received the process stops executing, which would mean that this chunk of code wouldn't execute. Is that not the case? How does does a process have to cleanup before it gets force killed by the OS? |
From what I understand, the workers spawn actual jobs via I think this must be the case, otherwise there is no way that SIGKILL could appear as the cause of termination within the failure panel. The worker executes this code, not the dead child process. Or, this may occur entirely outside of the worker thread. I haven't dug that deep. |
@rs-mfurlender Ah, I didn't realize that was possible with forked processes—I always assumed the signals were isolated to the individual process. Could you add a note with your example code (and a link to this discussion) to the readme? I know others using this gem would love to see this info. |
We have a situation where certain memory-heavy tasks will cause our dedicated server to run out of memory and trigger the OOM killer.
The killer terminates the Resque job (which is fine) with a SIGKILL, but it also breaks Resque for any subsequent jobs (not so fine).
To replicate this issue I created a
Resque
job that purposely eats up all all available RAM and causes an OOM and gets reaped.After it's reaped all subsequent tasks fail like this:
Strangely, the
Workers
tab for the worker in question shows this:This doesn't make sense to me because all 24 of those tasks very much failed (and their failure shows up on the
Failures
tab)Also, why is there a heartbeat? The PID should be dead, right?
I was hoping for some guidance on this issue. Right now we are thinking of setting up a cron to check
dmesg
for "process killed" and restarting Resque, but that seems like a dirty solution.I am using version 2.2 of
Resque
.Thanks!
The text was updated successfully, but these errors were encountered: