-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job stuck with active shells after timeout #5967
Milestone
Comments
Another occurrence of this one for a couple jobs on elcap. Since it is quite annoying and should be a somewhat easy fix, adding it to the v0.63.0 mileston. |
grondo
added a commit
to grondo/flux-core
that referenced
this issue
May 29, 2024
Problem: The job-exec module only attempts to send SIGKILL once to job shells after the job shell kill timeout has elapsed. If for some reason the signal is not delivered to one or more shells, then the job can hang forever in CLENAUP state waiting for thos shells to exit. Retry the send of SIGKILL to job shells as long as the job is still active in the exec system. Since kernel or filesystem issues could cause a shell to become unkillable, implement a backoff scheme so that the kill is not sent every 5s in this case. Fixes flux-framework#5967
grondo
added a commit
to grondo/flux-core
that referenced
this issue
May 31, 2024
Problem: The job-exec module only attempts to send SIGKILL once to job shells after the job shell kill timeout has elapsed. If for some reason the signal is not delivered to one or more shells, then the job can hang forever in CLENAUP state waiting for thos shells to exit. Retry the send of SIGKILL to job shells as long as the job is still active in the exec system. Since kernel or filesystem issues could cause a shell to become unkillable, implement a backoff scheme so that the kill is not sent every 5s in this case. Fixes flux-framework#5967
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jun 2, 2024
Problem: The job-exec module only attempts to send SIGKILL once to job shells after the job shell kill timeout has elapsed. If for some reason the signal is not delivered to one or more shells, then the job can hang forever in CLENAUP state waiting for thos shells to exit. Retry the send of SIGKILL to job shells as long as the job is still active in the exec system. Since kernel or filesystem issues could cause a shell to become unkillable, implement a backoff scheme so that the kill is not sent every 5s in this case. Fixes flux-framework#5967
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jun 2, 2024
Problem: The job-exec module only attempts to send SIGKILL once to job shells after the job shell kill timeout has elapsed. If for some reason the signal is not delivered to one or more shells, then the job can hang forever in CLENAUP state waiting for thos shells to exit. Retry the send of SIGKILL to job shells as long as the job is still active in the exec system. Since kernel or filesystem issues could cause a shell to become unkillable, implement a backoff scheme so that the kill is not sent every 5s in this case. Fixes flux-framework#5967
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
On elcap a large job was stuck in CLEANUP with many active job shells still running. The logs indicate that a SIGKILL was sent to the shells, but this apparently didn't work on some number of them, though there were no errors in the logs.
Subsequent fatal exceptions didn't re-send SIGKILL.
Since signals are inherently racy, perhaps the job-exec module should continue to send SIGKILL with a timeout and backoff to jobs that are stuck in this way. This would have eventually cleaned up this job (I presume, though we don't really know why the initial SIGKILL failed).
The text was updated successfully, but these errors were encountered: