Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job stuck with active shells after timeout #5967

Closed
grondo opened this issue May 14, 2024 · 1 comment
Closed

job stuck with active shells after timeout #5967

grondo opened this issue May 14, 2024 · 1 comment

Comments

@grondo
Copy link
Contributor

grondo commented May 14, 2024

On elcap a large job was stuck in CLEANUP with many active job shells still running. The logs indicate that a SIGKILL was sent to the shells, but this apparently didn't work on some number of them, though there were no errors in the logs.
Subsequent fatal exceptions didn't re-send SIGKILL.

Since signals are inherently racy, perhaps the job-exec module should continue to send SIGKILL with a timeout and backoff to jobs that are stuck in this way. This would have eventually cleaned up this job (I presume, though we don't really know why the initial SIGKILL failed).

@grondo
Copy link
Contributor Author

grondo commented May 21, 2024

Another occurrence of this one for a couple jobs on elcap. Since it is quite annoying and should be a somewhat easy fix, adding it to the v0.63.0 mileston.

@grondo grondo added this to the flux-core-0.63.0 milestone May 21, 2024
grondo added a commit to grondo/flux-core that referenced this issue May 29, 2024
Problem: The job-exec module only attempts to send SIGKILL once to
job shells after the job shell kill timeout has elapsed. If for some
reason the signal is not delivered to one or more shells, then the
job can hang forever in CLENAUP state waiting for thos shells to
exit.

Retry the send of SIGKILL to job shells as long as the job is still
active in the exec system. Since kernel or filesystem issues could
cause a shell to become unkillable, implement a backoff scheme so
that the kill is not sent every 5s in this case.

Fixes flux-framework#5967
grondo added a commit to grondo/flux-core that referenced this issue May 31, 2024
Problem: The job-exec module only attempts to send SIGKILL once to
job shells after the job shell kill timeout has elapsed. If for some
reason the signal is not delivered to one or more shells, then the
job can hang forever in CLENAUP state waiting for thos shells to
exit.

Retry the send of SIGKILL to job shells as long as the job is still
active in the exec system. Since kernel or filesystem issues could
cause a shell to become unkillable, implement a backoff scheme so
that the kill is not sent every 5s in this case.

Fixes flux-framework#5967
grondo added a commit to grondo/flux-core that referenced this issue Jun 2, 2024
Problem: The job-exec module only attempts to send SIGKILL once to
job shells after the job shell kill timeout has elapsed. If for some
reason the signal is not delivered to one or more shells, then the
job can hang forever in CLENAUP state waiting for thos shells to
exit.

Retry the send of SIGKILL to job shells as long as the job is still
active in the exec system. Since kernel or filesystem issues could
cause a shell to become unkillable, implement a backoff scheme so
that the kill is not sent every 5s in this case.

Fixes flux-framework#5967
grondo added a commit to grondo/flux-core that referenced this issue Jun 2, 2024
Problem: The job-exec module only attempts to send SIGKILL once to
job shells after the job shell kill timeout has elapsed. If for some
reason the signal is not delivered to one or more shells, then the
job can hang forever in CLENAUP state waiting for thos shells to
exit.

Retry the send of SIGKILL to job shells as long as the job is still
active in the exec system. Since kernel or filesystem issues could
cause a shell to become unkillable, implement a backoff scheme so
that the kill is not sent every 5s in this case.

Fixes flux-framework#5967
@mergify mergify bot closed this as completed in 779c73e Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant