job stuck with active shells after timeout #5967

grondo · 2024-05-14T15:41:16Z

On elcap a large job was stuck in CLEANUP with many active job shells still running. The logs indicate that a SIGKILL was sent to the shells, but this apparently didn't work on some number of them, though there were no errors in the logs.
Subsequent fatal exceptions didn't re-send SIGKILL.

Since signals are inherently racy, perhaps the job-exec module should continue to send SIGKILL with a timeout and backoff to jobs that are stuck in this way. This would have eventually cleaned up this job (I presume, though we don't really know why the initial SIGKILL failed).

grondo · 2024-05-21T21:21:39Z

Another occurrence of this one for a couple jobs on elcap. Since it is quite annoying and should be a somewhat easy fix, adding it to the v0.63.0 mileston.

Problem: The job-exec module only attempts to send SIGKILL once to job shells after the job shell kill timeout has elapsed. If for some reason the signal is not delivered to one or more shells, then the job can hang forever in CLENAUP state waiting for thos shells to exit. Retry the send of SIGKILL to job shells as long as the job is still active in the exec system. Since kernel or filesystem issues could cause a shell to become unkillable, implement a backoff scheme so that the kill is not sent every 5s in this case. Fixes flux-framework#5967

grondo added this to the flux-core-0.63.0 milestone May 21, 2024

mergify bot closed this as completed in 779c73e Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job stuck with active shells after timeout #5967

job stuck with active shells after timeout #5967

grondo commented May 14, 2024

grondo commented May 21, 2024

job stuck with active shells after timeout #5967

job stuck with active shells after timeout #5967

Comments

grondo commented May 14, 2024

grondo commented May 21, 2024