Deal with subprocesses cleanly in the worker #8385
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes 2 things.
Zombies (CORE-1279)
If user code spawns a background process and it exits without the user code calling
waitpid
on it before the user code exits, we end up with a zombie process that takes up a slot in the process table forever. We now run the worker through dumb-init to reap these zombies, preventing the process table from filling up when user code behaves like this. Fixing this is not a huge deal in practice, but it's good for correctness. Setting DUMB_INIT_DEBUG=1 will print something whenever a zombie is reaped. A pipeline that runs a bash script likesleep 30 & touch /pfs/out/foo
is an example of how to produce a zombie prior to this change.Subprocesses that don't exit (CORE-1016, CORE-1280):
If user code spawns a background process that DOES NOT exit, then we have big problems. It could hold stdout open forever, which means we wait forever reading that inside
cmd.Wait()
, even if the user code is killed because of a context timeout. We had a workaround prior to #8066, but it didn't go far enough. A case it ignores is where a child of the user code doesn't read STDOUT but uses some other resource on the system, like a TCP port. Issue #8375 is an example of something that does this. This PR fixes that case by running the user code in a process group, and sending SIGKILL to the process group when the user code exits (leaving around background processes) or is killed (because of an expiring context). This means that everything the user code spawned gets SIGKILL'd and is guaranteed to terminate, releasing any file descriptors we were reading (STDOUT) or other resources (TCP ports).We make a best effort to log when we're about to do this (find /proc/... races with processes exiting on their own before we call kill); so the logs look something like this running the opencv example as a bash script like
sleep infinity & python3 /edges.py
:A process can leave the process group and they will have the same problems (stdout stuck open) as before. You have to go out of your way to leave the process group, but if a user does that, we have no way of fixing that. (For that, we'd have to create a PID namespace with unshare, but that requires root or at least CAP_SYSADMIN.)
Anyway, this bug is fairly common; if you've ever run "stop job" and it didn't stop, or if you see something taking longer than the DatumTimeout, you hit this bug. I've seen it quite a bit but didn't recognize the cause until looking into this in detail.
sleep infinity & touch /pfs/out/foo
is an example of a script that used to break, but now works.TestService
in pachyderm_test would be broken by this bug, but they found a workaround for this issue:exec python ...
instead ofpython ...
. I've made that test do it "wrong" now, acting as a test for this fix. Now users don't have to be UNIX wizards to have a working web server for files in a repo!One reason this bug is problematic in general is that we use SIGKILL (via a hard-coded default in exec.Cmd); if the user code has cleanup code to terminate it subprocesses, it can't run it because SIGKILL is not something it can catch. SIGTERM + grace period + SIGKILL would be more polite and less likely to cause problems; a well-behaved app would catch SIGTERM and kill its children. The ideal solution here is to run the user code in a new PID namespace and let it be pid 1; then user code could use dumb-init to remap SIGKILL to SIGTERM if it felt like it. But creating a PID namespace requires root, which we try to avoid, so I didn't pursue that route.
Finally, this is technically a breaking change. If you had a pipeline like
do_actual_work & sleep 10
anddo_actual_work
runs longer than 10 seconds, that would work in the past, but won't work now. Worse, the job will be marked successful because bash will exit 0 after the sleep 10. In the past, we would wait indefinitely fordo_actual_work
to close stdout, but now we SIGKILL it instantly. Something we'll have to watch out for as users upgrade.How I tested this
I ran TestPipelineWithSubprocess with and without this PR -- without, it hangs forever; with, it passes.
I ran (this PR's version of) TestService with and without this PR -- without, the first version of the server never gets killed and the test times out; with, it works as expected. Some logs:
(Note how we keep serving the wrong content after "canceling previous service, new job ready"; the requests for file1 at increasing intervals is the backoff loop waiting for the content to change.)
With this change, the logs are much more in line with what we expect: