Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with subprocesses cleanly in the worker #8385

Merged
merged 6 commits into from
Nov 30, 2022

Conversation

jrockway
Copy link
Member

@jrockway jrockway commented Nov 23, 2022

This fixes 2 things.

Zombies (CORE-1279)

If user code spawns a background process and it exits without the user code calling waitpid on it before the user code exits, we end up with a zombie process that takes up a slot in the process table forever. We now run the worker through dumb-init to reap these zombies, preventing the process table from filling up when user code behaves like this. Fixing this is not a huge deal in practice, but it's good for correctness. Setting DUMB_INIT_DEBUG=1 will print something whenever a zombie is reaped. A pipeline that runs a bash script like sleep 30 & touch /pfs/out/foo is an example of how to produce a zombie prior to this change.

Subprocesses that don't exit (CORE-1016, CORE-1280):

If user code spawns a background process that DOES NOT exit, then we have big problems. It could hold stdout open forever, which means we wait forever reading that inside cmd.Wait(), even if the user code is killed because of a context timeout. We had a workaround prior to #8066, but it didn't go far enough. A case it ignores is where a child of the user code doesn't read STDOUT but uses some other resource on the system, like a TCP port. Issue #8375 is an example of something that does this. This PR fixes that case by running the user code in a process group, and sending SIGKILL to the process group when the user code exits (leaving around background processes) or is killed (because of an expiring context). This means that everything the user code spawned gets SIGKILL'd and is guaranteed to terminate, releasing any file descriptors we were reading (STDOUT) or other resources (TCP ports).

We make a best effort to log when we're about to do this (find /proc/... races with processes exiting on their own before we call kill); so the logs look something like this running the opencv example as a bash script like sleep infinity & python3 /edges.py:

Nov 22 22:43:24 beginning to run user code pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:[{"hash":"q4zXe15LED6ufrTyMlNYCP+dN/coyuQPAQ/KoY7n9iQ=","path":"/1VqcWw9.jpg"}] datumId:aa7e5b5ee2d9e78cfd714209d5d591078f86671e53e1710f1e658e353813cc25
Nov 22 22:43:24 /usr/local/lib/python3.4/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:↑ datumId:↑ user:true
Nov 22 22:43:24   warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:↑ datumId:↑ user:↑
Nov 22 22:43:24 /usr/local/lib/python3.4/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:↑ datumId:↑ user:↑
Nov 22 22:43:24   warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:↑ datumId:↑ user:↑
Nov 22 22:43:24 note: about to kill unexpectedly-remaining subprocess of the user code: pid=35 comm=(sleep) cmdline=sleep infinity  pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:↑ datumId:↑
Nov 22 22:43:24 finished running user code after 598.008993ms pipelineName:↑ projectName:↑ workerId:↑ jobId:↑ data:↑ datumId:↑

A process can leave the process group and they will have the same problems (stdout stuck open) as before. You have to go out of your way to leave the process group, but if a user does that, we have no way of fixing that. (For that, we'd have to create a PID namespace with unshare, but that requires root or at least CAP_SYSADMIN.)

Anyway, this bug is fairly common; if you've ever run "stop job" and it didn't stop, or if you see something taking longer than the DatumTimeout, you hit this bug. I've seen it quite a bit but didn't recognize the cause until looking into this in detail. sleep infinity & touch /pfs/out/foo is an example of a script that used to break, but now works.

TestService in pachyderm_test would be broken by this bug, but they found a workaround for this issue: exec python ... instead of python .... I've made that test do it "wrong" now, acting as a test for this fix. Now users don't have to be UNIX wizards to have a working web server for files in a repo!

One reason this bug is problematic in general is that we use SIGKILL (via a hard-coded default in exec.Cmd); if the user code has cleanup code to terminate it subprocesses, it can't run it because SIGKILL is not something it can catch. SIGTERM + grace period + SIGKILL would be more polite and less likely to cause problems; a well-behaved app would catch SIGTERM and kill its children. The ideal solution here is to run the user code in a new PID namespace and let it be pid 1; then user code could use dumb-init to remap SIGKILL to SIGTERM if it felt like it. But creating a PID namespace requires root, which we try to avoid, so I didn't pursue that route.

Finally, this is technically a breaking change. If you had a pipeline like do_actual_work & sleep 10 and do_actual_work runs longer than 10 seconds, that would work in the past, but won't work now. Worse, the job will be marked successful because bash will exit 0 after the sleep 10. In the past, we would wait indefinitely for do_actual_work to close stdout, but now we SIGKILL it instantly. Something we'll have to watch out for as users upgrade.

How I tested this

I ran TestPipelineWithSubprocess with and without this PR -- without, it hangs forever; with, it passes.

I ran (this PR's version of) TestService with and without this PR -- without, the first version of the server never gets killed and the test times out; with, it works as expected. Some logs:

Nov 22 21:22:23 beginning to run user code master:true pipelineName:pipelineservice47fafd305644 projectName:default workerId:pipeline-default-pipelineservice47fafd305644-v1-5jpzk data:[{"hash":"tzUlQiTOsvaBavGE0GHMNpn5FVUqne/23Cvboctq8pw=","path":"/"}] datumId:17f6ef2602f33a7faa8e4fe5eca9730c4851f2132fb97dd67c1d48cecf82a1c7

Nov 22 21:22:28 10.244.0.1 - - [23/Nov/2022 02:22:28] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:true
Nov 22 21:22:28 canceling previous service, new job ready master:true pipelineName:↑ projectName:↑ workerId:↑
Nov 22 21:22:29 10.244.0.1 - - [23/Nov/2022 02:22:29] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:[{"hash":"tzUlQiTOsvaBavGE0GHMNpn5FVUqne/23Cvboctq8pw=","path":"/"}] datumId:17f6ef2602f33a7faa8e4fe5eca9730c4851f2132fb97dd67c1d48cecf82a1c7 user:true
Nov 22 21:22:29 10.244.0.1 - - [23/Nov/2022 02:22:29] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:30 10.244.0.1 - - [23/Nov/2022 02:22:30] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:32 10.244.0.1 - - [23/Nov/2022 02:22:32] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:33 10.244.0.1 - - [23/Nov/2022 02:22:33] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:35 10.244.0.1 - - [23/Nov/2022 02:22:35] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:40 10.244.0.1 - - [23/Nov/2022 02:22:40] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:43 10.244.0.1 - - [23/Nov/2022 02:22:43] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:46 10.244.0.1 - - [23/Nov/2022 02:22:46] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:49 10.244.0.1 - - [23/Nov/2022 02:22:49] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:53 10.244.0.1 - - [23/Nov/2022 02:22:53] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:22:58 10.244.0.1 - - [23/Nov/2022 02:22:58] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:05 10.244.0.1 - - [23/Nov/2022 02:23:05] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:08 10.244.0.1 - - [23/Nov/2022 02:23:08] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:13 10.244.0.1 - - [23/Nov/2022 02:23:13] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:17 10.244.0.1 - - [23/Nov/2022 02:23:17] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:22 10.244.0.1 - - [23/Nov/2022 02:23:22] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:26 10.244.0.1 - - [23/Nov/2022 02:23:26] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑
Nov 22 21:23:30 10.244.0.1 - - [23/Nov/2022 02:23:30] "GET /TestService_data9eabed90410e/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:↑

(Note how we keep serving the wrong content after "canceling previous service, new job ready"; the requests for file1 at increasing intervals is the backoff loop waiting for the content to change.)

With this change, the logs are much more in line with what we expect:

Nov 22 21:29:51 beginning to run user code master:true pipelineName:pipelineservice6c2a4335d762 projectName:default workerId:pipeline-default-pipelineservice6c2a4335d762-v1-pljn6 data:[{"hash":"tzUlQiTOsvaBavGE0GHMNpn5FVUqne/23Cvboctq8pw=","path":"/"}] datumId:f69008bc94305836b09b32cedafbbc3544a97d1d1ac08da93cef0b8b4c93dafc
Nov 22 21:29:56 10.244.0.1 - - [23/Nov/2022 02:29:56] "GET /TestService_data23806d8940d0/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:true
Nov 22 21:29:56 canceling previous service, new job ready master:true pipelineName:↑ projectName:↑ workerId:↑
Nov 22 21:29:56 killing remaining children master:↑ pipelineName:↑ projectName:↑ workerId:↑ data:[{"hash":"tzUlQiTOsvaBavGE0GHMNpn5FVUqne/23Cvboctq8pw=","path":"/"}] datumId:f69008bc94305836b09b32cedafbbc3544a97d1d1ac08da93cef0b8b4c93dafcNov 22 21:29:56 errored running user code after 5.124852286s: signal: killed master:↑ pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑
Nov 22 21:29:56 starting new service, job: 073c5aa44d0f4d959f342f16dadae57b master:↑ pipelineName:↑ projectName:↑ workerId:↑
Nov 22 21:29:57 beginning to run user code master:true pipelineName:pipelineservice6c2a4335d762 projectName:default workerId:pipeline-default-pipelineservice6c2a4335d762-v1-pljn6 data:[{"hash":"NXAbkRl+Dg9yqP1d97rKGS/hekrlWOTEN8Tqj8pi+dg=","path":"/"}] datumId:f69008bc94305836b09b32cedafbbc3544a97d1d1ac08da93cef0b8b4c93dafc
Nov 22 21:29:57 10.244.0.1 - - [23/Nov/2022 02:29:57] "GET /TestService_data23806d8940d0/file1 HTTP/1.1" 200 - pipelineName:↑ projectName:↑ workerId:↑ data:↑ datumId:↑ user:true

@codecov
Copy link

codecov bot commented Nov 23, 2022

Codecov Report

Merging #8385 (b5889a5) into master (16cce72) will increase coverage by 29.53%.
The diff coverage is 58.33%.

@@             Coverage Diff             @@
##           master    #8385       +/-   ##
===========================================
+ Coverage    9.37%   38.90%   +29.53%     
===========================================
  Files         357      435       +78     
  Lines      107057   120293    +13236     
===========================================
+ Hits        10036    46800    +36764     
+ Misses      95172    64153    -31019     
- Partials     1849     9340     +7491     
Impacted Files Coverage Δ
etc/worker/init.go 0.00% <0.00%> (ø)
src/server/pps/server/worker_rc.go 9.55% <0.00%> (+9.11%) ⬆️
src/server/worker/driver/driver.go 37.15% <35.71%> (+37.15%) ⬆️
src/server/worker/driver/ps_linux.go 63.63% <63.63%> (ø)
src/server/worker/driver/block_linux.go 71.42% <71.42%> (ø)
src/server/worker/driver/driver_unix.go 42.16% <72.22%> (+42.16%) ⬆️
src/internal/task/cmds/cmds.go 48.48% <0.00%> (ø)
src/server/pfs/cmds/cmds.go 32.63% <0.00%> (ø)
src/server/pps/pretty/pretty.go 31.26% <0.00%> (ø)
... and 349 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jrockway jrockway marked this pull request as ready for review November 23, 2022 03:55
@jrockway jrockway requested review from molinamelendezj and removed request for a team, brycemcanally and molinamelendezj November 23, 2022 03:55
@msteffen
Copy link
Contributor

Fix #2544

@jrockway jrockway merged commit 832c15f into master Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants