Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager can send SIGQUIT to worker while exiting, causing coredump #2046

Open
brsakai-csco opened this issue Mar 9, 2023 · 3 comments
Open

Comments

@brsakai-csco
Copy link
Contributor

  • Mojolicious version: 9.22, Waffle
  • Perl version: v5.34.0
  • Operating system: Ubuntu 22.04.2 LTS

Steps to reproduce the behavior

  1. Create and start simple Hypnotoad app
  2. Send a SIGQUIT to a worker process to end it gracefully
  3. The worker's manager will send a second SIGQUIT
  4. Depending on timing, the worker will reset its signal handlers to SIG_DFL before catching the second SIGQUIT, causing it to dump its core

In more detail:

Unpack the attached tarfile which has my app and update mojo_server_wrapper.pl to have the correct $ENV{REPRO_PATH} (you could most likely use any Hypnotoad/Prefork app, but I figured a working example would be best)

hypno.tar.gz

Start the app from terminal (A)

brsakai@brsakai-1151:~/hypno$ ./mojo_server_wrapper.pl
[2023-03-09 19:03:32.09625] [29198] [info] Listening at "http://127.0.0.1:8080"
Web application available at http://127.0.0.1:8080
[2023-03-09 19:03:32.09659] [29198] [info] Manager 29198 started
[2023-03-09 19:03:32.09910] [29199] [info] Worker 29199 started
[2023-03-09 19:03:32.10025] [29200] [info] Worker 29200 started
[2023-03-09 19:03:32.10118] [29201] [info] Worker 29201 started
[2023-03-09 19:03:32.10202] [29202] [info] Worker 29202 started
[2023-03-09 19:03:32.10239] [29198] [info] Creating process id file "/tmp/mojo_server.pid"
[2023-03-09 19:03:32.10248] [29203] [info] Worker 29203 started

strace one of the workers from another terminal (B)

brsakai@brsakai-1151:~$ sudo strace -p 29199
strace: Process 29199 attached
epoll_wait(8,

Issue a kill -QUIT from a third terminal (C)

brsakai@brsakai-1151:~$ kill -QUIT 29199

Observe that the kill signal is received in the strace output (note that two SIGQUIT are received, one from terminal (C) and one from the manager)

brsakai@brsakai-1151:~$ sudo strace -p 29199
strace: Process 29199 attached
epoll_wait(8, 0x561e175cd970, 64, 3408) = -1 EINTR (Interrupted system call)
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=29085, si_uid=1000} --- <<< We catch the SIGQUIT from terminal (C) and start to exit
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
epoll_wait(8, [], 64, 904)              = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [QUIT], NULL, 8) = 0
getpid()                                = 29199
write(5, "29199:1\n", 8)                = 8
rt_sigprocmask(SIG_BLOCK, [TTOU], [], 8) = 0
rt_sigaction(SIGTTOU, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [TTIN], [], 8) = 0
rt_sigaction(SIGTTIN, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=0x561e14f54450, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0 <<< SIGQUIT handler is uninstalled
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=29198, si_uid=1000} --- <<< We catch the SIGQUIT from the manager, and coredump
+++ killed by SIGQUIT (core dumped) +++

The manager will print status information about the killed worker

brsakai@brsakai-1151:~/hypno$ PERL5LIB=/home/brsakai/hypno ./mojo_server_wrapper.pl
[2023-03-09 19:03:32.09625] [29198] [info] Listening at "http://127.0.0.1:8080"
Web application available at http://127.0.0.1:8080
[2023-03-09 19:03:32.09659] [29198] [info] Manager 29198 started
[2023-03-09 19:03:32.09910] [29199] [info] Worker 29199 started
[2023-03-09 19:03:32.10025] [29200] [info] Worker 29200 started
[2023-03-09 19:03:32.10118] [29201] [info] Worker 29201 started
[2023-03-09 19:03:32.10202] [29202] [info] Worker 29202 started
[2023-03-09 19:03:32.10239] [29198] [info] Creating process id file "/tmp/mojo_server.pid"
[2023-03-09 19:03:32.10248] [29203] [info] Worker 29203 started
[2023-03-09 19:04:17.10105] [29198] [info] Stopping worker 29199 gracefully (30 seconds)
[2023-03-09 19:04:17.10365] [29210] [info] Worker 29210 started
[2023-03-09 19:04:17.21945] [29198] [info] Worker 29199 stopped

Expected behavior

The app should not generate a coredump when it is sent a graceful shutdown signal

Actual behavior

The second SIGQUIT from the manager process is being caught by SIG_DFL and creating a lot of cores, filling disk space on my server

Open comments

Issue first seen on Supervillain in our product

# mojo version
CORE
  Perl        (v5.32.1, linux)
  Mojolicious (8.09, Supervillain)

OPTIONAL
  Cpanel::JSON::XS 4.04+  (4.08)
  EV 4.0+                 (4.17)
  IO::Socket::Socks 0.64+ (0.74)
  IO::Socket::SSL 2.009+  (2.067)
  Net::DNS::Native 0.15+  (0.18)
  Role::Tiny 2.000001+    (2.000006)

You might want to update your Mojolicious to 9.31!

Reproduced on Ubuntu in a fresh install

brsakai@brsakai-1151:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy
brsakai@brsakai-1151:~$ # Installed via sudo apt install libmojolicious-perl
brsakai@brsakai-1151:~$ mojo version
CORE
  Perl        (v5.34.0, linux)
  Mojolicious (9.22, Waffle)

OPTIONAL
  Cpanel::JSON::XS 4.09+   (4.27)
  EV 4.32+                 (4.33)
  IO::Socket::Socks 0.64+  (0.74)
  IO::Socket::SSL 2.009+   (2.074)
  Net::DNS::Native 0.15+   (n/a)
  Role::Tiny 2.000001+     (2.002004)
  Future::AsyncAwait 0.52+ (0.56)

You might want to update your Mojolicious to 9.31!

I suspect that this is due to Perl deregistering all signal handlers when exit is called at the end of Mojo::Server::Prefork::_spawn, so the second SIGQUIT is being caught by the default handler (core dump) instead of the Perl handler (graceful stop). This is a little racy, since we can sometimes get the SIGQUIT from the manager before uninstalling the signal handler, so you might need to try the repro a couple times.

Possible suggestions:

  1. Manager could not send a second SIGQUIT (not sure why this is triggered)
  2. We could switch "graceful shutdown" signal to SIGTERM or something that does not create coredumps when the signal handlers are deregistered at exit
  3. Add some sort of delay in _spawn so we catch any late signals before uninstalling the signal handlers during exit
@brsakai-csco
Copy link
Contributor Author

I was able to prevent this behavior with the following patch

--- Server/Prefork.pm.bak       2023-03-21 16:51:31.027344078 +0000
+++ Server/Prefork.pm   2023-03-21 16:54:24.053345048 +0000
@@ -194,6 +194,7 @@
     next unless my $w = $self->{pool}{$1};
     @$w{qw(healthy time)} = (1, $time) and $self->emit(heartbeat => $1);
     $w->{graceful} ||= $time if $2;
+    $w->{quit}++ if $2;
   }
 }

As I understand it, the control flow here is:

  1. Mojo server creates a pipe in Mojo::Server::Prefork::run, which is used for heartbeats between the manager and worker
  2. Mojo::Server::Prefork::_spawn sets the signal handler for SIGQUIT to $self->ioloop->stop_gracefully
  3. Upon receipt of SIGQUIT, the worker emits the finished signal from Mojo::IOLoop::stop_gracefully
  4. Upon receipt of the finish signal (from itself? Not clear), the worker sets $finished to 1 in Mojo::Server::Prefork::_spawn
  5. The worker sends a heartbeat from Mojo::Server::Prefork::_heartbeat with $finished set to 1
  6. The manager receives the heartbeat message and sets $self->{pool}{$pid}->{graceful} equal to the worker's $finished in Mojo::Server::Prefork::_wait
  7. The manager sends a SIGQUIT to the worker and sets $self->{pool}{$pid}->{quit} to nonzero (preventing further SIGQUITs)

This patch "combines" steps 7 and 8, so that $self->{pool}{$pid}->{quit} is updated when the manger is notified of the graceful shutdown. This builds in the following assumptions

  1. There is no scenario in which the worker reports a graceful shutdown via its heartbeat but still needs a SIGQUIT to actually shut down
  2. The {quit} field isn't used for anything else

Assumption (2) appears to be fine, but I'm not sure about (1), would appreciate some input from you folks on that point

This corresponds to (1) in the Possible Suggestions from the original comment

@jixam
Copy link
Contributor

jixam commented Jun 7, 2023

Also see #1449 (with a proposed fix in #1883).

@kraih
Copy link
Member

kraih commented Jun 9, 2023

This does indeed seem like the correct solution for the problem from #1883. Please make this a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants