"Booting worker" is looping infinitely despite no exit signals #1663

bensalilijames · 2017-12-11T15:33:41Z

I'm trying to get gunicorn set up on Docker. It works well locally, and the production image is exactly the same as the local image, but I'm getting this strange behaviour on the production Docker engine:

ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [INFO] Starting gunicorn 19.7.1
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [DEBUG] Arbiter booted
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [INFO] Using worker: sync
ml-server_1     | [2017-12-11 13:18:50 +0000] [8] [INFO] Booting worker with pid: 8
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [DEBUG] 1 workers
ml-server_1     | Using TensorFlow backend.
ml-server_1     | [2017-12-11 13:18:54 +0000] [11] [INFO] Booting worker with pid: 11
ml-server_1     | Using TensorFlow backend.
ml-server_1     | [2017-12-11 13:18:58 +0000] [14] [INFO] Booting worker with pid: 14
ml-server_1     | Using TensorFlow backend.
ml-server_1     | [2017-12-11 13:19:02 +0000] [17] [INFO] Booting worker with pid: 17
ml-server_1     | Using TensorFlow backend.

It looks like gunicorn is booting workers every 4-5 seconds, despite no apparent error messages or exit signals. This behaviour continues indefinitely until terminated.

Is it possible that a worker can exit without logging anything to stderr/stdout, or for the arbiter to spawn workers infinitely?

Since they are the same docker image, they are running exactly the same code, on exactly the same architecture, so I'm really confused what this could be (bug?). Any help greatly appreciated!

The text was updated successfully, but these errors were encountered:

bensalilijames · 2017-12-11T16:23:54Z

ssh-ing into the Docker container led me to find this error:

Illegal instruction (core dumped)

Perhaps gunicorn should surface errors like this instead of swallowing them, or handle them differently? Not sure, just thought I'd raise this since it might help someone else!

tilgovi · 2017-12-11T18:32:39Z

Thanks for reporting the issue!

If you can figure out where this happens, that would be very helpful.

Perhaps we can add logging when workers exit. Usually, the worker itself logs, but if it's killed very abruptly it won't.

bensalilijames · 2017-12-11T18:35:06Z

No worries!

There seems to be a problem with Spacy which I've just added to this thread: explosion/spaCy#1589

Anyway, it's causing a SIGILL as strace confirms:

--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x7ff48bbe6cea} ---
+++ killed by SIGILL (core dumped) +++
Illegal instruction (core dumped)

I suppose it would be nice if gunicorn could identify this and log an error rather than phantomly restarting the worker, but tbh I know very little about how exit codes work!

tilgovi · 2017-12-11T18:43:25Z

Some exit codes definitely have special meanings and we could probably log those.
http://tldp.org/LDP/abs/html/exitcodes.html

bensalilijames · 2017-12-11T19:04:35Z

Sounds good! Additionally if the exit code isn't a reserved exit code (such as this case), it would be cool if that could be logged (without an explanation) so it's apparent that the worker is indeed terminating 🙂

zetaab · 2017-12-28T17:54:08Z

I have kind of similar issue, gunicorn is booting new worker always when I make http request. I do not get any response back, it just reboots new worker always. Strace log from two http requests:

select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=510, si_uid=0, si_status=SIGSEGV, si_utime=160, si_stime=32} ---
getpid()                                = 495
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WNOHANG, NULL) = 510
lseek(8, 0, SEEK_CUR)                   = 0
close(8)                                = 0
wait4(-1, 0x7ffd455ad844, WNOHANG, NULL) = 0
write(4, ".", 1)                        = 1
select(4, [3], [], [], {0, 840340})     = 1 (in [3], left {0, 840338})
read(3, ".", 1)                         = 1
read(3, 0x7f2682025fa0, 1)              = -1 EAGAIN (Resource temporarily unavailable)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
umask(0)                                = 022
getpid()                                = 495
open("/tmp/wgunicorn-q4aa72u7", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 8
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
chown("/tmp/wgunicorn-q4aa72u7", 0, 0)  = 0
umask(022)                              = 0
unlink("/tmp/wgunicorn-q4aa72u7")       = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
ioctl(8, TIOCGWINSZ, 0x7ffd455b8e50)    = -1 ENOTTY (Not a tty)
lseek(8, 0, SEEK_CUR)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
fork()                                  = 558
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(0, NULL, NULL, NULL, {0, 37381}[2017-12-28 17:50:23 +0000] [558] [INFO] Booting worker with pid: 558
) = 0 (Timeout)
select(4, [3], [], [], {1, 0}loading test-eu-ovh settings
)          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0}
)          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=499, si_uid=0, si_status=SIGSEGV, si_utime=160, si_stime=31} ---
getpid()                                = 495
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WNOHANG, NULL) = 499
lseek(7, 0, SEEK_CUR)                   = 0
close(7)                                = 0
wait4(-1, 0x7ffd455ad844, WNOHANG, NULL) = 0
write(4, ".", 1)                        = 1
select(4, [3], [], [], {0, 450691})     = 1 (in [3], left {0, 450689})
read(3, ".", 1)                         = 1
read(3, 0x7f2682067de8, 1)              = -1 EAGAIN (Resource temporarily unavailable)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
umask(0)                                = 022
getpid()                                = 495
open("/tmp/wgunicorn-5x9a40ca", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 7
fcntl(7, F_SETFD, FD_CLOEXEC)           = 0
chown("/tmp/wgunicorn-5x9a40ca", 0, 0)  = 0
umask(022)                              = 0
unlink("/tmp/wgunicorn-5x9a40ca")       = 0
fstat(7, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
ioctl(7, TIOCGWINSZ, 0x7ffd455b8e50)    = -1 ENOTTY (Not a tty)
lseek(7, 0, SEEK_CUR)                   = 0
lseek(7, 0, SEEK_CUR)                   = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
fork()                                  = 579
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(0, NULL, NULL, NULL, {0, 8144}[2017-12-28 17:50:30 +0000] [579] [INFO] Booting worker with pid: 579
)  = 0 (Timeout)
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0

sara-02 · 2018-07-05T11:09:38Z

I am facing the same issue, gunicorn is booting repeated within seconds for sync worker type. Setting worker timeout to 900 is not helping.

In my load before action, I am downloading data from AWS S3. It takes approx 1min 10 sec to download the various files.

benoitc · 2018-07-05T12:04:03Z

@sara-02 what is your command line to launch gunicorn ?

sara-02 · 2018-07-05T12:07:06Z

@benoitc gunicorn --pythonpath /src -b 0.0.0.0:$SERVICE_PORT --workers=1 -k sync -t $SERVICE_TIMEOUT flask_endpoint:app
Present here

benoitc · 2018-07-05T12:22:37Z

@sara-02 Thanks.

Are old workers really exiting or they are kept online and new workers are spawned? What the debug log shows also?

sara-02 · 2018-07-05T12:42:05Z

The logs are mixed with botocore logs, but it is something like this

[INFO] Booting worker with pid:  a
[INFO] Booting worker with pid:  b
[INFO] Booting worker with pid:  c

benoitc · 2018-07-05T13:34:00Z

but are the worker killed ? what return the command ps ax|grep gunicorn ?

sara-02 · 2018-07-05T13:46:16Z

@benoitc

sara-02 · 2018-07-05T13:52:09Z

One question though, why do we see 2 gunicorn processes, when worker limit is set to 1? Is one master, and one worker?

benoitc · 2018-07-06T03:53:57Z

There is 1 arbiter process (master) and N workers processes ye :)

So you run the command each time a worker booted right ? If so it seems that the older worker is killed a new one is spawned. I will investigate.

benoitc · 2018-07-06T03:55:30Z

@sara-02 one last thing, this also happening in docker?

sara-02 · 2018-07-06T09:11:18Z

@benoitc on docker-compose it is working as expected, but when I am putting the same code on Openshift, I see this error. Increasing the memory requirement did fix, but when I run the application via docker-compose it is using less than limited memory.

sara-02 · 2018-08-07T13:43:55Z

Just an update, the issue for me was actually a memory error and was fixed when the memory issue was fixed.

gulshan-gaurav · 2018-10-08T19:57:56Z

@benoitc
I am facing the same Issue while trying to spawn 5 gunicorn worker in docker.
@sara-02
How did you identify the cause to be a memory error?

sara-02 · 2018-10-09T03:35:21Z

@gulshan-gaurav 2 things helped me:
I increased the memory assigned to my Pod and stopped crashing. Second, we checked our Openshift Zabbix logs.

gulshan-gaurav · 2018-10-09T08:24:54Z

@sara-02
Even on my staging Pod the files + models that I am loading in memory amounts to 50Mb so 2GB of memory should be sufficient for 5 workers.

benoitc · 2018-10-09T09:55:09Z

@gulshan-gaurav which issue are you facing? Having 5 process there looks good....

emilwallner · 2018-10-29T13:57:19Z

I had the same issue. I didn't locate the exact problem, but it was solved once I upgraded from python 3.5 to 3.6.

m3h0w · 2019-01-11T16:24:54Z

I am facing the same issue in a Docker container. Gunicorn keeps botting a new worker every time I call an endpoint that causes the failure but not exception or error is outputted into the Gunicorn's log files. Things that I choose to print are logged and then suddenly the log file just says "Booting worker with pid..."

One step that helped was to add env variable PYTHONUNBUFFERED. Before that, even the print statements would disappear and would not be saved in Gunicorn's logs.

2 other endpoints of the app work correctly.

I run Gunicorn with: gunicorn run:app -b localhost:5000 --enable-stdio-inheritance --error-logfile /var/log/gunicorn/error.log --access-logfile /var/log/gunicorn/access.log --capture-output --log-level debug

Already running Python 3.6 and checked with top that the memory doesn't seem to be an issue.

EDIT: It looks like it was a Python issue and not a Gunicorn's fault. Some version discrepancies were causing Python to just die without any trace while performing a certain operation.

sumbb · 2019-01-29T11:51:33Z

I am facing similar issue where worker node keep coming up with
Booting worker with pid: 17636. i don't know if it is killing the previous worker node or previous worker node still exists. But the number of workers mentioned in gunicorn command line arguments is only 3 - -workers=3 . Also I am using python version 3.7

I had a mismatch of scikit-learn dependency, but even after resolving that, i am getting still same infinite workers coming up. what kind of python version discrepancies should i look for and how to identify them?

benoitc · 2019-09-16T12:56:06Z

can someone provides a way to reproduce the issue?

aleSuglia · 2019-09-16T14:09:10Z

It's a manager of several components that are executed in a pipeline. Some of them might start HTTP requests to other components on the same machine or on remote machines. Some of the modules of the pipeline can be executed in parallel but they're executed using a ThreadPoolExecutor. They don't use any shared objects but they only generate data structures that are later on aggregated in a single resulting one.

Unfortunately I'm not sure if I can put together a minimal example without exposing the system we have.

benoitc · 2019-09-16T14:16:54Z

requests does a lot of unsafe things with threads which sometimes fork a new process. I would advise to use another client. can you paste at least the lines you'reusing to do a request? Are you using its timeout feature?

aleSuglia · 2019-09-16T14:20:26Z

One of them could be:

try:
     resp = requests.post(self._endpoint, json=request_data)

     if resp.status_code != 200:
          logger.critical("[Error]: status code is {}".format(resp.status_code))
          return None

     response = resp.json()
     return {"intent": response["intent"], "intent_ranking": response["intent_ranking"]}
except ConnectionError as exc:
     logger.critical("[Exception] {}".format(str(exc)))
     return None

benoitc · 2019-09-17T09:02:15Z

Thanks. I will try to create a simple from it.

It would be cool anyway if someone can send us a pr that reproduce the behaviour either as an example or a unit test so we make sure we are actually fixing the right thing.

g-bon · 2019-10-02T22:33:52Z

Not sure if it can help someone but I had the same issue while running a dockerized flask webapp and solved it updating the base image of my dockerfile to python:3.6.9-alpine

Dmesg on the host showed a segfault on lilibpython3.6m.so.1.0:

[626278.653010] gunicorn[19965]: segfault at 70 ip 00007f6423e7faee sp 00007ffc4e9a2a38 error 4 in libpython3.6m.so.1.0[7f6423d8a000+194000]

My docker image was based on python:3.6-alpine and doing an apk update which was updating python to to 3.6.8.

As said changing the base image to python:3.6.9-alpine solved it for me

Ogala · 2019-11-22T01:10:51Z

I faced the same challenge running a Flask + Docker + Kubernetes. Increasing the CPU and Memory limits solved it for me.

ramnes · 2019-12-18T14:59:33Z

The same thing happened to us. Increasing resource limits fixed the problem.

minderov · 2020-02-08T05:38:16Z

This suddenly happened to me on macOS Catalina (not containerized).

What helped me is:

Installing openssl:

brew install openssl

Running and adding this to my ~/.zshrc:

export DYLD_LIBRARY_PATH=/usr/local/opt/openssl/lib:$DYLD_LIBRARY_PATH

Source: https://stackoverflow.com/a/58445755/5811984

BrightNana · 2020-02-24T08:49:53Z

Am having similar challenge and would be grateful if someone can help me out.
This is what I had;

"root@ubuntu-s-1vcpu-1gb-nyc1-01:~# sudo systemctl status gunicorn.service ● gunicorn.service - gunicorn daemon Loaded: loaded (/etc/systemd/system/gunicorn.service; disabled; vendor preset: enabled) Active: active (running) since Mon 2020-02-24 07:48:04 UTC; 44min ago Main PID: 4846 (gunicorn) Tasks: 4 (limit: 1151) CGroup: /system.slice/gunicorn.service ├─4846 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - ├─4866 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - ├─4868 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - └─4869 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - Feb 24 07:48:04 ubuntu-s-1vcpu-1gb-nyc1-01 systemd[1]: Stopped gunicorn daemon. Feb 24 07:48:04 ubuntu-s-1vcpu-1gb-nyc1-01 systemd[1]: Started gunicorn daemon. Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] Starting gunicorn 20.0.4 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] Listening at: unix:/run/gunicorn.soc Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] Using worker: sync Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4866] [INFO] Booting worker with pid: 4866 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4868] [INFO] Booting worker with pid: 4868 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4869] [INFO] Booting worker with pid: 4869 Feb 24 08:03:41 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: - - [24/Feb/2020:08:03:41 +0000] "GET / HTTP/1.0" 400 26 "-" "Mozilla/5.0 (Wi lines 1-20/20 (END)" Can anyone please help me fix that?

g-bon · 2020-02-24T12:26:13Z

@BrightNana can you try to give a dmesg and see if you have any gunicorn errors?
dmesg | grep gunicorn could help filter out the other errors

GUlbricht · 2020-03-12T07:50:37Z

Hello,
i have the same bug in debian 9 when i want to provide gunicorn as a systemd service. If I start it from the CLI, gunicorn runs without errors.

Extract from dmesg | grep gunicorn:
[ 6379.465022] gunicorn[28100]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6379.686421] gunicorn[28103]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6379.885803] gunicorn[28106]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6380.079670] gunicorn[28109]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6380.302551] gunicorn[28112]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6380.525240] gunicorn[28115]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6380.730884] gunicorn[28118]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000] [ 6380.898665] gunicorn[28121]: segfault at b0 ip 00007f06656e4c7c sp 00007ffcc060b030 error 4 in ld-2.24.so[7f06656d9000+23000]

Excerpt from journalctl:
Mär 12 07:01:06 build-server gunicorn[828]: [2020-03-12 07:01:06 +0100] [1054] [INFO] Booting worker with pid: 1054 Mär 12 07:01:06 build-server gunicorn[828]: [2020-03-12 07:01:06 +0100] [1057] [INFO] Booting worker with pid: 1057 Mär 12 07:01:06 build-server gunicorn[828]: [2020-03-12 07:01:06 +0100] [1060] [INFO] Booting worker with pid: 1060 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1064] [INFO] Booting worker with pid: 1064 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1067] [INFO] Booting worker with pid: 1067 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1070] [INFO] Booting worker with pid: 1070 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1073] [INFO] Booting worker with pid: 1073 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1076] [INFO] Booting worker with pid: 1076 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1079] [INFO] Booting worker with pid: 1079 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1082] [INFO] Booting worker with pid: 1082 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1085] [INFO] Booting worker with pid: 1085 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1088] [INFO] Booting worker with pid: 1088 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1091] [INFO] Booting worker with pid: 1091 Mär 12 07:01:09 build-server gunicorn[828]: [2020-03-12 07:01:09 +0100] [1094] [INFO] Booting worker with pid: 1094
Extract from systemctl status:
● api.service - API Server for BuildingChallenge served with Gunicorn Loaded: loaded (/etc/systemd/system/api.service; disabled; vendor preset: enabled) Active: active (running) since Thu 2020-03-12 08:26:01 CET; 22min ago Main PID: 8150 (gunicorn) Tasks: 3 (limit: 4915) Memory: 37.7M (high: 100.0M max: 500.0M) CGroup: /system.slice/api.service ├─ 8150 /opt/api/venv/bin/python /opt/api/venv/bin/gunicorn --bind unix:api.sock wsgi:app ├─28936 /opt/api/venv/bin/python /opt/api/venv/bin/gunicorn --bind unix:api.sock wsgi:app └─28938 /usr/bin/python3 -Es /usr/bin/lsb_release -a Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28909] [INFO] Booting worker with pid: 28909 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28912] [INFO] Booting worker with pid: 28912 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28915] [INFO] Booting worker with pid: 28915 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28918] [INFO] Booting worker with pid: 28918 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28921] [INFO] Booting worker with pid: 28921 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28924] [INFO] Booting worker with pid: 28924 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28927] [INFO] Booting worker with pid: 28927 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28930] [INFO] Booting worker with pid: 28930 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28933] [INFO] Booting worker with pid: 28933 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28936] [INFO] Booting worker with pid: 28936

Thanks for your help.

tilgovi · 2020-04-21T02:34:01Z

I put up a PR that might help debug these kinds of situations. Can anyone take a look?
#2315

Ahmed-Mosharafa · 2020-06-17T18:50:34Z

I had the same issue with a Flask application running inside Docker. The workers were restarting infinitely with an increasing process ID.

The issue was memory related for me, when I increased the memory allowed for Docker, the workers spawned up effectively.

mildebrandt · 2020-09-12T06:01:43Z

@tilgovi , I don't mind if you'd like to incorporate my changes into your PR since you got there first. This will cover the workers being killed via signals.

tilgovi · 2020-09-13T19:27:47Z

@mildebrandt I'll take a look, thanks!

ifiddes · 2020-09-23T22:50:29Z

I am also seeing this behavior suddenly, using Gunicorn (20.0.4) + Gevent (1.5.0) + Flask inside a Docker container.

[  328.699160] gunicorn[5151]: segfault at 78 ip 00007fc1113c16be sp 00007ffce50452a0 error 4 in _greenlet.cpython-37m-x86_64-linux-gnu.so[7fc11138d000+3e000]

In my case, as you can see the segfault is being caused by gevent. What is weird is that this container worked fine 5 days ago, and none of the code changes since then changed any versions of any of the libraries, and all of them are set to specific releases. I did remove flask-mail as a dependency, which may have slightly altered the versions of other dependencies.

Updating from gevent==1.5.0 to gevent==20.9.0 resolved the issue for me.

jamadden · 2020-09-23T23:26:42Z

@ifiddes your issue is likely unrelated. You’re seeing an ABI compatibility issue between old versions of gevent with the most recent version greenlet. See python-greenlet/greenlet#178

ifiddes · 2020-09-23T23:36:39Z

Ah, thanks @jamadden. This post was all I could find when searching for infinite spawning of booting workers, but that issue and the timing of that issue fit my problem.

ciotto · 2020-10-15T15:47:50Z

I had a similar error with a new AWS machine with Ubuntu 20.04 Server and with the same code that work on production.

The machine was configured using Ansible like the other production machines.

[2020-10-15 15:11:49 +0000] [18068] [DEBUG] Current configuration:
  config: None
  bind: ['127.0.0.1:8000']
  backlog: 2048
  workers: 1
  worker_class: uvicorn.workers.UvicornWorker
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /var/www/realistico/app
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1001
  group: 1001
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: /var/www/realistico/logs/gunicorn/access.log
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: /var/www/realistico/logs/gunicorn/error.log
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: realistico.asgi:application
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f7ba5fdd550>
  on_reload: <function OnReload.on_reload at 0x7f7ba5fdd670>
  when_ready: <function WhenReady.when_ready at 0x7f7ba5fdd790>
  pre_fork: <function Prefork.pre_fork at 0x7f7ba5fdd8b0>
  post_fork: <function Postfork.post_fork at 0x7f7ba5fdd9d0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f7ba5fddaf0>
  worker_int: <function WorkerInt.worker_int at 0x7f7ba5fddc10>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f7ba5fddd30>
  pre_exec: <function PreExec.pre_exec at 0x7f7ba5fdde50>
  pre_request: <function PreRequest.pre_request at 0x7f7ba5fddf70>
  post_request: <function PostRequest.post_request at 0x7f7ba5f6e040>
  child_exit: <function ChildExit.child_exit at 0x7f7ba5f6e160>
  worker_exit: <function WorkerExit.worker_exit at 0x7f7ba5f6e280>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f7ba5f6e3a0>
  on_exit: <function OnExit.on_exit at 0x7f7ba5f6e4c0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2020-10-15 15:11:49 +0000] [18068] [INFO] Starting gunicorn 20.0.4
[2020-10-15 15:11:49 +0000] [18068] [DEBUG] Arbiter booted
[2020-10-15 15:11:49 +0000] [18068] [INFO] Listening at: unix:/run/gunicorn.sock (18068)
[2020-10-15 15:11:49 +0000] [18068] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2020-10-15 15:11:49 +0000] [18080] [INFO] Booting worker with pid: 18080
[2020-10-15 15:11:49 +0000] [18068] [DEBUG] 1 workers
[2020-10-15 15:11:51 +0000] [18083] [INFO] Booting worker with pid: 18083
[2020-10-15 15:11:53 +0000] [18086] [INFO] Booting worker with pid: 18086
...
[2020-10-15 15:12:09 +0000] [18120] [INFO] Booting worker with pid: 18120
[2020-10-15 15:12:11 +0000] [18123] [INFO] Booting worker with pid: 18123

After many many time lost trying to solve this issue without success (and without any errors on logs), I have tried with this Hello world and I have found this error:

ModuleNotFoundError: No module named 'httptools'

After installing httptools the Hello world application works fine and, unexpectedly, works also my application.

I have no idea why the error was not logged or why this library was installed on the other machines but not on the new one, but this fix the problem for me.

brycedrennan · 2020-11-10T18:24:44Z

Had this happen recently and take down the kubernetes node it was on by consuming all the CPU. Thanks to the hint about dmesg I did find an error eventually:

[225027.348869] traps: python[44796] general protection ip:7f8bd8f8f8b0 sp:7ffc21a0b370 error:0 in libpython3.7m.so.1.0[7f8bd8dca000+2d9000]

In the end my issue was another instance of python-greenlet/greenlet#178 and was resolved by updating gunicorn, gevent, and greenlet to the latest version.

Since these types of exceptions create no python logs, cannot be caught, return exit code 0, and can hang the machine when they occur they're pretty difficult to manage.

I propose that gunicorn detect rapid crash-looping of this nature and

either give up or rate-limit the spawning of new workers
provide a useful message directing people to this issue and newly release pypi package logged /usr/local/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject return f(*args, **kwds) python-greenlet/greenlet#178

perhaps max_consecutive_startup_crashes with default being num_workers * 10 ?

tilgovi · 2021-02-16T03:40:15Z

Let's track the crash loop feature request in #2504. We also have the PR for additional logging in #2315. I'll close this issue since it seems like everyone has debugged their problems and now we have some feature requests and improvements to help others in the future. Thanks, everyone!

nkmittal · 2021-12-23T08:57:52Z

I was facing this issue with various Python official Docker images (all for Python 3.7.x), yet kept facing the same issue of "Booting worker with pid..." consistently across all images.
Finally I changed gevent, greenlet and gunicorn versions to latest and problem went away. I hope others facing similar issue would try this as well.

tilgovi added help wanted Open for everyone. You do not need permission to work on these. May need familiarity with codebase. Feature/Core Improvement and removed Feature/Core labels Dec 11, 2017

tilgovi mentioned this issue Dec 12, 2017

The gunicorn got timeout but not exit #1657

Closed

ilivans mentioned this issue Dec 6, 2019

Silent OOM (Booting worker) #2215

Closed

mildebrandt mentioned this issue Sep 12, 2020

Add additional logs when worker exits abnormally #2419

Merged

tilgovi self-assigned this Sep 13, 2020

ioneuk mentioned this issue Nov 12, 2020

Uvicorn worker which is using requests library at the startup, is failed due to timeout #2456

Closed

tilgovi closed this as completed Feb 16, 2021

derneuere mentioned this issue Mar 31, 2021

Infinite loop in backend? LibrePhotos/librephotos#223

Closed

2 tasks

TomiBelan mentioned this issue Oct 31, 2023

arbiter: don't log if handling SIGCHLD #3064

Open

"Booting worker" is looping infinitely despite no exit signals #1663

"Booting worker" is looping infinitely despite no exit signals #1663

Comments

bensalilijames commented Dec 11, 2017

bensalilijames commented Dec 11, 2017

tilgovi commented Dec 11, 2017 • edited

bensalilijames commented Dec 11, 2017

tilgovi commented Dec 11, 2017

bensalilijames commented Dec 11, 2017

zetaab commented Dec 28, 2017 • edited

sara-02 commented Jul 5, 2018 • edited

benoitc commented Jul 5, 2018

sara-02 commented Jul 5, 2018 • edited

benoitc commented Jul 5, 2018

sara-02 commented Jul 5, 2018

benoitc commented Jul 5, 2018

sara-02 commented Jul 5, 2018

sara-02 commented Jul 5, 2018

benoitc commented Jul 6, 2018

benoitc commented Jul 6, 2018

sara-02 commented Jul 6, 2018

sara-02 commented Aug 7, 2018 • edited

gulshan-gaurav commented Oct 8, 2018

sara-02 commented Oct 9, 2018 • edited

gulshan-gaurav commented Oct 9, 2018

benoitc commented Oct 9, 2018

emilwallner commented Oct 29, 2018

m3h0w commented Jan 11, 2019 • edited

sumbb commented Jan 29, 2019 • edited

benoitc commented Sep 16, 2019

aleSuglia commented Sep 16, 2019 • edited

benoitc commented Sep 16, 2019

aleSuglia commented Sep 16, 2019 • edited

benoitc commented Sep 17, 2019

g-bon commented Oct 2, 2019

Ogala commented Nov 22, 2019

ramnes commented Dec 18, 2019

minderov commented Feb 8, 2020 • edited

BrightNana commented Feb 24, 2020

g-bon commented Feb 24, 2020

GUlbricht commented Mar 12, 2020

tilgovi commented Apr 21, 2020

Ahmed-Mosharafa commented Jun 17, 2020

mildebrandt commented Sep 12, 2020

tilgovi commented Sep 13, 2020

ifiddes commented Sep 23, 2020

jamadden commented Sep 23, 2020

ifiddes commented Sep 23, 2020

ciotto commented Oct 15, 2020

brycedrennan commented Nov 10, 2020

tilgovi commented Feb 16, 2021

nkmittal commented Dec 23, 2021

tilgovi commented Dec 11, 2017 •

edited

zetaab commented Dec 28, 2017 •

edited

sara-02 commented Jul 5, 2018 •

edited

sara-02 commented Jul 5, 2018 •

edited

sara-02 commented Aug 7, 2018 •

edited

sara-02 commented Oct 9, 2018 •

edited

m3h0w commented Jan 11, 2019 •

edited

sumbb commented Jan 29, 2019 •

edited

aleSuglia commented Sep 16, 2019 •

edited

aleSuglia commented Sep 16, 2019 •

edited

minderov commented Feb 8, 2020 •

edited