New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Booting worker" is looping infinitely despite no exit signals #1663
Comments
Perhaps gunicorn should surface errors like this instead of swallowing them, or handle them differently? Not sure, just thought I'd raise this since it might help someone else! |
Thanks for reporting the issue! If you can figure out where this happens, that would be very helpful. Perhaps we can add logging when workers exit. Usually, the worker itself logs, but if it's killed very abruptly it won't. |
No worries! There seems to be a problem with Spacy which I've just added to this thread: explosion/spaCy#1589 Anyway, it's causing a
I suppose it would be nice if gunicorn could identify this and log an error rather than phantomly restarting the worker, but tbh I know very little about how exit codes work! |
Some exit codes definitely have special meanings and we could probably log those. |
Sounds good! Additionally if the exit code isn't a reserved exit code (such as this case), it would be cool if that could be logged (without an explanation) so it's apparent that the worker is indeed terminating 🙂 |
I have kind of similar issue, gunicorn is booting new worker always when I make http request. I do not get any response back, it just reboots new worker always. Strace log from two http requests:
|
I am facing the same issue, In my load before action, I am downloading data from AWS S3. It takes approx 1min 10 sec to download the various files. |
@sara-02 what is your command line to launch gunicorn ? |
@sara-02 Thanks. Are old workers really exiting or they are kept online and new workers are spawned? What the debug log shows also? |
The logs are mixed with botocore logs, but it is something like this
|
but are the worker killed ? what return the command |
One question though, why do we see 2 |
There is 1 arbiter process (master) and N workers processes ye :) So you run the command each time a worker booted right ? If so it seems that the older worker is killed a new one is spawned. I will investigate. |
@sara-02 one last thing, this also happening in docker? |
@benoitc on |
Just an update, the issue for me was actually a memory error and was fixed when the memory issue was fixed. |
@gulshan-gaurav 2 things helped me: |
@sara-02 |
@gulshan-gaurav which issue are you facing? Having 5 process there looks good.... |
I had the same issue. I didn't locate the exact problem, but it was solved once I upgraded from python 3.5 to 3.6. |
I am facing the same issue in a Docker container. Gunicorn keeps botting a new worker every time I call an endpoint that causes the failure but not exception or error is outputted into the Gunicorn's log files. Things that I choose to print are logged and then suddenly the log file just says "Booting worker with pid..." One step that helped was to add env variable PYTHONUNBUFFERED. Before that, even the print statements would disappear and would not be saved in Gunicorn's logs. 2 other endpoints of the app work correctly. I run Gunicorn with: gunicorn run:app -b localhost:5000 --enable-stdio-inheritance --error-logfile /var/log/gunicorn/error.log --access-logfile /var/log/gunicorn/access.log --capture-output --log-level debug Already running Python 3.6 and checked with top that the memory doesn't seem to be an issue. EDIT: It looks like it was a Python issue and not a Gunicorn's fault. Some version discrepancies were causing Python to just die without any trace while performing a certain operation. |
I am facing similar issue where worker node keep coming up with I had a mismatch of scikit-learn dependency, but even after resolving that, i am getting still same infinite workers coming up. what kind of python version discrepancies should i look for and how to identify them? |
can someone provides a way to reproduce the issue? |
It's a manager of several components that are executed in a pipeline. Some of them might start HTTP requests to other components on the same machine or on remote machines. Some of the modules of the pipeline can be executed in parallel but they're executed using a ThreadPoolExecutor. They don't use any shared objects but they only generate data structures that are later on aggregated in a single resulting one. Unfortunately I'm not sure if I can put together a minimal example without exposing the system we have. |
requests does a lot of unsafe things with threads which sometimes fork a new process. I would advise to use another client. can you paste at least the lines you'reusing to do a request? Are you using its timeout feature? |
One of them could be:
|
Thanks. I will try to create a simple from it. It would be cool anyway if someone can send us a pr that reproduce the behaviour either as an example or a unit test so we make sure we are actually fixing the right thing. |
Not sure if it can help someone but I had the same issue while running a dockerized flask webapp and solved it updating the base image of my dockerfile to Dmesg on the host showed a segfault on lilibpython3.6m.so.1.0:
My docker image was based on As said changing the base image to |
I faced the same challenge running a Flask + Docker + Kubernetes. Increasing the CPU and Memory limits solved it for me. |
The same thing happened to us. Increasing resource limits fixed the problem. |
This suddenly happened to me on macOS Catalina (not containerized). What helped me is:
brew install openssl
export DYLD_LIBRARY_PATH=/usr/local/opt/openssl/lib:$DYLD_LIBRARY_PATH |
Am having similar challenge and would be grateful if someone can help me out. "root@ubuntu-s-1vcpu-1gb-nyc1-01:~# sudo systemctl status gunicorn.service ● gunicorn.service - gunicorn daemon Loaded: loaded (/etc/systemd/system/gunicorn.service; disabled; vendor preset: enabled) Active: active (running) since Mon 2020-02-24 07:48:04 UTC; 44min ago Main PID: 4846 (gunicorn) Tasks: 4 (limit: 1151) CGroup: /system.slice/gunicorn.service ├─4846 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - ├─4866 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - ├─4868 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - └─4869 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - Feb 24 07:48:04 ubuntu-s-1vcpu-1gb-nyc1-01 systemd[1]: Stopped gunicorn daemon. Feb 24 07:48:04 ubuntu-s-1vcpu-1gb-nyc1-01 systemd[1]: Started gunicorn daemon. Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] Starting gunicorn 20.0.4 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] Listening at: unix:/run/gunicorn.soc Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] Using worker: sync Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4866] [INFO] Booting worker with pid: 4866 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4868] [INFO] Booting worker with pid: 4868 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4869] [INFO] Booting worker with pid: 4869 Feb 24 08:03:41 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: - - [24/Feb/2020:08:03:41 +0000] "GET / HTTP/1.0" 400 26 "-" "Mozilla/5.0 (Wi lines 1-20/20 (END)" Can anyone please help me fix that? |
@BrightNana can you try to give a |
Hello, Extract from Excerpt from Thanks for your help. |
I put up a PR that might help debug these kinds of situations. Can anyone take a look? |
@tilgovi , I don't mind if you'd like to incorporate my changes into your PR since you got there first. This will cover the workers being killed via signals. |
@mildebrandt I'll take a look, thanks! |
I am also seeing this behavior suddenly, using Gunicorn (20.0.4) + Gevent (1.5.0) + Flask inside a Docker container.
In my case, as you can see the segfault is being caused by gevent. What is weird is that this container worked fine 5 days ago, and none of the code changes since then changed any versions of any of the libraries, and all of them are set to specific releases. I did remove flask-mail as a dependency, which may have slightly altered the versions of other dependencies. Updating from gevent==1.5.0 to gevent==20.9.0 resolved the issue for me. |
@ifiddes your issue is likely unrelated. You’re seeing an ABI compatibility issue between old versions of gevent with the most recent version greenlet. See python-greenlet/greenlet#178 |
Ah, thanks @jamadden. This post was all I could find when searching for infinite spawning of booting workers, but that issue and the timing of that issue fit my problem. |
I had a similar error with a new AWS machine with Ubuntu 20.04 Server and with the same code that work on production. The machine was configured using Ansible like the other production machines.
After many many time lost trying to solve this issue without success (and without any errors on logs), I have tried with this Hello world and I have found this error:
After installing I have no idea why the error was not logged or why this library was installed on the other machines but not on the new one, but this fix the problem for me. |
Had this happen recently and take down the kubernetes node it was on by consuming all the CPU. Thanks to the hint about
In the end my issue was another instance of python-greenlet/greenlet#178 and was resolved by updating gunicorn, gevent, and greenlet to the latest version. Since these types of exceptions create no python logs, cannot be caught, return exit code 0, and can hang the machine when they occur they're pretty difficult to manage. I propose that gunicorn detect rapid crash-looping of this nature and
perhaps |
I was facing this issue with various Python official Docker images (all for Python 3.7.x), yet kept facing the same issue of "Booting worker with pid..." consistently across all images. |
I'm trying to get gunicorn set up on Docker. It works well locally, and the production image is exactly the same as the local image, but I'm getting this strange behaviour on the production Docker engine:
It looks like gunicorn is booting workers every 4-5 seconds, despite no apparent error messages or exit signals. This behaviour continues indefinitely until terminated.
Is it possible that a worker can exit without logging anything to stderr/stdout, or for the arbiter to spawn workers infinitely?
Since they are the same docker image, they are running exactly the same code, on exactly the same architecture, so I'm really confused what this could be (bug?). Any help greatly appreciated!
The text was updated successfully, but these errors were encountered: