Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gunicorn CRITICAL WORKER TIMEOUT #1440

Closed
jseaidou opened this issue Jan 21, 2017 · 19 comments
Closed

gunicorn CRITICAL WORKER TIMEOUT #1440

jseaidou opened this issue Jan 21, 2017 · 19 comments

Comments

@jseaidou
Copy link

jseaidou commented Jan 21, 2017

I am running a gunicorn with the following settings:
gunicorn --worker-class gevent --timeout 30 --graceful-timeout 20 --max-requests-jitter 2000 --max-requests 1500 -w 50 --log-level DEBUG --capture-output --bind 0.0.0.0:5000 run:app and I am seeing in all but 3 workers the [CRITICAL] WORKER TIMEOUT. After a certain while, gunicorn can't spawn anymore workers or at least is very slow at spawning them. This causes it to not make the server reachable and any request unreachable.

I reduced the number of workers to 3 and gave each worker 2 threads and now Im not seeing this issue anymore.

I can't get a stacktrace from the timeouts but this seems like after a certain number of workers it can't handle them?

@benoitc
Copy link
Owner

benoitc commented Jan 25, 2017

when a worker timeout it means that it didn't notified the arbiter in time that it was alive. Do you have any task executed during a request that could take longer than the timeout?

@benoitc
Copy link
Owner

benoitc commented Jan 27, 2017

@jseaidou bump.

@jseaidou
Copy link
Author

Sorry for the late reply @benoitc. The issue I'm seeing is actually the non active threads doing this. My active threads don't timeout, my the non active ones under less load give more critical timeout errors than just gracefully timing out. I switched from gevent to tornado and that seems to have fixed the hang issue but I'm still seeing 3 workers consistently giving the critical timeout every 30 seconds. If it's a normal timeout it shouldnt be a critical error.

@saabeilin
Copy link

I am facing exactly the same issue.
Gunicorn 19.7.1
gevent 1.2.1
Python 3.5.3
Running in Docker, image based on official "python:3.5"

@benoitc
Copy link
Owner

benoitc commented Jun 25, 2017

@jseaidou it's anormal timeout in the sense the arbiter react to it. It's critical because it should normally not happen.

Most probably one of your worker is doing a blocking operation preventing the gunicorn worker to notify the arbiter. If you have a long operation, make sure to trigger the geven scheduler from time to time by sleeping or such. or anything that call the tornado scheduler back too.

How can I reproduce the issue?

@saabeilin same ^^

@schwuk
Copy link

schwuk commented Jun 27, 2017

I'm seeing the same thing: workers are timing out even when serving no requests. All I've done is launched my container on AWS ECS.

[2017-06-27 20:41:56 +0000] [1] [DEBUG] Current configuration:
proxy_protocol: False
worker_connections: 1000
statsd_host: None
max_requests_jitter: 0
post_fork: <function post_fork at 0x7f6bbc3f1938>
errorlog: -
enable_stdio_inheritance: False
worker_class: sync
ssl_version: 2
suppress_ragged_eofs: True
syslog: False
syslog_facility: user
when_ready: <function when_ready at 0x7f6bbc3f1668>
pre_fork: <function pre_fork at 0x7f6bbc3f17d0>
cert_reqs: 0
preload_app: False
keepalive: 2
accesslog: -
group: 0
graceful_timeout: 30
do_handshake_on_connect: False
spew: False
workers: 4
proc_name: None
sendfile: None
pidfile: None
umask: 0
on_reload: <function on_reload at 0x7f6bbc3f1500>
pre_exec: <function pre_exec at 0x7f6bbc3f1ed8>
worker_tmp_dir: None
limit_request_fields: 100
pythonpath: None
on_exit: <function on_exit at 0x7f6bbc3f7758>
config: None
logconfig: None
check_config: False
statsd_prefix:
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
reload_engine: auto
proxy_allow_ips: ['127.0.0.1']
pre_request: <function pre_request at 0x7f6bbc3f70c8>
post_request: <function post_request at 0x7f6bbc3f71b8>
forwarded_allow_ips: ['127.0.0.1']
worker_int: <function worker_int at 0x7f6bbc3f1c08>
raw_paste_global_conf: []
threads: 1
max_requests: 0
chdir: /opt/app
daemon: False
user: 0
limit_request_line: 4094
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
certfile: None
on_starting: <function on_starting at 0x7f6bbc3f1398>
post_worker_init: <function post_worker_init at 0x7f6bbc3f1aa0>
child_exit: <function child_exit at 0x7f6bbc3f7320>
worker_exit: <function worker_exit at 0x7f6bbc3f7488>
paste: None
default_proc_name: app:app
syslog_addr: udp://localhost:514
syslog_prefix: None
ciphers: TLSv1
worker_abort: <function worker_abort at 0x7f6bbc3f1d70>
loglevel: DEBUG
bind: ['0.0.0.0:5005']
raw_env: []
initgroups: False
capture_output: False
reload: False
limit_request_field_size: 8190
nworkers_changed: <function nworkers_changed at 0x7f6bbc3f75f0>
timeout: 30
keyfile: None
ca_certs: None
tmp_upload_dir: None
backlog: 2048
logger_class: gunicorn.glogging.Logger
[2017-06-27 20:41:56 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2017-06-27 20:41:56 +0000] [1] [DEBUG] Arbiter booted
[2017-06-27 20:41:56 +0000] [1] [INFO] Listening at: http://0.0.0.0:5005 (1)
[2017-06-27 20:41:56 +0000] [1] [INFO] Using worker: sync
[2017-06-27 20:41:56 +0000] [8] [INFO] Booting worker with pid: 8
[2017-06-27 20:41:57 +0000] [9] [INFO] Booting worker with pid: 9
[2017-06-27 20:41:57 +0000] [10] [INFO] Booting worker with pid: 10
[2017-06-27 20:41:57 +0000] [12] [INFO] Booting worker with pid: 12
[2017-06-27 20:41:57 +0000] [1] [DEBUG] 4 workers
[2017-06-27 20:42:15 +0000] [10] [DEBUG] Closing connection.
[2017-06-27 20:42:15 +0000] [8] [DEBUG] Closing connection.
[2017-06-27 20:42:45 +0000] [8] [DEBUG] Closing connection.
[2017-06-27 20:42:45 +0000] [10] [DEBUG] Closing connection.
[2017-06-27 20:42:46 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-06-27 20:42:46 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-06-27 20:42:46 +0000] [9] [INFO] Worker exiting (pid: 9)
[2017-06-27 20:42:46 +0000] [12] [INFO] Worker exiting (pid: 12)
[2017-06-27 20:42:46 +0000] [1] [DEBUG] 3 workers
[2017-06-27 20:42:46 +0000] [24] [INFO] Booting worker with pid: 24
[2017-06-27 20:42:46 +0000] [25] [INFO] Booting worker with pid: 25
[2017-06-27 20:42:46 +0000] [1] [DEBUG] 4 workers

This does not occur when running locally. :-/

@schwuk
Copy link

schwuk commented Jun 27, 2017

Looks like switching to gevent workers has resolved it for me. ¯\(ツ)

@tilgovi
Copy link
Collaborator

tilgovi commented Jun 30, 2017

Duplicate of #1194, I think.

@jenstroeger
Copy link

I’ve seen this happen repeatedly lately, and for me it seems connected with putting the laptop to sleep. When I open the lid, a bunch of these messages show. Not sure this helps, but I thought I’d mention it…

@stripcode
Copy link

gunicorn --daemon --workers 2 --timeout 120 --bind 127.0.0.1:4000 --pid /var/run/mshc_admin.pid --user danilovskoe --group danilovskoe --chdir /home/danilovskoe/mshc2/src/flask/ --env MSHC_PRODUCTION=/etc/monit/mshc.config.py admin_gunicorn:app

timeout 30 seconds
GET request

ubuntu 16.04
flask 0.12.2
Python 3.6.3 (default, Oct 4 2017, 02:55:45)
[GCC 5.4.0 20160609] on linux
gunicorn (version 19.7.1)

@xgfone
Copy link

xgfone commented Mar 7, 2018

I met with the trouble.

Just after starting the app, it works. But only if there is a request, [CRITICAL] WORKER TIMEOUT will be triggered. For example,

[2018-01-02 16:38:03 +0800] [24355] [INFO] Starting gunicorn 19.7.1
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] Arbiter booted
[2018-01-02 16:38:03 +0800] [24355] [INFO] Listening at: http://0.0.0.0:8080 (24355)
[2018-01-02 16:38:03 +0800] [24355] [INFO] Using worker: gevent
[2018-01-02 16:38:03 +0800] [24358] [INFO] Booting worker with pid: 24358
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] 1 workers

[2018-01-02 16:38:10 +0800] [24358] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:38:10 +0800] [24358] [DEBUG] Closing connection. 
[2018-01-02 16:38:41 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24358)
[2018-01-02 16:38:41 +0800] [24358] [INFO] Worker exiting (pid: 24358)
[2018-01-02 16:38:41 +0800] [24381] [INFO] Booting worker with pid: 24381

[2018-01-02 16:48:51 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24381)
[2018-01-02 16:48:51 +0800] [24381] [INFO] Worker exiting (pid: 24381)
[2018-01-02 16:48:51 +0800] [24703] [INFO] Booting worker with pid: 24703
[2018-01-02 16:48:51 +0800] [24703] [INFO] worker pid 24703 notify
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] Closing connection. 
CentOS: 6.7
Python: 3.6.3
Gevent: 1.2.2
Greenlet: 0.4.9
Gunicorn: 19.7.1

RUN CMD: gunicorn --worker-class gevent --log-level debug --bind 0.0.0.0:8080 app

When I switched the worker class to eventlet, that's,
gunicorn --worker-class eventlet --log-level debug --bind 0.0.0.0:8080 app,
it's OK.

Notice: My app runs on the physical host, neither the virtual host nor the cloud host.


Update:

So I guess that it's the question of gevent or gevent worker.

@tilgovi
Copy link
Collaborator

tilgovi commented Mar 10, 2018

There are reports on this issue of gevent solving the problem and gevent causing the problem. I cannot identify a root cause here. Some of the reports may be the same as #1194, but others maybe not.

If anyone can share a minimal case to reproduce, that would help.

@mtripp100
Copy link

I'm not sure it's definitely the same issue, but I can reproduce this 100% of the time using Virtualbox with the following setup:

Host: Windows 10
Guest: Ubuntu 16.04
Gunicorn: 19.7.1

I forward TCP:8000 between host and guest over the default NAT connection. Using a sync worker, any requests on the host for localhost:8000 cause these errors to appear in the log, but if I make the same requests on the guest, the log is clear. Switching to --worker-class eventlet removes the trace.

I appreciate Virtualbox is a completely different dimension, but it does sound very similar to what is being described above, and is consistently reproducible (for me at least).

@lordmauve
Copy link
Contributor

I am seeing this happen with slow uploads. During an upload (to a Django site) if the worker timeout is hit, the upload dies.

@tilgovi
Copy link
Collaborator

tilgovi commented Apr 28, 2018

@lordmauve if you are using the sync worker that is expected. Long requests will block the worker and eventually the arbiter will kill it. You can use a different worker type if you expect long requests to succeed.

@tilgovi
Copy link
Collaborator

tilgovi commented Apr 28, 2018

For anyone reading this thread, please re-open with a minimal case to reproduce. I cannot see any clean investigation to pursue here. For the case of AWS / ECS, I'm still leaving #1194 open until I can test the configurations I listed (#1194 (comment)).

@hashlash
Copy link

hashlash commented Aug 9, 2018

Anyone who still have this problem, please check the resource availability for the application in addition of increasing the timeout and changing the worker class type

I was having this problem when I try to deploy my application using Docker Swarm and realized I was limiting the resource too low for the application. Increasing the resource solve the problem for me

deploy:
  replicas: 5
  resources:
    limits:
      cpus: "0.1"
      memory: 50M
  restart_policy:
    condition: on-failure

I think this is not a bug, just how the way we configure our apps

@fred-revel
Copy link

@jseaidou it's anormal timeout in the sense the arbiter react to it. It's critical because it should normally not happen.

Most probably one of your worker is doing a blocking operation preventing the gunicorn worker to notify the arbiter. If you have a long operation, make sure to trigger the geven scheduler from time to time by sleeping or such. or anything that call the tornado scheduler back too.

How can I reproduce the issue?

@saabeilin same ^^

Thanks, this makes sense.
For me this is happening upon serving a large file that is previously downloaded from cloud storage.
Upon request, the file is retrieved to the local disk from cloud storage, then it is stream-decrypted to the client. The download from cloud storage works fine, even if it takes 10 minutes or longer.
Once the worker starts stream-decrypting the file from disk to the client, it dies after a few hundred MB, probably because it times out while being busy with this blocking operation.

@1oglop1
Copy link

1oglop1 commented Mar 15, 2022

@fred-revel Were you able to resolve the issue? I think I have exactly the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Bugs
Other
Development

No branches or pull requests