Clarify what/how timeout and graceful_timeout work #1493

tuukkamustonen · 2017-04-03T09:07:48Z

(Sorry for the monologue here: simple things got complicated and I ended up digging through the stack. Hopefully what I've documented is helpful for the reader, however.)

As I've understood, by default:

After 30 seconds (configurable with timeout) of request processing, gunicorn master process sends SIGTERM to the worker process, to initiate a graceful restart.
If worker does not shutdown during another 30 seconds (configurable with graceful_timeout), master process sends SIGKILL. Seems like this signal is also sent when the worker does gracefully shutdown during the graceful_timeout period (d1a0973).

The questions:

Are the signals correct?
What actually happens when gunicorn (sync) worker receives these signals? How does it tell WSGI app that the signal was caught and something should happen (ok I assume it just "passes it on")?
How does, for example, Flask handle SIGTERM signal - in practice, what happens during request processing? Does it just set a flag for the WSGI application (on werkzeug level) that it should shutdown after the request processing is complete? Or does SIGTERM already somehow affect ongoing request processing - kill IO connections or something to speed up request processing...?

On SIGKILL, I guess request processing is just forcefully aborted.

I could file a tiny PR to improve docs about this, if I get understanding how things actually work.

The text was updated successfully, but these errors were encountered:

tuukkamustonen · 2017-04-03T09:31:21Z

Hmm, I think #1236 (comment) confirms my assumptions about SIGTERM simply setting worker to shutdown after request processing is complete (and setting worker not to accept any new connections).

tuukkamustonen · 2017-04-03T09:59:00Z

Seems like how I interpreted timeout and graceful_timeout is wrong. Both periods actually refer to time at the start of request processing. So, by default, because both settings are set to 30 seconds, there is no graceful restart enabled. If I do something like --graceful-timeout 15 --timeout 30 that should mean graceful restart is initiated at 15 seconds and worker is force killed at 30 seconds if request did not complete before that.

However, it seems like if response is returned between graceful_timeout and timeout, then the worker is not restarted after all? Shouldn't it?

I tested by app.py:

import time
from flask import Flask

app = Flask(__name__)

@app.route('/foo')
def foo():
    time.sleep(3)
    return 'ok'

Then:

12:51 $ gunicorn app:app --timeout 5 --graceful-timeout 1
[2017-04-03 12:51:37 +0300] [356] [INFO] Starting gunicorn 19.6.0
[2017-04-03 12:51:37 +0300] [356] [INFO] Listening at: http://127.0.0.1:8000 (356)
[2017-04-03 12:51:37 +0300] [356] [INFO] Using worker: sync
[2017-04-03 12:51:37 +0300] [359] [INFO] Booting worker with pid: 359

Then I send curl localhost:8000/foo, which returns after 3 seconds. But nothing happens in gunicorn - I see no trace of graceful restart being initiated or happened?

tuukkamustonen · 2017-04-03T11:27:34Z

It seems that on timeout, SystemExit(1,) is thrown, aborting current request processing in Flask. What code or signal generates it, can't say.

This exception gets thrown through Flask stack, and any teardown_request handlers catch it. There's enough time to log something, but if you do time.sleep(1) or something else time consuming in the handler, it gets silently killed. It's as if there was 100-200ms time before the process is actually forcefully terminated and I am wondering what is this delay. It's not graceful timeout, that setting has no impact on the delay. I would expect the process to be just forcefully killed in place, instead of seeing SystemExit being thrown through the stack, but then potentially killing the process mid-air anyway.

In fact, I don't see graceful_timeout doing anything - maybe it's not supported for sync workers, or maybe it doesn't work "stand-alone" (or together with timeout) - only when you manually send SIGTERM?

Also what might be weird is that https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L392 does not check graceful flag at all. I guess https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L390 ensures that self.WORKERS is empty so graceful timeout is not waited when doing non-graceful stop.

tuukkamustonen · 2017-04-06T05:58:34Z

@benoitc @tilgovi Care to give a hand here? Hopefully my writings above make sense...

benoitc · 2017-04-17T09:02:42Z

@tuco86 The graceful timeout is only available when you quit the arbiter, upgrade it (USR2), send an HUP signal to the arbiter or send a QUIT signal to the worker. Ie it's only used when the action is normal

The timeout is here to prevent busy workers to block others requests. If they don't notify the arbiter in a time less than the timeout the worker is simply exited and the connection with the client closed.

tuukkamustonen · 2017-04-17T10:38:03Z

Uh, alright. Does timeout have any effect when you:

quit the arbiter, upgrade it (USR2), send an HUP signal to the arbiter or send a QUIT signal to the worker

I mean, what if the worker doesn't shutdown in graceful_timeout - will timeout kick in after that and workers get forcefully killed, or is it left up to user to call for SIGQUIT in case they don't gracefully die?

QUIT signal to the worker

I assume you meant TERM here (as QUIT is documented as quick shutdown for both master and workers)?

benoitc · 2017-04-21T08:56:33Z

if the worker doesn't shutdown during the graceful time it will be killed without any other delay.

tuukkamustonen · 2017-04-21T10:10:14Z

Of course. Thanks for clarifying things up!

tuukkamustonen · 2017-08-09T11:01:54Z

@benoitc Asking in the context of this old ticket - what does the last sentence in timeout documentation actually mean?

Generally set to thirty seconds. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.

Not being a native english speaker, I have hard time understanding this. Does it mean that timeout is not supported for non-sync workers (because that's what I seem to be witnessing: I'm using gthread workers and timeout is not kickin in and killing too slow requests)?

tilgovi · 2017-08-09T17:52:12Z

@tuukkamustonen --timeout is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Maybe it would be a good idea for us to change the name if other people find this confusing.

tuukkamustonen · 2017-08-10T04:44:14Z

@tilgovi timeout is just fine, although something like worker_timeout might be more descriptive. I just got initially confused because timeout and graceful_timeout are declared next to each other in the documentation, so my brain assumed they are tightly connected, while they are actually not.

For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Would you have an example of when timeout kicks in with non-sync workers? Is it something that should never happen, really - maybe only if there's a bug that causes the worker to block/freeze?

tilgovi · 2017-08-10T04:57:53Z

That's correct. An asynchronous worker that relies on an event loop core might perform a CPU intensive procedure that doesn't yield within the timeout.

tilgovi · 2017-08-10T05:04:51Z

Not only a bug, in other words. Although, sometimes it may indicate a bug, such as a call to a blocking I/O function when an asyncio protocol would be more appropriate.

tuukkamustonen · 2017-08-10T05:22:16Z

Getting stuck in CPU intensive task is a good example, thanks.

Calling blocking I/O in async code is one as well, but I'm not sure how it applies to this context - I'm running a traditional Flask app with blocking code but running it with an async worker (gthread) without any sort of monkey patching. And it works just ok. I know this isn't really in the context of this ticket anymore, but doesn't mixing and matching async/sync code like this cause problems?

tuukkamustonen · 2017-08-10T05:47:42Z

Also, what is the heartbeat interval? What would be a sane value to use for timeout with non-sync workers?

tilgovi · 2017-08-10T05:52:55Z

The gthread worker is not asynchronous, but it does have a main thread for the heartbeat so it won't timeout either. In the case of that worker, you probably won't see a timeout unless the worker is very overloaded or, more likely, you call a C extension module the doesn't release the GIL.

You probably don't have to change the timeout unless you start seeing worker timeouts.

tuukkamustonen · 2017-08-10T06:19:23Z

Alright. Just one more thing:

The gthread worker is not asynchronous

It may be a little bit confusing that gthread worker is not async but is listed as "AsyncIO" workers at http://docs.gunicorn.org/en/stable/design.html#asyncio-workers. Other than that, using "threads" doesn't need asyncio, so that also raises questions in the reader. Just saying this from a perspective of a naive user, I'm sure it's all technically well-founded.

In a nutshell, the gthread worker is implemented with asyncio lib but it spawns threads to handle sync code. Correct me if wrong.

tilgovi · 2017-08-10T06:30:08Z

Glad you asked!

The threaded worker does not use asyncio and does not inherit from the base asynchronous worker class.

We should clarify the documentation. I think it may have been listed as async because the worker timeout is handled concurrently, making it behave more like the async workers than the sync worker with respect to ability to handle long requests and concurrent requests.

It would be great to clarify the documentation and make it more accurately describe all the workers.

benoitc · 2017-08-11T09:04:48Z

yeah the gthreads worker shouldn't listed in asyncio worker. maybe having a section that describe the design of each workers is better?

tilgovi · 2017-08-21T21:29:13Z

Re-opening this so we can track it as work to clarify the section on worker types and timeouts.

aschatten · 2018-01-09T01:22:23Z

@tilgovi

--timeout is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Is there a request timeout option available for async workers? In other words how to make arbiter kill a worker that did not process a request within a specified time?

tilgovi · 2018-01-09T02:56:54Z

@aschatten there is not, unfortunately. See also #1658.

tuukkamustonen · 2018-01-09T04:55:42Z

kill a worker that did not process a request within a specified time

As a worker may be processing multiple requests concurrently, killing whole worker because one request times out sounds pretty extreme. Wouldn't that result in all the other requests getting killed in vain?

I recall uWSGI was planning to introduce thread-based killing in 2.1 or so, though probably even that applies for sync/threaded workers only (and my recollection on this is vague).

aschatten · 2018-01-09T20:09:42Z

As a worker may be processing multiple requests concurrently, killing whole worker because one request times out sounds pretty extreme. Wouldn't that would result in all the other requests getting killed in vain?

The approach can be the same as for max_request, where there is a separate implementation for every worker type.

tilgovi · 2018-01-09T20:29:01Z

We are working on a release this week, at which point it may be time to branch for R20, where we plan to tackle a few major things. That might be the right time to make the current timeout into a proper request timeout for every worker type.

ghost · 2018-02-08T11:15:25Z

Commenting here instead of filing a separate issue since I'm trying to understand how timeout is supposed to work and I'm not sure whether this is a bug or not.

The IMO unexpected behaviour I'm seeing is this:

Every max-requests'th request (the one after which worker will be restarted) is timeouted, whereas the other requests are completed successfully. In the below example 4 requests are performed, requests 1, 2, and 4 are successful, whereas request 3 fails.

Relevant configuration:

gthread worker
serving request takes longer than timeout
max-requests is non-zero

import time

def app(environ, start_response):
    start_response('200 OK', [('Content-type', 'text/plain; charset=utf-8')])
    time.sleep(5)
    return [b"Hello World\n"]

gunicorn:

gunicorn --log-level debug -k gthread -t 4 --max-requests 3 "app:app"
...
[2018-02-08 10:11:59 +0200] [28592] [INFO] Starting gunicorn 19.7.1
[2018-02-08 10:11:59 +0200] [28592] [DEBUG] Arbiter booted
[2018-02-08 10:11:59 +0200] [28592] [INFO] Listening at: http://127.0.0.1:8000 (28592)
[2018-02-08 10:11:59 +0200] [28592] [INFO] Using worker: gthread
[2018-02-08 10:11:59 +0200] [28595] [INFO] Booting worker with pid: 28595
[2018-02-08 10:11:59 +0200] [28592] [DEBUG] 1 workers
[2018-02-08 10:12:06 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:11 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:15 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:20 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:23 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:23 +0200] [28595] [INFO] Autorestarting worker after current request.
[2018-02-08 10:12:27 +0200] [28592] [CRITICAL] WORKER TIMEOUT (pid:28595)
[2018-02-08 10:12:27 +0200] [28595] [INFO] Worker exiting (pid: 28595)
[2018-02-08 10:12:28 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:28 +0200] [28599] [INFO] Booting worker with pid: 28599
[2018-02-08 10:12:32 +0200] [28599] [DEBUG] GET /
[2018-02-08 10:12:37 +0200] [28599] [DEBUG] Closing connection.
^C[2018-02-08 10:12:39 +0200] [28592] [INFO] Handling signal: int

Client:

[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
[salonen@mac ~]$ curl http://127.0.0.1:8000
curl: (52) Empty reply from server
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World

benoitc · 2019-01-10T07:57:50Z

what should be the plan there? I have in mind the following:

update worker description (if still needed)
document the protocol to detect dead or blocked workers

Should it be 20.0 or could we postpone it?

benoitc · 2019-01-24T15:42:23Z

postponing.

lucas03 · 2019-07-31T12:10:56Z

Hey, so this won't be part of 20.0?

That might be the right time to make the current timeout into a proper request timeout for every worker type.

benoitc · 2019-11-22T21:49:13Z

clarified. @lucas03 it's unclear what a request timeout is there. please open a ticket if you need something specific?.

YuryHrytsuk · 2022-06-14T11:02:06Z

Having read the discussion I have got a question. Does it make sense to keep graceful_timeout always bigger than timeout?

benoitc · 2022-06-20T22:59:22Z

they have different purposes. So yes.

diamond-fish · 2023-10-26T22:02:33Z

Interesting that Microsoft Azure docs use --timeout 600 without specifying a larger --graceful-timeout: https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

benoitc added Discussion Documentation labels Apr 12, 2017

benoitc added this to Unanswered in Forum Apr 17, 2017

tuukkamustonen closed this as completed Apr 21, 2017

berkerpeksag moved this from Unanswered to To close in Forum Jul 21, 2017

tilgovi reopened this Aug 21, 2017

tilgovi added this to the 20.0 milestone Jan 9, 2018

benoitc assigned benoitc, tilgovi and berkerpeksag Jan 10, 2019

benoitc removed this from the 20.0 milestone Jan 24, 2019

tilgovi mentioned this issue Sep 17, 2019

Can I set the default response when timeout is occurred? #2111

Closed

stefankoegl mentioned this issue Oct 16, 2019

Occasional SystemExit in gunicorn gpodder/mygpo#273

Open

benoitc closed this as completed Nov 22, 2019

euri10 mentioned this issue Apr 7, 2020

Gunicorn with UvicornWorker is not respecting the timeout encode/uvicorn#611

Closed

jpmckinney mentioned this issue Nov 12, 2021

docs: gthread is a sync worker #2686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify what/how timeout and graceful_timeout work #1493

Clarify what/how timeout and graceful_timeout work #1493

tuukkamustonen commented Apr 3, 2017 •

edited

tuukkamustonen commented Apr 3, 2017

tuukkamustonen commented Apr 3, 2017 •

edited

tuukkamustonen commented Apr 3, 2017 •

edited

tuukkamustonen commented Apr 6, 2017

benoitc commented Apr 17, 2017

tuukkamustonen commented Apr 17, 2017

benoitc commented Apr 21, 2017

tuukkamustonen commented Apr 21, 2017

tuukkamustonen commented Aug 9, 2017

tilgovi commented Aug 9, 2017

tuukkamustonen commented Aug 10, 2017 •

edited

tilgovi commented Aug 10, 2017

tilgovi commented Aug 10, 2017

tuukkamustonen commented Aug 10, 2017 •

edited

tuukkamustonen commented Aug 10, 2017

tilgovi commented Aug 10, 2017

tuukkamustonen commented Aug 10, 2017

tilgovi commented Aug 10, 2017

benoitc commented Aug 11, 2017

tilgovi commented Aug 21, 2017

aschatten commented Jan 9, 2018 •

edited

tilgovi commented Jan 9, 2018

tuukkamustonen commented Jan 9, 2018 •

edited

aschatten commented Jan 9, 2018 •

edited

tilgovi commented Jan 9, 2018

ghost commented Feb 8, 2018

benoitc commented Jan 10, 2019

benoitc commented Jan 24, 2019

lucas03 commented Jul 31, 2019

benoitc commented Nov 22, 2019

YuryHrytsuk commented Jun 14, 2022

benoitc commented Jun 20, 2022

diamond-fish commented Oct 26, 2023

Clarify what/how timeout and graceful_timeout work #1493

Clarify what/how timeout and graceful_timeout work #1493

Comments

tuukkamustonen commented Apr 3, 2017 • edited

tuukkamustonen commented Apr 3, 2017

tuukkamustonen commented Apr 3, 2017 • edited

tuukkamustonen commented Apr 3, 2017 • edited

tuukkamustonen commented Apr 6, 2017

benoitc commented Apr 17, 2017

tuukkamustonen commented Apr 17, 2017

benoitc commented Apr 21, 2017

tuukkamustonen commented Apr 21, 2017

tuukkamustonen commented Aug 9, 2017

tilgovi commented Aug 9, 2017

tuukkamustonen commented Aug 10, 2017 • edited

tilgovi commented Aug 10, 2017

tilgovi commented Aug 10, 2017

tuukkamustonen commented Aug 10, 2017 • edited

tuukkamustonen commented Aug 10, 2017

tilgovi commented Aug 10, 2017

tuukkamustonen commented Aug 10, 2017

tilgovi commented Aug 10, 2017

benoitc commented Aug 11, 2017

tilgovi commented Aug 21, 2017

aschatten commented Jan 9, 2018 • edited

tilgovi commented Jan 9, 2018

tuukkamustonen commented Jan 9, 2018 • edited

aschatten commented Jan 9, 2018 • edited

tilgovi commented Jan 9, 2018

ghost commented Feb 8, 2018

benoitc commented Jan 10, 2019

benoitc commented Jan 24, 2019

lucas03 commented Jul 31, 2019

benoitc commented Nov 22, 2019

YuryHrytsuk commented Jun 14, 2022

benoitc commented Jun 20, 2022

diamond-fish commented Oct 26, 2023

tuukkamustonen commented Apr 3, 2017 •

edited

tuukkamustonen commented Apr 3, 2017 •

edited

tuukkamustonen commented Apr 3, 2017 •

edited

tuukkamustonen commented Aug 10, 2017 •

edited

tuukkamustonen commented Aug 10, 2017 •

edited

aschatten commented Jan 9, 2018 •

edited

tuukkamustonen commented Jan 9, 2018 •

edited

aschatten commented Jan 9, 2018 •

edited