Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify what/how timeout and graceful_timeout work #1493

Closed
tuukkamustonen opened this issue Apr 3, 2017 · 33 comments
Closed

Clarify what/how timeout and graceful_timeout work #1493

tuukkamustonen opened this issue Apr 3, 2017 · 33 comments

Comments

@tuukkamustonen
Copy link

tuukkamustonen commented Apr 3, 2017

(Sorry for the monologue here: simple things got complicated and I ended up digging through the stack. Hopefully what I've documented is helpful for the reader, however.)

As I've understood, by default:

  • After 30 seconds (configurable with timeout) of request processing, gunicorn master process sends SIGTERM to the worker process, to initiate a graceful restart.
  • If worker does not shutdown during another 30 seconds (configurable with graceful_timeout), master process sends SIGKILL. Seems like this signal is also sent when the worker does gracefully shutdown during the graceful_timeout period (d1a0973).

The questions:

  • Are the signals correct?
  • What actually happens when gunicorn (sync) worker receives these signals? How does it tell WSGI app that the signal was caught and something should happen (ok I assume it just "passes it on")?
  • How does, for example, Flask handle SIGTERM signal - in practice, what happens during request processing? Does it just set a flag for the WSGI application (on werkzeug level) that it should shutdown after the request processing is complete? Or does SIGTERM already somehow affect ongoing request processing - kill IO connections or something to speed up request processing...?

On SIGKILL, I guess request processing is just forcefully aborted.

I could file a tiny PR to improve docs about this, if I get understanding how things actually work.

@tuukkamustonen
Copy link
Author

Hmm, I think #1236 (comment) confirms my assumptions about SIGTERM simply setting worker to shutdown after request processing is complete (and setting worker not to accept any new connections).

@tuukkamustonen
Copy link
Author

tuukkamustonen commented Apr 3, 2017

Seems like how I interpreted timeout and graceful_timeout is wrong. Both periods actually refer to time at the start of request processing. So, by default, because both settings are set to 30 seconds, there is no graceful restart enabled. If I do something like --graceful-timeout 15 --timeout 30 that should mean graceful restart is initiated at 15 seconds and worker is force killed at 30 seconds if request did not complete before that.

However, it seems like if response is returned between graceful_timeout and timeout, then the worker is not restarted after all? Shouldn't it?

I tested by app.py:

import time
from flask import Flask

app = Flask(__name__)

@app.route('/foo')
def foo():
    time.sleep(3)
    return 'ok'

Then:

12:51 $ gunicorn app:app --timeout 5 --graceful-timeout 1
[2017-04-03 12:51:37 +0300] [356] [INFO] Starting gunicorn 19.6.0
[2017-04-03 12:51:37 +0300] [356] [INFO] Listening at: http://127.0.0.1:8000 (356)
[2017-04-03 12:51:37 +0300] [356] [INFO] Using worker: sync
[2017-04-03 12:51:37 +0300] [359] [INFO] Booting worker with pid: 359

Then I send curl localhost:8000/foo, which returns after 3 seconds. But nothing happens in gunicorn - I see no trace of graceful restart being initiated or happened?

@tuukkamustonen
Copy link
Author

tuukkamustonen commented Apr 3, 2017

It seems that on timeout, SystemExit(1,) is thrown, aborting current request processing in Flask. What code or signal generates it, can't say.

This exception gets thrown through Flask stack, and any teardown_request handlers catch it. There's enough time to log something, but if you do time.sleep(1) or something else time consuming in the handler, it gets silently killed. It's as if there was 100-200ms time before the process is actually forcefully terminated and I am wondering what is this delay. It's not graceful timeout, that setting has no impact on the delay. I would expect the process to be just forcefully killed in place, instead of seeing SystemExit being thrown through the stack, but then potentially killing the process mid-air anyway.

In fact, I don't see graceful_timeout doing anything - maybe it's not supported for sync workers, or maybe it doesn't work "stand-alone" (or together with timeout) - only when you manually send SIGTERM?

Also what might be weird is that https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L392 does not check graceful flag at all. I guess https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L390 ensures that self.WORKERS is empty so graceful timeout is not waited when doing non-graceful stop.

@tuukkamustonen
Copy link
Author

@benoitc @tilgovi Care to give a hand here? Hopefully my writings above make sense...

@benoitc
Copy link
Owner

benoitc commented Apr 17, 2017

@tuco86 The graceful timeout is only available when you quit the arbiter, upgrade it (USR2), send an HUP signal to the arbiter or send a QUIT signal to the worker. Ie it's only used when the action is normal

The timeout is here to prevent busy workers to block others requests. If they don't notify the arbiter in a time less than the timeout the worker is simply exited and the connection with the client closed.

@tuukkamustonen
Copy link
Author

Uh, alright. Does timeout have any effect when you:

quit the arbiter, upgrade it (USR2), send an HUP signal to the arbiter or send a QUIT signal to the worker

I mean, what if the worker doesn't shutdown in graceful_timeout - will timeout kick in after that and workers get forcefully killed, or is it left up to user to call for SIGQUIT in case they don't gracefully die?

QUIT signal to the worker

I assume you meant TERM here (as QUIT is documented as quick shutdown for both master and workers)?

@benoitc
Copy link
Owner

benoitc commented Apr 21, 2017

if the worker doesn't shutdown during the graceful time it will be killed without any other delay.

@tuukkamustonen
Copy link
Author

Of course. Thanks for clarifying things up!

@berkerpeksag berkerpeksag moved this from Unanswered to To close in Forum Jul 21, 2017
@tuukkamustonen
Copy link
Author

@benoitc Asking in the context of this old ticket - what does the last sentence in timeout documentation actually mean?

Generally set to thirty seconds. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.

Not being a native english speaker, I have hard time understanding this. Does it mean that timeout is not supported for non-sync workers (because that's what I seem to be witnessing: I'm using gthread workers and timeout is not kickin in and killing too slow requests)?

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 9, 2017

@tuukkamustonen --timeout is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Maybe it would be a good idea for us to change the name if other people find this confusing.

@tuukkamustonen
Copy link
Author

tuukkamustonen commented Aug 10, 2017

@tilgovi timeout is just fine, although something like worker_timeout might be more descriptive. I just got initially confused because timeout and graceful_timeout are declared next to each other in the documentation, so my brain assumed they are tightly connected, while they are actually not.

For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Would you have an example of when timeout kicks in with non-sync workers? Is it something that should never happen, really - maybe only if there's a bug that causes the worker to block/freeze?

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 10, 2017

That's correct. An asynchronous worker that relies on an event loop core might perform a CPU intensive procedure that doesn't yield within the timeout.

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 10, 2017

Not only a bug, in other words. Although, sometimes it may indicate a bug, such as a call to a blocking I/O function when an asyncio protocol would be more appropriate.

@tuukkamustonen
Copy link
Author

tuukkamustonen commented Aug 10, 2017

Getting stuck in CPU intensive task is a good example, thanks.

Calling blocking I/O in async code is one as well, but I'm not sure how it applies to this context - I'm running a traditional Flask app with blocking code but running it with an async worker (gthread) without any sort of monkey patching. And it works just ok. I know this isn't really in the context of this ticket anymore, but doesn't mixing and matching async/sync code like this cause problems?

@tuukkamustonen
Copy link
Author

Also, what is the heartbeat interval? What would be a sane value to use for timeout with non-sync workers?

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 10, 2017

The gthread worker is not asynchronous, but it does have a main thread for the heartbeat so it won't timeout either. In the case of that worker, you probably won't see a timeout unless the worker is very overloaded or, more likely, you call a C extension module the doesn't release the GIL.

You probably don't have to change the timeout unless you start seeing worker timeouts.

@tuukkamustonen
Copy link
Author

Alright. Just one more thing:

The gthread worker is not asynchronous

It may be a little bit confusing that gthread worker is not async but is listed as "AsyncIO" workers at http://docs.gunicorn.org/en/stable/design.html#asyncio-workers. Other than that, using "threads" doesn't need asyncio, so that also raises questions in the reader. Just saying this from a perspective of a naive user, I'm sure it's all technically well-founded.

In a nutshell, the gthread worker is implemented with asyncio lib but it spawns threads to handle sync code. Correct me if wrong.

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 10, 2017

Glad you asked!

The threaded worker does not use asyncio and does not inherit from the base asynchronous worker class.

We should clarify the documentation. I think it may have been listed as async because the worker timeout is handled concurrently, making it behave more like the async workers than the sync worker with respect to ability to handle long requests and concurrent requests.

It would be great to clarify the documentation and make it more accurately describe all the workers.

@benoitc
Copy link
Owner

benoitc commented Aug 11, 2017

yeah the gthreads worker shouldn't listed in asyncio worker. maybe having a section that describe the design of each workers is better?

@tilgovi tilgovi reopened this Aug 21, 2017
@tilgovi
Copy link
Collaborator

tilgovi commented Aug 21, 2017

Re-opening this so we can track it as work to clarify the section on worker types and timeouts.

@aschatten
Copy link

aschatten commented Jan 9, 2018

@tilgovi

--timeout is not meant as a request timeout. It's meant as a liveness check for workers. For sync workers, this functions as a request timeout because the worker cannot do anything other than process the request. The asynchronous workers heartbeat even while they are handling long running requests, so unless the worker blocks/freezes it won't be killed.

Is there a request timeout option available for async workers? In other words how to make arbiter kill a worker that did not process a request within a specified time?

@tilgovi
Copy link
Collaborator

tilgovi commented Jan 9, 2018

@aschatten there is not, unfortunately. See also #1658.

@tuukkamustonen
Copy link
Author

tuukkamustonen commented Jan 9, 2018

kill a worker that did not process a request within a specified time

As a worker may be processing multiple requests concurrently, killing whole worker because one request times out sounds pretty extreme. Wouldn't that result in all the other requests getting killed in vain?

I recall uWSGI was planning to introduce thread-based killing in 2.1 or so, though probably even that applies for sync/threaded workers only (and my recollection on this is vague).

@aschatten
Copy link

aschatten commented Jan 9, 2018

As a worker may be processing multiple requests concurrently, killing whole worker because one request times out sounds pretty extreme. Wouldn't that would result in all the other requests getting killed in vain?

The approach can be the same as for max_request, where there is a separate implementation for every worker type.

@tilgovi
Copy link
Collaborator

tilgovi commented Jan 9, 2018

We are working on a release this week, at which point it may be time to branch for R20, where we plan to tackle a few major things. That might be the right time to make the current timeout into a proper request timeout for every worker type.

@tilgovi tilgovi added this to the 20.0 milestone Jan 9, 2018
@ghost
Copy link

ghost commented Feb 8, 2018

Commenting here instead of filing a separate issue since I'm trying to understand how timeout is supposed to work and I'm not sure whether this is a bug or not.

The IMO unexpected behaviour I'm seeing is this:

Every max-requests'th request (the one after which worker will be restarted) is timeouted, whereas the other requests are completed successfully. In the below example 4 requests are performed, requests 1, 2, and 4 are successful, whereas request 3 fails.

Relevant configuration:

  • gthread worker
  • serving request takes longer than timeout
  • max-requests is non-zero
import time

def app(environ, start_response):
    start_response('200 OK', [('Content-type', 'text/plain; charset=utf-8')])
    time.sleep(5)
    return [b"Hello World\n"]

gunicorn:

gunicorn --log-level debug -k gthread -t 4 --max-requests 3 "app:app"
...
[2018-02-08 10:11:59 +0200] [28592] [INFO] Starting gunicorn 19.7.1
[2018-02-08 10:11:59 +0200] [28592] [DEBUG] Arbiter booted
[2018-02-08 10:11:59 +0200] [28592] [INFO] Listening at: http://127.0.0.1:8000 (28592)
[2018-02-08 10:11:59 +0200] [28592] [INFO] Using worker: gthread
[2018-02-08 10:11:59 +0200] [28595] [INFO] Booting worker with pid: 28595
[2018-02-08 10:11:59 +0200] [28592] [DEBUG] 1 workers
[2018-02-08 10:12:06 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:11 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:15 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:20 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:23 +0200] [28595] [DEBUG] GET /
[2018-02-08 10:12:23 +0200] [28595] [INFO] Autorestarting worker after current request.
[2018-02-08 10:12:27 +0200] [28592] [CRITICAL] WORKER TIMEOUT (pid:28595)
[2018-02-08 10:12:27 +0200] [28595] [INFO] Worker exiting (pid: 28595)
[2018-02-08 10:12:28 +0200] [28595] [DEBUG] Closing connection.
[2018-02-08 10:12:28 +0200] [28599] [INFO] Booting worker with pid: 28599
[2018-02-08 10:12:32 +0200] [28599] [DEBUG] GET /
[2018-02-08 10:12:37 +0200] [28599] [DEBUG] Closing connection.
^C[2018-02-08 10:12:39 +0200] [28592] [INFO] Handling signal: int

Client:

[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World
[salonen@mac ~]$ curl http://127.0.0.1:8000
curl: (52) Empty reply from server
[salonen@mac ~]$ curl http://127.0.0.1:8000
Hello World

@benoitc
Copy link
Owner

benoitc commented Jan 10, 2019

what should be the plan there? I have in mind the following:

  • update worker description (if still needed)
  • document the protocol to detect dead or blocked workers

Should it be 20.0 or could we postpone it?

@benoitc
Copy link
Owner

benoitc commented Jan 24, 2019

postponing.

@benoitc benoitc removed this from the 20.0 milestone Jan 24, 2019
@lucas03
Copy link

lucas03 commented Jul 31, 2019

Hey, so this won't be part of 20.0?

That might be the right time to make the current timeout into a proper request timeout for every worker type.

@benoitc
Copy link
Owner

benoitc commented Nov 22, 2019

clarified. @lucas03 it's unclear what a request timeout is there. please open a ticket if you need something specific?.

@YuryHrytsuk
Copy link

Having read the discussion I have got a question. Does it make sense to keep graceful_timeout always bigger than timeout?

@benoitc
Copy link
Owner

benoitc commented Jun 20, 2022

they have different purposes. So yes.

@diamond-fish
Copy link

Interesting that Microsoft Azure docs use --timeout 600 without specifying a larger --graceful-timeout: https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Forum
To close
Development

No branches or pull requests

8 participants