Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep Puma worker metrics segregated by PID #253

Open
KaoruDev opened this issue Nov 14, 2022 · 4 comments
Open

Keep Puma worker metrics segregated by PID #253

KaoruDev opened this issue Nov 14, 2022 · 4 comments

Comments

@KaoruDev
Copy link

I'm investigating behavior in prod where a few requests seemingly create a backlog of requests in Puma.

From my understanding of the Puma docs each worker will randomly pull a request from the UNIX socket. I believe if one of those Workers is processing a "heavy" request, a backlog of requests can start causing the high latency for requests that are typically "fast". However it not possible me to verify this as all the Worker stats are aggregated here.

I propose avoiding summing the metrics and adding the PID of each worker. I.E:

ruby_puma_request_backlog{phase="0",pid="1"} 0
ruby_puma_request_backlog{phase="0",pid="2"} 2
ruby_puma_request_backlog{phase="0", pid="3} 6

If PID is too high of a cardinality, idx seems reasonable to me too.

@SamSaffron
Copy link
Member

Sounds reasonable to me ... @nateberkopec any concerns here, should we be providing instrumentation down to this fidelity?

@nateberkopec
Copy link

Ah, sorry, a common misunderstanding. We should really have named the Puma stats option something other than backlog...

As we specify in the docs you linked to:

This socket backlog is distinct from the backlog of work as reported by Puma.stats or the control server. The backlog that Puma.stats refers to represents the number of connections in the process' todo set waiting for a thread from the ThreadPool.

What Puma.stats reports as backlog is not related to workers processing a heavy request. Most workers should report 0 for backlog most of the time, even when requests are heavy or load is high, because we don't accept on the socket when the thread pool is 100% full/busy.

To log this you would probably need to do something with netstat or use a request header like X-Request-Start, added by a load balancer or reverse proxy, to judge how long a request has waited to be accepted.

@KaoruDev
Copy link
Author

we don't accept on the socket when the thread pool is 100% full/busy.

Ah -- so what causes this gauge to be non-zero then?

@nateberkopec
Copy link

@KaoruDev Backlog will be >0 as long as any thread is busy. Roughly most of the time, backlog == number of threads which have accepted a connection but haven't yet finished responding. So like I said backlog is sort of misleading, because it implies work that hasn't started yet, but it has!

Your thesis is still kind of true in a way though - Puma workers with higher backlogs will have higher response latency than workers with 0 backlog. This fact is the inspiration behind our wait_for_less_busy_worker feature, which is now on by default in Puma 6. With this feature, workers with backlog > 0 will not immediately accept new connections, letting workers with backlog == 0 pick up those connections first, leading to lower response latency overall.

So I would say upgrade to Puma 6 if you haven't already!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants