Hard to diagnose problem with pgBouncer #576

depesz · 2021-02-25T14:05:34Z

depesz
Feb 25, 2021

Hi,

we have following situation: databases on AWS/EC2 (not RDS). For each group we have three servers:

primary
secondary (streaming replica from primary)
backup (streaming replica from primary)

There are also 2-3 (usually 3) pgbouncer servers, that have appropriate pools defined to connect to primary and secondary. All 3 bouncers have the same config, they are just in different AWS availability zones.

As far as OS goes, we use Ubuntu bionic, and pgBouncer 1.11.

Some time ago (couple of months) we decided to add more bouncer, not as a server, but plain daemon on pg servers.

The idea is that application connects to nearest pgbouncer on bouncer server, this connects to pgbouncer on appopriate pg servers, and this in turn connect to database.

This was done so that we'll see more reuse of pg connections between app servers from different AZs.

It works. Generally.

Sometimes, in a way that defies all my tries to debug it, something happens.

Something is:

pgbouncers on bouncer servers report steady cl_waiting connections
pgbouncer on pg server is suddenly using 100% of cpu.

this situation doesn't seem to be related to spikes in load, at least in way that I'd see.

What's more troubling - often (though not always) cl_waiting manitest only on 1-2 bouncer servers, and not all!

Since our configs are built by ansible, I'm reasonably sure that configs are the same.

What we do is simply restart pgbouncer on pg server, and the problem goes away.

Out of desperation we added cronjob to daily to online-restart of pgbouncer on pg servers, but the problem still happens.

The thing is that it sometimes happens ~ week apart, and sometimes it works well for 3 weeks.

This config is successfully deployed on ~ 50 such server groups, and the problems seems to happen only on subset of the groups. But - we haven't been able to find any commonalities.

Recently (this week) we upgraded pgbouncer on pg servers to 1.15 in the hope that maybe it will fix the issue. So far the problem hasn't re-appeared, but it hasn't been long since the upgrade.

Is there anything you can think of that would help me debug/diagnose the problem?

I will be working to get some more system stats, nework stats, and similar on these servers, the problem is that due to random nature of the situation, it's not trivial to catch the problem while it's there.

Hope you can help me somehow...

depesz

depesz · 2021-03-17T09:48:44Z

depesz
Mar 17, 2021
Author

OK, not a lot of help here, but we found some more data.

Apparently pgbouncer that is nearest DB doesn't (sometimes) reuse connection.

Speficially - we use transaction pooling everywhere. And in case where cl_waiting was increasing on top-level bouncers (the ones that app talks to), we noticed that lower level bouncer keps idle connections (not idle in transaction) for MANY hours instead of reusing them.

This lead to a pool of 20 connections established to db, where only 2-3 were doing work, and the rest (17-18) were sitting idle, with no work (I checked state_change in pg_stat_activity) for, even, 10+ hours!

This is, of course, not usable, so we added a tool that kills all such connections, but we'd like to be able to debug it more. Any ideas on what to do? How could this happened?

3 replies

petere May 12, 2021
Maintainer

Sometimes, mysterious issues like this are related to the fragile cancel request handling in PgBouncer. Is that something you are using?

depesz May 12, 2021
Author

Not entirely sure what you mean by that. The connections we were seeing were from application (job processing mostly) and, afaik, they weren't dying or anything like this.

query_timeout is 0, if that matters.

eulerto May 13, 2021
Maintainer

There is a fix relate to query canceling (see commit 8df8f41).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard to diagnose problem with pgBouncer #576

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Hard to diagnose problem with pgBouncer #576

depesz Feb 25, 2021

Replies: 1 comment · 3 replies

depesz Mar 17, 2021 Author

petere May 12, 2021 Maintainer

depesz May 12, 2021 Author

eulerto May 13, 2021 Maintainer

depesz
Feb 25, 2021

Replies: 1 comment 3 replies

depesz
Mar 17, 2021
Author

petere May 12, 2021
Maintainer

depesz May 12, 2021
Author

eulerto May 13, 2021
Maintainer