Replies: 1 comment 3 replies
-
OK, not a lot of help here, but we found some more data. Apparently pgbouncer that is nearest DB doesn't (sometimes) reuse connection. Speficially - we use transaction pooling everywhere. And in case where cl_waiting was increasing on top-level bouncers (the ones that app talks to), we noticed that lower level bouncer keps idle connections (not idle in transaction) for MANY hours instead of reusing them. This lead to a pool of 20 connections established to db, where only 2-3 were doing work, and the rest (17-18) were sitting idle, with no work (I checked state_change in pg_stat_activity) for, even, 10+ hours! This is, of course, not usable, so we added a tool that kills all such connections, but we'd like to be able to debug it more. Any ideas on what to do? How could this happened? |
Beta Was this translation helpful? Give feedback.
-
Hi,
we have following situation: databases on AWS/EC2 (not RDS). For each group we have three servers:
There are also 2-3 (usually 3) pgbouncer servers, that have appropriate pools defined to connect to primary and secondary. All 3 bouncers have the same config, they are just in different AWS availability zones.
As far as OS goes, we use Ubuntu bionic, and pgBouncer 1.11.
Some time ago (couple of months) we decided to add more bouncer, not as a server, but plain daemon on pg servers.
The idea is that application connects to nearest pgbouncer on bouncer server, this connects to pgbouncer on appopriate pg servers, and this in turn connect to database.
This was done so that we'll see more reuse of pg connections between app servers from different AZs.
It works. Generally.
Sometimes, in a way that defies all my tries to debug it, something happens.
Something is:
this situation doesn't seem to be related to spikes in load, at least in way that I'd see.
What's more troubling - often (though not always) cl_waiting manitest only on 1-2 bouncer servers, and not all!
Since our configs are built by ansible, I'm reasonably sure that configs are the same.
What we do is simply restart pgbouncer on pg server, and the problem goes away.
Out of desperation we added cronjob to daily to online-restart of pgbouncer on pg servers, but the problem still happens.
The thing is that it sometimes happens ~ week apart, and sometimes it works well for 3 weeks.
This config is successfully deployed on ~ 50 such server groups, and the problems seems to happen only on subset of the groups. But - we haven't been able to find any commonalities.
Recently (this week) we upgraded pgbouncer on pg servers to 1.15 in the hope that maybe it will fix the issue. So far the problem hasn't re-appeared, but it hasn't been long since the upgrade.
Is there anything you can think of that would help me debug/diagnose the problem?
I will be working to get some more system stats, nework stats, and similar on these servers, the problem is that due to random nature of the situation, it's not trivial to catch the problem while it's there.
Hope you can help me somehow...
depesz
Beta Was this translation helpful? Give feedback.
All reactions