Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pool issue. Waiting for sv_login #1019

Open
vitabaks opened this issue Feb 6, 2024 · 4 comments
Open

Connection pool issue. Waiting for sv_login #1019

vitabaks opened this issue Feb 6, 2024 · 4 comments

Comments

@vitabaks
Copy link

vitabaks commented Feb 6, 2024

Periodically, I observe a problem with one of the pool (pool_size=40 pool_mode=transaction), it consists in the fact that all client connections switch to the waiting state and there is no active connection:

image

And we see that there is only one server connection with the login state:

image

image

Pgbouncer log:

2024-02-05 21:12:10.631 MSK [1502899] LOG stats: 784 xacts/s, 785 queries/s, in 403910 B/s, out 1893441 B/s, xact 6107 us, query 6103 us, wait 110 us
2024-02-05 21:13:10.627 MSK [1502899] LOG stats: 759 xacts/s, 759 queries/s, in 394653 B/s, out 1802318 B/s, xact 5603 us, query 5601 us, wait 35 us
2024-02-05 21:14:10.628 MSK [1502899] LOG stats: 740 xacts/s, 741 queries/s, in 439135 B/s, out 2223504 B/s, xact 5705 us, query 5702 us, wait 37 us
2024-02-05 21:15:10.628 MSK [1502899] LOG stats: 782 xacts/s, 783 queries/s, in 402146 B/s, out 1782092 B/s, xact 4810 us, query 4806 us, wait 132 us
2024-02-05 21:16:10.628 MSK [1502899] LOG stats: 663 xacts/s, 663 queries/s, in 355599 B/s, out 1540223 B/s, xact 7040 us, query 7038 us, wait 235 us
2024-02-05 21:16:48.969 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:48.975 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:49.017 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:49.017 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:49.023 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.070 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.070 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.150 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.150 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.174 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.193 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.193 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.206 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.224 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.229 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.241 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.254 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.295 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.297 MSK [1502899] LOG started sending cancel request
...
2024-02-05 21:17:16.474 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:16.543 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:16.543 MSK [1502899] LOG started sending cancel request
2024-02-05 21:18:10.631 MSK [1502899] LOG stats: 502 xacts/s, 502 queries/s, in 244007 B/s, out 1594010 B/s, xact 16062 us, query 16049 us, wait 1849494951 us
2024-02-05 21:19:10.631 MSK [1502899] LOG stats: 499 xacts/s, 499 queries/s, in 268585 B/s, out 1862475 B/s, xact 20141 us, query 20126 us, wait 6732413893 us
2024-02-05 21:20:10.631 MSK [1502899] LOG stats: 492 xacts/s, 493 queries/s, in 278989 B/s, out 2076211 B/s, xact 20656 us, query 20638 us, wait 11015335041 us
2024-02-05 21:21:10.631 MSK [1502899] LOG stats: 481 xacts/s, 481 queries/s, in 428009 B/s, out 3612005 B/s, xact 28941 us, query 28931 us, wait 14730572772 us
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1739c227e0: skms/skmsadmin@172.30.58.36:63258 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1736359820: skms/skmsadmin@172.30.58.36:63298 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1739c230e0: skms/skmsadmin@172.30.58.36:63314 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f17363595e0: skms/skmsadmin@172.30.58.36:63362 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1739c23320: skms/skmsadmin@172.30.58.36:63456 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1736359ca0: skms/skmsadmin@172.30.58.36:63484 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f173b819ae0: skms/skmsadmin@172.30.58.36:63480 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1736359ee0: skms/skmsadmin@172.30.58.36:63496 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1739c22ea0: skms/skmsadmin@172.30.58.36:63500 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1736359a60: skms/skmsadmin@172.30.58.36:63530 pooler error: query_wait_timeout
...

2024-02-05 21:59:59.853 MSK [1502899] WARNING C-0x7f172f45c3e0: skms/skmsadmin@172.30.58.36:65496 pooler error: query_wait_timeout
2024-02-05 21:59:59.853 MSK [1502899] WARNING C-0x7f17366ef160: skms/skmsadmin@172.30.58.36:65512 pooler error: query_wait_timeout
2024-02-05 21:59:59.853 MSK [1502899] WARNING C-0x7f172f2280e0: skms/skmsadmin@172.30.58.36:10002 pooler error: query_wait_timeout
2024-02-05 21:59:59.853 MSK [1502899] WARNING C-0x7f173671fbe0: skms/skmsadmin@172.30.58.36:10006 pooler error: query_wait_timeout

PgBouncer config:

[databases]
postgres = host=127.0.0.1 port=5432 dbname=postgres
sms = host=127.0.0.1 port=5432 dbname=sms pool_size=60 pool_mode=transaction
kms = host=127.0.0.1 port=5432 dbname=kms pool_size=60 pool_mode=transaction
okydrom = host=127.0.0.1 port=5432 dbname=okydrom pool_size=10 pool_mode=transaction
permission = host=127.0.0.1 port=5432 dbname=permission pool_size=10 pool_mode=transaction
errmap = host=127.0.0.1 port=5432 dbname=errmap pool_size=10 pool_mode=transaction
errmapshield = host=127.0.0.1 port=5432 dbname=errmapshield pool_size=10 pool_mode=transaction
shieldems = host=127.0.0.1 port=5432 dbname=shieldems pool_size=10 pool_mode=transaction
stat = host=127.0.0.1 port=5432 dbname=stat pool_size=40 pool_mode=transaction
skms = host=127.0.0.1 port=5432 dbname=skms pool_size=40 pool_mode=transaction
oms = host=127.0.0.1 port=5432 dbname=oms pool_size=10 pool_mode=transaction
stb = host=127.0.0.1 port=5432 dbname=stb pool_size=10 pool_mode=transaction
ums = host=127.0.0.1 port=5432 dbname=ums pool_size=10 pool_mode=transaction
cm_errmap = host=127.0.0.1 port=5432 dbname=cm_errmap pool_size=10 pool_mode=transaction

* = host=127.0.0.1 port=5432

[pgbouncer]
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /run/pgbouncer/pgbouncer.pid
listen_addr = 0.0.0.0
listen_port = 6432
unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_user = pgbouncer
auth_dbname = postgres
auth_query = SELECT usename, passwd FROM user_search($1)
admin_users = postgres
stats_users = postgres,pgwatch2,netdata
ignore_startup_parameters = extra_float_digits,geqo,search_path

pool_mode = session
server_reset_query = DISCARD ALL
max_client_conn = 100000
default_pool_size = 20
query_wait_timeout = 120
reserve_pool_size = 1
reserve_pool_timeout = 1
max_db_connections = 50000
pkt_buf = 8192
listen_backlog = 4096

log_connections = 0
log_disconnections = 0

Database - Sessions

image

We see that idle sessions were closed during the incident.

There is no max_connection saturation on the Postgres server. There is nothing special in Postgres log except "could not send data to client: Broken pipe".

We began to observe the problem after updating the Debian 9 system to 11 and, accordingly, the pgbouncer package to 1.21.0
At the moment, we have rolled back to version 1.20.1, but this problem has happened again. We plan to rollback to an older version of pgbouncer package, since we did not encounter this behavior before the update.

@vitabaks
Copy link
Author

vitabaks commented Feb 16, 2024

The problem with the PgBouncer pools leads to incidents in production.

image

We will update Debian 11.8 to 11.9 and PgBouncer to version 1.22 and let you know if it helped us.

@JelteF
Copy link
Member

JelteF commented Feb 16, 2024

Hmm, that's some pretty strange behaviour.

2024-02-05 21:16:48.969 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:48.975 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:49.017 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:49.017 MSK [1502899] LOG started sending cancel request
2024-02-05 21:16:49.023 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.070 MSK [1502899] LOG started sending cancel request
2024-02-05 21:17:07.070 MSK [1502899] LOG started sending cancel request

I'm wondering if these cancellations are a sideffect or the cause of the issue. Could you also share the stats of of the connections in active_cancel and waiting_cancel states?

2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1739c227e0: skms/skmsadmin@172.30.58.36:63258 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1736359820: skms/skmsadmin@172.30.58.36:63298 pooler error: query_wait_timeout
2024-02-05 21:21:10.851 MSK [1502899] WARNING C-0x7f1739c230e0: skms/skmsadmin@172.30.58.36:63314 pooler error: query_wait_timeout

Lowering query_wait_timeout is something worth trying, but probably won't help.

There is no max_connection saturation on the Postgres server. There is nothing special in Postgres log except "could not send data to client: Broken pipe".

"Broken pipe" often suggests some network issue. Did you check that the Postgres server is still reachable from the PgBouncer server?

@vitabaks
Copy link
Author

@JelteF Thanks for the reply.

Hmm, that's some pretty strange behaviour.

I agree, this is the first time in years of operation.

Could you also share the stats of the connections in active_cancel and waiting_cancel states?

I don't seem to have such a metrics (in Netdata).

Did you check that the Postgres server is still reachable from the PgBouncer server?

Yes, postgres is covered by several monitoring systems and all show that it was available.

P.S.

At the moment, the problem only occurs on Debian 11.8

We have updated to 11.9 and are currently performing load testing, so far everything looks good, but we will leave it under load for a few more days (for reliability). I'll let you know the result.

@vitabaks
Copy link
Author

We have updated to 11.9 and are currently performing load testing, so far everything looks good, but we will leave it under load for a few more days (for reliability). I'll let you know the result.

The problem was not reproduced on Debian 11.9 during 48 hours of load testing.

We will also repeat the tests with PgBouncer 1.22 and plan upgrades in production. If everything is stable, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants