Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

63k file descriptor used for 400 connetions - Too many open files #1037

Closed
bdoublet91 opened this issue Mar 11, 2024 · 25 comments
Closed

63k file descriptor used for 400 connetions - Too many open files #1037

bdoublet91 opened this issue Mar 11, 2024 · 25 comments
Labels
pending-close will be closed if no further discussion

Comments

@bdoublet91
Copy link

Hi,
Im facing a difficult issue with pgbouncer and postgres in production. I administrate a BDD cluster (haproxy + patroni + pgbouncer + postgres) for 4 years (I'm not a begginer) and now I need to upgrade default connexions settings because we have more clients.

I have three baremetal server (32GB RAM + 12VCPU). On each one, I have haproxy + pgbouncer + patroni + postgres.

My problem is simple but I dont have any explanation:
When I have a spike connections (around 400 which is not incredible ^^), my supp show me 63k file descriptor open on the server. I can't split this metric by process but I have ERROR in pgbouncer and postgres logs like
Too many open files with server_login_retry for pgbouncer

First of all I have increase the open file limits for pgbouncer (64000) and postgres process (16384) but it still reaches this limit ...

I could increase again but I want to understand why pgbouncer are uses so much file descriptors for 400 connexions ?
I read that a client connexion open one file descriptor so pgbouncer should have needed 400 fd ?

pgboucer_config:

[databases]
* = host=127.0.0.1 port=5432 auth_user=pgbouncer 

[pgbouncer]
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid
listen_addr = 10.*.*.*,127.0.0.1
listen_port = 6432
unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_query = SELECT usename, passwd FROM user_search($1)
auth_file = /*/*/userlist.txt
ignore_startup_parameters = extra_float_digits,geqo
admin_users = pgbouncer
 
pool_mode = transaction
server_reset_query = DISCARD ALL
max_client_conn = 5000
default_pool_size = 125
reserve_pool_size = 20
reserve_pool_timeout = 1
max_db_connections = 250

server_lifetime = 3600
server_idle_timeout = 600
server_connect_timeout = 15
server_login_retry = 15
client_login_timeout = 60

log_disconnections = 1
log_connections = 1
log_pooler_errors = 1

max open files pgbouncer:

cat /proc/3960110/limits | grep files
Max open files            16384                16384                files 

max open files postgres:

 cat /proc/85875/limits | grep files
Max open files            16384                16384                files

I set up max connexion for postgres to 500.

image

image

image

For now, I deactivate pgbouncer so my clients connect directy to the postgres database, not recommended in production but I have no choice ...
I didn't find any issues like this for pgbouncer ...

Thanks for you help

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

I agree that that sounds like quite a lot of file descriptors for just 400 connections. Which PgBouncer version is this? And did you change the version when increasing the connection limits?

max open files pgbouncer:
cat /proc/3960110/limits | grep files
Max open files 16384 16384 files

Was that command before increasing the file limit to 64000 for pgbouncer? If not, it seems like your increase didn't have the intended effect.

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

I can't split this metric by process but I have ERROR in pgbouncer and postgres logs like
Too many open files with server_login_retry for pgbouncer

What errors do you have in the postgres logs? Because if pgbouncer is opening the many files, than postgres should not be running out of files (because the limits are per process). Basically, can you share the exact errors from the logs?

@bdoublet91
Copy link
Author

bdoublet91 commented Mar 11, 2024

image
When I have disabled pgbouncer, 50k fd to 3k

pgbouncer --version
PgBouncer 1.22.0
libevent 2.1.11-stable
adns: c-ares 1.15.0
tls: OpenSSL 1.1.1f  31 Mar 2020
systemd: yes

openssl is old but I dont use TLS connection for postgres so it doesnt use it ?
also I have postgres 12
I didnt change the version before increasiing the limit

Was that command before increasing the file limit to 64000 for pgbouncer?
Not sure to understand the question but before increasing, the max open file was set to 1024.
I have upgrade in systemd unit file

[Service]
Type=notify
User=postgres
ExecStart=/usr/sbin/pgbouncer /etc/pgbouncer/pgbouncer.ini
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT
LimitNOFILE=64000
LimitNOFILESoft=64000
StandardOutput=null
StandardError=null

I know there are a lot of issues on how to upgrade the max open file for pgbouncer process but I guess if /proc/PID/limit shows 64000, that means the setting is applied ? and also pgbouncer logs shows me kernel file descriptor set to 64000 ^^

What errors do you have in the postgres logs? Because if pgbouncer is opening the many files, than postgres should not be running out of files (because the limits are per process). Basically, can you share the exact errors from the logs?

pgbouncer logs

2024-03-11 07:02:40.780 CET [3182081] ERROR accept() failed: Too many open files in system
2024-03-11 07:02:40.921 CET [3182081] LOG S-0x558e3d45b090: db_dsvbicmr8268/user_dsvbicmr8268@127.0.0.1:5432 closing because: server conn crashed? (age=476s)
2024-03-11 07:02:40.921 CET [3182081] LOG C-0x558e3d453670: db_dsvbicmr8268/user_dsvbicmr8268@10.30.5.1:34388 closing because: server conn crashed? (age=0s)
2024-03-11 07:02:40.921 CET [3182081] WARNING C-0x558e3d453670: db_dsvbicmr8268/user_dsvbicmr8268@10.30.5.1:34388 pooler error: server conn crashed?
2024-03-11 07:02:40.922 CET [3182081] WARNING S-0x558e3d45ade0: db_dsvbicmr8268/***@127.0.0.1:5432 got packet 'N' from server when not linked

Postgres logs

2024-03-11 07:02:38 CET [3726336-1] 127.0.0.1(48350) user_dsvbicmr8268@db_dsvbicmr8268 LOG:  out of file descriptors: Too many open files in system; release and retry

What errors do you have in the postgres logs? Because if pgbouncer is opening the many files, than postgres should not be running out of files (because the limits are per process). Basically, can you share the exact errors from the logs?

Yeah I think about it, get postgres error of too many open files is possibly a different problem.
Or another idea is that pgbouncer create too much connexions (more than 400) and all connexions try to connect to postgres so it reaches the too many open file.
But that means there is a bug in pool size limit in pgbouncer ^^ so maybe I'm wrong.

Maybe there is a global max open file limit in kernel or something else ? pgbouncer takes all fd and postgres doesnt have enought ?

Thanks for your help

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

Maybe there is a global max open file limit in kernel or something else ? pgbouncer takes all fd and postgres doesnt have enought ?

fs.file-max is what you want probably: https://www.tecmint.com/increase-set-open-file-limits-in-linux/

@Melkij
Copy link

Melkij commented Mar 11, 2024

Strange, out of LimitNOFILE always looked like

ERROR accept() failed: Too many open files

not ... in system

yep, it should be system-wise fs.file-max sysctl https://www.ibm.com/support/pages/datastage-job-reports-error-too-many-open-files-system-system-message-indicates-system-limit-not-user-limt

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

I know there are a lot of issues on how to upgrade the max open file for pgbouncer process but I guess if /proc/PID/limit shows 64000, that means the setting is applied ? and also pgbouncer logs shows me kernel file descriptor set to 64000 ^^

Then it's good. LimitNOFILE is indeed what you want to use when using systemd. I guess maybe you copied the wrong output of /proc/PID/limit in your first comment, because there it showed 16384 (I guess that was a postgres process not a pgbouncer process).

@bdoublet91
Copy link
Author

also set

 sysctl fs.file-max
fs.file-max = 65535

@bdoublet91
Copy link
Author

[Service]
Type=notify
User=postgres
ExecStart=/usr/sbin/pgbouncer /etc/pgbouncer/pgbouncer.ini
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT
LimitNOFILE=64000
LimitNOFILESoft=64000
StandardOutput=null
StandardError=null

This one is the unit file of pgbouncer

And I have another one for patroni with LimitNOFILE set to 16384

@bdoublet91
Copy link
Author

Is it possible that pgbouncer creates / duplicates clients connexions ?
In postgres logs I have several times the same query

2024-03-11 11:15:13 CET [3932291-359] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 LOG:  out of file descriptors: Too many open files in system; release and retry
2024-03-11 11:15:13 CET [3932291-360] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 STATEMENT:  UPDATE "****" SET "api_url" = '****', "status" = 'READY', "updated_at" = '2024-03-11T10:15:12.147545+00:00'::timestamptz WHERE "****"."id" = 7069
--
2024-03-11 11:15:13 CET [3932291-418] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 STATEMENT:  UPDATE "****" SET "api_url" = '****', "status" = 'READY', "updated_at" = '2024-03-11T10:15:12.147545+00:00'::timestamptz WHERE "****"."id" = 7069
2024-03-11 11:15:13 CET [3932291-419] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 LOG:  out of file descriptors: Too many open files in system; release and retry
2024-03-11 11:15:13 CET [3932291-420] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 STATEMENT:  UPDATE "****" SET "api_url" = '****', "status" = 'READY', "updated_at" = '2024-03-11T10:15:12.147545+00:00'::timestamptz WHERE "****"."id" = 7069
2024-03-11 11:15:13 CET [3932291-421] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 LOG:  out of file descriptors: Too many open files in system; release and retry
2024-03-11 11:15:13 CET [3932291-422] 127.0.0.1(44896) user_dsvbicmr8268@db_dsvbicmr8268 STATEMENT:  UPDATE "****" SET "api_url" = '****', "status" = 'READY', "updated_at" = '2024-03-11T10:15:12.147545+00:00'::timestamptz WHERE "****"."id" = 7069

With server_login_retry, I think pgbouncer must retry the query but maybe there is a problem there ?

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

also set

 sysctl fs.file-max
fs.file-max = 65535

yeah so, it sounds like that should be set to a larger. Now that you're getting close to that number. It does seem like a fairly high limit already, but even without pgbouncer you already get ~50k.

I'd say, set that to 655350 (ten times the current value). And then run your intended setup again. Then you can find out what is using your file descriptiors. But it sounds like it's mostly postgres.

@bdoublet91
Copy link
Author

bdoublet91 commented Mar 11, 2024

I understand what you mean about increase this limit but I still don't understand why pgbouncer used so much file descriptor for 400 connections like my supp shows when I have disabled pgbouncer, file descriptors downgrade from 64k to 3k and now 16k since one hour.

image

The global problem seems related to pgbouncer -> uses too much fd and reaches the fs.file-max limit so postgres doesnt have enougth fd and log too many open files but I dont think it's related to postgres.

and I prefer to check out this problem instead of upgrade the fs.file-max value because if pgbouncer has a fd leak or something, it will just consume RAM and CPU for nothing ?
Agree with me ?

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

Yeah, I also don't understand why the number is so high. But if PgBouncer is not running, it's pretty much impossible to debug. So that's why I'm suggesting, configure the settings so that the system at least runs. And then figure out what is causing so many file descriptors to be used using a combination of netstat, ss, lsof, SHOW SERVERS, SHOW CLIENTS (and probably more).

Also what was your pgbouncer config before changing it? Because your file descriptor pattern even after disabling pgbouncer seems very different from the (pretty much) flat line it was before.

@bdoublet91
Copy link
Author

bdoublet91 commented Mar 11, 2024

Ok I will reenable the ssetup this nigth, could you give me some point to check / debug.
show SERVERS + show CLIENTS ok but also other stats or logs ?
thanks

@bdoublet91
Copy link
Author

bdoublet91 commented Mar 11, 2024

pgbouncer in verbose mode ?

@JelteF
Copy link
Member

JelteF commented Mar 11, 2024

I think mostly using some os tools (netstat, ss, lsof) to find out if it's actually the pgbouncer process that opens the bulk of these files. Or if it's actually postgres (and pgbouncer making more connections to it causes postgres to open more files). The fact that the files go up significantly from your baseline, even without pgbouncer makes me think it could be postgres too.

@bdoublet91
Copy link
Author

ok I wll try it this week and give you a feedback.

Thanks for your help.

See you

@bdoublet91
Copy link
Author

Hello again :)
I have reactivated pgbouncer with an upgrade of the file descriptor limit: 100k

I still have an high consumption of file descriptor when there is a spike of client connections...

image
image
image
image
image
image
image

2024-03-13 17:29:01.802 CET [3115148] LOG kernel file descriptor limit: 64000 (hard: 64000); max_client_conn: 5000, max expected fd use: 5512

2024-03-13 17:29:31.321 CET [3115148] LOG C-0x55b74fbab720: (nodb)/(nouser)@10.30.5.1:36662 registered new auto-database: db_dsvbicmr8268
2024-03-13 17:29:31.321 CET [3115148] LOG S-0x55b74fbb4460: db_dsvbicmr8268/SApgbouncer@127.0.0.1:5432 new connection to server (from 127.0.0.1:53420)
2024-03-13 17:29:31.325 CET [3115148] LOG S-0x55b74fbb4460: db_dsvbicmr8268/SApgbouncer@127.0.0.1:5432 SSL established: TLSv1.3/TLS_AES_256_GCM_SHA384/ECDH=prime256v1

2024-03-13 17:32:33.795 CET [3115148] LOG C-0x55b74fd1cde0: db_dsvbicmr8268/user_dsvbicmr8268@10.30.5.1:41300 login attempt: db=db_dsvbicmr8268 user=user_dsvbicmr8268 tls=no

UNTIL HERE IT S OK
2024-03-13 17:39:18.578 CET [3115148] WARNING C-0x55b74fdac060: db_dsvbicmr8268/user_dsvbicmr8268@10.30.5.1:54888 taking connection from reserve_pool
2024-03-13 17:39:19.203 CET [3115148] LOG C-0x55b74fbada10: db_dsvbicmr8268/user_dsvbicmr8268@10.30.5.1:58606 closing because: server conn crashed? (age=128s)
2024-03-13 17:39:19.223 CET [3115148] WARNING tls_sbufio_recv: read failed: Connection reset by peer
2024-03-13 17:39:19.223 CET [3115148] LOG S-0x55b74fd48ea0: db_dsvbicmr8268/SApgbouncer@127.0.0.1:5432 closing because: server conn crashed? (age=329s)
2024-03-13 17:39:19.223 CET [3115148] LOG C-0x55b74fda77d0: db_dsvbicmr8268/(nouser)@10.30.5.1:55198 closing because: bouncer config error (age=0s)
2024-03-13 17:39:19.223 CET [3115148] WARNING C-0x55b74fda77d0: db_dsvbicmr8268/(nouser)@10.30.5.1:55198 pooler error: bouncer config error
2024-03-13 17:39:19.223 CET [3115148] LOG S-0x55b74fbb51d0: db_dsvbicmr8268/test@127.0.0.1:5432 closing because: server conn crashed? (age=575s)
2024-03-13 17:39:19.343 CET [3115148] LOG C-0x7f4f5436a340: db_dsvbicmr8268/(nouser)@10.30.5.1:33294 closing because: server login has been failing, try again later (server_login_retry) (age=0s)
2024-03-13 17:39:19.343 CET [3115148] WARNING C-0x7f4f5436a340: db_dsvbicmr8268/(nouser)@10.30.5.1:33294 pooler error: server login has been failing, try again later (server_login_retry)
2024-03-13 17:39:19.343 CET [3115148] LOG S-0x55b74fea87b0: db_dsvbicmr8268/test@127.0.0.1:5432 closing because: server conn crashed? (age=143s)
2024-03-13 17:39:19.343 CET [3115148] WARNING S-0x55b74fea9520: db_dsvbicmr8268/test@127.0.0.1:5432 got packet 'N' from server when not linked

also somes stats:

2024-03-13 17:34:01.804 CET [3115148] LOG stats: 285 xacts/s, 508 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 171612 B/s, out 336628 B/s, xact 80509 us, query 44089 us, wait 3538664 us
2024-03-13 17:35:01.805 CET [3115148] LOG stats: 1308 xacts/s, 1388 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 1136966 B/s, out 1849792 B/s, xact 43214 us, query 40579 us, wait 5809437 us
2024-03-13 17:36:01.800 CET [3115148] LOG stats: 2564 xacts/s, 2688 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 1928066 B/s, out 9333155 B/s, xact 19131 us, query 18122 us, wait 3536764 us
2024-03-13 17:39:01.805 CET [3115148] LOG stats: 568 xacts/s, 596 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 407608 B/s, out 5073615 B/s, xact 387413 us, query 369161 us, wait 190062437 us

2688 queries/s is high ? at 17:36. It's higher than others pgbouncer log stats but how it causes 70k file descriptor ...
High latency at 17:39 due to lack of file descriptor I think ?

All crash with nouser ?

2024-03-13 17:39:29.628 CET [3115148] WARNING C-0x7f4f5435cef0: db_dsvbicmr8268/(nouser)@10.30.5.1:35660 pooler error: server login has been failing, try again later (server_login_retry)

ok again

2024-03-13 17:39:34.476 CET [3115148] LOG S-0x55b74fea8d10: db_dsvbicmr8268/user_dsvbicmr8268@127.0.0.1:5432 new connection to server (from 127.0.0.1:40328)

Do you have some ways to resolve this or other things you need to debug ?

Thanks

@JelteF
Copy link
Member

JelteF commented Mar 14, 2024

Can you run this command to see the number of open files per process?

lsof | cut -d ' ' -f1 | sort | uniq -c | sort -n | tail

@bdoublet91
Copy link
Author

oh good shot, nice command btw ;)

sudo lsof | cut -d ' ' -f1 | sort | uniq -c | sort -n | tail
    339 NetworkMa
    341 node_expo
    380 snapd
    400 udisksd
    424 pgbouncer
    534 patroni
    952 etcd
   1126 sshd
   1482 haproxy
  73615 postgres

Add a lot of file descriptor like

postgres  3980964                           postgres   43u      REG                9,2  170188800    8138552 /var/lib/postgresql/12/main/base/16402/486815881
postgres  3980964                           postgres   44u      REG                9,2 1043021824    8130679 /var/lib/postgresql/12/main/base/16402/439059455
postgres  3980964                           postgres   45u      REG                9,2  112164864    8129147 /var/lib/postgresql/12/main/base/16402/439059877
postgres  3980964                           postgres   46u      REG                9,2  112164864    8127850 /var/lib/postgresql/12/main/base/16402/439059878
postgres  3980964                           postgres   47u      REG                9,2   70598656    8131446 /var/lib/postgresql/12/main/base/16402/439059876
postgres  3980964                           postgres   48u      REG                9,2   70680576    8128099 /var/lib/postgresql/12/main/base/16402/439059875
postgres  3980964                           postgres   49u      REG                9,2      73728    8127382 /var/lib/postgresql/12/main/base/16402/439058839
postgres  3980964                           postgres   50u      REG                9,2      16384    8127615 /var/lib/postgresql/12/main/base/16402/439058920
postgres  3980964                           postgres   51u      REG                9,2      16384    8128126 /var/lib/postgresql/12/main/base/16402/439058922
postgres  3980964                           postgres   52u      REG                9,2      16384    8131000 /var/lib/postgresql/12/main/base/16402/439058921
postgres  3980964                           postgres   53u      REG                9,2      32768    8130704 /var/lib/postgresql/12/main/base/16402/439059455_vm
postgres  3980964                           postgres   54u      REG                9,2       8192    8130400 /var/lib/postgresql/12/main/base/16402/439032441
postgres  3980964                           postgres   55u      REG                9,2  163381248    8130425 /var/lib/postgresql/12/main/base/16402/2840
postgres  3980964                           postgres   56u      REG                9,2    2244608    8130167 /var/lib/postgresql/12/main/base/16402/2841
postgres  3980964                           postgres   57u      REG                9,2      65536    8127514 /var/lib/postgresql/12/main/base/16402/456496035

cat /tmp/test | grep "/var/lib/postgresql/12/main/base/" | wc -l
54294

So it's the process postgres that opens a lot of filedescriptor but the spikes happens only if pgbouncer is started. When I have disabled pgbouncer, I have no spike (maybe because max_db_connections was set to 500 so lot of connection get dropped ?)

this is my postgresql conf:

postgresql:
  parameters:
    archive_command: cd .
    archive_mode: true
    archive_timeout: 1800s
    auto_explain.log_analyze: true
    auto_explain.log_buffers: true
    auto_explain.log_min_duration: 10s
    auto_explain.log_nested_statements: true
    auto_explain.log_timing: false
    auto_explain.log_triggers: true
    auto_explain.log_verbose: true
    autovacuum: true
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 4
    autovacuum_naptime: 1s
    autovacuum_vacuum_cost_delay: 20
    autovacuum_vacuum_cost_limit: 200
    autovacuum_vacuum_scale_factor: 0.01
    checkpoint_completion_target: 0.9
    checkpoint_timeout: 15min
    default_statistics_target: 1000
    effective_cache_size: 15GB
    effective_io_concurrency: 2
    hot_standby: true
    hot_standby_feedback: true
    huge_pages: try
    idle_in_transaction_session_timeout: 3600000
    log_checkpoints: true
    log_directory: /var/log/postgresql
    log_filename: postgresql-%a.log
    log_line_prefix: '%t [%p-%l] %r %q%u@%d '
    log_lock_waits: true
    log_rotation_age: 1d
    log_rotation_size: 0
    log_temp_files: 0
    log_truncate_on_rotation: true
    logging_collector: true
    maintenance_work_mem: 256MB
    max_connections: 700
    max_files_per_process: 4096
    max_locks_per_transaction: 64
    max_parallel_maintenance_workers: 2
    max_parallel_workers: 8
    max_parallel_workers_per_gather: 2
    max_prepared_transactions: 0
    max_replication_slots: 10
    max_wal_senders: 10
    max_wal_size: 4GB
    max_worker_processes: 12
    min_wal_size: 2GB
    pg_stat_statements.max: 10000
    pg_stat_statements.save: false
    pg_stat_statements.track: all
    random_page_cost: 4
    seq_page_cost: 1
    shared_buffers: 3GB
    shared_preload_libraries: pg_stat_statements,auto_explain
    statement_timeout: 86400000
    superuser_reserved_connections: 5
    synchronous_commit: true
    track_activities: true
    track_counts: true
    track_functions: all
    track_io_timing: true
    wal_buffers: 32MB
    wal_keep_segments: 130
    wal_level: replica
    wal_log_hints: true
    work_mem: 8MB

Could you give me some ways to debug this ? I m sure you know better postgres than me :).

Thank you again !

@JelteF
Copy link
Member

JelteF commented Mar 14, 2024

Okay, so yeah those ~400 file descriptors for pgbouncer itself seem quite normal.

Why postgres has so many open is probably because you have set:

max_files_per_process: 4096

So each process can open 4096 files, multiplied by the 120 default_pool_size that could run up to 125*4096: 512000 open files in the worst case.

The most probable reason why pgbouncer makes this problem more pronounced is that it will keep connections to postgres open, thus a postgres process that opened a lot of files will keep them open much longer than if your client does:

  1. connect
  2. query that opens many files
  3. disconnect

Because now the disconnect does not happen, and the disconnect would close all filedescriptors automatically without pgbouncer in the picture.

So I don't really think there's a huge problem going on here, just increase fs.file-max a bunch to something very high and don't worry too much about it.

You could also lower max_files_per_process a bit to reduce the amount of amplification.

One question though is: Why is the spike so high right after changing to pgbouncer, and why does it go back down after a while. I think this might have to do with you restarting postgres too at that point. And thus shared_buffers of postgres being empty, which in turn would cause many procesess to load data from the filesystem into shared_buffers by opening and reading files.

If that's indeed the case you probably want to enable pg_prewarm. That way a single background worker would fill the cache, instead of having the cache be filled by all backends. You can enable pg_prewarm by adding this to your postgresql.conf:

shared_preload_libraries = 'pg_prewarm'

Source: https://www.postgresql.org/docs/current/pgprewarm.html

@bdoublet91
Copy link
Author

Oh im very grateful !
Yeah first, I downgraded max_files_per_process to 1000 (default value) but I though the problem was pgbouncer so i restored the value.
I dont understand this setting because it's setup to 4096 since 4 years with pgbouncer and we get no problem until here.
Maybe we have more connections now and I have upgrade the max_worker_processes from 8 to 12 so we reached the fd limits.

To be sure, when a client connect to pgbouncer -> create/reuse a pool connection that connect to postgres -> create a postgres process and this process can create / open 4096 files. Ok but in my debug, I found that postgres connection only need one or two files descriptor ? I missunderstood ?

"One question though is: Why is the spike so high right after changing to pgbouncer, and why does it go back down after a while"
Yeah it happens after pgbouncer restart but also twice until now. I wll try pg_prewarm and monitor the behavior

Ok I wll increase gradually the fs.file-max and monitor the behavior.

@JelteF
Copy link
Member

JelteF commented Mar 14, 2024

To be sure, when a client connect to pgbouncer -> create/reuse a pool connection that connect to postgres -> create a postgres process and this process can create / open 4096 files. Ok but in my debug, I found that postgres connection only need one or two files descriptor ? I missunderstood ?

Yeah, so I guess Postgres doesn't always close the files it opens in a process. And by pgbouncer caching the connections to postgres, these files stay open longer, and then the next query might open more files, which again stay open. Thus eventually adding up to a lot of files per postgres process.

@bdoublet91
Copy link
Author

Should I let the issue opened and give a feedback next week ?
for the moment I set max_files_per_process to 1000 and fs.file-max to 150k with pg_prewarm.
Let's see.
thanks

@eulerto eulerto added the pending-close will be closed if no further discussion label Apr 15, 2024
@eulerto
Copy link
Member

eulerto commented May 3, 2024

Having not received a reply, closed.

@eulerto eulerto closed this as completed May 3, 2024
@bdoublet91
Copy link
Author

Hi,

Sorry for the feedback, I have upgrade the fs.file-max to 400k with pg_prewarm and downgrade a little pool_connection to 100 per database. I Have some spike to 150k fd open but it doesnt impact RAM or CPU iin my supp and no more too many open files ERROR so quite good for me.
Have another problem about IO disk and bad optimize query but it's not the business of pgbouncer.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-close will be closed if no further discussion
Projects
None yet
Development

No branches or pull requests

4 participants