Recovery failure with barman due to max_connections #2478

gsimko · 2023-07-21T08:56:04Z

CNPG version: 1.20.1

I'm trying to recover from a backup that was produced with barmanObjectStore but recovering the primary instance fails.
In postgresql.parameters.max_connections I use a value of 40, and because of that the following error is produced:
"message":"hot standby is not possible because of insufficient parameter settings","detail":"max_connections = 10 is a lower setting than on the primary server, where its value was 40."
Meaning that recovery from a backup must use at least as large max_connections as what was used by the backup.

Tracking down the code I figured that the max_connections=10 setting comes from pg_controldata(src), which strictly overwrites the user setting.

I guess the user should be able to override max_connections when running a recovery to match that was backed up?

I've checked and /var/lib/postgresql/data/pgdata/custom.conf indeed shows max_connections=10.
What's confusing though is that running pg_controldata displays "max_connections setting: 40" so I'm confused where that 10 comes from.
The logs show this: {..., "msg":"enforcing parameters found in pg_controldata","parameters":{"max_connections":"10","max_locks_per_transaction":"64","max_prepared_transactions":"0","max_wal_senders":"10","max_worker_processes":"32"}}

gsimko · 2023-07-21T12:46:44Z

@phisco @gabriele-wolfox @mnencia
I see that this logic was contributed by you in this commit
Unfortunately I couldn't find any explanation why it's in place. Why do we force-override the max_connections and other settings?

gsimko · 2023-07-23T07:12:52Z

After more investigation I found the culprit: there was a max_connection change from 10 to 40 after the base backup was written, so that change is only present in the WAL.

When CNPG initializes the config, it uses pg_controldata to figure the max_connections setting. According to the docs "pg_controldata prints information initialized during initdb", so that's why it returned 10 and then CNPG uses that value to override the setting from the postgresql.parameters stanza.

When barman-cloud-restore gets to the WAL entry of increasing max_connections, it pauses with the error message shown above.

The fix to this problem is to not overwrite the max_connections setting with the value from pg_controldata. If we used the user-specified max_connections setting, things would work fine as the user can pick a high enough max_connections.

My solution for restoring the data was to manually start a postgresql server on a clean machine, do the barman-cloud-restore (had to override restore_command and archive_command in postgresql.auto.conf) and make a pg_dump of the database. Then in CNPG I created a clean db using initdb and pg_restored the data into it using kubectl exec.

phisco · 2023-07-23T16:37:59Z

@gsimko can you be more explicit with the steps that brought you to this situation?
E.g.:

created first cluster with max_connection yy
backed up first cluster to bucket xxx
set max_connections to zz on first cluster
created second cluster restoring from bucket xxx
…

gsimko · 2023-07-24T19:47:17Z

AFAICT the following steps led to this situation:

created cluster with max_connection=10
created base backup in bucket X
changed max_connection to 40
wrote WAL logs to bucket X
created second cluster restoring from bucket X
initialization of the cluster primary pod failed

The reason for the failure is that the new cluster is initialized with max_connection=10, but when the restoration process gets to the WAL log with max_connection=40 it stops due to not supporting such an increase.

Hope that helps!

phisco · 2023-07-24T21:29:04Z

But the error message you shared is saying: max_connections = 10 is a lower setting than on the primary server, where its value was 40. So, it looks like it’s trying to go back to 10 for some reason

The second cluster was created with max_connections set to 10 or 40?

gsimko · 2023-07-24T22:27:10Z

On the second cluster I set it at 40.

The reason why it uses 10 - and that's the actual bug - is because CNPG internally overrides the user setting by reading the max_connection setting from pg_controldata, which returns the value at the time when the table was created.

phisco · 2023-07-25T06:57:09Z

Maybe I got it, it’s set to 10 from the backup, then at some point replaying WALs, it’s set to 40, but we still try to force it back to 10 because pg_controldata says so. I’ll try to reproduce it, thanks!

(Which is exactly what you said above, but now I got it too 😂)

litaocdl · 2024-05-15T10:41:22Z

I can reproduce it, the problem is during the recovery, we are enforce to use the max_connections from backup.
so let's say,

if cluster has max_connection=100 and then create a base backup A
later increase the max_connections to 200 but without create another base backup.

then when we do the full recovery, we are using the max_connection=100 in backup A start the server in standby mode and recovery wals, when reach to the wal which increase the max_connection, the postgres in the recovery job will pause and recovery job will hangs.
If we create a base backup right after increase the max_connection, full restore will success.

) Ensure the PostgreSQL replication parameters are set to the higher value between the ones specified in the cluster specification and the ones stored in the backup. This will ensure that the backup will be restored correctly while allowing the users to raise their value to accommodate changes in the configuration that have happened after the backup was taken. Partially closes #2478 #2337 Signed-off-by: Tao Li <tao.li@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

) Ensure the PostgreSQL replication parameters are set to the higher value between the ones specified in the cluster specification and the ones stored in the backup. This will ensure that the backup will be restored correctly while allowing the users to raise their value to accommodate changes in the configuration that have happened after the backup was taken. Partially closes #2478 #2337 Signed-off-by: Tao Li <tao.li@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> (cherry picked from commit 87f80ce)

…oudnative-pg#4564) Ensure the PostgreSQL replication parameters are set to the higher value between the ones specified in the cluster specification and the ones stored in the backup. This will ensure that the backup will be restored correctly while allowing the users to raise their value to accommodate changes in the configuration that have happened after the backup was taken. Partially closes cloudnative-pg#2478 cloudnative-pg#2337 Signed-off-by: Tao Li <tao.li@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com> Signed-off-by: Douglass Kirkley <dkirkley@eitccorp.com>

litaocdl mentioned this issue May 17, 2024

fix: allow raising replication parameters when restoring a backup #4564

Merged

litaocdl self-assigned this May 17, 2024

litaocdl added the bug 🐛 Something isn't working label May 18, 2024

litaocdl mentioned this issue May 18, 2024

recover through max_connection parameter change #2337

Open

This was referenced May 30, 2024

[Bug]: Avoid restore job hang if max_connection is increased after last basebackup #4721

Open

fix: make restore job failed if wal replay is paused #4723

Draft

leonardoce closed this as completed in #4564 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery failure with barman due to max_connections #2478

Recovery failure with barman due to max_connections #2478

gsimko commented Jul 21, 2023 •

edited

gsimko commented Jul 21, 2023

gsimko commented Jul 23, 2023

phisco commented Jul 23, 2023

gsimko commented Jul 24, 2023

phisco commented Jul 24, 2023

gsimko commented Jul 24, 2023

phisco commented Jul 25, 2023 •

edited

litaocdl commented May 15, 2024 •

edited

Recovery failure with barman due to max_connections #2478

Recovery failure with barman due to max_connections #2478

Comments

gsimko commented Jul 21, 2023 • edited

gsimko commented Jul 21, 2023

gsimko commented Jul 23, 2023

phisco commented Jul 23, 2023

gsimko commented Jul 24, 2023

phisco commented Jul 24, 2023

gsimko commented Jul 24, 2023

phisco commented Jul 25, 2023 • edited

litaocdl commented May 15, 2024 • edited

gsimko commented Jul 21, 2023 •

edited

phisco commented Jul 25, 2023 •

edited

litaocdl commented May 15, 2024 •

edited