Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery failure with barman due to max_connections #2478

Closed
gsimko opened this issue Jul 21, 2023 · 8 comments · Fixed by #4564 · May be fixed by #4723
Closed

Recovery failure with barman due to max_connections #2478

gsimko opened this issue Jul 21, 2023 · 8 comments · Fixed by #4564 · May be fixed by #4723
Assignees
Labels
bug 🐛 Something isn't working

Comments

@gsimko
Copy link

gsimko commented Jul 21, 2023

CNPG version: 1.20.1

I'm trying to recover from a backup that was produced with barmanObjectStore but recovering the primary instance fails.
In postgresql.parameters.max_connections I use a value of 40, and because of that the following error is produced:
"message":"hot standby is not possible because of insufficient parameter settings","detail":"max_connections = 10 is a lower setting than on the primary server, where its value was 40."
Meaning that recovery from a backup must use at least as large max_connections as what was used by the backup.

Tracking down the code I figured that the max_connections=10 setting comes from pg_controldata(src), which strictly overwrites the user setting.

I guess the user should be able to override max_connections when running a recovery to match that was backed up?

I've checked and /var/lib/postgresql/data/pgdata/custom.conf indeed shows max_connections=10.
What's confusing though is that running pg_controldata displays "max_connections setting: 40" so I'm confused where that 10 comes from.
The logs show this: {..., "msg":"enforcing parameters found in pg_controldata","parameters":{"max_connections":"10","max_locks_per_transaction":"64","max_prepared_transactions":"0","max_wal_senders":"10","max_worker_processes":"32"}}

@gsimko
Copy link
Author

gsimko commented Jul 21, 2023

@phisco @gabriele-wolfox @mnencia
I see that this logic was contributed by you in this commit
Unfortunately I couldn't find any explanation why it's in place. Why do we force-override the max_connections and other settings?

@gsimko
Copy link
Author

gsimko commented Jul 23, 2023

After more investigation I found the culprit: there was a max_connection change from 10 to 40 after the base backup was written, so that change is only present in the WAL.

When CNPG initializes the config, it uses pg_controldata to figure the max_connections setting. According to the docs "pg_controldata prints information initialized during initdb", so that's why it returned 10 and then CNPG uses that value to override the setting from the postgresql.parameters stanza.

When barman-cloud-restore gets to the WAL entry of increasing max_connections, it pauses with the error message shown above.

The fix to this problem is to not overwrite the max_connections setting with the value from pg_controldata. If we used the user-specified max_connections setting, things would work fine as the user can pick a high enough max_connections.

My solution for restoring the data was to manually start a postgresql server on a clean machine, do the barman-cloud-restore (had to override restore_command and archive_command in postgresql.auto.conf) and make a pg_dump of the database. Then in CNPG I created a clean db using initdb and pg_restored the data into it using kubectl exec.

@phisco
Copy link
Contributor

phisco commented Jul 23, 2023

@gsimko can you be more explicit with the steps that brought you to this situation?
E.g.:

  • created first cluster with max_connection yy
  • backed up first cluster to bucket xxx
  • set max_connections to zz on first cluster
  • created second cluster restoring from bucket xxx

@gsimko
Copy link
Author

gsimko commented Jul 24, 2023

AFAICT the following steps led to this situation:

  • created cluster with max_connection=10
  • created base backup in bucket X
  • changed max_connection to 40
  • wrote WAL logs to bucket X
  • created second cluster restoring from bucket X
  • initialization of the cluster primary pod failed

The reason for the failure is that the new cluster is initialized with max_connection=10, but when the restoration process gets to the WAL log with max_connection=40 it stops due to not supporting such an increase.

Hope that helps!

@phisco
Copy link
Contributor

phisco commented Jul 24, 2023

But the error message you shared is saying: max_connections = 10 is a lower setting than on the primary server, where its value was 40. So, it looks like it’s trying to go back to 10 for some reason

The second cluster was created with max_connections set to 10 or 40?

@gsimko
Copy link
Author

gsimko commented Jul 24, 2023

On the second cluster I set it at 40.

The reason why it uses 10 - and that's the actual bug - is because CNPG internally overrides the user setting by reading the max_connection setting from pg_controldata, which returns the value at the time when the table was created.

@phisco
Copy link
Contributor

phisco commented Jul 25, 2023

Maybe I got it, it’s set to 10 from the backup, then at some point replaying WALs, it’s set to 40, but we still try to force it back to 10 because pg_controldata says so. I’ll try to reproduce it, thanks!

(Which is exactly what you said above, but now I got it too 😂)

@litaocdl
Copy link
Collaborator

litaocdl commented May 15, 2024

I can reproduce it, the problem is during the recovery, we are enforce to use the max_connections from backup.
so let's say,

  1. if cluster has max_connection=100 and then create a base backup A
  2. later increase the max_connections to 200 but without create another base backup.

then when we do the full recovery, we are using the max_connection=100 in backup A start the server in standby mode and recovery wals, when reach to the wal which increase the max_connection, the postgres in the recovery job will pause and recovery job will hangs.
If we create a base backup right after increase the max_connection, full restore will success.

@litaocdl litaocdl self-assigned this May 17, 2024
@litaocdl litaocdl added the bug 🐛 Something isn't working label May 18, 2024
leonardoce added a commit that referenced this issue Jun 3, 2024
)

Ensure the PostgreSQL replication parameters are set to the higher value
between the ones specified in the cluster specification and the ones
stored in the backup.

This will ensure that the backup will be restored correctly while
allowing the users to raise their value to accommodate changes in the
configuration that have happened after the backup was taken.

Partially closes #2478 #2337

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
cnpg-bot pushed a commit that referenced this issue Jun 3, 2024
)

Ensure the PostgreSQL replication parameters are set to the higher value
between the ones specified in the cluster specification and the ones
stored in the backup.

This will ensure that the backup will be restored correctly while
allowing the users to raise their value to accommodate changes in the
configuration that have happened after the backup was taken.

Partially closes #2478 #2337

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 87f80ce)
cnpg-bot pushed a commit that referenced this issue Jun 3, 2024
)

Ensure the PostgreSQL replication parameters are set to the higher value
between the ones specified in the cluster specification and the ones
stored in the backup.

This will ensure that the backup will be restored correctly while
allowing the users to raise their value to accommodate changes in the
configuration that have happened after the backup was taken.

Partially closes #2478 #2337

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 87f80ce)
cnpg-bot pushed a commit that referenced this issue Jun 3, 2024
)

Ensure the PostgreSQL replication parameters are set to the higher value
between the ones specified in the cluster specification and the ones
stored in the backup.

This will ensure that the backup will be restored correctly while
allowing the users to raise their value to accommodate changes in the
configuration that have happened after the backup was taken.

Partially closes #2478 #2337

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 87f80ce)
dougkirkley pushed a commit to dougkirkley/cloudnative-pg that referenced this issue Jun 11, 2024
…oudnative-pg#4564)

Ensure the PostgreSQL replication parameters are set to the higher value
between the ones specified in the cluster specification and the ones
stored in the backup.

This will ensure that the backup will be restored correctly while
allowing the users to raise their value to accommodate changes in the
configuration that have happened after the backup was taken.

Partially closes cloudnative-pg#2478 cloudnative-pg#2337

Signed-off-by: Tao Li <tao.li@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Douglass Kirkley <dkirkley@eitccorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
3 participants