Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--dry-run does not catch if e.g. max_workers params are different on servers and hence followers will not be able to attach properly #826

Open
NielsKSchjoedt opened this issue Aug 21, 2023 · 0 comments

Comments

@NielsKSchjoedt
Copy link

NielsKSchjoedt commented Aug 21, 2023

We were just carrying out a switchover of our primary using repmgr 5.3.3:

sudo -u postgres repmgr standby switchover --siblings-follow --dry-run

postgres@psql-09:/root$ repmgr standby switchover --siblings-follow --dry-run
NOTICE: checking switchover on node "psql-09" (ID: 9) in --dry-run mode
INFO: SSH connection to host "10.10.10.7" succeeded
INFO: able to execute "repmgr" on remote host "10.10.10.7"
INFO: all sibling nodes are reachable via SSH
INFO: 4 walsenders required, 20 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: archive mode is "off"
INFO: replication lag on this standby is 2 seconds
INFO: 4 replication slots required, 20 available
NOTICE: attempting to pause repmgrd on 5 nodes
NOTICE: local node "psql-09" (ID: 9) would be promoted to primary; current primary "psql-07" (ID: 7) would be demoted to standby
INFO: following shutdown command would be run on node "psql-07":
  "sudo /usr/bin/pg_ctlcluster 15 main stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

However psql-09 (which is a more powerful server) was configured to max_worker_processes=64 while psql-07 was just max_worker_processes=32. So when we actually did the switchover, we ended up in a limbo state where none of the replicas could join, because they could not restart because of the difference to that param:

Aug 21 22:11:47 psql-08 postgres[4082218]: [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
Aug 21 22:11:47 psql-08 postgres[4082221]: [1] LOG:  database system was interrupted while in recovery at log time 2023-08-21 21:49:15 UTC
Aug 21 22:11:47 psql-08 postgres[4082221]: [2] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
Aug 21 22:11:48 psql-08 postgres[4082221]: [1] LOG:  entering standby mode
Aug 21 22:11:48 psql-08 postgres[4082221]: [1] FATAL:  recovery aborted because of insufficient parameter settings
Aug 21 22:11:48 psql-08 postgres[4082221]: [2] DETAIL:  max_worker_processes = 32 is a lower setting than on the primary server, where its value was 64.
Aug 21 22:11:48 psql-08 postgres[4082221]: [3] HINT:  You can restart the server after making the necessary configuration changes.
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG:  startup process (PID 4082221) exited with exit code 1
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG:  aborting startup due to startup process failure
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG:  database system is shut down

That's unexpected that this was not caught 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant