Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master node fails to automatically rejoin the cluster after recovery from failure #850

Open
nuowei2543 opened this issue Apr 11, 2024 · 1 comment

Comments

@nuowei2543
Copy link

nuowei2543 commented Apr 11, 2024

Hello, during my simulation of host failover, I stopped the master host's PostgreSQL instance, and the standby node successfully switched to become the new master node. However, when I restarted the original master node, it did not automatically rejoin the cluster as a standby node.
version:
ubuntu:20.4
postgresql:16.2
repmgrd:5.4.1

1、 postgres@ser-compute-01:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 3 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 3 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | witness | * running | node1 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

2、on node1 execute command
supervisorctl stop postgresql

3、postgres@ser-compute-02:~$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | - failed | ? | default | 100 | | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | witness | * running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

4、on node1 execute command
supervisorctl startpostgresql

5、postgres@ser-compute-02:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | ! running | | default | 100 | 1 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | witness | * running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

  • node "node1" (ID: 1) is running but the repmgr node record is inactive

So, I don't know why node1 is still the primary.

@stephan-hahn
Copy link

stephan-hahn commented Apr 25, 2024

Hi, there is no inbuilt automatic rejoin. By just starting the old master again, you create a split brain scenario. But it's no problem to automatically rejoin the old master after promoting the new one via script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants