Failover when the postmaster is hung #1371

kmsarabu · 2020-01-22T18:45:29Z

We are looking into replacing Repmgr with Patroni and testing various failover scenarios.

It seems that the patroni currently keeping persistent connection to the master database. For any reason if postmaster is hung, and that basically blocks all new connection requests to the master database. It appears that patroni was unable to identify such scenario and not triggering failover.

We have also tested the same scenario with repmgr. repmgr was able to identify such case and failed over the database. Repmgr makes use of parameters reconnect_interval and reconnect_attempts instead of keeping a persistent connection & failover the database if the process unable connect to master in provided threshold.

Test case:

[host1] : ps -eaf |egrep postgres|egrep config |egrep listen_address
postgres 193623 1 0 17:13 ? 00:00:00 /postgres/product/server/10.3.2/bin/postgres -D /postgres/data001/pgrept1/sysdata --config-file=/postgres/data001/pgrept1/sysdata/postgresql.conf --listen_addresses=host1 --port=4345 --cluster_name=pgrept1 --wal_level=logical --hot_standby=on --max_connections=1000 --max_wal_senders=10 --max_prepared_transactions=0 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on

[host1] : kill -STOP 193623

[host1] : psql -h host1 -p 4345 -U krishna
HUNG ..

From the patroni.log:
2020-01-22 13:35:35,082 INFO: Lock owner: host1; I am host1
2020-01-22 13:35:35,090 INFO: no action. i am the leader with the lock
...
2020-01-22 13:37:05,082 INFO: Lock owner: host1; I am host1
2020-01-22 13:37:05,091 INFO: no action. i am the leader with the lock
....

CyberDem0n · 2020-01-22T19:05:04Z

Fairly speaking I have never ever seen a situation when the postmaster hung so that is not able to respond at all, but I have seen a lot of situations when all connection slots are occupied and therefore a connection is rejected.
Opening a new connection is a very expensive operation. If the system is under stress it could easily take seconds. Doing a failover in such case IMO is a bit insane, because:

first of all you'll have to fence the old primary. Depending on how you are doing fencing it might result in some data loss.
Clients which are currently connected would not be happy.
Highly likely that the new primary would become stressed even harder than the old one.

kmsarabu · 2020-01-22T20:02:35Z

Thanks for quick response and looking into this issue.

Unfortunately, we ran into this issue on both databases Oracle & PostgreSQL.

Agree that spawning connection frequently is an expensive operation. However, it can be justifiable for highly available services (provided that it is set to small enough to allow timely detection & large enough to limit the associated overhead).

Agree that master should be fenced in such scenario and also the setup should ensure that there’s no data loss during failover.

Oracle has something similar feature. It introduced ObserverReconnect, a dataguard observer parameter, back in version 11.2.0.4 which allowed to detect such scenario & failover. Part of failover, the Oracle dataguard takes former primary offline to fence it. Synchronous replication is used to prevent data loss. The parameter when set to 0 will make the observer to use persistent connection.

Something similar can be done in PostgreSQL using Patroni? In case of PG, the postmaster (or the database) was hung and blocked new incoming connections for extended period of time leading to loss of service.

Former primary should be demoted and the cluster should go through election process to elect new leader. Something similar to what Oracle does.
There is already a loss of service if the database is hung and it will not able to process new connection requests and also requests from existing connections. Clients will be unhappy anyway in this case.

Please share your thoughts.

CyberDem0n · 2020-01-22T20:37:11Z

Unfortunately, we ran into this issue on both databases Oracle & PostgreSQL.

I can't tell anything about Oracle... We are running hundreds of postgres clusters and statistically we would have already hit such problem, but in fact we have never.

Agree that master should be fenced in such scenario and also the setup should ensure that there’s no data loss during failover.

If the postmaster is really hanging the clean shutdown (especially in a timely manner) is barely possible, therefore you may loose some data.

In case of PG, the postmaster (or the database) was hung and blocked new incoming connections for extended period of time leading to loss of service.

As I already explained there are different reasons why the connection could not be opened. The most common one - all connection slots are occupied and we have seen it a lot, but we really never observed the hanging postmaster.

There is already a loss of service if the database is hung and it will not able to process new connection requests and also requests from existing connections. Clients will be unhappy anyway in this case.

How is it possible? Patroni is regularly running queries via the existing connection, therefore it is in the same position as other connected clients.

Please share your thoughts.

Ok, lets assume we do a failover in such a situation, but highly likely that the new primary soon or later will hit the same issue (usually it happens quickly). At the end you will get a chain of failovers, which would be even worse.

If you really see the problem of hanging postmaster it should be investigated/debugged and reported as a postgres bug.

kmsarabu · 2020-01-22T23:47:53Z

If the postmaster is really hanging the clean shutdown (especially in a timely manner) is barely possible, therefore you may loose some data.

With the setup utilizing synchronous replication would help overcome this issue. (synchronous_standby_names set to “[FIRST|ANY] NUM_SYNC (standby db list)” & synchronous_commit set to remote_apply for e.g.). Transactions on former master basically hung on ack as the synchronous standbys starts following the newly elected leader. Repmgr does this is in a controlled way – it cancels wal-receiver process on all standby nodes during failover so they no longer receive the changes from master. This ensures that there is zero data loss as the master is configured with synchronous replication with above settings. Remaining standby nodes goes through leader election and the leader will be elected and the remaining nodes follow the new leader.

How is it possible? Patroni is regularly running queries via the existing connection, therefore it is in the same position as other connected clients.

I gave a scenario where only the postmaster was hung. I will be testing various scenarios including database hung. Just tested with patroni db connection hung (to simulate database hung) on master and there was no failover action taken. Looks like it doesn't use any statement_timeout which probably making the patroni process to wait on the query call forever.

[host1] : ps -eaf |egrep pgrept1|egrep patroni|egrep "postgres:"
postgres  44282  29922  0 13:59 ?        00:00:00 postgres: pgrept1: patroni postgres xx.xxx.xxx.xx(51856) idle   

[host1] : kill -STOP 44282
[host1] : tail -f patroni.log
...
2020-01-22 16:03:05,362 INFO: Lock owner: host1; I am host1
<No update after this>

If you really see the problem of hanging postmaster it should be investigated/debugged and reported as a postgres bug.

Yes of course. That’s our standard process where we go through postmortem to findout the cause and to get it fixed (with vendors if the issue pertains to their code)

Ok, lets assume we do a failover in such a situation, but highly likely that the new primary soon or later will hit the same issue (usually it happens quickly). At the end you will get a chain of failovers, which would be even worse.

The chain of failovers (ping-pong back & forth) probably can be prevented by allowing x number of consecutive failovers in certain time. Underlying issue that caused failover probably limited to only that machine & may not be carried over with failover.

The point I am trying to make is the database high availability & service continuity with various failure scenarios.

CyberDem0n · 2020-01-23T07:17:14Z

I gave a scenario where only the postmaster was hung

Sorry, but no. When you send SIGSTOP you just shoot your own foot, and it hurts badly.

Looks like it doesn't use any statement_timeout

You haven't look into the source code, have you? JFYI, statement_timeout is handled by postgres!

which probably making the patroni process to wait on the query call forever.

This is not a statement_timeout problem. Depending on how Patroni is connected to postgres it could be either fixed or not (in case of unx-domain sockets) by enabling tcp keepalives.

Yes of course. That’s our standard process where we go through postmortem to findout the cause

And? What are the results of your investigation? Have you found the reason for hanging postmaster?

Underlying issue that caused failover probably limited to only that machine & may not be carried over with failover.

The keyword here is "probably". But my experience clearly shows that usually such issues are related to the workload and retain after failover.

The point I am trying to make is the database high availability & service continuity with various failure scenarios.

Doing unnecessary failovers doesn't improve neither availability nor service continuity.

P.S. If you really like to have this feature - go ahead and implement it, we will happily review and merge the pull request.

## Feature: Postgres stop timeout Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or hangs for some reason(Issue described in [1371](#1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor. ### master_stop_timeout The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.

This was referenced Mar 16, 2020

Master stop timeout #1445

Merged

WIP: Feature: Plugin for additional health check #1446

Closed

and111016 mentioned this issue Oct 5, 2020

No transfer master role if current failed #1728

Closed

ksarabu1 mentioned this issue Dec 11, 2020

Feature: Plugin for additional health check #1787

Closed

ksarabu1 mentioned this issue Feb 8, 2021

Feature: Plugin for an additional health check #1839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failover when the postmaster is hung #1371

Failover when the postmaster is hung #1371

kmsarabu commented Jan 22, 2020 •

edited

CyberDem0n commented Jan 22, 2020

kmsarabu commented Jan 22, 2020

CyberDem0n commented Jan 22, 2020

kmsarabu commented Jan 22, 2020 •

edited

CyberDem0n commented Jan 23, 2020

Failover when the postmaster is hung #1371

Failover when the postmaster is hung #1371

Comments

kmsarabu commented Jan 22, 2020 • edited

CyberDem0n commented Jan 22, 2020

kmsarabu commented Jan 22, 2020

CyberDem0n commented Jan 22, 2020

kmsarabu commented Jan 22, 2020 • edited

CyberDem0n commented Jan 23, 2020

kmsarabu commented Jan 22, 2020 •

edited

kmsarabu commented Jan 22, 2020 •

edited