RFC: design for supporting synchronous quorum commit in Patroni #664

jberkus · 2018-04-13T21:29:29Z

With Postgres 10, support for synchronous quorum commit is a feature, and makes using synchrounous replication to reduce data loss during failover much more practical for Patroni. Here's a draft of how this would potentially work:

Clusters would get a new setting, synchronous_quorum(int), defaulting to 0.
If turned on (> 0), all Patroni nodes would set synchronous_standby_names to a quorum of any $synchronous_quorum ( list, of, all, nodes )
The polling cycle would add an extra check to see if new nodes have been added, and if so, update synchronous_standby_names.

For example, if synchronous_quorum=1 and there are 4 nodes, the setting on each Postgres would be:

synchronous_standby_names = 'any 1 ( patroni-0, patroni-1, patroni-2, patroni-3 )'

Questions:

We don't have cluster settings, patroni settings are per-node. If some nodes had synchronous_quorum turned on, and other nodes did not, it could result in unexpected behavior. Is this a real problem?
In the kubernetes-native version, we no longer maintain a list of all nodes in the DCS. Presumably we would need to add one to the ConfigMap for this to work? Or is it sufficient to estimate based on the statefulset size?
How do we want to handle nodes dying or being removed? Do we want to update the list of replicas? If so, how do we want to handle this in order to not create a data-losing race condition? If we don't update automatically, how do we want to enable administrators to do a cleanup?
Will admins be able to push a config change of synchronous_quorum=0 to a single remaining node, in the event that they're down to one node and rescuing the cluster?

The text was updated successfully, but these errors were encountered:

ants · 2018-04-14T06:22:57Z

I have done some thinking along these lines. My notes on this:

The list of nodes currently considered synchronous should be stored in SyncState in DCS. As would be the value of synchronous_quorum.
A node can consider itself healhy if:
1. It is considered a synchronous candidate by SyncState
2. From list of synchronous_candidates + [leader] it can get the xlog position of all but synchronous_quorum nodes.
Leader manages the list of synchronous candidates. Compare-and-swap semantics avoid issues with races.
If master can not be elected due to synchronous_quorum the admin must force a failover with patronictl failover to acknowledge the possibility of data loss.

A couple of examples to illustrate how it would work:

Simple 3 node cluster:

leader=p0 candidates=p1,p2 quorum=1
p0 fails
p1 attempts to contact p0 and p2.
- p0 times out
- p2 can be reached.
p1 can promote itself if xlog position is not behind of p2.

Same as earlier, but p2 can not be reached:

p1 can not promote itself, because p2 might have latest transaction.

5 node cluster:

leader=p0 candidate=p1,p2,p3,p4 quorum=2
p0 and p2 fail
p1 attempts to contact p0,p2,p3,p4
- Gets response from p3 and p4
Because atleast 2 of {p1,p2,p3,p4} have latest commit and we know xlog positions of {p1,p3,p4} we can consider ourselves able to promote.

jberkus · 2018-04-16T16:26:50Z

Ants,

That makes sense. It's also pointless to promote a node if there is no candidate sync rep, since that node will still not be able to accept synchronous writes, so that all works.

What I'm still concerned about is maintaining the list of candidate nodes, particually in a "rolling blackout" situation. While you've eliminated a race condition by saying only the master can edit the list of nodes, there's still an inherent danger if the master is in the process of adding and removing nodes from the list and then goes dark (this is not a new problem, Cassandra and similar databases have a lot of logic devoted to this issue, and it's why RAFT requires a fixed cluster size).

A possible amelioration of uncertainty caused by adding and removing nodes would be just to make it slow; make the timeout on either adding or dropping a node from the list 5 minutes, for example. The drawback to that is that it would take a lot longer for a cluster to right itself when the nodes come back.

ants · 2018-04-17T00:16:06Z

The synchronous list maintenance is different from maintaining cluster consensus in Cassandra, et al. as Patroni delegates the consensus problem to DCS. Externalizing the consensus allows us to be quite flexible in what constitutes a consensus. We just have to follow ordering constraints to ensure state stored in DCS is always conservative/pessimistic. If increasing redundancy (going from 1 of 2 quorum to 1 of 3 quorum) then replicate first, publish later, if increasing number of candidates (from 1 of 2 to 2 of 3 quorum) then publish first, replicate later.

To put it more formally, given a master with synchronous_stanby_names = ANY k (SyncStandbys), SyncSet = SyncStandbys ∪ {master}, replication_factor=k+1, and DCS state of any quorum_size of CandidateSet, then at any point the following invariant must hold:

|CandidateSet ∪ SyncSet| < replication_factor + quorum_size => any replication set and any quorum set of specified sizes will have at least one overlap.

The user interface for this would generalize nicely over what we have now. The user would have to pick 2 things:

Maximum number of nodes that can fail at once before quorum is lost.
Minimum replication factor.

I haven't yet figured out how to migrate current configs over is still unclear, but the following mapping from current settings seems to make sense:

synchronous_mode=true, synchronous_mode_strict=false -> max_fail=1, min_replication_factor=0
synchronous_mode=true, synchronous_mode_strict=true -> max_fail=1, min_replication_factor=1

jberkus · 2018-04-17T15:45:26Z

Ants,

Am I understanding this right? It seems like that design would result in increasing k to match the number of replicas, if max_fail isn't also increased.

ants · 2018-04-19T08:21:45Z

In steady state parameter values would be:

SyncStandbys = CandidateSet \ {master}
k = clamp(max_fail, min=min_replication_factor, max=|SyncStandbys|)
quorum_size = |CandidateSet| - k

ants · 2018-05-01T12:01:25Z

I pushed a prototype of this as #672

haslersn · 2021-11-29T09:07:47Z

Why do you need a list of all nodes, instead of setting

synchronous_standby_names = 'any 1 ( * )'

?

anikin-aa mentioned this issue Aug 22, 2018

WIP Quorum commit feature #2 #787

Closed

CyberDem0n mentioned this issue Jul 20, 2023

Quorum based failover #2668

Open

hughcapet linked a pull request Dec 3, 2023 that will close this issue

Quorum commit feature #672

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: design for supporting synchronous quorum commit in Patroni #664

RFC: design for supporting synchronous quorum commit in Patroni #664

jberkus commented Apr 13, 2018 •

edited

ants commented Apr 14, 2018

jberkus commented Apr 16, 2018 •

edited

ants commented Apr 17, 2018

jberkus commented Apr 17, 2018

ants commented Apr 19, 2018

ants commented May 1, 2018

haslersn commented Nov 29, 2021

RFC: design for supporting synchronous quorum commit in Patroni #664

RFC: design for supporting synchronous quorum commit in Patroni #664

Comments

jberkus commented Apr 13, 2018 • edited

ants commented Apr 14, 2018

jberkus commented Apr 16, 2018 • edited

ants commented Apr 17, 2018

jberkus commented Apr 17, 2018

ants commented Apr 19, 2018

ants commented May 1, 2018

haslersn commented Nov 29, 2021

jberkus commented Apr 13, 2018 •

edited

jberkus commented Apr 16, 2018 •

edited