Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: design for supporting synchronous quorum commit in Patroni #664

Open
jberkus opened this issue Apr 13, 2018 · 7 comments · May be fixed by #672
Open

RFC: design for supporting synchronous quorum commit in Patroni #664

jberkus opened this issue Apr 13, 2018 · 7 comments · May be fixed by #672

Comments

@jberkus
Copy link
Contributor

jberkus commented Apr 13, 2018

With Postgres 10, support for synchronous quorum commit is a feature, and makes using synchrounous replication to reduce data loss during failover much more practical for Patroni. Here's a draft of how this would potentially work:

  1. Clusters would get a new setting, synchronous_quorum(int), defaulting to 0.
  2. If turned on (> 0), all Patroni nodes would set synchronous_standby_names to a quorum of any $synchronous_quorum ( list, of, all, nodes )
  3. The polling cycle would add an extra check to see if new nodes have been added, and if so, update synchronous_standby_names.

For example, if synchronous_quorum=1 and there are 4 nodes, the setting on each Postgres would be:

synchronous_standby_names = 'any 1 ( patroni-0, patroni-1, patroni-2, patroni-3 )'

Questions:

  • We don't have cluster settings, patroni settings are per-node. If some nodes had synchronous_quorum turned on, and other nodes did not, it could result in unexpected behavior. Is this a real problem?
  • In the kubernetes-native version, we no longer maintain a list of all nodes in the DCS. Presumably we would need to add one to the ConfigMap for this to work? Or is it sufficient to estimate based on the statefulset size?
  • How do we want to handle nodes dying or being removed? Do we want to update the list of replicas? If so, how do we want to handle this in order to not create a data-losing race condition? If we don't update automatically, how do we want to enable administrators to do a cleanup?
  • Will admins be able to push a config change of synchronous_quorum=0 to a single remaining node, in the event that they're down to one node and rescuing the cluster?
@ants
Copy link
Collaborator

ants commented Apr 14, 2018

I have done some thinking along these lines. My notes on this:

  • The list of nodes currently considered synchronous should be stored in SyncState in DCS. As would be the value of synchronous_quorum.
  • A node can consider itself healhy if:
    1. It is considered a synchronous candidate by SyncState
    2. From list of synchronous_candidates + [leader] it can get the xlog position of all but synchronous_quorum nodes.
  • Leader manages the list of synchronous candidates. Compare-and-swap semantics avoid issues with races.
  • If master can not be elected due to synchronous_quorum the admin must force a failover with patronictl failover to acknowledge the possibility of data loss.

A couple of examples to illustrate how it would work:

Simple 3 node cluster:

  • leader=p0 candidates=p1,p2 quorum=1
  • p0 fails
  • p1 attempts to contact p0 and p2.
    • p0 times out
    • p2 can be reached.
  • p1 can promote itself if xlog position is not behind of p2.

Same as earlier, but p2 can not be reached:

  • p1 can not promote itself, because p2 might have latest transaction.

5 node cluster:

  • leader=p0 candidate=p1,p2,p3,p4 quorum=2
  • p0 and p2 fail
  • p1 attempts to contact p0,p2,p3,p4
    • Gets response from p3 and p4
  • Because atleast 2 of {p1,p2,p3,p4} have latest commit and we know xlog positions of {p1,p3,p4} we can consider ourselves able to promote.

@jberkus
Copy link
Contributor Author

jberkus commented Apr 16, 2018

Ants,

That makes sense. It's also pointless to promote a node if there is no candidate sync rep, since that node will still not be able to accept synchronous writes, so that all works.

What I'm still concerned about is maintaining the list of candidate nodes, particually in a "rolling blackout" situation. While you've eliminated a race condition by saying only the master can edit the list of nodes, there's still an inherent danger if the master is in the process of adding and removing nodes from the list and then goes dark (this is not a new problem, Cassandra and similar databases have a lot of logic devoted to this issue, and it's why RAFT requires a fixed cluster size).

A possible amelioration of uncertainty caused by adding and removing nodes would be just to make it slow; make the timeout on either adding or dropping a node from the list 5 minutes, for example. The drawback to that is that it would take a lot longer for a cluster to right itself when the nodes come back.

@ants
Copy link
Collaborator

ants commented Apr 17, 2018

The synchronous list maintenance is different from maintaining cluster consensus in Cassandra, et al. as Patroni delegates the consensus problem to DCS. Externalizing the consensus allows us to be quite flexible in what constitutes a consensus. We just have to follow ordering constraints to ensure state stored in DCS is always conservative/pessimistic. If increasing redundancy (going from 1 of 2 quorum to 1 of 3 quorum) then replicate first, publish later, if increasing number of candidates (from 1 of 2 to 2 of 3 quorum) then publish first, replicate later.

To put it more formally, given a master with synchronous_stanby_names = ANY k (SyncStandbys), SyncSet = SyncStandbys ∪ {master}, replication_factor=k+1, and DCS state of any quorum_size of CandidateSet, then at any point the following invariant must hold:

  • |CandidateSet ∪ SyncSet| < replication_factor + quorum_size => any replication set and any quorum set of specified sizes will have at least one overlap.

The user interface for this would generalize nicely over what we have now. The user would have to pick 2 things:

  • Maximum number of nodes that can fail at once before quorum is lost.
  • Minimum replication factor.

I haven't yet figured out how to migrate current configs over is still unclear, but the following mapping from current settings seems to make sense:

  • synchronous_mode=true, synchronous_mode_strict=false -> max_fail=1, min_replication_factor=0
  • synchronous_mode=true, synchronous_mode_strict=true -> max_fail=1, min_replication_factor=1

@jberkus
Copy link
Contributor Author

jberkus commented Apr 17, 2018

Ants,

Am I understanding this right? It seems like that design would result in increasing k to match the number of replicas, if max_fail isn't also increased.

@ants
Copy link
Collaborator

ants commented Apr 19, 2018

In steady state parameter values would be:

SyncStandbys = CandidateSet \ {master}
k = clamp(max_fail, min=min_replication_factor, max=|SyncStandbys|)
quorum_size = |CandidateSet| - k

@ants
Copy link
Collaborator

ants commented May 1, 2018

I pushed a prototype of this as #672

@haslersn
Copy link
Contributor

Why do you need a list of all nodes, instead of setting

synchronous_standby_names = 'any 1 ( * )'

?

@hughcapet hughcapet linked a pull request Dec 3, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants