Quorum commit feature #672

ants · 2018-05-01T11:54:39Z

I did some work on the subject, did not have enough time to get it finished, but far enough to solicit some feedback.

Description

The feature generalizes synchronous_mode to multiple nodes being in sync. Trade-off between synchronization overhead and fault tolerance is user selectable via replication_factor configuration parameter, specifying how many nodes must contain a commit before it is acknowledged.

Cluster will retain self-healing capability if within one HA loop interval up to replication_factor - 1 nodes fail.

The old synchronous_commit corresponds to replication_factor = 2.

On PostgreSQL 10 the quorum commit ANY k (n) feature is used. On 9.6 the k (n) variant is used picking first nodes. On even older versions maximum replication_factor is 2. However, this is still an improvement from status quo as it delegates sync standby selection to PostgreSQL reducing commit latency on sync standby failure.

DCS sync state is changed from {"leader": .., "sync_standby": ..} to {"leader": .., "quorum": .., "members": [..]}, where leader is kept mainly as an informative field.

State

The main quorum management engine should be in a reasonable state although I haven't tested it yet.

Open issues:

Is replication_factor a reasonable UI for this? Do we want to keep synchronous_commit as the main switch for this functionality.
Nodes that aren't actually taking part in synchronous replication can still promote themselves if they can verify their wal position with enough nodes that are taking part. Does anyone see a problem with this? It causes minor extra complexity when promoting such node, but on the other hand simplifies health checks.
I could not figure out a way how to find out from PostgreSQL when synchronous_standby_names becomes active. Currently there is a small race condition there. I'm not even sure if the previous check of pg_stat_replication showing walsender as sync was enough as a cursory review of postgres code hints that a walsender might think it is sync while a backend hasn't gotten the message that sync replication was enabled. More study of the code is needed.
Quorum engine manages transitioning DCS and sync state between "safe" states. If somebody manually modifies state into something we don't think of as valid, there currently isn't any code to automatically recover. Should be simple enough, I just didn't get to it.
State handling when promoting could be improved. I think we want to keep synchronous replication running with mostly the same settings as before failover, but remove any nodes that we know are down based on DCS information. Speeds up being able to accept commits.
Testsuite is obviously broken. Needs fixing up of old tests and new tests to exercise corner cases of the new stuff.
patronictl UI needs updating. How do we want to display the information?
There is an impedance mismatch between quorum engine and synchronous_standby_names in that quorum engine thinks of leader as just one additional synchronous replica. Currently the boundary transition between two worlds is in postgresql.py. Maybe some other place would work better.
flake8 is probably going to have a fit on this code. A bunch of cleanup is needed.

Generalizes synchronous_mode to multiple nodes being in sync. Trade-off between synchronization overhead and fault tolerance is user selectable via `replication_factor` configuration parameter, specifying how many nodes must contain a commit before it is acknowledged. Cluster will retain self-healing capability if within one HA loop interval up to `replication_factor - 1` nodes fail. The old `synchronous_commit` corresponds to `replication_factor = 2`. On PostgreSQL 10 the quorum commit `ANY k (n)` feature is used. On 9.6 the `k (n)` variant is used picking first nodes. On even older versions maximum `replication_factor` is 2. However, this is still an improvement from status quo as it delegates sync standby selection to PostgreSQL reducing commit latency on sync standby failure. DCS sync state is changed from `{"leader": .., "sync_standby": ..}` to `{"leader": .., "quorum": .., "members": [..]}`, where leader is kept mainly as an informative field.

CyberDem0n

I've spent a few hours trying to understand how QuorumStateResolver works. Not much success :(
It definitely deserves a few sentences explaining how it supposed to work.

CyberDem0n · 2018-05-04T11:58:01Z

patroni/postgresql.py

-        if name != self._synchronous_standby_names:
-            if name is None:
+            if state == 'sync' or (state == 'potential' and current is None):
+                current = member.name


This branch seems to be unused.

Oops. That should be sync_state and it looks like I forgot to finish the code to make sure that if we can only do one sync standby (version <= 9.5) we don't switch the standby around. Will fix.

CyberDem0n · 2018-05-04T12:01:25Z

patroni/postgresql.py

+        else:
+            assert self.name in sync
+            sync_standbys = sync.difference([self.name])
+            standby_list = ", ".join(sorted(sync_standbys)) if sync_standbys else "*"


Lost quote_ident.

Thanks for catching this. I will add it back in.

CyberDem0n · 2018-05-04T12:05:51Z

patroni/postgresql.py

-        current = cluster.sync.sync_standby
-        current = current.lower() if current else current
+        sync_standby_names = self.query("SHOW synchronous_standby_names")
+        result = parse_sync_standby_names(sync_standby_names)


Why do we need to parse it all the time? Is it some kind of protection from unexpected changes from outside?

Basically yes. It helps us automatically resolve the situation if due to a bug or administrator intervention our understanding of the state goes out of sync. If we aren't too worried about that we could cache the state and avoid spamming the database with unnecessary queries. What do you think?

The problem is that it could be changed with 'ALTER SYSTEM' which takes precedence over postgresql.conf :(

We could detect that from pg_settings. Now the question is what should the system do in that situation? Override the administrator with ALTER SYSTEM RESET synchronous_standby_names, make a quorum state to match the synchronization state, just complain in the logs where nobody is looking anyway, panic with sys.exit(1)?

Looks like executing ALTER SYSTEM RESET synchronous_standby_names is the best we can do.
In addition to that we can write warning into the log with previous value.

CyberDem0n · 2018-05-04T12:33:01Z

patroni/quorum.py

+            # First get to requested replication factor
+            increase_numsync_by = self.sync_wanted - self.numsync
+            if increase_numsync_by:
+                add = set(sorted(to_add)[:increase_numsync_by])


This one I don't really understand.

It is responsible for https://github.com/zalando/patroni/pull/672/files#diff-57e7f41a6358e844d5245ea7b8409311R12

+ self.assertEquals(list(QuorumStateResolver(1, set("a"), 1, set("a"), active=set("abcde"), sync_wanted=3)), [ + ('sync', 3, set('abc')), + ('quorum', 3, set('abcde')), + ('sync', 3, set('abcde')), + ])

Why does it fist adds only 'b' and 'c'? Why not all 4 at the same time? Something like:

+ ('sync', 3, set('abcde')), + ('quorum', 3, set('abcde')),

Yes, that would be a correct optimization. Currently the algorithm doesn't take into account that the master is special and always guaranteed to be one of the synced nodes. This may lead to extra steps when adding multiple nodes at the same time. On the plus side this allows for promotion before we update sync state in DCS. I need to think a bit about how the algorithm would look like if we consider that master to be special. If it makes it considerably more complicated than it already is, then it might not be worth the minor improvement in efficiency. I will also write down how and why the QuorumStateResolver works.

CyberDem0n · 2018-05-04T12:54:56Z

patroni/ha.py

                        return
-                    logger.info("Synchronous standby status assigned to %s", picked)
+                elif transition == 'sync':
+                    logger.info("Setting synchronous replication to %d of %d (%s)",


This message is misleading when num < min_sync.

Will correct.

anikin-aa · 2018-08-03T07:29:58Z

patroni/ha.py

+                                                              sync_wanted=sync_wanted):
+                if transition == 'quorum':
+                    logger.info("Setting quorum to %d of %d (%s)", num, len(nodes), ", ".join(sorted(nodes)))
+                    if not self.dcs.write_sync_state(leader=self.state_handler.name, quorum=num, members=list(nodes),


write_sync_state - this method should be also changed, no ?
here you are passing leader, quorum, membres and index as a parameters, but signature of method wasn't changed in this pull request.

def write_sync_state(self, leader, sync_standby, index=None):

anikin-aa · 2018-08-03T07:33:21Z

@ants If there is any chance You are getting back to work on this feature ?
I am pretty interested in this one, if You are need any help, feel free to contact me.

anikin-aa · 2018-08-03T07:46:45Z

patroni/postgresql.py

        """
-        current = cluster.sync.sync_standby
-        current = current.lower() if current else current
+        sync_standby_names = self.query("SHOW synchronous_standby_names")


as I can see here https://github.com/zalando/patroni/blob/master/patroni/postgresql.py#L1509
self.query - will return cursor, not the result of query execution. And after that You are passing cursor to
sync_rep_parser_re.finditer its not acceptable.

ants · 2018-08-03T07:52:01Z

Thanks for reviewing. I plan to resume work on this in about 1-2 weeks time and any review effort on the code will be helpful, as will be testing once the code is in a testable state.

…

anikin-aa · 2018-08-16T11:24:44Z

@ants @CyberDem0n

I have done some changes and this feature is working as a prototype, but without usage of new variables replication_factor and minimum_replication_factor, simply just accepting in quorum all applications from pg_stat_replication (everyone should be in sync in the quorum).

Also its pretty hard to understand the purpose of the QuorumStateResolver. Just commented this part.

Is my solution acceptable ? Or should I just fix everything in current PR ?

simply just accepting in quorum all applications (everyone should be in sync in the quorum)

ants · 2018-11-23T15:22:55Z

Getting closer to something I am happy with. Work still to be done:

Fill the couple of holes in test coverage.
Review and polish configuration, API and patronictl interfaces to this feature. Some known issues:
- Currently sync standby state display is based on DCS state not on information from master. Would be neat if we can get the contents of synchronous_standby_names from master and use this to show actual state.
- Would be great if patronictl list shows a warning if cluster is read-only due to not enough nodes to satisfy minimum_replication_factor.
Handling standby picking in pre-9.6 versions as it was before this feature. Add tests to verify this.
Feature tests.
Run real-life torture tests to identify any holes in testing.
Improve comments until other people besides me can understand how this works.

I should be able to pick this up sometime next week, publishing what I have so far to solicit feedback.

anikin-aa · 2019-02-01T08:04:59Z

Guys, any updates on this PR ?

CyberDem0n

I also found a small corner case, when I started the new single node cluster with {synchronous_mode: true, replication_factor: 2, minimum_replication_factor: 2} in bootstrap.dcs it didn't set synchronous_standby_names.

CyberDem0n · 2019-02-09T17:03:31Z

patroni/api.py

@@ -309,15 +309,15 @@ def is_failover_possible(self, cluster, leader, candidate, action):
        if leader and (not cluster.leader or cluster.leader.name != leader):
            return 'leader name does not match'
        if candidate:
-            if action == 'switchover' and cluster.is_synchronous_mode() and cluster.sync.sync_standby != candidate:
-                return 'candidate name does not match with sync_standby'
+            if action == 'switchover' and cluster.is_synchronous_mode() and cluster.sync.matches(candidate):


and not cluster.sync.matches(candidate):

CyberDem0n · 2019-02-09T17:44:19Z

patroni/config.py

@@ -47,6 +47,8 @@ class Config(object):
        'master_start_timeout': 300,
        'synchronous_mode': False,
        'synchronous_mode_strict': False,
+        'replication_factor': 2,


I don't really understand the status of synchronous_mode and synchronous_mode_strict. From one side it seems they are redundant, but from another side, they are still used in the ha.py and postgresql.py

CyberDem0n · 2019-02-09T17:45:04Z

patroni/postgresql.py

+    @property
+    def use_multiple_sync(self):
+        return self._major_version >= 90600
+


Line 304 references synchronous_mode_strict, which is kind of becomes deprecated.

CyberDem0n · 2019-02-10T08:36:32Z

patroni/ha.py

+        #
+        # Regardless of voting, if we observe a node that can become a leader and is ahead, we defer to that node.
+        # This can lead to failure to act on quorum if there is asymmetric connectivity.
+        quorum_votes = 1 if self.state_handler.name in voting_set else 0


It would be also good to add some info output here about the current quorum and voting set. It will make some potential investigations easier.
Right now such info output is only available on the master when it processes topology changes, but it might happen that logs from the master are unrecoverable.

CyberDem0n · 2019-02-10T09:00:19Z

patroni/quorum.py

+
+    def quorum_update(self, quorum, voters):
+        if quorum < 1:
+            raise QuorumError("Quorum %d < 0 of (%s)" % (quorum, voters))


Exception message doesn't match the condition: quorum < 1 and %d < 0

CyberDem0n · 2019-02-10T09:05:33Z

patroni/quorum.py

+
+    def check_invariants(self):
+        if self.quorum and not (len(self.voters | self.sync) < self.quorum + self.numsync):
+            raise QuorumError("Quorum and sync not guaranteed to overlap: nodes %d >= quorum %d + sync %d" %


When I removed sync key from the DCS the master crashed:

$ etcdctl rm /service/batman/sync PrevNode.Value: {"leader":"postgresql2","members":["postgresql1","postgresql0","postgresql2"],"quorum":2} 2019-02-10 09:51:33,719 INFO: Lock owner: postgresql2; I am postgresql2 2019-02-10 09:51:33.720 CET [10143] LOG: received fast shutdown request 2019-02-10 09:51:33.725 CET [10143] LOG: aborting any active transactions 2019-02-10 09:51:33.725 CET [10178] FATAL: terminating connection due to administrator command 2019-02-10 09:51:33.730 CET [10143] LOG: background worker "logical replication launcher" (PID 10193) exited with exit code 1 2019-02-10 09:51:33.731 CET [10146] LOG: shutting down 2019-02-10 09:51:33.794 CET [10143] LOG: database system is shut down 2019-02-10 09:51:33,813 INFO: Lock owner: postgresql2; I am postgresql2 Traceback (most recent call last): File "./patroni.py", line 6, in <module> main() File "/home/akukushkin/git/ants/patroni/patroni/__init__.py", line 182, in main return patroni_main() File "/home/akukushkin/git/ants/patroni/patroni/__init__.py", line 148, in patroni_main patroni.run() File "/home/akukushkin/git/ants/patroni/patroni/__init__.py", line 119, in run logger.info(self.ha.run_cycle()) File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 1370, in run_cycle info = self._run_cycle() File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 1343, in _run_cycle msg = self.process_healthy_cluster() File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 966, in process_healthy_cluster 'promoted self to leader because i had the session lock' File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 566, in enforce_master_role self.process_sync_replication() File "/home/akukushkin/git/ants/patroni/patroni/ha.py", line 426, in process_sync_replication sync_wanted=sync_wanted): File "/home/akukushkin/git/ants/patroni/patroni/quorum.py", line 84, in __iter__ transitions = list(self._generate_transitions()) File "/home/akukushkin/git/ants/patroni/patroni/quorum.py", line 95, in _generate_transitions self.check_invariants() File "/home/akukushkin/git/ants/patroni/patroni/quorum.py", line 62, in check_invariants (len(self.voters | self.sync), self.quorum, self.numsync)) patroni.quorum.QuorumError: Quorum and sync not guaranteed to overlap: nodes 3 >= quorum 1 + sync 2

CyberDem0n

Looks like a lot of corner-cases are not covered in the QuorumStateResolver :(

CyberDem0n · 2019-02-22T14:12:13Z

patroni/quorum.py

+            # First get to requested replication factor
+            logger.debug("Adding nodes: %s", to_add)
+            increase_numsync_by = self.sync_wanted - self.numsync
+            if increase_numsync_by:


increase_numsync_by could become negative.
example:

QuorumStateResolver(quorum=2, voters=set('abcdef'), numsync=5, sync=set('abcdef'), active=set('abcdefg'), sync_wanted=4)

CyberDem0n · 2019-02-22T14:34:11Z

patroni/quorum.py

+
+        safety_margin = self.quorum + self.numsync - len(self.voters | self.sync)
+        if safety_margin > 1:
+            logger.debug("Case 3: replication factor is bigger than needed")


This branch is not covered by tests, so I tried to do it:

QuorumStateResolver(quorum=3, voters=set('abcdef'), numsync=5, sync=set('abcdef'), active=set('abcdef'), sync_wanted=3)

and got an exception QuorumError: Quorum and sync not guaranteed to overlap: nodes 6 >= quorum 3 + sync 3 from sync_update()

Jan-M · 2019-03-06T16:18:21Z

This PR suddenly is more interesting. Can this Pullrequest be made easier for non support of PG9.6? I expected this to be less complicated given Postgres supports Quorum mode now.

Jan-M · 2019-03-06T16:21:07Z

https://www.postgresql.org/docs/10/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-MASTER

… sync > sync_wanted

We can then handle any node that is ahead but unable to promote. Downside is that we might potentially need to rewind more and/or lose a couple more unsychronized transactions.

sdudoladov · 2020-02-25T14:33:36Z

patroni/dcs/__init__.py

+                   if SyncState is empty.
+    :param quorum: number of servers from synchronous set we need to see to know we see the latest
+                   commit.
+    :param members: set of member names that participate in determining quorum.


SyncState uses the word "members" while everywhere else in the algorithm the term voters is used. Can you please name it consistently ?

sdudoladov · 2020-02-25T14:35:59Z

tests/test_quorum.py

+from patroni.quorum import QuorumStateResolver, QuorumError
+
+class QuorumTest(unittest.TestCase):
+    def test_1111(self):


It is better to explicitly spell out how test names map to the content of the state variable; it is not obvious at first glance

sdudoladov · 2020-02-25T14:37:05Z

patroni/ha.py

-                    logger.info("Removing synchronous privilege from %s", current)
-                    if not self.dcs.write_sync_state(self.state_handler.name, None, index=self.cluster.sync.index):
+            sync_state = self.state_handler.current_sync_state(self.cluster)
+            numsync = min(sync_state['numsync'], len(sync_state['sync']))


Can this line be moved to self.state_handler.current_sync_state(self.cluster) ?

sdudoladov · 2020-02-25T15:43:17Z

tests/test_quorum.py

+        # Active set matches state
+        self.assertEqual(list(QuorumStateResolver(*state, active=set("ab"), sync_wanted=3)), [
+        ])
+        # Add node by increasing quorum


I find this description a bit too short , better would be to have sth like

add a sync standby without increasing the replication factor; must increase quorum to avoid losing commits on failover

sdudoladov · 2020-02-25T15:49:24Z

tests/test_quorum.py

+            ('quorum', 2, set('abc')),
+            ('sync', 2, set('abc')),
+        ])
+        # Add node by increasing sync


same here:

add a sync standby and increase the replication factor in the corner case where before and after adding a sync standby all nodes of a PG cluster must ack the commit

sdudoladov · 2020-02-25T15:54:13Z

tests/test_quorum.py

+            ('quorum', 3, set('abcde')),
+            ('sync', 3, set('abcde')),
+        ])
+        # Master is alone


Is that removing a sync standby from a 2-node cluster ?

sdudoladov · 2020-02-27T12:15:34Z

patroni/quorum.py

+            logger.debug("Case 1: synchronous_standby_names subset of DCS state")
+            # Case 1: quorum is superset of sync nodes. In the middle of changing quorum.
+            # Evict from quorum dead nodes that are not being synced.
+            remove_from_quorum = self.voters - (self.sync | self.active)


Can it happen at this point that self.sync still contains some node that ceased to be active and therefore must be evicted from voters ? In other words, why do we subtract here the union and not self.sync & self.active or simply self.active ?

sdudoladov · 2020-02-27T12:18:43Z

patroni/postgresql/__init__.py

-        candidates = []
-        # Pick candidates based on who has flushed WAL farthest.
-        # TODO: for synchronous_commit = remote_write we actually want to order on write_location
+
        for app_name, state, sync_state in self.query(


sync_state is no longer used in the proposed change

sdudoladov · 2020-02-27T12:25:49Z

patroni/postgresql/__init__.py

-        if candidates:
-            return candidates[0], False
-        return None, False
+            active.append(member.name)


Just to double check: a standby with pg_stat_replication.sync_state in ('async', 'potential'): is still considered active for the purposes of QuorumStateResolver?

sdudoladov · 2020-02-27T12:33:09Z

patroni/quorum.py

+            if remove_from_sync:
+                yield self.sync_update(
+                        numsync=min(self.sync_wanted, len(self.sync) - len(remove_from_sync)),
+                        sync=self.sync - remove_from_sync)


why not

if remove_from_sync: remaining_sync = self.sync - remove_from_sync yield self.sync_update( numsync=min(self.sync_wanted, len(remaining_sync)), sync=remaining_sync)

sdudoladov · 2020-02-27T12:35:40Z

patroni/quorum.py

+            # Case 3: quorum or replication factor is bigger than needed. In the middle of changing replication factor.
+            if self.numsync > self.sync_wanted:
+                # Reduce replication factor
+                new_sync = clamp(self.sync_wanted, min=len(self.voters) - self.quorum + 1, max=len(self.sync))


sync denotes a set everywhere else, so here just for the sake of consistency, new_sync should be new_numsync

sdudoladov · 2020-02-27T12:49:35Z

docs/replication_modes.rst

- A node that is not the leader or current synchronous standby is not allowed to promote itself automatically.
-
-Patroni will only ever assign one standby to ``synchronous_standby_names`` because with multiple candidates it is not possible to know which node was acting as synchronous during the failure.
+When in synchronous mode Patroni maintains synchronization state in the DCS, containing the latest primary, number of nodes required for quorum and nodes currently eligible to vote on quorum. In steady state the nodes voting on quorum are the leader and all synchronous standbys. This state is updated with strict ordering constraints with regards to node promotion and ``synchronous_standby_names`` to ensure that at all times any subset of voters that can achieve quorum is contained to have at least one node having the latest successful commit.


In steady state the nodes voting on quorum are the leader and all synchronous standbys.

Judging from this line, steady state also means updates to quorum and/or replication factor have completed (as opposed to simply having self.voters == self.sync).

ksarabu1 · 2020-05-11T20:04:52Z

We are looking into implementing Patroni in-house and this feature is one of the shortfall holding us right now. We are very much interested in this enhancement and wondering if we can be any help in getting this PR moving. We are open to collaborate and contribute. Thanks.

ksarabu1 · 2020-06-04T21:21:34Z

We utilize mix of synchronous hot standbys (with synchronous_commit set to remote_apply
for read-consistency between master/sync_standby) & async standby
databases where the applications tolerate staleness.

With the Quorum based (ANY) implementation, the databases that receives the change
synchronously from the list may be random (unless we set sync_num matching to # of replicas).

My suggestion would be to extend current one synchronous standby implementation to
support precise multiple guaranteed sync standbys based on maximum/minimum replication factor. Synchronous standby members list can be updated by Patroni as the members join/leave to maintain maximum/minimum limit of no of sync standby's.

This functionality can be further extended to make it generic using new parameter(s) for providing
additional support for Priority (FIRST n) based synchronous replication & Quorum (ANY n) based synchronous replication.

Please let me know your thoughts.

Thanks

When `synchronous_standby_names` GUC is changed PostgreSQL nearly immediately starts reporting corresponding walsenders as synchronous, while in fact they maybe didn't reach this state yet. To mitigate this problem we memorize current flush lsn on the primary right after change of `synchronous_standby_names` got visible and use it as an additional check for walsenders. The walsender will be counted as truly "sync" only when write/flush/replay_lsn on it reached memorized LSN and the `application_name` is known to be a part of `synchronous_standby_names`. The size of PR mostly related to refactoring and moving the code responsible for working with `synchronous_standby_names` and `pg_stat_replication` to the dedicated file. And `parse_sync_standby_names()` function was mostly copied from #672.

ants mentioned this pull request May 1, 2018

RFC: design for supporting synchronous quorum commit in Patroni #664

Open

CyberDem0n reviewed May 4, 2018

View reviewed changes

ants mentioned this pull request Aug 1, 2018

[Question] Is there a possibility to know that slave is out of cluster #760

Closed

anikin-aa reviewed Aug 3, 2018

View reviewed changes

This was referenced Aug 22, 2018

WIP Quorum commit feature #2 #787

Closed

Quorum commit feature #2 #791

Closed

ants added 7 commits October 23, 2018 19:07

Merge remote-tracking branch 'origin/master' into feature/quorum

da64960

Make api and ctl aware of new SyncState structure

e0ed25c

WIP

ed92ed7

Some fixes

1cd531c

Merge remote-tracking branch 'origin/master' into feature/quorum

ab0cf85

Some more fixes

3a99241

Documentation and formatting

866ba14

CyberDem0n reviewed Feb 10, 2019

View reviewed changes

CyberDem0n reviewed Feb 22, 2019

View reviewed changes

RafiaSabih mentioned this pull request Aug 21, 2019

Stanby Cluster questions #1151

Closed

CyberDem0n mentioned this pull request Dec 5, 2019

Multiple Synchronous Standby Databases Support #1307

Closed

Merge branch 'master' into feature/quorum

52592ca

CyberDem0n added this to the 2.0 milestone Feb 7, 2020

Fix too eager reduction of synchronization when safety margin > 1 and…

4012c15

… sync > sync_wanted

ants added 3 commits February 7, 2020 17:09

Detect minimum replication factor changes

12bfd83

Log quorum votes

74fbf5b

Make promotion in synchronous mode only depend on quorum.

a10a7da

We can then handle any node that is ahead but unable to promote. Downside is that we might potentially need to rewind more and/or lose a couple more unsychronized transactions.

sdudoladov reviewed Feb 25, 2020

View reviewed changes

sdudoladov reviewed Feb 27, 2020

View reviewed changes

CyberDem0n mentioned this pull request May 31, 2020

Multiple sync replicas #1565

Open

ksarabu1 mentioned this pull request Jun 16, 2020

Multi Sync Standby Support #1594

Merged

jkatz mentioned this pull request Dec 20, 2020

Cluster with 2 replicas and no primary anymore CrunchyData/postgres-operator#2132

Closed

hughcapet removed this from the 2.0 milestone Nov 1, 2022

hughcapet added the new feature label Dec 31, 2022

CyberDem0n mentioned this pull request Jan 13, 2023

Enhanced sync connections check #2524

Merged

CyberDem0n mentioned this pull request Mar 28, 2023

quorum commit CyberDem0n/patroni#10

Open

CyberDem0n mentioned this pull request May 9, 2023

Quorum based failover #2668

Open

hughcapet linked an issue Dec 3, 2023 that may be closed by this pull request

RFC: design for supporting synchronous quorum commit in Patroni #664

Open

Quorum commit feature #672

Are you sure you want to change the base?

Quorum commit feature #672

Conversation

ants commented May 1, 2018

Description

State

CyberDem0n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anikin-aa commented Aug 3, 2018

Choose a reason for hiding this comment

ants commented Aug 3, 2018 via email

anikin-aa commented Aug 16, 2018

ants commented Nov 23, 2018

anikin-aa commented Feb 1, 2019

CyberDem0n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CyberDem0n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-M commented Mar 6, 2019

Jan-M commented Mar 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksarabu1 commented May 11, 2020

ksarabu1 commented Jun 4, 2020