Quorum based failover #2668

CyberDem0n · 2023-05-09T16:26:24Z

Based on #672

To enable quorum commit:

$ patronictl.py edit-config
--- 
+++ 
@@ -5,3 +5,4 @@
   use_pg_rewind: true
 retry_timeout: 10
 ttl: 30
+synchronous_mode: quorum

Apply these changes? [y/N]: y
Configuration changed

By default Patroni will use ANY 1(list,of,stanbys) in synchronous_standby_names. That is, only one node out of listed replicas will be used for quorum.
If you want to increase the number of quorum nodes it is possible to do it with:

$ patronictl edit-config
--- 
+++ 
@@ -6,3 +6,4 @@
 retry_timeout: 10
 synchronous_mode: quorum
 ttl: 30
+synchronous_node_count: 2

Apply these changes? [y/N]: y
Configuration changed

coveralls · 2023-05-09T16:36:22Z

Pull Request Test Coverage Report for Build 8534982209

Details

404 of 404 (100.0%) changed or added relevant lines in 10 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.02%) to 99.886%

Totals
Change from base Build 8523051901:	0.02%
Covered Lines:	14059
Relevant Lines:	14075

💛 - Coveralls

Old keys remain the same for backward compatibility.

- Don't return 200 for quorum nodes in GET `/sync` endpoint - Dew `GET /quorum` endpoint, returns 200 for quorum nodes - Show Quorum Standby role for quorum nodes in `patronictl list`

…um-commit

It takes some time for existing standbys to start streaming from the new primary and we want to do our best to not empty the /sync key before that.

…um-commit

- make a few attempts with timeout when checking registered nodes - get rid from artificial sleep - allow check_registration() function to check secondaries These changes are useful for Quorum based failover (#2668) and future PR that enhances Citus support by registering secondaries in `pg_dist_node`.

* Reduce flakiness of citus behave tests - make a few attempts with timeout when checking registered nodes - get rid from artificial sleep - allow check_registration() function to check secondaries These changes are useful for Quorum based failover (#2668) and future PR that enhances Citus support by registering secondaries in `pg_dist_node`.

…um-commit

matthbakeredb

partial review...

There's a lot to understand here that I don't feel equipped to review.

My general feel for some of the methods is that they are very large and are steps in a flow that could be better expressed as separate methods or possibly separate classes. (I don't know if Patroni needs some way to express complex workflows in a way that is easier to understand.)

Some of the concepts are new to me, I can't say I understand what this is doing from reading the documentation. Do you have any material where you've talked about this, perhaps at a conference?

docs/dynamic_configuration.rst

docs/replication_modes.rst

patroni/ha.py

patroni/postgresql/sync.py

barthisrael

I still need to continue going through the PR later, but I'm sharing some comments that I had so far.

docs/dynamic_configuration.rst

docs/replication_modes.rst

features/steps/quorum_commit.py

patroni/api.py

barthisrael · 2023-09-14T14:09:19Z

patroni/quorum.py

+            if add_to_sync:
+                yield from self.sync_update(self.numsync, CaseInsensitiveSet(self.sync | add_to_sync))
+        elif self.sync > self.voters:
+            logger.debug("Case 2: synchronous_standby_names superset of DCS state")


I guess we could include the variables in the debug message.

barthisrael · 2023-09-14T14:18:22Z

patroni/quorum.py

+        safety_margin = self.quorum + min(self.numsync, self.numsync_confirmed) - len(self.voters | self.sync)
+        if safety_margin > 0:  # In the middle of changing replication factor.
+            if self.numsync > self.sync_wanted:
+                logger.debug('Case 3: replication factor is bigger than needed')


I guess we could include the variables in the debug message.

barthisrael · 2023-09-14T14:18:31Z

patroni/quorum.py

+                logger.debug('Case 3: replication factor is bigger than needed')
+                yield from self.sync_update(max(self.sync_wanted, len(self.voters) - self.quorum), self.sync)
+            else:
+                logger.debug('Case 4: quorum is bigger than needed')


I guess we could include the variables in the debug message.

barthisrael · 2023-09-14T14:19:28Z

patroni/quorum.py

+                    sync=CaseInsensitiveSet(self.sync - remove_from_sync))
+
+        # After handling these two cases voters and sync must match.
+        assert self.voters == self.sync


I wonder if throwing a QuorumError instead of AssertionError would look more elegant

barthisrael · 2023-09-14T16:12:33Z

patroni/quorum.py

+    def __remove_gone_nodes(self) -> Iterator[Transition]:
+        """Remove inactive nodes from ``synchronous_standby_names`` and from ``/sync`` key.
+
+        :yields: transitions as :class:`Transition` objects.
+        """
+        to_remove = self.sync - self.active
+        if to_remove and self.sync == to_remove:
+            logger.debug("Removing nodes: %s", to_remove)
+            yield from self.quorum_update(0, CaseInsensitiveSet(), adjust_quorum=False)
+            yield from self.sync_update(0, CaseInsensitiveSet())
+        elif to_remove:
+            logger.debug("Removing nodes: %s", to_remove)
+            can_reduce_quorum_by = self.quorum
+            # If we can reduce quorum size try to do so first
+            if can_reduce_quorum_by:
+                # Pick nodes to remove by sorted order to provide deterministic behavior for tests
+                remove = CaseInsensitiveSet(sorted(to_remove, reverse=True)[:can_reduce_quorum_by])
+                sync = CaseInsensitiveSet(self.sync - remove)
+                # when removing nodes from sync we can safely increase numsync if requested
+                numsync = min(self.sync_wanted, len(sync)) if self.sync_wanted > self.numsync else self.numsync
+                yield from self.sync_update(numsync, sync)
+                voters = CaseInsensitiveSet(self.voters - remove)
+                to_remove &= self.sync
+                yield from self.quorum_update(len(voters) - self.numsync, voters,
+                                              adjust_quorum=not to_remove)
+            if to_remove:
+                assert self.quorum == 0
+                numsync = self.numsync - len(to_remove)
+                sync = CaseInsensitiveSet(self.sync - to_remove)
+                voters = CaseInsensitiveSet(self.voters - to_remove)
+                sync_decrease = numsync - min(self.sync_wanted, len(sync))
+                quorum = min(sync_decrease, len(voters) - 1) if sync_decrease else 0
+                yield from self.quorum_update(quorum, voters, adjust_quorum=False)


I must say I'm having some issues to consume the maths shown in this module.

I wonder if is there any way in which we can improve readability/consumption of the logic that has been implemented?

Maybe we could:

Expand docstrings

Expand comments

Create auxiliary variables with a detailed description

Describe the workflow

I don't know. I wonder if it's just me who is having bit of a hard time to digest it all?

(I put a localized comment here, but I guess we could look for improvements in other pieces of this module)

Yes, I know, it is extremely complex. The original PR was opened more than 5 years ago and for me it took a few attempts to get the idea.
All the complexity exists because we can't atomically update both, /sync and synchronous_standby_names.

I would recommend you to start understanding the workflow from the __add_new_nodes() method:

'a' is a leader

/sync key is empty

synchronous_standby_names is empty

we have 'b' and 'c' nodes streaming

The QuorumStateResolver will produce following transitions:

'sync', 'a', 1, {'b', 'c'}

'restart', because after updating synchronous_standby_names we need to wait until 'b' and 'c' will catch up with the LSN when postgresql.conf was updated.

That is, we got to one of non-steady cases, when /sync and synchronous_standby_names don't match.
Such state will remain until at least one of the nodes 'b' or 'c' will catch up with the LSN when postgresql.conf was updated (numsync_confirmed>0).

Interrupt might happen not only because of restart yielded by QuorumStateResolver, but also because of other external factors like for example not being able to update the /sync key.
Lets consider the state:

a is a leader

/sync key: {leader: a, sync_standby: b,c, quorum: 1}

synchronous_standby_names: {ANY 1 (b, c)}

only b is streaming

In this case QuorumStateResolver will produce following transitions:

'sync', 'a', 1, {'b'}

'quorum', 'a', 0, {'b'}

That is, first 'c' is removed from the synchronous_standby_names and only after from the '/sync' key.
If update of the /sync key fails, we also get to a non-steady state.

Non-steady state might also happen when replication factor (sync_wanted) is increased/reduced by the administrator. The change also must be reflected by performing updates of /sync and synchronous_standby_names in a very specific order.

Last, but not least, there could also be a combination of different changes. Like for example new nodes started streaming, but admin decided to increase/decrease replication factor.

If you look at unit-tests, you'll see that QuorumStateResolver can produce up to 4 transitions until the steady state is reached: https://github.com/zalando/patroni/blob/feature/quorum-commit/tests/test_quorum.py#L410-L422
And, QuorumStateResolver must be able to "resume" the chain from the arbitrary state when it was interrupted.
Most of this magic happens in __handle_non_steady_cases() method and this is why it is com complex.

Unit-tests actually check restart QuorumStateResolver from all intermediate states and check that yielded transitions are matching the expectation: https://github.com/zalando/patroni/blob/feature/quorum-commit/tests/test_quorum.py#L10-L28

…to feature/quorum-commit

…um-commit

barthisrael

Some comments on the easier files 😆 Still going through others.

patroni/config.py

patroni/dcs/__init__.py

patroni/postgresql/sync.py

Co-authored-by: Israel <israel.barth@enterprisedb.com>

barthisrael

More comments

patroni/quorum.py

patroni/ha.py

barthisrael · 2023-10-24T12:03:11Z

tests/test_sync.py

+            {'pid': 100, 'application_name': self.leadermem.name, 'sync_state': 'quorum', 'flush_lsn': 1},
+            {'pid': 101, 'application_name': self.other.name, 'sync_state': 'quorum', 'flush_lsn': 2}]
+
+        # sync node is a bit behind of async, but we prefer it anyway


I'm not sure if I understood this comment 🤔

I think it was a copy&paste from https://github.com/zalando/patroni/pull/2668/files#diff-7cdc82c5ea3109d2377741f06aecdcce8d49257482335f49ee25be99853621b3R40

barthisrael · 2023-10-24T12:26:10Z

tests/test_quorum.py

+            kwargs['expected'] = expected[1:]
+            self.check_state_transitions(**kwargs)
+
+    def test_1111(self):


What is the convention on these test names that have trailing numbers?

Very good question :)
These names are from the original PR: https://github.com/zalando/patroni/pull/672/files#diff-77b9ff9cf878f34d15fd5bada8be62ca3475ab49e9d6f15db4261728837be972R6

I think I'll rename them one day.

barthisrael · 2023-10-24T14:01:26Z

patroni/quorum.py

There are 3 assert calls in this module. Maybe we should replace that with some kind of exception?
I mean, assert seems more like something we are testing than something we are running on production.

Yeah... These asserts are purely for tests.
Maybe we can also raise QuorumError and "handle" it in the __iter__() method when calling transitions = list(self._generate_transitions()).

Ah cool, yes, I think so.
This is more of a cosmetic change because we would also get exceptions on assertion errors.
I like the suggestion.

barthisrael · 2023-10-24T14:16:04Z

patroni/quorum.py

+            if increase_numsync_by > 0:
+                if self.sync:
+                    add = CaseInsensitiveSet(sorted(to_add)[:increase_numsync_by])
+                    increase_numsync_by = len(add)


We need this because to_add maybe have less members that increase_numsync_by initially wants, right?

Its a corner case / optimization.
Adding nodes and changing the replication factor are two different operations and supposed to be done one after another, which is not always optimal.

Example:

/sync: {leader: a, sync_standby: b, quorum: 0}

synchronous_standby_names: ANY 1 {b}

We want to add node c and increase the replication factor from 1 to 2.
Non optimal way of doing that would be:

/sync: {leader: a, sync_standby: b,c, quorum: 1} (add node /sync key with quorum increase)

synchronous_standby_names: ANY 1 {b,c} (add node no ssn)

synchronous_standby_names: ANY 2 {b,c} (increase replication factor to requested value)

/sync: {leader: a, sync_standby: b,c, quorum: 0} (reduce quorum)
2 and 3 will be merged by __iter__ method.

With optimization (because we know that the replication factor should be increased anyway) it produces the following:

synchronous_standby_names: ANY 2 {b,c} (add node no ssn and increase the replication factor)

/sync: {leader: a, sync_standby: b,c, quorum: 0} (add node to /sync key).

Besides that there could be other corner cases. When for example we have only 2 replicas (one is already in ssn and we are adding the second one), but the requested replication factor is 3 or higher.

barthisrael · 2023-10-24T18:25:01Z

I think we could give it a try as an experimental feature and see what users say about this. It is not something that would come enabled by default IAC -- only on Citus clusters from what I could see; maybe we should put some kind of warning here?

The feature is very complex, but the unit tests provide a good coverage.
The behave tests make sense to me, and from manual attempts of using the feature it seems to be working fine.

I tried understanding in detail and memorizing the quorum state machine code paths, but I'm having a really hard time on doing that. However, as I mentioned above, the unit tests are a good way to take a look at. While reviewing the PR I tried "guessing" the outputs that were written to the unit tests of QuorumStateResolver, and I was successful more or less half of the time. I guess that's a good indication at least.

From a developer perspective what I can say is: it's complex to understand everything that is being included, but the tests make me more confident that it's working mostly fine. I'd like to get users fingers on that, as users are brilliant on breaking things 😆

Anyway, I would like to hear other people's thoughts.

…um-commit

Co-authored-by: Israel <israel.barth@enterprisedb.com>

…um-commit

…to feature/quorum-commit

…um-commit

martinmarques · 2024-02-07T17:07:34Z

@CyberDem0n what's the latest state on this PR? Need more testing/reviewing?
I know that it's very large and complex, but at the same time adds a lot of usefulness that some end users would like to have with Patroni.

CyberDem0n · 2024-02-08T07:51:19Z

@martinmarques my current plan is to merge it for v4.0.0. When the release will happen I don't know yet.

…um-commit

CyberDem0n requested a review from ants May 10, 2023 12:00

hughcapet added the new feature label May 10, 2023

CyberDem0n added 7 commits May 11, 2023 11:15

Introduce quorum field in the /sync key

2223553

Old keys remain the same for backward compatibility.

Adapt SyncHandler interfaces for quorum commit

ea019ba

Compatibility with future synchronous_mode=quorum

f5f0adb

Implement synchronous_mode=quorum

e97d2f0

update REST API

d799be9

- Don't return 200 for quorum nodes in GET `/sync` endpoint - Dew `GET /quorum` endpoint, returns 200 for quorum nodes - Show Quorum Standby role for quorum nodes in `patronictl list`

Behave tests

7284416

Update documentation

dbfe844

CyberDem0n force-pushed the feature/quorum-commit branch from 3b12486 to dbfe844 Compare May 11, 2023 12:14

CyberDem0n added 3 commits May 23, 2023 09:59

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

f298921

…um-commit

Delay _process_quorum_replication by loop_wait seconds after promote

8f60b18

It takes some time for existing standbys to start streaming from the new primary and we want to do our best to not empty the /sync key before that.

Fix citus tests. Metadata sync could be slow after coordinator switch

a5e1c53

CyberDem0n force-pushed the feature/quorum-commit branch from c1d9520 to a5e1c53 Compare May 23, 2023 14:10

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

db8061a

…um-commit

CyberDem0n mentioned this pull request Jul 5, 2023

Reduce flakiness of citus behave tests #2728

Merged

CyberDem0n added 3 commits July 7, 2023 15:02

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

2dafb37

…um-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

b0d8b21

…um-commit

Limit time spent in _process_quorum_replication by loop_wait seconds

e6d251b

CyberDem0n force-pushed the feature/quorum-commit branch from c775805 to e6d251b Compare July 18, 2023 12:54

Please codacy

300740c

CyberDem0n marked this pull request as ready for review July 18, 2023 14:30

CyberDem0n requested a review from hughcapet as a code owner July 18, 2023 14:30

CyberDem0n requested review from barthisrael and matthbakeredb July 18, 2023 14:30

matthbakeredb suggested changes Jul 19, 2023

View reviewed changes

barthisrael reviewed Jul 19, 2023

View reviewed changes

docs/dynamic_configuration.rst Outdated Show resolved Hide resolved

docs/replication_modes.rst Outdated Show resolved Hide resolved

features/steps/quorum_commit.py Outdated Show resolved Hide resolved

patroni/api.py Outdated Show resolved Hide resolved

patroni/api.py Show resolved Hide resolved

barthisrael reviewed Sep 14, 2023

View reviewed changes

CyberDem0n added 6 commits September 15, 2023 12:42

Address feedback

0f6e069

Merge branch 'feature/quorum-commit' of github.com:zalando/patroni in…

8612d55

…to feature/quorum-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

b4c783d

…um-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

2f6678e

…um-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

5c6b34a

…um-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

f329891

…um-commit

barthisrael reviewed Oct 23, 2023

View reviewed changes

CyberDem0n and others added 2 commits October 24, 2023 11:11

Apply suggestions from code review

7794f9c

Co-authored-by: Israel <israel.barth@enterprisedb.com>

Update sync.py

a9e1d67

barthisrael reviewed Oct 24, 2023

View reviewed changes

CyberDem0n and others added 7 commits October 25, 2023 10:43

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

94e128c

…um-commit

Apply suggestions from code review

ebdc197

Co-authored-by: Israel <israel.barth@enterprisedb.com>

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

13cc86f

…um-commit

Merge branch 'feature/quorum-commit' of github.com:zalando/patroni in…

193f5f1

…to feature/quorum-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

91a6059

…um-commit

Revert unexpected change

3b367d6

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

59ecfb1

…um-commit

CyberDem0n added 2 commits April 2, 2024 12:10

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

bda07fa

…um-commit

Merge branch 'master' of github.com:zalando/patroni into feature/quor…

0da448b

…um-commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quorum based failover #2668

Quorum based failover #2668

CyberDem0n commented May 9, 2023

coveralls commented May 9, 2023 •

edited

matthbakeredb left a comment

barthisrael left a comment

barthisrael Sep 14, 2023

barthisrael Sep 14, 2023

barthisrael Sep 14, 2023

barthisrael Sep 14, 2023

barthisrael Sep 14, 2023

barthisrael Sep 14, 2023

CyberDem0n Sep 15, 2023

barthisrael left a comment

barthisrael left a comment

barthisrael Oct 24, 2023

CyberDem0n Oct 25, 2023

barthisrael Oct 24, 2023

CyberDem0n Oct 25, 2023

barthisrael Oct 24, 2023

CyberDem0n Oct 25, 2023

barthisrael Oct 25, 2023

barthisrael Oct 24, 2023

CyberDem0n Oct 25, 2023

barthisrael commented Oct 24, 2023

martinmarques commented Feb 7, 2024

CyberDem0n commented Feb 8, 2024

Quorum based failover #2668

Are you sure you want to change the base?

Quorum based failover #2668

Conversation

CyberDem0n commented May 9, 2023

coveralls commented May 9, 2023 • edited

Pull Request Test Coverage Report for Build 8534982209

Details

💛 - Coveralls

matthbakeredb left a comment

Choose a reason for hiding this comment

barthisrael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barthisrael left a comment

Choose a reason for hiding this comment

barthisrael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barthisrael commented Oct 24, 2023

martinmarques commented Feb 7, 2024

CyberDem0n commented Feb 8, 2024

coveralls commented May 9, 2023 •

edited