Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quorum based failover #2668

Open
wants to merge 72 commits into
base: master
Choose a base branch
from
Open

Quorum based failover #2668

wants to merge 72 commits into from

Conversation

CyberDem0n
Copy link
Collaborator

Based on #672

To enable quorum commit:

$ patronictl.py edit-config
--- 
+++ 
@@ -5,3 +5,4 @@
   use_pg_rewind: true
 retry_timeout: 10
 ttl: 30
+synchronous_mode: quorum

Apply these changes? [y/N]: y
Configuration changed

By default Patroni will use ANY 1(list,of,stanbys) in synchronous_standby_names. That is, only one node out of listed replicas will be used for quorum.
If you want to increase the number of quorum nodes it is possible to do it with:

$ patronictl edit-config
--- 
+++ 
@@ -6,3 +6,4 @@
 retry_timeout: 10
 synchronous_mode: quorum
 ttl: 30
+synchronous_node_count: 2

Apply these changes? [y/N]: y
Configuration changed

@coveralls
Copy link

coveralls commented May 9, 2023

Pull Request Test Coverage Report for Build 8534982209

Details

  • 404 of 404 (100.0%) changed or added relevant lines in 10 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 99.886%

Totals Coverage Status
Change from base Build 8523051901: 0.02%
Covered Lines: 14059
Relevant Lines: 14075

💛 - Coveralls

Old keys remain the same for backward compatibility.
- Don't return 200 for quorum nodes in GET `/sync` endpoint
- Dew `GET /quorum` endpoint, returns 200 for quorum nodes
- Show Quorum Standby role for quorum nodes in `patronictl list`
It takes some time for existing standbys to start streaming from the
new primary and we want to do our best to not empty the /sync key before
that.
CyberDem0n added a commit that referenced this pull request Jul 5, 2023
- make a few attempts with timeout  when checking registered nodes
- get rid from artificial sleep
- allow check_registration() function to check secondaries

These changes are useful for Quorum based failover (#2668) and future PR
that enhances Citus support by registering secondaries in `pg_dist_node`.
hughcapet pushed a commit that referenced this pull request Jul 7, 2023
* Reduce flakiness of citus behave tests

- make a few attempts with timeout  when checking registered nodes
- get rid from artificial sleep
- allow check_registration() function to check secondaries

These changes are useful for Quorum based failover (#2668) and future PR
that enhances Citus support by registering secondaries in `pg_dist_node`.
@CyberDem0n CyberDem0n marked this pull request as ready for review July 18, 2023 14:30
Copy link
Collaborator

@matthbakeredb matthbakeredb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review...

There's a lot to understand here that I don't feel equipped to review.

My general feel for some of the methods is that they are very large and are steps in a flow that could be better expressed as separate methods or possibly separate classes. (I don't know if Patroni needs some way to express complex workflows in a way that is easier to understand.)

Some of the concepts are new to me, I can't say I understand what this is doing from reading the documentation. Do you have any material where you've talked about this, perhaps at a conference?

docs/dynamic_configuration.rst Outdated Show resolved Hide resolved
docs/replication_modes.rst Outdated Show resolved Hide resolved
docs/replication_modes.rst Outdated Show resolved Hide resolved
docs/replication_modes.rst Outdated Show resolved Hide resolved
docs/replication_modes.rst Outdated Show resolved Hide resolved
patroni/ha.py Outdated Show resolved Hide resolved
patroni/postgresql/sync.py Show resolved Hide resolved
patroni/postgresql/sync.py Outdated Show resolved Hide resolved
patroni/postgresql/sync.py Show resolved Hide resolved
patroni/postgresql/sync.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@barthisrael barthisrael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to continue going through the PR later, but I'm sharing some comments that I had so far.

docs/dynamic_configuration.rst Outdated Show resolved Hide resolved
docs/replication_modes.rst Outdated Show resolved Hide resolved
features/steps/quorum_commit.py Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
patroni/api.py Show resolved Hide resolved
if add_to_sync:
yield from self.sync_update(self.numsync, CaseInsensitiveSet(self.sync | add_to_sync))
elif self.sync > self.voters:
logger.debug("Case 2: synchronous_standby_names superset of DCS state")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could include the variables in the debug message.

safety_margin = self.quorum + min(self.numsync, self.numsync_confirmed) - len(self.voters | self.sync)
if safety_margin > 0: # In the middle of changing replication factor.
if self.numsync > self.sync_wanted:
logger.debug('Case 3: replication factor is bigger than needed')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could include the variables in the debug message.

logger.debug('Case 3: replication factor is bigger than needed')
yield from self.sync_update(max(self.sync_wanted, len(self.voters) - self.quorum), self.sync)
else:
logger.debug('Case 4: quorum is bigger than needed')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could include the variables in the debug message.

sync=CaseInsensitiveSet(self.sync - remove_from_sync))

# After handling these two cases voters and sync must match.
assert self.voters == self.sync
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if throwing a QuorumError instead of AssertionError would look more elegant

Comment on lines +305 to +337
def __remove_gone_nodes(self) -> Iterator[Transition]:
"""Remove inactive nodes from ``synchronous_standby_names`` and from ``/sync`` key.

:yields: transitions as :class:`Transition` objects.
"""
to_remove = self.sync - self.active
if to_remove and self.sync == to_remove:
logger.debug("Removing nodes: %s", to_remove)
yield from self.quorum_update(0, CaseInsensitiveSet(), adjust_quorum=False)
yield from self.sync_update(0, CaseInsensitiveSet())
elif to_remove:
logger.debug("Removing nodes: %s", to_remove)
can_reduce_quorum_by = self.quorum
# If we can reduce quorum size try to do so first
if can_reduce_quorum_by:
# Pick nodes to remove by sorted order to provide deterministic behavior for tests
remove = CaseInsensitiveSet(sorted(to_remove, reverse=True)[:can_reduce_quorum_by])
sync = CaseInsensitiveSet(self.sync - remove)
# when removing nodes from sync we can safely increase numsync if requested
numsync = min(self.sync_wanted, len(sync)) if self.sync_wanted > self.numsync else self.numsync
yield from self.sync_update(numsync, sync)
voters = CaseInsensitiveSet(self.voters - remove)
to_remove &= self.sync
yield from self.quorum_update(len(voters) - self.numsync, voters,
adjust_quorum=not to_remove)
if to_remove:
assert self.quorum == 0
numsync = self.numsync - len(to_remove)
sync = CaseInsensitiveSet(self.sync - to_remove)
voters = CaseInsensitiveSet(self.voters - to_remove)
sync_decrease = numsync - min(self.sync_wanted, len(sync))
quorum = min(sync_decrease, len(voters) - 1) if sync_decrease else 0
yield from self.quorum_update(quorum, voters, adjust_quorum=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say I'm having some issues to consume the maths shown in this module.

I wonder if is there any way in which we can improve readability/consumption of the logic that has been implemented?

Maybe we could:

  • Expand docstrings
  • Expand comments
  • Create auxiliary variables with a detailed description
  • Describe the workflow

I don't know. I wonder if it's just me who is having bit of a hard time to digest it all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I put a localized comment here, but I guess we could look for improvements in other pieces of this module)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, it is extremely complex. The original PR was opened more than 5 years ago and for me it took a few attempts to get the idea.
All the complexity exists because we can't atomically update both, /sync and synchronous_standby_names.

I would recommend you to start understanding the workflow from the __add_new_nodes() method:

  • 'a' is a leader
  • /sync key is empty
  • synchronous_standby_names is empty
  • we have 'b' and 'c' nodes streaming

The QuorumStateResolver will produce following transitions:

  1. 'sync', 'a', 1, {'b', 'c'}
  2. 'restart', because after updating synchronous_standby_names we need to wait until 'b' and 'c' will catch up with the LSN when postgresql.conf was updated.

That is, we got to one of non-steady cases, when /sync and synchronous_standby_names don't match.
Such state will remain until at least one of the nodes 'b' or 'c' will catch up with the LSN when postgresql.conf was updated (numsync_confirmed>0).

Interrupt might happen not only because of restart yielded by QuorumStateResolver, but also because of other external factors like for example not being able to update the /sync key.
Lets consider the state:

  • a is a leader
  • /sync key: {leader: a, sync_standby: b,c, quorum: 1}
  • synchronous_standby_names: {ANY 1 (b, c)}
  • only b is streaming

In this case QuorumStateResolver will produce following transitions:

  1. 'sync', 'a', 1, {'b'}
  2. 'quorum', 'a', 0, {'b'}

That is, first 'c' is removed from the synchronous_standby_names and only after from the '/sync' key.
If update of the /sync key fails, we also get to a non-steady state.

Non-steady state might also happen when replication factor (sync_wanted) is increased/reduced by the administrator. The change also must be reflected by performing updates of /sync and synchronous_standby_names in a very specific order.

Last, but not least, there could also be a combination of different changes. Like for example new nodes started streaming, but admin decided to increase/decrease replication factor.

If you look at unit-tests, you'll see that QuorumStateResolver can produce up to 4 transitions until the steady state is reached: https://github.com/zalando/patroni/blob/feature/quorum-commit/tests/test_quorum.py#L410-L422
And, QuorumStateResolver must be able to "resume" the chain from the arbitrary state when it was interrupted.
Most of this magic happens in __handle_non_steady_cases() method and this is why it is com complex.

Unit-tests actually check restart QuorumStateResolver from all intermediate states and check that yielded transitions are matching the expectation: https://github.com/zalando/patroni/blob/feature/quorum-commit/tests/test_quorum.py#L10-L28

Copy link
Collaborator

@barthisrael barthisrael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the easier files 😆 Still going through others.

patroni/config.py Outdated Show resolved Hide resolved
patroni/dcs/__init__.py Outdated Show resolved Hide resolved
patroni/dcs/__init__.py Outdated Show resolved Hide resolved
patroni/dcs/__init__.py Outdated Show resolved Hide resolved
patroni/postgresql/sync.py Outdated Show resolved Hide resolved
patroni/postgresql/sync.py Show resolved Hide resolved
CyberDem0n and others added 2 commits October 24, 2023 11:11
Co-authored-by: Israel <israel.barth@enterprisedb.com>
Copy link
Collaborator

@barthisrael barthisrael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments

patroni/quorum.py Outdated Show resolved Hide resolved
patroni/ha.py Outdated Show resolved Hide resolved
patroni/ha.py Outdated Show resolved Hide resolved
patroni/ha.py Outdated Show resolved Hide resolved
patroni/ha.py Outdated Show resolved Hide resolved
patroni/ha.py Outdated Show resolved Hide resolved
{'pid': 100, 'application_name': self.leadermem.name, 'sync_state': 'quorum', 'flush_lsn': 1},
{'pid': 101, 'application_name': self.other.name, 'sync_state': 'quorum', 'flush_lsn': 2}]

# sync node is a bit behind of async, but we prefer it anyway
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understood this comment 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs['expected'] = expected[1:]
self.check_state_transitions(**kwargs)

def test_1111(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the convention on these test names that have trailing numbers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question :)
These names are from the original PR: https://github.com/zalando/patroni/pull/672/files#diff-77b9ff9cf878f34d15fd5bada8be62ca3475ab49e9d6f15db4261728837be972R6

I think I'll rename them one day.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 assert calls in this module. Maybe we should replace that with some kind of exception?
I mean, assert seems more like something we are testing than something we are running on production.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... These asserts are purely for tests.
Maybe we can also raise QuorumError and "handle" it in the __iter__() method when calling transitions = list(self._generate_transitions()).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool, yes, I think so.
This is more of a cosmetic change because we would also get exceptions on assertion errors.
I like the suggestion.

if increase_numsync_by > 0:
if self.sync:
add = CaseInsensitiveSet(sorted(to_add)[:increase_numsync_by])
increase_numsync_by = len(add)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because to_add maybe have less members that increase_numsync_by initially wants, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a corner case / optimization.
Adding nodes and changing the replication factor are two different operations and supposed to be done one after another, which is not always optimal.

Example:

  • /sync: {leader: a, sync_standby: b, quorum: 0}
  • synchronous_standby_names: ANY 1 {b}

We want to add node c and increase the replication factor from 1 to 2.
Non optimal way of doing that would be:

  1. /sync: {leader: a, sync_standby: b,c, quorum: 1} (add node /sync key with quorum increase)
  2. synchronous_standby_names: ANY 1 {b,c} (add node no ssn)
  3. synchronous_standby_names: ANY 2 {b,c} (increase replication factor to requested value)
  4. /sync: {leader: a, sync_standby: b,c, quorum: 0} (reduce quorum)
    2 and 3 will be merged by __iter__ method.

With optimization (because we know that the replication factor should be increased anyway) it produces the following:

  1. synchronous_standby_names: ANY 2 {b,c} (add node no ssn and increase the replication factor)
  2. /sync: {leader: a, sync_standby: b,c, quorum: 0} (add node to /sync key).

Besides that there could be other corner cases. When for example we have only 2 replicas (one is already in ssn and we are adding the second one), but the requested replication factor is 3 or higher.

@barthisrael
Copy link
Collaborator

I think we could give it a try as an experimental feature and see what users say about this. It is not something that would come enabled by default IAC -- only on Citus clusters from what I could see; maybe we should put some kind of warning here?

The feature is very complex, but the unit tests provide a good coverage.
The behave tests make sense to me, and from manual attempts of using the feature it seems to be working fine.

I tried understanding in detail and memorizing the quorum state machine code paths, but I'm having a really hard time on doing that. However, as I mentioned above, the unit tests are a good way to take a look at. While reviewing the PR I tried "guessing" the outputs that were written to the unit tests of QuorumStateResolver, and I was successful more or less half of the time. I guess that's a good indication at least.

From a developer perspective what I can say is: it's complex to understand everything that is being included, but the tests make me more confident that it's working mostly fine. I'd like to get users fingers on that, as users are brilliant on breaking things 😆

Anyway, I would like to hear other people's thoughts.

@martinmarques
Copy link
Contributor

@CyberDem0n what's the latest state on this PR? Need more testing/reviewing?
I know that it's very large and complex, but at the same time adds a lot of usefulness that some end users would like to have with Patroni.

@CyberDem0n
Copy link
Collaborator Author

@martinmarques my current plan is to merge it for v4.0.0. When the release will happen I don't know yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

None yet

6 participants