New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quorum based failover #2668
base: master
Are you sure you want to change the base?
Quorum based failover #2668
Conversation
Pull Request Test Coverage Report for Build 8534982209Details
💛 - Coveralls |
Old keys remain the same for backward compatibility.
- Don't return 200 for quorum nodes in GET `/sync` endpoint - Dew `GET /quorum` endpoint, returns 200 for quorum nodes - Show Quorum Standby role for quorum nodes in `patronictl list`
3b12486
to
dbfe844
Compare
It takes some time for existing standbys to start streaming from the new primary and we want to do our best to not empty the /sync key before that.
c1d9520
to
a5e1c53
Compare
- make a few attempts with timeout when checking registered nodes - get rid from artificial sleep - allow check_registration() function to check secondaries These changes are useful for Quorum based failover (#2668) and future PR that enhances Citus support by registering secondaries in `pg_dist_node`.
* Reduce flakiness of citus behave tests - make a few attempts with timeout when checking registered nodes - get rid from artificial sleep - allow check_registration() function to check secondaries These changes are useful for Quorum based failover (#2668) and future PR that enhances Citus support by registering secondaries in `pg_dist_node`.
c775805
to
e6d251b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review...
There's a lot to understand here that I don't feel equipped to review.
My general feel for some of the methods is that they are very large and are steps in a flow that could be better expressed as separate methods or possibly separate classes. (I don't know if Patroni needs some way to express complex workflows in a way that is easier to understand.)
Some of the concepts are new to me, I can't say I understand what this is doing from reading the documentation. Do you have any material where you've talked about this, perhaps at a conference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to continue going through the PR later, but I'm sharing some comments that I had so far.
patroni/quorum.py
Outdated
if add_to_sync: | ||
yield from self.sync_update(self.numsync, CaseInsensitiveSet(self.sync | add_to_sync)) | ||
elif self.sync > self.voters: | ||
logger.debug("Case 2: synchronous_standby_names superset of DCS state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could include the variables in the debug message.
patroni/quorum.py
Outdated
safety_margin = self.quorum + min(self.numsync, self.numsync_confirmed) - len(self.voters | self.sync) | ||
if safety_margin > 0: # In the middle of changing replication factor. | ||
if self.numsync > self.sync_wanted: | ||
logger.debug('Case 3: replication factor is bigger than needed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could include the variables in the debug message.
patroni/quorum.py
Outdated
logger.debug('Case 3: replication factor is bigger than needed') | ||
yield from self.sync_update(max(self.sync_wanted, len(self.voters) - self.quorum), self.sync) | ||
else: | ||
logger.debug('Case 4: quorum is bigger than needed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could include the variables in the debug message.
sync=CaseInsensitiveSet(self.sync - remove_from_sync)) | ||
|
||
# After handling these two cases voters and sync must match. | ||
assert self.voters == self.sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if throwing a QuorumError instead of AssertionError would look more elegant
def __remove_gone_nodes(self) -> Iterator[Transition]: | ||
"""Remove inactive nodes from ``synchronous_standby_names`` and from ``/sync`` key. | ||
|
||
:yields: transitions as :class:`Transition` objects. | ||
""" | ||
to_remove = self.sync - self.active | ||
if to_remove and self.sync == to_remove: | ||
logger.debug("Removing nodes: %s", to_remove) | ||
yield from self.quorum_update(0, CaseInsensitiveSet(), adjust_quorum=False) | ||
yield from self.sync_update(0, CaseInsensitiveSet()) | ||
elif to_remove: | ||
logger.debug("Removing nodes: %s", to_remove) | ||
can_reduce_quorum_by = self.quorum | ||
# If we can reduce quorum size try to do so first | ||
if can_reduce_quorum_by: | ||
# Pick nodes to remove by sorted order to provide deterministic behavior for tests | ||
remove = CaseInsensitiveSet(sorted(to_remove, reverse=True)[:can_reduce_quorum_by]) | ||
sync = CaseInsensitiveSet(self.sync - remove) | ||
# when removing nodes from sync we can safely increase numsync if requested | ||
numsync = min(self.sync_wanted, len(sync)) if self.sync_wanted > self.numsync else self.numsync | ||
yield from self.sync_update(numsync, sync) | ||
voters = CaseInsensitiveSet(self.voters - remove) | ||
to_remove &= self.sync | ||
yield from self.quorum_update(len(voters) - self.numsync, voters, | ||
adjust_quorum=not to_remove) | ||
if to_remove: | ||
assert self.quorum == 0 | ||
numsync = self.numsync - len(to_remove) | ||
sync = CaseInsensitiveSet(self.sync - to_remove) | ||
voters = CaseInsensitiveSet(self.voters - to_remove) | ||
sync_decrease = numsync - min(self.sync_wanted, len(sync)) | ||
quorum = min(sync_decrease, len(voters) - 1) if sync_decrease else 0 | ||
yield from self.quorum_update(quorum, voters, adjust_quorum=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must say I'm having some issues to consume the maths shown in this module.
I wonder if is there any way in which we can improve readability/consumption of the logic that has been implemented?
Maybe we could:
- Expand docstrings
- Expand comments
- Create auxiliary variables with a detailed description
- Describe the workflow
I don't know. I wonder if it's just me who is having bit of a hard time to digest it all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I put a localized comment here, but I guess we could look for improvements in other pieces of this module)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, it is extremely complex. The original PR was opened more than 5 years ago and for me it took a few attempts to get the idea.
All the complexity exists because we can't atomically update both, /sync
and synchronous_standby_names
.
I would recommend you to start understanding the workflow from the __add_new_nodes()
method:
- 'a' is a leader
/sync
key is emptysynchronous_standby_names
is empty- we have 'b' and 'c' nodes streaming
The QuorumStateResolver
will produce following transitions:
- 'sync', 'a', 1, {'b', 'c'}
- 'restart', because after updating
synchronous_standby_names
we need to wait until 'b' and 'c' will catch up with the LSN whenpostgresql.conf
was updated.
That is, we got to one of non-steady cases, when /sync
and synchronous_standby_names
don't match.
Such state will remain until at least one of the nodes 'b' or 'c' will catch up with the LSN when postgresql.conf
was updated (numsync_confirmed>0
).
Interrupt might happen not only because of restart
yielded by QuorumStateResolver
, but also because of other external factors like for example not being able to update the /sync
key.
Lets consider the state:
a
is a leader/sync
key: {leader: a, sync_standby: b,c, quorum: 1}synchronous_standby_names
:{ANY 1 (b, c)}
- only
b
is streaming
In this case QuorumStateResolver
will produce following transitions:
- 'sync', 'a', 1, {'b'}
- 'quorum', 'a', 0, {'b'}
That is, first 'c' is removed from the synchronous_standby_names
and only after from the '/sync' key.
If update of the /sync
key fails, we also get to a non-steady state.
Non-steady state might also happen when replication factor (sync_wanted
) is increased/reduced by the administrator. The change also must be reflected by performing updates of /sync
and synchronous_standby_names
in a very specific order.
Last, but not least, there could also be a combination of different changes. Like for example new nodes started streaming, but admin decided to increase/decrease replication factor.
If you look at unit-tests, you'll see that QuorumStateResolver
can produce up to 4 transitions until the steady state is reached: https://github.com/zalando/patroni/blob/feature/quorum-commit/tests/test_quorum.py#L410-L422
And, QuorumStateResolver
must be able to "resume" the chain from the arbitrary state when it was interrupted.
Most of this magic happens in __handle_non_steady_cases()
method and this is why it is com complex.
Unit-tests actually check restart QuorumStateResolver
from all intermediate states and check that yielded transitions are matching the expectation: https://github.com/zalando/patroni/blob/feature/quorum-commit/tests/test_quorum.py#L10-L28
…to feature/quorum-commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the easier files 😆 Still going through others.
Co-authored-by: Israel <israel.barth@enterprisedb.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments
{'pid': 100, 'application_name': self.leadermem.name, 'sync_state': 'quorum', 'flush_lsn': 1}, | ||
{'pid': 101, 'application_name': self.other.name, 'sync_state': 'quorum', 'flush_lsn': 2}] | ||
|
||
# sync node is a bit behind of async, but we prefer it anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understood this comment 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was a copy&paste from https://github.com/zalando/patroni/pull/2668/files#diff-7cdc82c5ea3109d2377741f06aecdcce8d49257482335f49ee25be99853621b3R40
kwargs['expected'] = expected[1:] | ||
self.check_state_transitions(**kwargs) | ||
|
||
def test_1111(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the convention on these test names that have trailing numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question :)
These names are from the original PR: https://github.com/zalando/patroni/pull/672/files#diff-77b9ff9cf878f34d15fd5bada8be62ca3475ab49e9d6f15db4261728837be972R6
I think I'll rename them one day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 assert
calls in this module. Maybe we should replace that with some kind of exception?
I mean, assert
seems more like something we are testing than something we are running on production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... These asserts are purely for tests.
Maybe we can also raise QuorumError
and "handle" it in the __iter__()
method when calling transitions = list(self._generate_transitions())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool, yes, I think so.
This is more of a cosmetic change because we would also get exceptions on assertion errors.
I like the suggestion.
if increase_numsync_by > 0: | ||
if self.sync: | ||
add = CaseInsensitiveSet(sorted(to_add)[:increase_numsync_by]) | ||
increase_numsync_by = len(add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this because to_add
maybe have less members that increase_numsync_by
initially wants, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a corner case / optimization.
Adding nodes and changing the replication factor are two different operations and supposed to be done one after another, which is not always optimal.
Example:
/sync
:{leader: a, sync_standby: b, quorum: 0}
synchronous_standby_names
:ANY 1 {b}
We want to add node c
and increase the replication factor from 1 to 2.
Non optimal way of doing that would be:
/sync
:{leader: a, sync_standby: b,c, quorum: 1}
(add node /sync key with quorum increase)synchronous_standby_names
:ANY 1 {b,c}
(add node no ssn)synchronous_standby_names
:ANY 2 {b,c}
(increase replication factor to requested value)/sync
:{leader: a, sync_standby: b,c, quorum: 0}
(reduce quorum)
2
and3
will be merged by__iter__
method.
With optimization (because we know that the replication factor should be increased anyway) it produces the following:
synchronous_standby_names
:ANY 2 {b,c}
(add node no ssn and increase the replication factor)/sync
:{leader: a, sync_standby: b,c, quorum: 0}
(add node to /sync key).
Besides that there could be other corner cases. When for example we have only 2 replicas (one is already in ssn and we are adding the second one), but the requested replication factor is 3 or higher.
I think we could give it a try as an experimental feature and see what users say about this. It is not something that would come enabled by default IAC -- only on Citus clusters from what I could see; maybe we should put some kind of warning here? The feature is very complex, but the unit tests provide a good coverage. I tried understanding in detail and memorizing the quorum state machine code paths, but I'm having a really hard time on doing that. However, as I mentioned above, the unit tests are a good way to take a look at. While reviewing the PR I tried "guessing" the outputs that were written to the unit tests of From a developer perspective what I can say is: it's complex to understand everything that is being included, but the tests make me more confident that it's working mostly fine. I'd like to get users fingers on that, as users are brilliant on breaking things 😆 Anyway, I would like to hear other people's thoughts. |
Co-authored-by: Israel <israel.barth@enterprisedb.com>
…to feature/quorum-commit
@CyberDem0n what's the latest state on this PR? Need more testing/reviewing? |
@martinmarques my current plan is to merge it for v4.0.0. When the release will happen I don't know yet. |
Based on #672
To enable quorum commit:
By default Patroni will use
ANY 1(list,of,stanbys)
insynchronous_standby_names
. That is, only one node out of listed replicas will be used for quorum.If you want to increase the number of quorum nodes it is possible to do it with: