-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BACKPORT 2.14.10][#14473] docdb: Accept registrations from tservers …
…on the same node with greater sequence numbers. Summary: Original commits: - 59f0520 / D25761 - 16d2151 / D25886 This diff modifies the master registration logic. The master has some logic for whether or not to accept a ts registration from a node it has seen before. Unfortunately this logic depends on whether it had ever seen the uuid trying to register before. The previous logic is: If the uuid is new, accept the registration as long as the seqno is greater than the previously registered ts' seqno (seqno is the unix epoch on start of the tserver process). If the uuid is old, accept the re-registration but if the uuid has previously been superseded by a subsequent registration of a new uuid with the same hostport, fail loop by endlessly requesting the node re-register. Otherwise if the uuid is old, accept the re-registration. This diff streamlines this logic by removing the condition on the uuid. That is, regardless of whether the uuid has been previously seen or not: Accept the registration if and only if the seqno is greater than the previously registered tserver's seqno. The existing logic causes problems when operators try to fix broken quorums. When a majority of tservers have failed due to wiped disks, the respawned tserver processes cannot rejoin the quorum because: # they have different uuids and thus are considered new tservers # they cannot be added to the quorum because such an action requires an effective leader and thus a quorum One trick to bypass this is to respawn new tservers on the wiped nodes and override their UUIDs to the original, pre-wipe UUIDs. These processes will be recognized as existing quorum members and can be manually bootstrapped. However, the master leader refuses to register such tservers because it has already seen these UUIDs and marked them as removed. Concretely, for a tablet group with tservers with UUIDs A, B, and C on nodes N1, N2, and N3: # B and C fail with wiped disks # automation respawns tserver processes on nodes N2 and N3 with uuids B' and C' # B' and C' cannot join the tablet group because there is no tablet leader # operator wipes the disk of N2, and respawns a new tserver process on N2 with uuid B # master rejects B's attempt to register. When B' registered, it marked B as removed Jira: DB-3872 Test Plan: ``` ybd --cxx-test master-test --gtest_filter 'MasterTest.TestReRegisterRemovedUUID' ybd --cxx-test master_heartbeat-itest --gtest_filter 'MasterHeartbeatITestWithExternal.ReRegisterRemovedPeers' ``` Reviewers: bogdan, stiwary, rahuldesirazu, asrivastava Reviewed By: asrivastava Subscribers: aaruj, asrivastava, ybase, slingam, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26021
- Loading branch information
Showing
4 changed files
with
223 additions
and
48 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters