Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoration of a member in a multi-node cluster #645

Open
unmarshall opened this issue Jul 14, 2023 · 4 comments
Open

Restoration of a member in a multi-node cluster #645

unmarshall opened this issue Jul 14, 2023 · 4 comments
Labels
exp/beginner Issue that requires only basic skills kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age) size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py)

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Jul 14, 2023

Describe the bug:

If an existing etcd member crashed and now has come up again, then if the data directory is not longer valid then for a multi-node setup, the data directory is removed and only limited number of attempts are made to add as learner. Now consider a case where more than 1 member goes down and both are trying to recover (in a 5 member cluster). The quorum is still there so it can happen that both of the member attempt to add themselves as learners and one of them will fail.

Expected behavior:
In scale-up case where adding the current candidate as a learner is repeatedly attempted (upto 6 times). Similar thing should also be done when a restoration of a member in a multi-node cluster requires it to be added as a learner.

@unmarshall unmarshall added kind/bug Bug size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) exp/beginner Issue that requires only basic skills labels Jul 14, 2023
@ishan16696
Copy link
Member

ishan16696 commented Jul 14, 2023

Now consider a case where more than 1 member goes down and both are trying to recover (in a 5 member cluster). The quorum is still there so it can happen that both of the member attempt to add themselves as learners and one of them will fail.

I agree with the concern but take this scenario in 5 member cluster:

etcd-0 --> leader
etcd-1 --> follower
etcd-2 --> follower
etcd-3 --> goes down due to data-dir corruption
etcd-4 --> goes down due to data-dir corruption
---
quorum is still there (3/5 is up)

Now, as you already mentioned backup-restore will detects during initialisation phase it is a single member restoration case, hence backup-restore will clean-up the old data-dir and then try to add them as a learner(non-voting member) at a same time.
But only 1 backup-restore will succeed in adding its corresponding etcd as a learner. Others initialisation call will fail

etcd-0 --> leader
etcd-1 --> follower
etcd-2 --> follower
etcd-3 --> learner --> get promoted to follower
etcd-4 --> still down as adding a learner call failed
---
quorum is there (4/5 is up)

Now, IMO initialisation will get re-trigger and backup-restore will re-detects this as single member restoration case again, hence this time it will get added as a learner successfully.

@ishan16696
Copy link
Member

ishan16696 commented Jul 14, 2023

In scale-up case where adding the current candidate as a learner is repeatedly attempted (upto 6 times). Similar thing should also be done when a restoration of a member in a multi-node cluster requires it to be added as a learner.

no, it doesn’t require as in case of scale-up we want to avoid going to the wrong path if adding a learner failed. That's why we throw a fatal error there.
But in this case backup-restore have detected the single member restoration correctly and will detects the single member restoration correctly even if its previous attempts failed to add a learner.
And no. of retries will be taken care by re-trigger of initialisation call if previous initialisation call failed.

@ishan16696
Copy link
Member

@unmarshall can you please close this issue if you are satisfied with this comments #645 (comment)

@unmarshall
Copy link
Contributor Author

In a previous ticket we made a change where we attempt to add a learner a few times and then we give up and exit the container, resulting in restarting of a container. I discussed this with @shreyas-s-rao and we agreed that the earlier approach of trying to add-as-learner unlimited number of times was sufficient. A restart of a container does not alleviate this in any ways. So maybe it would probably make sense to remove the limit in both these situations and wait till an etcd-member is added as a learner (either in the case of a new member or a restart of an existing member in a multi-node cluster)

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/beginner Issue that requires only basic skills kind/bug Bug lifecycle/stale Nobody worked on this for 6 months (will further age) size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

No branches or pull requests

3 participants