AddPeer API #5123

ramonberrutti · 2024-02-22T19:41:50Z

WIP AddPeer API.

Need to:

Test Edge Case
Corrupt Disk comeback.
Empty Disk Comeback.

Signed-off-by: Ramon Berrutti ramonberrutti@gmail.com

derekcollison

What specific problem are we trying to solve? Peers automatically get added in. Is this specific to after a peer remove step?

ramonberrutti · 2024-02-22T22:40:16Z

What specific problem are we trying to solve? Peers automatically get added in. Is this specific to after a peer remove step?

Yes, after peer removal, if we want to join the cluster again, one method is to change the server name, but in our case, we want to add it after some minutes or hours.
Another solution that we found is to force a leader election until the nodes are added again to the raft (I haven't looked at how this is working or if it is just luck)

The solution from the code only works for peers already removed, but that was kept in the hashmap

derekcollison · 2024-02-23T00:25:22Z

You could simply shutdown the server and do the maintenance needed and restart?

If you need to move stream and consumer peers off that machine during the downtime you can do that separately.

I will double check when the system will re-add a peer that was removed..

ramonberrutti · 2024-02-23T01:38:16Z

You could simply shutdown the server and do the maintenance needed and restart?

If you need to move stream and consumer peers off that machine during the downtime you can do that separately.

I will double check when the system will re-add a peer that was removed..

We can't do that because we want to adjust the Quorum Number.
During our maintenance, we scale up new nodes (1/3 of the total nodes)

For example, we have 9 nodes, so we need 5 to reach the meta leader quorum.

During our maintenance, we scale up 3 new nodes and scale down 3.
Now we need 7 nodes to reach the quorum, but we already lost 3, we can only lose 2 extra nodes.
We want to remove that node for a bit to be able to lose 3 nodes instead of 2.

Also, the need to repeat that process multiple times, and we also want to remove the added nodes when the first ones removed are recovered.

derekcollison · 2024-02-23T02:09:38Z

With your 9 node cluster, you can have 4 failures in terms of the whole cluster being available (meta). What purpose is being served by scaling up to 12?

ripienaar · 2024-02-23T07:53:14Z

Why do you want to adjust the quorum number?

The process of swapping machines in and out works really well in a rolling fashion if you bring nodes back with set server_names predictably, there should never be a need to peer remove a server other than it is gone for good.

Rather than carry API bloat I'd rather want to see a better process used here for maintenance - and discover what we can help you to achieve a better process that keeps the RAFT layer stable over time.

AddPeer API

02881a0

ramonberrutti requested a review from a team as a code owner February 22, 2024 19:41

derekcollison requested a review from ripienaar February 22, 2024 21:36

derekcollison reviewed Feb 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddPeer API #5123

AddPeer API #5123

ramonberrutti commented Feb 22, 2024

derekcollison left a comment

ramonberrutti commented Feb 22, 2024 •

edited

derekcollison commented Feb 23, 2024

ramonberrutti commented Feb 23, 2024

derekcollison commented Feb 23, 2024

ripienaar commented Feb 23, 2024

AddPeer API #5123

Are you sure you want to change the base?

AddPeer API #5123

Conversation

ramonberrutti commented Feb 22, 2024

derekcollison left a comment

Choose a reason for hiding this comment

ramonberrutti commented Feb 22, 2024 • edited

derekcollison commented Feb 23, 2024

ramonberrutti commented Feb 23, 2024

derekcollison commented Feb 23, 2024

ripienaar commented Feb 23, 2024

ramonberrutti commented Feb 22, 2024 •

edited