Cluster upgrade not working as expected #10786

schiaro98 · 2024-03-19T08:20:31Z

schiaro98
Mar 19, 2024

Describe the bug

During the rolling upgrade of the nodes of our cluster, some operations does not work as expected.
Initial state: 3 nodes of the cluster (A, B, C) up and running with version 3.12, 3 single nodes up and running with the newer version 3.13, but not inside the cluster (D, E, F).
When we tried to join cluster formed by A, B, and C from the node D, the cluster didn't grow all the queue as expected.
These are the operations done on the node D, the node to join to the cluster:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster A
rabbitmqctl start_app
rabbitmqctl queues grow D all
In this phase we discovered that not all the queue were shared from all the members of the cluster (A,B,C,D). No allarms were given.
After this, the other nodes (E and F) are joined to the cluster.
Every time we Run rabbitmqctl cluster_status and rabbitmqctl status, the cluster was seen as healthy, no allarms were found.
The only allarm discovered from the rabbitmq cli tools during this phase was that every node was quorom critical, but it' wasn't true.

Next, we also used the management console to find the state of the queue, and most of them was in an inconsistent state, some were found not running if seen from the inital cluster nodes (A, B, C), but the other nodes (D,E,F) sees the queue as running.

When analyzing these queue using rabbitmq-queues quorom_status we found out that some of them has three quorom nodes with only followers and not leaders ! Others queue has noproc raft state in other nodes (A, B, C) in a no deterministic way

Then after we made the cluster enter in a consistent state, were all the queue had the new nodes in quorom, we tried to forget the node A,B and C. First we shrink and forget the node A, but the shrink give timeout errors on few queue, and also the delete_member manual was not useful, the cluster state was unknown to us. When we manually adjusted the queue membership, we forget the node with no error. When we also shrinked the last node (get again error on some queues) we first had a "clean" state, were all queue had new nodes (D,E,F) has members. When we forget the last node, almost every queue was down. Every queue had has quorom member the node C, that we required to forget before (and we had a OK as reponse). In this step we had the major service dis service.

All these errors got us to have a total inconsistent cluster state with most of the queue down, with no knowledge on why and we had to terminate the cluster in a drastic way, losing lot of (important) data.

It will be important to us to discover why it happened, and how to get this to not happen again, any tips, info and suggestion will be very important to us, and for any information I will reply in this issue. I'm not sure if this happened from some sort of bug or for any other reason, but I'm sure that some some sort of check should be done from the cluster to not enter in this type of state.

Reproduction steps

These are the operations done on the node D, the node to join to the cluster:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster A
rabbitmqctl start_app
rabbitmqctl queues grow D all

Expected behavior

We should have a rolling upgrade procedure more safer, to reduce the probability of inconsistent cluster state

Additional context

No response

Answered by michaelklishin

Mar 19, 2024

We cannot suggest much without having logs from all nodes, but there is one known scenario:

When new nodes are added, they will not have the entire Raft log immediately. If you then start removing existing replicas, in 3.12 these new nodes will be considered online replicas, which is not entirely correct as they cannot be elected leaders unless they have the entire log. A company that performs upgrades in this manner has contributed #9162 to 3.13.0
Unless you use rabbitmq-diagnostics check_if_node_is_quorum_critical in the process, you can end up with fewer than a majority replicas online. A booting node is technically online according to some metrics but it does not necessarily have sta…

View full answer

michaelklishin · 2024-03-19T14:49:59Z

michaelklishin
Mar 19, 2024
Maintainer

rabbitmqctl reset deletes all of node's data and it does not fully make the rest of the cluster aware of the fact that a member is gone for good. For that, use rabbitmqctl forget_cluster_node. No, actually, two wrongs don't make a right, so please keep reading.

Resetting a node was never recommended as an upgrade step. Performing it during an upgrade step is wrong and specifically counterproductive with quorum queues, streams and Khepri (all Raft-based features).

Try finding a single recommendation to use rabbitmqctl reset in the Upgrade guide.

If you have reasons to "upgrade" by throwing away data, you can use a greatly simplified variation of the Blue/Green deployment where all you do is export and re-import definitions, and otherwise form a brand new cluster.

The only place where rabbitmqctl reset is used is in one section of the Clustering guide demo which demonstrate the manual steps that have long been superseded by Peer discovery.

The claim that you did not want to rebuild the cluster and lose any data absolutely does not add up with the aforementioned upgrade procedure that explicitly resets nodes.

0 replies

michaelklishin · 2024-03-19T14:57:31Z

michaelklishin
Mar 19, 2024
Maintainer

There's a couple more things that perhaps I should clarify:

rabbitmqctl reset is NOT essential for stopping a node. If you want, you can mark a node for maintenance before stopping but resetting it should be completely unnecessary
rabbitmqctl reset is not the same as a node failure, which can happen during an upgrade, because a failed node, if it ever comes back, will come back with its previous identity known to the rest of the cluster. A reset node will come online with a brand new identity not known to the rest of the cluster. RabbitMQ generally expects that nodes are permanently removed (there are two ways of doing it) when they must
See rabbitmqctl help check_if_node_is_quorum_critical, it is highly relevant for rolling upgrades for clusters with modern features such as quorum queues

0 replies

michaelklishin · 2024-03-19T15:15:54Z

michaelklishin
Mar 19, 2024
Maintainer

Another important note: as of 3.13.0, rabbitmqctl join_cluster will reset the node if necessary for the operation to complete. However, that step is not necessary during an upgrade, and for a few years now, not necessary to form a cluster: there are five peer discovery options available. The easiest to switch to from manually forming a cluster with rabbitmqctl join_cluster is the classic config one.

0 replies

schiaro98 · 2024-03-19T16:32:36Z

schiaro98
Mar 19, 2024
Author

Thank you Micheal for your reply, it's very important for Us.
Let me explain the reason for the reset. We usually have three nodes in the running cluster. When it's time to upgrade a node (periodically or when a security fix is advides) we create three new nodes with rabbit running, then we reset them because we want them to be clean. The main reason is because the cycle of update.
We have A.B,C running and we create D,E and F, if everything goes correctly the next update we dismiss D,E and F for A,B,C but we do not want them to have any relationship with the old nodes (that does not exist anymore from the last update).
So when we reset a node is because that node has no data and only for precautionary purpose, but we'll remove from the procedure it if causes trouble. We still don't know why that can be problematic and why that cause these error. Now we still experiment errors because of multiple queue that has no leader and we don't know any way to force a reelection (docs say it's not possible, but the transfer reelection and trigger of ra module say another thing).
We'll appreciate a lot any advise for this step, that will save most of our data!
Thank you, Davide

2 replies

michaelklishin Mar 19, 2024
Maintainer

We cannot suggest much without having logs from all nodes, but there is one known scenario:

When new nodes are added, they will not have the entire Raft log immediately. If you then start removing existing replicas, in 3.12 these new nodes will be considered online replicas, which is not entirely correct as they cannot be elected leaders unless they have the entire log. A company that performs upgrades in this manner has contributed Add new QQ members as non-voters #9162 to 3.13.0
Unless you use rabbitmq-diagnostics check_if_node_is_quorum_critical in the process, you can end up with fewer than a majority replicas online. A booting node is technically online according to some metrics but it does not necessarily have started all quorum queue and stream replicas, and they may or may not be up-to-date. Add new QQ members as non-voters #9162, New upgrade time QQ health check: add check_if_new_quorum_queue_replicas_have_finished_initial_sync #10304, and Reconcile (repair or expand) quorum queue membership periodically #8218 are relevant changes shipped in 3.13.0

Answer selected by michaelklishin

michaelklishin Mar 22, 2024
Maintainer

Our Upgrade guide now has a new section dedicated to this upgrade strategy. For the lack of a better established term, I've named it "grow-then-shrink". Some tools use the term "surge upgrades" but I find that less descriptive.

schiaro98 · 2024-03-19T19:00:36Z

schiaro98
Mar 19, 2024
Author

Our team Is grateful for the upgrade strategy advice, but we would like to know if there Is a known method to repair the actual situation, that is the following:
We currently have a Number of queues ( more than 200) that has as leader a node that Is no more part of the cluster. Also we cannot grow these queue when we add the node to the cluster. And finally we have queues with only follower nodes and lo leader. In the management ui, these queue has the same nodes, but has also as leader the missing node. We are stuck in this situation and we cant repair It. Is there by any chance a method to reelect a leader in a queue or ti force the grow of queues?

2 replies

michaelklishin Mar 19, 2024
Maintainer

I have to guess as to what exactly the state of those replicas but if you have lost a majority of replicas for those queues, you cannot perform any operations on them, including adding or removing nodes. Losing the previously elected leader is much less important, Raft was designed to handle that but with the intentional restriction of always having a majority of nodes online.

A way to "forcefully add" nodes was discussed three or four times over the ≈ six years of quorum queues. Doing so would violate a couple of important Raft expectations and guarantees, namely that all cluster and state machine changes go through the log.

Rolling restarts and rabbitmq-diagnostics check_if_node_is_quorum_critical is the recommended upgrade procedure.

In general, there is no reason why adding nodes and removing them, 1 or N at a time, cannot work for quorum queues but you must be really careful about the state of the replicas, as described in #10786 (reply in thread) above:

All QQs and streams must have a majority of replicas online
rabbitmq-diagnostics check_if_node_is_quorum_critical must be used before node shutdown (leave alone removal), and rabbitmq-diagnostics check_if_new_quorum_queue_replicas_have_finished_initial_sync must be used before node removal

There are now two (as of 3.13, and one in 3.12) dedicated health checks for that.

Overall, given these restrictions, I don't see how this upgrade procedure is really safer or easier to automate than a rolling restart. Those looking for maximum safety should use Blue/Green upgrades which leaves your original cluster intact if you have to go back to it. It is harder to automate but the safest option that exists today and likely that can be implemented for a distributed system similar to RabbitMQ in general.

michaelklishin Mar 19, 2024
Maintainer

Here is one specific step towards a version of unsafe recovery in our Raft implementation: rabbitmq/ra#167, rabbitmq/ra#306.

We have never added anything to force it on QQs in RabbitMQ itself, that I know of. Perhaps @kjnilsson would have something to add.

michaelklishin · 2024-03-22T04:17:36Z

michaelklishin
Mar 22, 2024
Maintainer

Our Upgrade guide now has a new section dedicated to this upgrade strategy. For the lack of a better established term, I've named it "grow-then-shrink". Some tools use the term "surge upgrades" but I find that less descriptive.

1 reply

schiaro98 Mar 22, 2024
Author

Thank you for the documentation update!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster upgrade not working as expected #10786

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Cluster upgrade not working as expected #10786

schiaro98 Mar 19, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 6 comments · 5 replies

michaelklishin Mar 19, 2024 Maintainer

michaelklishin Mar 19, 2024 Maintainer

michaelklishin Mar 19, 2024 Maintainer

schiaro98 Mar 19, 2024 Author

michaelklishin Mar 19, 2024 Maintainer

michaelklishin Mar 22, 2024 Maintainer

schiaro98 Mar 19, 2024 Author

michaelklishin Mar 19, 2024 Maintainer

michaelklishin Mar 19, 2024 Maintainer

michaelklishin Mar 22, 2024 Maintainer

schiaro98 Mar 22, 2024 Author

schiaro98
Mar 19, 2024

Replies: 6 comments 5 replies

michaelklishin
Mar 19, 2024
Maintainer

michaelklishin
Mar 19, 2024
Maintainer

michaelklishin
Mar 19, 2024
Maintainer

schiaro98
Mar 19, 2024
Author

michaelklishin Mar 19, 2024
Maintainer

michaelklishin Mar 22, 2024
Maintainer

schiaro98
Mar 19, 2024
Author

michaelklishin Mar 19, 2024
Maintainer

michaelklishin Mar 19, 2024
Maintainer

michaelklishin
Mar 22, 2024
Maintainer

schiaro98 Mar 22, 2024
Author