Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ 3.13.0 nodes with Consul peer discovery enabled fails to form a cluster #10760

Closed
dumbbell opened this issue Mar 16, 2024 · 2 comments
Closed
Assignees
Labels

Comments

@dumbbell
Copy link
Member

The conclusion of the discussion in #10661 is that the Consul peer discovery backend broke in RabbitMQ 3.13.0, following the rewrite of peer discovery in #9797. In this rewrite, the behavior changed significantly. In particular, the lock is only acquired after the discovery phase and only if the node executing peer discovery must join another node.

This breaks the Consul peer discovery backend because the lock also opened a Consul session. Discoverable nodes are those that have a session open.

This was ok in RabbitMQ 3.12.x and before because the steps were:

  1. acquire a lock (which implicitly opens a session in the Consul backend)
  2. discover nodes
  3. join nodes
  4. release the lock

In RabbitMQ 3.13.0, the behavior is:

  1. discover nodes; because nodes may not have opened a session yet, the discovery step can return nothing useful
  2. acquire a lock if the node should join another one; that's where the session is opened but the discovery step above likely returned nothing
  3. join nodes
  4. release the lock

While looking at this, I see that the session is never explicitly closed. Another thing to fix, once I know how to improve peer discovery to allow the Consul backend to open a session early, separate from the locking.

@dumbbell dumbbell added the bug label Mar 16, 2024
@dumbbell dumbbell self-assigned this Mar 16, 2024
@lukebakken lukebakken self-assigned this Mar 16, 2024
dumbbell added a commit that referenced this issue Mar 18, 2024
…callbacks

[Why]
The Consul peer discovery backend needs to create a session before it
can acquire a lock. This session is also required for nodes to discover
each other.

It must open the session before the `list_nodes/0` callback can return
meaningful results.

[How]
The new `pre_discovery/0` and `post_discovery/1` callbacks are used to
create and delete that session before the whole discover/lock/join
process.

Fixes #10760.
dumbbell added a commit that referenced this issue Mar 18, 2024
…callbacks

[Why]
The Consul peer discovery backend needs to create a session before it
can acquire a lock. This session is also required for nodes to discover
each other.

It must open the session before the `list_nodes/0` callback can return
meaningful results.

[How]
The new `pre_discovery/0` and `post_discovery/1` callbacks are used to
create and delete that session before the whole discover/lock/join
process.

Fixes #10760.
dumbbell added a commit that referenced this issue Mar 18, 2024
…callbacks

[Why]
The Consul peer discovery backend needs to create a session before it
can acquire a lock. This session is also required for nodes to discover
each other.

It must open the session before the `list_nodes/0` callback can return
meaningful results.

[How]
The new `pre_discovery/0` and `post_discovery/1` callbacks are used to
create and delete that session before the whole discover/lock/join
process.

Fixes #10760.
@dumbbell
Copy link
Member Author

After more investigation, it looks like the problem is not the lock-related changes, but the fact that we register a node after the discovery step. This means that a node can't discover itself (among other members of a cluster).

This was fine in RabbitMQ 3.12.x and before because we had far fewer checks in place, one of them being a requirement that the current node is among the discovered nodes. In 3.12.x and before, this was fine because that check didn't exist and after some timeout, one node would give up on peer discovery and boot. As part of the boot, it would register itself and other nodes will discover it.

With the new checks in place in 3.13.x, we reject the discovered nodes list until the same timeout. At that point, all nodes will boot as standalone nodes because they didn't discover anyone.

One possible solution is to register first, then run peer discovery.

@dumbbell
Copy link
Member Author

This issue should be fixed by #11045.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants