Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabbit_peer_discovery: Fixes and improvements for Consul and etcd #11045

Merged
merged 9 commits into from May 14, 2024

Conversation

dumbbell
Copy link
Member

@dumbbell dumbbell commented Apr 19, 2024

Why

Node registration

After the rework of the peer discovery subsystem in RabbitMQ 3.13.0 to fix cases where clusters would be incorrectly initialized, there were regressions in the Consul and etcd backends causing nodes to hang during peer discovery and eventually boot as standalone nodes.

For Consul and etcd, the discovery process relies on the node registration: they return whatever was previously registered. With the new checks and failsafes added in RabbitMQ 3.13.0, the fact that registration happens after running discovery breaks these backends.

It used to work before because the first node would eventually time out waiting for a non-empty list of nodes from the backend and proceed as a standalone node, registering itself on the way. Following nodes would then discover that first node.

Among the new checks, the node running discovery expects to find itself in the list of discovered nodes. Because it didn't register yet, it will never find itself and reject the peer discovery results.

Node selection

In addition to that, RabbitMQ 3.13 introduce a new nodes sorting to make sure the node selection was deterministic. However in Consul, the order in which nodes are registered should be considered in case registration happens in a different order that nodes booted in. Therefore, the patch now allows to let the backend select the node to join and return a single node instead of a list.

This is committed at the same time as the first part because the new testsuites may fail transiently without this.

How

The solution is to register first, then run discovery. The node should at least get itself in the discovered nodes.

While here, add a testsuite in both the Consul and etcd plugins to actually test clustering against a real Consul/etcd server.

Finally, allow backends to select the node to join and return a single node name instead of a list. We still query that node’s properties to see if it is ready to be joined.

@dumbbell dumbbell added the bug label Apr 19, 2024
@dumbbell dumbbell added this to the 3.13.2 milestone Apr 19, 2024
@dumbbell dumbbell self-assigned this Apr 19, 2024
@dumbbell
Copy link
Member Author

As of this writing, the etcd testsuite is not finished and the Consul is yet to be extended.

@dumbbell dumbbell force-pushed the peer-disc-register-before-discover branch 5 times, most recently from ff507a5 to f6442ea Compare April 26, 2024 07:51
@mergify mergify bot added the bazel label Apr 26, 2024
@dumbbell dumbbell force-pushed the peer-disc-register-before-discover branch 7 times, most recently from 62e6e2b to eb6a195 Compare May 2, 2024 15:28
@michaelklishin michaelklishin modified the milestones: 3.13.2, 3.13.3 May 3, 2024
@dumbbell dumbbell force-pushed the peer-disc-register-before-discover branch from eb6a195 to e47ca00 Compare May 3, 2024 13:51
@dumbbell dumbbell changed the title rabbit_peer_discovery: Register node before running discovery rabbit_peer_discovery: Fixes and improvements for Consul and etcd May 3, 2024
@dumbbell
Copy link
Member Author

dumbbell commented May 3, 2024

In the end, I put two peer discovery improvements in this patch that fix regressions in the Consul and etcd backends, plus new testsuites for these backends. Thus the scope expanded compared to the initial intent of this pull request. I'm going to rewrite the pull requestion description to explain everything.

@frederikbosch
Copy link

frederikbosch commented May 3, 2024

I just found out that my whole exercise in this discussion is what will probably be resolved in this PR.

If I may, what I would prefer as a Consul / Nomad user, is to let RabbitMQ read what is already registered in Consul by Nomad. Hence, I would like to disable RabbitMQ registering the service. That's something that my container orchestrator (Nomad) already takes care of. So every RabbitMQ node sends a query to Consul, sees three nodes (including this node), and constructs the cluster including leader election.

Config wise, this might be realized by.

cluster_formation.consul.svc_addr_auto = false
cluster_formation.consul.svc_addr =

And I would include the required policy in Consul in the documentation, which for 3.12 I believe is the following.

node_prefix ""
{
  policy = "read"
}

service_prefix "rabbitmq"
{
  policy = "write"
}

agent_prefix ""
{
  policy = "read"
}

session_prefix ""
{
  policy = "write"
}

key_prefix "rabbitmq/"
{
  policy = "write"
}

@dumbbell dumbbell force-pushed the peer-disc-register-before-discover branch from e47ca00 to 9e57581 Compare May 3, 2024 15:47
@dumbbell dumbbell force-pushed the peer-disc-register-before-discover branch from 9e57581 to 39e65f3 Compare May 13, 2024 08:13
@dumbbell dumbbell marked this pull request as ready for review May 14, 2024 07:40
[Why]
A variable can be set to `false` to explicitly unset it in the child
process. This was ignored and converted to the string "false".

[How]
We special-case `false` and leave it as is.
[Why]
The existing testsuite tried if the communication with an etcd node
would work, but didn't test an actual cluster formation.

[How]
The new testcases try to create a cluster using the local etcd node
started by the testsuite. The first one starts one RabbitMQ node at a
time. the second one starts all of them concurrently.

While here, use the etcd source code added as a Git submodule in a
previous commit to compile etcd locally just for the testsuite.
[Why]
The `consul_svc` parameter is used as the service name and to construct
the service ID. The problem with the way the service ID is constructed
is that it doesn't allow to register several distinct RabbitMQ nodes in
the same Consul agent.

This is a problem for testsuites where we want to run several RabbitMQ
nodes on the same host with a single local Consul agent.

[How]
The service ID has now its own parameters, `consul_svc_id`. If this one
is unset, it falls back to the previous construction from the service
name. This allows to remain 100% compatible with previous versions.
…stration

[Why]
This allows other nodes to discover the actual node names, instead of
deriving one from the Consul agent node name and their own node name.

This permits to register several RabbitMQ nodes in the same Consul
agent. This is at least handy for testsuites.

[How]
The Erlang node name is added to the `Meta` properties list as long as
the RabbitMQ cluster name.

Note that this also fixes when the cluster name is added to `Meta`:
before this commit, a non-default cluster name was not added if the
user-configured properties list was empty at the beginning.
[Why]
Add a `system_SUITE` testsuite, copied from
rabbitmq_peer_discovery_etcd, that attempts to start a RabbitMQ cluster
where nodes use a Consul server to discover themselves.

[How]
The new testcases try to create a cluster using the local Consul node
started by the testsuite. The first one starts one RabbitMQ node at a
time. the second one starts all of them concurrently.

While here, use the Consul source code added as a Git submodule in a
previous commit to compile Consul locally just for the testsuite.
[Why]
The new implementation of `rabbit_peer_discovery` acquires the lock only
when a node needs to join another one. This is meant to disappear in the
medium/long term anyway.

Here, we need to lock the query to Consul to make sure that queries
happen sequentially, not concurrently. This is a work in progress and we
may not keep it either.
[Why]
The two backends that use registration are Consul and etcd. The
discovery process relies on the registered nodes: they return whatever
was previously registered.

With the new checks and failsafes added in peer discovery in RabbitMQ
3.13.0, the fact that registration happens after running discovery
breaks Consul and etcd backend.

It used to work before because the first node would eventually time out
waiting for a non-empty list of nodes from the backend and proceed as a
standalone node, registering itself on the way. Following nodes would
then discover that first node.

Among the new checks, the node running discovery expects to find itself
in the list of discovered nodes. Because it didn't register yet, it will
never find itself.

[How]
The solution is to register first, then run discovery. The node should
at least get itself in the discovered nodes.
…selves

[Why]
Before, the backend would always return a list of nodes and the
subsystem would select one based on their uptimes, the nodes they are
already clustered with, and the readiness of their database.

This works well in general but has some limitations. For instance with
the Consul backend, the discoverability of nodes depends on when each
one registered and in which order. Therefore, the node with the highest
uptime might not be the first that registers. In this case, the one that
registers first will only discover itself and boot as a standalone node.
However, the one with the highest uptime that registered after will
discover both nodes. It will then select itself as the node to join
because it has the highest uptime. In the end both nodes form distinct
clusters.

Another example is the Kubernetes backend. The current solution works
fine but it could be optimized: the backend knows we always want to join
the first node ("$node-0") regardless of the order in which they are
started because picking the first node alphabetically is fine.

Therefore we want to let the backend selects the node to join if it
wants.

[How]
The `list_nodes()` callback can now return the following term:

    {ok, {SelectedNode :: node(), NodeType}}

If the subsystem sees this return value, it will consider that the
returned node is the one to join. It will still query properties because
we want to make sure the node's database is ready before joining it.
[Why]
The default node selection of the peer discovery subsystem doesn't work
well with Consul. The reason is that that selection is based on the
nodes' uptime. However, the node with the highest uptime may not be the
first to register in Consul.

When this happens, the node that registered first will only discover
itself and boot as a standalone node. Then, the node with the highest
uptime will discover both of them, but will select itself as the node to
join because of its uptime. In the end, we end up with two clusters
instead of one.

[How]
We use the `CreateIndex` property in the Consul response to sort
services. We then derive the name of the node to join after the service
that has the lower `CreateIndex`, meaning it was the first to register.
@dumbbell dumbbell force-pushed the peer-disc-register-before-discover branch from 39e65f3 to 0f054e1 Compare May 14, 2024 07:40
@dumbbell dumbbell merged commit d893c84 into main May 14, 2024
18 checks passed
@dumbbell dumbbell deleted the peer-disc-register-before-discover branch May 14, 2024 08:38
@michaelklishin michaelklishin added the backport-pending Use with PRs that are yet to be backported for any reason label May 14, 2024
@michaelklishin
Copy link
Member

We will backport this in a week or two once we have some evidence that this PR works as expected during CI runs in main and day-to-day core team development use (of main).

@frederikbosch
Copy link

frederikbosch commented May 14, 2024

@michaelklishin What is the image name of the main branch? I am planning to test today.

Edit: never mind, already found it.

@frederikbosch
Copy link

frederikbosch commented May 14, 2024

I have tested this PR using a three node cluster with two scenario's.

  1. Also register the service using my orchestration tool (Nomad) in Consul with the meta erlang-node-name set, under the same name (rabbitmq) as RabbitMQ will register the service.
  2. Only let RabbitMQ do the registration in Consul.

The second scenario has one big downside: what if RabbitMQ did not close properly? Then the service remains in the registry. This could lead to an unrecoverable cluster. I actually ran into this scenario. What happened?

  • I stopped the cluster, and RabbitMQ did not have time to shutdown properly, so the node was killed
  • Leaving services in the registry
  • When I restarted all the nodes they query consul and see the left services (status passing), and try to join it.
  • This results in RabbitMQ crashing, because the hostname does not resolve anymore. In my cluster Docker services only have a resolving FQDN when they are actually running.
=PROGRESS REPORT==== 14-May-2024::12:25:21.123386 ===
    supervisor: {local,inet_gethost_native_sup}
    started: [{pid,<0.97.0>},{mfa,{inet_gethost_native,init,[[]]}}]

=PROGRESS REPORT==== 14-May-2024::12:25:21.133443 ===
    supervisor: {local,kernel_safe_sup}
    started: [{pid,<0.96.0>},
              {id,inet_gethost_native_sup},
              {mfargs,{inet_gethost_native,start_link,[]}},
              {restart_type,temporary},
              {significant,false},
              {shutdown,1000},
              {child_type,worker}]

2024-05-14 12:25:21.171098+00:00 [notice] <0.44.0> Application mnesia exited with reason: stopped
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> BOOT FAILED
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> ===========
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> Exception during startup:
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> error:function_clause
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_peer_discovery:select_node_to_join/1, line 873
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>         args: [[]]
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_peer_discovery:sync_desired_cluster/3, line 206
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_db:init/0, line 66
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:-run_step/2-lc$^0/1-0-/2, line 51
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:run_step/2, line 58
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:-run_boot_steps/1-lc$^0/1-0-/1, line 22
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit_boot_steps:run_boot_steps/1, line 23
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0>     rabbit:start/2, line 978
2024-05-14 12:25:21.171789+00:00 [error] <0.253.0> 
2024-05-14 12:25:22.173737+00:00 [debug] <0.253.0> Set stop reason to: {error,function_clause}
2024-05-14 12:25:22.174281+00:00 [debug] <0.253.0> Change boot state to `stopped`

Although I have a scenario I can work with, I would suggest to:

  • do not let RabbitMQ crash when the hostname does not resolve, but rather skip the node to join
  • allow RabbitMQ to not register at all (via a new config cluster_formation.consul.svc_register = false)

@dumbbell
Copy link
Member Author

Thank you for testing @frederikbosch! I think these are valid concerns. Could you please file an issue for this? The problem existed before anyway, so this is not a regression that should have been fixed by this pull request.

@frederikbosch
Copy link

Issue filed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-pending Use with PRs that are yet to be backported for any reason bazel bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants