Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabbit_peer_discovery: Fixes and improvements for Consul and etcd #11045

Merged
merged 9 commits into from
May 14, 2024

Commits on May 14, 2024

  1. rabbitmq_ct_helpers: Fix handling of unset env. variables in exec/2.

    [Why]
    A variable can be set to `false` to explicitly unset it in the child
    process. This was ignored and converted to the string "false".
    
    [How]
    We special-case `false` and leave it as is.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    6266e64 View commit details
    Browse the repository at this point in the history
  2. rabbitmq_peer_discovery_etcd: Add clustering testcases

    [Why]
    The existing testsuite tried if the communication with an etcd node
    would work, but didn't test an actual cluster formation.
    
    [How]
    The new testcases try to create a cluster using the local etcd node
    started by the testsuite. The first one starts one RabbitMQ node at a
    time. the second one starts all of them concurrently.
    
    While here, use the etcd source code added as a Git submodule in a
    previous commit to compile etcd locally just for the testsuite.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    50b4901 View commit details
    Browse the repository at this point in the history
  3. rabbitmq_peer_discovery_consul: Separate service name and ID

    [Why]
    The `consul_svc` parameter is used as the service name and to construct
    the service ID. The problem with the way the service ID is constructed
    is that it doesn't allow to register several distinct RabbitMQ nodes in
    the same Consul agent.
    
    This is a problem for testsuites where we want to run several RabbitMQ
    nodes on the same host with a single local Consul agent.
    
    [How]
    The service ID has now its own parameters, `consul_svc_id`. If this one
    is unset, it falls back to the previous construction from the service
    name. This allows to remain 100% compatible with previous versions.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    684ec76 View commit details
    Browse the repository at this point in the history
  4. rabbitmq_peer_discovery_consul: Populate Erlang node name during regi…

    …stration
    
    [Why]
    This allows other nodes to discover the actual node names, instead of
    deriving one from the Consul agent node name and their own node name.
    
    This permits to register several RabbitMQ nodes in the same Consul
    agent. This is at least handy for testsuites.
    
    [How]
    The Erlang node name is added to the `Meta` properties list as long as
    the RabbitMQ cluster name.
    
    Note that this also fixes when the cluster name is added to `Meta`:
    before this commit, a non-default cluster name was not added if the
    user-configured properties list was empty at the beginning.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    750497c View commit details
    Browse the repository at this point in the history
  5. rabbitmq_peer_discovery_consul: Add clustering testcases

    [Why]
    Add a `system_SUITE` testsuite, copied from
    rabbitmq_peer_discovery_etcd, that attempts to start a RabbitMQ cluster
    where nodes use a Consul server to discover themselves.
    
    [How]
    The new testcases try to create a cluster using the local Consul node
    started by the testsuite. The first one starts one RabbitMQ node at a
    time. the second one starts all of them concurrently.
    
    While here, use the Consul source code added as a Git submodule in a
    previous commit to compile Consul locally just for the testsuite.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    27ed4d2 View commit details
    Browse the repository at this point in the history
  6. rabbitmq_peer_discovery_consul: Handle locking inside list_nodes/0

    [Why]
    The new implementation of `rabbit_peer_discovery` acquires the lock only
    when a node needs to join another one. This is meant to disappear in the
    medium/long term anyway.
    
    Here, we need to lock the query to Consul to make sure that queries
    happen sequentially, not concurrently. This is a work in progress and we
    may not keep it either.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    a56d82c View commit details
    Browse the repository at this point in the history
  7. rabbit_peer_discovery: Register node before running discovery

    [Why]
    The two backends that use registration are Consul and etcd. The
    discovery process relies on the registered nodes: they return whatever
    was previously registered.
    
    With the new checks and failsafes added in peer discovery in RabbitMQ
    3.13.0, the fact that registration happens after running discovery
    breaks Consul and etcd backend.
    
    It used to work before because the first node would eventually time out
    waiting for a non-empty list of nodes from the backend and proceed as a
    standalone node, registering itself on the way. Following nodes would
    then discover that first node.
    
    Among the new checks, the node running discovery expects to find itself
    in the list of discovered nodes. Because it didn't register yet, it will
    never find itself.
    
    [How]
    The solution is to register first, then run discovery. The node should
    at least get itself in the discovered nodes.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    cb9f0d8 View commit details
    Browse the repository at this point in the history
  8. rabbit_peer_discovery: Allow backends to select the node to join them…

    …selves
    
    [Why]
    Before, the backend would always return a list of nodes and the
    subsystem would select one based on their uptimes, the nodes they are
    already clustered with, and the readiness of their database.
    
    This works well in general but has some limitations. For instance with
    the Consul backend, the discoverability of nodes depends on when each
    one registered and in which order. Therefore, the node with the highest
    uptime might not be the first that registers. In this case, the one that
    registers first will only discover itself and boot as a standalone node.
    However, the one with the highest uptime that registered after will
    discover both nodes. It will then select itself as the node to join
    because it has the highest uptime. In the end both nodes form distinct
    clusters.
    
    Another example is the Kubernetes backend. The current solution works
    fine but it could be optimized: the backend knows we always want to join
    the first node ("$node-0") regardless of the order in which they are
    started because picking the first node alphabetically is fine.
    
    Therefore we want to let the backend selects the node to join if it
    wants.
    
    [How]
    The `list_nodes()` callback can now return the following term:
    
        {ok, {SelectedNode :: node(), NodeType}}
    
    If the subsystem sees this return value, it will consider that the
    returned node is the one to join. It will still query properties because
    we want to make sure the node's database is ready before joining it.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    3147ab7 View commit details
    Browse the repository at this point in the history
  9. rabbitmq_peer_discovery_consul: Select the node to join

    [Why]
    The default node selection of the peer discovery subsystem doesn't work
    well with Consul. The reason is that that selection is based on the
    nodes' uptime. However, the node with the highest uptime may not be the
    first to register in Consul.
    
    When this happens, the node that registered first will only discover
    itself and boot as a standalone node. Then, the node with the highest
    uptime will discover both of them, but will select itself as the node to
    join because of its uptime. In the end, we end up with two clusters
    instead of one.
    
    [How]
    We use the `CreateIndex` property in the Consul response to sort
    services. We then derive the name of the node to join after the service
    that has the lower `CreateIndex`, meaning it was the first to register.
    dumbbell committed May 14, 2024
    Configuration menu
    Copy the full SHA
    0f054e1 View commit details
    Browse the repository at this point in the history