Feature flags need quality of life improvements #9677

dumbbell · 2023-10-11T09:38:19Z

Why

Since the introduction of the first required feature flags, it became more painful for users to upgrade if they did not pay attention to the feature flags states. Things like:

A node refuses to start because it requires a feature flag which was still disabled before the upgrade.
It's difficult to downgrade if a package system was used because there is little chance the previous package is available easily.
It's difficult to add a newer node to a cluster using an older version.

There is room for improvement in the current subsystem and I would like to follow two routes:

better communicate from RabbitMQ itself that users have to enable feature flags
make changes to the subsystem to handle common situations that are problematic today
prevent foot-shooting when upgrading RabbitMQ using our packages (Debian and RPM)

How

Here is a list of improvements that I plan to make:

Log a warning during start, stop and perhaps on a regular basis that list stable feature flags that are still disabled
Display a warning in the management UI to invite users to pay attention to the Feature flags section
Highlight stable feature flags that are still disabled in the management UI's Feature flags section
Improve the compatibility check in join_cluster to take into account the fact that the node will be reset. There should be no need to mess with $RABBITMQ_FEATURE_FLAGS because the joining node's feature flags states will be aligned with the remote cluster anyway.
See:
- rabbit_db: Copy feature states early when joining a cluster #9682
- Relax feature flag compat check during join cluster #9729
When a clustered node is upgraded to a version that requires some feature flags, it should be possible to enable them remotely in the cluster and then proceed with the start of the local node.
When a node is upgraded, users could configure RabbitMQ to automatically enable all stable feature flags as soon as possible. This could be an opt-in or opt-out behavior.
Improve Debian and RPM packages to verify that RabbitMQ can be upgraded. This requires that a preinst script can compare the list of feature flags from the installed version and the new one.

The text was updated successfully, but these errors were encountered:

... that considers the local node as if it was reset. [Why] When a node joins a cluster, we check its compatibility with the cluster, reset the node, copy the feature flags states from the remote cluster and add that node to the cluster. However, the compatibility check is performed with the current feature flags states, even though they are about to be reset. Therefore, a node with an enabled feature flag that is unsupported by the cluster will refuse to join. It's incorrect because after the reset and the states copy, it could have join the cluster just fine. [How] We introduce a new variant of `check_node_compatibility/2` that takes an argument to indicate if the local node should be considered as a virgin node (i.e. like after a reset). This way, the joining node will always be able to join, regardless of its initial feature flags states, as long as it doesn't require a feature flag that is unsupported by the cluster. This also removes the need to use `$RABBITMQ_FEATURE_FLAGS` environment variable to force a new node to leave stable feature flags disabled to allow it to join a cluster running an older version. References #9677.

wast · 2023-10-17T07:47:37Z

I'd like to add my 2 cents. There's an error when using 1 replica in a RabbitMQ Cluster Kubernetes operator:
"Feature flags: refuse to enable feature flags while clustered nodes are missing, stopped or unreachable"

dumbbell · 2023-10-17T07:56:56Z

I'd like to add my 2 cents. There's an error when using 1 replica in a RabbitMQ Cluster Kubernetes operator: "Feature flags: refuse to enable feature flags while clustered nodes are missing, stopped or unreachable"

All RabbitMQ nodes in a cluster need to run before a feature flag can be enabled. Could you please expand on your use case?

wast · 2023-10-17T08:02:02Z

"All nodes" in my scenario is 1 single node (as defined in yaml: replicas: 1), so why is it expecting more?

michaelklishin · 2023-10-17T09:32:52Z

@wast please start a separate GitHub Discussion, we will not let well defined issues to be turned into open ended discussions and troubleshooting sessions.

michaelklishin · 2023-10-17T09:35:02Z

"All nodes" in my scenario is 1 single node (as defined in yaml: replicas: 1), so why is it expecting more?

Most likely because there were more nodes in the cluster at some point and existing nodes still have knowledge of their prior peers. The Cluster Operator does not support shrinking the cluster, at least not in all cases, IIRC. There is a certain workaround but in general, shrinking member count should not be considered a supported operation.

This is a topic for a separate discussion, this issue has well defined and specific scope.

dumbbell added the enhancement label Oct 11, 2023

dumbbell added this to the 3.13.0 milestone Oct 11, 2023

dumbbell self-assigned this Oct 11, 2023

rabbitmq locked as off-topic and limited conversation to collaborators Oct 17, 2023

dumbbell removed this from the 3.13.0 milestone Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature flags need quality of life improvements #9677

Feature flags need quality of life improvements #9677

dumbbell commented Oct 11, 2023 •

edited

wast commented Oct 17, 2023

dumbbell commented Oct 17, 2023

wast commented Oct 17, 2023

michaelklishin commented Oct 17, 2023 •

edited

michaelklishin commented Oct 17, 2023 •

edited

Feature flags need quality of life improvements #9677

Feature flags need quality of life improvements #9677

Comments

dumbbell commented Oct 11, 2023 • edited

Why

How

wast commented Oct 17, 2023

dumbbell commented Oct 17, 2023

wast commented Oct 17, 2023

michaelklishin commented Oct 17, 2023 • edited

michaelklishin commented Oct 17, 2023 • edited

dumbbell commented Oct 11, 2023 •

edited

michaelklishin commented Oct 17, 2023 •

edited

michaelklishin commented Oct 17, 2023 •

edited