Use `persistent_term` for the feature flag registry #10988

the-mikedavis · 2024-04-12T15:24:27Z

This is an experiment for using persistent_term to hold the feature flag registry. rabbit_ff_registry is currently implemented with a module that is dynamically regenerated with erl_syntax, compile:forms/2 and code reloading. The module approach has the advantage of maximizing read speed but it has a few drawbacks:

We need workarounds to convince the dialyzer of the real typespecs of rabbit_ff_registry functions.
We need to be careful to not deadlock around the code server.
Regenerating the registry can be relatively slow: around 35-40ms for me locally per regeneration. With this patch I see a total savings of ~800ms during a regular single-node boot (bazel run broker) and a total savings of ~200ms with rabbitmqctl enable_feature_flag khepri_db for example.

The question is whether persistent_term will be fast enough. Running the usual one_fast perf-test flag setup I didn't see a difference between this branch and main but we'll want to run this through the full performance test suite before considering it.

This branch also includes a concurrency fix for rabbit_feature_flags:inject_test_feature_flags/2 which can cause the feature_flags_SUITE:registry_concurrent_reloads/1 case to misbehave.

lin72h · 2024-04-12T22:00:27Z

very creative!

the-mikedavis · 2024-04-16T21:28:10Z

I took some microbenchmarks of the two approaches - generating a module (albeit an Elixir module in the test) and reading from a :persistent_term - for the rabbit_ff_registry:is_enabled/1 use-case:

bench.exs.txt

(which can be run with mv bench.exs.txt bench.exs and elixir bench.exs)

Benchee results...

Operating System: Linux
CPU Information: AMD Ryzen Threadripper 3970X 32-Core Processor
Number of Available Cores: 64
Available memory: 125.64 GB
Elixir 1.16.2
Erlang 26.2.4
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 10 s
memory time: 2 s
reduction time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 30 s

Benchmarking dynamic module ...
Benchmarking persistent_term ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
dynamic module        30.31 M       32.99 ns  ±7714.91%          30 ns          41 ns
persistent_term       17.67 M       56.58 ns  ±4861.17%          51 ns          70 ns

Comparison: 
dynamic module        30.31 M
persistent_term       17.67 M - 1.71x slower +23.59 ns

Memory usage statistics:

Name               Memory usage
dynamic module              0 B
persistent_term             0 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name            Reduction count
dynamic module                2
persistent_term               2 - 1.00x reduction count +0

**All measurements for reduction count were the same**

persistent_term is ~1.7x slower than is_enabled/1 on a module using the current set of feature flags. This is better than I expected: persistent_term is certainly slower but not by a very wide margin since both operations are so cheap.

the-mikedavis · 2024-04-16T21:32:11Z

I also ran this branch against the performance tests vs. main: https://grafana.lionhead.rabbitmq.com/d/1OSIKB4Vk/rabbitmq-performance-tests?orgId=1&from=1713189778195&to=1713195192973&shareView=link

At a glance the two branches are hard to tell apart. The persistent_term may be slower but we might not read feature flags so often that it's noticeable. @mkuratczyk could I bug you to take a look and weigh in on these results when you get a chance?

mkuratczyk · 2024-04-22T17:24:14Z

Yeah, I can't think of code paths where this could slow down normal operations. Small differences between environments are unfortunately always there in these tests (there are multiple reasons for this, including intentional randomization in a few places, both in perf-test as well as on the broker side). I'm fairly sure the observed differences are due to such entropy, not this PR

Previously the feature flag registry was a module which was regenerated (i.e. with `erl_syntax` and `compile:forms/2`) whenever the set of known feature flags or feature flag states changed. Compiling and loading a module is fairly expensive though, costing around 35-40ms for me locally, and the registry module is regenerated many times on boot. So this change saves a fair amount of time on boot. A regular single-node broker boots locally for me in around 4250ms via `bazel run broker`. With this patch the boot time falls to around 3300ms. This patch also improves the time it takes to enable individual feature flags. `time rabbitmqctl enable_feature_flag khepri_db` for example changes locally for me from around 1.2s to 1.0s with this change.

This prevents a potential race which was possible when two processes attempted to initialize the feature flag registry at the same time. One process attempting to initialize the registry could take a relatively long time to read the available feature flags meanwhile another process could read a more recently changed set of available feature flags. Locking before reading and writing the flag prevents one update from clobbering the other. This race was not possible before the registry used a persistent_term because we read the version of the registry module as the very first step when initializing. When we created a new module we checked the version again and retried the update if it changed in the meantime. That retry mechanism has been removed since we no longer track the current version of the registry. Instead we can use a local lock for the same effect.

`rabbit_feature_flags:inject_test_feature_flags/2` could discard flags from the persistent term because the read and write to the persistent term were not atomic. `feature_flags_SUITE:registry_concurrent_reloads/1` spawns many processes to modify the feature flag registry concurrently but also updates this persistent term concurrently. The processes would race so that many would read the initial flags, add their flag and write that state, discarding any flags that had been written in the meantime. We can add a lock around changes to the persistent term to make the changes atomic. Non-atomic updates to the persistent term caused unexpected behavior in the `registry_concurrent_reloads/1` case previously. The `PT_TESTSUITE_ATTRS` persistent_term only ended up containing a few of the desired feature flags (for example only `ff_02` and `ff_06` along with the base `ff_a` and `ff_b`). The case did not fail because the registry continued to make progress towards that set of feature flags. However the test case never reached the "all feature flags appeared" state of the spammer and so the new assertion added at the end of the case in this commit would fail.

These test cases and RPC calls were concerned with deadlocks possible from modifying the code server process from multiple callers. With the switch to persistent_term in the parent commits these kinds of deadlocks are no longer possible. We should keep the RPC calls to `rabbit_ff_registry_wrapper:inventory/0` though for mixed-version compatibility with nodes that use module generation instead of `persistent_term` for their registry.

the-mikedavis · 2024-04-23T15:20:12Z

deps/rabbit/test/feature_flags_SUITE.erl

-    receive {'DOWN', MRef, process, Spammer, _} -> ok end.
+    receive {'DOWN', MRef, process, Spammer, _} -> ok end,
+    %% All feature flags appeared.
+    ?assertEqual(FinalFFList, ?list_ff(all)).


This assertion fails on main but it's because of a bug in the test helpers (see 2e4e321) rather than a bug in the feature flag system

michaelklishin · 2024-04-23T19:44:28Z

On an M1 machine with 10 cores and a fast SSD, this shaves some 9% off of node-reported startup time 👏 on top of the improvements in #10989. This should nicely benefit all future CI runs where clusters are formed and nodes are restarted many times in total.

the-mikedavis assigned dumbbell and the-mikedavis Apr 12, 2024

the-mikedavis force-pushed the md/feature-flags/persistent-term branch from 1a0cc72 to e7de95c Compare April 13, 2024 15:24

the-mikedavis mentioned this pull request Apr 13, 2024

Prefer erlang:monotonic_time/0 for elapsed time measurements #10928

Merged

the-mikedavis force-pushed the md/feature-flags/persistent-term branch 2 times, most recently from b6a7a14 to f2b8c5d Compare April 15, 2024 13:29

the-mikedavis added 4 commits April 23, 2024 11:01

the-mikedavis force-pushed the md/feature-flags/persistent-term branch from f2b8c5d to b95c129 Compare April 23, 2024 15:18

the-mikedavis marked this pull request as ready for review April 23, 2024 15:18

the-mikedavis unassigned dumbbell Apr 23, 2024

the-mikedavis requested a review from dumbbell April 23, 2024 15:18

the-mikedavis commented Apr 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `persistent_term` for the feature flag registry #10988

Use `persistent_term` for the feature flag registry #10988

the-mikedavis commented Apr 12, 2024

lin72h commented Apr 12, 2024

the-mikedavis commented Apr 16, 2024

the-mikedavis commented Apr 16, 2024

mkuratczyk commented Apr 22, 2024

the-mikedavis Apr 23, 2024

michaelklishin commented Apr 23, 2024 •

edited

Use persistent_term for the feature flag registry #10988

Are you sure you want to change the base?

Use persistent_term for the feature flag registry #10988

Conversation

the-mikedavis commented Apr 12, 2024

lin72h commented Apr 12, 2024

the-mikedavis commented Apr 16, 2024

the-mikedavis commented Apr 16, 2024

mkuratczyk commented Apr 22, 2024

the-mikedavis Apr 23, 2024

Choose a reason for hiding this comment

michaelklishin commented Apr 23, 2024 • edited

Use `persistent_term` for the feature flag registry #10988

Use `persistent_term` for the feature flag registry #10988

michaelklishin commented Apr 23, 2024 •

edited