Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use persistent_term for the feature flag registry #10988

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

the-mikedavis
Copy link
Member

This is an experiment for using persistent_term to hold the feature flag registry. rabbit_ff_registry is currently implemented with a module that is dynamically regenerated with erl_syntax, compile:forms/2 and code reloading. The module approach has the advantage of maximizing read speed but it has a few drawbacks:

  • We need workarounds to convince the dialyzer of the real typespecs of rabbit_ff_registry functions.
  • We need to be careful to not deadlock around the code server.
  • Regenerating the registry can be relatively slow: around 35-40ms for me locally per regeneration. With this patch I see a total savings of ~800ms during a regular single-node boot (bazel run broker) and a total savings of ~200ms with rabbitmqctl enable_feature_flag khepri_db for example.

The question is whether persistent_term will be fast enough. Running the usual one_fast perf-test flag setup I didn't see a difference between this branch and main but we'll want to run this through the full performance test suite before considering it.

This branch also includes a concurrency fix for rabbit_feature_flags:inject_test_feature_flags/2 which can cause the feature_flags_SUITE:registry_concurrent_reloads/1 case to misbehave.

@lin72h
Copy link

lin72h commented Apr 12, 2024

very creative!

@the-mikedavis
Copy link
Member Author

I took some microbenchmarks of the two approaches - generating a module (albeit an Elixir module in the test) and reading from a :persistent_term - for the rabbit_ff_registry:is_enabled/1 use-case:

bench.exs.txt

(which can be run with mv bench.exs.txt bench.exs and elixir bench.exs)

Benchee results...
Operating System: Linux
CPU Information: AMD Ryzen Threadripper 3970X 32-Core Processor
Number of Available Cores: 64
Available memory: 125.64 GB
Elixir 1.16.2
Erlang 26.2.4
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 10 s
memory time: 2 s
reduction time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 30 s

Benchmarking dynamic module ...
Benchmarking persistent_term ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
dynamic module        30.31 M       32.99 ns  ±7714.91%          30 ns          41 ns
persistent_term       17.67 M       56.58 ns  ±4861.17%          51 ns          70 ns

Comparison: 
dynamic module        30.31 M
persistent_term       17.67 M - 1.71x slower +23.59 ns

Memory usage statistics:

Name               Memory usage
dynamic module              0 B
persistent_term             0 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name            Reduction count
dynamic module                2
persistent_term               2 - 1.00x reduction count +0

**All measurements for reduction count were the same**

persistent_term is ~1.7x slower than is_enabled/1 on a module using the current set of feature flags. This is better than I expected: persistent_term is certainly slower but not by a very wide margin since both operations are so cheap.

@the-mikedavis
Copy link
Member Author

I also ran this branch against the performance tests vs. main: https://grafana.lionhead.rabbitmq.com/d/1OSIKB4Vk/rabbitmq-performance-tests?orgId=1&from=1713189778195&to=1713195192973&shareView=link

At a glance the two branches are hard to tell apart. The persistent_term may be slower but we might not read feature flags so often that it's noticeable. @mkuratczyk could I bug you to take a look and weigh in on these results when you get a chance?

@mkuratczyk
Copy link
Contributor

Yeah, I can't think of code paths where this could slow down normal operations. Small differences between environments are unfortunately always there in these tests (there are multiple reasons for this, including intentional randomization in a few places, both in perf-test as well as on the broker side). I'm fairly sure the observed differences are due to such entropy, not this PR

Previously the feature flag registry was a module which was regenerated
(i.e. with `erl_syntax` and `compile:forms/2`) whenever the set of known
feature flags or feature flag states changed. Compiling and loading
a module is fairly expensive though, costing around 35-40ms for me
locally, and the registry module is regenerated many times on boot. So
this change saves a fair amount of time on boot. A regular single-node
broker boots locally for me in around 4250ms via `bazel run broker`.
With this patch the boot time falls to around 3300ms.

This patch also improves the time it takes to enable individual feature
flags. `time rabbitmqctl enable_feature_flag khepri_db` for example
changes locally for me from around 1.2s to 1.0s with this change.
This prevents a potential race which was possible when two processes
attempted to initialize the feature flag registry at the same time. One
process attempting to initialize the registry could take a relatively
long time to read the available feature flags meanwhile another process
could read a more recently changed set of available feature flags.
Locking before reading and writing the flag prevents one update from
clobbering the other.

This race was not possible before the registry used a persistent_term
because we read the version of the registry module as the very first
step when initializing. When we created a new module we checked the
version again and retried the update if it changed in the meantime. That
retry mechanism has been removed since we no longer track the current
version of the registry. Instead we can use a local lock for the same
effect.
`rabbit_feature_flags:inject_test_feature_flags/2` could discard flags
from the persistent term because the read and write to the persistent
term were not atomic. `feature_flags_SUITE:registry_concurrent_reloads/1`
spawns many processes to modify the feature flag registry concurrently
but also updates this persistent term concurrently. The processes would
race so that many would read the initial flags, add their flag and write
that state, discarding any flags that had been written in the meantime.

We can add a lock around changes to the persistent term to make the
changes atomic.

Non-atomic updates to the persistent term caused unexpected behavior in
the `registry_concurrent_reloads/1` case previously. The
`PT_TESTSUITE_ATTRS` persistent_term only ended up containing a few of
the desired feature flags (for example only `ff_02` and `ff_06` along
with the base `ff_a` and `ff_b`). The case did not fail because the
registry continued to make progress towards that set of feature flags.
However the test case never reached the "all feature flags appeared"
state of the spammer and so the new assertion added at the end of the
case in this commit would fail.
These test cases and RPC calls were concerned with deadlocks possible
from modifying the code server process from multiple callers. With the
switch to persistent_term in the parent commits these kinds of deadlocks
are no longer possible.

We should keep the RPC calls to `rabbit_ff_registry_wrapper:inventory/0`
though for mixed-version compatibility with nodes that use module
generation instead of `persistent_term` for their registry.
@the-mikedavis the-mikedavis force-pushed the md/feature-flags/persistent-term branch from f2b8c5d to b95c129 Compare April 23, 2024 15:18
@the-mikedavis the-mikedavis marked this pull request as ready for review April 23, 2024 15:18
receive {'DOWN', MRef, process, Spammer, _} -> ok end.
receive {'DOWN', MRef, process, Spammer, _} -> ok end,
%% All feature flags appeared.
?assertEqual(FinalFFList, ?list_ff(all)).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails on main but it's because of a bug in the test helpers (see 2e4e321) rather than a bug in the feature flag system

@michaelklishin
Copy link
Member

michaelklishin commented Apr 23, 2024

On an M1 machine with 10 cores and a fast SSD, this shaves some 9% off of node-reported startup time 👏 on top of the improvements in #10989. This should nicely benefit all future CI runs where clusters are formed and nodes are restarted many times in total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants