How to use a new handle-based API? #271

loyd · 2022-02-12T16:07:03Z

I'm writing an actor system (elfo) and have been faced with inconvenient usage of the handle API that has been lent in the last metrics release (0.18).

Some prelude before questions.

All metrics exposed by actors are recorded with special labels (actor_group and, if configured, actor_key).

Let's consider some built-in metrics, e.g. the elfo_handling_message_time_seconds histogram, that's measured for every incoming message:

elfo_handling_message_time_seconds_sum{actor_group="some_actor_group", message="SomeMessage", ...} ..
...
elfo_handling_message_time_seconds_sum{actor_group="another_group", message="AnotherMessage", ...} ..
...

This metric is built using a static name and static label set (that generated for each Message) on the fly without registering any metrics before. Also, these labels used somewhere in production code, without explicit storing of any handles.
Moreover, every metric can be produced twice (per actor group and per actor), if configured. Because if a group has low actors, it's useful to collect metrics about each of them. And, always, all metrics are collected for whole group.

All emitted metrics are handled by some dedicated actor, which implementation can be altered by user, but default implementation (elfo-telemeter) also provided and aren't based on metrics-exporter-prometheus, because I want more control: compact atomic buckets by interval, provide sliding window, have graceful termination, handle zero counters in a special way, provide API for other actors to get metrics in backend-agnostic structure and so on. By the way, I don't understand, why a such specific implementation as metrics-exporter-prometheus is stored in this repo at all, but it another question, I need only frontend and useful utils (thanks for them).

Now, using 0.17, the metric storage totally implemented in that dedicated actor.

In 0.18, with handle-based API, I don't totally understand how to handle these cases. I need to register all metrics (both, for built-in metrics and specific production metrics), that can have different actor keys (that can be solved with the layer system) and per-message metrics, that requires efficient storage in different places.

And, more important, any metric can be recorded twice (per group and per actor), if configured for a specific actor group.

Thus, I have two problems:
1.Macro calls can lead to exposing one or two metrics (depends on config) with different labels. I see only adding labels in layers.
2. Even in places, where key and labels are static, I cannot emit metric directly and must register all of them, that leads to placing storages here and there (per message, per log level, per dumping class and so on). Should I use Registry everywhere?

The text was updated successfully, but these errors were encountered:

tobz · 2022-02-12T16:41:33Z

Sorry that you've been having trouble using the new handle-based API in 0.18.

At a high-level, there's not much that changes from the perspective of the recorder:

all describe_* methods are only for storing the description and/or unit of a metric, but you could also use them to zero-initialize a metric if you really wanted to
all register_* methods are for creating an entry in your metrics storage for the given key/metric type, and returning a handle to the underlying storage

This means that while in 0.17, updating a metric typically meant using the Registry::op pattern, now you must provide a handle to actual metric storage so it can be updated in accordance with what the metric type is.

Let's pretend that the Registry is just HashMap<(MetricKind, Key), AtomicU64>. If you wanted to update a metric in 0.17, you might pass a closure that gets a reference to the atomic i.e. |&atomic| atomic.fetch_add(1, ...) in order to increment.

In 0.18, Registry looks more like HashMap<(MetricKind, Key), Arc<AtomicU64>>, and when a metric is registered, you are giving back a clone of the Arc<AtomicU64> wrapped in the appropriate handle type (Counter/Gauge/Histogram) which is then used directly, or by the emission macros (increment_counter!, etc).

If you have a specific example of trying to update one area of your code from metrics 0.17 to 0.18 that you can share, I might be able to provide more help. Looking at the code you linked, it's just not clear to me why it couldn't be adapted to 0.18.

loyd · 2022-02-13T12:16:29Z

Thanks for the fast response!

Let's focus on the problem of exposing several metrics from one place.
For every counter!("some_metric") call I want to produce from 0 to 2 different metrics, depending on the configuration for a specific actor group:

some_metric{actor_group="a", ...}
some_metric{actor_group="a", actor_key="1", ...}

Now, this is implemented by reading rare changing the atomic configuration and calling Registry::op several times: click. It works, because the previous version of Registry allowed me to have a custom key.

Let's try to use the new API:

struct Storage(Registry);

impl Recorder for Storage {
    fn register_counter(&self) -> Counter {
        // I cannot just get one metric by calling
        // `get_or_create_counter`, because:
        // 1. Additional labels should be added, that means
        // recreation of `Key`, that's expensive.
        // 2. If configured, I need create two counters instead of one
        // and it can be changed at runtime.
        // 3. At runtime, if configuration is changed, the registration should be called again,
        // that's impossible for users who don't use macros.
    }
}

(3) could be resolved with some analogue of https://docs.rs/tracing-core/latest/tracing_core/callsite/fn.rebuild_interest_cache.html

Ok, let's create a special CounterFn instance, that will do all work at the calling moment instead of the registration one:

struct MyCounter {
    storage: Arc<Storage>,
    key: Key,
}

impl CounterFn for MyCounter {
    fn increment(&self, value: u64) {
        if telemeter_per_actor() {
            self.storage.counter_per_actor(self.key, value);
        }
        if telemeter_per_group() {
            self.storage.counter_per_group(self.key, value);
        }
    }
}

The signature of register_counter forces implementation to return Counter, which means cloning Arc<Storage>. Registration occurs every call of the macro, which means a lot of false sharing on the counter inside Arc. Ok, I can use TLS and have different Arc in each thread to avoid false sharing. Inconvenient, but possible. But in this case, it's impossible to use several storages (for tests or other purposes) and almost equal to having global static Storage.

Also, Key should be cloned also, that can be expensive too in case of Owned labels/name inside Key.

Another, but related, the problem is reusing Registry for custom Key. Now Registry's methods accept metrics::Key instead of custom key implementation. So, instead of just passing some kind of struct { actor_group: ... key: Key }, I need to recreate a new Key with merging labels. Implementation of Registry is nice, I want to reuse it.

tobz · 2022-02-13T18:37:23Z

Answering out of order:

Another, but related, the problem is reusing Registry for custom Key. Now Registry's methods accept metrics::Key instead of custom key implementation. So, instead of just passing some kind of struct { actor_group: ... key: Key }, I need to recreate a new Key with merging labels. Implementation of Registry is nice, I want to reuse it.

That's a good point about Registry now requiring Key instead of a generic K. I had done some refactoring to be able to memoize the hash of a key to speed up lookups, so I constrained everything to Key to make it easier to enforce. There's probably a better approach I could take to keep it generic while still supporting memoized hashing. I opened #272 to explore finding a better design that lets us have generic keys while still supporting memoized hashes.

For every counter!("some_metric") call I want to produce from 0 to 2 different metrics, depending on the configuration for a specific actor group:

Thanks for this example. It's very clear now what you're trying to achieve and why the new handle-based design makes it harder/inefficient.

One thought in my mind: do actors ever change their labels once active? For example, do actor_group or actor_key change once an actor is running?

The reason I ask this question is that I wonder if you might potentially benefit from the new handle-based design by changing how you create metrics for a given actor. If we assume that the metric key/labels are fixed from when an actor starts until when it completes/dies, then it might be worth trying to actually store those metrics as fields on a struct rather than using the macros.

By doing so, you could avoid all the current logic that figures out if a metric needs to be updated or not. For example:

// All metrics for an actor.  This could also be implemented multiple times if
// you wanted to break down the metric groups so that only certain metrics are
// declared near where they are used, closer to the model of only using the macros.
struct ActorMetrics {
    messages_received: Counter,
    ...
}

impl ActorMetrics {
    pub fn create() -> Self {
        Self {
            messages_received: register_counter!("elfo_actor_messages_received"),
        }
    }
}

// Now you would make these changes in your `Recorder` to support returning a handle
// that can affect multiple metrics at the same time:
impl Recorder {
    fn with_params<R>(&self, f: impl Fn(&Storage, &Scope, bool) -> R) -> Option<Vec<R>> {
        scope::try_with(|scope| {
            let result = Vec::new();
            let perm = scope.permissions();
            if perm.is_telemetry_per_actor_group_enabled() {
                result.push(f(&self.storage, scope, false));
            }
            if perm.is_telemetry_per_actor_key_enabled() && !scope.meta().key.is_empty() {
                result.push(f(&self.storage, scope, true));
            }
            result
        });
    }
}

impl metrics::Recorder for Recorder {
    fn register_counter(&self, key: &Key) -> Counter {
        // `Storage` still controls and owns all data for all registered metrics, but now
        // you return individual `Counter` handles for each metric, and then wrap them in
        // a newtype that allows for returning a single `Counter` here in `register_counter`.
        let maybe_counters = self.with_params(|storage, scope, with_actor_key| {
            storage.register_counter(scope, key, with_actor_key)
        });
        maybe_counters.map(CounterGroup::wrap).unwrap_or_else(|| Counter::noop())
    }
    
    ...
}

struct CounterGroup {
    // Possible optimization is that you just return a direct `Arc<Atomic...>` clone
    // from `Storage::register_*` methods, since `Recorder::with_params` is generic
    // over return type, which would be slightly more efficient since there would be
    // no optional check like there is in `Counter` itself: the `Vec` having items, or
    // not, is what now dictates whether we update any metrics.
    counters: Vec<Counter>,
}

impl CounterGroup {
    fn wrap(counters: Vec<Counter>) -> Counter {
        let grouped = CounterGroup { counters };
        Counter::from_arc(grouped)
    }
}

impl CounterFn for CounterGroup {
    fn increment(&self, value: u64) {
        for counter in &self.counters {
            counter.increment(value);
        }
    }
    
    ...
}

While this code is not perfect -- there might be a restriction I'm not thinking of -- I think it could represent a workable design for your use case. Some benefits:

direct access to the atomic storage: you don't even need to query the registry hashmap anymore if metric name/labels are static and don't change
still technically supports using normal increment_counter!, etc, macros (but with the performance hit that you correctly mentioned above with regards to Key cloning)

To your point about runtime changes, that is definitely trickier to solve... but it might be solvable in a generic way by using something like a configuration epoch? For example:

a new global atomic is created that is incremented any time a change to runtime configuration is made (or doesn't have to be global, maybe you pass Arc handle of it to interested actors, etc)
CounterGroup, etc, get a clone of handle to this epoch, stores their own "last epoch seen", and store the original key used to register the metric
every operation for CounterFn checks if global epoch is different than local epoch:
if global/local epoch are the same: run operation on the existing counters
if global/local epoch are not the same: store new global epoch, re-register counters, and then run operation on new counters

As I was typing that, I had the thought that it already wouldn't 100% work because you would need mutability to replace the existing counters in the proposed CounterGroup. However, if you used the proposed ActorMetrics approach, you would own that and thus could implement such functionality at that level instead.

An atomic load for every metric emit isn't great, but overall, I believe my proposed design would still be faster overall as there would be less centralized access to the Registry where the internal hashmap is more of a bottleneck than an atomic load.

loyd · 2022-02-14T04:59:11Z

The main disadvantage of the suggested approach is that the metric set is unbound: all metrics produced by user-written code should use the same approach, not only provided by elfo ones. Moreover, the code producing metrics may not even know that is called inside an actor system. metrics is becoming de-facto the default ecosystem for collecting metrics in Rust, so it's a matter of time when some HTTP/WS/DB/etc client will include custom metrics using metrics. It's undesirable to force all of them to cache metrics instead of just using increment_counter!("ws_messages_received_total") in place, which becomes expensive immediately because of the expensive registration step. Nobody expects that this line can be expensive. For instance, tracing caches all registrations after calling Subscriber::register_callsite and provides a method to rebuild that cache. This doesn't solve my problem, but I want to show, that it's unexpected to call register() on hot paths anyway. Also, in this case, the previous implementation is expressible by storing Key in CounterGroup.

Maybe Recorder should have methods increment_counter/etc with the default implementation self.register_counter().increment(). Although this can prevent further optimizations of macros (like storing in static mentioned above).

you would own that and thus could implement such functionality at that level instead.

As I said above, metrics produced by users should behave similarly.

I had the thought that it already wouldn't 100% work because you would need mutability to replace the existing counters in the proposed

Yep, it's problematic. It's a reason, why arc-swap exists, and even it is 20x slower than atomic accesses (but still x20 faster parking_lot. Need to check seqlock also).

direct access to the atomic storage: you don't even need to query the registry hashmap anymore if metric name/labels are static and don't change
where the internal hashmap is more of a bottleneck than an atomic load.

As I see, it's a sharded map internally. Of course, even with precalculated hash, it's still slower than atomic loads. I'm sure, that the handle API + ArcSwap in CounterGroup should work better. In general, I welcome the change, my concerns are about macros and unexpected overhead for users because of re-registering on each call.

tobz · 2022-02-14T15:50:08Z

The main disadvantage of the suggested approach is that the metric set is unbound: all metrics produced by user-written code should use the same approach, not only provided by elfo ones.

There's no reason you can't limit it to metrics that use a prefix of elfo_, etc.

Moreover, the code producing metrics may not even know that is called inside an actor system. metrics is becoming de-facto the default ecosystem for collecting metrics in Rust, so it's a matter of time when some HTTP/WS/DB/etc client will include custom metrics using metrics.

This is an explicit goal of metrics: just like tracing, an application can get metrics from its dependencies for "free", just by installing a recorder.

It's undesirable to force all of them to cache metrics instead of just using increment_counter!("ws_messages_received_total") in place, which becomes expensive immediately because of the expensive registration step. Nobody expects that this line can be expensive. ... This doesn't solve my problem, but I want to show, that it's unexpected to call register() on hot paths anyway.

I think an important distinction is that "registration" in 0.18 is no different than actually "doing" the specific metric operation in 0.17. You never know if a metric exists yet, so the registry always has to look up the key, see if it exists, clone the key and create the storage if it didn't already exist, etc.

The primary difference in 0.18 is that we've separated looking up/registering a metric and actually performing the operation that updates it.

Yep, it's problematic. It's a reason, why arc-swap exists, and even it is 20x slower than atomic accesses (but still x20 faster parking_lot. Need to check seqlock also).

This line is a good point, and it actually made me realize that the Counter/Gauge/Histogram probably should have had mutable interfaces since it would have enabled more interesting designs such as aggregating internally and then emitting a single time at Drop, etc.

In general, I welcome the change, my concerns are about macros and unexpected overhead for users because of re-registering on each call.

I had already mentioned above the subtlety around the changed naming, but I want to dig in a little more on the actual overhead/performance.

So, in general, the "registering" step is not that dissimilar to the code that existed before. For a call of counter!:

In 0.17:

construct key (possibly const, possibly runtime)
call Recorder::increment_counter, which would:
look up key in registry (get shard, entry API to find if key already exist, create if not)
run closure on metric handle (increment counter, record histo, etc)

In 0.18:

construct key (possibly const, possibly runtime)
call Recorder::register_counter, which would:
look up key in registry (get shard, entry API to find if key already exist, create if not)
clone handle and return it (so, cloning of Arc<AtomicU64>, etc)
call appropriate method metric handle (increment counter, record histo, etc)

In both cases, having to look up the key in the registry's sharded map is the most expensive part by far. It outweighs the cost of cloning the Arc, although cloning the Arc does involve atomics, so in some cases, there's the potential for the compiler to make sub-optimal choices that reduce instruction-level parallelism, etc.

The explicit trade-off I made in 0.18 was that the performance for "non-optimal" usage -- calling the macros only, maybe with dynamic labels, etc -- becoming slightly worse was worth it for the improvement in the scenarios where applications can pre-register metrics.

In practical terms, the overhead is still very small. If we look at metrics-benchmark, a very stupid program that emits metrics as fast as possible (a counter, a gauge, then a histogram; all static keys), we can see that for a single producer, both 0.17 and 0.18 can easily push 30-35M calls/second on a modern CPU. If we increase the parallelism of the "producer" threads to 16, 0.17 becomes slightly faster than 0.18, managing to push ~14.5M calls/second vs ~12M calls/second, respectively3 While 12 vs 14.5 is an 18% difference... you still have to fully saturate 16 producer threads to get to that much of a difference.

If you're pushing that many metrics per second, or even 10% of that amount, I would argue that you likely understand the performance well enough to tweak how/when you emit metrics to avoid incurring undesired overhead.

The flipside of all of this is what 0.18 unlocks: the ability to hyperoptimize metrics overhead by using handles directly. While the ~30-35M calls/second for a single producer was the best you could do in 0.17, even using entirely static keys, using metric handles directly in 0.18 blows it away with ~110M calls/second. Even with a producer parallelism of 16, it's still at ~35M calls/second, over 2x what can be done when only using the macros.

loyd · 2022-02-15T06:25:32Z

In 0.18

You're considering the usual usage of metrics, not the described by me. In my case, CounterGroup should include more information for re-registering, if the config's epoch is changed, including a clone of Key and, maybe, another Arc<Storage>.

if registration only happened once (like in tracing), it would be possible to clone Key and needed additional information only once.

However, implementing #273 also can help in this case.

probably should have had mutable interfaces since it would have enabled more interesting designs such as aggregating internally and then emitting a single time at Drop, etc.

A good direction, it can solve my problem. At least, Registry should be generic over the handles type, now it uses a private Primitives. I'm glad that you agree about the necessity in the generalization of Registry.

in some cases, there's the potential for the compiler to make sub-optimal choices that reduce instruction-level parallelism

This is a much smaller problem than false sharing.

a very stupid program that emits metrics as fast as possible

Without any complex keys. In practical terms, most metrics contain labels, often dynamic. And in 0.18 I'm forced to clone it on every call.

The flipside of all of this is what 0.18 unlocks

I don't argue with you about the pros of a separate registration step. I'm just trying to show that the current API is not complete enough. Either need to somehow cache a handle per macro call, either provide Recorder::increment_counter (with default register_counter().increment() implementation) and make it possible to override the behavior like in my cases.

tobz · 2022-02-15T15:43:22Z

Just to take a step back...

My assertion here is that between 0.17 and 0.18, there wasn't a meaningful change to performance/efficiency for callers who just used counter!("my_metric_name", 1, "label" => "value", ...):

The construction of the key itself did not change: if it can be constructed in a const fashion, we'll store it as a local static, otherwise, we construct it on the fly.
The registry still has to look at the HashMap, sharded or not.
We do now clone the Arc of the storage for the specific metric, and return it to the caller vs updating it in the call to the registry.

Assuming I make the changes to Registry to allow keys to be user-specified again, I believe that lets you switch to 0.18 with no meaningful change in performance. Do you agree with that? Do you not agree with it? If so, what do you think is missing from my understanding or proposal?

tobz · 2022-02-15T15:47:12Z

Also, separately:

A good direction, it can solve my problem. At least, Registry should be generic over the handles type, now it uses a private Primitives.

You're right, and this was an oversight on my part. I opened #274 to fix that.

tobz · 2022-03-11T01:37:47Z

I just cut a new release -- metrics-util@v0.12.0 -- which covers a few changes pursuant to this issue:

Primitives is now public (renamed to Storage)
Registry is now generic over both the key and the storage

I think these two things addressed the bulk of your original issues using the new handle-based design, so I'd be interested in hearing any feedback you have if you have a chance to try them out.

loyd · 2022-06-14T06:24:31Z

I'm going to update metrics in elfo next week, now it seems to cover most of my questions, maybe all. I'll be back with feedback. Thanks a lot.

tobz · 2022-06-14T12:37:18Z

Sounds good. 👍🏻

I'm also working on refactoring metrics::Cow to additionally allow for wrapping an Arc<T> (in addition to the normal support for &'a T and T) and exploring a design for inlining strings directly, per your comment in #273.

tobz added C-core Component: core functionality such as traits, etc. T-ergonomics Type: ergonomics. labels Feb 12, 2022

tobz mentioned this issue Feb 13, 2022

Find a better design for Registry to support generic keys and memoized hashing. #272

Closed

tobz mentioned this issue Feb 15, 2022

Make Primitives public for users can pass their own implementation. #274

Closed

tobz mentioned this issue Mar 1, 2022

Make Registry generic over both key and storage. #281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a new handle-based API? #271

How to use a new handle-based API? #271

loyd commented Feb 12, 2022 •

edited

tobz commented Feb 12, 2022

loyd commented Feb 13, 2022 •

edited

tobz commented Feb 13, 2022 •

edited

loyd commented Feb 14, 2022 •

edited

tobz commented Feb 14, 2022 •

edited

loyd commented Feb 15, 2022 •

edited

tobz commented Feb 15, 2022 •

edited

tobz commented Feb 15, 2022

tobz commented Mar 11, 2022

loyd commented Jun 14, 2022

tobz commented Jun 14, 2022

How to use a new handle-based API? #271

How to use a new handle-based API? #271

Comments

loyd commented Feb 12, 2022 • edited

tobz commented Feb 12, 2022

loyd commented Feb 13, 2022 • edited

tobz commented Feb 13, 2022 • edited

loyd commented Feb 14, 2022 • edited

tobz commented Feb 14, 2022 • edited

loyd commented Feb 15, 2022 • edited

tobz commented Feb 15, 2022 • edited

tobz commented Feb 15, 2022

tobz commented Mar 11, 2022

loyd commented Jun 14, 2022

tobz commented Jun 14, 2022

loyd commented Feb 12, 2022 •

edited

loyd commented Feb 13, 2022 •

edited

tobz commented Feb 13, 2022 •

edited

loyd commented Feb 14, 2022 •

edited

tobz commented Feb 14, 2022 •

edited

loyd commented Feb 15, 2022 •

edited

tobz commented Feb 15, 2022 •

edited