Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider the interplay between metrics expiration and handles. #314

Open
tobz opened this issue Jul 20, 2022 · 4 comments
Open

Consider the interplay between metrics expiration and handles. #314

tobz opened this issue Jul 20, 2022 · 4 comments
Labels
C-core Component: core functionality such as traits, etc. C-util Component: utility classes and helpers. E-complex Effort: complex. T-ergonomics Type: ergonomics.

Comments

@tobz
Copy link
Member

tobz commented Jul 20, 2022

Chatting with a user about the idle timeout support in metrics-exporter-prometheus made me realize that there's one particular -- maybe fatal? -- flaw with the handle-based design: orphaned metrics.

Consider a normal invocation of counter!("key", 1):

  • we acquire a handle to a counter identified by "key"
  • we update that handle

As is, with the handle-based design, that handle is wrapping an Arc<T>, but we query the handle right before modifying it. Either the metric already exists, or it doesn't -- either because it's brand-new or because it was previously expired and removed from the registry -- but we always get a valid handle as of the time of the call to the registry.

Now, consider when we take a handle to a metric via register_counter!(...). The steps to do so are nearly identical to before, specifically in that we query the registry, and the handle is either found if it exists, or is created if it doesn't... but then we never query for it again. If the metric is removed from the registry, the underlying storage may live on (Arc<T>) but it is disconnected from the registry where it originated.

@tobz tobz added C-util Component: utility classes and helpers. C-core Component: core functionality such as traits, etc. E-complex Effort: complex. T-ergonomics Type: ergonomics. labels Jul 20, 2022
@tobz
Copy link
Member Author

tobz commented Jul 20, 2022

Possible answer/solution: warn people about it.

It could be sufficient to warn people that when using metrics handles, they cannot or should not drop idle metrics. For use cases where the metric cardinality doesn't grow over time, and using the "idle timeout" feature of metrics-exporter-prometheus, or equivalent behavior, isn't required... they could just turn it off.

This is only sufficient in those use cases, though.

@tobz
Copy link
Member Author

tobz commented Jul 20, 2022

Possible answer/solution: provide a way to register handles in a strong or weak fashion.

As a play on the strong versus weak pointers of Arc<T>, we could potentially provide a way to indicate that a metric handle should be strongly/weakly stored depending on what is called for. If we're grabbing a handle to support a transient call -- i.e. counter!(...) -- then it could be a weak reference. Otherwise, we'd register it as a strong reference.

The registry would expose this, allowing exporters to understand what metrics they should or shouldn't remove.

This is still susceptible to problems with unbounded cardinality, and requires knowing upfront that your strong handles -- register_counter!(...) -- won't ever be used with labels that have unbounded cardinality, since you're implicitly excluding them from being expired/removed by exporters.

@tobz
Copy link
Member Author

tobz commented Jul 20, 2022

Possible answer/solution: allow handles to re-register.

Given that the recorder is exposed such that there's always a &'static reference to it, one approach could be to use the strong/weak reference capabilities of Arc<T> to allow handles to weakly map back to their originating recorder, as well as store the key used to create them.

When trying to update a handle, we would attempt to upgrade the weak reference in order to actually access the storage. If that failed, we would re-register ourselves with the recorder, acquiring a new weak reference.

Biggest downsides here that I can think of are:

  • we always have to upgrade the weak reference before updating the metric (atomic load + cmpxchg)
  • we have to store a Key and &'static dyn Recorder in the handles now, which blows up their size to some extent (I guess only by like... 48-64 bytes?)

Other thoughts:

  • how do we get the &'static self reference inside the recorder? do we need to change the trait?

@tobz
Copy link
Member Author

tobz commented Feb 3, 2024

It's been an incredibly long time such I've touched this, but recently I've poked around at ways to try and solve this.

What hasn't worked (so far)

First up, I tried to approach this the most complex angle, which is essentially writing our own Arc<T> that supports our requirements here for the lowest possible overhead way to weakly access the storage while supporting the ability to atomically update it. This approach more or less failed.

While it's certainly possible to do this in some way, the biggest issue is that we would need to have a deeply specialized implementation here, and we would likely have to do some very unergonomic things, like exposing virtual tables for each metric type, and writing a lot of specialized code to handle both the happy path of loading the smart pointer, as well as correctly replacing it while under concurrent accesses.

Most of this is hinged on the bit about virtual tables: we can't carry generic types in the metric handles, so we need dyn CounterFn types internally... which puts us into fat pointer territory. Fat pointers are essentially two machine words (data pointer + vtable pointer) so they require something like AtomicU128... which is possible on many modern CPUs, but can be expensive. There's also not a trivial way in safe Rust to split fat pointers into their constituent parts, so this is still tricky to do even if we targeted only very modern platforms and decided to say "screw it" and use nightly.

What could work

What would likely work just fine is double indirection in the form of something like Box<Weak<dyn T>>, giving us a thin pointer to a fat pointer-backed weak reference.

I've been avoiding this because it just feels icky and slow, but I'm probably prematurely optimizing. Practically, it handles the two biggest constraints: atomic access to the underlying data in a way that allows for expiration (strong vs weak references) and the ability to atomically update the storage pointer held by metric handles when they need to re-register since we can handle a thin pointer atomically without issue.

The biggest (relatively speaking) downside to this approach is that it doesn't have a trivial way to mark the storage pointer as "closed". Essentially, we want the registry to be able to mark the storage as closed when it decides that a metric is idle, or if it decides to forcefully remove the metric, so that a constant stream of activity can't keep it active in a way that prevents dropping the Arc<T> and consuming its value.

That scenario is only relevant during expiration (where, you know, if it's gotten to that point, the metric obviously isn't going to be highly contended) and forceful removal. Forceful removal isn't currently a thing, but has been requested.

Despite these potential issues, it's likely the simplest approach to take in the meantime because it actively solves a real problem and in a way that doesn't require us to compromise in terms of MSRV or additional dark magic atomics/unsafe code, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-core Component: core functionality such as traits, etc. C-util Component: utility classes and helpers. E-complex Effort: complex. T-ergonomics Type: ergonomics.
Projects
None yet
Development

No branches or pull requests

1 participant