Add Initial Metrics #2499

eerhardt · 2023-06-30T22:12:30Z

(Note: Leaving in Draft until #2497 is merged as this needs a net6.0 target. You can skip the first commit in the review for now.)

This adds the initial implementation of using System.Diagnostics.Metrics in the StackExchange.Redis client. For this first round, the following metrics are tracked:

redis-operation-count
redis-completed-asynchronously
redis-completed-synchronously
redis-failed-asynchronously
redis-failed-synchronously
redis-non-preferred-endpoint-count

Other metrics in ConnectionCounters need to use UpDownCounter, which is only available in net7.0.

cc @NickCraver @mgravell @JamesNK @noahfalk

NickCraver · 2023-07-01T14:41:33Z

Starting early global discussions here:

What's the metric convention here on names? Based on other top level for process/rpc/etc. I'd naively expect redis.<name> (rather than redis-<name>), am I way off there?
For what's effectively a singleton top to bottom, can we simplify to static meter and properties, eliminating a lot of code and .Instance everywhere? Or is this a stepping stone? (I'd imagine since the receiver is pluggable, e.g. memory OTel collector, this doesn't need to be) If not, I'd like to consolidate this down to just a bunch of static props (rather than constructor) and methods. Depending on area, nested classes would look the same (putting ideas below)
Dimensions is really good point...we're going to need to think about this heavily. Some are per-endpoint, but we need to consider aggregation behavior later
Should we be incrementing completion latency? (cc @mgravell)

Counter ideas:

We should have counters on sent, not just completed (so we can see stalls via metrics)
Payload size, if cheap (need to look at this)
- ...or possibly just bytes written to/read from pipe in general
Connection counters:
- Connected seconds
- Connections created
- Connections failures
- Server errors
- Internal errors
- Reconfigurations
Things I'd love, but maybe add too much volume (verbosity setting on whether we enable these? Or some flags enum? How do other handle this for mass scale?)
- Internal metrics of basically everything in an error message that's us: byte buffers, backlogged items, etc.

Tip: GitHub will auto redirect PRs, so you can PR to a PR and when the first PR is merged downstream PRs automatically get redirected to their merge target, e.g. this would change to look at main upon merge, showing only relevant diff the whole time...very handy for these layering cases :)

Happy to hop on a call next week if y'all want to talk though, just getting initial ideas out there. Thoughts?

JamesNK · 2023-07-02T07:52:10Z

What's the metric convention here on names? Based on other top level for process/rpc/etc. I'd naively expect redis. (rather than redis-), am I way off there?

There isn't a proscribed format from OTel about what format counter names should look like. OTel has some counters they're defining for certain domains and they use dots to separate words. .NET counters use dashes. When exported to Prometheus, they're both turned into underscores.

For what's effectively a singleton top to bottom

Meter + counters don't necessarily have to be a singleton. In ASP.NET Core, the meter is created from IMeterFactory. That allows a different meter per-container and easy unit testing of metrics. We're looking at making HttpClient have an extension point for configuring a meter to allow its values to be isolated.

A singleton is difficult because multiple tests can run in parallel, or work may still be happening from a finished test, which interferes with results.

See https://github.com/JamesNK/MeterFactoryDemo/blob/d0660fb70ea54bfa358606830a4e02f507fc89f6/src/MeterFactoryDemo.Tests/BasicTests.cs#L14-L41 for an example of how easy it can be to test metrics once the meter is injected. I'm guessing the Redis client doesn't have close integration with DI so a more manual approach like what we're doing with HttpClient would be required: dotnet/runtime#86961

Dimensions + other counters

Check out dotnet/aspnetcore#47536 for ideas of the kind of counters that we're adding to ASP.NET Core + dimensions.

JamesNK · 2023-07-02T07:53:47Z

redis-operation-count

redis-completed-asynchronously

redis-completed-synchronously

redis-failed-asynchronously

redis-failed-synchronously

You could consider combining these into one histogram counter: redis-operation-duration. Then have a dimension for whether the operation was async/sync and a dimension for the success/failure result.

JamesNK · 2023-07-02T07:56:16Z

src/StackExchange.Redis/RedisMetrics.cs

+        if (_operationCount.Enabled)
+        {
+            _operationCount.Add(1,
+                new KeyValuePair<string, object?>("endpoint", endpoint));
+        }


Counter.Add tests for enabled internally. Having an extra check like this is really only useful if some work happens when building up the tag list.

+1 - we should have a string version of the endpoint memoized so that this KVP is as cheap as possible if not already, no objections adding that.

NickCraver · 2023-07-02T11:56:22Z

@JamesNK great points all around (and I really appreciate the expertise and time!)

I agree with collapsing and using dimensions for types (sync/async) and the testing story.

I think my confusion on instance intent here is the direct usage of .Instance on the call sites - if instead the Multiplexer had a RedisMetrics Metrics property (replaceable, per test) which defaulted to the RedisMetrics.Instance: I mostly get it. I think this would have to be surfaced on ConfigurationOptions because access isn't early enough otherwise. But I'm still fuzzy how for example the memory test collector would differentiate given the namespace. Do we do something like making an internal constructor of RedisMetrics take a namespace so a test can override random GUID style for the Meter or is there a simpler solution to this sans dependency? Thoughts?

Also agree on DI assessment: we don't expose the ConnectionMultiplexer constructor, and overloading the static .Connect(Async) methods creates a real maintenance headache (I have regrets there), so IMeterFactory isn't probably something we'd do given the dependency, but I do want to make this open/testable as much as possible.

noahfalk · 2023-07-04T08:19:17Z

What's the metric convention here on names? Based on other top level for process/rpc/etc. I'd naively expect redis. (rather than redis-), am I way off there?

There isn't a proscribed format from OTel about what format counter names should look like

Just as a heads up (and not at all specific to Redis) I've been questioning lately whether .NET should continue to use and recommend dashes in these names. I sent some email earlier that touched on it. I can't find an explicit written recommendation for '.' in OTel's guidance, yet '.' is used uniformly in every OTel example, every experimental semantic convention, and every implementation I can find. The fact that the OTel text doesn't explicitly say "we recommend using dot" feels more like an oversight rather than OTel deliberately not having a recommendation/preference. Some other other generalized guidance from OTel on naming is here.

Do we do something like making an internal constructor of RedisMetrics take a namespace so a test can override random GUID style for the Meter or is there a simpler solution to this sans dependency? Thoughts?

If you can create an internal constructor for RedisMetrics and get your code under test to use it then I'd recommend making that internal constructor take a Meter object:

public static readonly RedisMetrics Instance = new RedisMetrics(new Meter("StackExchange.Redis"));

internal RedisMetrics(Meter meter)
{
    _meter = meter;
    _operationCount = _meter.CreateCounter<long>(
            "redis-operation-count",
            description: "The number of operations performed.");
    ...
}

And then use it something like:

// arrange
// the name of this Meter doesn't matter and it doesn't need to be unique across tests
var meter = new Meter("StackExchange.Redis");
// This collector will only collect metrics recorded by this specific Meter object.
// Even if other tests use Meters with the same name, they will not be captured by this collector
using var collector = new MetricCollector(meter, "redis-operation-count");
var redisMetrics = new RedisMetrics(meter);

// act
DoSomeWorkThatEmitsMetrics(redisMetrics);

// assert
Assert.Equal(1, collector.GetMeasurementSnapshot().Count);
...

Its certainly possible to inject strings and use them to create custom Meter names, but I think that winds up being more complex than injecting a Meter object and just using the object identity as the thing that keeps metrics for each test isolated.

The other option if you can't inject anything (or prefer not to), is to create MetricCollector(scope:null, meterName:"StackExchange.Redis", instrumentName:"redis-operation-count") and then don't run these tests in parallel with each other. Binding based on name is going to capture data from every Meter+Instrument in the process that matches this name.

NickCraver · 2023-07-04T11:44:43Z

@noahfalk The dash feels weird to me too, everything was namespace delimited with dots when we did Bosun's system design and metrics because that's what we saw everywhere else. Dashes weren't a thing at all, and spaces were underscores, e.g. redis.connection_restores. If we're headed that way overall, I'd definitely like to start here. But above all: I don't want to stand out in a metrics list anywhere as the oddball. From what I see with 1P at least, most people have started with the OTel naming because of using the Runtime packages and such....so the first other addition is the oddball. I agree with your direction thinking dropping the dash, let me know if I can help.

As for testing: great suggestion, I love it. I knew we could narrow in on a custom Meter.Name attachment but hadn't seen the "taking a Meter" overload. Thanks as always for advice <3

noahfalk · 2023-07-06T00:26:42Z

But above all: I don't want to stand out in a metrics list anywhere as the oddball.

Yeah, I don't want Redis to look like an oddball either. If we go through with this I'm going to adjust names for all the metrics being added to ASP.NET and HttpClient for .NET 8, as well as change the written naming guidance in our public docs. I'm not proposing we change names that already got shipped in past releases as part of EventCounters API, but I expect the usage of those metrics will diminish over time and we'd converge towards a place where all the metrics people care about are using dot.

- no dashes. - use `db.redis` prefix Implement duration histogram and collapse 4 counters into it.

eerhardt · 2023-07-14T23:33:10Z

Thanks for all the feedback so far! I still need to address the feedback to make this testable by allowing the unit tests to inject a Metrics object into the Multiplexer. I'll work on that next week.

In the meantime, I've updated for better naming to align with OpenTelemetry (no more dash - characters). I've also prefixed things with db.redis following the naming patterns in https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/redis.md. I also took @JamesNK's excellent advice and collapsed the 4 counters into a single histogram. Thanks!

Counter ideas:

@NickCraver @mgravell - my intention is for this PR to get the infrastructure and an initial set of metrics added to the library. And then add more metrics in separate PRs as we get the need for them. How does that approach sound? Do you have an idea for what a "minimal set" of initial metrics would be?

NickCraver · 2023-07-16T14:19:44Z

@eerhardt I get the intent, but all past experience has taught me not to put a few metrics in and come back later. In every case we've done that, when we go from say the first 3 to the first 20 we found out the scheme/name/dimensions/etc. for the first 3 wasn't right and stands out. IMO, we need to figure out a larger set together.

Looking at current, I see db.redis.operation.count and and db.redis.duration, wouldn't the second be db.redis.opration.duration? (I'm also not sure operation is the right word - we do a lot of operations, in Redis these are typically "commands" and I'd expect that here. This could be bias, but connections, reconnectiions, reconfigurations, etc. are all "operations".

I'm not 100% sure operations/commands makes overall sense as an item. For example, we have commands sent and received. I'd guess based on most of our debugging these should be separate, e.g. the first being steady state and the second pausing indicates a server stall. On commands we probably want the command (e.g. SET) as a tag.

Overall, I think we've got the approach figured out, we should now make an overall list of metrics to get a decent API/name/pattern/tag design that's a lot more comprehensive than starting with just a few. Happy to do a call this week if it helps, I just need to get some active blocking release stuff out the door to get more time on SE.Redis. Let me try and get a decent start on the full list here during the Tuesday call.

…the Meter object.

eerhardt · 2023-07-18T23:10:08Z

For what's effectively a singleton top to bottom

Ok, I've updated the latest code to allow for injection of the Meter object, so tests can be written effectively. I've also added a single test to show how testing works. Let me know what you think of the current structure.

If we like how this is structured, I can move forward with adding/removing/refactoring the metrics we really want to design. And then add more tests for them.

pantosha · 2024-01-02T17:36:02Z

@eerhardt, #2497 is merged. Could you prepare the current PR for the merge? 😊

cijothomas · 2024-02-25T18:46:40Z

src/StackExchange.Redis/RedisMetrics.cs

+            var now = Stopwatch.GetTimestamp();
+            var duration = new TimeSpan((long)((now - message.CreatedTimestamp) * s_tickFrequency));
+
+            var tags = new TagList


for upto 3 tags, its faster to pass them directly.
https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#instruments
When reporting measurements with 3 tags or less, pass the tags directly to the instrument API.

NickCraver requested a review from philon-msft July 1, 2023 15:18

JamesNK reviewed Jul 2, 2023

View reviewed changes

eerhardt added 4 commits July 14, 2023 14:52

Initial Metrics implementations

8dd98fa

Add NonPreferredEndpointCount

fbafc9d

Implement completed failed counters.

8837b21

Rename the metrics following OTel naming

a926218

- no dashes. - use `db.redis` prefix Implement duration histogram and collapse 4 counters into it.

eerhardt force-pushed the AddMetrics branch from 2be1478 to a926218 Compare July 14, 2023 23:24

Refactor RedisMetrics to be instance based and allow tests to inject …

837345f

…the Meter object.

eerhardt mentioned this pull request Sep 6, 2023

Add Metrics support for OpenTelemetry.Instrumentation.StackExchangeRedis open-telemetry/opentelemetry-dotnet-contrib#1331

Closed

cijothomas reviewed Feb 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Initial Metrics #2499

Add Initial Metrics #2499

eerhardt commented Jun 30, 2023

NickCraver commented Jul 1, 2023

JamesNK commented Jul 2, 2023

JamesNK commented Jul 2, 2023 •

edited

JamesNK Jul 2, 2023 •

edited

NickCraver Jul 2, 2023

NickCraver commented Jul 2, 2023

noahfalk commented Jul 4, 2023

NickCraver commented Jul 4, 2023

noahfalk commented Jul 6, 2023

eerhardt commented Jul 14, 2023

NickCraver commented Jul 16, 2023

eerhardt commented Jul 18, 2023

pantosha commented Jan 2, 2024

cijothomas Feb 25, 2024

Add Initial Metrics #2499

Are you sure you want to change the base?

Add Initial Metrics #2499

Conversation

eerhardt commented Jun 30, 2023

NickCraver commented Jul 1, 2023

JamesNK commented Jul 2, 2023

JamesNK commented Jul 2, 2023 • edited

JamesNK Jul 2, 2023 • edited

Choose a reason for hiding this comment

NickCraver Jul 2, 2023

Choose a reason for hiding this comment

NickCraver commented Jul 2, 2023

noahfalk commented Jul 4, 2023

NickCraver commented Jul 4, 2023

noahfalk commented Jul 6, 2023

eerhardt commented Jul 14, 2023

NickCraver commented Jul 16, 2023

eerhardt commented Jul 18, 2023

pantosha commented Jan 2, 2024

cijothomas Feb 25, 2024

Choose a reason for hiding this comment

JamesNK commented Jul 2, 2023 •

edited

JamesNK Jul 2, 2023 •

edited