Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics reporting failing after update to v8.x #460

Open
daverant opened this issue Dec 6, 2023 · 7 comments
Open

Metrics reporting failing after update to v8.x #460

daverant opened this issue Dec 6, 2023 · 7 comments

Comments

@daverant
Copy link

daverant commented Dec 6, 2023

We have a combination of metrics, some using Prometheus.Metrics collectors, and some using System.Diagnostics.Metrics instruments.

We're running .NET 7, and wanting to update to .NET 8, but need to take this bugfix because we're seeing that error crop up a lot when trying to run in .NET 8:

System.FormatException: Input string was not in a correct format. Failure to parse near offset 2. Expected an ASCII digit.
at System.Text.StringBuilder.AppendFormatHelper(IFormatProvider provider, String format, ReadOnlySpan`1 args)
at System.Text.StringBuilder.AppendFormat(String format, Object[] args)
at Prometheus.MeterAdapter.TranslateInstrumentDescriptionToPrometheusHelp(Instrument instrument)
at Prometheus.MeterAdapter.OnInstrumentPublished(Instrument instrument, MeterListener listener)
at System.Diagnostics.Metrics.Instrument.Publish()
at System.Diagnostics.Metrics.Meter.GetOrCreateInstrument[T](Type instrumentType, String name, String unit, String description, IEnumerable`1 tags, Func`1 instrumentCreator)
at System.Net.Http.Metrics.MetricsHandler..ctor(HttpMessageHandler innerHandler, IMeterFactory meterFactory, Meter& meter)

We initialise everything using KestrelMetricServer:

DotNetRuntimeStatsBuilder.Default().StartCollecting();
using var server = new KestrelMetricServer(hostname, port);
server.Start();

When upgrading from v7.0.0 to v8.x of prometheus-net, we start to see strange behaviour with various collectors and instruments:

MassTransit metrics, using Prometheus.Metrics collectors - running v7 until 17:05, then running v8.1.1 after 17:05 where they fail to increment and are then dropped.:
image

Our own System.Diagnostics instruments - running v7 until 17:05, then running v8.1.1 after 17:05 where they fail to increment and are then dropped:
image

This is unfortunately blocking us from taking .NET 8 because we need to take the above bugfix - any help or suggestions in rooting out a potential cause would be appreciated.

@sandersaares
Copy link
Member

Can you provide a minimal sample app to reproduce the problem? Are there any exceptions visible? If you attach a debugger, do you see any exceptions listed (e.g. in the Visual Studio "Output" panel)?

@daverant
Copy link
Author

daverant commented Dec 6, 2023

Can you provide a minimal sample app to reproduce the problem? Are there any exceptions visible? If you attach a debugger, do you see any exceptions listed (e.g. in the Visual Studio "Output" panel)?

Thanks @sandersaares, yes I'm currently figuring out a minimal repro for this, will share when I have it 👍

@daverant
Copy link
Author

daverant commented Dec 13, 2023

Thought I'd just add some info in trying to repro this. After upgrading to 8.x it looks like we're seeing a decrease in the overall number of metrics being scraped from prometheus endpoints, and that metrics appear to be dropped from those endpoints over time. In the graph below you can see when we deploy with 8.x in version 2.0.26693.1 and when we deploy back to 7.0.0 in 2.0.26698.1. It's not selective for instruments shipped with prometheus-net instruments or system.diagnostics instruments.

image

I stood up a basic app to try and repro (most recent versions are in v7 and v8 branches), and there is some potentially interesting behaviour differences between 7.0.0 and 8.x in terms of total metrics shipped over time. Both pods are running identical code, other than having a different prometheus-net dependency. I think I'd expect those two graphs to be more closely aligned? I've added an event source filter to try and maintain parity on that front between the two prometheus-net versions.

image

I'm going to see whether the metrics server itself is throwing any errors at all next, but just need to get ducks in a row to attach a remote debugger.

@sandersaares
Copy link
Member

sandersaares commented Dec 18, 2023

The app you shared includes some metrics from the .NET Meters API, based on one of the prometheus-net samples. This sample code emits different timeseries over time, so after a while the old ones will expire and be dropped. This might explain the nature of the fluctuation in the last graph you shared.

This expiration of metrics is not super precise - the lifetime is just a minimum lifetime guarantee (5 minutes by default), with cleanup happening at an unspecified point after that. The specifics of this logic have changed in recent versions, so some difference in when exactly garbage is cleaned up is not surprising.

image

I was not able to detect any other metrics going away on a random sampling over 30 minutes. Looking forward with interest to more details!

@sandersaares
Copy link
Member

Could it be that your metrics are not being updated at a fast enough interval to keep them alive? Although, even in this case they should come back as soon as the next value is recorded - your original screenshot shows the timeseries disappearing for good. Still, perhaps an angle to explore?

@daverant
Copy link
Author

Hi @sandersaares thanks for the input and apologies for a slow reply, took a chunk of time off over the holidays! I'll read through this and be taking a fresh look at it

@daverant
Copy link
Author

daverant commented Jan 11, 2024

Could it be that your metrics are not being updated at a fast enough interval to keep them alive? Although, even in this case they should come back as soon as the next value is recorded - your original screenshot shows the timeseries disappearing for good. Still, perhaps an angle to explore?

The particularly interesting behaviour as you say is some metrics get dropped absolutely, where some of those metric values appear to freeze and no longer increment or decrement despite those code paths being hit. I think that's a thread worth tugging at too - why does a metric stop increasing in value despite the code path being hit repeatedly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants