Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent prometheus response from subsequent GET /metrics scrapes (e.g. ibmmq_queue_depth) #238

Open
grahambrereton-form3 opened this issue Jul 21, 2023 · 2 comments

Comments

@grahambrereton-form3
Copy link

Affected version: v5.5.0, latest in master branch

Metrics based on publications are not consistently present in the response from GET /metrics in the mq_prometheus program, dependent on scrape interval. It is expected that all metrics would be present in the metrics endpoint response, regardless of scrape frequency.

If the endpoint is scraped twice in short succession, the second scrape may not contain the values for a large number of metrics, including ibmmq_queue_depth. It appears the link between these missing metrics is that they are updated via publications.

Expected behaviour: if there is no new value for a gauge metric, the metrics endpoint should just return the latest value in its response, rather than omitting the value.

Use case: running prometheus scraping in an HA setup and not wanting to have to carefully coordinate scraping cycles for each replica so that there are never two scrapes within a single publication interval.

It seems that this comes down to the implementation of Collect in the exporter for mq_prometheus, where one of the first things it does is reset all the gauge metrics before processing any new publications. It appears that scraping twice within a single publication interval will produce inconsistent results.

Example to reproduce for queue depth:

❯ while true; do date; curl -s http://localhost:9157/metrics | grep -c '^ibmmq_queue_depth'; sleep 5; done
Fri 21 Jul 2023 09:49:32 AM EDT
8
Fri 21 Jul 2023 09:49:37 AM EDT
8
Fri 21 Jul 2023 09:49:42 AM EDT
0
Fri 21 Jul 2023 09:49:47 AM EDT
8
Fri 21 Jul 2023 09:49:52 AM EDT
0

More general example:

❯ while true; do date; curl -s http://localhost:9157/metrics | wc -l; sleep 5; done
Fri 21 Jul 2023 09:51:36 AM EDT
1000
Fri 21 Jul 2023 09:51:41 AM EDT
400
Fri 21 Jul 2023 09:51:46 AM EDT
1000
Fri 21 Jul 2023 09:51:51 AM EDT
400
@ibmmqmet
Copy link
Collaborator

Most of the values returned by the published metrics are counters over the interval. Returning duplicate values when none have actually been reported by the queue manager would be wrong. It would lead to incorrect calculations such as total number of messages. The more common situation we have to deal with here instead is where the scrape interval covers two sets of publications from the queue manager - cleaning out the maps on each iteration makes the aggregation where that's needed more manageable.

While "depth" is an absolute value, and theoretically could be duplicated without harm, trying to handle that as a special case would get very messy, and still potentially misleading if the real depth is varying rapidly as you wouldn't be able to trust it.

If you want to increase the sampling rate, then there are tuning parameters for the queue manager which cause it to publish the metrics more frequently. And you could link that to your preferred scrape interval. In particular you can put

'TuningParameters:
    MonitorPublishHeartBest=<n>  # seconds - default 10

@grahambrereton-form3
Copy link
Author

Hi @ibmmqmet, thanks for your response.

I'm a bit confused by the decision to use gauge metrics to return the value of a counter over an interval, rather than just giving the absolute value for the counter and using a counter metric. This would allow the of use the rate operator in promql to determine the rate of change or increase to determine the change, rather than directly reporting the increase over the scrape interval. My understanding is that only absolute values should be gauge metrics, while anything that specifies the number of occurrences of an event should be a counter.

If the sensitive metrics you were referring to were counters rather than gauges, then a duplicate value doesn't seem like it would be an issue. E.g. if "total message count" is a gauge and count over an interval then yielding the same value on two subsequent scrapes could alter the outcome, while for a counter the same value being scraped twice would just be reporting no change.

The problem I'm facing is not that I'd particularly like to increase the scrape interval, but that I have a HA setup for prometheus scraping and each instance will do its own scrape of the metrics endpoint on its own schedule, with the samples being deduplicated by the associated labels, and depending on timing, the accepted sample may be one that lacks all the publication metrics due to an earlier scrape causing a reset. We have worked around this by increasing the scrape interval, so scrape interval / replica count > publication interval, but that only decreases the likelihood of us seeing problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants