Inconsistent prometheus response from subsequent `GET /metrics` scrapes (e.g. `ibmmq_queue_depth`) #238

grahambrereton-form3 · 2023-07-21T16:58:23Z

Affected version: v5.5.0, latest in master branch

Metrics based on publications are not consistently present in the response from GET /metrics in the mq_prometheus program, dependent on scrape interval. It is expected that all metrics would be present in the metrics endpoint response, regardless of scrape frequency.

If the endpoint is scraped twice in short succession, the second scrape may not contain the values for a large number of metrics, including ibmmq_queue_depth. It appears the link between these missing metrics is that they are updated via publications.

Expected behaviour: if there is no new value for a gauge metric, the metrics endpoint should just return the latest value in its response, rather than omitting the value.

Use case: running prometheus scraping in an HA setup and not wanting to have to carefully coordinate scraping cycles for each replica so that there are never two scrapes within a single publication interval.

It seems that this comes down to the implementation of Collect in the exporter for mq_prometheus, where one of the first things it does is reset all the gauge metrics before processing any new publications. It appears that scraping twice within a single publication interval will produce inconsistent results.

Example to reproduce for queue depth:

❯ while true; do date; curl -s http://localhost:9157/metrics | grep -c '^ibmmq_queue_depth'; sleep 5; done
Fri 21 Jul 2023 09:49:32 AM EDT
8
Fri 21 Jul 2023 09:49:37 AM EDT
8
Fri 21 Jul 2023 09:49:42 AM EDT
0
Fri 21 Jul 2023 09:49:47 AM EDT
8
Fri 21 Jul 2023 09:49:52 AM EDT
0

More general example:

❯ while true; do date; curl -s http://localhost:9157/metrics | wc -l; sleep 5; done
Fri 21 Jul 2023 09:51:36 AM EDT
1000
Fri 21 Jul 2023 09:51:41 AM EDT
400
Fri 21 Jul 2023 09:51:46 AM EDT
1000
Fri 21 Jul 2023 09:51:51 AM EDT
400

The text was updated successfully, but these errors were encountered:

ibmmqmet · 2023-07-24T06:42:01Z

Most of the values returned by the published metrics are counters over the interval. Returning duplicate values when none have actually been reported by the queue manager would be wrong. It would lead to incorrect calculations such as total number of messages. The more common situation we have to deal with here instead is where the scrape interval covers two sets of publications from the queue manager - cleaning out the maps on each iteration makes the aggregation where that's needed more manageable.

While "depth" is an absolute value, and theoretically could be duplicated without harm, trying to handle that as a special case would get very messy, and still potentially misleading if the real depth is varying rapidly as you wouldn't be able to trust it.

If you want to increase the sampling rate, then there are tuning parameters for the queue manager which cause it to publish the metrics more frequently. And you could link that to your preferred scrape interval. In particular you can put

'TuningParameters:
    MonitorPublishHeartBest=<n>  # seconds - default 10

grahambrereton-form3 · 2023-07-28T14:45:21Z

Hi @ibmmqmet, thanks for your response.

I'm a bit confused by the decision to use gauge metrics to return the value of a counter over an interval, rather than just giving the absolute value for the counter and using a counter metric. This would allow the of use the rate operator in promql to determine the rate of change or increase to determine the change, rather than directly reporting the increase over the scrape interval. My understanding is that only absolute values should be gauge metrics, while anything that specifies the number of occurrences of an event should be a counter.

If the sensitive metrics you were referring to were counters rather than gauges, then a duplicate value doesn't seem like it would be an issue. E.g. if "total message count" is a gauge and count over an interval then yielding the same value on two subsequent scrapes could alter the outcome, while for a counter the same value being scraped twice would just be reporting no change.

The problem I'm facing is not that I'd particularly like to increase the scrape interval, but that I have a HA setup for prometheus scraping and each instance will do its own scrape of the metrics endpoint on its own schedule, with the samples being deduplicated by the associated labels, and depending on timing, the accepted sample may be one that lacks all the publication metrics due to an earlier scrape causing a reset. We have worked around this by increasing the scrape interval, so scrape interval / replica count > publication interval, but that only decreases the likelihood of us seeing problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent prometheus response from subsequent `GET /metrics` scrapes (e.g. `ibmmq_queue_depth`) #238

Inconsistent prometheus response from subsequent `GET /metrics` scrapes (e.g. `ibmmq_queue_depth`) #238

grahambrereton-form3 commented Jul 21, 2023

ibmmqmet commented Jul 24, 2023

grahambrereton-form3 commented Jul 28, 2023

Inconsistent prometheus response from subsequent GET /metrics scrapes (e.g. ibmmq_queue_depth) #238

Inconsistent prometheus response from subsequent GET /metrics scrapes (e.g. ibmmq_queue_depth) #238

Comments

grahambrereton-form3 commented Jul 21, 2023

ibmmqmet commented Jul 24, 2023

grahambrereton-form3 commented Jul 28, 2023

Inconsistent prometheus response from subsequent `GET /metrics` scrapes (e.g. `ibmmq_queue_depth`) #238

Inconsistent prometheus response from subsequent `GET /metrics` scrapes (e.g. `ibmmq_queue_depth`) #238