Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Multiple alerts per graph #7832

Closed
sanchitraizada opened this issue Mar 14, 2017 · 149 comments
Closed

[Feature request] Multiple alerts per graph #7832

sanchitraizada opened this issue Mar 14, 2017 · 149 comments

Comments

@sanchitraizada
Copy link
Contributor

sanchitraizada commented Mar 14, 2017

As per http://docs.grafana.org/alerting/rules/, Grafana plans to track state per series in future releases.

  • "If a query returns multiple series then the aggregation function and threshold check will be evaluated for each series. What Grafana does not do currently is track alert rule state per series." and
  • "To improve support for queries that return multiple series we plan to track state per series in a future release"

But it seems like there can be use cases where we have graphs containing set of metrics for which different sets of alerts are required. This is slightly different from "Support per series state change" ( #6041 ) because

  1. The action (notifications) can be different.
  2. Also, tracking separate states of an alert is not always preferred (as the end-user would need to know the details behind the individual states ) vs just knowing if alert is triggered.

Grafana version = 4.x

@gdhgdhgdh
Copy link

gdhgdhgdh commented Apr 13, 2017

Concrete use case: I have instrumented my app to record a histogram in Prometheus for each major function (e.g. where an external HTTP call or disk I/O takes place) and would like to alert when any of these becomes slow.

Presently I have to define dummy graphs for this because of the 1:1 relationship between graph and alert. It would be much more logical to keep the alerts defined in the same place as the graph itself.

@torkelo
Copy link
Member

torkelo commented Apr 13, 2017

And you cannot defined that in one query?

@gdhgdhgdh
Copy link

No; a chain of OR conditions is crude, and the single name of the Alert can not clearly identify the exact reason for the alert. I definitely don't want to send alerts along the lines of Some part of service X is failing - engineers on call would not be my friends...

@torkelo
Copy link
Member

torkelo commented Apr 13, 2017

then it makes more sense to have separate panels for the alerts, if you want sperate alert rule name & message etc.

@gdhgdhgdh
Copy link

Yep that's exactly what I'm doing at the moment. Is there any likelihood of implementing multiple alerts per graph in the near future so I can move away from this workaround?

@torkelo
Copy link
Member

torkelo commented Apr 13, 2017

it's very unlikely

@torkelo
Copy link
Member

torkelo commented Apr 13, 2017

maybe if there huge demand for it :)

@gdhgdhgdh
Copy link

haha OK - I'll see if I can rustle up an angry mob ;) Seriously tho', thanks for the honesty.

@rssalerno
Copy link

Ok we have a mob of two :-) I'm graphing fuel levels in multiple tanks & wanted to set up a low fuel alert for each tank.

@torkelo
Copy link
Member

torkelo commented May 18, 2017

and each tank has different thresholds or notifications ?

@rssalerno
Copy link

Exactly. One is a 285 gal heating oil tank. I wanted to set up an "heating oil low" alert when that tank goes below 70gals. The other is a 500 gal propane tank, for that I wanted a "propane low" alert when it goes under 100 gal. I set up singlestats for each but alerts are not available in a singlestat.

fuellevels

@oododa
Copy link

oododa commented May 28, 2017

I have a graph with a median and a 90th percentile metric. I'd like to get an alert on each. In order to do this, I have to create one graph for each. Then, if I want warnings and critical alerts for each, I have to create a second graph for each.

I have 30 or 40 services to monitor, each with 2 to 5 key metrics. I have graphs where I graph the same metric for multiple customers, and while I don't have to do alerts per customer (yet), it does add to the number of metrics I'd like to have alerts on. The amount of work to create dozens of graphs expands very quickly. It would be very useful in my current production environment (and in my previous production environments) to have warnings and critical alerts, and to display multiple metrics in a single graph and alert on them.

@alex-phillips
Copy link

I'd also like to see this feature. A good example is one alert if a metric goes outside of a threshold and another alert if data fails to update. Ii.e., if a value goes too high or if values fail to report. This could be used to show that whatever is reporting the data has encountered an issue that is preventing communication with grafana (or whatever backend).

@rmsys
Copy link

rmsys commented Jun 9, 2017

Hi Torkelo!

I got several "likes" for the feature! Will we enter the next realease =) ?

@torkelo
Copy link
Member

torkelo commented Jun 9, 2017

@rmsys maybe at some point, solving it from UX perspective & code complexity (and UX complexity) perspective will take time, it's not on any roadmap yet, but maybe next year as the alerting engine mature more and a UX design for this is worked out

@jpriebe
Copy link

jpriebe commented Jun 9, 2017

Another good use case for multiple alerts is to have different severity thresholds with different actions. If a server starts to exhibit slowdowns, an email might be sufficient, but if the slowdowns become extreme, it might be worth paging the administrator.

@pgporada
Copy link
Contributor

I have a graph that returns a metric with the value of valid and invalid. This would be useful to me because I could use a single graph containing two queries to create alerts that fire when valid's are too low and invalid's are too high.

@torkelo
Copy link
Member

torkelo commented Aug 28, 2017

Also, tracking separate states of an alert is not always preferred (as the end-user would need to know the details behind the individual states ) vs just knowing if alert is triggered.

Not sure I understand what you mean by this. Can you elaborate?

Can you describe how multiple alerts per graph would work and look? What would the annotations say, and the green/red heart beside the panel title show(if say 2/5 alert rules where firing)?

Would you want to share something between the alert rules or would they be completely isolated (Beside living in the same graph panel and possibly referring to the same queries).

How would you visualize thresholds when you have multiple alert rules? Would they show up as separate rules in alert rules page & alert list panel? Then you need a way to navigate to a specific instance of a rule and not just to the alert tab.

Grafana is a visual tool and we have chosen to tie an alert rule to a graph so that the alert rule state can visualized easily (via the metrics, thresholds & alert state history). I am afraid that having each graph be able to represent multiple alert rules will complicate this to a very large extent and I am not sure about the need for this.

@rssalerno having support for alert rules in singlestat panel seems unrelated to this issue.

@alex-phillips you scenario sounds like it can be solved by making individual alert rules more flexible.

Does someone have some concrete examples where this would be good? Just not seeing a scenario where it would end up in a confusing graph with 2-5 thresholds that you do not know relates to what metric and alert history annotations that you also do no know what alert rule they came from (without hovering).

@pdf
Copy link

pdf commented Aug 28, 2017

Can you describe how multiple alerts per graph would work and look? What would the annotations say, and the green/red heart beside the panel title show(if say 2/5 alert rules where firing)?

I think multiple alert rules would be annotated individually. Hearts might be colour-coded. Rules would need to be named for differentiation in alerts/panels.

Would you want to share something between the alert rules or would they be completely isolated (Beside living in the same graph panel and possibly referring to the same queries).

Generally I would think not, though I suspect groups would need to have a shared threshold, and name if they were implemented (per #6557 (comment)).

How would you visualize thresholds when you have multiple alert rules? Would they show up as separate rules in alert rules page & alert list panel? Then you need a way to navigate to a specific instance of a rule and not just to the alert tab.

If rules take an additional colour param, thresholds can be rendered using that, and diffentiated as such, probably want a tooltip also. Being able to toggle rules would be useful, and a param to render a specific rule takes care of the latter I think?

@rssalerno having support for alert rules in singlestat panel seems unrelated to this issue.

I believe you'll find he was referring to the graph below that, though since he has separate panels for each tank, singlestat alerting may solve his problem for that specific dashboard.

Does someone have some concrete examples where this would be good? Just not seeing a scenario where it would end up in a confusing graph with 2-5 thresholds that you do not know relates to what metric and alert history annotations that you also do no know what alert rule they came from (without hovering).

Primarily, I'd like this to support #6557 and #6553, and for multiple thresholds, similar to @alex-phillips. For example, one use-case we have for #6557 is to alert differently for different environments (production, beta, dev, etc), combined with multiple thresholds that would solve most of our problems. If there's a better way of doing that without multiple rules, it's not obvious to me.

@ddhirajkumar
Copy link

ddhirajkumar commented Aug 31, 2017

@torkelo

Can you describe how multiple alerts per graph would work and look? What would the annotations say, and the green/red heart beside the panel title show(if say 2/5 alert rules where firing)?

I like the approach suggested by @pdf

Further, the approach to show annotations would be the same as the current case, where you have an alert rule with > 1 conditions (each having a different threshold). And the green/red heart beside the panel title would be shown as red (if there is atleast one alert which is firing), similar to current scenario where at-least one condition of an alert rule evaluating to true). And probably also show the number (2/5) along with the red heart in the title.

Would you want to share something between the alert rules or would they be completely isolated (Beside living in the same graph panel and possibly referring to the same queries).

In most of our use cases, these rules would not share anything between them and the queries are also different

How would you visualize thresholds when you have multiple alert rules? Would they show up as separate rules in alert rules page & alert list panel? Then you need a way to navigate to a specific instance of a rule and not just to the alert tab.

They would show up as separate rules in alerts page. The Alert tab, would probably have a list of alerts defined. Right, we would need to highlight/expand the specific alert rule on this tab, when the alret rule url (should capture the alert id or index) is accessed from the notification. Seems to be easily solvable.

In the alert list panel, there wouldn't be any change. It shows all of them separately. Semantically, each alert is separate. Just that it has been placed in the same panel.

Does someone have some concrete examples where this would be good? Just not seeing a scenario where it would end up in a confusing graph with 2-5 thresholds that you do not know relates to what metric and alert history annotations that you also do no know what alert rule they came from (without hovering).

Considering that a lot of people have upvoted for this feature, it would definitely be a useful feature. If we have the support for multiple alerts, then I think it would be upto each user's perception whether it's confusing or not. IMHO, those who think it is confusing would go with the current approach of separate panels for each graph and for those who think the utility/convenience of having the same panel used for visualization and alerting outweighs the perceived confusion, will go the multiple alerts way. Sure it would change the UX somewhat

@rossKayHe
Copy link

In splunk we have high/low alerts. If multiple alerts in grafana available, we'd just use the same search, they are just different thresholds against the same search.

@fadlytabrani
Copy link

+1 for this feature.

@sparr
Copy link

sparr commented Nov 15, 2017

+1 for this. Our use case is as follows: We want to define one chart with, say, cpu usage for all of our servers. Then on that same chart we will make two hidden metrics, one for cpu usage on production servers and one for cpu usage on non-production servers. Each of those metrics would have its own alert, with different notification channels. We do not want to have to create multiple charts or panels or dashboards to accomplish this.

@nelg
Copy link

nelg commented Nov 30, 2017

+1 for this feature.

@StianOvrevage
Copy link

Came here reading some of the other issues regarding categories and severities. I agree all alerts should be actionable. But there is a difference between a "fix this first thing in the morning" alert and a "call out the $400/hour consultant ASAP" alert.

As many have mentioned, this is most common solved by Warning and Critical thresholds.

Technically this could be implemented in a bunch of ways, labels, several alerts per panel, several thresholds per alert etc.

Regarding confusion if the categorization is to complex, a Warning/Critical setup can simply use Red/Yellow. Red overrides Yellow.

For more complex setups, another option besides hover to locate the offending time-series could be a flashing line/area/whatever? That could draw attention to the correct time-series easily.

I think most users would be satisfied by a fairly simple Warn/Crit separation though.

@KieranP
Copy link

KieranP commented Dec 10, 2017

This is an absolute must for an alerting software, especially for server monitoring. Disk space, memory, cpu usage, temperature, load avg.... all prime examples where one would want multiple alerts configured with different messages with different thresholds. Take disk space for example. Need one alert for disk usage over 70%, another for disk usage over 90%.

@Caffe1neAdd1ct
Copy link

Caffe1neAdd1ct commented Jan 4, 2018

Bit of an edge case, but we are using the alerts to notify us if a product hasn't sold in a few days. We have each product as a metric, which in turn means we only get one alert when one of the metrics enters the alerting threshold. Ideally we would like to receive an alert if the alert shows any additional metric has entered the alerting threshold as well.

Also we are using templating vars to repeat a graph for each selected product with two metrics overlayed (volume and gross margin) on the left and right y axis. This kills any chance of using alerting as the alert query isn't picking up the $sku list variable for our IN ($sku).

To work around this I've tried having another query B which just runs the template query to look up all skus we are interested in and puts that straight into the alert query IN (SELECT skus from interested_product_table). However this starts sending us alerts for each graph for all the metrics across each graph meaning we get:

Email Alert 1 - metric1,metric2,metric3
Email Alert 2 - metric1,metric2,metric3
Email Alert 3 - metric1,metric2,metric3
Email Alert 4 - metric1,metric2,metric3

Email Alert 5 - metric4
Email Alert 6 - metric4
Email Alert 7 - metric4
Email Alert 8 - metric4

For example which is quite spammy.

@jessover9000
Copy link
Contributor

jessover9000 commented Jan 14, 2021

Hello Grafana community, the Grafana team has picked up the work on Alerting and we're in the process of redesigning it to make the best possible alerting experience happen 🔥 🚀 We would love to find out more about your needs as our beloved users. So if any of you are willing to have a 30-minute interview with me, please just send me an empty e-mail and I will get in touch.

Update: I got so many e-mails in such a short time, you all rock! I'll be reaching out to everyone who sent e-mails, we have enough interviewees now, thank you <3

@eapperley
Copy link

eapperley commented Jan 16, 2021 via email

@lafrech
Copy link

lafrech commented Jan 19, 2021

Looks like I'm late for the interview. Here's a use case, anyway.

Building monitoring, many sensors. We want to receive alerts if data is missing or out of bounds (ideally a different alert).

(I'd also be interested in alerts such as impossible change rate, like a room temperature going from 15 to 30 in a minute, but that is secondary and should be achievable with thresholds on a derivative query.)

I have a table (.csv) defining for each sensor the expected frequency (which defines a NoData time) and the min and max bounds, if any. I could also work with families of sensors, like sensors with tag "temperature" -> [0°C; 50°C], freq=120m, tab "elec power" -> ... etc.

Ideally, I'd like to be able to launch a single query with a groupby(id) to have each sensor as a separate series, then create an alert that would use parameters inferred from tags to check bounds and NoData, returning an explicit alert for each issue.

I reckon the tags -> parameters mapping is a bit too specific, so I would be happy if I just could manually create an alert by sensor family (e.g. an alert for temperature sensors, another one for elec meters, etc.) with parameters set in the alert.

In any case, I don't want an all-or-nothing alert. I need an alert state on each sensor. But having it in a single alert, or having some concept wrapping several alerts to get a single report would be a killer feature as getting 1 email per sensor the day the gateway falls is not ideal.

Since alerts can't be parametrized this way, I need to create many alerts. Too many to do it manually. I can use the API to create them programmatically but there are still limitations.

I still can't parametrize the alert so it has to be one per sensor, no global report. A concept of single report on multiple alerts would be nice.

I can't set two alerts on the same panel, so I have to create a panel for each alert. This limitation seems arbitrary and, without any knowledge of the code, I naively assume it shouldn't be too difficult to improve. Then one could have a panel with multiple queries and an alert for each query, for instance. Having to create multiple panels is totally feasible with the API but it's kinda lame to use dummy panels just to set alerts. I'm confident the alerts module redesign will address this.

Regarding the NoData case, it's unclear to me what the NoData state is meant to. It's nice to have a separate state if it can trigger a different notification or at least a different message on the same notification. But even then, one may want to check for out of bound values every hour for the last hour, and check for no data on different time periods, so it might end up in separate alerts anyway. I think the NoData state in Grafana right now is just a way to tell Grafana what to do about an alert if the data is missing, but it is not exactly meant to create a dedicated no data alert. For instance, you can't just create a simple no data alert using this, you need a condition. To create a simple no data alert, you must use a count <= 0 condition, then you don't need the NoData state, I guess. And if I set the state to Alerting on no data, I get a message about an error with no faulty value in the report from which I deduce it is a no data issue, but it is not very explicit.

Hope this helps and I'm not just adding to the noise.

Thanks for all your work anyway. Grafana is a great tool and we're happy with it. I'm afraid I'm gonna have to implement alerts with a separate InfluxDB client, but I'm eager to see what comes out of the alert module rework. Please keep us posted.

@tgritter
Copy link

tgritter commented Feb 2, 2021

Here's my use case. I have a graph monitoring Disk Usage and we want one non-critical alert to be sent to a particular Slack channel when the disk usage is at 50%, and a critical alert sent to another Slack channel when the usage is above 80%. Thus I'm also someone who would love multiple alerts per graph!

@jloiselle
Copy link

So I have a panel that tracks the number of processes running for 4 instances, each with their own query (a-d).

I want to now set up separate alerts for each instance to make sure the number of processes is above a certain value. It's my understanding that I cannot have separate alert triggers for these with the current limits with out this feature, is that correct?

@Var091
Copy link

Var091 commented Feb 12, 2021

Hello everyone, in my team we are developing a solution in which we use Grafana as a metric viewer and we are also finding this same stopper to continue. We need to have alerts for the metrics in different states using the same chart, such as:

Disk <=50% = OK (Green)
Disk >=65% = Warning (Yellow) -> Teams alert.
Disk >=80% = Severe (Orange) -> Call to Ops team
Disk >=90% = Critical (Red) -> Call to Ops team

Thanks!

@oldhamjosh
Copy link

I also am bumping into this issue. We have 1000's of servers, and I want to monitor the disk space on them vs. creating 1000's of dashboards with 1 host per dashboard. We are currently leveraging the alert api to grab this data, but are bumping into the issue that the database doesnt update unless the state of the alert changes. So if host A causes an alert and 10 minutes later host b causes and alert, but host a hasnt cleared; i am not paged for host b's issue. Is there any hope this issue will be resolved, or are we to look at other solutions for alerting capabilities?

@alexfouche
Copy link

i don't get it... Why do people keep commenting or requesting here ?
It must be obvious by now after all this time that it will never be done. Take a look at the initial date : from 2017 !

@anthosz
Copy link

anthosz commented Mar 6, 2021

i don't get it... Why do people keep commenting or requesting here ?
It must be obvious by now after all this time that it will never be done. Take a look at the initial date : from 2017 !

🤷‍♂️🤦‍♂️
Check messages before... You will understand why 😉

@tw-bert
Copy link

tw-bert commented May 28, 2021

@kylebrandt Just to summarize:

Alerting NG (NextGen) alerting planned for 8 will support multiple alert instances from a single alert definition.

Looking forward to Grafana version 8. Great stuff.

@kylebrandt
Copy link
Contributor

@kylebrandt Just to summarize:

Alerting NG (NextGen) alerting planned for 8 will support multiple alert instances from a single alert definition.

Looking forward to Grafana version 8. Great stuff.

Yes this is still true. More info coming with Grafanacon and the v8 release (and the docs that will go with it).

@kylebrandt
Copy link
Contributor

kylebrandt commented Jun 8, 2021

The new beta version of alerting in Grafana 8 (opt-in with "ngalert" feature toggle) supports "multi-dimensional" alerting based on labels. So one can have multiple alert instances from a single rule. Alert rules no longer live in dashboards, but are their own entities assigned to folders.

Documented under multi-dimensional rule in https://grafana.com/docs/grafana/latest/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule/#query (sorry for broken images, fix in progress).

For example:

image

Would create alert (instances) per device,instance,job:

image

Note: This may not work with all backend data sources yet. Notably, since Graphite data source is not label based (I don't think we return labels for Graphite that does support labels yet, and nothing to extract labels). That being said we can address both of these in the future (support labels with data source, and/or extract labels from parts of the metric the name).

Demos etc regarding the new alerting V8 will be in the Grafacon session (online streaming) on June 16, 2021: https://grafana.com/go/grafanaconline/2021/alerting/

@lusid
Copy link

lusid commented Jun 9, 2021

Great progress on this!

I've been able to get multidimensional conditions working correctly in the GUI (as shown in the image below), but notifications never get sent. I've only tried with Prometheus and Elasticsearch datasources, and with email and webhook contact points. I'm pretty sure everything is configured correctly as notifications are sent when I switch to using "classic condition" type.

image

Am I wrong in thinking that this should trigger one or more notifications?

Edit: Apparently, I received an email as I was writing this (with extremely strange formatting), but no webhook notification. So at the moment, it seems like email notifications work with Prometheus, but using either Elasticsearch datasource and/or webhook notifications doesn't work at all. Is that expected?

@kylebrandt
Copy link
Contributor

@lusid Thank you for trying out the feature, feedback, and noticing progress. You are among the first outside of Grafana to try the feature 🏅

Everything you mentioned is meant to work, so I think your understanding of the features are correct.

Each of the issues you mention sounds like it could potentially be a bug or a UX/doc issue:

  • Multi-dimensional alerts not working with elastic
  • Email formatting issues
  • Webhook notifications not working correctly with multi-dimension alerts

Will try to replicate some of the behaviors. If you happen to have the time for more beta testing... Can you try to localize the issues more (minimal configuration, check server logs with debugging) to be able to create some new more specific issues?

@lusid
Copy link

lusid commented Jun 9, 2021

@kylebrandt I would be happy to help in any way I can!

I did set up a completely independent Docker container with Grafana tagged 8.0.0 on my local machine and only reused the data sources from production and set up a minimal test case, and the issue remained. It seems like using a Reduce/Math combination with either Elasticsearch datasource or webhook contact point even on the simplest query is presented perfectly in the GUI as expected, but notifications are never sent.

I have tried looking at the logs, but I mostly don't see anything that stands out as unusual. Would it be beneficial to run it in my local environment in development mode? Will that produce additional logs that might be useful?

@lusid
Copy link

lusid commented Jun 9, 2021

As for the email I receive when it does work (with Prometheus data source and Email contact point), it looks like this (as received through Gmail):

image

@gabrielmcf
Copy link

Great feature! Congratulations! I'm also testing the new multi dimensional alerts and so far I'm really liking it!

@lusid
Copy link

lusid commented Jun 9, 2021

I have also tested it now with Google Hangouts Chat contact point. Same outcome.

@gabrielmcf
Copy link

I've tested using Azure Monitor and prometheus. Both fired and sent the message to Slack.

@tgritter
Copy link

@kylebrandt Thanks so much Kyle and the rest of the Grafana team! This feature is very much appreciated! You've definitely made my life easier! Thanks again and I'll try to convince my boss to pay you more!

@cilerler
Copy link

@kylebrandt I'm experiencing the same issue @lusid is experiencing. Should we create a new issue? How can we get attention?

@kylebrandt
Copy link
Contributor

@kylebrandt I'm experiencing the same issue @lusid is experiencing. Should we create a new issue? How can we get attention?

New issue would be great. A fair amount of pieces and things be fixed, so good to track things independently to make sure they go through issue triage etc.

@sgpinkus
Copy link

sgpinkus commented May 17, 2022

@kylebrandt thanks for your work on this.

I found the documentation you linked to a bit confusing. I can't really make sense of it.

The implementation is also a bit confusing. The docs state there are 3 types of "multi-dimensional" condition: reduce, math, resample.

  • Resample is not explained at all in the docs and I can't comprehend what it's about.
  • Math actually only applies to to other conditions so it like a meta condition and not specifically about "multi-dimensionality" (but it turns out you actually need this because the the Reduce condition does support a conditional (unlike classic).
  • Reduce seems to apply an aggregation function (min, max, sum, count, avg) across all series returned (as opposed to the classic condition which (some what counter intuitively) applies a condition independently to each individual series returned).

In my testing with reduce, an alert is generated for each series even though they are all tied to the same underlying condition (which seems strange)? In any case, I don't think the current reduce functionality satisfies what this issue was asking for. The reduce aggregation functionality could / should be implemented at the query level with an aggregation function. My understanding was this issue asks for a distinct alert for each series returned by the query that is independently tracked.

One simple way to achieve that at th UI level would be to have a switch that changes the semantics of the classic condition to apply any series (the current classic semantics) or each. The new reduce and math functionality could still be kept in addition to each but I think each is more valuable since the reduction can be applied at the query level and should be supported by most QLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.