New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Multiple alerts per graph #7832
Comments
Concrete use case: I have instrumented my app to record a histogram in Prometheus for each major function (e.g. where an external HTTP call or disk I/O takes place) and would like to alert when any of these becomes slow. Presently I have to define dummy graphs for this because of the 1:1 relationship between graph and alert. It would be much more logical to keep the alerts defined in the same place as the graph itself. |
And you cannot defined that in one query? |
No; a chain of |
then it makes more sense to have separate panels for the alerts, if you want sperate alert rule name & message etc. |
Yep that's exactly what I'm doing at the moment. Is there any likelihood of implementing multiple alerts per graph in the near future so I can move away from this workaround? |
it's very unlikely |
maybe if there huge demand for it :) |
haha OK - I'll see if I can rustle up an angry mob ;) Seriously tho', thanks for the honesty. |
Ok we have a mob of two :-) I'm graphing fuel levels in multiple tanks & wanted to set up a low fuel alert for each tank. |
and each tank has different thresholds or notifications ? |
Exactly. One is a 285 gal heating oil tank. I wanted to set up an "heating oil low" alert when that tank goes below 70gals. The other is a 500 gal propane tank, for that I wanted a "propane low" alert when it goes under 100 gal. I set up singlestats for each but alerts are not available in a singlestat. |
I have a graph with a median and a 90th percentile metric. I'd like to get an alert on each. In order to do this, I have to create one graph for each. Then, if I want warnings and critical alerts for each, I have to create a second graph for each. I have 30 or 40 services to monitor, each with 2 to 5 key metrics. I have graphs where I graph the same metric for multiple customers, and while I don't have to do alerts per customer (yet), it does add to the number of metrics I'd like to have alerts on. The amount of work to create dozens of graphs expands very quickly. It would be very useful in my current production environment (and in my previous production environments) to have warnings and critical alerts, and to display multiple metrics in a single graph and alert on them. |
I'd also like to see this feature. A good example is one alert if a metric goes outside of a threshold and another alert if data fails to update. Ii.e., if a value goes too high or if values fail to report. This could be used to show that whatever is reporting the data has encountered an issue that is preventing communication with grafana (or whatever backend). |
Hi Torkelo! I got several "likes" for the feature! Will we enter the next realease =) ? |
@rmsys maybe at some point, solving it from UX perspective & code complexity (and UX complexity) perspective will take time, it's not on any roadmap yet, but maybe next year as the alerting engine mature more and a UX design for this is worked out |
Another good use case for multiple alerts is to have different severity thresholds with different actions. If a server starts to exhibit slowdowns, an email might be sufficient, but if the slowdowns become extreme, it might be worth paging the administrator. |
I have a graph that returns a metric with the value of |
Not sure I understand what you mean by this. Can you elaborate? Can you describe how multiple alerts per graph would work and look? What would the annotations say, and the green/red heart beside the panel title show(if say 2/5 alert rules where firing)? Would you want to share something between the alert rules or would they be completely isolated (Beside living in the same graph panel and possibly referring to the same queries). How would you visualize thresholds when you have multiple alert rules? Would they show up as separate rules in alert rules page & alert list panel? Then you need a way to navigate to a specific instance of a rule and not just to the alert tab. Grafana is a visual tool and we have chosen to tie an alert rule to a graph so that the alert rule state can visualized easily (via the metrics, thresholds & alert state history). I am afraid that having each graph be able to represent multiple alert rules will complicate this to a very large extent and I am not sure about the need for this. @rssalerno having support for alert rules in singlestat panel seems unrelated to this issue. @alex-phillips you scenario sounds like it can be solved by making individual alert rules more flexible. Does someone have some concrete examples where this would be good? Just not seeing a scenario where it would end up in a confusing graph with 2-5 thresholds that you do not know relates to what metric and alert history annotations that you also do no know what alert rule they came from (without hovering). |
I think multiple alert rules would be annotated individually. Hearts might be colour-coded. Rules would need to be named for differentiation in alerts/panels.
Generally I would think not, though I suspect groups would need to have a shared threshold, and name if they were implemented (per #6557 (comment)).
If rules take an additional colour param, thresholds can be rendered using that, and diffentiated as such, probably want a tooltip also. Being able to toggle rules would be useful, and a param to render a specific rule takes care of the latter I think?
I believe you'll find he was referring to the graph below that, though since he has separate panels for each tank, singlestat alerting may solve his problem for that specific dashboard.
Primarily, I'd like this to support #6557 and #6553, and for multiple thresholds, similar to @alex-phillips. For example, one use-case we have for #6557 is to alert differently for different environments ( |
I like the approach suggested by @pdf Further, the approach to show annotations would be the same as the current case, where you have an alert rule with > 1 conditions (each having a different threshold). And the green/red heart beside the panel title would be shown as red (if there is atleast one alert which is firing), similar to current scenario where at-least one condition of an alert rule evaluating to true). And probably also show the number (2/5) along with the red heart in the title.
In most of our use cases, these rules would not share anything between them and the queries are also different
They would show up as separate rules in alerts page. The Alert tab, would probably have a list of alerts defined. Right, we would need to highlight/expand the specific alert rule on this tab, when the alret rule url (should capture the alert id or index) is accessed from the notification. Seems to be easily solvable. In the alert list panel, there wouldn't be any change. It shows all of them separately. Semantically, each alert is separate. Just that it has been placed in the same panel.
Considering that a lot of people have upvoted for this feature, it would definitely be a useful feature. If we have the support for multiple alerts, then I think it would be upto each user's perception whether it's confusing or not. IMHO, those who think it is confusing would go with the current approach of separate panels for each graph and for those who think the utility/convenience of having the same panel used for visualization and alerting outweighs the perceived confusion, will go the multiple alerts way. Sure it would change the UX somewhat |
In splunk we have high/low alerts. If multiple alerts in grafana available, we'd just use the same search, they are just different thresholds against the same search. |
+1 for this feature. |
+1 for this. Our use case is as follows: We want to define one chart with, say, cpu usage for all of our servers. Then on that same chart we will make two hidden metrics, one for cpu usage on production servers and one for cpu usage on non-production servers. Each of those metrics would have its own alert, with different notification channels. We do not want to have to create multiple charts or panels or dashboards to accomplish this. |
+1 for this feature. |
Came here reading some of the other issues regarding categories and severities. I agree all alerts should be actionable. But there is a difference between a "fix this first thing in the morning" alert and a "call out the $400/hour consultant ASAP" alert. As many have mentioned, this is most common solved by Warning and Critical thresholds. Technically this could be implemented in a bunch of ways, labels, several alerts per panel, several thresholds per alert etc. Regarding confusion if the categorization is to complex, a Warning/Critical setup can simply use Red/Yellow. Red overrides Yellow. For more complex setups, another option besides hover to locate the offending time-series could be a flashing line/area/whatever? That could draw attention to the correct time-series easily. I think most users would be satisfied by a fairly simple Warn/Crit separation though. |
This is an absolute must for an alerting software, especially for server monitoring. Disk space, memory, cpu usage, temperature, load avg.... all prime examples where one would want multiple alerts configured with different messages with different thresholds. Take disk space for example. Need one alert for disk usage over 70%, another for disk usage over 90%. |
Bit of an edge case, but we are using the alerts to notify us if a product hasn't sold in a few days. We have each product as a metric, which in turn means we only get one alert when one of the metrics enters the alerting threshold. Ideally we would like to receive an alert if the alert shows any additional metric has entered the alerting threshold as well. Also we are using templating vars to repeat a graph for each selected product with two metrics overlayed (volume and gross margin) on the left and right y axis. This kills any chance of using alerting as the alert query isn't picking up the To work around this I've tried having another query
For example which is quite spammy. |
Hello Grafana community, the Grafana team has picked up the work on Alerting and we're in the process of redesigning it to make the best possible alerting experience happen 🔥 🚀 We would love to find out more about your needs as our beloved users. So if any of you are willing to have a 30-minute interview with me, Update: I got so many e-mails in such a short time, you all rock! I'll be reaching out to everyone who sent e-mails, we have enough interviewees now, thank you <3 |
Jess,
Unfortunately I don't have the time, however I'm really pleased to see the customer-focussed response.
Our requirement relates to battery voltages - these are non-commercial, domestic systems.. We are plotting 16 cell voltages (16 queries) on each graph. We currently perform an OR operation to detect data outside the permissible range on each cell and generate a single alert for that battery but we can't indicate which cell(s) are out of range. It would be nice to be able to generate alerts from each cell.
Best wishes,
Eric.
…________________________________
From: Jess <notifications@github.com>
Sent: Friday, 15 January 2021 3:22 AM
To: grafana/grafana <grafana@noreply.github.com>
Cc: eapperley <eaa603@hotmail.com>; Comment <comment@noreply.github.com>
Subject: Re: [grafana/grafana] [Feature request] Multiple alerts per graph (#7832)
Hello Grafana community, the Grafana team has picked up the work on Alerting and we're in the process of redesigning it to make the best possible alerting experience happen 🔥 🚀 We would love to find out more about your needs as our beloved users. So if any of you are willing to have a 30-minute interview with me, please just send me an empty e-mail<mailto:jessica.muller@grafana.com?subject=Alerting%20research%20interview> and I will get in touch. Looking forward to hearing from you! :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#7832 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AELEQ55FT27SGJKISR4BB2DSZ342NANCNFSM4DDVAQPQ>.
|
Looks like I'm late for the interview. Here's a use case, anyway. Building monitoring, many sensors. We want to receive alerts if data is missing or out of bounds (ideally a different alert). (I'd also be interested in alerts such as impossible change rate, like a room temperature going from 15 to 30 in a minute, but that is secondary and should be achievable with thresholds on a derivative query.) I have a table (.csv) defining for each sensor the expected frequency (which defines a NoData time) and the min and max bounds, if any. I could also work with families of sensors, like sensors with tag "temperature" -> [0°C; 50°C], freq=120m, tab "elec power" -> ... etc. Ideally, I'd like to be able to launch a single query with a groupby(id) to have each sensor as a separate series, then create an alert that would use parameters inferred from tags to check bounds and NoData, returning an explicit alert for each issue. I reckon the tags -> parameters mapping is a bit too specific, so I would be happy if I just could manually create an alert by sensor family (e.g. an alert for temperature sensors, another one for elec meters, etc.) with parameters set in the alert. In any case, I don't want an all-or-nothing alert. I need an alert state on each sensor. But having it in a single alert, or having some concept wrapping several alerts to get a single report would be a killer feature as getting 1 email per sensor the day the gateway falls is not ideal. Since alerts can't be parametrized this way, I need to create many alerts. Too many to do it manually. I can use the API to create them programmatically but there are still limitations. I still can't parametrize the alert so it has to be one per sensor, no global report. A concept of single report on multiple alerts would be nice. I can't set two alerts on the same panel, so I have to create a panel for each alert. This limitation seems arbitrary and, without any knowledge of the code, I naively assume it shouldn't be too difficult to improve. Then one could have a panel with multiple queries and an alert for each query, for instance. Having to create multiple panels is totally feasible with the API but it's kinda lame to use dummy panels just to set alerts. I'm confident the alerts module redesign will address this. Regarding the NoData case, it's unclear to me what the NoData state is meant to. It's nice to have a separate state if it can trigger a different notification or at least a different message on the same notification. But even then, one may want to check for out of bound values every hour for the last hour, and check for no data on different time periods, so it might end up in separate alerts anyway. I think the NoData state in Grafana right now is just a way to tell Grafana what to do about an alert if the data is missing, but it is not exactly meant to create a dedicated no data alert. For instance, you can't just create a simple no data alert using this, you need a condition. To create a simple no data alert, you must use a count <= 0 condition, then you don't need the NoData state, I guess. And if I set the state to Alerting on no data, I get a message about an error with no faulty value in the report from which I deduce it is a no data issue, but it is not very explicit. Hope this helps and I'm not just adding to the noise. Thanks for all your work anyway. Grafana is a great tool and we're happy with it. I'm afraid I'm gonna have to implement alerts with a separate InfluxDB client, but I'm eager to see what comes out of the alert module rework. Please keep us posted. |
Here's my use case. I have a graph monitoring Disk Usage and we want one non-critical alert to be sent to a particular Slack channel when the disk usage is at 50%, and a critical alert sent to another Slack channel when the usage is above 80%. Thus I'm also someone who would love multiple alerts per graph! |
So I have a panel that tracks the number of processes running for 4 instances, each with their own query (a-d). I want to now set up separate alerts for each instance to make sure the number of processes is above a certain value. It's my understanding that I cannot have separate alert triggers for these with the current limits with out this feature, is that correct? |
Hello everyone, in my team we are developing a solution in which we use Grafana as a metric viewer and we are also finding this same stopper to continue. We need to have alerts for the metrics in different states using the same chart, such as: Disk <=50% = OK (Green) Thanks! |
I also am bumping into this issue. We have 1000's of servers, and I want to monitor the disk space on them vs. creating 1000's of dashboards with 1 host per dashboard. We are currently leveraging the alert api to grab this data, but are bumping into the issue that the database doesnt update unless the state of the alert changes. So if host A causes an alert and 10 minutes later host b causes and alert, but host a hasnt cleared; i am not paged for host b's issue. Is there any hope this issue will be resolved, or are we to look at other solutions for alerting capabilities? |
i don't get it... Why do people keep commenting or requesting here ? |
🤷♂️🤦♂️ |
@kylebrandt Just to summarize:
Looking forward to Grafana version 8. Great stuff. |
Yes this is still true. More info coming with Grafanacon and the v8 release (and the docs that will go with it). |
The new beta version of alerting in Grafana 8 (opt-in with "ngalert" feature toggle) supports "multi-dimensional" alerting based on labels. So one can have multiple alert instances from a single rule. Alert rules no longer live in dashboards, but are their own entities assigned to folders. Documented under multi-dimensional rule in https://grafana.com/docs/grafana/latest/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule/#query (sorry for broken images, fix in progress). For example: Would create alert (instances) per device,instance,job: Note: This may not work with all backend data sources yet. Notably, since Graphite data source is not label based (I don't think we return labels for Graphite that does support labels yet, and nothing to extract labels). That being said we can address both of these in the future (support labels with data source, and/or extract labels from parts of the metric the name). Demos etc regarding the new alerting V8 will be in the Grafacon session (online streaming) on June 16, 2021: https://grafana.com/go/grafanaconline/2021/alerting/ |
Great progress on this! I've been able to get multidimensional conditions working correctly in the GUI (as shown in the image below), but notifications never get sent. I've only tried with Prometheus and Elasticsearch datasources, and with email and webhook contact points. I'm pretty sure everything is configured correctly as notifications are sent when I switch to using "classic condition" type. Am I wrong in thinking that this should trigger one or more notifications? Edit: Apparently, I received an email as I was writing this (with extremely strange formatting), but no webhook notification. So at the moment, it seems like email notifications work with Prometheus, but using either Elasticsearch datasource and/or webhook notifications doesn't work at all. Is that expected? |
@lusid Thank you for trying out the feature, feedback, and noticing progress. You are among the first outside of Grafana to try the feature 🏅 Everything you mentioned is meant to work, so I think your understanding of the features are correct. Each of the issues you mention sounds like it could potentially be a bug or a UX/doc issue:
Will try to replicate some of the behaviors. If you happen to have the time for more beta testing... Can you try to localize the issues more (minimal configuration, check server logs with debugging) to be able to create some new more specific issues? |
@kylebrandt I would be happy to help in any way I can! I did set up a completely independent Docker container with Grafana tagged 8.0.0 on my local machine and only reused the data sources from production and set up a minimal test case, and the issue remained. It seems like using a Reduce/Math combination with either Elasticsearch datasource or webhook contact point even on the simplest query is presented perfectly in the GUI as expected, but notifications are never sent. I have tried looking at the logs, but I mostly don't see anything that stands out as unusual. Would it be beneficial to run it in my local environment in development mode? Will that produce additional logs that might be useful? |
Great feature! Congratulations! I'm also testing the new multi dimensional alerts and so far I'm really liking it! |
I have also tested it now with Google Hangouts Chat contact point. Same outcome. |
I've tested using Azure Monitor and prometheus. Both fired and sent the message to Slack. |
@kylebrandt Thanks so much Kyle and the rest of the Grafana team! This feature is very much appreciated! You've definitely made my life easier! Thanks again and I'll try to convince my boss to pay you more! |
@kylebrandt I'm experiencing the same issue @lusid is experiencing. Should we create a new issue? How can we get attention? |
New issue would be great. A fair amount of pieces and things be fixed, so good to track things independently to make sure they go through issue triage etc. |
@kylebrandt thanks for your work on this. I found the documentation you linked to a bit confusing. I can't really make sense of it. The implementation is also a bit confusing. The docs state there are 3 types of "multi-dimensional" condition:
In my testing with One simple way to achieve that at th UI level would be to have a switch that changes the semantics of the classic condition to apply |
As per http://docs.grafana.org/alerting/rules/, Grafana plans to track state per series in future releases.
But it seems like there can be use cases where we have graphs containing set of metrics for which different sets of alerts are required. This is slightly different from "Support per series state change" ( #6041 ) because
Grafana version = 4.x
The text was updated successfully, but these errors were encountered: