Skip to content

Commit

Permalink
Alerting docs: more vale (#86978)
Browse files Browse the repository at this point in the history
  • Loading branch information
brendamuir committed Apr 27, 2024
1 parent b77763b commit 7077a58
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 25 deletions.
4 changes: 2 additions & 2 deletions docs/sources/alerting/fundamentals/alert-rules/_index.md
Expand Up @@ -32,7 +32,7 @@ Grafana supports two different alert rule types: Grafana-managed alert rules and

## Grafana-managed alert rules

Grafana-managed alert rules are the most flexible alert rule type. They allow you to create alerts that can act on data from any of our [supported data sources](#supported-data-sources), and use multiple data sources in a single alert rule.
Grafana-managed alert rules are the most flexible alert rule type. They allow you to create alerts that can act on data from any of the [supported data sources](#supported-data-sources), and use multiple data sources in a single alert rule.

Additionally, you can also add [expressions to transform your data][expression-queries], set custom alert conditions, and include [images in alert notifications][notification-images].

Expand Down Expand Up @@ -77,7 +77,7 @@ When choosing which alert rule type to use, consider the following comparison be

| <div style="width:200px">Feature</div> | <div style="width:200px">Grafana-managed alert rule</div> | <div style="width:200px">Data source-managed alert rule |
| ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Create alert rules<wbr /> based on data from any of our supported data sources | Yes | No. You can only create alert rules that are based on Prometheus-based data. |
| Create alert rules<wbr /> based on data from any of the supported data sources | Yes | No. You can only create alert rules that are based on Prometheus-based data. |
| Mix and match data sources | Yes | No |
| Includes support for recording rules | No | Yes |
| Add expressions to transform<wbr /> your data and set alert conditions | Yes | No |
Expand Down
Expand Up @@ -20,17 +20,17 @@ weight: 104

# Queries and conditions

In Grafana, queries play a vital role in fetching and transforming data from supported data sources, which include databases like MySQL and PostgreSQL, time series databases like Prometheus, InfluxDB and Graphite, and services like Elasticsearch, AWS CloudWatch, Azure Monitor and Google Cloud Monitoring.
In Grafana, queries play a vital role in fetching and transforming data from supported data sources, which include databases like MySQL and PostgreSQL, time series databases like Prometheus, InfluxDB and Graphite, and services like Elasticsearch, Amazon CloudWatch, Azure Monitor, and Google Cloud Monitoring.

For more information on supported data sources, see [Data sources][data-source-alerting].
For more information on supported data sources, refer to [Data sources][data-source-alerting].

The process of executing a query involves defining the data source, specifying the desired data to retrieve, and applying relevant filters or transformations. Query languages or syntaxes specific to the chosen data source are utilized for constructing these queries.
The process of executing a query involves defining the data source, specifying the desired data to retrieve, and applying relevant filters or transformations. Query languages or syntax specific to the chosen data source are utilized for constructing these queries.

In Alerting, you define a query to get the data you want to measure and a condition that needs to be met before an alert rule fires.

An alert rule consists of one or more queries and expressions that select the data you want to measure.

For more information on queries and expressions, see [Query and transform data][query-transform-data].
For more information on queries and expressions, refer to [Query and transform data][query-transform-data].

## Data source queries

Expand Down Expand Up @@ -123,7 +123,7 @@ These functions are available for **Reduce** and **Classic condition** expressio

## Alert condition

An alert condition is the query or expression that determines whether the alert will fire or not depending on the value it yields. There can be only one condition which will determine the triggering of the alert.
An alert condition is the query or expression that determines whether the alert fires or not depending on the value it yields. There can be only one condition which determines the triggering of the alert.

After you have defined your queries and/or expressions, choose one of them as the alert rule condition.

Expand All @@ -145,11 +145,11 @@ Grafana-managed alert rules are evaluated for a specific interval of time. Durin

It can be tricky to create an alert rule for a noisy metric. That is, when the value of a metric continually goes above and below a threshold. This is called flapping and results in a series of firing - resolved - firing notifications and a noisy alert state history.

For example, if you have an alert for latency with a threshold of 1000ms and the number fluctuates around 1000 (say 980 ->1010 -> 990 -> 1020, and so on) then each of those will trigger a notification.
For example, if you have an alert for latency with a threshold of 1000ms and the number fluctuates around 1000 (say 980 ->1010 -> 990 -> 1020, and so on) then each of those triggers a notification.

To solve this problem, you can set a (custom) recovery threshold, which basically means having two thresholds instead of one. An alert is triggered when the first threshold is crossed and is resolved only when the second threshold is crossed.

For example, you could set a threshold of 1000ms and a recovery threshold of 900ms. This way, an alert rule will only stop firing when it goes under 900ms and flapping is reduced.
For example, you could set a threshold of 1000ms and a recovery threshold of 900ms. This way, an alert rule only stops firing when it goes under 900ms and flapping is reduced.

## Alert on numeric data

Expand Down Expand Up @@ -179,7 +179,6 @@ For a MySQL table called "DiskSpace":
| 2021-June-7 | web1 | /etc | 3 |
| 2021-June-7 | web2 | /var | 4 |
| 2021-June-7 | web3 | /var | 8 |
| ... | ... | ... | ... |

You can query the data filtering on time, but without returning the time series to Grafana. For example, an alert that would trigger per Host, Disk when there is less than 5% free space:

Expand All @@ -204,7 +203,7 @@ This query returns the following Table response to Grafana:
| web2 | /var | 4 |
| web3 | /var | 0 |

When this query is used as the **condition** in an alert rule, then the non-zero will be alerting. As a result, three alert instances are produced:
When this query is used as the **condition** in an alert rule, then the non-zero is alerting. As a result, three alert instances are produced:

| Labels | Status |
| --------------------- | -------- |
Expand Down
Expand Up @@ -18,17 +18,17 @@ weight: 250

Starting with Grafana 10, Alerting can record all alert rule state changes for your Grafana managed alert rules in a Loki instance.

This allows you to explore the behavior of your alert rules in the Grafana explore view and levels up the existing state history modal with a powerful new visualisation.
This allows you to explore the behavior of your alert rules in the Grafana explore view and levels up the existing state history dialog box with a powerful new visualisation.

<!-- image here, maybe the one from the blog? -->

## Configuring Loki

To set up alert state history, make sure to have a Loki instance Grafana can write data to. The default settings might need some tweaking as the state history modal might query up to 30 days of data.
To set up alert state history, make sure to have a Loki instance Grafana can write data to. The default settings might need some tweaking as the state history dialog box might query up to 30 days of data.

The following change to the default configuration should work for most instances, but we recommend looking at the full Loki configuration settings and adjust according to your needs.
The following change to the default configuration should work for most instances, but look at the full Loki configuration settings and adjust according to your needs.

As this might impact the performances of an existing Loki instance, we recommend using a separate Loki instance for the alert state history.
As this might impact the performances of an existing Loki instance, use a separate Loki instance for the alert state history.

```yaml
limits_config:
Expand All @@ -38,7 +38,7 @@ limits_config:

## Configuring Grafana

We need some additional configuration in the Grafana configuration file to have it working with the alert state history.
Additional configuration is required in the Grafana configuration file to have it working with the alert state history.

The example below instructs Grafana to write alert state history to a local Loki instance:

Expand All @@ -56,7 +56,7 @@ enable = alertStateHistoryLokiSecondary, alertStateHistoryLokiPrimary, alertStat

## Adding the Loki data source

See our instructions on [adding a data source](/docs/grafana/latest/administration/data-source-management/).
Refer to the instructions on [adding a data source](/docs/grafana/latest/administration/data-source-management/).

## Querying the history

Expand Down
Expand Up @@ -26,7 +26,7 @@ Grafana Alerting uses the Prometheus model of separating the evaluation of alert

{{< figure src="/static/img/docs/alerting/unified/high-availability-ua.png" class="docs-image--no-shadow" max-width= "750px" caption="High availability" >}}

When running multiple instances of Grafana, all alert rules are evaluated on all instances. You can think of the evaluation of alert rules as being duplicated by the number of running Grafana instances. This is how Grafana Alerting makes sure that as long as at least one Grafana instance is working, alert rules will still be evaluated and notifications for alerts will still be sent.
When running multiple instances of Grafana, all alert rules are evaluated on all instances. You can think of the evaluation of alert rules as being duplicated by the number of running Grafana instances. This is how Grafana Alerting makes sure that as long as at least one Grafana instance is working, alert rules are still be evaluated and notifications for alerts are still sent.

You can find this duplication in state history and it is a good way to confirm if you are using high availability.

Expand All @@ -36,8 +36,8 @@ The Alertmanager uses a gossip protocol to share information about notifications

{{% admonition type="note" %}}

If using a mix of `execute_alerts=false` and `execute_alerts=true` on the HA nodes, since the alert state is not shared amongst the Grafana instances, the instances with `execute_alerts=false` will not show any alert status.
This is because the HA settings (`ha_peers`, etc), only apply to the alert notification delivery (i.e. de-duplication of alert notifications, and silences, as mentioned above).
If using a mix of `execute_alerts=false` and `execute_alerts=true` on the HA nodes, since the alert state is not shared amongst the Grafana instances, the instances with `execute_alerts=false` do not show any alert status.
This is because the HA settings (`ha_peers`, etc) only apply to the alert notification delivery (i.e. de-duplication of alert notifications, and silences, as mentioned above).

{{% /admonition %}}

Expand All @@ -61,13 +61,13 @@ Since gossiping of notifications and silences uses both TCP and UDP port `9094`,
As an alternative to Memberlist, you can use Redis for high availability. This is useful if you want to have a central
database for HA and cannot support the meshing of all Grafana servers.

1. Make sure you have a redis server that supports pub/sub. If you use a proxy in front of your redis cluster, make sure the proxy supports pub/sub.
1. Make sure you have a redis server that supports pub/sub. If you use a proxy in front of your Redis cluster, make sure the proxy supports pub/sub.
1. In your custom configuration file ($WORKING_DIR/conf/custom.ini), go to the [unified_alerting] section.
1. Set `ha_redis_address` to the redis server address Grafana should connect to.
1. Set `ha_redis_address` to the Redis server address Grafana should connect to.
1. [Optional] Set the username and password if authentication is enabled on the redis server using `ha_redis_username` and `ha_redis_password`.
1. [Optional] Set `ha_redis_prefix` to something unique if you plan to share the redis server with multiple Grafana instances.

The following metrics can be used for meta monitoring, exposed by Grafana's `/metrics` endpoint:
The following metrics can be used for meta monitoring, exposed by the `/metrics` endpoint in Grafana:

| Metric | Description |
| ---------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
Expand All @@ -83,7 +83,7 @@ The following metrics can be used for meta monitoring, exposed by Grafana's `/me

## Enable alerting high availability using Kubernetes

1. You can expose the pod IP [through an environment variable](https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/) via the container definition.
1. You can expose the Pod IP [through an environment variable](https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/) via the container definition.

```yaml
env:
Expand Down Expand Up @@ -115,7 +115,7 @@ The following metrics can be used for meta monitoring, exposed by Grafana's `/me
fieldPath: status.podIP
```

1. Create a headless service that returns the pod IP instead of the service IP, which is what the `ha_peers` need:
1. Create a headless service that returns the Pod IP instead of the service IP, which is what the `ha_peers` need:

```yaml
apiVersion: v1
Expand Down

0 comments on commit 7077a58

Please sign in to comment.