Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add elasticsearch alerting #11380

Merged
merged 32 commits into from
Jun 1, 2018
Merged

Conversation

wph95
Copy link
Contributor

@wph95 wph95 commented Mar 26, 2018

Fixes #5893

wph95 and others added 7 commits March 23, 2018 23:18
upgrade from grafana/grafana
wip
Signed-off-by: wph95 <wph657856467@gmail.com>
Signed-off-by: wph95 <wph657856467@gmail.com>
Signed-off-by: wph95 <wph657856467@gmail.com>
- add some test
@CLAassistant
Copy link

CLAassistant commented Mar 26, 2018

CLA assistant check
All committers have signed the CLA.

@codecov-io
Copy link

codecov-io commented Mar 26, 2018

Codecov Report

Merging #11380 into master will decrease coverage by 0.17%.
The diff coverage is 46.49%.

@@            Coverage Diff             @@
##           master   #11380      +/-   ##
==========================================
- Coverage    51.9%   51.72%   -0.18%     
==========================================
  Files         359      365       +6     
  Lines       26066    26509     +443     
  Branches     1509     1556      +47     
==========================================
+ Hits        13530    13713     +183     
- Misses      11796    12030     +234     
- Partials      740      766      +26

@pbuentam
Copy link

pbuentam commented Apr 2, 2018

I have downloaded this but it seems that it doesn't work with template based queries. Do you have planned adding this?

Kind regards

@wph95
Copy link
Contributor Author

wph95 commented Apr 2, 2018

@pbuentam
I don't have plan to support template variables in this PR
#6557

@Knaky41
Copy link

Knaky41 commented Apr 16, 2018

Hi,

Did you make some additional tests ? Is it working fine ?

Thanks for your support,

Cheers.

@Knaky41
Copy link

Knaky41 commented May 2, 2018

Do you have some news about the tests you made ?

Cheers.

@daniellee daniellee self-assigned this May 2, 2018
@marefr marefr self-assigned this May 3, 2018
@pbuentam
Copy link

pbuentam commented May 3, 2018 via email

@beriba
Copy link

beriba commented May 3, 2018

@pbuentam I observed similar behavior in alerting on Influx data source but didn't have time to check it deeper. So that may be a bug on totally different level. I'm not saying that it is in your case but that may be a hint for you.

@marefr marefr mentioned this pull request May 15, 2018
@marefr
Copy link
Member

marefr commented May 15, 2018

@wph95 just a heads up. We've been starting to review this and initiated refactoring work. If you didn't change the default setting for Allow edits from maintainers when you created the pull request we'll push our changes to your fork. If you did, we'll need to branch of and create a separate PR.

Our hope is to be able to merge this to master soon.

@wph95
Copy link
Contributor Author

wph95 commented May 15, 2018

@marefr yep i did't change the Allow edits from maintainers
and i will spend more time to this pr, to help you merge this to master soon :)

@wph95
Copy link
Contributor Author

wph95 commented May 15, 2018

@pbuentam could you give more information about the problem when you meet.
e.g. I want to know what es query you used
I want to reproduce this problem :)
maybe you use trim Datapoints?

@dcloud9
Copy link

dcloud9 commented May 23, 2018

@marefr I'll have a go with this branch if it helps as we need this feature and blocks us at the moment. How close are we to get this merged to master and release version? Rough guesstimate would do.

@szaroubi
Copy link

szaroubi commented May 24, 2018

@marefr,
(I am new here, fair warning, don't hesitate to guide me towards the appropriate procedure).

I was able to checkout the pull request, compile and run.
I got behaviour that I can't explain (which doesn't mean it is a bug).

Elasticsearch version

{
  "name" : "OX-PN6t",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "uLPK779kS-OPNE1dxcjeYQ",
  "version" : {
    "number" : "5.1.2",
    "build_hash" : "c8c4c16",
    "build_date" : "2017-01-11T20:18:39.146Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
  "tagline" : "You Know, for Search"
}

Grafana Panel
Type: Graph
Number of series: 2
Series A: Count with Filter group by Term + group by Date histogram with interval=20s
Series B: Count with same filter only group by Date histogram with interval=20s

Alert Config
When: max
Of: Query(B, 5m, now) Is Above 30

Notes

  1. The graph shows that the Series B goes above 30 (at least once) in past 5 minutes (view attached screenshots)
  2. Click on Test Rule
  3. View output of test rule
firing:false
state:"ok"
conditionEvals:"false = false"
timeMs:"5.822ms"
    logs:Array[2]
        0:Object
            message:"Condition[0]: Query Result"
            data:Array[1]
                0:Object
                    name:"Count"
                    points:Array[16]
                        0:Array[0,1527183240000]
                        1:Array[0,1527183260000]
                        2:Array[0,1527183280000]
                        3:Array[0,1527183300000]
                        4:Array[0,1527183320000]
                        5:Array[0,1527183340000]
                        6:Array[0,1527183360000]
                        7:Array[0,1527183380000]
                        8:Array[0,1527183400000]
                        9:Array[0,1527183420000]
                        10:Array[0,1527183440000]
                        11:Array[0,1527183460000]
                        12:Array[0,1527183480000]
                        13:Array[0,1527183500000]
                        14:Array[0,1527183520000]
                        15:Array[0,1527183540000]
        1:Object
            message:"Condition[0]: Eval: false, Metric: Count, Value: 0.000"
            data:null

screen shot 2018-05-24 at 1 45 10 pm

@marefr
Copy link
Member

marefr commented May 25, 2018

Thanks for trying it out @szaroubi - are you sure you're using the last commit in this branch since I've pushed some changes the last couple of days?

Can you please include the full json of your panel so I can try it out?

@szaroubi
Copy link

@marefr,
I was using the latest commit as it was the only time I had ever checkout of the grafana codebase.

As for JSON of panel, I currently don't have access to the environment in which grafana is installed and can't provide the JSON.
Will set a reminder to provide it Monday during the day.

@szaroubi
Copy link

szaroubi commented May 28, 2018

@marefr

Panel
{
  "alert": {
    "conditions": [
      {
        "evaluator": {
          "params": [
            30
          ],
          "type": "gt"
        },
        "operator": {
          "type": "and"
        },
        "query": {
          "params": [
            "B",
            "1m",
            "now"
          ]
        },
        "reducer": {
          "params": [],
          "type": "max"
        },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "frequency": "10s",
    "handler": 1,
    "name": "Log levels alert",
    "noDataState": "no_data",
    "notifications": [
      {
        "id": 1
      }
    ]
  },
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "datasource": "ES Radio-IP Prod",
  "fill": 1,
  "gridPos": {
    "h": 5,
    "w": 24,
    "x": 0,
    "y": 15
  },
  "id": 4,
  "legend": {
    "alignAsTable": true,
    "avg": false,
    "current": false,
    "hideEmpty": false,
    "hideZero": false,
    "max": false,
    "min": false,
    "rightSide": true,
    "show": true,
    "total": true,
    "values": true
  },
  "lines": true,
  "linewidth": 1,
  "links": [],
  "nullPointMode": "null as zero",
  "percentage": false,
  "pointradius": 5,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [],
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "bucketAggs": [
        {
          "fake": true,
          "field": "syslog_level.keyword",
          "id": "3",
          "settings": {
            "min_doc_count": 1,
            "order": "desc",
            "orderBy": "_term",
            "size": "10"
          },
          "type": "terms"
        },
        {
          "field": "@timestamp",
          "id": "2",
          "settings": {
            "interval": "20s",
            "min_doc_count": 0,
            "trimEdges": 0
          },
          "type": "date_histogram"
        }
      ],
      "metrics": [
        {
          "field": "select field",
          "id": "1",
          "type": "count"
        }
      ],
      "query": "syslog_level.keyword:$LogLevel AND syslog_program.keyword:$Program",
      "refId": "A",
      "timeField": "@timestamp"
    },
    {
      "bucketAggs": [
        {
          "field": "@timestamp",
          "id": "2",
          "settings": {
            "interval": "20s",
            "min_doc_count": 0,
            "trimEdges": 0
          },
          "type": "date_histogram"
        }
      ],
      "metrics": [
        {
          "field": "select field",
          "id": "1",
          "type": "count"
        }
      ],
      "query": "syslog_level.keyword:$LogLevel AND syslog_program.keyword:$Program",
      "refId": "B",
      "timeField": "@timestamp"
    }
  ],
  "thresholds": [
    {
      "value": 30,
      "op": "gt",
      "fill": true,
      "line": true,
      "colorMode": "critical"
    }
  ],
  "timeFrom": null,
  "timeShift": null,
  "title": "Log levels",
  "tooltip": {
    "shared": true,
    "sort": 1,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    }
  ],
  "yaxis": {
    "align": false,
    "alignLevel": null
  }
}```
</details>

@marefr
Copy link
Member

marefr commented May 28, 2018

@szaroubi seems to me you're using template variables in your queries - that's not supported when using the alerting feature, see #6557.

@szaroubi
Copy link

@marefr,
I feel like the error message could have been a bit clearer, there weren't any indications of the query not executing due to template variables.
But all that said, sorry for the false bug report and thank you very much for your prompt support.

@marefr
Copy link
Member

marefr commented May 28, 2018

@szaroubi yes that's actually a real bug you did found - there should actually be a clear error message there which is currently not implement so thank you for finding this - will fix that asap.

@szaroubi
Copy link

@marefr,
I can confirm that when I removed the template variable, I get alerts triggered.

If datasource handles targetContainsTemplate function it can evaluate if a certain
query contains template variables and this is used for show an error message that
template variables not is supported in alert queries.
Handle all replacements if interval template variables in the client.
Fix issue with client and different versions.
Adds better tests of the client
@pbuentam
Copy link

pbuentam commented Jun 1, 2018

I keep finding alerts inconsistent with the data.
imagen
In this snapshot you can see that the recovery should have been at 02:50

@marefr
Copy link
Member

marefr commented Jun 1, 2018

@pbuentam please include some details about your metric query and raw response data (use the query inspector for this).

You've selected unit seconds - are you sure that the raw data that comes back from metric query are in second format? If they for example comes back in millisecond format I think you'll need to use 10000 in the alert config to represent 10 seconds.

You've set Evaluate every 5m, you you please try and change this to 60s (default) just to make sure that it's not was causing your problems.

@pbuentam
Copy link

pbuentam commented Jun 1, 2018

I have set the units to none and Evaluate to 60s.
The query inspector shows the following for that graph, I can restrict the timespan if you consider that is going to be useful:

Result
{
  "xhrStatus": "complete",
  "request": {
    "method": "POST",
    "url": "api/datasources/proxy/10/_msearch",
    "data": "{\"search_type\":\"query_then_fetch\",\"ignore_unavailable\":true,\"index\":[\"logstash-wlaccess-2018.06.01\"]}\n{\"size\":0,\"query\":{\"bool\":{\"filter\":[{\"range\":{\"@timestamp\":{\"gte\":\"1527812908675\",\"lte\":\"1527816589767\",\"format\":\"epoch_millis\"}}},{\"query_string\":{\"analyze_wildcard\":true,\"query\":\"app:psportal AND env:produccion AND web:autoservicio\"}}]}},\"aggs\":{\"2\":{\"date_histogram\":{\"interval\":\"5m\",\"field\":\"@timestamp\",\"min_doc_count\":0,\"extended_bounds\":{\"min\":\"1527812908675\",\"max\":\"1527816589767\"},\"format\":\"epoch_millis\"},\"aggs\":{\"1\":{\"avg\":{\"field\":\"time_taken\"}}}}}}\n"
  },
  "response": {
    "responses": [
      {
        "took": 2,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "skipped": 0,
          "failed": 0
        },
        "hits": {
          "total": 459,
          "max_score": 0,
          "hits": []
        },
        "aggregations": {
          "2": {
            "buckets": [
              {
                "1": {
                  "value": 0.011500000488013029
                },
                "key_as_string": "1527812700000",
                "key": 1527812700000,
                "doc_count": 2
              },
              {
                "1": {
                  "value": 0.010771929852157962
                },
                "key_as_string": "1527813000000",
                "key": 1527813000000,
                "doc_count": 57
              },
              {
                "1": {
                  "value": 3.4190000845177564
                },
                "key_as_string": "1527813300000",
                "key": 1527813300000,
                "doc_count": 60
              },
              {
                "1": {
                  "value": 21.798712007599093
                },
                "key_as_string": "1527813600000",
                "key": 1527813600000,
                "doc_count": 66
              },
              {
                "1": {
                  "value": 0.00975000043399632
                },
                "key_as_string": "1527813900000",
                "key": 1527813900000,
                "doc_count": 4
              },
              {
                "1": {
                  "value": 0.009500000393018126
                },
                "key_as_string": "1527814200000",
                "key": 1527814200000,
                "doc_count": 4
              },
              {
                "1": {
                  "value": 0.010250000283122063
                },
                "key_as_string": "1527814500000",
                "key": 1527814500000,
                "doc_count": 4
              },
              {
                "1": {
                  "value": 0.00975000043399632
                },
                "key_as_string": "1527814800000",
                "key": 1527814800000,
                "doc_count": 4
              },
              {
                "1": {
                  "value": 0.01000000024214387
                },
                "key_as_string": "1527815100000",
                "key": 1527815100000,
                "doc_count": 4
              },
              {
                "1": {
                  "value": 1.7438275462482125
                },
                "key_as_string": "1527815400000",
                "key": 1527815400000,
                "doc_count": 58
              },
              {
                "1": {
                  "value": 0.010000000474974513
                },
                "key_as_string": "1527815700000",
                "key": 1527815700000,
                "doc_count": 4
              },
              {
                "1": {
                  "value": 0.487445647080439
                },
                "key_as_string": "1527816000000",
                "key": 1527816000000,
                "doc_count": 92
              },
              {
                "1": {
                  "value": 0.5569400003890042
                },
                "key_as_string": "1527816300000",
                "key": 1527816300000,
                "doc_count": 100
              }
            ]
          }
        },
        "status": 200
      }
    ]
  }
}

@marefr
Copy link
Member

marefr commented Jun 1, 2018

@pbuentam since you're using an interval of 5 minutes you'll see that the last 5 minutes are missing in the graph - this is a general recommendation for Grafana alerting when having this scenario: Configure alert so that the time settings is like/similar to Query(A, 5m, now-5m)

You basically saying to alerting engine that don't look at the latest 5 minutes since there won't be any data there. I think you're hitting this problem and since you have If no data or all values are null = keep last state you encounter your described problem.

@marefr marefr merged commit b6afe5f into grafana:master Jun 1, 2018
@marefr marefr added this to the 5.2 milestone Jun 1, 2018
@marefr
Copy link
Member

marefr commented Jun 1, 2018

@wph95 Thank you for your first contribution to Grafana!

@marefr
Copy link
Member

marefr commented Jun 1, 2018

And thanks to all of you have been helping us test this. Test will continue until we release v5.2 stable and I'll encourage you to create an issue if you find any problems.

@yossiv
Copy link

yossiv commented Jun 3, 2018

Hi I note that during the alert sending the value it add decimal point so for example if value is 110 it makes 0.110
image
any idea why the value is 0.XXX ?
image

@marefr
Copy link
Member

marefr commented Jun 4, 2018

@yossiv looking at the graph it seems correct - the green series is at the bottom. Please change to query/A, 5m, now) or similar to average over longer time.

@yossiv
Copy link

yossiv commented Jun 4, 2018

Hi @marefr thanks.
it didnt help.
once i change the Interval from Auto to 1m it gave me the correct value.
cheers!
image

@ying-jeanne ying-jeanne added the pr/external This PR is from external contributor label Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/external This PR is from external contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alerting: Elasticsearch support