Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus告警规则示例:欢迎大家分享 #89

Open
Zhang21 opened this issue Dec 17, 2020 · 7 comments
Open

Prometheus告警规则示例:欢迎大家分享 #89

Zhang21 opened this issue Dec 17, 2020 · 7 comments

Comments

@Zhang21
Copy link
Contributor

Zhang21 commented Dec 17, 2020

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/



# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。


groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
  - alert: cpu负载大于2Cores
    expr: node_load1 > (instance:node_cpus:count * 2) 
    for: 4m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      alertgroup: ops

在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间,请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)
@feiyu563 feiyu563 pinned this issue Feb 3, 2021
@zhangcaiyuan
Copy link

你好!我先把wxurl这改为emailurl可以写为emailurl:url吗

@tommy19810810
Copy link

您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:

global:
resolve_timeout: 5m
route:
group_by: ['gateway']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'webhook'
routes:

@michael-liumh
Copy link

您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:

global: resolve_timeout: 5m route: group_by: ['gateway'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'webhook' routes:

在 receiver 下面添加 continue: true

@qianglatiao
Copy link

您好,请问有Oracle的告警规则吗?

@zhangcaiyuan
Copy link

zhangcaiyuan commented Jul 20, 2022 via email

@running-db
Copy link

请问

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/

# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。

groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"

  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"

  - alert: cpu负载大于2Cores
    expr: node_load5 > (instance:node_cpus:count * 2) 
    for: 2m
    labels:
      severity: critical
      level: 3
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      wxurl: "webhook1, webhook2"

在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间,请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

请问这个怎么用的呢,没在文档中找到。这个webhook1代表的地址在哪儿配置呢,app.conf?还是说把多个地址原文直接写在这个里面么

wxurl: "webhook1, webhook2"
mobile: "13xxx, 15xxx"

@Zhang21
Copy link
Contributor Author

Zhang21 commented Nov 13, 2023

@running-db
多个地址写在里面。你看文档上都有写的。后面的功能上加上了告警组的功能,可以将告警组配置在 app.conf 配置里,然后 rules 里填写对应的一个/多个告警组就可以。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants