PromQL expression QOL #31

kedoodle · 2022-01-04T04:22:44Z

We're updating our alerts to make use of the metrics exposed by the prometheus-exporter feature of aws-quota-checker. We have a generic expression which aims to alert whenever we've breached 70% of any limit.

The expression is quite unweildly:

round( 100 *
    label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
    / on (resource)
    label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70

A couple suggestions which would aid in crafting PromQL expressions:

It would be great if the metrics had an additional label e.g.

awsquota_rds_instances{resource="rds_instances"}
awsquota_rds_instances_limit{resource="rds_instances"}

A bigger change, but what if the metrics were exposed using the same pair of metrics, just with the additional label as above? e.g.
```
awsquota_usage{resource="rds_instances"}
awsquota_limit{resource="rds_instances"}
```

Feel free to disregard if this is too niche or opinionated in a direction you'd rather not take. A solution to those facing similar grievances could be through the use of recording rules.

The text was updated successfully, but these errors were encountered:

brennerm · 2022-01-04T13:51:33Z

Hey @kedoodle, thanks for opening this issue. I understand the benefit of switching to the awsquota_usage{resource="rds_instances"} scheme. But what would be the advantage of adding a resource label to the existing metrics?

kedoodle · 2022-01-04T23:15:07Z

Hey @brennerm, appreciate the response!

I'm thinking of a scenario for "generic" expressions where we want to alert on any and all AWS limits reaching a certain threshold (as opposed to a singular resource).

TL;DR: it saves a label_replace or two.

Existing metrics:

awsquota_s3_bucket_count{account="123456789012"}
awsquota_s3_bucket_count_limit{account="123456789012"}

Existing expression (same as original issue comment):

round( 100 *
    label_replace({__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+(_count|_instances))$")
    /
    label_replace({__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}, "resource", "$1", "__name__", "^awsquota_([3a-z_]+)(_limit)$")
) > 70

Existing metric names with additional resource label:

awsquota_s3_bucket_count{account="123456789012",resource="s3_bucket_count"}
awsquota_s3_bucket_count_limit{account="123456789012",resource="s3_bucket_count"}

New expression with existing metric names with additional resource label:

round( 100 *
    {__name__=~"^awsquota_([3a-z_]+(_count|_instances))$",account=~".+"}
    / on (resource)
    {__name__=~"^awsquota_([3a-z_]+(_limit))$",account=~".+"}
) > 70

It could also be nice for specific alerts where you want to use the resource as part of the alert details e.g. the alert could have a description (using metric labels) that we have reached 70% of the limit on s3_bucket_count in 123456789012. I understand that you can get the resource from the metric name - it just requires an extra label_replace for a seemingly common use-case.

round( 100 *
    {__name__="awsquota_s3_bucket_count"}
    / on (resource)
    {__name__="awsquota_s3_bucket_count_limit"}
) > 70

brennerm · 2022-01-05T19:01:14Z

@kedoodle I agree with your point of view. I added a new label called quota in 585f1b6 that contains the quota name.
Could you provide feedback on that change? If it works for you I'll create a new release.

I'll probably also switch to the proposed awsquota_usage and awsquota_limit scheme at some point in time but that'll be part of a new major release as it's a breaking change.

kedoodle · 2022-01-05T22:14:44Z

Hey @brennerm, I've built and deployed from 585f1b6. The new label looks great!

I understand with awsquota_usage and awsquota_limit being a breaking change. Would love to see it in a future release.

brennerm · 2022-01-06T11:16:13Z

That's great to hear. The change has been released with version 1.10.0.

I'll leave the ticket open until I switch to the breaking change scheme.

kedoodle · 2022-01-07T04:56:20Z

Thanks @brennerm!

I'm in the process of deploying 1.10.0 into a few different k8s clusters. Probably unrelated to #31, but I'm seeing some high spikes in memory (~800 MiB) usage during refreshing current values. I've increased memory limits and will let you know (in another issue?) next week if the spikes persisted over the weekend.

Container logs, after which the pod is OOMKilled:

AWS profile: default | AWS region: ap-southeast-2 | Active checks: cf_stack_count,ebs_snapshot_count,rds_instances,s3_bucket_count
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - starting /metrics endpoint on port 8080
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collecting checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - collected 4 checks
07-Jan-22 04:46:33 [INFO] aws_quota.prometheus - refreshing limits
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - limits refreshed
07-Jan-22 04:46:34 [INFO] aws_quota.prometheus - refreshing current values

EDIT:
Given enough memory, we can see it takes 3 minutes 30 seconds to refresh current values:

07-Jan-22 05:04:06 [INFO] aws_quota.prometheus - refreshing current values
07-Jan-22 05:07:36 [INFO] aws_quota.prometheus - current values refreshed

This particular AWS account has ~35k EBS snapshots. I suspect pagination may be needed to reduce memory usage during any one particular check e.g. https://github.com/brennerm/aws-quota-checker/blob/1.10.0/aws_quota/check/ebs.py#L13 for my scenario.

EDIT 2:
Did some troubleshooting given that most people probably don't have an AWS account with 35k EBS snapshots handy. PR opened #32.

tpoindessous · 2022-05-19T09:50:35Z

Hello @kedoodle

thanks for your work. Your expression doesn't work with this metric

awsquota_elb_listeners_per_clb

We are trying to find a new alert rule, we will get back to you !

Thanks !

kedoodle · 2022-05-27T04:45:02Z

Hello @kedoodle

thanks for your work. Your expression doesn't work with this metric

awsquota_elb_listeners_per_clb

We are trying to find a new alert rule, we will get back to you !

Thanks !

Hopefully you can adapt the expression to something that works for your use case in leiu of the proposed awsquota_usage and awsquota_limit breaking change being implemented.

kedoodle mentioned this issue Jan 7, 2022

Paginate ebs_snapshot_count check #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PromQL expression QOL #31

PromQL expression QOL #31

kedoodle commented Jan 4, 2022

brennerm commented Jan 4, 2022 •

edited

kedoodle commented Jan 4, 2022 •

edited

brennerm commented Jan 5, 2022

kedoodle commented Jan 5, 2022

brennerm commented Jan 6, 2022

kedoodle commented Jan 7, 2022 •

edited

tpoindessous commented May 19, 2022

kedoodle commented May 27, 2022

PromQL expression QOL #31

PromQL expression QOL #31

Comments

kedoodle commented Jan 4, 2022

brennerm commented Jan 4, 2022 • edited

kedoodle commented Jan 4, 2022 • edited

brennerm commented Jan 5, 2022

kedoodle commented Jan 5, 2022

brennerm commented Jan 6, 2022

kedoodle commented Jan 7, 2022 • edited

tpoindessous commented May 19, 2022

kedoodle commented May 27, 2022

brennerm commented Jan 4, 2022 •

edited

kedoodle commented Jan 4, 2022 •

edited

kedoodle commented Jan 7, 2022 •

edited