Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can one add a weekly maintenance window into the calculations for SLO's with sloth? #529

Open
golodhrim opened this issue Dec 13, 2023 · 1 comment

Comments

@golodhrim
Copy link

I just found out about the sloth project today and after a lot of reading of the docs I think it is totally what I look for. but the only question that still is up in my head is how I can add a weekly returning maintenance window to the SLO calculations, cause an outage in this time window it would be not counted against the SLO at all.
Greetings

@tokheim
Copy link

tokheim commented Jan 24, 2024

At least you first need prometheus to record maintenance windows. Either some system that reports this as metric, or if its a fixed time, you could build a recording rule with the day_of_week and hour functions.

Then I'd probably cut my losses and just define a inhibit rule to avoid sending alerts during maintenance period (maybe add a buffer around the period). The slo calculations and boards would still take errors during maintenance period into account though. I would find that advantageous though as I'd anyways encourage trying to limit impact of maintenance periods.

Still if you really need calculations to exclude maintenance periods, then the approach would likely depend on your query. Assuming maintenance_period recording rule that reports 1 during mainteance, 0 otherwise, then maybe queries like this would do the trick

  error_query: |
    sum_over_time((
        sum(rate(<error_counter>[30s]))
        * scalar(1-maintenance_period)
    )[{{.window}}:])
  total_query: |
    (sum_over_time((
        sum(rate(<total_counter>[30s]))
        * scalar(1-maintenance_period)
    )[{{.window}}:]) > 0) or vector(1)

Basically if you take rate over full window period, you wouldn't know which errors happened during maintenance period. sum_over_time should still ensure the error ratio is a quite good approximation for the entire window period. > 0 or vector(1) will be quite important to include as the error ratios would otherwise have 0 denominator inside any maintenance period

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants