Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add missing cluster label to mixins #12870

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

QuentinBisson
Copy link
Contributor

What this PR does / why we need it:

This PR enhance the loki alert mixins by adding the cluster label to all summary.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@QuentinBisson QuentinBisson requested a review from a team as a code owner May 3, 2024 08:48
@QuentinBisson QuentinBisson changed the title enhance: add missing cluster label to mixins feat: add missing cluster label to mixins May 3, 2024
Signed-off-by: QuentinBisson <quentin@giantswarm.io>
@QuentinBisson QuentinBisson force-pushed the loki-mixins-add-cluster-label branch from a866c00 to 07ac356 Compare May 3, 2024 12:24
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, some of these won't work as you'd like them to because the queries in each alerts expr are not keeping the cluster label.

@QuentinBisson
Copy link
Contributor Author

Oh you're right :(
Btw, i've just discovered a tool call pint that could detect such issues using ci validation. Do you think it would make sense to add it to the Mimir generated rules?

@cstyan
Copy link
Contributor

cstyan commented May 3, 2024

Oh you're right :( Btw, i've just discovered a tool call pint that could detect such issues using ci validation. Do you think it would make sense to add it to the Mimir generated rules?

It's worth looking into, happy to review a PR if you come up with something that works 👍

@QuentinBisson
Copy link
Contributor Author

Here is the output of pint without colors :(

I think it would be best to add it to the loki-build-image but I'm not sure if you like to have contribution on that repo.

./pint-linux-amd64 --offline lint --min-severity=information /home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/*.yaml
level=INFO msg="Finding all rules to check" paths=["/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml","/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/rules.yaml"]
level=INFO msg="Offline mode, skipping Prometheus discovery"
level=ERROR msg="Execution completed with error(s)" err="invalid --min-severity value: unknown severity: information"
> ./pint-linux-amd64 --offline lint --min-severity=info /home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/*.yaml
level=INFO msg="Finding all rules to check" paths=["/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml","/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/rules.yaml"]
level=INFO msg="Offline mode, skipping Prometheus discovery"
/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:6-7 Bug: Template is using `cluster` label but the query removes it. (alerts/template)
 6 |             description: |
 7 |                 {{ $labels.cluster }} {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.

/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:6-7 Information: Using the value of `rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])` inside this annotation might be hard to read, consider using one of humanize template functions to make it more human friendly. (alerts/template)
 6 |             description: |
 7 |                 {{ $labels.cluster }} {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.

/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:19-20 Bug: Template is using `cluster` label but the query removes it. (alerts/template)
 19 |             description: |
 20 |                 {{ $labels.cluster }} {{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panics.

level=INFO msg="Problems found" Bug=2 Information=1
level=ERROR msg="Execution completed with error(s)" err="found 1 problem(s) with severity Bug or higher"

@@ -52,7 +52,7 @@ groups:
message: |
{{`{{`}} $labels.cluster {{`}}`}} {{`{{`}} $labels.namespace {{`}}`}} has had {{`{{`}} printf "%.0f" $value {{`}}`}} compactors running for more than 5m. Only one compactor should run at a time.
expr: |
sum(loki_boltdb_shipper_compactor_running) by (namespace, cluster) > 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure they are all in the same order

Signed-off-by: QuentinBisson <quentin@giantswarm.io>
@QuentinBisson
Copy link
Contributor Author

@cstyan it should be fine now, pint is happy apart from the humanize thing

@cstyan
Copy link
Contributor

cstyan commented May 6, 2024

I think it would be best to add it to the loki-build-image but I'm not sure if you like to have contribution on that repo.

Agreed, it would be best to have it nicely runnable in a reproduce-able way. For now, can you show me how the invocation of pint works and I can setup a CI job via github actions at least post-merging of your PR.

Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; same as usual, I assume you've tried deploying the jsonnet version already to your own environment?

pint is happy apart from the humanize thing

not sure what you mean here?

@QuentinBisson
Copy link
Contributor Author

I meant this section:

/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:6-7 Information: Using the value of `rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])` inside this annotation might be hard to read, consider using one of humanize template functions to make it more human friendly. (alerts/template)
 6 |             description: |
 7 |                 {{ $labels.cluster }} {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.

But I did not really play with humanize functions :)

So I run pint from the cli for this one but I think you could use the pint ci directly:
https://cloudflare.github.io/pint/#pull-requests

There are a lot of things that could be added (like enforcing that some annotations or labels are always set) but an initial version would be good.

I will deploy it tomorrow or later today, I did not have time to full test my changes :( i'll ping you when i'm done and thank you for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants