Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra telemetry on policy evaluation failure #661

Open
the-nando opened this issue Jul 9, 2023 · 0 comments · May be fixed by #660
Open

Extra telemetry on policy evaluation failure #661

the-nando opened this issue Jul 9, 2023 · 0 comments · May be fixed by #660

Comments

@the-nando
Copy link

the-nando commented Jul 9, 2023

We recently run into two separate issue where the Nomad autoscaler failed to describe AWS autoscaling groups due to an expired AWS token or failed to evaluate a scaling policy because of an issue reaching the APM (Prometheus).

{"@level":"warn","@message":"failed to get target status","@module":"policy_manager.policy_handler","@timestamp":"2023-07-06T16:21:26.029652Z","error":"failed to describe AWS Autoscaling Group: operation error Auto Scaling: DescribeAutoScalingGroups, https response error StatusCode: 403, RequestID: c674bc86-1234-4fb1-5678-b264741176bc, api error ExpiredToken: The security token included in the request is expired","policy_id":"613aeb80-xs23-8f4e-1234-ef2ca2748d8a"}

It would be great to have a couple of extra Prometheus metrics exported by the autoscaler to be monitored to detect simple failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants