-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus wpa controller reconcile and wpa valid metrics #142
base: main
Are you sure you want to change the base?
Conversation
controllers/metrics.go
Outdated
@@ -26,15 +26,29 @@ const ( | |||
metricNamePromLabel = "metric_name" | |||
reasonPromLabel = "reason" | |||
transitionPromLabel = "transition" | |||
reconcileErrPromLabel = "reconcile_err" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it can be only error
or type
since the metrics name already contains "error"
controllers/metrics.go
Outdated
) | ||
|
||
// reasonValues contains the 3 possible values of the 'reason' label | ||
var reasonValues = []string{downscaleCappingPromLabelVal, upscaleCappingPromLabelVal, withinBoundsPromLabelVal} | ||
|
||
// reconcileReasonValues contains possible `reconcile_err` label values | ||
var reconcileReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more self explained, the slice name could include "error"
var reconcileReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal} | |
var reconcileErrorReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal} |
controllers/metrics.go
Outdated
resourceNamespacePromLabel, | ||
resourceNamePromLabel, | ||
resourceKindPromLabel, | ||
reconcileErrPromLabel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the "reconcile_success" should not include a "error" tag.
maybe we can inverse the metrics logic: reconcile_error
and set to 1 if an error happen.
or create 2 metrics:
reconcile_success
withoutreconcileErrPromLabel
reconcile_error
send only in case of error with the error type.
like this a user can monitor WPA withreconcile_success
and understand the issue with the metrics "reconcile_error".
@CharlyF and @celenechang WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also WPA Invalid
can be a reconcile_error
error_type. so maybe we don't need the metric wpa_valid
.
3f999fc
to
43edfde
Compare
What does this PR do?
Adds 2 metrics:
wpa_controller_reconcile_error
:1
with tagreason:<short_error_message>
if the last reconcile results in an error. If there is no error, no metric is reported.wpa_controller_reconcile_success
: Reports1
if the last reconcile is successful,0
if there was an error.Example:
Motivation
More ways to track WPA and WPA controller errors. There's a
controller_runtime_reconcile_total
metric, but that only has the labelscontroller
andresult
, which don't give much detail about the reconcile error.Describe your test plan
Set up a WPA(s) (example). The WPA should be valid, the target resource should be present, and the Datadog metric should be present and reporting consistently. The
wpa_controller_reconcile_success
metric should be present with value1
and the following labels:resource_kind
resource_name
resource_namespace
wpa_name
wpa_namespace
To visualize metrics, either collect them via the node agent with the
prometheus
oropenmetrics
check (example), or check the/metrics
endpoint:/metrics
endpoint:kubectl exec -it <wpa_controller> -- curl localhost:8383/metrics
Update the WPA (and/or target resource) to force an error (examples below) and ensure that the
wpa_controller_reconcile_success
metric reports0
and that thewpa_controller_reconcile_error
metric reports1
with the appropriatereason
tag. There shouldn't be any stale metrics; the metrics should update accordingly when going from an error to an ok state, error to another error state, and when the WPA is deleted.This isn't inclusive of all possible errors (and
reason
values), but here's a list of a few ways to force some errors:system.load.1.invalid
. This should give metrics withreason:failed_compute_replicas
andFailed to compute desired number of replicas based on listed metrics.
logs in the controller pod.spec.scaleTargetRef.apiVersion
to getreason:invalid_api_version
:spec.scaleTargetRef.apiVersion
. For example,extensions/v1beta1
for Deployments looks to have been deprecated in v1.16 so on a newer Kubernetes cluster with the target Deployment using theapps/v1
apiVersion, using the old apiVersion results inreason:unknown_resource
and the log lineunable to determine resource for scale target reference
:spec.minReplicas
larger than thespec.maxReplicas
to show the log messageInvalid WPA specification: watermark pod autoscaler requires the minimum number of replicas to be configured and inferior to the maximum
and the tagreason:invalid_wpa_spec
: