Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus wpa controller reconcile and wpa valid metrics #142

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

khewonc
Copy link
Contributor

@khewonc khewonc commented Mar 10, 2022

What does this PR do?

Adds 2 metrics:

  • wpa_controller_reconcile_error: 1 with tag reason:<short_error_message> if the last reconcile results in an error. If there is no error, no metric is reported.
  • wpa_controller_reconcile_success: Reports 1 if the last reconcile is successful, 0 if there was an error.

Example:

# HELP wpa_controller_reconcile_error Gauge indicating whether the last recorded reconcile gave an error
# TYPE wpa_controller_reconcile_error gauge
wpa_controller_reconcile_error{reason="failed_compute_replicas",resource_kind="Deployment",resource_name="redis",resource_namespace="default",wpa_name="four",wpa_namespace="default"} 1
# HELP wpa_controller_reconcile_success Gauge indicating whether the last recorded reconcile is successful
# TYPE wpa_controller_reconcile_success gauge
wpa_controller_reconcile_success{resource_kind="Deployment",resource_name="redis",resource_namespace="default",wpa_name="four",wpa_namespace="default"} 0```

Motivation

More ways to track WPA and WPA controller errors. There's a controller_runtime_reconcile_total metric, but that only has the labels controller and result, which don't give much detail about the reconcile error.

Describe your test plan

Set up a WPA(s) (example). The WPA should be valid, the target resource should be present, and the Datadog metric should be present and reporting consistently. The wpa_controller_reconcile_success metric should be present with value 1 and the following labels:

  • resource_kind
  • resource_name
  • resource_namespace
  • wpa_name
  • wpa_namespace

To visualize metrics, either collect them via the node agent with the prometheus or openmetrics check (example), or check the /metrics endpoint:

  • Example request to curl /metrics endpoint: kubectl exec -it <wpa_controller> -- curl localhost:8383/metrics
  • Example using Datadog graphs: error metric, success metric

Update the WPA (and/or target resource) to force an error (examples below) and ensure that the wpa_controller_reconcile_success metric reports 0 and that the wpa_controller_reconcile_error metric reports 1 with the appropriate reason tag. There shouldn't be any stale metrics; the metrics should update accordingly when going from an error to an ok state, error to another error state, and when the WPA is deleted.

This isn't inclusive of all possible errors (and reason values), but here's a list of a few ways to force some errors:

  • Use a Datadog metric name that doesn't exist in the account, e.g. system.load.1.invalid. This should give metrics with reason:failed_compute_replicas and Failed to compute desired number of replicas based on listed metrics. logs in the controller pod.
  • Trigger a parsing error with spec.scaleTargetRef.apiVersion to get reason:invalid_api_version:
spec:
(...)
  scaleTargetRef:
    kind: "Deployment"
    name: "redis"
    apiVersion: "apps/v1///"
  • Use a mismatched spec.scaleTargetRef.apiVersion. For example, extensions/v1beta1 for Deployments looks to have been deprecated in v1.16 so on a newer Kubernetes cluster with the target Deployment using the apps/v1 apiVersion, using the old apiVersion results in reason:unknown_resource and the log line unable to determine resource for scale target reference:
spec:
(...)
  scaleTargetRef:
    kind: "Deployment"
    name: "redis"
    apiVersion: "extensions/v1beta1"
  • Make the spec.minReplicas larger than the spec.maxReplicas to show the log message Invalid WPA specification: watermark pod autoscaler requires the minimum number of replicas to be configured and inferior to the maximum and the tag reason:invalid_wpa_spec:
spec:
(...)
  minReplicas: 3
  maxReplicas: 1

@khewonc khewonc added the enhancement New feature or request label Mar 10, 2022
@khewonc khewonc requested a review from a team as a code owner March 10, 2022 15:28
@@ -26,15 +26,29 @@ const (
metricNamePromLabel = "metric_name"
reasonPromLabel = "reason"
transitionPromLabel = "transition"
reconcileErrPromLabel = "reconcile_err"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it can be only error or type since the metrics name already contains "error"

)

// reasonValues contains the 3 possible values of the 'reason' label
var reasonValues = []string{downscaleCappingPromLabelVal, upscaleCappingPromLabelVal, withinBoundsPromLabelVal}

// reconcileReasonValues contains possible `reconcile_err` label values
var reconcileReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more self explained, the slice name could include "error"

Suggested change
var reconcileReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal}
var reconcileErrorReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal}

resourceNamespacePromLabel,
resourceNamePromLabel,
resourceKindPromLabel,
reconcileErrPromLabel,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the "reconcile_success" should not include a "error" tag.

maybe we can inverse the metrics logic: reconcile_error and set to 1 if an error happen.

or create 2 metrics:

  • reconcile_success without reconcileErrPromLabel
  • reconcile_error send only in case of error with the error type.
    like this a user can monitor WPA with reconcile_success and understand the issue with the metrics "reconcile_error".

@CharlyF and @celenechang WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also WPA Invalid can be a reconcile_error error_type. so maybe we don't need the metric wpa_valid.

@khewonc khewonc force-pushed the khewonc/add-prometheus-wpa-reconcile-valid-metrics branch from 3f999fc to 43edfde Compare April 4, 2022 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants