Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users/community to define healthy status conditions per kubernetes resources per versions #4140

Open
Cajga opened this issue Nov 20, 2023 · 6 comments

Comments

@Cajga
Copy link

Cajga commented Nov 20, 2023

Problem

Currently, weave gitops reports "red" status instead of "green" at graph view for several resource types while the resource is in fact healthy.

Example:
HorizontalPodAutoscaler (apiVersion: autoscaling/v2)
It has 3 status conditions:

  • AbleToScale: should be True,
  • ScalingActive: should be True,
  • ScalingLimited: should be False

Solution

Define a way/procedure how/where users/projects can define the healthy status of a resource on a specific version. It could be a configuration or sending a PR etc.

Additional context

I would be willing to contribute the definition of several resources if it would be well defined how to do it

@makkes
Copy link
Member

makkes commented Nov 20, 2023

To determine the health of HPAs Weave GitOps compares .status.currentReplicas with .status.desiredReplicas and checks each existing condition for a "Failed" or "Invalid" reason.

@Cajga would you mind posting the complete .status object of your HPA here?

kubectl get hpa -n NAMESPACE HPA -o jsonpath={.status}

@Cajga
Copy link
Author

Cajga commented Nov 20, 2023

@makkes thanks for looking into this. Sure, here we are:

# kubectl get hpa -n istio-system istiod -o jsonpath={.status}|jq
{
  "conditions": [
    {
      "lastTransitionTime": "2023-11-17T15:15:50Z",
      "message": "recent recommendations were higher than current one, applying the highest recent recommendation",
      "reason": "ScaleDownStabilized",
      "status": "True",
      "type": "AbleToScale"
    },
    {
      "lastTransitionTime": "2023-11-17T15:16:20Z",
      "message": "the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)",
      "reason": "ValidMetricFound",
      "status": "True",
      "type": "ScalingActive"
    },
    {
      "lastTransitionTime": "2023-11-18T19:10:45Z",
      "message": "the desired count is within the acceptable range",
      "reason": "DesiredWithinRange",
      "status": "False",
      "type": "ScalingLimited"
    }
  ],
  "currentMetrics": [
    {
      "resource": {
        "current": {
          "averageUtilization": 0,
          "averageValue": "3m"
        },
        "name": "cpu"
      },
      "type": "Resource"
    }
  ],
  "currentReplicas": 1,
  "desiredReplicas": 1
}

NOTE: this is a default installation of Istio (with metrics-server) into production, which would scale automatically in case needed

@Cajga
Copy link
Author

Cajga commented Nov 20, 2023

Let me drop here another example also from Istio's default installation:

# kubectl get poddisruptionbudgets.policy -n istio-system istiod -o jsonpath={.status}|jq
{
  "conditions": [
    {
      "lastTransitionTime": "2023-11-17T15:15:45Z",
      "message": "",
      "observedGeneration": 1,
      "reason": "InsufficientPods",
      "status": "False",
      "type": "DisruptionAllowed"
    }
  ],
  "currentHealthy": 1,
  "desiredHealthy": 1,
  "disruptionsAllowed": 0,
  "expectedPods": 1,
  "observedGeneration": 1
}

While we could argue on the fact that this shows that "disruption would not be allowed in this case" but this is still a healthy installation of Istio and the "red status" of the poddisruptionbudget does not look very nice on the graph

@Cajga
Copy link
Author

Cajga commented Nov 20, 2023

@makkes hmm... looking into the code, it seems, your test data for HPA is in fact does not look good:

    message: the desired replica count is less than the minimum replica count
    reason: TooFewReplicas
    status: "True"
    type: ScalingLimited

I believe, that this means basically that the HPA would like to scale down but it reached the minReplicas. You should take action and reduce the minReplicas to allow it to scale down...
Red Hat guys have a good documentation about this.

@foot
Copy link
Contributor

foot commented Nov 24, 2023

Thanks for raising the issue!

Sounds like HPA health checking could be improved.

  • Doing good health checking for built in k8s resources and at least flux resources too would be great to maintain and we might not need an extensible system for this.
  • Having a more extensible system that allows declaring red/green mapping for less common CustomResources would be neat but need some thought

weave gitops reports "red" status instead of "green" at graph view for several resource types

Are the other resource types CustomResources or builtin k8s resources?

@Cajga
Copy link
Author

Cajga commented Mar 13, 2024

Hi @foot,

Sorry for not coming back. We stopped using/evaluating waeve-gitops as it does not support flux multi-tenant config (more details in this ticket).

As far as I remember there were few more resources reported red in our env but unfortunately, cannot recall which ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants