Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HorizontalPodAutoscaler causes degraded status #6287

Open
3 tasks done
lindlof opened this issue May 21, 2021 · 13 comments
Open
3 tasks done

HorizontalPodAutoscaler causes degraded status #6287

lindlof opened this issue May 21, 2021 · 13 comments
Labels
bug Something isn't working

Comments

@lindlof
Copy link

lindlof commented May 21, 2021

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

To scale up HorizontalPodAutoscaler increases the replicas of a Deployment. That seems to cause ArgoCD to consider that the service is degraded as the number of replicas running immediately after the increase will be less than what is specified in Deployment. The status recovers back to healthy once the Deployment has managed to start the desired number of replicas.

The status shouldn't be considered degraded because it's working exactly as intended and scaling up, using standard Kubernetes practices.

We are receiving notifications when the status is degraded. We're constantly getting notifications when the deployment is scaled up.

To Reproduce

  1. Add a Deployment and a HorizontalPodAutoscaler
  2. Send traffic to scale up the deployment
  3. After autoscaling status of the application gets degraded

Expected behavior

The status shouldn't be considered degraded. Instead, it could stay healthy or be something less severe than degraded.

We expect to get notified when the status truly degrades and not during normal HorizontalPodAutoscaler operations.

Version

{
    "Version": "v1.9.0+98bec61",
    "BuildDate": "2021-01-08T07:46:29Z",
    "GitCommit": "98bec61d6154a1baac54812e5816c0d4bbc79c05",
    "GitTreeState": "clean",
    "GoVersion": "go1.14.12",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KsonnetVersion": "v0.13.1",
    "KustomizeVersion": "v3.8.1 2020-07-16T00:58:46Z",
    "HelmVersion": "v3.4.1+gc4e7485",
    "KubectlVersion": "v1.17.8",
    "JsonnetVersion": "v0.17.0"
}
@lindlof lindlof added the bug Something isn't working label May 21, 2021
@zezaeoh
Copy link

zezaeoh commented May 26, 2021

After upgrading the version from 1.x to 2.0, I had the same issue. 👀

@pvlltvk
Copy link

pvlltvk commented May 28, 2021

We have the same issue (after upgrading to 2.0.x), but during the deployment's rollout. It seems like maxSurge: 2 in rollingUpdate also causes the degraded status.

@juris
Copy link

juris commented Jul 16, 2021

Having the same issue with degraded HPA. Looks like it happens because HPA does not have enough metrics during a rollout. Potentially, it can be mitigated by increasing HPA cpu initialisation period --horizontal-pod-autoscaler-cpu-initialization-period
In my case it is not an option, as EKS does not support it yet.

@mmckane
Copy link

mmckane commented Jul 30, 2021

Seeing this as well. This is causing our pipelines to fail as we validate application health as a step using argocd app wait --health, but there is about a 30second to 1min period where argocd marks the HPA as degraded after pushing a new version of a deployment. This causes the argocd app wait --health to exit with an error code failing our pipeline.

Because we are using cloud clusters we can't change the flag --horizontal-pod-autoscaler-cpu-initialization-period on the kube controller. Would be nice if there was a way around this from an argocd standpoint other than writing a custom health check that always marks HPA's as healthy.

FYI for anyone looking for a workaround to stop the degraded status from appearing at all here is the health check we are using.

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
data:
  resource.customizations: | 
    autoscaling/HorizontalPodAutoscaler:
      health.lua: |
        hs = {}
        hs.status = "Healthy"
        hs.message = "Ignoring HPA Health Check"
        return hs

@pentago
Copy link

pentago commented Dec 8, 2021

While approach above works, it's rather a workaround than a solution. We should solve this on an argocd-notifications side of things somehow..

@artem-kosenko
Copy link

artem-kosenko commented Feb 10, 2022

I was paing around a little and have hound this solution. Please feel free to use it and leave your feedback about your experience.

How it works:

  1. on the very begining of the application deploument/update it checks the HPA and deletes it if exist (here we need to run kubectl inside the K8s job, and to do so we have to create service account and role for it, that allows to get and delete the hpa resource of this specific app) all these are on sync-wave = -10/-5 (make sure you use correct version of kubectl based on your K8s cluster varsion)
  2. then common deployment on default sync-wave = 0
  3. then run one more K8s job with sleep inside just to wait till metric-service will have metrics of newly deployed replica-set (sleep 120 enough) PostSync, sync-wave = 0
  4. then deploy the hpa on PostSync, sync-wave = 5

added hook-delete-policy: HookSucceeded for all woraraund parts to delete them in very last shot. It leave only HPA, that was deployed in PostSync in the very end.

# templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-10"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-10"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
rules:
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    resourceNames: ["{{ include "app.fullname" . }}"]
    verbs: ["get", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-10"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
subjects:
  - kind: ServiceAccount
    name: {{ include "app.fullname" . }}-hpa-delete
roleRef:
  kind: Role
  name: {{ include "app.fullname" . }}-hpa-delete
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app.fullname" . }}-hpa-delete
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "-5"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 0
  template:
    spec:
      serviceAccountName: {{ include "app.fullname" . }}-hpa-delete
      restartPolicy: Never
      containers:
        - name: {{ include "app.fullname" . }}-hpa-delete
          image: public.ecr.aws/bitnami/kubectl:1.20
          imagePullPolicy: IfNotPresent
          env:
            - name: NS
              value: {{ .Release.Namespace }}
            - name: APP
              value: {{ include "app.fullname" . }}
          command:
            - /bin/bash
            - -c
            - |-
              echo -e "[INFO]\tTrying to delete HPA ${APP} in namespace ${NS}..."
              echo

              RESULT=`kubectl get hpa ${APP} -n ${NS} 2>&1`

              if [[ $RESULT =~ "Deployment/${APP}" ]]; then
                kubectl delete hpa ${APP} -n ${NS}
                echo
                echo -e "[OK]\tContinue deployment..."
                exit 0
              elif [[ $RESULT =~ "\"${APP}\" not found" ]]; then
                echo "${RESULT}"
                echo
                echo -e "[OK]\tContinue deployment..."
                exit 0
              else
                echo "${RESULT}"
                echo
                echo -e "[ERROR]\tUnexpected error. Check the log above!"
                exit 1
              fi

---
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app.fullname" . }}-hpa-wait
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/sync-wave: "0"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: {{ include "app.fullname" . }}-hpa-wait
          image: public.ecr.aws/docker/library/alpine:3.15.0
          imagePullPolicy: IfNotPresent
          command: ["sh", "-c", "sleep 120"]
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/sync-wave: "5"
  name: {{ include "app.fullname" . }}
  labels:
    {{- include "app.labels" . | nindent 4 }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "app.fullname" . }}
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
    {{- if .Values.autoscaling.cpuAverageUtilization }}
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.cpuAverageUtilization }}
    {{- end }}
    {{- if .Values.autoscaling.memoryAverageUtilization }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.memoryAverageUtilization }}
    {{- end }}
{{- end }}

@noam-allcloud
Copy link

Any new suggestions here?

@prein
Copy link

prein commented Mar 3, 2022

@mubarak-j shared a more sophisticated healthcheck workaround in the comment here, pasting below:

    resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
      hs = {}
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do
            if condition.type == "ScalingActive" and condition.reason == "FailedGetResourceMetric" then
                hs.status = "Progressing"
                hs.message = condition.message
                return hs
            end
            if condition.status == "True" then
                hs.status = "Healthy"
                hs.message = condition.message
                return hs
            end
          end
        end
        hs.status = "Healthy"
        return hs
      end
      hs.status = "Progressing"
      return hs

I'm new to custom health checks. Which is correct:

  resource.customizations: | 
    autoscaling/HorizontalPodAutoscaler:
      health.lua: |

or

  resource.customizations: |
     health.autoscaling_HorizontalPodAutoscaler: |

?
The above question is also discussed here #6175

@mubarak-j
Copy link
Contributor

The new format as shown in argocd docs examples was introduced in ArgoCD v1.2.0 and explained in the blog release here

So unless you're running an older version of argocd, you will need to use the new format.

@prein
Copy link

prein commented Mar 3, 2022

@mubarak-j thanks for answering!
Looking into the blog post, I'm not sure what "In the upcoming release, the resource.customizations key has been deprecated in favor of a separate ConfigMap key per resource" means.

I think I found a different issue in my setup. I'm managing argocd with the helm chart, and what I came up in my values.yaml based on outdated documentation was

argo-cd:
  server:
    config:
      resourceCustomizations: |
        health.autoscaling_HorizontalPodAutoscaler: |
          hs = {}
          [...]

which, I guess, was ignored.
I thought that there was some translation between helm values and the cm, while I could simply do:

argo-cd:
  server:
    config:
      resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
          hs = {}
          [...]

Let's see if it works

BTW It would be great if there was a way to list/show resource customizations

@mubarak-j
Copy link
Contributor

You can find argocd built-in resource customizations here: https://github.com/argoproj/argo-cd/tree/master/resource_customizations

@chris-ng-scmp
Copy link
Contributor

This is a comprehensive custom health check for HPA

I also added a condition to make sure apiVersion is not v1, as v1 only contains status in the annotation

    resource.customizations.useOpenLibs.autoscaling_HorizontalPodAutoscaler: "true"
    resource.customizations.health.autoscaling_HorizontalPodAutoscaler: |
      hs = {}
      hsScalingActive = {}
      if obj.apiVersion == 'autoscaling/v1' then
          hs.status = "Degraded"
          hs.message = "Please upgrade the apiVersion to the latest."
          return hs
      end
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do
            if condition.status == "False" and condition.type ~= 'ScalingActive' then
                hs.status = "Degraded"
                hs.message = condition.message
                return hs
            end
            if condition.type == "ScalingActive" and condition.reason == "FailedGetResourceMetric" and condition.status then
                if string.find(condition.message, "missing request for") then
                  hs.status = "Degraded"
                  hs.message = condition.message
                  return hs
                end
                hsScalingActive.status = "Progressing"
                hsScalingActive.message = condition.message
            end
          end
          if hs.status ~= nil then
            return hs
          end
          if hsScalingActive.status ~= nil then
            return hsScalingActive
          end
          hs.status = "Healthy"
          return hs
        end
      end
      hs.status = "Progressing"
      return hs

@zdraganov
Copy link

zdraganov commented Sep 14, 2022

Anyone having idea for a workaround in the Koncrete (https://www.koncrete.dev/) hosted ArgoCD? We do not have access to the K8S API, so no option for applying those customizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests