New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HorizontalPodAutoscaler causes degraded status #6287
Comments
After upgrading the version from 1.x to 2.0, I had the same issue. 👀 |
We have the same issue (after upgrading to 2.0.x), but during the deployment's rollout. It seems like |
Having the same issue with degraded HPA. Looks like it happens because HPA does not have enough metrics during a rollout. Potentially, it can be mitigated by increasing HPA cpu initialisation period |
Seeing this as well. This is causing our pipelines to fail as we validate application health as a step using Because we are using cloud clusters we can't change the flag --horizontal-pod-autoscaler-cpu-initialization-period on the kube controller. Would be nice if there was a way around this from an argocd standpoint other than writing a custom health check that always marks HPA's as healthy. FYI for anyone looking for a workaround to stop the degraded status from appearing at all here is the health check we are using.
|
While approach above works, it's rather a workaround than a solution. We should solve this on an argocd-notifications side of things somehow.. |
I was paing around a little and have hound this solution. Please feel free to use it and leave your feedback about your experience. How it works:
added hook-delete-policy: HookSucceeded for all woraraund parts to delete them in very last shot. It leave only HPA, that was deployed in PostSync in the very end. # templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-10"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-10"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
rules:
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
resourceNames: ["{{ include "app.fullname" . }}"]
verbs: ["get", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-10"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
subjects:
- kind: ServiceAccount
name: {{ include "app.fullname" . }}-hpa-delete
roleRef:
kind: Role
name: {{ include "app.fullname" . }}-hpa-delete
apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "app.fullname" . }}-hpa-delete
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "-5"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
backoffLimit: 0
template:
spec:
serviceAccountName: {{ include "app.fullname" . }}-hpa-delete
restartPolicy: Never
containers:
- name: {{ include "app.fullname" . }}-hpa-delete
image: public.ecr.aws/bitnami/kubectl:1.20
imagePullPolicy: IfNotPresent
env:
- name: NS
value: {{ .Release.Namespace }}
- name: APP
value: {{ include "app.fullname" . }}
command:
- /bin/bash
- -c
- |-
echo -e "[INFO]\tTrying to delete HPA ${APP} in namespace ${NS}..."
echo
RESULT=`kubectl get hpa ${APP} -n ${NS} 2>&1`
if [[ $RESULT =~ "Deployment/${APP}" ]]; then
kubectl delete hpa ${APP} -n ${NS}
echo
echo -e "[OK]\tContinue deployment..."
exit 0
elif [[ $RESULT =~ "\"${APP}\" not found" ]]; then
echo "${RESULT}"
echo
echo -e "[OK]\tContinue deployment..."
exit 0
else
echo "${RESULT}"
echo
echo -e "[ERROR]\tUnexpected error. Check the log above!"
exit 1
fi
---
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "app.fullname" . }}-hpa-wait
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/sync-wave: "0"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: {{ include "app.fullname" . }}-hpa-wait
image: public.ecr.aws/docker/library/alpine:3.15.0
imagePullPolicy: IfNotPresent
command: ["sh", "-c", "sleep 120"]
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/sync-wave: "5"
name: {{ include "app.fullname" . }}
labels:
{{- include "app.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "app.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.cpuAverageUtilization }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.cpuAverageUtilization }}
{{- end }}
{{- if .Values.autoscaling.memoryAverageUtilization }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.memoryAverageUtilization }}
{{- end }}
{{- end }}
|
Any new suggestions here? |
@mubarak-j shared a more sophisticated healthcheck workaround in the comment here, pasting below:
I'm new to custom health checks. Which is correct:
or
? |
The new format as shown in argocd docs examples was introduced in ArgoCD v1.2.0 and explained in the blog release here So unless you're running an older version of argocd, you will need to use the new format. |
@mubarak-j thanks for answering! I think I found a different issue in my setup. I'm managing argocd with the helm chart, and what I came up in my values.yaml based on outdated documentation was
which, I guess, was ignored.
Let's see if it works BTW It would be great if there was a way to list/show resource customizations |
You can find argocd built-in resource customizations here: https://github.com/argoproj/argo-cd/tree/master/resource_customizations |
This is a comprehensive custom health check for HPA I also added a condition to make sure apiVersion is not v1, as v1 only contains status in the annotation
|
Anyone having idea for a workaround in the Koncrete (https://www.koncrete.dev/) hosted ArgoCD? We do not have access to the K8S API, so no option for applying those customizations. |
Checklist:
argocd version
.Describe the bug
To scale up
HorizontalPodAutoscaler
increases thereplicas
of aDeployment
. That seems to cause ArgoCD to consider that the service is degraded as the number of replicas running immediately after the increase will be less than what is specified inDeployment
. The status recovers back to healthy once theDeployment
has managed to start the desired number of replicas.The status shouldn't be considered degraded because it's working exactly as intended and scaling up, using standard Kubernetes practices.
We are receiving notifications when the status is degraded. We're constantly getting notifications when the deployment is scaled up.
To Reproduce
Expected behavior
The status shouldn't be considered degraded. Instead, it could stay healthy or be something less severe than degraded.
We expect to get notified when the status truly degrades and not during normal
HorizontalPodAutoscaler
operations.Version
The text was updated successfully, but these errors were encountered: