Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: The dsChecksum wasn't updated on the autoscaling request #1340

Open
JKBGIT1 opened this issue Apr 18, 2024 · 0 comments
Open

Bug: The dsChecksum wasn't updated on the autoscaling request #1340

JKBGIT1 opened this issue Apr 18, 2024 · 0 comments
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper

Comments

@JKBGIT1
Copy link
Contributor

JKBGIT1 commented Apr 18, 2024

Current Behaviour

On the autoscaling request Claudie didn't change the dsChecksum, only the desiredState. Due to that, the autoscaler didn't work (see).

According to the logs, the cluster-autoscaler failed to retrieve the resource lock kube-system/cluster-autoscaler and then lost master. The same error occurrence #1065 (comment) .

$ kubectl logs -n claudie autoscaler-wox01-cluster-qy5w5zl-57f74c7dd5-v8gbd -c cluster-autoscaler -p
...
I0417 18:08:17.383459       1 static_autoscaler.go:673] Decreasing size of compute01-ccx23-auto-fy7ww3o, expected=7 current=6 delta=-1
I0417 18:08:17.383801       1 static_autoscaler.go:426] Some node group target size was fixed, skipping the iteration
I0417 18:08:27.485122       1 static_autoscaler.go:673] Decreasing size of compute01-ccx23-auto-fy7ww3o, expected=7 current=6 delta=-1
I0417 18:08:27.485366       1 static_autoscaler.go:426] Some node group target size was fixed, skipping the iteration
I0417 18:08:28.291313       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0417 18:08:28.292135       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 787.249µs
E0417 18:08:33.691966       1 leaderelection.go:330] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://loadbalancer.worldofpotter.eu:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0417 18:08:33.695145       1 leaderelection.go:283] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0417 18:08:40.377907       1 main.go:578] lost master

Besides that, there was an error in the builder from 8 days ago. The error was produced by the ansibler due to the timeout on static nodes while installing the VPN. However, the InputManifest was in a DONE state constantly... (see #1339)

One more thing. The cluster-autoscaler states that there are 6 nodes and the expected amount is 7. On the other hand, the InputManifest record in Mongo has 7 autoscaled nodes in the currentState and 6 autoscaled nodes in the desiredState.

Expected Behaviour

Claudie should update the value of the dsChecksum when it updates the desiredState.

Steps To Reproduce

I don't know.

Anything else to note

The same error in the cluster-autoscaler #1065 (comment)

EDIT: a workaround for this error, when it appears.

@JKBGIT1 JKBGIT1 added the bug Something isn't working label Apr 18, 2024
@Despire Despire added the groomed Task that everybody agrees to pass the gatekeeper label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants