Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Longhorn nodes does not match requested amount #1285

Closed
Despire opened this issue Mar 21, 2024 · 6 comments
Closed

Bug: Longhorn nodes does not match requested amount #1285

Despire opened this issue Mar 21, 2024 · 6 comments
Assignees
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper

Comments

@Despire
Copy link
Contributor

Despire commented Mar 21, 2024

In the CI pipeline the following error is consistent since upgrading the longhorn version

2024-03-20T17:36:04Z INF utils.go:89 > Waiting for 1.yaml from test-sets/test-set2 to finish... [ 2430s elapsed ] module=testing-framework
2024-03-20T17:36:04Z ERR claudie_test.go:117 > Error in test sets test-set1  error="error while performing additional test for manifest 1.yaml from test-set1 : error while checking the nodes.longhorn.io in cluster ts1-oci : the count of schedulable nodes (3) is not equal to nodes.longhorn.io (2) in cluster ts1-oci" module=testing-framework
@Despire Despire added the bug Something isn't working label Mar 21, 2024
@JKBGIT1 JKBGIT1 added the groomed Task that everybody agrees to pass the gatekeeper label Apr 5, 2024
@Despire
Copy link
Contributor Author

Despire commented Apr 30, 2024

After adding & deleting nodes it can happen that some nodes end up in not ready state
Wireguard connection works ok, the CNI is corrupted on those nodes.
There are no files in /etc/cni/net.d/ on the corrupted nodes
Screenshot 2024-04-30 at 14 02 05

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Screenshot 2024-04-30 at 14 01 09
Screenshot 2024-04-30 at 14 04 22

@Despire Despire self-assigned this Apr 30, 2024
@Despire
Copy link
Contributor Author

Despire commented Apr 30, 2024

after restarting the control plane node. the cilium operator was stuck in a crashloop couldn't connect to the api-server

level=info msg="Cilium Operator 1.14.3 252a99ef 2023-10-18T18:21:56+03:00 go version go1.20.10 linux/amd64" subsys=cilium-operator-generic
level=info msg=Invoked duration="471.431µs" function="pprof.glob..func1 (cell.go:51)" subsys=hive
level=info msg=Invoked duration="118.966µs" function="gops.registerGopsHooks (cell.go:39)" subsys=hive
level=info msg=Invoked duration="648.85µs" function="cmd.registerOperatorHooks (root.go:156)" subsys=hive
level=info msg=Invoked duration=18.109668ms function="api.glob..func1 (cell.go:32)" subsys=hive
level=info msg=Invoked duration="213.824µs" function="apis.createCRDs (cell.go:63)" subsys=hive
level=info msg=Invoked duration="303.092µs" function="lbipam.glob..func1 (cell.go:25)" subsys=hive
level=info msg=Invoked duration="288.037µs" function="auth.registerIdentityWatcher (watcher.go:43)" subsys=hive
level=info msg=Invoked duration="194.58µs" function="cmd.registerLegacyOnLeader (root.go:362)" subsys=hive
level=info msg=Invoked duration="283.965µs" function="identitygc.registerGC (gc.go:82)" subsys=hive
level=info msg=Starting subsys=hive
level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops
level=info msg="Start hook executed" duration="441.486µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client

Wireguard connection still worked

@Despire
Copy link
Contributor Author

Despire commented Apr 30, 2024

after the restart the cilium interface are no longer up on the control plane
However the files do exists in /etc/cni/net.d/
Screenshot 2024-04-30 at 14 50 15

@Despire
Copy link
Contributor Author

Despire commented Apr 30, 2024

I was able to reproduce this exact issue with calico as the CNI aswell. Must be something how we apply the changes.

To reproduce this.

Apply manifest 2 from test-set4
Apply manifest 3 from test-set 4

increase count of the gcp nodepool from 1 to 3. The newly added nodes should be in corrupted state even though everything completed sucessfuly

@Despire
Copy link
Contributor Author

Despire commented May 10, 2024

Wait 1 more week before closing this issue as resolved by #1366

@Despire
Copy link
Contributor Author

Despire commented May 15, 2024

had not been seen since #1366

@Despire Despire closed this as completed May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper
Projects
None yet
Development

No branches or pull requests

2 participants