Bug: Longhorn nodes does not match requested amount #1285

Despire · 2024-03-21T04:59:23Z

In the CI pipeline the following error is consistent since upgrading the longhorn version

2024-03-20T17:36:04Z INF utils.go:89 > Waiting for 1.yaml from test-sets/test-set2 to finish... [ 2430s elapsed ] module=testing-framework
2024-03-20T17:36:04Z ERR claudie_test.go:117 > Error in test sets test-set1  error="error while performing additional test for manifest 1.yaml from test-set1 : error while checking the nodes.longhorn.io in cluster ts1-oci : the count of schedulable nodes (3) is not equal to nodes.longhorn.io (2) in cluster ts1-oci" module=testing-framework

The text was updated successfully, but these errors were encountered:

Despire · 2024-04-30T12:14:54Z

After adding & deleting nodes it can happen that some nodes end up in not ready state
Wireguard connection works ok, the CNI is corrupted on those nodes.
There are no files in /etc/cni/net.d/ on the corrupted nodes

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Tue, 30 Apr 2024 14:08:32 +0200   Tue, 30 Apr 2024 12:42:19 +0200   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Despire · 2024-04-30T12:35:18Z

after restarting the control plane node. the cilium operator was stuck in a crashloop couldn't connect to the api-server

level=info msg="Cilium Operator 1.14.3 252a99ef 2023-10-18T18:21:56+03:00 go version go1.20.10 linux/amd64" subsys=cilium-operator-generic
level=info msg=Invoked duration="471.431µs" function="pprof.glob..func1 (cell.go:51)" subsys=hive
level=info msg=Invoked duration="118.966µs" function="gops.registerGopsHooks (cell.go:39)" subsys=hive
level=info msg=Invoked duration="648.85µs" function="cmd.registerOperatorHooks (root.go:156)" subsys=hive
level=info msg=Invoked duration=18.109668ms function="api.glob..func1 (cell.go:32)" subsys=hive
level=info msg=Invoked duration="213.824µs" function="apis.createCRDs (cell.go:63)" subsys=hive
level=info msg=Invoked duration="303.092µs" function="lbipam.glob..func1 (cell.go:25)" subsys=hive
level=info msg=Invoked duration="288.037µs" function="auth.registerIdentityWatcher (watcher.go:43)" subsys=hive
level=info msg=Invoked duration="194.58µs" function="cmd.registerLegacyOnLeader (root.go:362)" subsys=hive
level=info msg=Invoked duration="283.965µs" function="identitygc.registerGC (gc.go:82)" subsys=hive
level=info msg=Starting subsys=hive
level=info msg="Started gops server" address="127.0.0.1:9891" subsys=gops
level=info msg="Start hook executed" duration="441.486µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client
level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s-client

Wireguard connection still worked

Despire · 2024-04-30T12:50:54Z

after the restart the cilium interface are no longer up on the control plane
However the files do exists in /etc/cni/net.d/

Despire · 2024-04-30T14:07:35Z

I was able to reproduce this exact issue with calico as the CNI aswell. Must be something how we apply the changes.

To reproduce this.

Apply manifest 2 from test-set4
Apply manifest 3 from test-set 4

increase count of the gcp nodepool from 1 to 3. The newly added nodes should be in corrupted state even though everything completed sucessfuly

Despire · 2024-05-10T12:08:04Z

Wait 1 more week before closing this issue as resolved by #1366

Despire · 2024-05-15T12:39:31Z

had not been seen since #1366

Despire added the bug Something isn't working label Mar 21, 2024

JKBGIT1 added the groomed Task that everybody agrees to pass the gatekeeper label Apr 5, 2024

Despire self-assigned this Apr 30, 2024

Despire closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Longhorn nodes does not match requested amount #1285

Bug: Longhorn nodes does not match requested amount #1285

Despire commented Mar 21, 2024 •

edited

Despire commented Apr 30, 2024

Despire commented Apr 30, 2024

Despire commented Apr 30, 2024 •

edited

Despire commented Apr 30, 2024

Despire commented May 10, 2024

Despire commented May 15, 2024

Bug: Longhorn nodes does not match requested amount #1285

Bug: Longhorn nodes does not match requested amount #1285

Comments

Despire commented Mar 21, 2024 • edited

Despire commented Apr 30, 2024

Despire commented Apr 30, 2024

Despire commented Apr 30, 2024 • edited

Despire commented Apr 30, 2024

Despire commented May 10, 2024

Despire commented May 15, 2024

Despire commented Mar 21, 2024 •

edited

Despire commented Apr 30, 2024 •

edited