[BUG] Longhorn seems not to be able to connect to replica outside of the one which is on the same node affected to the pod #8451

greggameplayer · 2024-04-26T11:33:13Z

Describe the bug

When i schedule a docker registry pod with a longhorn volume tied to it on the master node.
The two replicas are placed on master2 and on worker1 and they are always failing so the docker registry pod never start.
When i make the node worker1 unschedulable, one of the replicas goes into the master node and boom magically that replica work and the pod starts correctly but the other replica on master2 node is still failling.

To Reproduce

In longhorn v1.6.1, create a disk with 2 replicas which i've called registry.
In the UI, i've created the PV/PVC inside the namespace docker-registry, i've called the pvc registry and i've put ext4 as format.

Support bundle for troubleshooting

supportbundle_51ccfc95-cc51-470e-8d37-837371d0fb87_2024-04-25T22-10-41Z.zip

I've just removed the nodes folder inside the original archive, because it weighs more than 500Mb.

Environment

Longhorn version: v1.6.1
Impacted volume (PV): registry
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
- Number of control plane nodes in the cluster: 2
- Number of worker nodes in the cluster: 1
Node config
- OS type and version: master => AMD64 ubuntu 22.04 / master2, worker1 => ARM64 ubuntu 22.04
- Kernel version: 6.5.0-28
- CPU per node: 4
- Memory per node: master => 12GB / master2, worker 1 => 24GB
- Disk type (e.g. SSD/NVMe/HDD): master => SSD / master2, worker1 => it's oracle OCI storage
- Network bandwidth between the nodes (Gbps): master => 1 GB/s / master2, worker1 => 4GB/s
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): master2 => Proxmox at my home / master, worker1 => Oracle Cloud
Number of Longhorn volumes in the cluster: 1

Additional context

james-munson · 2024-04-26T21:59:06Z

To recapitulate, the cluster is spread between your home and Oracle cloud. When the replica is colocated with the pod on Proxmox at your home, it is functional, but replicas on the cloud nodes are not. As far as you know, there are no latency or permission issues between the nodes.

greggameplayer · 2024-04-27T00:41:18Z

@james-munson it doesn't depends on pod location, it can be on a node on oracle cloud or at my home but for the disk replica to be able to work correctly if it's not on the same node than the pod, it will keep failing and rebuilding over and over again.

greggameplayer · 2024-04-27T00:44:16Z

also i've corrected a mistake in my description it's master2 that is in my proxmox.
The others two master and worker1 are on oracle cloud spread in two differents datacenters (Frankfurt and Amsterdam)

For information there are connected through tailscale

greggameplayer · 2024-04-27T00:47:24Z

also for your analysis, if you need, i kept a copy of the folder nodes in the support bundle archive.

james-munson · 2024-05-01T02:46:13Z

In the logs there are any number of failures like:

logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417286127+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="controller.(*Controller).handleErrorNoLock" file="control.go:1129" error="tcp://10.42.1.17:10000: r/w timeout; tcp://10.42.2.15:10000: r/w timeout"
logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417299567+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="controller.(*Controller).handleErrorNoLock" file="control.go:1129" error="tcp://10.42.2.15:10000: r/w timeout; tcp://10.42.1.17:10000: r/w timeout"
logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417313287+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="controller.(*Controller).handleErrorNoLock" file="control.go:1129" error="tcp://10.42.2.15:10000: r/w timeout; tcp://10.42.1.17:10000: r/w timeout"
logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417334247+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="co

and

logs/longhorn-system/longhorn-manager-9fbml/longhorn-manager.log:2024-04-25T18:22:43.717690906+02:00 time="2024-04-25T16:22:43Z" level=warning msg="registry-e-0: time=\"2024-04-25T16:15:58Z\" level=error msg=\"I/O error\" func=\"controller.(*Controller).handleErrorNoLock\" file=\"control.go:1129\" error=\"tcp://10.42.2.15:10000: r/w timeout\"" func="controller.(*InstanceHandler).printInstanceLogs" file="instance_handler.go:467"
logs/longhorn-system/longhorn-manager-9fbml/longhorn-manager.log:2024-04-25T18:22:43.730457593+02:00 time="2024-04-25T16:22:43Z" level=warning msg="registry-e-0: time=\"2024-04-25T16:21:06Z\" level=error msg=\"R/W Timeout. No response received in 8s\" func=\"dataconn.(*Client).loop\" file=\"client.go:148\"" func="controller.(*InstanceHandler).printInstanceLogs" file="instance_handler.go:467"

and

logs/longhorn-system/csi-resizer-7466f7b45f-4hxq6/csi-resizer.log.1:2024-04-25T17:47:27.771057735+02:00 F0425 15:47:27.770927       1 main.go:134] failed to connect to CSI driver: context deadline exceeded
logs/longhorn-system/csi-attacher-57689cc84b-w9cjv/csi-attacher.log.1:2024-04-25T17:47:28.739540560+02:00 E0425 15:47:28.739342       1 main.go:136] context deadline exceeded
logs/longhorn-system/csi-provisioner-6c78dcb664-qm4wp/csi-provisioner.log.1:2024-04-25T17:51:51.895116612+02:00 E0425 15:51:51.894875       1 csi-provisioner.go:215] context deadline exceeded
logs/longhorn-system/longhorn-csi-plugin-z95kw/node-driver-registrar.log.1:2024-04-25T17:47:31.268636940+02:00 E0425 15:47:31.268391    7163 main.go:160] error connecting to CSI driver: context deadline exceeded

All indicating timeouts between components. Coupled with the fact that replicas behave when local (meaning in the same data center) but not when they cross data center boundaries, the conclusion is clear that latency between sties is too large to run the block i/o reliably. This is not a cluster setup that Longhorn would recommend.

greggameplayer · 2024-05-01T12:22:58Z

On the previous Longhorn version 1.5 with kernel 5.5 and on previous version of k3s it was working perfectly fine on my setup

greggameplayer added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Longhorn seems not to be able to connect to replica outside of the one which is on the same node affected to the pod #8451

[BUG] Longhorn seems not to be able to connect to replica outside of the one which is on the same node affected to the pod #8451

greggameplayer commented Apr 26, 2024 •

edited

james-munson commented Apr 26, 2024

greggameplayer commented Apr 27, 2024

greggameplayer commented Apr 27, 2024

greggameplayer commented Apr 27, 2024

james-munson commented May 1, 2024

greggameplayer commented May 1, 2024

[BUG] Longhorn seems not to be able to connect to replica outside of the one which is on the same node affected to the pod #8451

[BUG] Longhorn seems not to be able to connect to replica outside of the one which is on the same node affected to the pod #8451

Comments

greggameplayer commented Apr 26, 2024 • edited

Describe the bug

To Reproduce

Support bundle for troubleshooting

Environment

Additional context

james-munson commented Apr 26, 2024

greggameplayer commented Apr 27, 2024

greggameplayer commented Apr 27, 2024

greggameplayer commented Apr 27, 2024

james-munson commented May 1, 2024

greggameplayer commented May 1, 2024

greggameplayer commented Apr 26, 2024 •

edited