Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Longhorn seems not to be able to connect to replica outside of the one which is on the same node affected to the pod #8451

Open
greggameplayer opened this issue Apr 26, 2024 · 6 comments
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@greggameplayer
Copy link

greggameplayer commented Apr 26, 2024

Describe the bug

When i schedule a docker registry pod with a longhorn volume tied to it on the master node.
The two replicas are placed on master2 and on worker1 and they are always failing so the docker registry pod never start.
When i make the node worker1 unschedulable, one of the replicas goes into the master node and boom magically that replica work and the pod starts correctly but the other replica on master2 node is still failling.

To Reproduce

In longhorn v1.6.1, create a disk with 2 replicas which i've called registry.
In the UI, i've created the PV/PVC inside the namespace docker-registry, i've called the pvc registry and i've put ext4 as format.

Support bundle for troubleshooting

supportbundle_51ccfc95-cc51-470e-8d37-837371d0fb87_2024-04-25T22-10-41Z.zip

I've just removed the nodes folder inside the original archive, because it weighs more than 500Mb.

Environment

  • Longhorn version: v1.6.1
  • Impacted volume (PV): registry
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
    • Number of control plane nodes in the cluster: 2
    • Number of worker nodes in the cluster: 1
  • Node config
    • OS type and version: master => AMD64 ubuntu 22.04 / master2, worker1 => ARM64 ubuntu 22.04
    • Kernel version: 6.5.0-28
    • CPU per node: 4
    • Memory per node: master => 12GB / master2, worker 1 => 24GB
    • Disk type (e.g. SSD/NVMe/HDD): master => SSD / master2, worker1 => it's oracle OCI storage
    • Network bandwidth between the nodes (Gbps): master => 1 GB/s / master2, worker1 => 4GB/s
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): master2 => Proxmox at my home / master, worker1 => Oracle Cloud
  • Number of Longhorn volumes in the cluster: 1

Additional context

@greggameplayer greggameplayer added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 26, 2024
@james-munson
Copy link
Contributor

To recapitulate, the cluster is spread between your home and Oracle cloud. When the replica is colocated with the pod on Proxmox at your home, it is functional, but replicas on the cloud nodes are not. As far as you know, there are no latency or permission issues between the nodes.

@greggameplayer
Copy link
Author

@james-munson it doesn't depends on pod location, it can be on a node on oracle cloud or at my home but for the disk replica to be able to work correctly if it's not on the same node than the pod, it will keep failing and rebuilding over and over again.

@greggameplayer
Copy link
Author

also i've corrected a mistake in my description it's master2 that is in my proxmox.
The others two master and worker1 are on oracle cloud spread in two differents datacenters (Frankfurt and Amsterdam)

For information there are connected through tailscale

@greggameplayer
Copy link
Author

also for your analysis, if you need, i kept a copy of the folder nodes in the support bundle archive.

@james-munson
Copy link
Contributor

In the logs there are any number of failures like:

logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417286127+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="controller.(*Controller).handleErrorNoLock" file="control.go:1129" error="tcp://10.42.1.17:10000: r/w timeout; tcp://10.42.2.15:10000: r/w timeout"
logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417299567+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="controller.(*Controller).handleErrorNoLock" file="control.go:1129" error="tcp://10.42.2.15:10000: r/w timeout; tcp://10.42.1.17:10000: r/w timeout"
logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417313287+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="controller.(*Controller).handleErrorNoLock" file="control.go:1129" error="tcp://10.42.2.15:10000: r/w timeout; tcp://10.42.1.17:10000: r/w timeout"
logs/longhorn-system/instance-manager-6504c98bf07a50e11a254f7ceed606f4/instance-manager.log:2024-04-25T18:14:13.417334247+02:00 time="2024-04-25T16:14:13Z" level=error msg="I/O error" func="co

and

logs/longhorn-system/longhorn-manager-9fbml/longhorn-manager.log:2024-04-25T18:22:43.717690906+02:00 time="2024-04-25T16:22:43Z" level=warning msg="registry-e-0: time=\"2024-04-25T16:15:58Z\" level=error msg=\"I/O error\" func=\"controller.(*Controller).handleErrorNoLock\" file=\"control.go:1129\" error=\"tcp://10.42.2.15:10000: r/w timeout\"" func="controller.(*InstanceHandler).printInstanceLogs" file="instance_handler.go:467"
logs/longhorn-system/longhorn-manager-9fbml/longhorn-manager.log:2024-04-25T18:22:43.730457593+02:00 time="2024-04-25T16:22:43Z" level=warning msg="registry-e-0: time=\"2024-04-25T16:21:06Z\" level=error msg=\"R/W Timeout. No response received in 8s\" func=\"dataconn.(*Client).loop\" file=\"client.go:148\"" func="controller.(*InstanceHandler).printInstanceLogs" file="instance_handler.go:467"

and

logs/longhorn-system/csi-resizer-7466f7b45f-4hxq6/csi-resizer.log.1:2024-04-25T17:47:27.771057735+02:00 F0425 15:47:27.770927       1 main.go:134] failed to connect to CSI driver: context deadline exceeded
logs/longhorn-system/csi-attacher-57689cc84b-w9cjv/csi-attacher.log.1:2024-04-25T17:47:28.739540560+02:00 E0425 15:47:28.739342       1 main.go:136] context deadline exceeded
logs/longhorn-system/csi-provisioner-6c78dcb664-qm4wp/csi-provisioner.log.1:2024-04-25T17:51:51.895116612+02:00 E0425 15:51:51.894875       1 csi-provisioner.go:215] context deadline exceeded
logs/longhorn-system/longhorn-csi-plugin-z95kw/node-driver-registrar.log.1:2024-04-25T17:47:31.268636940+02:00 E0425 15:47:31.268391    7163 main.go:160] error connecting to CSI driver: context deadline exceeded

All indicating timeouts between components. Coupled with the fact that replicas behave when local (meaning in the same data center) but not when they cross data center boundaries, the conclusion is clear that latency between sties is too large to run the block i/o reliably. This is not a cluster setup that Longhorn would recommend.

@greggameplayer
Copy link
Author

On the previous Longhorn version 1.5 with kernel 5.5 and on previous version of k3s it was working perfectly fine on my setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Pending user response
Development

No branches or pull requests

2 participants