Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Lost connection to unix:///csi/csi.sock #8427

Closed
klauserber opened this issue Apr 24, 2024 · 10 comments
Closed

[BUG] Lost connection to unix:///csi/csi.sock #8427

klauserber opened this issue Apr 24, 2024 · 10 comments
Assignees
Labels
area/csi CSI related like control/node driver, sidecars area/upstream Upstream related like tgt upstream library backport/1.5.6 backport/1.6.2 kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied.
Milestone

Comments

@klauserber
Copy link

klauserber commented Apr 24, 2024

Describe the bug

I have many restarts of csi-attacher pods. The reason is a lost connection:

I0424 04:00:25.811476       1 csi_handler.go:251] Attaching "csi-77e48635a1cf3972fd685244caa18e5d4527d4cac42753450fd4fa219cab449a"
E0424 04:00:25.818092       1 connection.go:142] Lost connection to unix:///csi/csi.sock.
F0424 04:00:25.818214       1 connection.go:97] Lost connection to CSI driver, exiting

There is no further impact on the workload. After the restart the volume gets successfully mounted:

I0424 04:00:26.276101       1 main.go:97] Version: v4.4.2
I0424 04:00:26.277646       1 common.go:138] Probing CSI driver for readiness
I0424 04:00:26.279744       1 main.go:154] CSI driver name: "driver.longhorn.io"
I0424 04:00:26.280626       1 main.go:230] CSI driver supports ControllerPublishUnpublish, using real CSI handler
I0424 04:00:26.280916       1 leaderelection.go:250] attempting to acquire leader lease longhorn-system/external-attacher-leader-driver-longhorn-io...
I0424 04:00:26.294174       1 leaderelection.go:260] successfully acquired lease longhorn-system/external-attacher-leader-driver-longhorn-io
I0424 04:00:26.294288       1 leader_election.go:178] became leader, starting
I0424 04:00:26.294346       1 controller.go:130] Starting CSI attacher
I0424 04:00:26.395929       1 csi_handler.go:251] Attaching "csi-77e48635a1cf3972fd685244caa18e5d4527d4cac42753450fd4fa219cab449a"
I0424 04:00:36.566452       1 csi_handler.go:264] Attached "csi-77e48635a1cf3972fd685244caa18e5d4527d4cac42753450fd4fa219cab449a"

After that I can stop and start the pod with no restarts. But after a while wenn no mounts happen the connection gets lost.

To Reproduce

Mount a PVC when no mount where happened for a longer time. I don't know how long it takes to loose the connection. For me it is every morning when I start my development pods.

Expected behavior

The PVC is mounted in the pod without any restarts of a csi-attacher pod.

Support bundle for troubleshooting

supportbundle_4bdc31e4-aaf6-4450-add0-84e8f7054857_2024-04-24T05-23-42Z.zip

Environment

  • Longhorn version: v1.6.1
  • Impacted volume (PV): pvc-017776f4-da0e-47db-b863-764f2f228e64
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 22.04
    • Kernel version: 5.15.0-100-generic
    • CPU per node: 8
    • Memory per node: 96GB
    • Disk type (e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes (Gbps): 10
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 9

Additional context

@klauserber klauserber added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 24, 2024
@derekbit
Copy link
Member

cc @ejweber @james-munson

@ejweber
Copy link
Contributor

ejweber commented Apr 29, 2024

Hello @klauserber. As far as I know, you're the first Longhorn user to report it, but it appears to be a known upstream issue affecting other CSI drivers as well.

The AWS EBS CSI driver team did a good analysis over at kubernetes-sigs/aws-ebs-csi-driver#1875. Their analysis directly mentions the version of the external-attacher used in Longhorn v1.6.1 (v4.4.2).

It was fixed in a utility library csi-lib-utils in its v0.16.0. Unfortunately, that wasn't picked up in external-attacher until v4.5.0.

Looking through the release notes, I think it should be safe for us to bump to this version of external-attacher in Longhorn v1.6.2 to help users avoid the bug, but we will have to do some testing.

If the issue is particularly troublesome for you, you can override the version of external-attacher Longhorn uses with any Longhorn deployment mechanisms. It looks quite safe to change it from v4.4.2 -> v4.5.1 to me, but any untested change comes with at least some risk.

@ejweber ejweber self-assigned this Apr 29, 2024
@ejweber ejweber added this to the v1.7.0 milestone Apr 29, 2024
@ejweber ejweber added area/upstream Upstream related like tgt upstream library area/csi CSI related like control/node driver, sidecars and removed require/qa-review-coverage Require QA to review coverage labels Apr 29, 2024
@klauserber
Copy link
Author

Hello @ejweber, thank you for the information.

Since I see no impact on my workloads it fine for me to wait.

@J1a-wei
Copy link

J1a-wei commented May 1, 2024

Same Issue

Longhorn version: v1.6.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.26.0
Number of control plane nodes in the cluster: 3
Number of worker nodes in the cluster: 5
Node config
OS type and version: Ubuntu 22.04
Kernel version: 5.15.0-101-generic
CPU per node: 24core 48Thread
Memory per node: 384GB
Disk type (e.g. SSD/NVMe/HDD): NVMe
Network bandwidth between the nodes (Gbps): 10
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 80Volume with 100TB

@ejweber
Copy link
Contributor

ejweber commented May 1, 2024

Thanks for the confirmation @J1a-wei. I checked v1.6.0 and it also uses the affected version of external-attacher.

@ejweber
Copy link
Contributor

ejweber commented May 1, 2024

This was a bit confusing, so updating here for myself:

kubernetes-csi/csi-lib-utils#153 bumps csi-lib-utils to grpc-go v1.59.0 and simultaneously fixes the connection dropping issue. Any sidecar using csi-lib-utils v1.16.0+ should be fine.

However, to see when sidecars started to be broken (for backporting), we cannot look at the csi-lib-utils version by itself. Instead, we should look at the grpc-go version used by the sidecar. If the grpc-go version is v1.59.0+ (it has likely been updated automatically, independently of csi-lib-utils), but csi-lib-utils is not v1.16.0+, it is affected.

@ejweber
Copy link
Contributor

ejweber commented May 1, 2024

Sidecar versions we use that are affected:

  • external-attacher v4.4.2 in master, v1.6.x, v1.5.x -> upgrade to v4.5.1
  • external-provisioner v3.6.2 in master -> upgrade to v4.0.1
  • external-provisioner v3.6.2 in v1.6.x, v1.5.x
    - There is no minor version we can upgrade to per policy.
    - Users can downgrade to v3.6.1 or upgrade to v4.0.1 as potential workarounds.
  • external-resizer v1.9.2 in master, v1.6.x, v1.5.x -> upgrade to v1.10.1
  • external-snapshotter v6.3.2 in master -> upgrade to v7.0.2
  • external-snapshotter v6.3.2 in v1.6.x, v1.5.x
    - There is no minor version we can upgrade to per policy.
    - Users can downgrade to v6.3.1 or upgrade to v7.0.2 as potential workarounds.

@ejweber
Copy link
Contributor

ejweber commented May 1, 2024

Reproduce:

  1. Install Longhorn master-head.
  2. Wait more than thirty minutes.
  3. Check how many times CSI sidecars have restarted. In my case, there were previous restarts.
eweber@laptop:~/longhorn> kl get pod | grep csi
csi-attacher-57c5fd5bdf-sptgq                       1/1     Running   2 (55m ago)   2d
csi-attacher-57c5fd5bdf-vsdx2                       1/1     Running   0             2d
csi-attacher-57c5fd5bdf-z8lxx                       1/1     Running   0             2d
csi-provisioner-7b95bf4b87-vw8m8                    1/1     Running   0             2d
csi-provisioner-7b95bf4b87-w4cfl                    1/1     Running   1 (54m ago)   2d
csi-provisioner-7b95bf4b87-wfn4z                    1/1     Running   2 (55m ago)   2d
csi-resizer-6df9886858-2zgl9                        1/1     Running   0             2d
csi-resizer-6df9886858-4zz2v                        1/1     Running   0             2d
csi-resizer-6df9886858-w6w9b                        1/1     Running   0             2d
csi-snapshotter-5d84585dd4-5s6xh                    1/1     Running   0             2d
csi-snapshotter-5d84585dd4-nvbm6                    1/1     Running   0             2d
csi-snapshotter-5d84585dd4-wqwp6                    1/1     Running   0             2d
  1. Tail the logs of csi-attacher and csi-provisioner.
  2. In a different window, apply examples/pod_with_pvc.yaml.
  3. Notice that the tailed logs show both csi-attacher and csi-provisioner restarting.
[csi-provisioner-7b95bf4b87-wfn4z] I0501 21:10:44.439910       1 controller.go:1366] provision "default/longhorn-volv-pvc" class "longhorn": started
[csi-provisioner-7b95bf4b87-wfn4z] W0501 21:10:44.440229       1 controller.go:620] "fstype" is deprecated and will be removed in a future release, please use "csi.storage.k8s.io/fstype" instead
[csi-provisioner-7b95bf4b87-wfn4z] I0501 21:10:44.443383       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"longhorn-volv-pvc", UID:"bb949859-07a1-43bf-bf8d-7fdd5f9c52f9", APIVersion:"v1", ResourceVersion:"89704388", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/longhorn-volv-pvc"
[csi-provisioner-7b95bf4b87-wfn4z] E0501 21:10:44.444240       1 connection.go:142] Lost connection to unix:///csi/csi.sock.
[csi-provisioner-7b95bf4b87-wfn4z] F0501 21:10:44.445549       1 connection.go:97] Lost connection to CSI driver, exiting
[csi-attacher-57c5fd5bdf-sptgq] I0501 21:11:10.190372       1 csi_handler.go:251] Attaching "csi-d12886f1b82b79892bf5c94e42001e0ce4680feed019591654ee302e72eeda93"
[csi-attacher-57c5fd5bdf-sptgq] E0501 21:11:10.331010       1 connection.go:142] Lost connection to unix:///csi/csi.sock.
[csi-attacher-57c5fd5bdf-sptgq] F0501 21:11:10.331268       1 connection.go:97] Lost connection to CSI driver, exiting
  1. Recheck how many times CSI sidecars have restarted. Both affected pods have restarted another time.
eweber@laptop:~/longhorn> kl get pod | grep csi
csi-attacher-57c5fd5bdf-sptgq                       1/1     Running   3 (3m36s ago)   2d
csi-attacher-57c5fd5bdf-vsdx2                       1/1     Running   0               2d
csi-attacher-57c5fd5bdf-z8lxx                       1/1     Running   0               2d
csi-provisioner-7b95bf4b87-vw8m8                    1/1     Running   0               2d
csi-provisioner-7b95bf4b87-w4cfl                    1/1     Running   1 (59m ago)     2d
csi-provisioner-7b95bf4b87-wfn4z                    1/1     Running   3 (4m2s ago)    2d
csi-resizer-6df9886858-2zgl9                        1/1     Running   0               2d
csi-resizer-6df9886858-4zz2v                        1/1     Running   0               2d
csi-resizer-6df9886858-w6w9b                        1/1     Running   0               2d
csi-snapshotter-5d84585dd4-5s6xh                    1/1     Running   0               2d
csi-snapshotter-5d84585dd4-nvbm6                    1/1     Running   0               2d
csi-snapshotter-5d84585dd4-wqwp6                    1/1     Running   0               2d
longhorn-csi-plugin-b6b2c                           3/3     Running   0               2d
longhorn-csi-plugin-kwlnj                           3/3     Running   0               2d
longhorn-csi-plugin-pgfgl                           3/3     Running   0               2d
  1. Delete and reapply examples/pod_with_pvc.yaml before another thirty minutes has expired.
  2. This does not cause any additional restarting.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented May 3, 2024

Pre Ready-For-Testing Checklist

@roger-ryao
Copy link

Verified on master-head 20240509

The test steps
#8427 (comment)

Result Passed

  1. I am also able to reproduce this issue on v1.6.1, and was fixed on master-head

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi CSI related like control/node driver, sidecars area/upstream Upstream related like tgt upstream library backport/1.5.6 backport/1.6.2 kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied.
Projects
Status: Resolved/Scheduled
Development

No branches or pull requests

6 participants