[BUG] Lost connection to unix:///csi/csi.sock #8427

klauserber · 2024-04-24T05:23:05Z

Describe the bug

I have many restarts of csi-attacher pods. The reason is a lost connection:

I0424 04:00:25.811476       1 csi_handler.go:251] Attaching "csi-77e48635a1cf3972fd685244caa18e5d4527d4cac42753450fd4fa219cab449a"
E0424 04:00:25.818092       1 connection.go:142] Lost connection to unix:///csi/csi.sock.
F0424 04:00:25.818214       1 connection.go:97] Lost connection to CSI driver, exiting

There is no further impact on the workload. After the restart the volume gets successfully mounted:

I0424 04:00:26.276101       1 main.go:97] Version: v4.4.2
I0424 04:00:26.277646       1 common.go:138] Probing CSI driver for readiness
I0424 04:00:26.279744       1 main.go:154] CSI driver name: "driver.longhorn.io"
I0424 04:00:26.280626       1 main.go:230] CSI driver supports ControllerPublishUnpublish, using real CSI handler
I0424 04:00:26.280916       1 leaderelection.go:250] attempting to acquire leader lease longhorn-system/external-attacher-leader-driver-longhorn-io...
I0424 04:00:26.294174       1 leaderelection.go:260] successfully acquired lease longhorn-system/external-attacher-leader-driver-longhorn-io
I0424 04:00:26.294288       1 leader_election.go:178] became leader, starting
I0424 04:00:26.294346       1 controller.go:130] Starting CSI attacher
I0424 04:00:26.395929       1 csi_handler.go:251] Attaching "csi-77e48635a1cf3972fd685244caa18e5d4527d4cac42753450fd4fa219cab449a"
I0424 04:00:36.566452       1 csi_handler.go:264] Attached "csi-77e48635a1cf3972fd685244caa18e5d4527d4cac42753450fd4fa219cab449a"

After that I can stop and start the pod with no restarts. But after a while wenn no mounts happen the connection gets lost.

To Reproduce

Mount a PVC when no mount where happened for a longer time. I don't know how long it takes to loose the connection. For me it is every morning when I start my development pods.

Expected behavior

The PVC is mounted in the pod without any restarts of a csi-attacher pod.

Support bundle for troubleshooting

supportbundle_4bdc31e4-aaf6-4450-add0-84e8f7054857_2024-04-24T05-23-42Z.zip

Environment

Longhorn version: v1.6.1
Impacted volume (PV): pvc-017776f4-da0e-47db-b863-764f2f228e64
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: Ubuntu 22.04
- Kernel version: 5.15.0-100-generic
- CPU per node: 8
- Memory per node: 96GB
- Disk type (e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes (Gbps): 10
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 9

Additional context

The text was updated successfully, but these errors were encountered:

derekbit · 2024-04-28T17:33:21Z

cc @ejweber @james-munson

ejweber · 2024-04-29T22:02:23Z

Hello @klauserber. As far as I know, you're the first Longhorn user to report it, but it appears to be a known upstream issue affecting other CSI drivers as well.

The AWS EBS CSI driver team did a good analysis over at kubernetes-sigs/aws-ebs-csi-driver#1875. Their analysis directly mentions the version of the external-attacher used in Longhorn v1.6.1 (v4.4.2).

It was fixed in a utility library csi-lib-utils in its v0.16.0. Unfortunately, that wasn't picked up in external-attacher until v4.5.0.

Looking through the release notes, I think it should be safe for us to bump to this version of external-attacher in Longhorn v1.6.2 to help users avoid the bug, but we will have to do some testing.

If the issue is particularly troublesome for you, you can override the version of external-attacher Longhorn uses with any Longhorn deployment mechanisms. It looks quite safe to change it from v4.4.2 -> v4.5.1 to me, but any untested change comes with at least some risk.

klauserber · 2024-04-30T19:13:40Z

Hello @ejweber, thank you for the information.

Since I see no impact on my workloads it fine for me to wait.

J1a-wei · 2024-05-01T16:11:10Z

Same Issue

Longhorn version: v1.6.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.26.0
Number of control plane nodes in the cluster: 3
Number of worker nodes in the cluster: 5
Node config
OS type and version: Ubuntu 22.04
Kernel version: 5.15.0-101-generic
CPU per node: 24core 48Thread
Memory per node: 384GB
Disk type (e.g. SSD/NVMe/HDD): NVMe
Network bandwidth between the nodes (Gbps): 10
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 80Volume with 100TB

ejweber · 2024-05-01T19:58:32Z

Thanks for the confirmation @J1a-wei. I checked v1.6.0 and it also uses the affected version of external-attacher.

ejweber · 2024-05-01T20:31:40Z

This was a bit confusing, so updating here for myself:

kubernetes-csi/csi-lib-utils#153 bumps csi-lib-utils to grpc-go v1.59.0 and simultaneously fixes the connection dropping issue. Any sidecar using csi-lib-utils v1.16.0+ should be fine.

However, to see when sidecars started to be broken (for backporting), we cannot look at the csi-lib-utils version by itself. Instead, we should look at the grpc-go version used by the sidecar. If the grpc-go version is v1.59.0+ (it has likely been updated automatically, independently of csi-lib-utils), but csi-lib-utils is not v1.16.0+, it is affected.

ejweber · 2024-05-01T20:55:33Z

Sidecar versions we use that are affected:

external-attacher v4.4.2 in master, v1.6.x, v1.5.x -> upgrade to v4.5.1
external-provisioner v3.6.2 in master -> upgrade to v4.0.1
external-provisioner v3.6.2 in v1.6.x, v1.5.x
- There is no minor version we can upgrade to per policy.
- Users can downgrade to v3.6.1 or upgrade to v4.0.1 as potential workarounds.
external-resizer v1.9.2 in master, v1.6.x, v1.5.x -> upgrade to v1.10.1
external-snapshotter v6.3.2 in master -> upgrade to v7.0.2
external-snapshotter v6.3.2 in v1.6.x, v1.5.x
- There is no minor version we can upgrade to per policy.
- Users can downgrade to v6.3.1 or upgrade to v7.0.2 as potential workarounds.

ejweber · 2024-05-01T21:17:17Z

Reproduce:

Install Longhorn master-head.
Wait more than thirty minutes.
Check how many times CSI sidecars have restarted. In my case, there were previous restarts.

eweber@laptop:~/longhorn> kl get pod | grep csi
csi-attacher-57c5fd5bdf-sptgq                       1/1     Running   2 (55m ago)   2d
csi-attacher-57c5fd5bdf-vsdx2                       1/1     Running   0             2d
csi-attacher-57c5fd5bdf-z8lxx                       1/1     Running   0             2d
csi-provisioner-7b95bf4b87-vw8m8                    1/1     Running   0             2d
csi-provisioner-7b95bf4b87-w4cfl                    1/1     Running   1 (54m ago)   2d
csi-provisioner-7b95bf4b87-wfn4z                    1/1     Running   2 (55m ago)   2d
csi-resizer-6df9886858-2zgl9                        1/1     Running   0             2d
csi-resizer-6df9886858-4zz2v                        1/1     Running   0             2d
csi-resizer-6df9886858-w6w9b                        1/1     Running   0             2d
csi-snapshotter-5d84585dd4-5s6xh                    1/1     Running   0             2d
csi-snapshotter-5d84585dd4-nvbm6                    1/1     Running   0             2d
csi-snapshotter-5d84585dd4-wqwp6                    1/1     Running   0             2d

Tail the logs of csi-attacher and csi-provisioner.
In a different window, apply examples/pod_with_pvc.yaml.
Notice that the tailed logs show both csi-attacher and csi-provisioner restarting.

[csi-provisioner-7b95bf4b87-wfn4z] I0501 21:10:44.439910       1 controller.go:1366] provision "default/longhorn-volv-pvc" class "longhorn": started
[csi-provisioner-7b95bf4b87-wfn4z] W0501 21:10:44.440229       1 controller.go:620] "fstype" is deprecated and will be removed in a future release, please use "csi.storage.k8s.io/fstype" instead
[csi-provisioner-7b95bf4b87-wfn4z] I0501 21:10:44.443383       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"longhorn-volv-pvc", UID:"bb949859-07a1-43bf-bf8d-7fdd5f9c52f9", APIVersion:"v1", ResourceVersion:"89704388", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/longhorn-volv-pvc"
[csi-provisioner-7b95bf4b87-wfn4z] E0501 21:10:44.444240       1 connection.go:142] Lost connection to unix:///csi/csi.sock.
[csi-provisioner-7b95bf4b87-wfn4z] F0501 21:10:44.445549       1 connection.go:97] Lost connection to CSI driver, exiting
[csi-attacher-57c5fd5bdf-sptgq] I0501 21:11:10.190372       1 csi_handler.go:251] Attaching "csi-d12886f1b82b79892bf5c94e42001e0ce4680feed019591654ee302e72eeda93"
[csi-attacher-57c5fd5bdf-sptgq] E0501 21:11:10.331010       1 connection.go:142] Lost connection to unix:///csi/csi.sock.
[csi-attacher-57c5fd5bdf-sptgq] F0501 21:11:10.331268       1 connection.go:97] Lost connection to CSI driver, exiting

Recheck how many times CSI sidecars have restarted. Both affected pods have restarted another time.

eweber@laptop:~/longhorn> kl get pod | grep csi
csi-attacher-57c5fd5bdf-sptgq                       1/1     Running   3 (3m36s ago)   2d
csi-attacher-57c5fd5bdf-vsdx2                       1/1     Running   0               2d
csi-attacher-57c5fd5bdf-z8lxx                       1/1     Running   0               2d
csi-provisioner-7b95bf4b87-vw8m8                    1/1     Running   0               2d
csi-provisioner-7b95bf4b87-w4cfl                    1/1     Running   1 (59m ago)     2d
csi-provisioner-7b95bf4b87-wfn4z                    1/1     Running   3 (4m2s ago)    2d
csi-resizer-6df9886858-2zgl9                        1/1     Running   0               2d
csi-resizer-6df9886858-4zz2v                        1/1     Running   0               2d
csi-resizer-6df9886858-w6w9b                        1/1     Running   0               2d
csi-snapshotter-5d84585dd4-5s6xh                    1/1     Running   0               2d
csi-snapshotter-5d84585dd4-nvbm6                    1/1     Running   0               2d
csi-snapshotter-5d84585dd4-wqwp6                    1/1     Running   0               2d
longhorn-csi-plugin-b6b2c                           3/3     Running   0               2d
longhorn-csi-plugin-kwlnj                           3/3     Running   0               2d
longhorn-csi-plugin-pgfgl                           3/3     Running   0               2d

Delete and reapply examples/pod_with_pvc.yaml before another thirty minutes has expired.
This does not cause any additional restarting.

longhorn-io-github-bot · 2024-05-03T14:11:40Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps are at: [BUG] Lost connection to unix:///csi/csi.sock #8427 (comment).
To test, follow the same steps, but don't observe any restarts.
Is there a workaround for the issue? If so, where is it documented?
No workaround is necessary. This issue is annoying, but does not prevent the eventually provisioning and attaching of volumes.
Does the PR include the explanation for the fix or the feature?
Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at: Bump sidecar container versions for csi-lib-utils issue #8482.
The PR for the chart change is at: Bump sidecar container versions for csi-lib-utils issue #8482.

roger-ryao · 2024-05-09T10:31:36Z

Verified on master-head 20240509

longhorn master-head 8026d1a

The test steps
#8427 (comment)

Result Passed

I am also able to reproduce this issue on v1.6.1, and was fixed on master-head

klauserber added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 24, 2024

ejweber self-assigned this Apr 29, 2024

ejweber added this to the v1.7.0 milestone Apr 29, 2024

ejweber added area/upstream Upstream related like tgt upstream library area/csi CSI related like control/node driver, sidecars and removed require/qa-review-coverage Require QA to review coverage labels Apr 29, 2024

ejweber mentioned this issue May 1, 2024

Bump sidecar container versions for csi-lib-utils issue #8482

Merged

ejweber added backport/1.6.2 backport/1.5.6 labels May 3, 2024

This was referenced May 3, 2024

[BACKPORT][v1.5.6][BUG] Lost connection to unix:///csi/csi.sock #8492

Closed

[BACKPORT][v1.6.2][BUG] Lost connection to unix:///csi/csi.sock #8493

Closed

This was referenced May 3, 2024

Bump sidecar container versions for csi-lib-utils issue (backport #8482) #8494

Merged

Bump sidecar container versions for csi-lib-utils issue (backport #8482) #8495

Merged

roger-ryao self-assigned this May 9, 2024

roger-ryao closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Lost connection to unix:///csi/csi.sock #8427

[BUG] Lost connection to unix:///csi/csi.sock #8427

klauserber commented Apr 24, 2024 •

edited

derekbit commented Apr 28, 2024

ejweber commented Apr 29, 2024 •

edited

klauserber commented Apr 30, 2024

J1a-wei commented May 1, 2024

ejweber commented May 1, 2024

ejweber commented May 1, 2024

ejweber commented May 1, 2024

ejweber commented May 1, 2024

longhorn-io-github-bot commented May 3, 2024 •

edited by ejweber

roger-ryao commented May 9, 2024

[BUG] Lost connection to unix:///csi/csi.sock #8427

[BUG] Lost connection to unix:///csi/csi.sock #8427

Comments

klauserber commented Apr 24, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

derekbit commented Apr 28, 2024

ejweber commented Apr 29, 2024 • edited

klauserber commented Apr 30, 2024

J1a-wei commented May 1, 2024

ejweber commented May 1, 2024

ejweber commented May 1, 2024

ejweber commented May 1, 2024

ejweber commented May 1, 2024

longhorn-io-github-bot commented May 3, 2024 • edited by ejweber

Pre Ready-For-Testing Checklist

roger-ryao commented May 9, 2024

klauserber commented Apr 24, 2024 •

edited

ejweber commented Apr 29, 2024 •

edited

longhorn-io-github-bot commented May 3, 2024 •

edited by ejweber