Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Persistant Volumes fail permanently after NAS reboot #232

Open
cbc02009 opened this issue Aug 31, 2022 · 16 comments
Open

All Persistant Volumes fail permanently after NAS reboot #232

cbc02009 opened this issue Aug 31, 2022 · 16 comments

Comments

@cbc02009
Copy link

Whenever I reboot the OS on the NAS that hosts my ISCSI democratic-csi volumes, all containers that rely on those volumes fail consistently even after the NAS comes back online with the following error:

  Warning  FailedMount  37s               kubelet            MountVolume.MountDevice failed for volume "pvc-da280e70-9bcb-41ba-bbbd-cbf973580c6e" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  34s               kubelet            Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[config media transcode kube-api-access-2c2w7 backup]: timed out waiting for the condition
  Warning  FailedMount  5s (x6 over 37s)  kubelet            MountVolume.MountDevice failed for volume "pvc-da280e70-9bcb-41ba-bbbd-cbf973580c6e" : rpc error: code = Aborted desc = operation locked due to in progress operation(s): ["volume_id_pvc-da280e70-9bcb-41ba-bbbd-cbf973580c6e"]

I have tried suspending all pods with kubectl scale -n media deploy/plex --replicas 0 to try and ensure that nothing is using the volume during the reboot.

Unfortunately I know almost nothing about ISCSI, so it's entirely possible this is 100% my fault. What is the proper process with ISCSI for rebooting either the NAS, or the nodes using PVs on the NAS to prevent this lockup? Is there an iscsiadm command I can use to remove this deadlock and let the new container access the PV?

my democratic-csi config is:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: csi-iscsi
  namespace: storage
spec:
  interval: 5m
  chart:
    spec:
      chart: democratic-csi
      version: 0.13.4
      sourceRef:
        kind: HelmRepository
        name: democratic-csi-charts
        namespace: flux-system
      interval: 5m
  values:
    csiDriver:
      name: "org.democratic-csi.iscsi"

    storageClasses:
    - name: tank-iscsi-csi
      defaultClass: true
      reclaimPolicy: Delete
      ## For testing
      # reclaimPolicy: Retain
      volumeBindingMode: Immediate
      allowVolumeExpansion: true
      parameters:
        fsType: ext4

    driver:
      image: docker.io/democraticcsi/democratic-csi:v1.7.6
      imagePullPolicy: IfNotPresent
      config:
        driver: zfs-generic-iscsi
      existingConfigSecret: zfs-generic-iscsi-config

and the driver config is:

apiVersion: v1
kind: Secret
metadata:
    name: zfs-generic-iscsi-config
    namespace: storage
stringData:
    driver-config-file.yaml: |
        driver: zfs-generic-iscsi
        sshConnection:
            host: ${UIHARU_IP}
            port: 22
            username: root
            privateKey: |
                -----BEGIN OPENSSH PRIVATE KEY-----
                ...
                -----END OPENSSH PRIVATE KEY-----
        zfs:
            datasetParentName: sltank/k8s/iscsiv
            detachedSnapshotsDatasetParentName: sltank/k8s/iscsis
        iscsi:
            shareStrategy: "targetCli"
            shareStrategyTargetCli:
                basename: "iqn.2016-04.com.open-iscsi:a6b73d4196"
                tpg:
                    attributes:
                        authentication: 0
                        generate_node_acls: 1
                        cache_dynamic_acls: 1
                        demo_mode_write_protect: 0
            targetPortal: "${UIHARU_IP}"

Not sure what other info is important, but I'd be happy to provide anything else that might help troubleshoot the issue.

@travisghansen
Copy link
Member

Ah this is a tricky one and I'm glad you opened this. So there are a couple issues at play here:

  • democratic-csi ensures no 2 (possibly conflicting) operations happen at the same time and thus creates an in-memory lock
  • iscsi as a protocol will generally not handle this situation well and actually would require all your pods using iscsi volumes to restart

The first can be remedied by deleting all the democratic-csi pods and just letting them restart. The latter requires you to handle each workload in a case by case basis.

Essentially if the nas goes down and comes back up the iscsi sessions on the node (assuming they recover) go to read-only. The only way to remedy that (via k8s) is to just restart the pods as appropriate..and even then in some cases that may not be enough and would require forcing the workload to a new node. I'll do some research on possible ways to just go to the cli of the nodes directly and get them back into a rw state manually without any other intervention at the k8s layer.

@cbc02009
Copy link
Author

cbc02009 commented Sep 3, 2022

For the record deleting all democratic-csi pods and the pod using the PVC did not solve the issue.

Would the NFS version have the same issue? I'm hesitant to use it for something like plex because of the hundreds of thousands of small files, but if it doesn't break on reboot it may be worth it.

@travisghansen
Copy link
Member

I haven't been able to find an iscsiadm command that will take a device that's become ro and make it rw (maybe it's not needed). I don't recall the exact behavior in this case...does the mount show as ro? If so maybe just simply remount the fs as rw will make existing connections clear up.

There is some generic k8s/csi work being done that would hopefully help correct thees situations automatically, but all the pieces haven't come together yet.

To mitigate the issue you could tweak some things like this: https://wiki.archlinux.org/title/ISCSI/Boot#Make_the_iSCSI_daemon_resilient_to_network_problems

Regarding nfs, it recovers from reboots of the nas much better yes, but indeed file-based storage has performance implications in certain scenarios vs block-based.

@cbc02009
Copy link
Author

cbc02009 commented Sep 3, 2022

I haven't been able to find an iscsiadm command that will take a device that's become ro and make it rw (maybe it's not needed). I don't recall the exact behavior in this case...does the mount show as ro? If so maybe just simply remount the fs as rw will make existing connections clear up.

I would definitely try that out and get you the information, but I'm completely clueless about iscsiadm. Could you let me know what command I should run to get the output for you and to remount the volumes?

@travisghansen
Copy link
Member

Can you send me the output of the mount command from a node with a volume that is currently read only?

@cbc02009
Copy link
Author

cbc02009 commented Sep 5, 2022

there were hundreds of lines, so I grep-ed it down to only the democratic-csi ones. let me know if you need the whole output.

❯ mount | grep csi
/dev/sda on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/6a7617911e8723d36cf2ce2d4761552ec9fc45909df51b38650ac81fbe1da466/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sda on /var/lib/kubelet/pods/c65aa595-96f9-4d5c-8b49-b8f31dfab417/volumes/kubernetes.io~csi/pvc-4beb4d11-a72c-4e64-872a-d4964de2dedc/mount type ext4 (rw,relatime,stripe=4)
/dev/sdb on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/9c1a1d9f6c298ef3438d4061d2c8e667ae891384e602e95f36afaaa7a5eadd98/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdb on /var/lib/kubelet/pods/e383b3c7-7f05-4d0e-818c-422736df9a6b/volumes/kubernetes.io~csi/pvc-ec386a17-e734-4d8f-a8d6-d8d87354c0c0/mount type ext4 (rw,relatime,stripe=4)
/dev/sdc on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/9097a1f0acddce7985644883763118cd78d31ed9ae97136f99a5d63e952badff/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdc on /var/lib/kubelet/pods/43c5146a-3ac6-4fd0-95fd-4c9924eae010/volumes/kubernetes.io~csi/pvc-f8832fc7-1cff-4faf-9e46-e6a73a24eae2/mount type ext4 (rw,relatime,stripe=4)
/dev/sdd on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/1a30803184ddc3874711e3f48b3e5d328680e443ee128d2320a1702f3cf47a0a/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdd on /var/lib/kubelet/pods/ab99ad4b-7062-433b-a0e5-a0a69543719c/volumes/kubernetes.io~csi/pvc-91213a25-0e9b-4ff1-8ae1-1afbfd59dfe9/mount type ext4 (rw,relatime,stripe=4)
/dev/sde on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/5b7345571bc8bcae92cd9663f382e958ff021773a204777294ab82f0ceb09910/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sde on /var/lib/kubelet/pods/a6f20f40-a274-4ac4-8914-6be9417f9b37/volumes/kubernetes.io~csi/pvc-995f8b55-25e4-4ea7-8f08-28e2513558cf/mount type ext4 (rw,relatime,stripe=4)
/dev/sdf on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/3616fabfaeee654e2adbcb545e495673d0254b1eb35479dab243c75b0945de00/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdf on /var/lib/kubelet/pods/235a485b-15d3-4b53-8171-da0a19b40e82/volumes/kubernetes.io~csi/pvc-126e0a9e-2b55-485c-963e-e7cd3e034012/mount type ext4 (rw,relatime,stripe=4)
/dev/sdg on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/3d3eb6033f51ac88ae8fcd05424eeb50c5af2148140218f118871a7e7dc25aa7/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdg on /var/lib/kubelet/pods/d9af64fc-23df-4f2b-96f5-7381a0170e5e/volumes/kubernetes.io~csi/pvc-31882371-9879-4ff4-80ee-6c911a8d063a/mount type ext4 (rw,relatime,stripe=4)
/dev/sdh on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/4be6f7fb94031f3b7178fd31ba56bf4f9c747aa87a11a8a7a5b5f90acd4a6804/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdh on /var/lib/kubelet/pods/a219c677-af88-4067-9141-31d52f967f8b/volumes/kubernetes.io~csi/pvc-a968d1ac-43f4-417a-a408-6914215fe73b/mount type ext4 (rw,relatime,stripe=4)
/dev/sdi on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/8658ea4d3f685c05833b8b0d7348a22c4bb4f2a6d47fec418ea681d2cef16597/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdi on /var/lib/kubelet/pods/6a55b542-69b4-41b9-81b4-85a3ef5d5eeb/volumes/kubernetes.io~csi/pvc-4e11ae12-65ec-45ac-bbf5-38a5a19d7e09/mount type ext4 (rw,relatime,stripe=4)
/dev/sdj on /var/lib/kubelet/plugins/kubernetes.io/csi/org.democratic-csi.iscsi/5bcf2cf492e91a2da6f903ed7491887ece9f3849a7570628157f9864e14f0cd7/globalmount type ext4 (rw,relatime,stripe=4)
/dev/sdj on /var/lib/kubelet/pods/d62a3627-840a-4dfc-8297-fa4c1305d46f/volumes/kubernetes.io~csi/pvc-ba9d436e-87b3-4166-9b7d-804d523c6635/mount type ext4 (rw,relatime,stripe=4)

@travisghansen
Copy link
Member

Those mounts are currently non-writable/non-functional?

@cbc02009
Copy link
Author

cbc02009 commented Sep 5, 2022

Yes, although the new pod got assigned to another host:

Normal   Scheduled    3m43s               default-scheduler  Successfully assigned organizarrs/sonarr-6b58cd8764-ft5mm to uiharu
  Warning  FailedMount  103s                kubelet            MountVolume.MountDevice failed for volume "pvc-c7c23d7e-8fe1-4ca1-8bb8-718c436e2212" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  100s                kubelet            Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[kube-api-access-k5cg6 backup bittorrent config media]: timed out waiting for the condition
  Warning  FailedMount  39s (x7 over 102s)  kubelet            MountVolume.MountDevice failed for volume "pvc-c7c23d7e-8fe1-4ca1-8bb8-718c436e2212" : rpc error: code = Aborted desc = operation locked due to in progress operation(s): ["volume_id_pvc-c7c23d7e-8fe1-4ca1-8bb8-718c436e2212"]

The volume is still attached to the old host:

❯ mount | grep c7c23d7e-8fe1-4ca1-8bb8-718c436e2212
/dev/sdj on /var/lib/kubelet/pods/3dfa970d-3abe-4766-8279-ee2eaa424448/volumes/kubernetes.io~csi/pvc-c7c23d7e-8fe1-4ca1-8bb8-718c436e2212/mount type ext4 (rw,relatime,stripe=4)

and the cpu on the old host is now going crazy:
image

I haven't tested to see what happens if it re-mounts to the same host (I wasn't paying attention to the host during the rest of the tests...)

Also, this is after making the changes to iscsid.conf from the article that you recommended to me, in case that makes a difference.

@travisghansen
Copy link
Member

Yeah, that's a dangerous situation (which is why when iscsi goes down the volumes go into ro mode). 2 nodes using the same block device simultaneously is not something you want happening. I would use something like kured (https://github.com/weaveworks/kured) or similar to simply trigger alll your nodes to cycle so the workloads shift around and everything comes up clean.

@theautomation
Copy link

Yeah, that's a dangerous situation (which is why when iscsi goes down the volumes go into ro mode). 2 nodes using the same block device simultaneously is not something you want happening. I would use something like kured (https://github.com/weaveworks/kured) or similar to simply trigger alll your nodes to cycle so the workloads shift around and everything comes up clean.

Any tips on how to tell kured when to reboot as soon as a Iscsi mount becomes into read-only mode?

@travisghansen
Copy link
Member

That could be a not a great assumption either (there are legitimate cases for ro iscsi). I probably wouldn’t fully automate that but if I were to do so I would use your iac tool of choice to just touch the reboot-required file on all the nodes when it’s clear the storage system was rebooted. For example I have an ansible playbook that does only that…but I only run it manually when I know an outage has occurred.

If you really wish to detect a read-only scsi connections however I would probably write up a little script to detect that and put it on a cron/systemd timer right in the nodes.

@djjudas21
Copy link

Maybe not the answer you were hoping for, but the best solution I've found is to do a rolling reboot of all my kube nodes whenever I reboot my NAS. It's a pain, but it's also an opportunity for patching etc.

@rouke-broersma
Copy link

@travisghansen It looks like Longhorn has chosen to support simply deleting the pods managed by a controller when it identifies that a volume is no longer available. Could this be something you would be willing to support in democratic csi?

See: https://longhorn.io/docs/archives/1.2.0/references/settings/#automatically-delete-workload-pod-when-the-volume-is-detached-unexpectedly

@travisghansen
Copy link
Member

Interesting, something to consider for sure. I think this could be handled by the health service endpoint. I am hesitant to get into such a thing but think it merits some discussion for sure.

@eaglesemanation
Copy link

eaglesemanation commented Mar 29, 2024

Longhorn solution looks promising for my usecase, would appreciate it getting implemented, unfortunately I'm too intimidated by huge JS codebase to try to contribute anything. Instead, I wrote a small HTTP server that will delete all pods that mount PVCs with given storage class, given hardcoded bearer token in authorization header. I'm using startup script on TrueNAS side to trigger that whenever NAS turns on.

If anyone wants to use it - here is server itself: https://github.com/eaglesemanation/k8s-csi-restarter
Here is example configuration for my k8s cluster: https://github.com/eaglesemanation/ops.emnt.dev/tree/main/k8s/apps/storage/k8s-csi-restarter
And on TrueNAS side it's basically curl --header 'Authorization:Bearer password' http://ingress-or-loadbalancerip/delete

Edit: This does not delete democratic-csi controller and node pods themselves, didn't think about that. Will add that functionality soon

@travisghansen
Copy link
Member

@eaglesemanation thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants