Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freenas-api-iscsi: Concurrently taken snapshots become stuck #332

Open
pf56 opened this issue Oct 21, 2023 · 4 comments
Open

freenas-api-iscsi: Concurrently taken snapshots become stuck #332

pf56 opened this issue Oct 21, 2023 · 4 comments

Comments

@pf56
Copy link

pf56 commented Oct 21, 2023

I'm using freenas-api-iscsi in conjunction with VolSync to take regular backups of my data. VolSync takes a snapshot of the PV, mounts it and copies the data offsite. While this is generally functional, scheduling multiple backups at the same time results in some of them becoming stuck.

Right now I have scheduled five different backups at 2am. Every morning, three have been successfully completed and two are stuck in a pending state. Apparently, the driver couldn't create some of the snapshot and timed out after 90 seconds:

$ kubectl -n sonarr get volumesnapshots -o yaml
apiVersion: v1
items:
- apiVersion: snapshot.storage.k8s.io/v1
  kind: VolumeSnapshot
  metadata:
    creationTimestamp: "2023-10-21T02:00:00Z"
    finalizers:
    - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
    generation: 1
    labels:
      app.kubernetes.io/created-by: volsync
      volsync.backube/cleanup: 390ed6ce-1aa3-4b2d-8788-f7b1a998c62a
    name: volsync-sonarr-backup-src
    namespace: sonarr
    ownerReferences:
    - apiVersion: volsync.backube/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicationSource
      name: sonarr-backup
      uid: 390ed6ce-1aa3-4b2d-8788-f7b1a998c62a
    resourceVersion: "37473618"
    uid: 196a9cf2-6bc4-4713-a823-38eb6363202e
  spec:
    source:
      persistentVolumeClaimName: data-sonarr-0
    volumeSnapshotClassName: freenas-api-iscsi-csi
  status:
    boundVolumeSnapshotContentName: snapcontent-196a9cf2-6bc4-4713-a823-38eb6363202e
    error:
      message: 'Failed to check and update snapshot content: failed to take snapshot
        of the volume pvc-2f49a909-d88e-417e-a6a7-94a7e778f49c: "rpc error: code =
        Aborted desc = operation locked due to in progress operation(s): [\"create_snapshot_name_snapshot-196a9cf2-6bc4-4713-a823-38eb6363202e\",\"volume_id_pvc-2f49a909-d88e-417e-a6a7-94a7e778f49c\"]"'
      time: "2023-10-21T02:01:30Z"
    readyToUse: false
kind: List
metadata:
  resourceVersion: ""

TrueNAS shows the operation as 'in progress':
20231021_11h34m18s_grim

The actual snapshot has not been created:
20231021_11h35m38s_grim

Restarting the democratic-csi controller causes it to pick up the failed snapshots and successfully create them (sometimes multiple restarts are necessary because one times out again). TrueNAS keeps displaying the jobs in the UI until I reboot the server or manually restart the middleware.

I can work around the issue by scheduling the backups five minutes apart from each other.

The helm release (I'm using Talos):

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: democratic-csi-freenas-api-iscsi
spec:
  interval: 10m
  chart:
    spec:
      chart: democratic-csi
      version: 0.14.1
      sourceRef:
        kind: HelmRepository
        name: democratic-csi-charts
        namespace: flux-system
      interval: 15m
  values:
    node:
      hostPID: true
      driver:
        extraEnv:
          - name: ISCSIADM_HOST_STRATEGY
            value: nsenter
          - name: ISCSIADM_HOST_PATH
            value: /usr/local/sbin/iscsiadm
        iscsiDirHostPath: /usr/local/etc/iscsi
        iscsiDirHostPathType: ""
    csiDriver:
      name: org.democratic-csi.freenas-api-iscsi
    storageClasses:
      - name: freenas-api-iscsi-csi
        defaultClass: true
        reclaimPolicy: Delete
        volumeBindingMode: Immediate
        allowVolumeExpansion: true
        parameters:
          fsType: ext4
        mountOptions: []
        secrets:
          node-stage-secret:
            node-db.node.session.auth.authmethod: CHAP
            node-db.node.session.auth.username: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_USERNAME}
            node-db.node.session.auth.password: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_PASSWORD}
            node-db.node.session.auth.username_in: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_USERNAME_IN}
            node-db.node.session.auth.password_in: ${TRUENAS_DEMOCRATIC_CSI_ISCSI_PASSWORD_IN}
    volumeSnapshotClasses:
      - name: freenas-api-iscsi-csi
        parameters:
          detachedSnapshots: true
    driver:
      config:
        driver: freenas-api-iscsi
        httpConnection:
          protocols: https
          host: storage.internal.paulfriedrich.me
          port: 443
          apiKey: ${TRUENAS_DEMOCRATIC_CSI_API_KEY}
          allowInsecure: false
        zfs:
          datasetProperties:
            "org.freenas:description": "{{ parameters.[csi.storage.k8s.io/pvc/namespace] }}/{{ parameters.[csi.storage.k8s.io/pvc/name] }}"
          datasetParentName: local/k8s/volumes
          detachedSnapshotsDatasetParentName: local/k8s/snapshots
          zvolEnableReservation: false
        iscsi:
          targetPortal: "10.0.60.3:3260"
          nameTemplate: "{{ parameters.[csi.storage.k8s.io/pvc/namespace] }}-{{ parameters.[csi.storage.k8s.io/pvc/name] }}"
          namePrefix: "csi-"
          targetGroups:
            - targetGroupAuthGroup: 1
              targetGroupAuthType: CHAP Mutual
              targetGroupInitiatorGroup: 1
              targetGroupPortalGroup: 2
          extentInsecureTpc: true
          extentXenCompat: false
          extentDisablePhysicalBlocksize: true
          extentBlocksize: 512
          extentRpm: SSD
          extentAvailThreshold: 80

The installed snapshot-controller is the democratic-csi one:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: democratic-csi-snapshot-controller
spec:
  interval: 10m
  chart:
    spec:
      chart: snapshot-controller
      version: 0.2.4
      sourceRef:
        kind: HelmRepository
        name: democratic-csi-charts
        namespace: flux-system
      interval: 15m

TrueNAS is running version TrueNAS-SCALE-22.12.3.3.

@travisghansen
Copy link
Member

So there are 2 different failures?:

  1. csi drivers is failing to submit the jobs
  2. TN is failing to actually execute the jobs

Can you send the controller logs showing csi failing to submit?

@pf56
Copy link
Author

pf56 commented Oct 21, 2023

I don't have the logs from tonight, because I restarted the controller. But I ran them again at 12:40 (UTC).

This one is the one that's stuck this time:

➜ kubectl get --all-namespaces volumesnapshots
NAMESPACE   NAME                        READYTOUSE   SOURCEPVC       SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS           SNAPSHOTCONTENT                                    CREATIONTIME   AGE
radarr      volsync-radarr-backup-src   false        data-radarr-0                                         freenas-api-iscsi-csi   snapcontent-2c4f71e3-b0b3-47f9-bad9-89cce9d3459f                  25m

The logs of the containers:
csi-driver
csi-proxy
external-snapshotter
external-provisioner
external-attacher
external-resizer

@qlonik
Copy link

qlonik commented Apr 9, 2024

Hey, I have a very similar issue with freenas-api-nfs driver. I usually get 25 snapshots get taken at the same time, but only 4 succeed, and the remaining get stuck. The snapshot errors bubble up to PVCs that use those snapshots and PVCs have these logs among the events section:

  Type     Reason                Age                  From                                                                                                                                Message
  ----     ------                ----                 ----                                                                                                                                -------
  Warning  ProvisioningFailed    16m                  freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  ProvisioningFailed    13m (x6 over 15m)    freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = Aborted desc = operation locked due to in progress operation(s): ["create_volume_name_pvc-345a1b4d-b0bd-4441-bd04-3f17cfd531cd"]
  Normal   ExternalProvisioning  3m5s (x62 over 18m)  persistentvolume-controller                                                                                                         Waiting for a volume to be created either by the external provisioner 'freenas-api-nfs.cluster.local' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          53s (x12 over 18m)   freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  External provisioner is provisioning volume for claim "default/volsync-jellyseerr-src"
  Warning  ProvisioningFailed    53s (x5 over 12m)    freenas-api-nfs.cluster.local_democratic-csi-freenas-api-nfs-provisioner-controller-84589j55s_588bc8f6-40c9-43d3-a933-5c1e1e1c522d  failed to provision volume with StorageClass "truenas-nfs-main": rpc error: code = AlreadyExists desc = volume has already been created with a different size, existing size: 1073741824, required_bytes: 9214464, limit_bytes: 0

From these messages, it seems to be attempting to create, but gets locked, and when it comes second time around to create this snapshot, the snapshot already exists but with a different size.

Usually when snapshots are happening my TrueNas server CPU is 100% loaded. I found that if I delete 4 PVCs at a time that depend on these snapshots, then PVCs proceed and my volsync snapshots finish. However, using this approach, I get dangling snapshots and NFS export entries that do not get cleaned up.

Also I found that TrueNAS settings have a limit on concurrent replication tasks (I wonder if it hits against this limit).

Is it possible to add a bounded queue of provision requests into the driver, which would limit snapshot concurrency? It would be nice if the size of the queue could be configurable too. This way new requests won't be sent to TrueNAS server before previous requests clear out from the queue.

@H3mul
Copy link

H3mul commented Apr 24, 2024

Hello, I also seem to have this issue. I have 12 volsync replicationsources that trigger backup at the same time, and their src PVCs get stuck in Pending similarly to above reports.

I noticed that after limiting Replication Tasks to 1 in Truenas Advanced settings, most of the PVCs eventually succeed (within 20 min), except exactly 2 (not always the same 2) that will remain stuck in Pending indefinitely (I assume, I waited for a few hours). I can also see 2 active jobs in the Truenas dashboard (replication.run_onetime) at this point.

At this point, if I delete the 2 stuck PVCs, volsync will schedule new ones and those get provisioned. The 2 active jobs remain until Truenas reboot however.

I can confirm this behavior on both 23.10 and now on 24.04.0 after upgrading this morning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants