Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trident fail to delete child FlexClone of a FlexClone #878

Open
rabin-io opened this issue Dec 19, 2023 · 3 comments
Open

Trident fail to delete child FlexClone of a FlexClone #878

rabin-io opened this issue Dec 19, 2023 · 3 comments

Comments

@rabin-io
Copy link

rabin-io commented Dec 19, 2023

Describe the bug
When using trident as backend for virtual machines with kubevirt, if one restore a volume of a VM and later on delete the VM, we are left with FlexClone without it parent, which require manual intervention with cli to resolve it

Environment

  • Trident version: 23.10.0
  • Trident installation flags used: -d -n trident
  • Container runtime: cri-o/runC
  • Kubernetes version: v1.27.6
  • Kubernetes orchestrator: OpenShift 4.14
  • Kubernetes enabled feature gates:
  • OS:
  • NetApp backend types: AWS FSx
  • Other:

To Reproduce

  1. On a clean start, we have a clean OCP install with AWS FSx as the backend
  2. Installing CNV/kubevirt will create 6 volumes for the boot source of the default templates
  3. Create a VM from a template → this will create a FlexClone from the source Volume of the "golden image"
  4. Create a VM snapshot
  5. Restore the snapshot → this will create a new FlexClone from the FlexClone on step 3.
    (On this stage we see the FlexClone from step 3 was deleted)
  6. Delete the VM → this deletes the VM and the PV/C for the 2nd FlexClone
    what we see in the backend is that the last FlexClone is left in offline state. And can't be deleted without doing a split.

When using this as part of our testing of OCP on AWS with FSx, we see this behavier,
and later on this block the deprovision of the FSx storage, as you can't delete the volumes from AWS.

Expected behavior
All resources should be deleted when the VM is deleted.

Additional context

  • Google Doc with pictures - here
@rabin-io rabin-io added the bug label Dec 19, 2023
@akalenyu
Copy link

The issue seems to be reducible to the following sequence of actions (strictly on k8s entities):

$ oc create -f pvc.yaml 
persistentvolumeclaim/simple-pvc created
$ oc get pvc
NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
simple-pvc   Bound    pvc-acefa31e-61f4-4bef-9e82-daf30a4d85c0   1Gi        RWX            trident-csi-fsx   3s
$ oc create -f snap.yaml 
volumesnapshot.snapshot.storage.k8s.io/snapshot created
$ oc get volumesnapshot
NAME       READYTOUSE   SOURCEPVC    SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
snapshot   true         simple-pvc                           296Ki         csi-snapclass   snapcontent-7d00e584-5ce6-40f2-b1f0-40f254845e3d   3s             3s
$ oc delete pvc simple-pvc 
persistentvolumeclaim "simple-pvc" deleted
$ oc create -f restore.yaml 
persistentvolumeclaim/restore-pvc-1 created
$ oc get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
restore-pvc-1   Bound    pvc-4d92a2ea-02a7-404d-9f9d-054c7dd8361b   1Gi        RWX            trident-csi-fsx   2s
$ oc delete pvc restore-pvc-1 
persistentvolumeclaim "restore-pvc-1" deleted
$ oc delete volumesnapshot snapshot 
volumesnapshot.snapshot.storage.k8s.io "snapshot" deleted
# Doesn't converge

Where the manifests are simply

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: simple-pvc
spec:
  storageClassName: trident-csi-fsx
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: snapshot
spec:
  volumeSnapshotClassName: csi-snapclass
  source:
    persistentVolumeClaimName: simple-pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc-1
spec:
  dataSource:
    name: snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  storageClassName: trident-csi-fsx
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

@uppuluri123
Copy link

Following this as an issue to investigate.

@akalenyu
Copy link

akalenyu commented Apr 9, 2024

If anyone is interested in the reproducer in kubevirt terms (I expected the reduced reproducer to be of more interest here):

  • Create VM
  • Create VMSnapshot of the VM
  • Restore to the same VM from said VMSnapshot
  • Delete VM
  • Delete VMSnapshot (underlying CSI snapshot hangs)
$ cat dv.yaml 
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: simple-dv
spec:
  source:
      registry:
        pullMethod: node
        url: docker://quay.io/kubevirt/fedora-with-test-tooling-container-disk:v0.53.2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 8Gi
$ cat vm.yaml 
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: simple-vm
  namespace: default
spec:
  running: true
  template:
    metadata:
      labels: {kubevirt.io/domain: simple-vm,
        kubevirt.io/vm: simple-vm}
    spec:
      domain:
        devices:
          disks:
          - disk: {bus: virtio}
            name: dv-disk
          - disk: {bus: virtio}
            name: cloudinitdisk
        resources:
          requests: {memory: 2048M}
      volumes:
      - dataVolume: {name: simple-dv}
        name: dv-disk
      - cloudInitNoCloud:
          userData: |
            #cloud-config
            password: fedora
            chpasswd: { expire: False }
        name: cloudinitdisk
$ cat vmsnap.yaml 
apiVersion: snapshot.kubevirt.io/v1alpha1
kind: VirtualMachineSnapshot
metadata:
  name: snap-larry
spec:
  source:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: simple-vm
$ cat vmrestore.yaml 
apiVersion: snapshot.kubevirt.io/v1alpha1
kind: VirtualMachineRestore
metadata:
  name: restore-larry
spec:
  target:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: simple-vm
  virtualMachineSnapshotName: snap-larry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants