Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trident-csi crashloops on invalid snapshot references #490

Closed
uberspot opened this issue Dec 7, 2020 · 9 comments
Closed

Trident-csi crashloops on invalid snapshot references #490

uberspot opened this issue Dec 7, 2020 · 9 comments

Comments

@uberspot
Copy link

uberspot commented Dec 7, 2020

Describe the bug

The trident-csi deployment was crashlooping because of

I1207 08:20:13.609384 1 connection.go:186] GRPC error: rpc error: code = FailedPrecondition desc = Trident initialization failed; error attempting to clean up snapshot snapshot-SOMEUUID from backend ontapnas.... : error reading volume size: flexvol trident_pvc_SOMEOTHERUUID not found

For some reason trident-csi was following leftover references to a snapshot on a PVC backend that didn't support snapshots.
The snapshot itself didn't exist for that PVC anymore (it might have been mistakenly created in kubernetes in the past [>1 month ago]). I even had deleted the original PVC. But the real problem was that deleting the "volumesnapshot" object for that snapshot, didn't seem to delete all the other references to it.

The backend was "ontap-nas-economy" ( https://netapp-trident.readthedocs.io/en/latest/kubernetes/operations/tasks/backends/ontap/drivers.html , the q-tree one).

It sort of looked like CSI was trying to locate a snapshot for a PV (qtree) provisioned on an 'economy' backend, but it was actually checking for the pvc volume in the regular ontap-nas backend. Which is set as our default too. I suspect this issue happened because the default storage class was the economy one, when the snapshot was created. That changed later on to set the default to ontap-nas which support backend, but the references were probably broken/not properly cleaned up at that point (?).

Those other references include:

  • a VolumeSnapshotContent which referenced the same snapshot/pvc.
  • a TridentVolume object referencing that deleted PVC
  • a TridentTransaction that also contained a reference to that snapshot.

Deleting the past references fixes the issue.

Initial state:

Some time ago, a volumesnapshot resource was created for a backend that didn't support it. Deleting that volumesnapshot didn't seem to delete the references to it on other trident crd objects.

What triggered the bug:

Restarting some nodes, restarted the kubelet on one of the nodes that had the volumeattachment for the snapshot/pvc.

Environment
Openshift 4.6.*

  • Trident version: 20.07.1 (also crashed in 20.10.0 after upgrading)
  • Trident installation flags used: --debug
  • Container runtime: cri-o://1.19.0-24
  • Kubernetes version: v1.19.0
  • Kubernetes orchestrator: Openshift 4.6.*
  • Kubernetes enabled feature gates: [e.g. CSINodeInfo]
  • OS: Red Hat Enterprise Linux CoreOS 46.82
  • NetApp backend types: [e.g. CVS for AWS, ONTAP AFF 9.5, HCI 1.7]

To Reproduce

  • Set the default storage backend to one that doesn't support snapshots.
  • Create a volumesnapshot object for a backend that doesn't support snapshots
  • Change the default SC to one that supports snapshots
  • Delete the volumesnapshot
  • Restart the node that contained the attachment for that snapshot.
  • After restarting the node, restart the trident-csi pod.
  • Check if all references are cleared.

Expected behavior

No crashlooping on broken TridentTransaction references.
Print a warning and continue OR cleanup the broken reference OR add a flag to toggle this behaviour on/off.

@uberspot uberspot added the bug label Dec 7, 2020
@gnarl gnarl added the tracked label Dec 7, 2020
@promothp
Copy link

promothp commented Dec 9, 2020

Workaround is to delete the trident transactions manually and restart the trident pods. this is still a bug that needs to be addressed.
Steps for the workaround below
Check for the entries in the tridenttransactions CRD.

oc get tridenttransaction -n trident

oc get tridenttransaction -n trident -o json

[corona@stablrebco2 ~]$ oc get ttx -n trident
NAME AGE
pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 8d

  1. Delete the entries present in the tridenttransaction CRD.
    Before deleting the tridenttransaction CRD entry, you may need to edit the CRD resource entry and remove the finalizers(trident.netapp.io)

To edit the CRD, use a vi editor: (Remember to save the change)

kubectl (oc) edit tridenttransaction pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 -n trident

 Delete the line under the finalizers containing entry "trident.netapp.io".

  1. To delete the tridenttransaction CRD.

oc delete ttx pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 -n trident

Confirm the tridenttransaction has been deleted.

oc get ttx -n trident

  1. After this the Trident pods should come up and be in Running status. If not, delete the main trident pod, (the pod with the longest pod name).
    This should spawn a new trident pod which will try to initialize again and it should successfully initialize because there are no stale transactions to be handled during initialization.

oc get pods -n trident

oc delete pod -n trident

@promothp
Copy link

promothp commented Dec 9, 2020

will there be a patch release to fix the bug

@gnarl
Copy link
Contributor

gnarl commented Dec 9, 2020

Hi @promothp,

This bug along all other bugs are evaluated based on severity and prioritized to be fixed accordingly. If possible a bug fix for this issue would be included in the Trident v21.01 release at the end of January.

@ghost
Copy link

ghost commented Jan 12, 2021

Add my vote to having a patch sooner than later!

@gnarl
Copy link
Contributor

gnarl commented Jan 30, 2021

This fix is included in the Trident v21.01 release with commit 0ce1aaf.

@gnarl gnarl closed this as completed Jan 30, 2021
@elmazzun
Copy link

I had this very same problem with Trident 22.10.0.
I was updating MachineConfigs in my OpenShift cluster: once the update was done, every node rebooted; this made Trident reboot too and read its corrupted TridentTransactions.
Pod trident-csi failed its initialization because it tried to delete a VolumeSnapshot which no longer exists but such snapshot was still referenced in a TridentTransactions.
I fixed it by deleting TridentTransactions still referring to missing snapshots and rebooting Trident Pods.

@carillonator
Copy link

confirmed still happening in 23.04.0 as well

@bert-jan
Copy link

bert-jan commented Jan 15, 2024

This issue is present in 23.10 as well
I used the workaround to get the trident-controller and csi pods up 'n running again.

@uberspot
Copy link
Author

I'm still getting this issue as well. It rarely blocks pvc provisioning and complains again about:
GRPC error: rpc error: code = Internal desc = unable to process the preexisting transaction for volume pvc-d2bafb8f-...... : error attempting to clean up snapshot snapshot-c02d2f67....... from backend ontapsan_10......IP: error reading volume size: LUN /vol/trident_pvc_d2bafb8f_......./lun0 not found" logLayer=csi_frontend requestID=d3d358f1-f8a4-.................... requestSource=CSI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants