Trident-csi crashloops on invalid snapshot references #490

uberspot · 2020-12-07T18:14:05Z

Describe the bug

The trident-csi deployment was crashlooping because of

I1207 08:20:13.609384 1 connection.go:186] GRPC error: rpc error: code = FailedPrecondition desc = Trident initialization failed; error attempting to clean up snapshot snapshot-SOMEUUID from backend ontapnas.... : error reading volume size: flexvol trident_pvc_SOMEOTHERUUID not found

For some reason trident-csi was following leftover references to a snapshot on a PVC backend that didn't support snapshots.
The snapshot itself didn't exist for that PVC anymore (it might have been mistakenly created in kubernetes in the past [>1 month ago]). I even had deleted the original PVC. But the real problem was that deleting the "volumesnapshot" object for that snapshot, didn't seem to delete all the other references to it.

The backend was "ontap-nas-economy" ( https://netapp-trident.readthedocs.io/en/latest/kubernetes/operations/tasks/backends/ontap/drivers.html , the q-tree one).

It sort of looked like CSI was trying to locate a snapshot for a PV (qtree) provisioned on an 'economy' backend, but it was actually checking for the pvc volume in the regular ontap-nas backend. Which is set as our default too. I suspect this issue happened because the default storage class was the economy one, when the snapshot was created. That changed later on to set the default to ontap-nas which support backend, but the references were probably broken/not properly cleaned up at that point (?).

Those other references include:

a VolumeSnapshotContent which referenced the same snapshot/pvc.
a TridentVolume object referencing that deleted PVC
a TridentTransaction that also contained a reference to that snapshot.

Deleting the past references fixes the issue.

Initial state:

Some time ago, a volumesnapshot resource was created for a backend that didn't support it. Deleting that volumesnapshot didn't seem to delete the references to it on other trident crd objects.

What triggered the bug:

Restarting some nodes, restarted the kubelet on one of the nodes that had the volumeattachment for the snapshot/pvc.

Environment
Openshift 4.6.*

Trident version: 20.07.1 (also crashed in 20.10.0 after upgrading)
Trident installation flags used: --debug
Container runtime: cri-o://1.19.0-24
Kubernetes version: v1.19.0
Kubernetes orchestrator: Openshift 4.6.*
Kubernetes enabled feature gates: [e.g. CSINodeInfo]
OS: Red Hat Enterprise Linux CoreOS 46.82
NetApp backend types: [e.g. CVS for AWS, ONTAP AFF 9.5, HCI 1.7]

To Reproduce

Set the default storage backend to one that doesn't support snapshots.
Create a volumesnapshot object for a backend that doesn't support snapshots
Change the default SC to one that supports snapshots
Delete the volumesnapshot
Restart the node that contained the attachment for that snapshot.
After restarting the node, restart the trident-csi pod.
Check if all references are cleared.

Expected behavior

No crashlooping on broken TridentTransaction references.
Print a warning and continue OR cleanup the broken reference OR add a flag to toggle this behaviour on/off.

The text was updated successfully, but these errors were encountered:

promothp · 2020-12-09T19:02:18Z

Workaround is to delete the trident transactions manually and restart the trident pods. this is still a bug that needs to be addressed.
Steps for the workaround below
Check for the entries in the tridenttransactions CRD.

oc get tridenttransaction -n trident

oc get tridenttransaction -n trident -o json

[corona@stablrebco2 ~]$ oc get ttx -n trident
NAME AGE
pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 8d

Delete the entries present in the tridenttransaction CRD.
Before deleting the tridenttransaction CRD entry, you may need to edit the CRD resource entry and remove the finalizers(trident.netapp.io)

To edit the CRD, use a vi editor: (Remember to save the change)

kubectl (oc) edit tridenttransaction pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 -n trident

 Delete the line under the finalizers containing entry "trident.netapp.io".

To delete the tridenttransaction CRD.

oc delete ttx pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 -n trident

Confirm the tridenttransaction has been deleted.

oc get ttx -n trident

After this the Trident pods should come up and be in Running status. If not, delete the main trident pod, (the pod with the longest pod name).
This should spawn a new trident pod which will try to initialize again and it should successfully initialize because there are no stale transactions to be handled during initialization.

oc get pods -n trident

oc delete pod -n trident

promothp · 2020-12-09T19:03:43Z

will there be a patch release to fix the bug

gnarl · 2020-12-09T22:16:04Z

Hi @promothp,

This bug along all other bugs are evaluated based on severity and prioritized to be fixed accordingly. If possible a bug fix for this issue would be included in the Trident v21.01 release at the end of January.

ghost · 2021-01-12T18:46:30Z

Add my vote to having a patch sooner than later!

gnarl · 2021-01-30T17:07:53Z

This fix is included in the Trident v21.01 release with commit 0ce1aaf.

elmazzun · 2023-07-11T15:35:13Z

I had this very same problem with Trident 22.10.0.
I was updating MachineConfigs in my OpenShift cluster: once the update was done, every node rebooted; this made Trident reboot too and read its corrupted TridentTransactions.
Pod trident-csi failed its initialization because it tried to delete a VolumeSnapshot which no longer exists but such snapshot was still referenced in a TridentTransactions.
I fixed it by deleting TridentTransactions still referring to missing snapshots and rebooting Trident Pods.

carillonator · 2024-01-05T13:34:31Z

confirmed still happening in 23.04.0 as well

bert-jan · 2024-01-15T09:05:19Z

This issue is present in 23.10 as well
I used the workaround to get the trident-controller and csi pods up 'n running again.

uberspot · 2024-01-31T13:29:44Z

I'm still getting this issue as well. It rarely blocks pvc provisioning and complains again about:
GRPC error: rpc error: code = Internal desc = unable to process the preexisting transaction for volume pvc-d2bafb8f-...... : error attempting to clean up snapshot snapshot-c02d2f67....... from backend ontapsan_10......IP: error reading volume size: LUN /vol/trident_pvc_d2bafb8f_......./lun0 not found" logLayer=csi_frontend requestID=d3d358f1-f8a4-.................... requestSource=CSI

uberspot added the bug label Dec 7, 2020

gnarl added the tracked label Dec 7, 2020

gnarl closed this as completed Jan 30, 2021

bert-jan mentioned this issue Jan 15, 2024

Prevent/Suppress "ACP is not enabled." and "Trident-ACP version is empty." messages #866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trident-csi crashloops on invalid snapshot references #490

Trident-csi crashloops on invalid snapshot references #490

uberspot commented Dec 7, 2020 •

edited

promothp commented Dec 9, 2020

promothp commented Dec 9, 2020

gnarl commented Dec 9, 2020

ghost commented Jan 12, 2021

gnarl commented Jan 30, 2021

elmazzun commented Jul 11, 2023

carillonator commented Jan 5, 2024

bert-jan commented Jan 15, 2024 •

edited

uberspot commented Jan 31, 2024

Trident-csi crashloops on invalid snapshot references #490

Trident-csi crashloops on invalid snapshot references #490

Comments

uberspot commented Dec 7, 2020 • edited

promothp commented Dec 9, 2020

oc get tridenttransaction -n trident

oc get tridenttransaction -n trident -o json

kubectl (oc) edit tridenttransaction pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 -n trident

oc delete ttx pvc-63836175-1515-4326-b73f-cae3e0963be7-snapshot-0462e2ea-f167-4846-86a3-10cc6599b4a8 -n trident

oc get ttx -n trident

oc get pods -n trident

oc delete pod -n trident

promothp commented Dec 9, 2020

gnarl commented Dec 9, 2020

ghost commented Jan 12, 2021

gnarl commented Jan 30, 2021

elmazzun commented Jul 11, 2023

carillonator commented Jan 5, 2024

bert-jan commented Jan 15, 2024 • edited

uberspot commented Jan 31, 2024

uberspot commented Dec 7, 2020 •

edited

bert-jan commented Jan 15, 2024 •

edited