Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot-controller logs report failure frequently #748

Open
ambiknai opened this issue Aug 12, 2022 · 27 comments
Open

snapshot-controller logs report failure frequently #748

ambiknai opened this issue Aug 12, 2022 · 27 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@ambiknai
Copy link

What happened:

In our csi-driver snapshot controller logs shows below error. There is no impact on functionality. But there are too many of these errors.

snapshot controller failed to update ...xxxxx
the object has been modified; please apply your changes to the latest version and try again

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment: I tested in IKs 1.22,1.23,1.24

@xing-yang
Copy link
Collaborator

@ggriffiths Can you take a look? These are from the "update" that are not replaced with "patch" in snapshot-controller.

@ggriffiths
Copy link
Member

ggriffiths commented Aug 24, 2022

Yes, there are many spots where we still use "update" instead of "patch":

  • Update:
pkg/common-controller/snapshot_controller.go:1383:	newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1454:		updatedSnapshot, err = ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1523:	newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1618:	updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1660:	updatedSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshot.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:547:	updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:642:	updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
  • UpdateStatus:
pkg/common-controller/snapshot_controller.go:813:	newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1218:		newSnapshotObj, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:403:	newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), content, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:459:		newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), contentClone, metav1.UpdateOptions{})

This error will still be hit in these scenarios. We reduced the major scenarios in #526, but there is more work to be done. I'm happy to review a PR for this work.

@ggriffiths
Copy link
Member

/help

@k8s-ci-robot
Copy link
Contributor

@ggriffiths:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 24, 2022
@ggriffiths ggriffiths removed their assignment Aug 24, 2022
@camartinez04
Copy link

I've sent a PR-757

@ggriffiths
Copy link
Member

/assign @camartinez04

@blackpiglet
Copy link

Also hit this issue.

E1011 03:12:02.763356       1 snapshot_controller.go:993] checkandRemovePVCFinalizer [velero-kibishii-data-kibishii-deployment-0-tsdgw]: removePVCFinalizer failed to remove finalizer snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again
E1011 03:12:02.763396       1 snapshot_controller.go:189] error check and remove PVC finalizer for snapshot [velero-kibishii-data-kibishii-deployment-0-tsdgw]: snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again

@ggriffiths
Copy link
Member

/unassign @camartinez04

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@shubham-pampattiwar
Copy link
Contributor

@ggriffiths @xing-yang If anyone is not actively working on this issue, I would like to work on it.

@xing-yang
Copy link
Collaborator

Hi @shubham-pampattiwar, that's great. Can you please assign this issue to yourself? Thanks.

@xing-yang
Copy link
Collaborator

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2023
@shubham-pampattiwar
Copy link
Contributor

/assign

@ggriffiths
Copy link
Member

/unassign @shubham-pampattiwar

@phoenix-bjoern
Copy link

Spottet this issue when using Velero 1.12.1 and CSI plugin 0.6.1 together with SC 6.3.1:

E1106 14:02:37.353831       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd]: error occurred in createSnapshotWrapper: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
E1106 14:02:37.353865       1 snapshot_controller_base.go:351] could not sync content "snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd": failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
I1106 14:02:37.353904       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd through the plugin ...
I1106 14:02:37.353931       1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd", UID:"194a4dc1-c11c-44d4-acb4-504a0fed9314", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"934203821", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"

As the logs indicate the patch rewrites are not in the 6.3.1 release yet. So please release them ASAP to close this issue.

@phoenix-bjoern
Copy link

I've cherry-picked commit ff71329 from main to the release-6.3 branch and created new images:

  • phoenixmedia/k8s-snapshot-controller:v6.3.1-patch1
  • phoenixmedia/k8s-csi-snapshotter:v6.3.1-patch1

I can confirm the patch rewrites solve this issue. IMHO the patch could be safely merged to the release-6.3 branch.

@phoenix-bjoern
Copy link

See #876 (comment)

@hairyhum
Copy link

We have also seen similar errors about snapshot status update, most likely coming from UpdateStatus mentioned above. I don't see them addressed in the PR, but I believe those are also important.

@julienvincent
Copy link

Any update on this? This happens pretty consistently for me. Running v6.3.3.

@phoenix-bjoern
Copy link

@julienvincent can you provide logs to understand the context?
How do you create snapshots? Often two or more processes interact with the VolumeSnapshot objects.

@julienvincent
Copy link

@phoenix-bjoern Sure! What kind of logs would you like? Happy to provide.

Similar to other users in this thread, I am using Velero@1.12.2 with the csi plugin at 0.6.2 and Longhorn@1.5.3. Velero is creating the VolumeSnapshots and I am getting the same error as reported in this issue.

For example:

  Error:
    Message:     Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad: "snapshot controller failed to update snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad\": the object has been modified; please apply your changes to the latest version and try again"
    Time:        2023-12-28T00:01:34Z
  Ready To Use:  false

Happy to provide any other information you need.

@phoenix-bjoern
Copy link

@julienvincent the snapshot controller log should have a trace for the error. Can you share it to identify the exact code lines?

@julienvincent
Copy link

@phoenix-bjoern sure, here is an example:

E1230 12:33:25.731663       1 snapshot_controller_base.go:470] could not sync snapshot "default/velero-registry-data-7737fc94-tn4m2": snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:
I1230 12:33:25.731729       1 snapshot_controller_base.go:288] Failed to sync snapshot "default/velero-registry-data-7737fc94-tn4m2", will retry again: snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:

@phoenix-bjoern
Copy link

Thanks @julienvincent for the additional context. The trace shows a storage error: StorageError: invalid object, Code: 4. It seems the snapshot controller doesn't have an issue. It only catches the exception thrown by Velero and/or the CSI driver. Maybe check vmware-tanzu/velero#6179 for more context.

@julienvincent
Copy link

julienvincent commented Jan 4, 2024

@phoenix-bjoern I might be misunderstanding something but the linked issue doesn't seem directly related to this issue.

That issue is more about longhorn behaviour of not deleting it's internal snapshots when CSI volume snapshots are deleted. AFAIU Velero is not actually involved in the CSI snapshot process other than initially creating the VolumeSnapshot resource. I can reproduce this issue independently of Velero by manually creating a VolumeSnapshot which I think excludes Velero's involvement in this.

In this case the snapshots themselves are not being reported as successful (but the underlying driver is successfully performing a snapshot).

But if I understand what you are saying this error message is set on the VolumeSnapshot resource by the driver (longhorn) and external-snapshotter is just reporting/relaying it?

Would you recommend opening an issue with longhorn?

@phoenix-bjoern
Copy link

@julienvincent The snapshot controller is only triggering a process which the storage driver then executes. Since the error seems to occur in the storage driver there is nothing you can in the snapshot controller.
So yes, it makes sense to continue investigating the issue in the Longhorn components.

@kain88-de
Copy link

Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet