Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic Volume Populator proceeds if no valid snapshots found or s3 bucket is not reachable. #1211

Open
cbc02009 opened this issue Apr 15, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@cbc02009
Copy link

cbc02009 commented Apr 15, 2024

Describe the bug
I have found two cases where the volume populator will just give up and allow an empty pvc to be sent to the workload which is not what I think the expected behavior should be. The result of this is that once the empty pvc is attached to the workload, any replication sources for that PVC will immediately snapshot this incorrectly populated pvc, potentially removing good snapshots in the process.

I would expect the volume populator to stop on config error and not proceed further to avoid potential data loss.

The two scenarios I have found are:

  1. When the s3 bucket used to store the backups is not reachable (or more specifically the DNS could not be resolved)
  2. When the restoreAsOf field is incorrectly formatted

This was tested on v0.9.0 and v0.9.1

Steps to reproduce

  1. Take a backup of a PVC using a replication source
  2. Configure a replication destination with an invalid restoreAsOf field
  3. Remove and recreate the PVC to let the volume populator work
  4. The replication destination will throw an error, but still allow the empty PVC to attach to the workload

Expected behavior
As mentioned earlier, to avoid data loss the replication destination should not allow a PVC that wasn't populated to attach to the workload.

Actual results
Empty PVC attached to workload, replication source ran, and overwrote one of my backups with a now empty snapshot

image

@cbc02009 cbc02009 added the bug Something isn't working label Apr 15, 2024
@tesshuflower
Copy link
Contributor

From the logs above, this looks like the replicationdestination was successful - if no snapshots are found to be restored, the current behaviour is not to fail.

Do you have an example where the replicationdestination fails but you see the volumepopulator proceed to make a PVC available? I'd like to see the replicationdestination yaml if possible. Normally the volumepopulator should not try to provision the PVC at all until the replicationDestination gets a latestImage set - which would happen after a successful restore on the replicationdestination.

A couple more things:

  • I would expect the replicationdestination to fail if the actual s3 bucket cannot be reached - if you could provide the replicationdestination for one of these that would help, as I would like to make sure it fails in this scenario.
  • Do you have an example of an "invalid" restoreAsOf field?

@cbc02009
Copy link
Author

hi @tesshuflower thanks for your response.

if no snapshots are found to be restored, the current behaviour is not to fail.
I'm not sure this should be the expected behavior. As noted above it leads to data loss as the replication source will immediately snapshot the empty pvc and overwrite one of the older backups. I can see how this could cause issues for instantiation of new clusters with no backups yet though, so maybe it is a fair tradeoff for some potential data loss.

The invalid restoreAsOf field is posted in the image. It exactly mirrors what was used in my replicationdestination at the time:
https://github.com/cbc02009/k8s-home-ops/blame/6ead4889ae9f3f068a6dd70f89c8863e164928d7/kubernetes/main/templates/volsync/minio/replicationdestination.yaml#L25

I think it's in the correct format, but it's possible I messed it up. Also there were valid snapshots that satisfied the restoreAsOf date:

ID        Time                 Host        Tags        Paths
------------------------------------------------------------
...
105e35cd  2024-04-13 05:00:33  volsync                 /data
367c0499  2024-04-13 06:00:23  volsync                 /data
215fb144  2024-04-14 19:27:28  volsync                 /data
4b41f3e4  2024-04-14 20:50:57  volsync                 /data
...
------------------------------------------------------------
31 snapshots

In the image above you can see where the replication destination found no valid snapshots and completed. I forgot to grab any screenshots, but immediately after the replication destination pod exited the workload (tautulli in this case) started with an empty PVC. I verified it by accessing the hosted page and seeing that it wanted me to do the initial configuration all over again.

the s3 bucket I used is an internally hosted minio bucket, so I'm not sure exactly what info you want about it. I was having DNS issues while trying to bring up my cluster, and for some reason I haven't figured out the hostname for my minio instance could not be resolved which led to the issue posted above. I will try to recreate the situation and test it again later today.

Please let me know if there's any more info I can provide.

@tesshuflower
Copy link
Contributor

Thanks for this extra info - let me split this up into a few different things:

  1. The volume populator should not populate a pvc if the replicationdestination it's based on has never completed a replication (never taken a snapshot). This should be the scenario if the s3 bucket cannot be accessed - If you can re-create this scenario, can you save the replicationdestination? I would expect it to fail in this case.

  2. The scenario where there are no snapshots to be restored - in this case the replicationdestination is successful, and you see the results you've mentioned. This would be expected as the replicationdestination has succeeded.

I do think perhaps there's an argument to be made to allow for a flag somewhere to say that you want the job to fail if no snapshots are found to be restored - by default however it does succeed, and this seems to be the intended behaviour for most of our restic users - in these gitopts scenarios where the entire thing is setup at once (replicationsource, replicationdestination source pvc as volume populated pvc), for first time deploys it would always fail if no data exists to be restored. See some discussions here: #1172 and #1181.

  1. There is a final issue you've brought up in that your restoreAsOf doesn't seem to be getting parsed correctly - I'm not able to recreate this - if I try to type in the same date as you like this: restoreAsOf: "2024-04-14T05:00:00-04:00" then I don't get any error in the log about invalid date. Not sure what to think here, I think it should be running the following in the pod:

date --date="2024-04-14T05:00:00-04:00" +%s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants