Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete Files from Existing Snapshot #14

Closed
scoddy opened this issue Nov 15, 2014 · 69 comments · Fixed by #2731
Closed

Delete Files from Existing Snapshot #14

scoddy opened this issue Nov 15, 2014 · 69 comments · Fixed by #2731
Assignees

Comments

@scoddy
Copy link
Member

scoddy commented Nov 15, 2014

In cases of accidential backup of e.g. too large files, I would like to be able to delete specific files or directories (incl. recursion) from existing snapshots

@scoddy scoddy added this to the 2014-48 milestone Nov 16, 2014
@scoddy scoddy modified the milestone: 2014-48 Nov 27, 2014
@viric
Copy link
Contributor

viric commented Feb 16, 2016

That'd be really nice.

@teknico
Copy link

teknico commented Mar 24, 2016

It would also allow removing sensitive data that got included unwittingly.

@zcalusic
Copy link
Member

This would be a great feature!

@alphapapa
Copy link

Any feedback from the devs on this idea? It would be very nice. For example, I just discovered that a program I build from git checkouts has been creating enormous binaries (almost 100 MB), and these have been getting backed up in my Restic backups unnecessarily. I haven't been using Restic for very long, as I'm still in a testing phase, so it's not a problem to delete the old snapshots in question. But this issue can happen quite easily, and it would be good to have long-term solutions for it, other than forgetting every snapshot.

I suppose it would be possible to write a script to restore every snapshot, delete undesired files, and re-backup the snapshot by setting the date manually, but obviously that would take a very long time. It would be great if Restic could do this natively.

Thanks.

@rawtaz
Copy link
Contributor

rawtaz commented Jan 16, 2018

I think there are multiple valid use cases for this. Seems like a really good feature to have. I would probably use it myself at some point.

@dnnr
Copy link
Contributor

dnnr commented Jan 18, 2018

It probably doesn't really change the implementation effort, but from an UX viewpoint, this might be done with a rather low profile by extending the backup command instead of adding an entirely new command:

restic backup [flags] FILE/DIR/SNAPSHOT [FILE/DIR/SNAPSHOT] ...

So instead of offering a command that modifies snapshots, this would allow making a new backup based on an existing snapshot ID. Deleting a file would be achieved with exclude rules.
All the documentation on restic backup could basically be "reused" (that is, almost nothing would need to be added for this new feature).

@alphapapa
Copy link

alphapapa commented Jan 18, 2018

@dnnr See #1550 (comment)

However, I don't follow you here. Removing data from old snapshots is definitely a distinct operation and should have its own command. Something like:

restic purge --snapshots abcd1234 deadbeef --paths /path/to/file1 /path/to/file2

And --snapshots should probably accept an all keyword to operate on all snapshots (or all snapshots with the specified --tag). And the command should probably require confirmation by typing yes.

It would also be good for it to have a --patterns option, which would delete paths matching the given patterns.

purge is one possibility for the command's name. erase might also be a good choice, as well as delete. Whatever is chosen, it should make it clear that the operation permanently deletes data. This is backup software we're talking about, and any dangerous operations should be distinct, explicit, and require confirmation.

@dnnr
Copy link
Contributor

dnnr commented Jan 18, 2018

Well, I left out the step where you'd delete the source snapshot afterwards (using forget, then maybe prune) , because I thought that was obvious.

In my opinion, doing it like this would keep the command set more orthogonal compared to adding a new command that overlaps with the functionality of existing commands. Right now, there is backup, forget and prune and they all do completely separate things. Adding a purge like you describe it, changes that. My suggestion doesn't.

@alvarolm
Copy link

since we are proposing one file operations it would be nice being able to rename.

@rawtaz
Copy link
Contributor

rawtaz commented Jan 18, 2018

I agree with @alphapapa that there should be a distinct command for this type of operation. It might be purge, that's not a bad name, then again there might be other similar operations in the future, e.g. @alvarolm already suggested being able to rename files.

For that reason I think perhaps adding a rewrite command is the best alternative in this case, and make that command have e.g. --purge and --rename options, assuming the latter is relevant to implement. So the final commands would be e.g. restic -r foo rewrite --purge snap1,snap2 path1 path2 ... and restic -r foo rewrite --rename snap1,snap2 pathFrom pathTo.

That said I'm not entirely sure renaming is something that's reasonable to implement - it goes quite a long way from what a backup program is about. But sure, why not.

I don't think it's wise to have the purge stuff be part of the backup command. In one perspective, you could argue that it's fine - you are doing an operation on your backup. But with that rationale the prune and unlock and forget actions should also be part of the backup command, as they too are about maintaining stuff in your backup. I don't think that makes sense, so I think it should indeed be a separate operation/command, e.g. rewrite or purge.

@alphapapa
Copy link

alphapapa commented Jan 18, 2018

@dnnr

Well, I left out the step where you'd delete the source snapshot afterwards (using forget, then maybe prune) , because I thought that was obvious.

It's definitely not obvious. It's also better if Restic handles that for the user, rather than the user having to keep track of which snapshot IDs have changed and need to be forgotten--which would be quite a burden if the user were rewriting all snapshots in the repo.

In my opinion, doing it like this would keep the command set more orthogonal compared to adding a new command that overlaps with the functionality of existing commands.

I don't understand what you mean. The opposite is the case. This proposed purge/delete/rewrite command does not overlap with backup at all--it deletes data from existing snapshots. It is orthogonal to existing commands.

Right now, there is backup, forget and prune and they all do completely separate things. Adding a purge like you describe it, changes that. My suggestion doesn't.

Again, no idea what you're thinking here. purge is completely separate from backup, forget, and prune:

  • backup: Creates a new snapshot of given paths.
  • forget: Removes existing snapshots.
  • prune: Garbage-collects unused blobs from forgotten snapshots.
  • purge/rewrite/whatever: Deletes files from existing snapshots.

You are proposing making the backup command operate in two modes, one of which backs up data, and the other of which would delete data.

@rawtaz Yes, rewrite is a good suggestion, because it literally rewrites existing snapshots. I'd suggest a UI like:

restic --repo REPO rewrite --snapshots abcd1234 deadbeef --delete /path/to/file1 "*.unwanted-file-extension-glob"

I recommend against using commas as separators, because it makes constructing command lines in scripts much more complicated.

@dnnr
Copy link
Contributor

dnnr commented Jan 18, 2018

backup: Creates a new snapshot of given paths.

Well, in a sense, modifying the contents of a snapshot is creating a new snapshot (because it's not the same snapshot as before). Think git commit --amend, which creates a new commit based an existing commit. The analogy is actually pretty fitting, since this ticket seems to move rapidly towards reinventing Git.

You are proposing making the backup command operate in two modes, one of which backs up data, and the other of which would delete data.

I didn't say that. Why would it? There is forget and prune, which are perfectly fine for removing things.

@alphapapa
Copy link

Well, in a sense, modifying the contents of a snapshot is creating a new snapshot (because it's not the same snapshot as before). Think git commit --amend, which creates a new commit based an existing commit. The analogy is actually pretty fitting, since this ticket seems to move rapidly towards reinventing Git.

You're right. But at the same time, Restic is not git, and it's not designed to require knowledge of content-based addressing to work. Regardless of how it works under the hood, I think that, to users, the command we are proposing should be considered to modify an existing snapshot, not create a new one, therefore it should be a distinct command.

I didn't say that. Why would it?

Well, you said:

from an UX viewpoint, this might be done with a rather low profile by extending the backup command instead of adding an entirely new command

Maybe you should explain in more detail.

There is forget and prune, which are perfectly fine for removing things.

Let's be specific. forget removes snapshots, and prune removes blobs. We're proposing a command to remove files within snapshots. It should be a distinct command.

@fd0
Copy link
Member

fd0 commented Jan 19, 2018

I'd like to add my opinion:

I think having a way to modify snapshots in the repo is valuable, based on the feedback how many people would like to have something like this.

The command should be independent of the backup command, not only for orthogonality reasons (which is quite Go-like), but also out of practical consideration: The backup command is already complex enough so I'd like to separate the other command from it.

I don't like the name purge, because of the similarity to prune. What about change? Then we have restic backup, restic restore and restic change.

For the supported operations of the command, I've seen requests for:

  • Delete files, e.g. --delete
  • Rename files, e.g. --rename

The former is exactly what this issue (originally) is about, but are there really use cases for renaming files?

@fd0 fd0 added the type: feature suggestion suggesting a new feature label Jan 19, 2018
@rawtaz
Copy link
Contributor

rawtaz commented Jan 19, 2018

I think change sounds more like taking something out and putting something in, rather than modifying the contents of something.

Imagine the repo/backup/snapshot is a bucket. Change is more like swapping the bucket itself for something else, or taking something out of it and putting another thing in, rather than picking something in the bucket up, modifying it a bit, and putting it back.

Perhaps some native english/american person knows which is more proper :) It boils down to linguistics I think.

@fd0
Copy link
Member

fd0 commented Jan 19, 2018

Hm, modify then?

@rawtaz
Copy link
Contributor

rawtaz commented Jan 19, 2018

modify is definitely better than change. So either rewrite or modify out of what's been proposed so far. Curious what others think :)

@dimejo
Copy link
Contributor

dimejo commented Jan 19, 2018

If this is only about deleting files, would it make sense to enhance the forget command to work with snapshots and files? Or would this be too complex?

If this new feature is about deleting and renaming (or something else) I'd vote for modify.

@rawtaz
Copy link
Contributor

rawtaz commented Jan 19, 2018

Thanks for your input @dimejo 👍

I think that when you're renaming and/or deleting, you are not forgetting (at least not in the former case).

@pvgoran
Copy link

pvgoran commented Jan 19, 2018

IMHO "rewrite" conveys the meaning the best.

@fd0
Copy link
Member

fd0 commented Jan 19, 2018

The forget command is also very complex, we won't add anything to that if we can help it ;)

@dnnr
Copy link
Contributor

dnnr commented Jan 19, 2018

If it's gonna be separate command, calling it modify would be my favorite as well (I'd also like modify-snapshot, even though it is rather long). It's also generic enough to be an appropriate place for all kinds of modifying file operations (renaming, maybe even adding). However, I still think that anything beyond removing files smells strongly of feature creep.

By the way, I feel that restic would benefit from command categories, similar to what Git has with its plumbing commands. Right now, restic -h lists all commands in lexical order, mixing low-level commands (e.g., cat, list, which will never be needed by "normal" users) with the primary high-level commands.

@zcalusic
Copy link
Member

You might also consider update.

@Miosame
Copy link

Miosame commented Feb 14, 2020

For all that monitor this for updates and hit it from Google, there's no need to wait for this issue to never go into fruition, just use duplicati for the meantime, it has first class support for removing files post fact from snapshots.

@MorgothSauron
Copy link

For all that monitor this for updates and hit it from Google, there's no need to wait for this issue to never go into fruition, just use duplicati for the meantime, it has first class support for removing files post fact from snapshots.

I've been using restic for about a year now and I stopped waiting for features to be implemented. I don't mean that everything should be added into restic, but there is basic things that should be there. I'm considering moving away from restic: the repository is very fragile and can get broken very easily.

Yesterday I deleted a snapshots because it included files that should not have been in the backup (I forgot to add an exclude). Since then I have errors in my repository and I haven't been able to repair it yet. I should not have to delete a whole snapshots because some files where included by mistake.

@Miosame
Copy link

Miosame commented Feb 14, 2020

@MorgothSauron I usually just removed snapshots that contained it too, which is the only solution it seems in restic, but again, duplicati can do it via a single command for a while now, so I've changed since and had no issues.

@rawtaz
Copy link
Contributor

rawtaz commented Feb 14, 2020

I wish to thank everyone for their input on this matter. As we've seen, many people have wanted in particular the ability to remove files from a snapshot. I guess we all make mistakes once in a while when backing up ;)

At this point in time the available maintainer and developer time is needed on other parts of restic, so I do not foresee this issue being implemented in the foreseeable future. I'm also going to release a new rest-server as soon as I can, and will then start to look into some other issues.

That said, if someone makes a solid PR that is nicely and clearly written, well tested and bug free, and produced in coordination with maintainers, it will definitely be considered for inclusion. This specific issue is one where @fd0 has already given his blessing on the direction, so focus can be mainly on producing a solid implementation (that we know won't corrupt repos) rather than "should we add this feature", which is good.

Such a PR should be basic and act as a starting point which if needed can be built upon. An example of what I mean by that is it should for starters:

  • Just be one new command (e.g. rewrite since that's the most voted for in this issue).
  • Take a list of snapshots as its primary argument(s) (including support for all), e.g. all or 098db9d5 or 098db9d5 af92db33.
  • Take a list of one or more --exclude <pattern> to list the paths that should be excluded/removed form the snapshot (in other words, here's the --exclude that was missing when backing up), e.g. --exclude="*.o", --exclude=*.unwanted, --exclude="*.o" --exclude=*.unwanted --exclude=.DS_Store.

The rationale here is to get a minimal start as a proof of concept and minimum viable product. Once being tested we can adjust it as needed, e.g. by adding the other --exclude-* arguments from the backup command. If we make a rewrite command like this, it will have pretty much the same interface as the backup command that it's meant to "correct":

restic -r /some/repo rewrite all --exclude="*.o" --exclude=*.unwanted --exclude=.DS_Store
restic -r /some/repo rewrite 098db9d5 af92db33 --exclude="*.o" --exclude=*.unwanted --exclude=.DS_Store

On a related note, perhaps the work done by @middelink in #323 could be used as inspiration or a basis for the implementation, as it does some processing of existing snapshots. I'm going to see if we can get moving with this one too soon.

@nullcake
Copy link

@rawtaz

Thanks for the thoughtful feedback!

@dionorgua
Copy link
Contributor

Hi there.

I've added draft rewrite implementation close to comment by @rawtaz

It works here with test repo, passes restic check --read-data without errors, but have not tested it much. So I strongly suggest to not use it with important data.

I've tried to get syntax very close to backup command. So --exclude, --iexclude and --exclude-file are supported (but not tested). Ideally I also want to see --exclude-if-present option (ideal workflow for me is something like 'oops, not needed to backup, add CACHEDIR.TAG and restic rewrite'). But it's pretty complex because in such case we'll need to rewrite on same host where backup was made and access filesystem to collect these files (plus tons of magic with relative paths). So not right now...

Also I don't like idea to replace snapshots by default, so currently default behavior is to just create new snapshot with rewrite tag. But replacing is also possible with --inplace arg.

Any feedback would be greatly appreciated.

@NovacomExperts
Copy link

Hey Dmitry,

Thanks for this implementation, great work !

So far it works perfectly on Linux with a small test repo of 600 files + several test snapshots. Restore works and diff shows correctly excluded folders. I will be doing more intensive tests on a "clone" real repo with many GB of data with more 100's of snapshots. I will also try Windows sourced repos.

One proposition : have the option to specify a tag for the snapshots that contained the exclusions on a rewrite pass. (keeping the "rewrite" tag on newly created snapshots.)

restic rewrite --add-tag mytag -i thisfileshouldberemoved.txt all

This would help identify those snapshots that still contains "thisfileshouldberemoved.txt". On the other hand, the more direct --inplace works like expected.

Again very good work.

@dionorgua
Copy link
Contributor

@NovacomExperts Yes, my initial motivation was to keep 'history editing as safe as possible. It's very easy to exclude something important with --exclude * and almost no way to recover from this (with backup it's just matter of start new backup again). Something like --dry-run but with ability to get actual snapshot and explicitly delete source snapshot after checking that it's ok.

I fully agree that currently this is not fully achieved. It's easy to 'observe' new snapshots, but too difficult to delete old one. Plus I don't like hardcoded rewrite snapshot name. Maybe it's better to have --inplace by default and and ---keep-source-tagged before-rewrite --tag-destination after-rewrite or something like this. (--add-tag is a bit unclear, whether it's old or new snapshot).

In any case I'll wait for feedback from maintainers. Don't want to spend much time if it's move in wrong direction.

PS. My primary restic repo is around ~2TB now. Will try on it later after making LVM snapshot.

@NovacomExperts
Copy link

@dionorgua Your initial motivation is fully correct. I'll cast my vote to keep it like that, with the "dangerous" option --inplace as far as possible from the user (definitely not by default). I would prefer a missing argument error on --keep-source-tagged / --tag-destination than --inplace by default.

But I agree, let's wait for feedback on this.

Yesterday, I forgot the cloned test repo (65 GB) inside a folder that was backed up by restic overnight. I could have forget yesterday's snapshot but went "all in" and tried your implementation. After forget + prune , I successfully removed the 65GB from a 400GB repo. All good, no error found.

I test more intensively with data that spans across multiple snapshots.

Cheers

@dionorgua
Copy link
Contributor

I've replaced that wrong #2720 pull request with new one because old one was created from master branch. Just added one missing error check. Sorry for extra noise

@msmafra
Copy link

msmafra commented Jul 8, 2020

Hm, modify then?

Very late for this, but rectify is my suggestion for the delete-specific-file-from-backup command.

@ghost
Copy link

ghost commented Aug 21, 2020

#2731 is exciting, thanks a bunch!

@rawtaz
Copy link
Contributor

rawtaz commented Aug 21, 2020

Very late for this, but rectify is my suggestion for the delete-specific-file-from-backup command.

I have to say that's not a great name for it. Rectify implies there's something wrong that needs correcting/rectifying. While this may be true in one of the use cases, it's not always the case. A user may want to just remove some data from existing snapshots to free up space for all we know, while keeping the rest of the snapshot. The wording has to be more neutral than rectify, I think.

@filippobottega
Copy link
Contributor

Hi, if it was possible to add, remove folders or files to an existing snapshot, restic could be like a dedupe filesystem, as OpenDedup. An interesting use case could be to save multiple versions of vhd files.

@vstavrinov
Copy link

The thing should be simple, e.g.

restic rm /srv/git/linux

It will delete the directory from all snapshots where it exists.

@mathstuf
Copy link
Contributor

mathstuf commented Apr 7, 2021

Such a destructive action should not be so trivial. FWIW, I think the approach currently taken is the right approach (editing snapshots to remove references to paths then using forget to age the old snapshots out of the repository permanently).

@vstavrinov
Copy link

What do you mean "the approach currently taken"? Is it taken for future release or it has already implemented?

@mathstuf
Copy link
Contributor

mathstuf commented Apr 7, 2021

There's a PR for this which incorporates outcomes of discussion here. See #2731.

@nanosparks
Copy link

nanosparks commented Jan 8, 2022

It would be great for Restic to have functionality analogous to borg recreate. https://borgbackup.readthedocs.io/en/stable/usage/recreate.html

@JsBergbau
Copy link
Contributor

Any updates on this? Pull request #2731 seems not maintained any more.
I'd also really like to have this feature.

@pascallj
Copy link

pascallj commented Aug 8, 2022

Because it seems there is some interest in this issue for a long time now, I'll post my crude workaround using Python which I used a couple of days ago and worked perfectly.

The basic idea is to rewrite all snapshots but with an 'exclude' filter to exclude the files you want to scrub/purge. Depending on the amount and size of the snapshots, this might take some time because it will rescan the metadata of every file in all of your snapshots. It uses the restic mount function for this so fusermount must be working on your system. Also if the files you want to purge moved location in between the snapshots, make sure to specify multiple '--exclude=' filters with each location the file or directory could be in. Also probably only works on Linux.

You will lose some information from your snapshots but this can be changed by adapting the script to your needs. As far as I know (might be more) with the current script you will lose:

  • Timezone and milliseconds precision from the time the snapshot was taken
  • Existing tags
  • The listed backup path will be incorrect

Requirements:

  • Mount your repository somewhere
  • Get a list of the snapshots you want to rewrite in JSON format: restic -r /srv/restic-repo snapshots --json
  • Prepare your restic command which you normally use, however with some modifications:
    • Add your new '--exclude=home/user/directory_to_scrub' filter. Make sure this filter does not use an absolute path. We will change into the backup directory and therefore all paths will be relative. You can remove your existing filters as the files the normally affect simply won't be present in the snapshots.
    • Make sure the command is not interactive (no sudo, no password input) as this won't work when scripting.
    • Remove your targets (for example '~/work') from the command
  • In the script substitute:
    • SNAPSHOTS_JSON_MAPPING, with your list of snapshots in JSON format
    • RESTIC_COMMAND, with your modified command, leaving every existing option inplace.
    • RESTIC_MOUNT_DIR, with the absolute path to the directory you mounted your repository to
    • PRUNE_TAG, with a tag of choice

If you now run this Python script it will change into the directory of each snapshot and perform a backup again using the current snapshot as parent and tagging it to make it distinguishable. Might be wise to test it with just one snapshot if it works to your liking. When restic is making a backup it will show you the 'current_file' it's processing and the total amount of files increasing. These numbers should increase roughly at the same rate as this script is not really writing new files to your repository but only metadata (which is quite fast).

#!/usr/bin/env python3
import datetime
import os

mapping = SNAPSHOTS_JSON_MAPPING

for i in mapping:
	command = """ RESTIC_COMMAND
		--parent {parent_id} \
		--ignore-inode \
		--time "{time}" \
		--tag "PRUNE_TAG" \
		. """
	repo_dir = "RESTIC_MOUNT_DIR"

	os.chdir(f'{repo_dir}/ids/{i["short_id"]}')
	# We are forced to lose the timezone and some seconds precision
	command = command.format(parent_id=i["id"],
		time=datetime.datetime.fromisoformat(i["time"][0:19]))
	print(f'---- Processing snapshot {i["short_id"]} ----')
	os.system(command)

Example:

I normally use this command restic --no-cache -p /etc/resticpasswd -r "/mnt/vg2-backup_lvol1/" --exclude="home/user/downloads" backup . and I mounted my repository at /mnt/resticmnt/.

The script will become:

#!/usr/bin/env python3
import datetime
import os

mapping = [
  {
    "time": "2022-08-03T03:04:10.22434835+02:00",
    "parent": "6b0a4ca9cbc8bce824588c6343e347405aac3d2bf196ca29b0d59234fc5e4da2",
    "tree": "8d33292e2d616d855e1dfba601abaf0e02f61404ae09075462eb6496e5a7eeba",
    "paths": [
      "/mnt/resticmnt"
    ],
    "hostname": "big-server",
    "username": "root",
    "id": "f4c4093a743a1ce3eb7f6e7a1914f9b13fca7bab87de6fe1bed0c3d0a2cd314c",
    "short_id": "f4c4093a"
  }
]

for i in mapping:
	command = """ restic --no-cache -p /etc/resticpasswd -r "/mnt/vg2-backup_lvol1/" backup \
		--exclude "home/user/directory_i_also_dont_want" \
		--parent {parent_id} \
		--ignore-inode \
		--time "{time}" \
		--tag "prune_downloads" \
		. """
	repo_dir = "/mnt/resticmnt"

	os.chdir(f'{repo_dir}/ids/{i["short_id"]}')
	# We are forced to lose the timezone and some seconds precision
	command = command.format(parent_id=i["id"],
		time=datetime.datetime.fromisoformat(i["time"][0:19]))
	print(f'---- Processing snapshot {i["short_id"]} ----')
	os.system(command)

Afterwards you can use restic diff between an old and a 'new' snapshot to make sure your excluded files are not in the new snapshot, but everything else is fine. If everything is fine you can restic forget everything which does not contain the PRUNE_TAG you chose. For example: restic -p /etc/resticpasswd -r "/mnt/vg2-backup_lvol1/" forget --keep-tag "prune_downloads". And finally after pruning your files will be permanently removed from the repository.

Afterwards you first new backup should contain both the --ignore-inode and --parent {REPLACE_WITH_LATEST_SNAPSHOT_ID} parameters as restic will not recognize the parent because the path is different. Also don't forget your new exclude filter, otherwise the files will be backed up again in new snapshots...

I'm writing this after the fact so I might have forgotton some steps or requirements. Let me know and will update this post. Also feel free to make the script more robust if you so desire.

@jniggemann
Copy link
Contributor

Is there anything we can do to get this rolling? Would a bounty work?

@therealrobster
Copy link

Hi all, just checking in to see if this has had any movement?

I have several TB of data I need to remove. They're video editing movie data (massive files) that was put into the wrong folder by someone who simply messed up. Human error. Our online backup is now MASSIVE and costing us some coin. We need to remove these files as they're costing us each month to have it there.

Just wanted to see if there's an implementation yet?

Thanks so much.

@pabs3
Copy link

pabs3 commented Nov 1, 2022 via email

@rawtaz
Copy link
Contributor

rawtaz commented Nov 2, 2022

I'd say stay tuned because that PR is shaping up pretty well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment