New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for merged backup parents #3118
Comments
I think @chastings667 is right: there's much room for improvement in the parent selection algorithm |
I think there could be two related enhancements here:
And maybe they could use the same backend merge algorithm:
(pardon the naive pseudocode...) Or maybe #1 doesn't need to write to a real temporary snapshot, just a structure in memory. I'm not familiar with the internals. But they seem like they'd have very different use cases. The first would be applicable to anyone doing routine backups to decrease overall runtime, while the second might only be used by advanced users under specific circumstances to create a new snapshot from existing ones. The first use case could be made even easier, as a simple flag with no policy argument. Then restic chooses a reasonable policy for the parent snapshot merge. |
I'd like to introduce as little new functionality to users as possible in order to reduce maintenance. Primarily, I'd like to make the parent detection algorithm much more intelligent. So if you run:
Then restic should detect on the third invocation that The data structures are already there, the function which does the lookup ("have these files/folders been saved before") needs to look at more previous snapshots. The archiver (which does the reading and saving to the repo) already builds a tree-like structure before starting to inspect files and folders (here), where it has the information that a directory (e.g. |
@fd0 I agree that adding this to the backup command can be done without much code changes. As I had a bit of time while the childs were playing, I started a PR. |
So this will look at all snapshots in the repo, and use any that contain the same files/directories as the current backup? Cool. Two questions about your change:
|
No, only the snapshots given manually by
Right, this is still open. I'm about to implement a better algorithm.
True, but I don't think that the usual case will be lots of parent snapshots. The standard case will be exactly one parent and backup runs with more - maybe two, three but mostly in the order of tens - will be already exceptions.
No, as only trees are read and those are locally cached, |
We'll probably want to add a hard limit though. Loading too many trees will otherwise cause some slow-down. Could snapshots with several thousand paths slow things down? |
Ah, sorry about that, I'm new to github. Is there a way I can tell a change is the final version before a submit?
You still need to determine which snapshots belong as parents, which would look through all snapshots unless constrained somehow. If you search for matching files and paths (are these "targets"?) in each snapshot to determine appropriate parents, this could be really slow. Just searching for relevant paths might be better, and limiting the search somehow (100 recent snapshots, last 90 days, etc.) would also mitigate these concerns. |
I would like to argue against a hard limit of numbers of snapshots.
|
We have the convention that in a PR there is a checklist that is going to be checked...
Yes, that is exactly the algorithm I had in mind. The current algorithm to find the one parent is actually also quite simple: It just searches for the latest snapshot with identical host and paths. I finished implementing the algorithm which finds suitable parents with identical hosts and "matching" paths. Further, the algorithm removes "superseded" snapshots which are basically snapshots which paths (or at least the subpaths which might be given as arguments to I realized that the |
I had a thought this morning: how do excludes interact with the reliance on paths as a proxy for files? I think they can break this algorithm, but I can't think of a good way to fix it. Consider: latest snapshot: path=/ exclude=/ So the previous will contain lots of file metadata, while the latest snapshot, which will be chosen by this algorithm, contains none (I'm not sure you can have this path/exclude combination, but the example still stands). I'm not even sure this needs to be fixed, as it seems like an unusual case to hit. Just wanted to note it. |
You are right - I was also thinking about whether there is a good way to include this. However, as this would add quite some complexity (and is also not tackled by the current much simpler "parent" algorithm), I will just keep on ignoring the excludes list in the new algorithm. |
@chastings667 #3121 is now in stat that I would judge as ready for review. Would be nice if you can test it and report if this suits your needs or if there are still issues... |
I'd love to test this out, but I don't currently have any way to build the restic binary. There is some automated testing done in the pull request -- does that build the restic binary, and is it stored anywhere? |
I put a compiled linux binary here: But please make sure that you use binaries from untrusted sources (like this one) only in environments where they cannot do any harm! |
Yes, I understand about binaries (and source code, really!) from untrusted sources. Thanks for the warning though. So I ran the following; (init) and the change looks like it's doing what it's supposed to. I measured runtime for the final backup cmd, since this should benefit from the new parent searching. 0.11.0: processed 40 files, 3.906 GiB in 1:12 real 1m12.562s alpha build: processed 40 files, 3.906 GiB in 0:00 real 0m1.042s So yeah, that's quite a bit faster. I'd like to measure speedup on my 4TB backup set, but maybe I'll wait for a release candidate with this change in it. And to verify that single-parent matching still works, I again backed up data/a. No full scan and it selected the correct parent. |
@chastings667 Thanks for testing! Two remarks:
|
I think the parents are printed whether or not -v is specified. Overall they look OK. Here's the sequence:
At first I thought #7 was a bug, since the data hasn't changed and using the snapshot from #5 as a parent should be sufficient. But I suspect you're using the snapshot creation time to determine what path is newer, not anything related to the file modtimes. If that's the case this looks correct. |
For the parent detection, this is correct as only the timestamps are used to choose the parents. As in your case the tree If however |
Sorry to go back to the original description of the problem being solved. It sounds like there are lots of reasons to have restic be able find/detect parent snapshots, and not just the original proposed problem. I'm just curious.
What would happen if you instead did:
Would that provide the same benefit as doing the multiple-parent merge? Specifically, it seems like it would fix this problem:
The third backup should reuse the metadata from the first two commands by leveraging excludes in this way, right? |
As I understand it, this is a weakness in the current parent selection algorithm. Only paths and timestamps are used in the algorithm, and parent paths are used as a proxy for "files contained in this snapshot" without consideration of any excludes. I suppose Supersedes() could be modified to consider exclude paths? Are these even stored in snapshot objects? Current: // Supersedes returns whether sn1 supersedes sn2 w.r.t. the one of the given paths. Additional term:
This would be easy to implement but would grow the number of parents as a multiple of the number of exclude sets being used. There's probably a more precise way to do it, where you consider exclude superpaths in a manner similar to paths above, but the logic is inverted and I would need to give it more thought. |
That would be different backups. If you know in advance, that you later want to use this as a parent for another path, you could simply use #3200 and
No, it would only use the second backup as parent and need to re-chunk and re-hash everything under |
@chastings667 In #3405 I added the possibility to merge the files that you pass with --files-from-verbatim option to a preexisting snapshot. In this list you can specify new, updated and deleted files, and the new snapshot will be the merge with the parent (that you can specify with the --parent option). I think it's ready for review and I would be really happy if you can test it and report a feedback or some issues you could spot. Thanks in advance |
Output of
restic version
restic 0.11.0 compiled with go1.15.3 on linux/amd64
What should restic do differently? Which functionality do you think we should add?
See https://forum.restic.net/t/backup-parent-behavior/3286 for background
I'd like to suggest an enhancement to allow more choices for the backup parent. Currently you're limited to a single snapshot, but it would be useful to use a merged object that combines data from multiple snapshots. The merge algorithm could keep the most recent metadata for each file.
Source snapshots for this merged parent would be selected by policy:
This functionality could be implemented with special values for the --parent arg or as a new --parent-merge arg:
num:10 => last 10 snapshots in repo
num:25 => last 25
14d => all snapshots <= 14 days old
1m => all snapshots <=1 month old
mem:1G => use latest snapshots until parent object is 1GB in size
Or in some other similar manner.
What are you trying to do? What problem would this solve?
This change would improve backup parent metadata hit rate which will improve backup scan speed by preventing a full read of source files to detect changes. It will also decouple backup scan time from the order of backup cmds.
Consider these three cmds, run in order:
The third backup will not benefit from metadata stored from backups 1 and 2, since the paths do not match. An explicit specification of either 1 or 2 as the parent is possible, but requires manual intervention, and only includes a subset of what's available in the repo due to the single-snapshot limitation.
Implementing this change would use more metadata available on the repo, and help keep backup scan times as low as possible. For my 4TB dataset, a full scan with no parent takes >48H, while a scan with complete metadata (all files are in the parent and unchanged) takes <1H.
Did restic help you today? Did it make you happy in any way?
It didn't directly help me today as I didn't suffer any drive failures. But it did contribute to my overall happiness, as my data is safe and securely stored in multiple locations.
The text was updated successfully, but these errors were encountered: