Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for merged backup parents #3118

Open
chastings667 opened this issue Nov 21, 2020 · 24 comments · May be fixed by #3405
Open

support for merged backup parents #3118

chastings667 opened this issue Nov 21, 2020 · 24 comments · May be fixed by #3405
Labels

Comments

@chastings667
Copy link

Output of restic version

restic 0.11.0 compiled with go1.15.3 on linux/amd64

What should restic do differently? Which functionality do you think we should add?

See https://forum.restic.net/t/backup-parent-behavior/3286 for background

I'd like to suggest an enhancement to allow more choices for the backup parent. Currently you're limited to a single snapshot, but it would be useful to use a merged object that combines data from multiple snapshots. The merge algorithm could keep the most recent metadata for each file.

Source snapshots for this merged parent would be selected by policy:

  1. last N snapshots (globally in repo)
  2. snapshots from the last time interval (N days/weeks/months)
  3. as many recent snapshots as possible, constrained to a max memory footprint
  4. etc

This functionality could be implemented with special values for the --parent arg or as a new --parent-merge arg:

num:10 => last 10 snapshots in repo
num:25 => last 25
14d => all snapshots <= 14 days old
1m => all snapshots <=1 month old
mem:1G => use latest snapshots until parent object is 1GB in size

Or in some other similar manner.

What are you trying to do? What problem would this solve?

This change would improve backup parent metadata hit rate which will improve backup scan speed by preventing a full read of source files to detect changes. It will also decouple backup scan time from the order of backup cmds.

Consider these three cmds, run in order:

  1. backup /home/foo
  2. backup /home/bar
  3. backup /home

The third backup will not benefit from metadata stored from backups 1 and 2, since the paths do not match. An explicit specification of either 1 or 2 as the parent is possible, but requires manual intervention, and only includes a subset of what's available in the repo due to the single-snapshot limitation.

Implementing this change would use more metadata available on the repo, and help keep backup scan times as low as possible. For my 4TB dataset, a full scan with no parent takes >48H, while a scan with complete metadata (all files are in the parent and unchanged) takes <1H.

Did restic help you today? Did it make you happy in any way?

It didn't directly help me today as I didn't suffer any drive failures. But it did contribute to my overall happiness, as my data is safe and securely stored in multiple locations.

@rawtaz rawtaz added category: backup type: feature enhancement improving existing features labels Nov 21, 2020
@aawsome
Copy link
Contributor

aawsome commented Nov 21, 2020

I am wondering if we should implement a possibility to save a new snapshot by merging given existing snapshots. This could e.g. be integrated into a rewrite command which could accept a list of source snapshots and a merging policy.

See #2654, #2731

@fd0
Copy link
Member

fd0 commented Nov 21, 2020

I think @chastings667 is right: there's much room for improvement in the parent selection algorithm backup uses. The data structures given to the archiver certainly make this possible.

@chastings667
Copy link
Author

I think there could be two related enhancements here:

  1. adding a merged snapshot parent to the backup cmd
  2. adding a snapshot merge functionality to the rewrite cmd

And maybe they could use the same backend merge algorithm:

  1. backup_parent = merged_snapshot(snapshots[], policy); do_backup(backup_parent, ...); delete_snapshot(backup_parent)
  2. new_snapshot = merged_snapshot(snapshots[], policy)

(pardon the naive pseudocode...)

Or maybe #1 doesn't need to write to a real temporary snapshot, just a structure in memory. I'm not familiar with the internals.

But they seem like they'd have very different use cases. The first would be applicable to anyone doing routine backups to decrease overall runtime, while the second might only be used by advanced users under specific circumstances to create a new snapshot from existing ones.

The first use case could be made even easier, as a simple flag with no policy argument. Then restic chooses a reasonable policy for the parent snapshot merge.

@fd0
Copy link
Member

fd0 commented Nov 22, 2020

I'd like to introduce as little new functionality to users as possible in order to reduce maintenance. Primarily, I'd like to make the parent detection algorithm much more intelligent. So if you run:

$ restic backup /home
$ restic backup /root
$ restic backup /

Then restic should detect on the third invocation that /root and /home have both been saved already and use the metadata from two different snapshots for avoiding to re-read data (if possible). That'd cover most use cases I think.

The data structures are already there, the function which does the lookup ("have these files/folders been saved before") needs to look at more previous snapshots. The archiver (which does the reading and saving to the repo) already builds a tree-like structure before starting to inspect files and folders (here), where it has the information that a directory (e.g. /root) should be saved, and that previous metadata can be found in the repo at some id.

@aawsome
Copy link
Contributor

aawsome commented Nov 23, 2020

@fd0 I agree that adding this to the backup command can be done without much code changes. As I had a bit of time while the childs were playing, I started a PR.
@chastings667: #3121 should already work if you specify multiple --parent options.

@chastings667
Copy link
Author

So this will look at all snapshots in the repo, and use any that contain the same files/directories as the current backup? Cool.

Two questions about your change:

  1. I'm not seeing the default multiple-snapshot selection, yet. Am I misunderstanding something here?, cmd_backup.go:415 shows the fallthrough when no --parent(s) specified on the cmdline, and it appears to only ever add a single snapshot ID to the array.

  2. Are there going to be any perf issues looking at all files on all snapshots to determine which ones to use? Let's say a repo has 1000 snapshots, each with 1M files. I can think of two potential costs here:
    a) data structure traversal. I think this algorithm will do foreach (files_in_backup) { foreach (snapshots) { file_in_snapshot() } } which seems like it could take a while. Even if the file_in_snapshot() does some kind of efficient tree lookup, that's a lot of comparisons. And it grows with the current backup job, not those in the repo.
    b) increased network data usage. I have no idea how large snapshot data is, but if it's remote and the network is slow or metered (S3, Azure, etc) increasing reads could be costly.

@aawsome
Copy link
Contributor

aawsome commented Nov 23, 2020

So this will look at all snapshots in the repo, and use any that contain the same files/directories as the current backup?

No, only the snapshots given manually by --parent or the ones selected by a "find parents" algorithm.

1. I'm not seeing the default multiple-snapshot selection, yet.

Right, this is still open. I'm about to implement a better algorithm.

2. Are there going to be any perf issues looking at all files on all snapshots to determine which ones to use? Let's say a repo has 1000 snapshots, each with 1M files. I can think of two potential costs here:
   a) data structure traversal. I think this algorithm will do foreach (files_in_backup) { foreach (snapshots) { file_in_snapshot() } } which seems like it could take a while. Even if the file_in_snapshot() does some kind of efficient tree lookup, that's a lot of comparisons. And it grows with the current backup job, not those in the repo.

True, but I don't think that the usual case will be lots of parent snapshots. The standard case will be exactly one parent and backup runs with more - maybe two, three but mostly in the order of tens - will be already exceptions.

   b) increased network data usage. I have no idea how large snapshot data is, but if it's remote and the network is slow or metered (S3, Azure, etc) increasing reads could be costly.

No, as only trees are read and those are locally cached,

@MichaelEischer
Copy link
Member

True, but I don't think that the usual case will be lots of parent snapshots. The standard case will be exactly one parent and backup runs with more - maybe two, three but mostly in the order of tens - will be already exceptions.

We'll probably want to add a hard limit though. Loading too many trees will otherwise cause some slow-down. Could snapshots with several thousand paths slow things down?

@chastings667
Copy link
Author

Right, this is still open. I'm about to implement a better algorithm.

Ah, sorry about that, I'm new to github. Is there a way I can tell a change is the final version before a submit?

True, but I don't think that the usual case will be lots of parent snapshots. The standard case will be exactly one parent and backup runs with more - maybe two, three but mostly in the order of tens - will be already exceptions.

You still need to determine which snapshots belong as parents, which would look through all snapshots unless constrained somehow. If you search for matching files and paths (are these "targets"?) in each snapshot to determine appropriate parents, this could be really slow. Just searching for relevant paths might be better, and limiting the search somehow (100 recent snapshots, last 90 days, etc.) would also mitigate these concerns.

@aawsome
Copy link
Contributor

aawsome commented Nov 24, 2020

We'll probably want to add a hard limit though. Loading too many trees will otherwise cause some slow-down. Could snapshots with several thousand paths slow things down?

I would like to argue against a hard limit of numbers of snapshots.

  1. Yes, loading too many trees slows the backup down. So does computing too many unneeded chunks and SHA256 hashes. All the parent option is about is to trade off loading trees against computing blobs - which could always be the wrong choice.

  2. The number of trees to load does depend on the number of parent snapshots but also much more on the tree structure of the snapshots. So why should 1000 parent snapshots having each a single dir be worse than one parent snapshots having the "superdir" (which contains all 1000 dirs) and therefore needing to load 1001 trees?

  3. Again, I think that the absolute standard will be exactly one parent snapshot followed by having zero parent snapshots (for initial backups). All other cases will be special cases like shifting backup strategy or users explicitly specifying parents.

  4. There was a problem in backup: Support multiple parents #3121 which loaded identical trees more often than once for pathological cases. I corrected this by sorting the nodes by subtree IDs before loading the trees and ignore duplicates.

@aawsome
Copy link
Contributor

aawsome commented Nov 24, 2020

Ah, sorry about that, I'm new to github. Is there a way I can tell a change is the final version before a submit?

We have the convention that in a PR there is a checklist that is going to be checked...

You still need to determine which snapshots belong as parents, which would look through all snapshots unless constrained somehow. If you search for matching files and paths (are these "targets"?) in each snapshot to determine appropriate parents, this could be really slow. Just searching for relevant paths might be better, and limiting the search somehow (100 recent snapshots, last 90 days, etc.) would also mitigate these concerns.

Yes, that is exactly the algorithm I had in mind. The current algorithm to find the one parent is actually also quite simple: It just searches for the latest snapshot with identical host and paths.

I finished implementing the algorithm which finds suitable parents with identical hosts and "matching" paths. Further, the algorithm removes "superseded" snapshots which are basically snapshots which paths (or at least the subpaths which might be given as arguments to backup) are completely contained in a later snapshot.
This is now included in #3121

I realized that the Supersedes() method is pretty hard to understand; we need some unit tests which show some examples what results to expect from this 😉

@chastings667
Copy link
Author

I had a thought this morning: how do excludes interact with the reliance on paths as a proxy for files? I think they can break this algorithm, but I can't think of a good way to fix it.

Consider:

latest snapshot: path=/ exclude=/
previous snapshot: path=/ exclude=(none)

So the previous will contain lots of file metadata, while the latest snapshot, which will be chosen by this algorithm, contains none (I'm not sure you can have this path/exclude combination, but the example still stands).

I'm not even sure this needs to be fixed, as it seems like an unusual case to hit. Just wanted to note it.

@aawsome
Copy link
Contributor

aawsome commented Nov 25, 2020

I had a thought this morning: how do excludes interact with the reliance on paths as a proxy for files? I think they can break this algorithm, but I can't think of a good way to fix it.

You are right - I was also thinking about whether there is a good way to include this. However, as this would add quite some complexity (and is also not tackled by the current much simpler "parent" algorithm), I will just keep on ignoring the excludes list in the new algorithm.
If there is an ongoing need for this, this can be tackled independently.

@aawsome
Copy link
Contributor

aawsome commented Nov 25, 2020

@chastings667 #3121 is now in stat that I would judge as ready for review. Would be nice if you can test it and report if this suits your needs or if there are still issues...
Thanks in advance!

@chastings667
Copy link
Author

I'd love to test this out, but I don't currently have any way to build the restic binary. There is some automated testing done in the pull request -- does that build the restic binary, and is it stored anywhere?

@aawsome
Copy link
Contributor

aawsome commented Nov 26, 2020

I put a compiled linux binary here:
https://github.com/aawsome/restic/raw/multiple-parents-executable/restic-multiple-parents

But please make sure that you use binaries from untrusted sources (like this one) only in environments where they cannot do any harm!

@chastings667
Copy link
Author

Yes, I understand about binaries (and source code, really!) from untrusted sources. Thanks for the warning though.

So I ran the following;

(init)
backup ./data/a
backup ./data/b
backup ./data/c
backup ./data/d
backup ./data

and the change looks like it's doing what it's supposed to. I measured runtime for the final backup cmd, since this should benefit from the new parent searching.

0.11.0:

processed 40 files, 3.906 GiB in 1:12
snapshot 26f515f2 saved

real 1m12.562s
user 2m17.748s

alpha build:

processed 40 files, 3.906 GiB in 0:00
snapshot d7f38525 saved

real 0m1.042s
user 0m0.572s

So yeah, that's quite a bit faster. I'd like to measure speedup on my 4TB backup set, but maybe I'll wait for a release candidate with this change in it.

And to verify that single-parent matching still works, I again backed up data/a. No full scan and it selected the correct parent.

@aawsome
Copy link
Contributor

aawsome commented Nov 27, 2020

@chastings667 Thanks for testing! Two remarks:

  • backup -v lists which parents are actually used
  • in your example, another run of backup ./data/d and then backup ./data should also gently set the right parents, feel free to try it out!

@chastings667
Copy link
Author

I think the parents are printed whether or not -v is specified. Overall they look OK.

Here's the sequence:

  1. backup ./data/a => snapshot 2a5109e6
  2. backup ./data/b => f01a37ff
  3. backup ./data/c => 2042fac0
  4. backup ./data/d => 140157ed
  5. backup ./data => ffa46959 (parents all of the above)
  6. backup ./data/d => 552c454e (parent ffa46959)
  7. backup ./data => ce25e84b (parents 552c454e ffa46959)

At first I thought #7 was a bug, since the data hasn't changed and using the snapshot from #5 as a parent should be sufficient. But I suspect you're using the snapshot creation time to determine what path is newer, not anything related to the file modtimes. If that's the case this looks correct.

@aawsome
Copy link
Contributor

aawsome commented Nov 28, 2020

  1. backup ./data => ce25e84b (parents 552c454e ffa46959)

At first I thought #7 was a bug, since the data hasn't changed and using the snapshot from #5 as a parent should be sufficient. But I suspect you're using the snapshot creation time to determine what path is newer, not anything related to the file modtimes. If that's the case this looks correct.

For the parent detection, this is correct as only the timestamps are used to choose the parents. ./data/d could have changed, so it is a good idea to additionally take this as parent.

As in your case the tree ./data/d is identical in both snapshots 5. and 6., the backup run detects this and only loads this tree once, so from the perspective of loading trees and potential network traffic this is completely identical to only using snapshot 5.

If however ./data/d was different in snapshot 6., then both trees would be loaded, but I think the additional possibility of finding matching entries to speed up the backup justifies this.

@onionjake
Copy link

Sorry to go back to the original description of the problem being solved. It sounds like there are lots of reasons to have restic be able find/detect parent snapshots, and not just the original proposed problem. I'm just curious.

Consider these three cmds, run in order:

backup /home/foo
backup /home/bar
backup /home

What would happen if you instead did:

backup --exclude /home/bar /home
backup --exclude /home/foo /home
backup /home

Would that provide the same benefit as doing the multiple-parent merge? Specifically, it seems like it would fix this problem:

The third backup will not benefit from metadata stored from backups 1 and 2, since the paths do not match.

The third backup should reuse the metadata from the first two commands by leveraging excludes in this way, right?

@chastings667
Copy link
Author

As I understand it, this is a weakness in the current parent selection algorithm. Only paths and timestamps are used in the algorithm, and parent paths are used as a proxy for "files contained in this snapshot" without consideration of any excludes.

I suppose Supersedes() could be modified to consider exclude paths? Are these even stored in snapshot objects?

Current:

// Supersedes returns whether sn1 supersedes sn2 w.r.t. the one of the given paths.
// "supersedes" means:
// - the snapshot is newer
// - for all relevant paths in sn2, sn1 has either identical path or a superpath

Additional term:

  • exclude paths are identical

This would be easy to implement but would grow the number of parents as a multiple of the number of exclude sets being used.

There's probably a more precise way to do it, where you consider exclude superpaths in a manner similar to paths above, but the logic is inverted and I would need to give it more thought.

@aawsome
Copy link
Contributor

aawsome commented Jan 15, 2021

Consider these three cmds, run in order:

backup /home/foo
backup /home/bar
backup /home

What would happen if you instead did:

backup --exclude /home/bar /home
backup --exclude /home/foo /home
backup /home

That would be different backups. backup --exclude /home/bar /home backups everything in /home except /home/bar. It is only equivalent if only /home/bar and /home/foo are present.

If you know in advance, that you later want to use this as a parent for another path, you could simply use #3200 and backup --set-path /home /home/foo. The problem with this (and your proposal) is that it might be, that you later realize that you want to backup the whole /home.

The third backup should reuse the metadata from the first two commands by leveraging excludes in this way, right?

No, it would only use the second backup as parent and need to re-chunk and re-hash everything under /home/foo (as with the #3200 solution above).

@gmgigi96 gmgigi96 linked a pull request May 19, 2021 that will close this issue
8 tasks
@gmgigi96
Copy link

@chastings667 In #3405 I added the possibility to merge the files that you pass with --files-from-verbatim option to a preexisting snapshot. In this list you can specify new, updated and deleted files, and the new snapshot will be the merge with the parent (that you can specify with the --parent option). I think it's ready for review and I would be really happy if you can test it and report a feedback or some issues you could spot. Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants