Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add copy functionality. #2606

Merged
merged 12 commits into from Aug 30, 2020
Merged

Add copy functionality. #2606

merged 12 commits into from Aug 30, 2020

Conversation

middelink
Copy link
Member

What is the purpose of this change? What does it change?

Add a copy command to copy snapshots between repositories. It allows the user
to specify a destination repository, password, password-file, password-command
or key-hint to supply the nessecary details to open the destination repository.

You need to supply a list of snapshots to copy, snapshots which alreadys exist
in the destination repository will be skipped.

Note, when using the network this becomes rather slow, as it needs to read the
blocks, decrypt them using the source key, then encrypt them again using the
destination key before finally writing them out to the destination repository.

Was the change discussed in an issue or in the forum before?

closes #323

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
@FallenWarrior2k
Copy link

I'm sorry, but may I ask what this is blocked on right now?

@middelink
Copy link
Member Author

middelink commented Apr 17, 2020

@FallenWarrior2k no idea, very good question.

@fd0 care to intervene? This PR has been sitting idle for almost 2 months now.

@rawtaz
Copy link
Contributor

rawtaz commented Apr 17, 2020

@FallenWarrior2k @middelink I can only apologize. I am the one who asked @middelink to make a PR out of her work, which she so swiftly did. My intention was to have something that we could work off in terms of adjustments and reviews. We started having more of a flow in doing restic maintenance at that point in time, but then Corona etc came into the picture and since then it's been almost impossible to get anything done due to maintainers' life situations with isolation and such things requiring all available time. This PR is still on our radar though, FWIW.

@FallenWarrior2k
Copy link

FallenWarrior2k commented Apr 17, 2020 via email

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found a few more typos (see review comments) and a few more fundamental points which could also be deferred to a follow-up PR: The Repository object currently does not upload intermediate indexes on its own. That could cause the copy command to create a single gigantic index file when completing the snapshot. It would probably be a good idea to use something like the IndexUploader from the backup command.

The next point is more of a usability question: The initial idea in #323 was to support a one-time copy from one repository to another. But it would probably also be nice to ensure that when copying a snapshot multiple times, that that only creates one snapshot in the destination repository.

The current implementation won't scale for large numbers of snapshots. These usually share large parts of their filesystem trees, thus iterating the full tree for every snapshot will be rather redundant. A rather simple solution for that would be to use FindUsedBlobs to collect all blobs of all snapshots and then copy everything in one go. This has the slight downside that when copying a single snapshot it will load trees twice, but should be much faster when copying multiple snapshots. And it would allow adding a progress bar. As well as avoiding yet another tree walking implementation.

cmd/restic/cmd_copy.go Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
@MichaelEischer
Copy link
Member

MichaelEischer commented May 23, 2020

From what I can tell right now the steps to finish this PR are:

The command needs documentation and a changelog:

  • Changelog: Probably a short explanation and usage example.
  • Documentation: And we need some documentation that backend parameters specified via environment variable or config options apply to both source and destination paths. That is it might not be possible to specify e.g. different accounts for the source and destination.

[edit]
As mentioned by @FallenWarrior2k it's possible to use rclone in case of colliding backend environment variables. For S3 the would either mean to store the access keys in the rclone config or to use a named profile which is then referenced by rclone.
[/edit]

Tests:

  • Check that a copied snapshot has same tree content as the old one (= identical tree hash)
  • Check that the tree in the new snapshot is intact
  • Check that the copy functionality doesn't add duplicate blobs to the repository

Code fixes:

  • As Simplify Repository.LoadBlob usage #2640 was merged the code just needs a small tweak to reuse the buffer
  • The duplicate blob upload should be fixed as this could cause lots of duplicates in pathological cases
  • An intermediate index uploader could also be added after merging this.

@FallenWarrior2k
Copy link

That is it might not be possible to specify e.g. different accounts for the source and destination.

For most remotes, I think it should be possible to accomplish this by going through rclone. Although it might not be possible to use e.g. env_auth = true (for S3) or similar in that case, meaning you'd have to temporarily persist your auth data in the rclone config. In that case, you could password-protect the file before adding them, if you want to make sure they're not visible on your filesystem.

@dionorgua
Copy link
Contributor

Small note: It's no longer possible to build it. I think that due to #2640 merged.

I've fixed it locally with this and tested with one-snapshot copy:

iff --git a/cmd/restic/cmd_copy.go b/cmd/restic/cmd_copy.go
index 4abfcdd7..337129a7 100644
--- a/cmd/restic/cmd_copy.go
+++ b/cmd/restic/cmd_copy.go
@@ -175,18 +175,10 @@ func copySnapshot(ctx context.Context, srcRepo, dstRepo restic.Repository, treeI
 				continue
 			}
 			debug.Log("Copying blob %s\n", blobID.Str())
-			size, found := srcRepo.LookupBlobSize(blobID, restic.DataBlob)
-			if !found {
-				return fmt.Errorf("LookupBlobSize(%v) failed", blobID)
-			}
-			buf := restic.NewBlobBuffer(int(size))
-			n, err := srcRepo.LoadBlob(ctx, restic.DataBlob, blobID, buf)
+			buf, err := srcRepo.LoadBlob(ctx, restic.DataBlob, blobID, nil)
 			if err != nil {
 				return fmt.Errorf("LoadBlob(%v) returned error %v", blobID, err)
 			}
-			if n != len(buf) {
-				return fmt.Errorf("wrong number of bytes read, want %d, got %d", len(buf), n)
-			}
 
 			newBlobID, err := dstRepo.SaveBlob(ctx, restic.DataBlob, buf, blobID)
 			if err != nil {
`

@seqizz
Copy link

seqizz commented May 28, 2020

This is a must-have feature for me, thanks @dionorgua 👍 Also tested & working.

@greatroar
Copy link
Contributor

/middelink/restic/pull/1

Copy link
Contributor

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the two major issues that @MichaelEischer also found.
I think the Index.Has problem should be fixed in another PR. I'll work on this.
The saving of intermediate indexes should IMO be added in this PR.

Moreover I have some enhancement suggestions and IMO found a possibility to speed copy up.

EDIT: I started working on the propsed PR and realized that I might easily fix both issues.

cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
@aawsome aawsome mentioned this pull request Jun 7, 2020
8 tasks
@aawsome
Copy link
Contributor

aawsome commented Jun 7, 2020

EDIT: I started working on the propsed PR and realized that I might easily fix both issues.

The new PR is ready for review: #2773

cmd/restic/cmd_copy.go Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
@MichaelEischer
Copy link
Member

Just noticed that I should have taken a look at #2773 first, which solves my index related concerns.

@aawsome
Copy link
Contributor

aawsome commented Jun 8, 2020

I just realized that there might be another optimization:
copy uses LoadBlob for all needed blobs but this might imply that a pack is downloaded more than once. So sorting for needed packs would improve performance. I think something similar is implemented in the restorer.

However, I would propose to not solve this in this PR and leave it open for further optimization.

Copy link
Contributor

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that #2773 was merged, there are only few open issues with this PR.

cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
@seqizz
Copy link

seqizz commented Jun 19, 2020

I think @middelink has not much time to check this PR :)

I wanted to rebase this over @aawsome 's new prune branch and got error on SaveBlob like mentioned above, so I made this change and made the build succeed. I am still worse than /dev/urandom on Go so please correct if I understand the new param wrong:

diff --git a/cmd/restic/cmd_copy.go b/cmd/restic/cmd_copy.go
index 46e61ccd..5faca0d0 100644
--- a/cmd/restic/cmd_copy.go
+++ b/cmd/restic/cmd_copy.go
@@ -182,7 +182,10 @@ func copySnapshot(ctx context.Context, srcRepo, dstRepo restic.Repository, treeI
                                return fmt.Errorf("LoadBlob(%v) returned error %v", blobID, err)
                        }

-                       newBlobID, err := dstRepo.SaveBlob(ctx, restic.DataBlob, buf, blobID)
+                       newBlobID, known, err := dstRepo.SaveBlob(ctx, restic.DataBlob, buf, blobID, false)
+                       if known {
+                               debug.Log("Skipping since it's already in repo: %s\n", blobID.Str())
+                       }
                        if err != nil {
                                return fmt.Errorf("SaveBlob(%v) returned error %v", blobID, err)
                        }

@greatroar
Copy link
Contributor

middelink/restic#2

@middelink
Copy link
Member Author

Hi, sorry for the delay indeed. I seem to not get mail notifications anymore in github on this matter (guess I pressed a wrong button somewhere), just the emails themselves which go off into some folder I hardly look at. Anyways, excuses.

I've merged the PR, and (at least on my computer) it compiles again. Thanks!

@MichaelEischer
Copy link
Member

I've rebased the PR to the current master branch and adapted it to the repository interface changes. There's also a small optimization to only copy tree once, such that for later snapshots only the changed parts have to be checked.

I'll take care of adding some tests and a bit of documentation.

@MichaelEischer
Copy link
Member

I've added some documentation and two basic integration tests for the copy command.

The more interesting change is probably "copy: Mark and skip previously copied snapshots", which uses the Original field of a snapshot to detect whether a snapshot has already been copied to the destination repository (see commit message for further details).

cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
Pauline Middelink and others added 4 commits August 26, 2020 22:17
Add a copy command to copy snapshots between repositories. It allows the user
to specify a destination repository, password, password-file, password-command
or key-hint to supply the necessary details to open the destination repository.

You need to supply a list of snapshots to copy, snapshots which already exist
in the destination repository will be skipped.

Note, when using the network this becomes rather slow, as it needs to read the
blocks, decrypt them using the source key, then encrypt them again using the
destination key before finally writing them out to the destination repository.
This speeds up copying multiple overlapping snapshots from one
repository to another, as we only have to copy the changed parts of
later snapshots.
@MichaelEischer
Copy link
Member

a common use case would be to use identical chunking parameters and keys for both repositories. This fixes the chunking issue. An optimization could check if the keys are identical and then save the re-encryption of blobs. The same optimization could be added to RewritePacks in prune.

The re-encryption part shouldn't be too expensive on somewhat modern processors. After all a single CPU core with AES-NI is able to encrypt/decrypt over 1GB/s of data. That is I don't think it's worth to bypass the current Load/StoreBlob abstractions. Re-encrypting blobs also has the benefit, that an attacker can't learn blob boundaries within a pack file after rewriting. (Rewrite would basically shuffle lots of encrypted strings. By checking for common substrings before and after rewriting you'll learn the boundaries between most blobs) However, I can't tell in what regard that information would be useful.

For matching chunking parameters we'd probably need some way to initialize a repository with the same chunking parameters as another one. Modifying the config file of an existing repository doesn't sound right to me. (Take these remarks with a grain of salt, I haven't thought much about whether that a good idea or not)

parallelize processing of trees, e.g. by using the code of internal/archiver

My plan was to go for an all-out parallelization in a few follow-up PRs. The first part would be to extract the parallel tree loading used by the check command into a StreamTrees function, then to parallelize loading data blobs (that would be the only copy command specific functionality) and finally to add the capability for parallel, asynchronous blob/pack uploads to somewhere within the repository (this functionality is also necessary to properly solve #2696).

Copy link
Contributor

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think the most important optimization would be to not reload multiple blobs from identical packs (like done in restorer) and parallelize the saving of blobs (like done in archiver), but this should be done in a future PR.

cmd/restic/cmd_copy.go Show resolved Hide resolved
cmd/restic/cmd_copy.go Show resolved Hide resolved
cmd/restic/cmd_copy.go Outdated Show resolved Hide resolved
cmd/restic/cmd_copy.go Show resolved Hide resolved
@aawsome
Copy link
Contributor

aawsome commented Aug 28, 2020

The re-encryption part shouldn't be too expensive on somewhat modern processors. After all a single CPU core with AES-NI is able to encrypt/decrypt over 1GB/s of data. That is I don't think it's worth to bypass the current Load/StoreBlob abstractions. Re-encrypting blobs also has the benefit, that an attacker can't learn blob boundaries within a pack file after rewriting. (Rewrite would basically shuffle lots of encrypted strings. By checking for common substrings before and after rewriting you'll learn the boundaries between most blobs)

You are right this is a good point! In fact with this argument I would also opt for always re-encrypting blobs (even when using the same key, but always with a new random nonce)

Use the `Original` field of the copied snapshot to store a persistent
snapshot ID. This can either be the ID of the source snapshot if
`Original` was not yet set or the previous value stored in the
`Original` field. In order to still copy snapshots modified using the
tags command the source snapshot is compared to all snapshots in the
destination repository which have the same persistent ID. Snapshots are
only considered equal if all fields except `Original` and `Parent`
match. That way modified snapshots are still copied while avoiding
duplicate copies at the same time.
Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks to everyone helping with implementing this PR and for the extensive discussions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add command to copy all data to another repository
8 participants