Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Design Proposal: Cloud Service backups using Restic #3434

Open
vrusinov opened this issue Jun 17, 2021 · 5 comments
Open

[RFC] Design Proposal: Cloud Service backups using Restic #3434

vrusinov opened this issue Jun 17, 2021 · 5 comments

Comments

@vrusinov
Copy link
Contributor

vrusinov commented Jun 17, 2021

Proposed: 2021-06-17

Last update: 2021-06-17

Status: proposed

View this document in: Google docs | my website | Github

Problem statement

Restic is a modern backup program that can backup your files. Individuals and SMBs may solve their file backup problem using restic.

However there is still a lot of data in cloud services, and it’s often more important than local files. E-mail, social network profile data, online documents and spreadsheets are often more important than data on one’s HDD or SSD. And while the majority of online/cloud providers do a good job at keeping the data safe and taking care of durability, mistakes still happen.

It is possible to be locked out of a cloud account (especially if it's a free one) or remove data accidentally. Sometimes services get it wrong and lose your data, or even just shut down.

The cloud data needs to be backed up too.

Some services make it possible (e.g. Google lets you get a copy of your data via Google Takeout), but almost none make it easy and convenient. Wouldn’t it be great if cloud data could be backed up just as easily as the local files using restic?

Design proposal

Summary

Restic already has an ‘FS’ interface which abstracts away filesystem access. There are implementations for local Windows and Unix filesystems. We can add ‘cloud’ filesystem implementations which will represent various objects as files. Depending on the backup source the corresponding ‘FS’ implementation will be chosen and the rest of the restic code will be unaware whether it is working with local filesystem or some virtual one representing some cloud service.

This idea is partially implemented by YoshieraHuang in pull request #2995 (for sftp) and by KrustyHack in pull request #2223 (for Google Cloud Storage).

UX

Pull requests #2995 and #2223 referenced above introduce a large number of additional flags to handle authentication. If we were to implement a dozen different backup sources, we’d have to add even more different flags (or environment variables), and it may quickly become messy.

It is also not clear how to choose the correct ‘FS’ implementation.

I propose to solve this via turning the backup source argument to be url-like and implementing authentication via configuration files. Examples:

  • restic -r <repo> backup /home/user/ - will do a backup of local /home/user/ files.

  • restic -r <repo> backup file:/home/user/ - same as above

  • restic -r <repo> backup sftp:user@host/home/user/ - will log in as user@host via sftp and do a backup of /home/user/.

  • restic -r <repo> backup gmail:/home/user/.config/restic/gmail-auth.conf - will do a backup using ‘gmail’ ‘FS’ implementation and will use authentication from /home/user/.config/restic/gmail-auth.conf file.

And so on, with the general structure being <fs_implementation>:.

It will be the responsibility of each fs implementation to interpret a path. For file FS implementation it will be a local directory. gmail may open and parse settings from a local file, etc.

Where possible different ‘FS’ implementations will share similar config format and behaviour.

Restore

Restoring cloud backups may not be straightforward. It is easy to restore filesystem-like ‘sftp’ or ‘gcs’ data by copying/uploading files to corresponding service. However ‘facebook’ or ‘strava’ may not provide the ability to restore data in an automated way, if at all.

restic will provide tools to convert mounted (e.g. via fuse) backup to something usable. Having social network post history in some human- and machine-readable formats may be still worthwhile even if it’s not possible to re-import it back.

Hostname and path handling

By default restic uses local hostname and path to identify snapshots.

This may not work well for cloud services, especially for hostname. Using local hostname and path can easily lead to mess, e.g. if backups of the same cloud service account are taken from different hosts.

Different ‘FS’ implementations may override hostname (unless one is explicitly provided via --host flag). It will be recommended to use @ format as default hostname and avoid using local hostname for non-local ‘FS’ implementations. Examples could be:

vladimir.rusinov@gmail.com@google_mail

vladimir.rusinov@gmail.com@google_calendar

zuck@facebook

bill@msn_mail

etc.

rdiff-backup "frontend"

Similarly to rdiff-backup backend, rdiff-backup "frontend" may be integrated to provide support for a bunch of storage/cloud services. One integration may unlock support backups of a large number of filesystem-like services, but will not allow backups of less file-like services. E.g. it may help backup Dropbox but may not help with Google Calendar backups. Also, it’ll be likely more awkward to use than "native" service support.

More research and more specific design may be needed.

Advantages of this design

  • One restic repository can be used for all backups - local and cloud

  • All benefits of restic snapshot management

  • Some deduplication possible, e.g. for when some subset of data is synced to local filesystems

Downsides

  • One restic repository can be used for all backups - local and cloud - can be dangerous if backup repository is compromised

  • Increased restic binary size. Since it’s in Go and statically-linked, adding more ‘FS’ implementations may pull more dependencies and increase ‘restic’ binary size for everyone.

  • UX is not perfect - we mix paths and config files.

Next steps / Milestones

  1. Write design proposal - done

  2. Send proposal to review

  3. In parallel to (2), start implementing cloud backup for one provider as a proof of concept. Having actual code will help refine design and may help discussion.

  4. Iterate on design comments, adjust the code from (3) accordingly.

  5. Finalize design and code of the first cloud backup source, send PR, merge it into the upcoming version of restic.

  6. Implement support for popular cloud service providers: SFTP and GCS as there are already pull requests which may need a small number of changes, Gmail, Facebook, Github, Hotmail, Dropbox, Google Drive, etc.

Alternatives

Do nothing

Too late, I already wrote this design.

Also, I still need my backups.

Keep restic for local backups only

One can simply have a service-specific backup/dump program and save backups as local files, to be picked up by restic backups. This is approach currently used by the author of this document and it has several downsides:

  • Requires managing different tools and different backup schedules

  • Makes it difficult to see which services were backed up when

  • Requires enough local storage to store a copy of all cloud data

Use stdin source + 3rd party binaries

Backup source can be implemented as a separate binary that simply dumps backup into stdout (e.g. in tar format if the source is file-based). Restic will then consume backup from stdin.

Such an approach is possible today, and no code changes are required. Restic may provide better documentation with specific examples of how to do this at least for popular services.

Advantages:

  • No code changes are required

  • No restic binary size bloat and no additional code to maintain

Disadvantages:

  • Worse UX

  • Worse deduplication (tar may add its own headers or realign blocks in a way that makes deduplication impossible).

  • Impossible to recover from partial failures - the whole backup/export will have to be started from scratch

  • No advantage from restic cache.

@underdpt
Copy link

underdpt commented Jan 8, 2022

Hello,

I would like to add another alternative, not sure if it's doable or the complexity for it to be added: use rclone a a source.

Today we can use rclone as a backend, which adds tons of backends to restic. How about using it as a source? You can then provide cloud-to-local backups and even cloud-to-cloud deduplicated and encrypted backups and both projects would benefit from it.

@vrusinov
Copy link
Contributor Author

Yes, that's certainly an option. I don't want to limit to just rclone. E.g. as a prototype I was trying to modify restic to back up github repositories along with metadata such as issues and issue comments.

I almost implemented backing up issues, and have them as "virtual" json files (e.g. this issue and comments would be backed up as /github/restic/restic/issues/3434.json). I've mixed up absolute and relative paths somewhere so prototype isn't functional yet and I did't have time to work on it further yet.

@dimejo
Copy link
Contributor

dimejo commented Jan 26, 2022

Sounds like your proposal would solve #299.

@vrusinov
Copy link
Contributor Author

Partially, I read #299 more as a request to transition to client-server architecture, although I agree the goals of #299 may be archived by having a "chain" of backups.

@rmanibus
Copy link

rmanibus commented Jan 7, 2024

very interested in this. I did some experiment here to implement FS for google drive:
rmanibus@cd6eecd

The main issue I am seeing for now is that files are uniquely identified by their id and not by their name. It is not possible from the API to get the file by it's path in a single request.

I partially solved it by:

  • making name return the Id in file info
  • making Join just return the last segment on the path
  • making LStat accept an id instead of a path

But this would need to be worked on a bit more, mainly because for now:

  • it is not retaining the file name
  • it is backuping the entire drive using the 'root' alias

I am also thinking of another issue: If we retain the id in the backup, it might work fine until we try to restore it on a blank drive. At this point we will recreate each file under a new id and wont be able to trivially match the file in the next backup.

It is also worth mentioning that in google drive two files in the same dir can have the same name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants