Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage Backend: Amazon Cloud Drive #212

Closed
e2b opened this issue Jul 4, 2015 · 56 comments
Closed

Storage Backend: Amazon Cloud Drive #212

e2b opened this issue Jul 4, 2015 · 56 comments
Labels
category: backend type: discussion undecided topics needing supplementary input

Comments

@e2b
Copy link

e2b commented Jul 4, 2015

The "Unlimited Everything" plan of Amazon Cloud Drive is a quite affordable backup storage option. Amazon Cloud Drive has its own RESTful API.

@fd0 fd0 added type: feature enhancement improving existing features category: backend feature and removed type: feature enhancement improving existing features labels Jul 4, 2015
@klauspost
Copy link
Contributor

Just thought the same thing.

I may look at this once #21 is in place - uploading without compression seems like a waste of bandwidth.

@fd0
Copy link
Member

fd0 commented Aug 14, 2015

That depends on your use case ;)

@stapelberg
Copy link
Contributor

There’s some proof of concept code I found: http://sprunge.us/fdQF — it requires that an oauth token is in /tmp/token.json, but seems to work for me.

Motivated people could turn that into a clean backend for ACD :).

@jsimonetti
Copy link

+1 from me :)
I compile the backend @stapelberg found, and it indeed appears to work and use code from rclone.

@kisscool
Copy link

+1 too
I'm trying to use Restic + acd_cli (FUSE python client for ACD), but it's very unreliable for now : some operations do not behave as expected (file truncate, rename) and Restic randomly panics as a result.
A working ACD storage backend would do wonder.

@fd0
Copy link
Member

fd0 commented Jan 26, 2016

I'm currently reworking the interface to the backends, this includes a radical simplification. This is basically done, but not yet merged. For the plan, see #383, the PR is #395.

Afterwards it will be much easier to implement new backends.

Before implementing many new backends, I'd like to have a list of rules that services we write backends for must fulfill, this may include that a test instance of the service must be available that we can run the integration tests against.

Do you by chance know whether there is a test service for ACD we can use for tests?

@kisscool
Copy link

As far as I know, there is no test instance for ACD. No mention of such a thing here : https://developer.amazon.com/public/apis/experience/cloud-drive/content/restful-api

But the https://github.com/ncw/rclone project already did an ACD backend in go. It seems to be fairly reusable as demonstrated by the proof of concept shared by @stapelberg .

@klauspost
Copy link
Contributor

Actually, looking at the revised interface, it would be reasonably easy to do a full wrapper for rclone filesystems. Maybe that way separate implementations isn't needed?

@kisscool
Copy link

I don't know what @fd0 vision for Restic future is, but it would seem logical to focus the project on the backup intelligence instead of re-implementing a ton of remote filesystems one by one. Besides both project licenses are compatible.
It would also solve the worries about how to test all those backends.

@klauspost was your idea to create a wrapper around rclone/fs/fs.go ? Is it doable without being tightly coupled with the internal logic of rclone ?

@klauspost
Copy link
Contributor

Each backend implements the fs.Fs interface. Each file is represented as an fs.Object.

It should be fairly easy to create a restic backend that uses an rclone filessystem+folder, provided it is already set up in the rclone configuration.

@fd0
Copy link
Member

fd0 commented Jan 26, 2016

Hm, interesting idea, I have to think about it. Not having to implement all the backends by ourselves looks like a good idea, on the other side (at least at the moment) I must admit that I don't like the thought of a tight coupling between restic and rcclone, as this introduces a dependency that we can't control...

I envision for restic that it should be easy to configure and use with a variety of suitable backends. This includes (in my opinion) only one place for configuration e.g. of the backends. Maybe that's possible with rcclone or at least part of their code. The interface looks suitable to be used with restic.

@stv0g
Copy link

stv0g commented Jan 31, 2016

I pledged a 5$ bounty for this feature.

Some thoughts:

  • Amazon Cloud Drive is using AWS S3 / CloudFront as its backend. The GET requests are always redirected with a Location header to Cloudfront. So you could use the Range header to request only a portion of a pack file as it is required by the new backend API. See: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RangeGETs.html
  • You could start with integrating go-acd. That's the library which is used by rclone.

@romusz
Copy link

romusz commented May 16, 2016

in case the priority of this FR depends on the popular vote: +1

@HeikoBornholdt
Copy link

👍

1 similar comment
@Intensity
Copy link

👍

@nunofgs
Copy link

nunofgs commented Jun 25, 2016

Yes please 👍

@stv0g
Copy link

stv0g commented Jul 2, 2016

How about adding some more bounties to this feature?

See: https://www.bountysource.com/issues/23684796-storage-backend-amazon-cloud-drive

@stapelberg
Copy link
Contributor

@fd0
Copy link
Member

fd0 commented Jul 5, 2016

Hey, thanks for your interest in restic in general and this backend in particular.

Just to give you a heads-up what's the blocker here: I'm not sure how to handle third-party web services. For the local and sftp backends we have extensive tests in place that are ran for every push/PR as part of the CI tests on Travis. This is also true for the s3 backend, there we're using a local Minio s3 server instance, due to this I've found several bugs in the minio client library we're using in the s3 backend.

How can we run CI tests for backend implementations that require a third-party service? Is there e.g. a test service for ACD we could use? Or maybe just take well-tested code from other projects such as rclone?

@fd0 fd0 added the type: discussion undecided topics needing supplementary input label Jul 5, 2016
@stapelberg
Copy link
Contributor

One solution might be to register an account with Amazon, whitelist it for the Cloud Drive API and then use that for the CI tests? The downside is that such a test depends on Cloud Drive being available, but I guess we can wait for an hour or so occasionally before merging a PR? :)

@fd0
Copy link
Member

fd0 commented Jul 5, 2016

That's the only solution I can imaging right now that allows us to run the tests against a live service (and that's desirable in my opinion).

When we add more backends for other services the following will happen:

  • The number of dependent services for running the CI tests grows
  • For each backend we will need a test account to use in the tests
  • This test account must allow parallel connections, as the tests (e.g. for a PR) are run in parallel

Did I forget anything?

@stapelberg
Copy link
Contributor

Your list looks good. There are of course more effects, but I’m not sure whether they are in scope for the question you’re trying to answer:

  • More backends means changes that touch the backends API become more involved (need to update more code, test more code).
  • People who want to run the tests locally need to create test accounts or use their own account.
  • More backends make restic more appealing to more people :).

This test account must allow parallel connections, as the tests (e.g. for a PR) are run in parallel

I think a simple way to take care of this requirement is to use different directories for each test invocation. Sending requests in parallel is usually not an issue with these services, and the different directories make sure the tests don’t clash.

@fd0
Copy link
Member

fd0 commented Jul 5, 2016

I think a simple way to take care of this requirement is to use different directories for each test invocation. Sending requests in parallel is usually not an issue with these services, and the different directories make sure the tests don’t clash.

What I meant was more of a question how many parallel connections a service accepts. For most web-based services this won't be limited (at least concerning the number of connections we require), but this may not be the case for other, more obscure services.

@stapelberg
Copy link
Contributor

When a service limits the number of connections so aggressively that our testing is impacted, we could ask the service owner for an exception or rate-limit on our end as well. As a last resort, we could disable the tests for the backend in question or remove that backend altogether.

But, I suggest we cross that bridge when we get there :).

@fd0
Copy link
Member

fd0 commented Dec 7, 2016

Thanks for describing the process in such great detail, that is already very similar to what I had in mind.

I'm wondering: Why is the webserver needed at all? This process works for a "workstation" type of machine, but not on a server (where there is not browser). The workflow used by rclone is described here: http://rclone.org/remote_setup/

I don't know why we need a webserver for this, but I haven't implemented an oauth-based login workflow yet.

We'll also need a config file to store the token configured for the remote in, that's also not yet done.

@Twister915
Copy link

Twister915 commented Dec 7, 2016

From my understanding, which is limited, the oauth data is provided to the user using the GET data in a redirect.

Have a look at the URL in my screenshot of the browser. That was put there by Amazon. After I hit sign-in on my amazon cloud drive, it redirected me, immediately, to that 127.0.0.1 URL.

Perhaps that is the only way to get this data. This is likely the case, because rclone implemented a webserver instead of picking another simpler solution. When I implemented oauth before, this seemed to be the implication.

If I am correct, then it follows that you must run your own webserver to provide a page to redirect to amazon, and a page to handle the redirect from amazon to do this, and this must be accessed through a web-browser.


As for config file, I think all we need is a file that's in a default location (~/.restic.conf) but can be configured via a flag or environment variable. I think this is a bit dirty, but it's only viable solution that is transparent to the average user, but powerful for those who wish to do it "their way"

@fd0
Copy link
Member

fd0 commented Dec 7, 2016

That sounds plausible. Let me think about a strategy here, this may take some time.

We'll need to:

  • have a config file to store the authentication tokens
  • have "instances" of backends, e.g. something called my_amazon_account which is a ACD backend configured with a login token, so users can run restic --repo my_amazon_account:/foo/bar/dir ...
  • have a workflow to create these login tokens
  • register a client id and secret for use with restic, and hide it in the source code (similar to what rclone does)

Anything else I'm missing here?

@Twister915
Copy link

Twister915 commented Dec 7, 2016

I think you got the big stuff outlined there.

Would you want to move all current backends into a single abstraction that supports this, or would this whole system become a "cloud" backend in the current sense of a backend (which itself is configured through special restic commands)?

Each instance of a cloud backend (google drive, onedrive, amazon cloud drive, S3?) has the following components:

  • Optional: authorization mechanism
  • Optional: arbitrary persistent state, typically related to authorization, but should support other things. This would be written by restic to a file for each instance of the backend.
  • A protocol for communication (ie: the interface/API)

and maybe some other stuff I'm missing

The current abstraction, from my quick read, only relates to the last thing. I think this is a pretty smart way of handling backends, if you're looking to revamp it a bit.

The other option is to simply, as I said, implement a "cloud" backend which does all of these things and rolls all the different providers together under it's umbrella.

@jsimonetti
Copy link

  • Have a workflow to refresh auth tokens (if they expire and the provider supplies a refresh token, which should be stored next to the auth token in your point 1)

@fd0
Copy link
Member

fd0 commented Dec 8, 2016

Here's some background in regards to embedding a client secret in open source applications: http://stackoverflow.com/a/28109307

As far as I understand the problem: You're not allowed to embed a client secret in an open source application. rclone employs some obfuscation to hide what they're embedding.

I doubt that embedding a static client id/secret in restic's source code is a good idea. On the other hand, having the user register an application themselves is complicated.

This article describes how to do oauth2 with Go: https://jacobmartins.com/2016/02/29/getting-started-with-oauth2-in-go/

@klauspost
Copy link
Contributor

klauspost commented Dec 12, 2016

I doubt that embedding a static client id/secret in restic's source code is a good idea.

There is no real solution, it is a broken concept to assume that any client can keep a secret.

However, if you consider what the client secret contains, it is not that important. The only real thing it allows is for Amazon (and others) to be able to identify a specific client, nothing more. It does not grant any special access - your tokens are used for that.

Sure a publicly available "client secret" can make other application identify themselves as restic, but other than risk that "restic" will be banned (or more likely rate limited) as a client, there is not much risk at exposing the client "secret". It will never put any user data in jeopardy.

@fd0
Copy link
Member

fd0 commented Dec 12, 2016

The problem here is that somebody needs to register the clientID, for example me. If I'm using my normal Amazon account (or even worse, my Google account), and "violate" the TOS for the service by publishing the client secret, they can terminate my account. That's not something I'm going to risk.

Another problem is that once the client secret changes (or is revoked), we're stuck with older versions of restic e.g. in Debian stable which are unable to communicate with the service because of a hardcoded (and now invalid) client secret. This is the case even if access to the service is restored shortly after, but the client secret has changed.

I've thought about possible solutions and found only two:

  • Live with the risk and just put the client secret into the source
  • Build restic in a way that users need to register their own client ID and client secret, via a nice UI that minimizes the hassle

Currently, I'm in favor of the second option, we need a UI for the oauth token thing anyway. What do you think?

@klauspost
Copy link
Contributor

If I'm using my normal Amazon account [...]

I know that Nick has had some correspondence with Amazon, since rclone was being rate limited due to many users. It is however my impression (from memory) that they were quite forthcoming and encouraged OS development, and have made exemptions for his client. So I guess my advise would be to contact them and see how things go from there. In the overall picture I don't think they would mind the business coming from restic users.

@fd0
Copy link
Member

fd0 commented Dec 13, 2016

Interesting idea, do you have any hint on who to contact at Amazon?

For Microsoft OneDrive he said that he did not contact anyone: rclone/rclone#372

@stapelberg
Copy link
Contributor

I know that @breunigs had bad luck with his amazon cloud drive duplicity backend — they wouldn’t give him any rate limit exemptions AFAIK.

@breunigs
Copy link

breunigs commented Dec 15, 2016

I have only read the last few comments, so please forgive me if this info is not needed:

  • rclone implements the web server on top of it offering remote setup where you copy the URL. Having a local webserver is just more convenient
  • if you want to whitelist any redirect target in Amazon, it has to be on a https machine – linking to http URLs is not okay. Only exception is localhost. So, for remote setup you can either redirect the user to a blank page and hope they realize what they have to do, or host some page with instructions. I added https://breunig.xyz/duplicity/copy.html for duplicity, since it doesn't have https infrastructure yet. Amazon will add all details in the query string, so you can get away with making this a static page
  • You need an Amazon Developer account. You can use your existing credentials to log in I believe, but you can also create a new one
  • There is a process where you register your app and then at some later stage you can create a security profile for said app. This process is very confusing, because of horrible UX, but it should work without human interaction from Amazon. (Note: by App they usually refer to "mobile apps", but not always. Click around a bit)
  • What the limits are is unclear, Amazon don't say. It's clear there are multiple stages: per user, per API endpoint, per credentials
  • If you want production limits, you send an email with "details" to clouddrive-api-invite@amazon.com. Use a big player mail server, or they will tag you as spam and it takes a month or two until some poor soul went through all their spam.

Also, a final word of advice: read through rclone's workarounds for Amazon Drive. The API contains a lot of undocumented "eventual consistency" gotchas. It even goes out of its way to cache an outdated response it gave you, so that you need to wait even longer if you were too hasty to begin with. This is on top of it reporting errors when there are none, one just needs to wait.

HTH,
Stefan

@fd0
Copy link
Member

fd0 commented Dec 16, 2016

Thanks for the information!

@jsimonetti
Copy link

Just throwing something out there:

What is we remove all (but local and REST) backends from restic and stick them into restic/rest-server?

This allows restic to focus on doing backups properly and filesystem implementations are done in the rest-server.
This also leaves restic with just 1 backends API to maintain.

This doesn't solve the testing problem, but will certainly help keep the restic source clean/focussed and it is easier to make API changes inside restic.

@fd0
Copy link
Member

fd0 commented Jan 28, 2017

Thanks for the suggestion. Unfortunately I don't like it at all, in my opinion this approach (adding an intermediate layer including a new transport via HTTP) will lead to even more problems.

The backend API interface was stable for a long time, then changed recently, and will be stable again. The interface is already rather small.

We should try to get backends into restic (including proper CI tests) as soon as possible, that's IMHO the only way to make sure they work.

In case of the Amazon ACD backend, we need to answer the outstanding questions first.

@fd0
Copy link
Member

fd0 commented Mar 29, 2017

The Amazon Developer Guide for Amazon Drive (what's it called these days) states that:

What Not To Build

[...]

  • Don’t build apps that encrypt customer data

I feel that Amazon Drive is not the right platform for securely storing encrypted backups.

@askielboe
Copy link
Contributor

Interesting. This must be a new addition, as it definitely was not the case when ACD support was added to Arq.

Seems ACD is not a real storage option after all.

@e2b
Copy link
Author

e2b commented Mar 29, 2017

@stapelberg
Copy link
Contributor

Amazon has since clarified this in https://forums.developer.amazon.com/questions/54909/impact-of-dont-encrypt-customer-data-part-of-drive.html:

What if the customer choses to encrypt their data?
They can do that, and that is fine.

So, restic and other apps should be good.

@stv0g
Copy link

stv0g commented Mar 29, 2017 via email

@stapelberg
Copy link
Contributor

One other motivation which I find plausible is to increase interoperability — if each application encrypts their files, the user’s ability to switch between applications is severely hampered.

@askielboe
Copy link
Contributor

I asked Arq Backup support. They encrypt everything, and said that their app had been approved by Amazon, and to not worry.

I'm not sure what Amazon is trying to say. But seems that are now evaluating each case as they come in.

@mikesager
Copy link

Not sure if anybody is aware of the recent ACD drama with acd_cli and rclone, but a TL;DR of the situation is that they have had their ACD API access revoked due to TOS violations. Their efforts to regain API access are apparently being hampered by the fact that Amazon has stopped accepting new third-party apps for ACD. I assume this latter revelation stops any Restic ACD support in its tracks, unless the project had already obtained ACD API access.

@sedlund
Copy link
Contributor

sedlund commented May 30, 2017

acd_cli API access was revoked due to a security issue with their oauth app, not a TOS violation. The problem has been fixed and Amazon re-instated their key. Although this is off topic from this project.

New ACD API access is currently closed.

@fd0
Copy link
Member

fd0 commented May 30, 2017

Thanks for posting this here, I wasn't aware of it. I had reservations implementing ACD, and it seems that Amazon indeed did not like secrets in the code of an Open Source program: rclone was banned for it: https://forum.rclone.org/t/rclone-has-been-banned-from-amazon-drive/2314

On the other hand, acd_cli implemented an OAUTH auth service (not sure what the correct nomenclature here is). This handles authorization for all users, and there apparently was a bug that allowed people to access/modify other people's files.

Since Amazon isn't accepting new clients anyway I'm closing this issue for now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: backend type: discussion undecided topics needing supplementary input
Projects
None yet
Development

No branches or pull requests