Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS identity verification #2121

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

AWS identity verification #2121

wants to merge 5 commits into from

Conversation

bemoody
Copy link
Collaborator

@bemoody bemoody commented Oct 24, 2023

Context

Pull #2086 added the ability to mirror project content on Amazon S3. This is now working and we are in the process of uploading open-access projects from PhysioNet.

The changes here will be needed once we start uploading restricted/credentialed projects, so that we can securely grant access to authorized users. (Identity verification aside, there are also some more significant changes that are needed for handling restricted/credentialed projects; see issue #2094.)

In brief: currently (in the old system Felipe set up), people are asked to self-report their AWS account number, and any person or service within that account would be allowed to access restricted data.

With these changes, in contrast, people will be asked to verify their personal AWS identity; subsequently, we'll be able to grant access only to verified identities (the latter part is yet to be implemented.)

Why

DUAs for MIMIC and other databases require that data is only shared with authorized individuals (each person must register on PhysioNet and be credentialed.) We want to enable cloud access for better performance, but complying with these DUAs requires knowing who is being granted permission to use these cloud services.

Moreover, although each user is ultimately responsible for data security, we want to encourage good practices. People may be using AWS for all sorts of reasons unrelated to PhysioNet. Giving themselves permission to access MIMIC through their personal account should not also grant permission to all of those unrelated and possibly-less-trusted services.

Some people may be using organizational AWS accounts rather than personal ones. Maybe we want to discourage this, or maybe not, but we can't prevent it. One member of an organization having access shouldn't grant access to everyone in the organization.

There is a lot about AWS authentication that is still a bit mysterious to me, but my gut feeling is that the "IAM user" level is the right level of authentication for PhysioNet and MIMIC.

It has been suggested that we could ask people to self-report their AWS username (or ARN?) in addition to their account number. And yes, that would be an improvement; but it has the disadvantage that usernames are variable-length, and may not be long-term stable. Better would be to ask people to self-report their AWS userid, but that's not easy for people to find and more likely to cause mistakes.

Finally, I can imagine that in the future there may be other reasons for wanting to associate a PhysioNet account with an AWS account, and having a strong verification process could enable more interesting forms of integration.

How identity verification works

The concept is that we would have a special-purpose S3 bucket which allows access only if the path matches the requester's AWS account and userid. To prove your identity, you generate a signature for a URL that can only be accessed by you, and paste that signed URL into a form on the site.

The process would be:

  1. You go to your cloud settings page on PhysioNet.

  2. We tell you to run the command aws sts get-caller-identity.

  3. You copy the output into the form.

  4. We then tell you to run a command like aws s3 presign s3://asdfghjk/physionet.org-verification/email=root@example.org/userid=AIDAABCDEFGHIJKL/account=112233445566/username=barackobama/.

  5. You copy the output into the form.

  6. We verify the format of the URL and submit it to AWS to verify the signature.

Wait a minute, what's this "userid" thing you keep talking about?

https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-unique-ids

Setup and testing

Using this feature requires creating a special-purpose S3 bucket (a bucket which probably should not be used for anything else.)

For the time being, you can test this by setting AWS_VERIFICATION_BUCKET_NAME to bm-uverify-test1. I will delete that bucket once we've set up a permanent replacement under the PhysioNet AWS account.

If you want to see exactly how the verification bucket is created, and test it yourself, see the instructions in deploy/README.md.

Background

Although this implementation is guided by the needs of PhysioNet, my goal has been to design a general-purpose authentication protocol that could be used by any site that needs to verify cross-account AWS identities.

This is inspired in part by the technique used by Hashicorp Vault and discussed here:

and similarly: https://stackoverflow.com/a/76099155

We could use the same method, but it would require the person to download and run a small program (and that program involves some pretty hairy digging into the AWS API.)

The method proposed here, in contrast, only requires the person to install the official AWS CLI and run a couple of commands. I think that this is easier to understand and therefore paradoxically more secure (see if you can spot the security flaw in the StackOverflow answer.)

For information about why this works, see AWS documentation on policy variables:
https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html

Also see the AWS CLI documentation:
https://docs.aws.amazon.com/cli/latest/reference/sts/get-caller-identity.html
https://docs.aws.amazon.com/cli/latest/reference/s3/presign.html

@bemoody bemoody force-pushed the user-aws-verification branch 4 times, most recently from ee7a7b4 to 859822d Compare October 30, 2023 19:45
@bemoody
Copy link
Collaborator Author

bemoody commented Oct 30, 2023

Just using the GetObject permission doesn't quite work - if the resource doesn't exist then we get a 403 even if we have permission to access it.

https://stackoverflow.com/a/19038017 explains why this is (which makes sense), though it's not 100% clear how the backend is interpreting the policy, and I'd like to find some official AWS docs clarifying the matter.

Adding ListBucket permission seems to make it work (and raises the question of whether the GetObject permission itself is needed.)

@bemoody
Copy link
Collaborator Author

bemoody commented Oct 30, 2023

Unfortunately, patching in X-Amz-Expected-Bucket-Owner doesn't work. Don't know if there's a way to include that header in the aws s3 presign command. We can live without it, it just feels a little less safe.

@bemoody bemoody force-pushed the user-aws-verification branch 2 times, most recently from 46ad8f2 to 078df56 Compare October 30, 2023 21:12
@bemoody
Copy link
Collaborator Author

bemoody commented Oct 30, 2023

Now with UI and should be working!

@bemoody bemoody force-pushed the user-aws-verification branch 2 times, most recently from 15b6210 to b177957 Compare October 30, 2023 21:56
@bemoody
Copy link
Collaborator Author

bemoody commented Oct 30, 2023

Nits:

  • modal before deleting
  • better error message for unsupported (non-AIDA) identity
  • save original Arn so we can show it on the edit_cloud page?

@bemoody
Copy link
Collaborator Author

bemoody commented Nov 1, 2023

Hmm. Some sources say that Amazon presigned URLs have a maximum expiration time of one week. Maybe that's true for the v4 signature? It's definitely not true for the v2 signature, I just generated one that was valid for a year and it was accepted.

@bemoody bemoody force-pushed the user-aws-verification branch 2 times, most recently from 65c007c to cab730e Compare November 7, 2023 19:25
@bemoody
Copy link
Collaborator Author

bemoody commented Nov 7, 2023

github-advanced-security, you're adorable. It's a fair point that I should rework that regexp to reduce backtrackability (though I'm not sure it's possible to completely fix the issue; python re is annoying that way.)

As for the rest of it, get outta here. ;)

@bemoody bemoody force-pushed the user-aws-verification branch 2 times, most recently from b0b9e44 to ab2ab47 Compare November 10, 2023 23:36
@bemoody
Copy link
Collaborator Author

bemoody commented Dec 5, 2023

I believe this approach should work if we use S3 Access Points to grant access. If we use S3 Access Grants then it is less clear - can we still grant access based on userid or do we have to use ARN?

So we may have to decide which approach to take before deciding how to validate identities. :(

@bemoody
Copy link
Collaborator Author

bemoody commented Dec 12, 2023

So we may have to decide which approach to take before deciding how to validate identities.

I'm still not clear about the access-grant thing and how it would work.

But just to note: this code could be changed to verify the requester's aws:username in addition to their aws:userid. Arguably it makes sense for us to do that regardless.

So although I think it's questionable, security-wise, to rely on cross-account IAM ARNs, it's not impossible and also wouldn't preclude other approaches.

@bemoody
Copy link
Collaborator Author

bemoody commented Jan 29, 2024

Note that using presigned URLs seems to be broken with bucket regions other than us-east-1: boto/boto3#3015

This affects older versions of awscli. Newer versions don't use boto3 and are possibly unaffected.

For the time being, PhysioNet should use us-east-1 for user verification, and other sites should do the same.

Unrelated to user verification, this might also cause problems if we want to store data in regions other than us-east-1.

# If the signature is correct, and the identity is correct as
# determined by the bucket policy, then S3 should return a 404
# response (because the resource doesn't, in fact, exist.)
response = session.get(signed_url)

Check failure

Code scanning / CodeQL

Full server-side request forgery

The full URL of this request depends on a [user-provided value](1).

# As a sanity check, verify that S3 returns a 403 response if
# the AWS signature is missing.
response = session.get(unsigned_url)

Check failure

Code scanning / CodeQL

Full server-side request forgery

The full URL of this request depends on a [user-provided value](1).
Benjamin Moody added 4 commits January 31, 2024 13:14
This module contains functions for authenticating a person's AWS
identity (account number, username, and user ID) by means of signed
URLs.

Amazon S3 authenticates clients using a per-request "signature" that
incorporates the request path and headers together with a secret key
held by the client.  This means that the client can pre-compute this
signature and send it to someone else, allowing the recipient to
perform that request on that client's behalf, without revealing the
secret key itself.

We can arrange to create an S3 URL that can only be accessed by a
particular AWS identity, and then ask someone to pre-compute the
signature that they would use to access that resource (which they can
do using the AWS CLI or other S3-compatible tools and libraries.)

If we then submit that signature to S3 and it succeeds, we know that
the requester holds the secret key for that identity.  In fact, the
resource in question doesn't need to actually exist, as long as we can
tell the difference between an unauthorized request (HTTP 403) and an
authorized request for something that doesn't exist (HTTP 404).

AWS supports many types of identities.  This module only supports "IAM
user" identities (arn:aws:iam::*:user/*) and not any other types, both
as a matter of policy (data access permissions are granted to
individual people, not to groups, organizations, or computers) and
because IAM users have a stable, fixed-length user ID.
We want to verify a person's AWS identity in order to permit them to
access restricted resources via direct cloud APIs; and possibly for
other purposes in the future.

An AWS account ID is not an identity.  An account may contain many
identities (known as "userids" or "unique IDs"), which might or might
not belong to the same person.  (Even if they do all belong to the
same person, it doesn't mean that person wants or should want to give
all of their identities the ability to access sensitive data.)

Here, we add fields to store the userid alongside the account ID and
ARN (which may be of interest in the future), and the date and time
that these credentials were verified.
In order for a person to verify their AWS identity, they need to
provide a digital signature, in the form of a signed URL that includes
their AWS account information in the path.  We further require the URL
to include the domain name of the site, and the user's primary email
address, to prevent misuse.

This signed URL can be generated using the AWS CLI.  However, the URL
must be exactly correct; if it is wrong, it is difficult to tell why.
In order to hopefully avoid confusion, we first ask the person to run
'aws sts get-caller-identity'; based on that, we tell them the exact
'aws s3 presign' command they need to run.
@bemoody
Copy link
Collaborator Author

bemoody commented Jan 31, 2024

I'm banning this bot. I don't accept contributions from bots unless they are free and open source.

(Seriously. I am not interested in providing free labor to Microsoft to help them develop better tools to lock people into GitHub.)

@bemoody
Copy link
Collaborator Author

bemoody commented Jan 31, 2024

regarding boto3 being silly in test suite, see #2150 (comment)

@bemoody bemoody marked this pull request as ready for review January 31, 2024 19:17
@bemoody bemoody changed the title [WIP] AWS identity verification AWS identity verification Jan 31, 2024
@Chrystinne Chrystinne self-assigned this Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants