Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not expose S3 credentials to JupyterHub users #47

Open
martinzugnoni opened this issue Sep 10, 2018 · 12 comments
Open

Do not expose S3 credentials to JupyterHub users #47

martinzugnoni opened this issue Sep 10, 2018 · 12 comments

Comments

@martinzugnoni
Copy link

While configuring s3contents, we need to provide credentials to connect to S3 service. We can do it by directly writing into ~/.jupyter/jupyter_notebook_config.py config file, or reading from env variables.

In any of those cases, any logged in user can read the config file o review the exposed env variables using a terminal session.

I can't find a way to securely connect to S3 without exposing the credentials to the JupyterHub users.

I'm using the dockerspawner, with the official jupyterhub/singleuser image.

Any suggestions?

Thanks.

@danielfrg
Copy link
Owner

Definitely an issue. Not sure how to solve this on this side might be something you have to ask on the Jupyter side, if there is a way to disable logging of specific variables in the config.

@martinzugnoni
Copy link
Author

I'm thinking about using a proxy to connect to S3, and provide unique access tokens for each Hub user. The idea is that the proxy evaluates the token, the logged in user and the action he/she's trying to perform, and determines if it's a valid action or not.

It's important to mention that I'm using a user-based prefixes strategy, where each user has its own namespace within the S3 bucket. The user should only have permission to read/write his/her namespace in the bucket.

I might work, but there's some implementations required on top of your s3contents app.
¯_(ツ)_/¯

@danielfrg
Copy link
Owner

That makes sense.

You can take a look at this issue that faces a similar problem: #45 but it doesn't solve the credentials issue.

One solution might be to pass the credentials from a JupyterHub setting that the users have to input.

@martinzugnoni
Copy link
Author

I wrote that issue. 😂

@danielfrg
Copy link
Owner

ROFL :)

@rgbkrk
Copy link
Contributor

rgbkrk commented Oct 5, 2018

There is always setting up an IAM role for the host. That does mean they have access from a server standpoint (which means they can fetch whatever is allowed by that role), but at least doesn't expose the tokens directly in config.

@martinzugnoni
Copy link
Author

Yes, we thought about IAM roles as well. But, as you say, the correct way would be to have a different user in AWS for each user in your system. Which is not possible if you have a ton of users. That's why we discarded that option.

@milutz
Copy link
Contributor

milutz commented Jul 6, 2019

@martinzugnoni Is this issue still a problem for you? I have a few different work arounds that I could type-up if they would help you.

I have one block of JupyterHub config that drives AWS to dynamically create a new IAM User for each user on jupyterHub (creates them at first login if they don't already exist - either way, it pulls the IAM User keys each login (this does make it really hard to use the keys for anything else) - it should work for up to 4999 users (one IAM User is consumed by jupyterhub to do the work)

I also have a block starts out the same but then generates temporary keys off of the IAM User keys (so these keys expire and become harmless - in exchange for having a maximum time the users session can run - depending on the style chosen, the max time on the temp keys is 12 or 36 hours (min time 10 minutes))

The above could also be switched to do federated users - which has a basically unlimited number of users, but does force a 12 hour max on the keys

Assuming you still have an issue, let me know which constraint is more important to you and I'll see if I can make some sample code

@guimou
Copy link

guimou commented Aug 12, 2019

@martinzugnoni We have a different approach that may be useful for you or other people dealing with this problem. All our installation is based on OpenShift, but that must work at least in any Kubernetes environment, and other environments as well.

We have a central datalake based on Ceph. Each user has its own S3 credentials and a set of buckets he has access too (instead of one bucket and prefixes). We store all users information (credentials) in a Hashicorp Vault instance.

Users authenticate themselves on JupyterHub through OAuth using Keycloak. We store access information, so basically JWT access_token and refresh_token (encrypted), in the JupyterHub database (tokens are refreshed periodically).

When a user launches a Notebook, we use the pre_spawn_start function from JupyterHub to connect to Vault using the access_token and retrieve user's aws_key and secret. We have a special dynamic policy in Vault attached to the path where we store secrets (/some_path/user_id/secret) that allows each user to retrieve his secrets and only his (reason why we need a valid access token from Keycloak to enforce this policy).

Then we simply inject the secrets as env vars in the notebook and use S3Contents to connect. No problem if the user sees them as they are his!
In fact it's even a bit more convoluted as we use HybridContentsManager to also connect the local filesystem, as well as connect all the different accessible buckets at a different path.
Notebook code is here if you want to have a look:
https://github.com/guimou/jupyter-notebooks-s3/blob/0e0979ae3fdb303e84a58ac85b6ec99357457ee6/minimal-notebook/jupyter_notebook_config.py#L28
but I will release in the coming days a clean version, along with Jupyterhub, Keycloak, Vault configurations and an article to explain everything in details. I'll update this comment when it's ready.

@guimou
Copy link

guimou commented Sep 4, 2019

@martinzugnoni Here is the article I published with more details on our implementation.
The repos with everything are now there for JupyterHub and there for the notebooks.
A huge thanks to @danielfrg for his work, if we have the chance to meet the beer's on me!

@chenglinzhang
Copy link

Thanks @guimou for the Medium article. A question if you don't mind: in the section #######################
# Directories mapping #
####################### of jupyter_notebook_config.py, how to specify certificate file location, in order for S3ContentsManager to access S3 in HTTPS?
I looked up the properties of c.S3ContentsManager. and c.NotebookApp. and do not have a clue.

@nlhnt
Copy link

nlhnt commented Mar 23, 2023

Is this still an issue? I've tried looking for the jupyter_notebook_config.py in my baremetal (two node) k3s cluster from the view of a regular user and I can't seem to find it in ~/.jupyter dir.
Is this only an issue for dockerspawner? Does this apply to Kubespawner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants