Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostMortem storage discussion #114

Open
consideRatio opened this issue Aug 1, 2020 · 1 comment
Open

PostMortem storage discussion #114

consideRatio opened this issue Aug 1, 2020 · 1 comment

Comments

@consideRatio
Copy link
Member

Mounting of storage on user pods was slow

It seems like it takes a while to mount volumes to the pods, impacting the spawn time significantly, I'm not sure what mounting process takes time yet though. There were many mounts happening.

  1. A 10GB GCE PD through a PVC / PV.
  2. A NFS server mount for the /home/curriculum folder that we did a gitpuller pull from to avoid relying on GitHub being up.
  3. A set of k8s ConfigMaps were also mounted

If it's the mounting that takes time, how much time does it take? If mounting a NFS PVC is slow, but it's fast to mount a hostPath volume, one could mount the NFS storage on each node and then use a hostPath volume to access that mount indirectly. This is what @yuvipanda's https://github.com/yuvipanda/k8s-nfs-mounter is doing, but it's also something Yuvi is transitioning away from.

NFS read/write throughput and the rsync cache workaround

Google's managed NFS service called Filestore was not promising more than a sustained throughput of 100MB/sec, which is a bit low if we want users to have access to 1GB datasets and have hundreds of users. Due to this, I ended up running a DaemonSet to create a pod on each node where I used rsync to stash away a local replica. rsync was used instead of cp or similar in order to ensure we could stay up to date with changes.

Some related PRs for this were #60, #63, #66, #100.

NFS quotas

While we didn't use NFS storage for the users, we could have, and then it would be relevant to try to solve the storage quota issue where you typically can't set quotas for individual users so easily.

@yuvipanda has demonstrated one solution using a self-hosted NFS server backed by storage on a XFS filesystem, and one can also use a Helm chart called nfs-provisioner to deploy a NFS server etc.

pangeo-data/pangeo-cloud-federation#654

NFS archiving

A challenge with a bootcamp like this is that we intent to tear it down after a while, but its not so great to delete access to storage for users. With that in mind, an option could be to archive it in some object storage and provide a way to access it later for users without having an NFS server running.

Access to the archived storage should not be public, so a simple solution would be to generate a password for each user which could be emailed or accessed through JupyterHub somehow which knows about the user. This could make sense to develop as a external JupyterHub service perhaps, which would be aware of the JupyterHub identity.

@yuvipanda is exploring this, but no GitHub repo is up yet to reference.

@yuvipanda
Copy link

A 10GB GCE PD through a PVC / PV.

Aaaah, this is the slow one. Takes a while always. NFS is usually instant in comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants