PostMortem storage discussion #114

consideRatio · 2020-08-01T10:48:41Z

Mounting of storage on user pods was slow

It seems like it takes a while to mount volumes to the pods, impacting the spawn time significantly, I'm not sure what mounting process takes time yet though. There were many mounts happening.

A 10GB GCE PD through a PVC / PV.
A NFS server mount for the /home/curriculum folder that we did a gitpuller pull from to avoid relying on GitHub being up.
A set of k8s ConfigMaps were also mounted

If it's the mounting that takes time, how much time does it take? If mounting a NFS PVC is slow, but it's fast to mount a hostPath volume, one could mount the NFS storage on each node and then use a hostPath volume to access that mount indirectly. This is what @yuvipanda's https://github.com/yuvipanda/k8s-nfs-mounter is doing, but it's also something Yuvi is transitioning away from.

NFS read/write throughput and the `rsync` cache workaround

Google's managed NFS service called Filestore was not promising more than a sustained throughput of 100MB/sec, which is a bit low if we want users to have access to 1GB datasets and have hundreds of users. Due to this, I ended up running a DaemonSet to create a pod on each node where I used rsync to stash away a local replica. rsync was used instead of cp or similar in order to ensure we could stay up to date with changes.

Some related PRs for this were #60, #63, #66, #100.

NFS quotas

While we didn't use NFS storage for the users, we could have, and then it would be relevant to try to solve the storage quota issue where you typically can't set quotas for individual users so easily.

@yuvipanda has demonstrated one solution using a self-hosted NFS server backed by storage on a XFS filesystem, and one can also use a Helm chart called nfs-provisioner to deploy a NFS server etc.

pangeo-data/pangeo-cloud-federation#654

NFS archiving

A challenge with a bootcamp like this is that we intent to tear it down after a while, but its not so great to delete access to storage for users. With that in mind, an option could be to archive it in some object storage and provide a way to access it later for users without having an NFS server running.

Access to the archived storage should not be public, so a simple solution would be to generate a password for each user which could be emailed or accessed through JupyterHub somehow which knows about the user. This could make sense to develop as a external JupyterHub service perhaps, which would be aware of the JupyterHub identity.

@yuvipanda is exploring this, but no GitHub repo is up yet to reference.

The text was updated successfully, but these errors were encountered:

yuvipanda · 2020-08-01T12:20:47Z

A 10GB GCE PD through a PVC / PV.

Aaaah, this is the slow one. Takes a while always. NFS is usually instant in comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostMortem storage discussion #114

PostMortem storage discussion #114

consideRatio commented Aug 1, 2020

yuvipanda commented Aug 1, 2020

PostMortem storage discussion #114

PostMortem storage discussion #114

Comments

consideRatio commented Aug 1, 2020

Mounting of storage on user pods was slow

NFS read/write throughput and the rsync cache workaround

NFS quotas

NFS archiving

yuvipanda commented Aug 1, 2020

NFS read/write throughput and the `rsync` cache workaround