Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod Local bcache possible? #143

Open
inviscid opened this issue Sep 14, 2022 · 3 comments
Open

Pod Local bcache possible? #143

inviscid opened this issue Sep 14, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@inviscid
Copy link

inviscid commented Sep 14, 2022

Is your feature request related to a problem?/Why is this needed
The ability for Carina to provide tiered storage using bcache is very powerful, especially in context of database operations. However, it currently requires data to reside at the node level rather than leveraging a combination of persistent storage at the pod level and ephemeral NVMe/SSD storage at the node level. This makes it very difficult to move pods to new nodes easily.

Describe the solution you'd like in detail
Would it be possible to construct the bcache volume within a pod so that it would utilize local node ephemeral NVMe/SSD disks but utilize a PV exposed at the pod level? This way, the persistent part of the bcache can move easily with the pod and the cache portion would be discarded and rebuilt once the pod has been rescheduled to a new node.

For example, in a GCP environment we can create a node with a local 375GB NVMe drive. As pods are scheduled to the node, a portion of the 375GB drive is allocated to the pod as a cache device (raw block device) as well as using a PV (raw block device) attached from the GCP persistent volume service. When the pod is initialized, the bcache device is created pod-local using the two attached block devices.

The benefit of this is the data is no longer node bound and the pods can be rescheduled easily to new nodes with their persistent data following. It would also enable resizing of individual PVs without worrying about how much disk space is attached at the node level.

Describe alternatives you've considered

  1. Just sticking with standard network attached PVs. This is not optimal for database operations since having local disk can significantly boost read/write performance.

  2. Try a homegrown version of this local bcache concept using TopoLVM (https://github.com/topolvm/topolvm) and network attached storage PVs.

  3. Also looked at using ZFS ARC but that also requires setting up our own storage layer rather than leveraging GCP, AWS, or Azure managed storage.

Additional context
This would have immediate use for Postgres and Greenplum running in kubernetes. The churn of rebuilding large data drives can be significant for clusters with frequent node terminations (spot instances).

@antmoveh
Copy link
Contributor

Thank you for your advice. I'm taking a look at your proposal

@antmoveh antmoveh added the enhancement New feature or request label Sep 16, 2022
@ZhangZhenhua
Copy link
Contributor

ZhangZhenhua commented Sep 16, 2022

This is a very interesting idea. If we have this implemented, then pod with local storage can migrate freely and the cluster(for example pg or greenplum) can get back to normal with data reconstruction.

However, there are some obstructions need to be addressed first.

  1. Carina need to support ephemeral storage provisioning. So that the bcache device shares same lifecycle with pod. When pod get deleted, bcache is deleted too, thus flushing cache data to persistent storage.

  2. We need to find a way to tell carina which persistent storage to use to build bcache as cold layer.

  3. The pod need to specify two storage class at least, one is carina and the other is persistent storage provisioner. When kubelet build container, it might run into dead lock when preparing those two volumes.

  4. As the persistent storage is also mounted into Pod, the application inside can never write to this device. And actually, I am not sure if the pure mount operation will cause data corruption or not.

@ZhangZhenhua ZhangZhenhua self-assigned this Sep 16, 2022
@ZhangZhenhua
Copy link
Contributor

@inviscid any thoughts?

And we need to take care of power failure, we may not get a chance to flush hot data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants