-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] - Add kubernetes horizontal autoscaler for conda-store workers based on queue depth #2284
Comments
OptionsWe have two options to achieve this: Option#1 Horizontal Pod Autoscaler based on external metrics and a load monitor/watcher.Ref: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
Option#2 KEDA (Kubernetes-based Event-driven Autoscaling)Ref: The PGSql scaler allows us to run a query on a database. Which means we can simply point it towards the existing conda-store database to get the queue depth of pending jobs. Pros and ConsOption#1
Option#2
|
Should this be part of conda-store?Regardless of the option we take, this can be moved upstream to conda-store.
|
We should agree on these before we start. Please suggest. Thanks. |
@pt247 Conda store already has a queue, it is using redis and celery. I expect we can pull queue depth from that, so we shouldn't need to deploy extra infra there. The nebari-conda-store-redis-master stateful set is what you are looking for. I am unfamiliar with KEDA, but it does look promising and has a redis scaler too. In general I prefer to use built in solutions as my default, so the horizontal autoscaler was my first thought, but if KEDA allows for better results with less complexity then I can see going with that. KEDA is a cncf project that seems to be actively maintained, so that is good. As to whether this solution belongs in conda-store, I will simply say, it does not. Conda-store allows for horizontal scaling by having a queue with a worker pool. That is where conda-store's responsibility ends. Building specific implementation details for scaling on Nebari into conda-store would cross software boundaries and greatly increase coupling between the projects. That would be moving in the wrong direction. We want to decrease coupling between conda-store and Nebari. conda-store has a method for scaling horizontally, it is on Nebari to implement autoscaling that fits its particular environment. |
I bet conda store devs would have comments on this, and it would be implemented in Conda store. It seems like this issue should transferred to the conda store repo to improve visibility with conda store devs. |
I also agree that the conda-store already has a sound scaling system; however, we are not using this on our own deployment. Having multiple celery workers is already supported (as both Redis and Celery handle the task load balancing by themselves); we need to discuss how to handle the worker scaling on our Kubernetes infrastructure. It's a manual process that depends on creating more workers. We need a way to automate this process. I initially suggested using the queue depth on Redis to manage this, which would trigger a CRD to change the number of replicas the worker deployment should have. |
Either KEDA or the horizontal autoscaler would work here and both can be used to scale automatically using the queue depth. I think that KEDA seems a bit more elegant with its implementation so would suggest using that to start to see if it works and if for some reason it doesn't, then falling back to the horizontal autoscaler. |
Notes on POCInstalling KEDA:
Scaled job spec:
I have also tried this:
|
I am getting the following error:
|
Uhm, this is strange behavior; I think something might be missing... I will try to reproduce this on my side as well. |
I have also tried TriggerAuthentication:
|
This worked: It turns out that the secrets need to be base encoded.
|
Performance imporvementsWe try and create 5 conda environments the fifth environment we add sciket-learn. Current develop branchTime: 5 minutes 11 seconds Default KEDATime: 4 minutes 29 seconds With min replica count set to 1 default is 0Time: 2 minutes 35 seconds With min replica count set to 1 default is 0 + Pooling interval of 15 seconds (default is 30 seconds)Time: 4 minutes 14 seconds pollingInterval: 5 and minimum replica count: 1 we track state building as well
|
Feature description
Currently conda-store is set to allow 4 simultaneous builds at once. This is a bottleneck once multiple environments start getting built at once and presents a scaling challenge. If we set the simultaneous builds to 1 and autoscale based on queue depth then we should be able to handle scaling far more gracefully
Value and/or benefit
Having the conda-store workers autoscale based on queue depth will allow larger orgs to take advantage of Nebari without hitting scale bottlenecks.
Anything else?
https://learnk8s.io/scaling-celery-rabbitmq-kubernetes
The text was updated successfully, but these errors were encountered: