You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Using a barebones install of the latest volcano, it cannot keep up with a reasonable amount of load. Submitting a relatively small number of jobs/pods (~20*100) that is easily handled by kube-scheduler causes volcano to lock up and not work properly.
The root cause seems to be job/pod admission validation webhooks timing out. Lots of logs like the below from the admission controller might be the root cause:
Waited for 19.550979481s due to client-side throttling, not priority and fairness, request: GET:https://<ip>:443/apis/scheduling.volcano.sh/v1beta1/namespaces/<namespace>/podgroups/0-test--108-3bdcc5aa-dfb7-4150-a472-deb9f3b35c99
What you expected to happen:
Volcano can schedule ~ 2000+ pods, without locking up.
How to reproduce it (as minimally and precisely as possible):
Then, submit this job > 20 times (with different names) over a few minutes. Scheduler will start to lock up, not schedule the majority of the pods (or even move them to pending), and will sometimes time-out during validation webhooks.
Anything else we need to know?:
Environment:
Volcano Version: 1.8.2
Kubernetes version (use kubectl version): v1.29.0-eks-c417bb3
Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:
The text was updated successfully, but these errors were encountered:
It looks like there are a lot of locations in the admissions validation webhooks where volcano is directly using kubernetes API gets/posts, rather than a more performant method like informers. This is likely causing the slowdown; Does volcano have any plans to migrate to informers in these areas?
What happened:
Using a barebones install of the latest volcano, it cannot keep up with a reasonable amount of load. Submitting a relatively small number of jobs/pods (~20*100) that is easily handled by
kube-scheduler
causes volcano to lock up and not work properly.The root cause seems to be job/pod admission validation webhooks timing out. Lots of logs like the below from the admission controller might be the root cause:
What you expected to happen:
Volcano can schedule ~ 2000+ pods, without locking up.
How to reproduce it (as minimally and precisely as possible):
Create a new job spec like:
Then, submit this job > 20 times (with different names) over a few minutes. Scheduler will start to lock up, not schedule the majority of the pods (or even move them to pending), and will sometimes time-out during validation webhooks.
Anything else we need to know?:
Environment:
kubectl version
): v1.29.0-eks-c417bb3uname -a
):The text was updated successfully, but these errors were encountered: