Extremely poor performance (client-side throttling)? #3318

SebMuir-Smith · 2024-02-06T05:20:47Z

What happened:
Using a barebones install of the latest volcano, it cannot keep up with a reasonable amount of load. Submitting a relatively small number of jobs/pods (~20*100) that is easily handled by kube-scheduler causes volcano to lock up and not work properly.

The root cause seems to be job/pod admission validation webhooks timing out. Lots of logs like the below from the admission controller might be the root cause:

Waited for 19.550979481s due to client-side throttling, not priority and fairness, request: GET:https://<ip>:443/apis/scheduling.volcano.sh/v1beta1/namespaces/<namespace>/podgroups/0-test--108-3bdcc5aa-dfb7-4150-a472-deb9f3b35c99

What you expected to happen:
Volcano can schedule ~ 2000+ pods, without locking up.

How to reproduce it (as minimally and precisely as possible):

  admission_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G
  scheduler_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G
  controller_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G

Create a new job spec like:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test
spec:
  minAvailable: 0
  schedulerName: volcano
  maxRetry: 0
  tasks:
    - replicas: 1
      minAvailable: 1
      name: "ubuntu"
      template:
        metadata:
          name: web
        spec:
          schedulerName: volcano
          containers:
            - image: ubuntu
              imagePullPolicy: IfNotPresent
              name: ubuntu
              resources:
                limits:
                  memory: "1G"
                  cpu: "0.2"
              command:
                - "sleep"
                - "604800"
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 1

Then, submit this job > 20 times (with different names) over a few minutes. Scheduler will start to lock up, not schedule the majority of the pods (or even move them to pending), and will sometimes time-out during validation webhooks.

Anything else we need to know?:

Environment:

Volcano Version: 1.8.2
Kubernetes version (use kubectl version): v1.29.0-eks-c417bb3
Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

SebMuir-Smith · 2024-02-06T22:47:00Z

It looks like there are a lot of locations in the admissions validation webhooks where volcano is directly using kubernetes API gets/posts, rather than a more performant method like informers. This is likely causing the slowdown; Does volcano have any plans to migrate to informers in these areas?

SebMuir-Smith · 2024-02-07T00:31:46Z

Confirmed that main performance issues were resolved by removing the parts of the webhooks that call the k8s api directly.

Monokaix · 2024-02-19T07:57:27Z

Hi, you can try to increase --kube-api-qps and --kube-api-burst params of admission component to get a better performance with kube-apiserver.

SebMuir-Smith added the kind/bug Categorizes issue or PR as related to a bug. label Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely poor performance (client-side throttling)? #3318

Extremely poor performance (client-side throttling)? #3318

SebMuir-Smith commented Feb 6, 2024

SebMuir-Smith commented Feb 6, 2024 •

edited

SebMuir-Smith commented Feb 7, 2024

Monokaix commented Feb 19, 2024

Extremely poor performance (client-side throttling)? #3318

Extremely poor performance (client-side throttling)? #3318

Comments

SebMuir-Smith commented Feb 6, 2024

SebMuir-Smith commented Feb 6, 2024 • edited

SebMuir-Smith commented Feb 7, 2024

Monokaix commented Feb 19, 2024

SebMuir-Smith commented Feb 6, 2024 •

edited