Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely poor performance (client-side throttling)? #3318

Open
SebMuir-Smith opened this issue Feb 6, 2024 · 3 comments
Open

Extremely poor performance (client-side throttling)? #3318

SebMuir-Smith opened this issue Feb 6, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@SebMuir-Smith
Copy link

What happened:
Using a barebones install of the latest volcano, it cannot keep up with a reasonable amount of load. Submitting a relatively small number of jobs/pods (~20*100) that is easily handled by kube-scheduler causes volcano to lock up and not work properly.

The root cause seems to be job/pod admission validation webhooks timing out. Lots of logs like the below from the admission controller might be the root cause:

Waited for 19.550979481s due to client-side throttling, not priority and fairness, request: GET:https://<ip>:443/apis/scheduling.volcano.sh/v1beta1/namespaces/<namespace>/podgroups/0-test--108-3bdcc5aa-dfb7-4150-a472-deb9f3b35c99

What you expected to happen:
Volcano can schedule ~ 2000+ pods, without locking up.

How to reproduce it (as minimally and precisely as possible):

  admission_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G
  scheduler_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G
  controller_resources:
    requests:
      cpu: 2000m
      memory: 8G
    limits:
      cpu: 2000m
      memory: 8G

Create a new job spec like:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test
spec:
  minAvailable: 0
  schedulerName: volcano
  maxRetry: 0
  tasks:
    - replicas: 1
      minAvailable: 1
      name: "ubuntu"
      template:
        metadata:
          name: web
        spec:
          schedulerName: volcano
          containers:
            - image: ubuntu
              imagePullPolicy: IfNotPresent
              name: ubuntu
              resources:
                limits:
                  memory: "1G"
                  cpu: "0.2"
              command:
                - "sleep"
                - "604800"
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 1

Then, submit this job > 20 times (with different names) over a few minutes. Scheduler will start to lock up, not schedule the majority of the pods (or even move them to pending), and will sometimes time-out during validation webhooks.

Anything else we need to know?:

Environment:

  • Volcano Version: 1.8.2
  • Kubernetes version (use kubectl version): v1.29.0-eks-c417bb3
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@SebMuir-Smith SebMuir-Smith added the kind/bug Categorizes issue or PR as related to a bug. label Feb 6, 2024
@SebMuir-Smith
Copy link
Author

SebMuir-Smith commented Feb 6, 2024

It looks like there are a lot of locations in the admissions validation webhooks where volcano is directly using kubernetes API gets/posts, rather than a more performant method like informers. This is likely causing the slowdown; Does volcano have any plans to migrate to informers in these areas?

@SebMuir-Smith
Copy link
Author

Confirmed that main performance issues were resolved by removing the parts of the webhooks that call the k8s api directly.

@Monokaix
Copy link
Member

Hi, you can try to increase --kube-api-qps and --kube-api-burst params of admission component to get a better performance with kube-apiserver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants