Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

Open
ls-2018 opened this issue Feb 23, 2024 · 14 comments
Open

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

ls-2018 opened this issue Feb 23, 2024 · 14 comments
Assignees
Labels
kind/bug Create a report to help us improve
Milestone

Comments

@ls-2018
Copy link
Contributor

ls-2018 commented Feb 23, 2024

timeout has two Settings
1: scheduler parameters
2: pod declaration
Should we add the corresponding logic that the timeout period is not allowed to exceed 15m

What happened:

 Coscheduling can set the default parameters in the scheduler, also can be set on the pod gang.scheduling.koordinator.sh/waiting-time It is used to determine the waiting time of coscheduling.

However, if the time exceeds 15 minutes, the k8s official scheduler will be removed from the waitingPods and will not be scheduled again.

Suppose group a has a subgroup b, and now A has created a pod, but b has not. This is a: The pod will enter the waitingPods waiting to allow, and then bind.
But if you wait longer for b to be ready than allow allows, a:pod will not bind again

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

1.yaml

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
 name: a
 namespace: default
 annotations:
   "gang.scheduling.koordinator.sh/total-number": "10"
   "gang.scheduling.koordinator.sh/groups": '["b"]'
spec:
 scheduleTimeoutSeconds: 3000
 minMember: 1
---
apiVersion: v1
kind: Pod
metadata:
 name: pod-example1
 namespace: default
 labels:
   pod-group.scheduling.sigs.k8s.io: a
spec:
 schedulerName: koord-scheduler
 containers:
 - command:
   - "sleep"
   - "365d"
   image: busybox
   name: curlimage

2.yaml

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: b
  namespace: default
spec:
  scheduleTimeoutSeconds: 3000
  minMember: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-example2
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: b
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    name: curlimage

Anything else we need to know?:

Environment:

  • App version:
  • Kubernetes version (use kubectl version):
  • Install details (e.g. helm install args):
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
    • OS version:
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@ls-2018 ls-2018 added the kind/bug Create a report to help us improve label Feb 23, 2024
@ls-2018 ls-2018 changed the title [BUG] Coscheduling Timeout Cannot exceed 1500s [BUG] Coscheduling Timeout Cannot exceed 15m Feb 23, 2024
@eahydra eahydra assigned ZiMengSheng and unassigned eahydra Feb 26, 2024
@ls-2018
Copy link
Contributor Author

ls-2018 commented Feb 26, 2024

I look forward to your reply, and I would be happy to work with you to solve this problem.

@ZiMengSheng
Copy link
Contributor

ZiMengSheng commented Mar 5, 2024

Can you exec kubectl describe pod-example1 -n default, and give me the message abount why pod-example1 is unschedulable?

@ZiMengSheng
Copy link
Contributor

ZiMengSheng commented Mar 5, 2024

In your example, PodGroup A has configured scheduleTimeoutSeconds as 10, so in theory PodGroupA will be timeout after 10 seconds. However, in our current implementation, the timeout configuration of PodGroup just means the max wait time since first pod comes to permit stage, and won't be persisted as podgroup/pod status in apiserver and won't also block pod scheduling process. So would you like give me more detail message abount why pod is unschedulable.

@ls-2018
Copy link
Contributor Author

ls-2018 commented Mar 6, 2024

Sorry, there is an error in the yaml I provided, I will fix it later and provide more information

@ls-2018
Copy link
Contributor Author

ls-2018 commented Mar 7, 2024

➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k delete -f .              
podgroup.scheduling.sigs.k8s.io "a" deleted
pod "pod-example1" deleted
podgroup.scheduling.sigs.k8s.io "b" deleted
pod "pod-example2" deleted
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 1.yaml          
podgroup.scheduling.sigs.k8s.io/a created
pod/pod-example1 created
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 date                  
Thu Mar  7 13:38:29 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 sleep 2000 && k apply -f 1.yaml && date
^Z
[1]  + 79592 suspended  sleep 2000
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 date
Thu Mar  7 14:06:39 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 1.yaml && date
podgroup.scheduling.sigs.k8s.io/a unchanged
pod/pod-example1 unchanged
Thu Mar  7 14:06:46 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 2.yaml && date
podgroup.scheduling.sigs.k8s.io/b created
pod/pod-example2 created
Thu Mar  7 14:06:55 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k describe pod pod-example1
Name:             pod-example1
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           pod-group.scheduling.sigs.k8s.io=a
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  curlimage:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      365d
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c9nvx (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-c9nvx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From             Message
  ----     ------            ----   ----             -------
  Warning  FailedScheduling  13m    koord-scheduler  rejected due to timeout after waiting 15m0s at plugin Coscheduling
  Warning  FailedScheduling  13m    koord-scheduler  running PreFilter plugin "Coscheduling": %!!(MISSING)w(<nil>)
  Warning  FailedScheduling  8m16s  koord-scheduler  running PreFilter plugin "Coscheduling": %!!(MISSING)w(<nil>)
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 

@ls-2018
Copy link
Contributor Author

ls-2018 commented Mar 7, 2024

As long as you sleep in between for a while, you can reproduce

image

@ls-2018
Copy link
Contributor Author

ls-2018 commented Mar 7, 2024

/cc @ZiMengSheng

@ZiMengSheng
Copy link
Contributor

can you give me scheduler log abount why pod-example1 coscheduling prefilter failed, current Prefilter failed message is a little confusing due to known kube-scheduler bug.

@ZiMengSheng
Copy link
Contributor

ZiMengSheng commented Mar 22, 2024

i make a test and have got the point. PodGroup default/a has total number of 10 and min number of 1.

With totalChildrenNum's help, when the last pod comes to make all childrenScheduleRoundMap's values equal to scheduleCycle, Gang's scheduleCycle will be added by 1, which means a new schedule cycle.

In our example, pod-example1 gets rejected due to timeout of waiting PodGroupB. Scheduling cycle of pod-example1 is added to 1 after prefilter. When pod-example1 enter into scheduling cycle next time, gang scheduling cycle won't be added because num(child of which schedule cycle equals gang scheduling cycle) is one < totalChildrenNumber. thus Prefilter failed.

New schedule cycle will never arrive until you submit enough children of PodGroupA. So just submit all children?

@ls-2018
Copy link
Contributor Author

ls-2018 commented Mar 25, 2024

@ZiMengSheng in goup/a I specified the number of minmembers to be 1. If I still need to increase the number of Pods, this is not consistent with my expectation.

@ZiMengSheng
Copy link
Contributor

ZiMengSheng commented Mar 25, 2024

OK, you opinion are right and welcome. There are some inconsistencies in the design. We need to fix it in the code and design doc. Do you have the time and interest to fix it?

@ls-2018
Copy link
Contributor Author

ls-2018 commented Apr 7, 2024

I'd love to fix it. But I don't have a specific idea of how best to fix it. We also want to hear from the community.

@eahydra
Copy link
Member

eahydra commented Apr 7, 2024

I'd love to fix it. But I don't have a specific idea of how best to fix it. We also want to hear from the community.

Welcome to contribute! Just do it!

@ZiMengSheng ZiMengSheng added this to the v1.5 milestone May 7, 2024
@jasonliu747 jasonliu747 modified the milestones: v1.5, someday May 21, 2024
@jasonliu747
Copy link
Member

@ls-2018 any updates? ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Create a report to help us improve
Projects
None yet
Development

No branches or pull requests

4 participants