[BUG] Coscheduling Timeout Cannot exceed 15m #1919

ls-2018 · 2024-02-23T10:49:24Z

timeout has two Settings
1: scheduler parameters
2: pod declaration
Should we add the corresponding logic that the timeout period is not allowed to exceed 15m

What happened:

 Coscheduling can set the default parameters in the scheduler, also can be set on the pod gang.scheduling.koordinator.sh/waiting-time It is used to determine the waiting time of coscheduling.

However, if the time exceeds 15 minutes, the k8s official scheduler will be removed from the waitingPods and will not be scheduled again.

Suppose group a has a subgroup b, and now A has created a pod, but b has not. This is a: The pod will enter the waitingPods waiting to allow, and then bind.
But if you wait longer for b to be ready than allow allows, a:pod will not bind again

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

1.yaml

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
 name: a
 namespace: default
 annotations:
   "gang.scheduling.koordinator.sh/total-number": "10"
   "gang.scheduling.koordinator.sh/groups": '["b"]'
spec:
 scheduleTimeoutSeconds: 3000
 minMember: 1
---
apiVersion: v1
kind: Pod
metadata:
 name: pod-example1
 namespace: default
 labels:
   pod-group.scheduling.sigs.k8s.io: a
spec:
 schedulerName: koord-scheduler
 containers:
 - command:
   - "sleep"
   - "365d"
   image: busybox
   name: curlimage

2.yaml

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: b
  namespace: default
spec:
  scheduleTimeoutSeconds: 3000
  minMember: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-example2
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: b
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    name: curlimage

Anything else we need to know?:

Environment:

App version:
Kubernetes version (use kubectl version):
Install details (e.g. helm install args):
Node environment (for koordlet/runtime-proxy issue):
- Containerd/Docker version:
- OS version:
- Kernal version:
- Cgroup driver: cgroupfs/systemd
Others:

The text was updated successfully, but these errors were encountered:

ls-2018 · 2024-02-26T10:31:30Z

I look forward to your reply, and I would be happy to work with you to solve this problem.

ZiMengSheng · 2024-03-05T06:02:52Z

Can you exec kubectl describe pod-example1 -n default, and give me the message abount why pod-example1 is unschedulable?

ZiMengSheng · 2024-03-05T06:55:02Z

In your example, PodGroup A has configured scheduleTimeoutSeconds as 10, so in theory PodGroupA will be timeout after 10 seconds. However, in our current implementation, the timeout configuration of PodGroup just means the max wait time since first pod comes to permit stage, and won't be persisted as podgroup/pod status in apiserver and won't also block pod scheduling process. So would you like give me more detail message abount why pod is unschedulable.

ls-2018 · 2024-03-06T13:49:40Z

Sorry, there is an error in the yaml I provided, I will fix it later and provide more information

ls-2018 · 2024-03-07T06:08:06Z

➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k delete -f .              
podgroup.scheduling.sigs.k8s.io "a" deleted
pod "pod-example1" deleted
podgroup.scheduling.sigs.k8s.io "b" deleted
pod "pod-example2" deleted
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 1.yaml          
podgroup.scheduling.sigs.k8s.io/a created
pod/pod-example1 created
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 date                  
Thu Mar  7 13:38:29 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 sleep 2000 && k apply -f 1.yaml && date
^Z
[1]  + 79592 suspended  sleep 2000
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 date
Thu Mar  7 14:06:39 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 1.yaml && date
podgroup.scheduling.sigs.k8s.io/a unchanged
pod/pod-example1 unchanged
Thu Mar  7 14:06:46 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 2.yaml && date
podgroup.scheduling.sigs.k8s.io/b created
pod/pod-example2 created
Thu Mar  7 14:06:55 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k describe pod pod-example1
Name:             pod-example1
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           pod-group.scheduling.sigs.k8s.io=a
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  curlimage:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      365d
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c9nvx (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-c9nvx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From             Message
  ----     ------            ----   ----             -------
  Warning  FailedScheduling  13m    koord-scheduler  rejected due to timeout after waiting 15m0s at plugin Coscheduling
  Warning  FailedScheduling  13m    koord-scheduler  running PreFilter plugin "Coscheduling": %!!(MISSING)w(<nil>)
  Warning  FailedScheduling  8m16s  koord-scheduler  running PreFilter plugin "Coscheduling": %!!(MISSING)w(<nil>)
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥

ls-2018 · 2024-03-07T06:08:37Z

As long as you sleep in between for a while, you can reproduce

ls-2018 · 2024-03-07T06:10:00Z

/cc @ZiMengSheng

ZiMengSheng · 2024-03-22T10:14:38Z

can you give me scheduler log abount why pod-example1 coscheduling prefilter failed, current Prefilter failed message is a little confusing due to known kube-scheduler bug.

ZiMengSheng · 2024-03-22T12:28:26Z

i make a test and have got the point. PodGroup default/a has total number of 10 and min number of 1.

With totalChildrenNum's help, when the last pod comes to make all childrenScheduleRoundMap's values equal to scheduleCycle, Gang's scheduleCycle will be added by 1, which means a new schedule cycle.

In our example, pod-example1 gets rejected due to timeout of waiting PodGroupB. Scheduling cycle of pod-example1 is added to 1 after prefilter. When pod-example1 enter into scheduling cycle next time, gang scheduling cycle won't be added because num(child of which schedule cycle equals gang scheduling cycle) is one < totalChildrenNumber. thus Prefilter failed.

New schedule cycle will never arrive until you submit enough children of PodGroupA. So just submit all children?

ls-2018 · 2024-03-25T02:22:19Z

@ZiMengSheng in goup/a I specified the number of minmembers to be 1. If I still need to increase the number of Pods, this is not consistent with my expectation.

ZiMengSheng · 2024-03-25T03:03:01Z

OK, you opinion are right and welcome. There are some inconsistencies in the design. We need to fix it in the code and design doc. Do you have the time and interest to fix it?

ls-2018 · 2024-04-07T13:01:23Z

I'd love to fix it. But I don't have a specific idea of how best to fix it. We also want to hear from the community.

eahydra · 2024-04-07T15:09:33Z

I'd love to fix it. But I don't have a specific idea of how best to fix it. We also want to hear from the community.

Welcome to contribute! Just do it!

jasonliu747 · 2024-05-21T11:36:52Z

@ls-2018 any updates? ;)

ls-2018 added the kind/bug Create a report to help us improve label Feb 23, 2024

ls-2018 changed the title ~~[BUG] Coscheduling Timeout Cannot exceed 1500s~~ [BUG] Coscheduling Timeout Cannot exceed 15m Feb 23, 2024

koordinator-bot bot assigned eahydra Feb 26, 2024

eahydra assigned ZiMengSheng and unassigned eahydra Feb 26, 2024

ZiMengSheng added this to the v1.5 milestone May 7, 2024

jasonliu747 modified the milestones: v1.5, someday May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

ls-2018 commented Feb 23, 2024 •

edited

ls-2018 commented Feb 26, 2024

ZiMengSheng commented Mar 5, 2024 •

edited

ZiMengSheng commented Mar 5, 2024 •

edited

ls-2018 commented Mar 6, 2024

ls-2018 commented Mar 7, 2024

ls-2018 commented Mar 7, 2024 •

edited

ls-2018 commented Mar 7, 2024

ZiMengSheng commented Mar 22, 2024

ZiMengSheng commented Mar 22, 2024 •

edited

ls-2018 commented Mar 25, 2024

ZiMengSheng commented Mar 25, 2024 •

edited

ls-2018 commented Apr 7, 2024

eahydra commented Apr 7, 2024

jasonliu747 commented May 21, 2024

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

Comments

ls-2018 commented Feb 23, 2024 • edited

1.yaml

2.yaml

ls-2018 commented Feb 26, 2024

ZiMengSheng commented Mar 5, 2024 • edited

ZiMengSheng commented Mar 5, 2024 • edited

ls-2018 commented Mar 6, 2024

ls-2018 commented Mar 7, 2024

ls-2018 commented Mar 7, 2024 • edited

ls-2018 commented Mar 7, 2024

ZiMengSheng commented Mar 22, 2024

ZiMengSheng commented Mar 22, 2024 • edited

ls-2018 commented Mar 25, 2024

ZiMengSheng commented Mar 25, 2024 • edited

ls-2018 commented Apr 7, 2024

eahydra commented Apr 7, 2024

jasonliu747 commented May 21, 2024

ls-2018 commented Feb 23, 2024 •

edited

ZiMengSheng commented Mar 5, 2024 •

edited

ZiMengSheng commented Mar 5, 2024 •

edited

ls-2018 commented Mar 7, 2024 •

edited

ZiMengSheng commented Mar 22, 2024 •

edited

ZiMengSheng commented Mar 25, 2024 •

edited