with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption #3329

LivingCcj · 2024-02-26T11:54:41Z

when volcano scheduler open preempt or reclaim plugin，the high prioriry pod is unable to preempt the low priority pod. Although there are some nodes that meet the preemption conditions，beacuse one function in these predicateFns return err (is not nil), the potential node will be ignore

volcano/pkg/scheduler/actions/preempt/preempt.go

Lines 211 to 221 in 94c62a4

    
           predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) ([]*api.Status, error) { 
        
           	// Allows scheduling to nodes that are in Success or Unschedulable state after filtering by predicate. 
        
           	var statusSets util.StatusSets 
        
           	statusSets, err := ssn.PredicateFn(task, node) 
        
           	if err != nil { 
        
           		return nil, api.NewFitError(task, node, err.Error()) 
        
           	} 
        
           	if statusSets.ContainsUnschedulableAndUnresolvable() || statusSets.ContainsErrorSkipOrWait() { 
        
           		return nil, api.NewFitError(task, node, statusSets.Message()) 
        
           	}

Environment:

Volcano Version: volcano v1.8.2
Kubernetes version (use kubectl version): v1.20.15

The text was updated successfully, but these errors were encountered:

lowang-bh · 2024-02-28T02:31:22Z

Would you please supply more informations, such as scheduler configmap, and scheduler logs and jobs config?

LivingCcj · 2024-02-28T04:00:29Z

when preempt or reclaim, if one predicate function handler return status with Unschedulable state，but the err is not nil, it will ignore the potential node. According to the comments Allows scheduling to nodes that are in Success or Unschedulable state after filtering by predicate in function predicateFn , the node could be preempted
here is the volcano scheduler configmap：

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, preempt, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
      - name: cdp
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: nodeorder
        arguments:
          nodeaffinity.weight: 5
      - name: binpack
        arguments:
          binpack.weight: 5
          binpack.cpu: 2
          binpack.memory: 1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system

lowang-bh · 2024-03-01T02:55:05Z

but the err is not nil

What is the error?

LivingCcj · 2024-03-01T10:25:12Z

There is a scene, an unscheduled pod with gpu resources is in the session of preempt action. one node have some lower priority pods，if it preempts the lower priority pod，the unscheduled pod could be placed at the node. However at predicate stage, the predicateStatus.Code is Unschedulable and err is not nil (refer to the code below). It leads to ignoring the potential node when filtering preempted nodes.

volcano/pkg/scheduler/plugins/predicates/predicates.go

Lines 530 to 554 in 6e9f4f6

    
           for _, val := range api.RegisteredDevices { 
        
           	if dev, ok := node.Others[val].(api.Devices); ok { 
        
           		if dev == nil { 
        
           			predicateStatus = append(predicateStatus, &api.Status{ 
        
           				Code:   devices.Unschedulable, 
        
           				Reason: "node not initialized with device" + val, 
        
           			}) 
        
           			return predicateStatus, fmt.Errorf("node not initialized with device %s", val) 
        
           		} 
        
           		code, msg, err := dev.FilterNode(task.Pod) 
        
           		filterNodeStatus := &api.Status{ 
        
           			Code:   code, 
        
           			Reason: msg, 
        
           		} 
        
           		if err != nil { 
        
           			return predicateStatus, err 
        
           		} 
        
           		if filterNodeStatus.Code != api.Success { 
        
           			predicateStatus = append(predicateStatus, filterNodeStatus) 
        
           			return predicateStatus, fmt.Errorf("plugin device filternode predicates failed %s", msg) 
        
           		} 
        
           	} else { 
        
           		klog.Warningf("Devices %s assertion conversion failed, skip", val) 
        
           	} 
        
           }

Monokaix · 2024-03-19T03:37:48Z

There is a scene, an unscheduled pod with gpu resources is in the session of preempt action. one node have some lower priority pods，if it preempts the lower priority pod，the unscheduled pod could be placed at the node. However at predicate stage, the predicateStatus.Code is Unschedulable and err is not nil (refer to the code below). It leads to ignoring the potential node when filtering preempted nodes.

volcano/pkg/scheduler/plugins/predicates/predicates.go

Lines 530 to 554 in 6e9f4f6

for _, val := range api.RegisteredDevices {

if dev, ok := node.Others[val].(api.Devices); ok {

if dev == nil {

predicateStatus = append(predicateStatus, &api.Status{

Code: devices.Unschedulable,

Reason: "node not initialized with device" + val,

})

return predicateStatus, fmt.Errorf("node not initialized with device %s", val)

}

code, msg, err := dev.FilterNode(task.Pod)

filterNodeStatus := &api.Status{

Code: code,

Reason: msg,

}

if err != nil {

return predicateStatus, err

}

if filterNodeStatus.Code != api.Success {

predicateStatus = append(predicateStatus, filterNodeStatus)

return predicateStatus, fmt.Errorf("plugin device filternode predicates failed %s", msg)

}

} else {

klog.Warningf("Devices %s assertion conversion failed, skip", val)

}

}

It's truly a problem in vgpu preemption, I think we should not reuturn err when vgpu resource insufficient here, if you're interested, welcome to fix that.

Monokaix · 2024-03-19T03:40:15Z

Same problem: #3186. We can fix it to resolve both of them.

Monokaix · 2024-03-19T03:40:55Z

@LivingCcj @lowang-bh You're welcome to fix this: )

Monokaix · 2024-03-19T07:09:06Z

This phenomenon has recurred when vgpu resource is insufficient.
Here are volcano scheduler logs：

I0319 02:51:05.281886       1 preempt.go:43] Enter Preempt ...
I0319 02:51:05.281895       1 job_info.go:728] job podgroup-f354bb74-7c3d-4429-aa92-3c02a7ab99ba/kubeflow actual: map[:1], ji.TaskMinAvailable: map[]
I0319 02:51:05.281913       1 preempt.go:65] Added Queue <default> for Job <kubeflow/podgroup-f354bb74-7c3d-4429-aa92-3c02a7ab99ba>
I0319 02:51:05.281925       1 job_info.go:728] job podgroup-ccffee3d-b1e2-4a94-9f4d-f15502dc3f77/kubeflow actual: map[:1], ji.TaskMinAvailable: map[]
I0319 02:51:05.281942       1 job_info.go:728] job podgroup-6ba7f409-d8c0-498f-a2ec-ec7b0c7f75fc/kubeflow actual: map[:1], ji.TaskMinAvailable: map[]
I0319 02:51:05.281973       1 predicates.go:384] pod(kubeflow/x-v1-76d645bc8c-8sr2m) affinity require information is nil, plugin InterPodAffinity is skipped
I0319 02:51:05.282004       1 predicate_helper.go:55] Considering Task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: <cpu 1000.00, memory 4294967296.00, volcano.sh/vgpu-number 2000.00> vs. <cpu 2750.00, memory 8095842304.00, ephemeral-storage 38644306266000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>
I0319 02:51:05.282013       1 predicate_helper.go:55] Considering Task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: <cpu 1000.00, memory 4294967296.00, volcano.sh/vgpu-number 2000.00> vs. <cpu 28530.00, memory 126106681344.00, hugepages-2Mi 0.00, nstack/vcuda-core 0.00, nstack/vcuda-memory 0.00, nvidia.com/gpu 2000.00, volcano.sh/vgpu-number 17000.00, ephemeral-storage 482947890401000.00, hugepages-1Gi 0.00>
I0319 02:51:05.282064       1 predicate_helper.go:75] Predicates failed for task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: task kubeflow/x-v1-76d645bc8c-8sr2m on node 10.x.x.x fit failed: plugin TaintToleration predicates failed node(s) had untolerated taint {node-role.kubernetes.io/controlplane: true}
I0319 02:51:05.282078       1 predicates.go:505] pod(kubeflow/x-v1-76d645bc8c-8sr2m) affinity require information is nil, plugin InterPodAffinity is skip for node 10.x.x.x
I0319 02:51:05.282105       1 csi.go:210] "Could not find a CSI driver name or volume handle, not counting volume"
I0319 02:51:05.282125       1 device_info.go:152] DeviceSharing:Into FitInPod x-v1-76d645bc8c-8sr2m
I0319 02:51:05.282136       1 device_info.go:167] DeviceSharing:FitInPod successed
I0319 02:51:05.282143       1 device_info.go:183] 4pdvgpu DeviceSharing starts filtering pods x-v1-76d645bc8c-8sr2m
I0319 02:51:05.282153       1 utils.go:256] counts= [{2 NVIDIA 10240 101 0}]
I0319 02:51:05.282178       1 utils.go:350] Allocating device for container request {2 NVIDIA 10240 101 0}
I0319 02:51:05.282201       1 utils.go:353] Scoring pod 10240:101:0:2i1device:1
I0319 02:51:05.282223       1 utils.go:354] gs 1 = 11441 10250 2
I0319 02:51:05.282244       1 utils.go:353] Scoring pod 10240:101:0:2i0device:0
I0319 02:51:05.282268       1 utils.go:354] gs 0 = 11441 10240 1
E0319 02:51:05.282285      1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node
I0319 02:51:05.282306       1 predicate_helper.go:75] Predicates failed for task <kubeflow/x-v1-76d645bc8c-8sr2m> on node <10.x.x.x>: task kubeflow/x-v1-76d645bc8c-8sr2m on node 10.x.x.x fit failed: not enough gpu fitted on this node

Vital information：device_info.go:187] deviceSharing err= not enough gpu fitted on this node

lowang-bh · 2024-03-24T08:37:50Z

There is a scene, an unscheduled pod with gpu resources is in the session of preempt action. one node have some lower priority pods，if it preempts the lower priority pod，the unscheduled pod could be placed at the node. However at predicate stage, the predicateStatus.Code is Unschedulable and err is not nil (refer to the code below). It leads to ignoring the potential node when filtering preempted nodes.

volcano/pkg/scheduler/plugins/predicates/predicates.go

Lines 530 to 554 in 6e9f4f6

for _, val := range api.RegisteredDevices {

if dev, ok := node.Others[val].(api.Devices); ok {

if dev == nil {

predicateStatus = append(predicateStatus, &api.Status{

Code: devices.Unschedulable,

Reason: "node not initialized with device" + val,

})

return predicateStatus, fmt.Errorf("node not initialized with device %s", val)

}

code, msg, err := dev.FilterNode(task.Pod)

filterNodeStatus := &api.Status{

Code: code,

Reason: msg,

}

if err != nil {

return predicateStatus, err

}

if filterNodeStatus.Code != api.Success {

predicateStatus = append(predicateStatus, filterNodeStatus)

return predicateStatus, fmt.Errorf("plugin device filternode predicates failed %s", msg)

}

} else {

klog.Warningf("Devices %s assertion conversion failed, skip", val)

}

}

It's truly a problem in vgpu preemption, I think we should not reuturn err when vgpu resource insufficient here, if you're interested, welcome to fix that.

@archlitchi is owned and familar with the vgpu code. @Monokaix

dmitsh · 2024-05-08T00:13:45Z

I might experience a similar issue.
My cluster has 4 GPU nodes.
First, I start a 4-nodes job with low priority, which gets scheduled and running.
A little later I start two 2-nodes jobs with high priority.
I would expect that the high priority jobs would preempt the first job, but it doesn't happen.
Please refer to the attached files.
volcano.zip
/cc @k82cn

LivingCcj added the kind/bug Categorizes issue or PR as related to a bug. label Feb 26, 2024

LivingCcj changed the title ~~when preempt or reclaim plugin，high priority pod can not be placed at some node which could be preempted~~ with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption Feb 26, 2024

LivingCcj mentioned this issue Mar 28, 2024

fix preempt & reclaim shecduler plugin with PreemptPredicateFn #3363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption #3329

with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption #3329

LivingCcj commented Feb 26, 2024

lowang-bh commented Feb 28, 2024 •

edited

LivingCcj commented Feb 28, 2024

lowang-bh commented Mar 1, 2024

LivingCcj commented Mar 1, 2024

Monokaix commented Mar 19, 2024

Monokaix commented Mar 19, 2024

Monokaix commented Mar 19, 2024

Monokaix commented Mar 19, 2024

lowang-bh commented Mar 24, 2024

dmitsh commented May 8, 2024 •

edited

with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption #3329

with preempt or reclaim plugin, the high priority pod can not be placed at some node which meet the conditions for preemption #3329

Comments

LivingCcj commented Feb 26, 2024

lowang-bh commented Feb 28, 2024 • edited

LivingCcj commented Feb 28, 2024

lowang-bh commented Mar 1, 2024

LivingCcj commented Mar 1, 2024

Monokaix commented Mar 19, 2024

Monokaix commented Mar 19, 2024

Monokaix commented Mar 19, 2024

Monokaix commented Mar 19, 2024

lowang-bh commented Mar 24, 2024

dmitsh commented May 8, 2024 • edited

lowang-bh commented Feb 28, 2024 •

edited

dmitsh commented May 8, 2024 •

edited