[REP][Core]New scheduling feature: taints and tolerations #44

larrylian · 2023-08-28T09:00:45Z

We plan to introduce the Taints & Tolerations feature to achieve scarce resource isolation, especially for GPU nodes, where preventing ordinary CPU tasks from being scheduled on GPU nodes is a key requirement that many ray users expect.

Key concepts:

If you don't want normal cpu task/actor to be scheduled on GPU node, You can add a taint to a gpu node(node1) using ray cli. For example:

# node1
ray start --taints={"gpu_node":"true"} --num-gpus=1

Normal cpu task/actor/placement group will not be scheduled on GPU node.

The actor/pg will not be scheudled onto node1

actor = Actor.options(num_cpus=1).remote()
pg = ray.util.placement_group(bundles=[{"CPU": 1}])

Then you want to schedule gpu task onto gpu node(node1), you can specify a toleration for task.

The actor/pg would be able to scheudled onto node1

actor = Actor.options(num_gpus=1, tolerations ={"gpu_node": Exists()}).remote()
pg = ray.util.placement_group(bundles=[{"GPU": 1}], tolerations ={"gpu_node": Exists()})

You can also use taints to achieve node isolation.

If you want to isolate a node with memory pressure so that tasks are not scheduled onto it. You can use ray taint:

ray taint --node-id {node_id_1} --apend {"memory-pressure":"high"}

Then the new task/actor/pg will not be schedule onto node1.

You can restore the node once the memory pressure on the node is reduced to a low level.

ray taint --node-id {node_id_1} --delete {"memory-pressure":"high"}

Then the new task/actor/pg will be able to schedule onto node1.

Signed-off-by: 稚鱼 <554538252@qq.com>

larrylian force-pushed the taints_torelation branch from 0014f31 to 3a254ee Compare August 28, 2023 09:11

larrylian requested review from scv119, jjyao, wumuzi520 and SongGuyang August 28, 2023 09:26

larrylian assigned jjyao Aug 28, 2023

larrylian force-pushed the taints_torelation branch 3 times, most recently from 31fcb64 to 9fcb558 Compare August 31, 2023 08:29

larrylian force-pushed the taints_torelation branch from 9fcb558 to a70bb7c Compare October 30, 2023 12:38

[REP][Core]Add taints and tolerations feature

e2c8159

Signed-off-by: 稚鱼 <554538252@qq.com>

larrylian force-pushed the taints_torelation branch from a70bb7c to e2c8159 Compare October 30, 2023 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REP][Core]New scheduling feature: taints and tolerations #44

[REP][Core]New scheduling feature: taints and tolerations #44

larrylian commented Aug 28, 2023 •

edited

[REP][Core]New scheduling feature: taints and tolerations #44

Are you sure you want to change the base?

[REP][Core]New scheduling feature: taints and tolerations #44

Conversation

larrylian commented Aug 28, 2023 • edited

Key concepts:

You can also use taints to achieve node isolation.

larrylian commented Aug 28, 2023 •

edited