[Core][REP] GPU Memory awareness scheduling #47

jonathan-anyscale · 2023-11-19T02:12:00Z

The GPU memory scheduling prototype:
ray-project/ray#41147

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

.ipynb_checkpoints/Untitled-checkpoint.ipynb

reps/2023-10-30-gpu-memory-support.md

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao

will continue

reps/2023-10-30-gpu-memory-support.md

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

reps/2023-10-30-gpu-memory-support.md

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567 · 2023-12-08T05:44:34Z

reps/2023-10-30-gpu-memory-support.md

+```python
+# Request a fractional GPU with specified gpu_memory in bytes.
+# Mutually exclusive with num_gpus.
+@ray.remote(gpu_memory=1024 * 1024 * 1024) # 1 mb request


Can we support string-based syntactic sugar? Feels more pythonic that way (i.e., gpu_memory="3gb")

for now we just follow how memory is defined. I think the pythonic support can be done separately which covers both gpu_memory and memory changes

rkooo567 · 2023-12-08T05:50:45Z

reps/2023-10-30-gpu-memory-support.md

+```python
+pg = placement_group([{"gpu_memory": 1024 * 1024, "CPU": 1}, {"GPU": 1}])
+```
+


I think we need the observability section here as this complicates the observability semantics.

how is it displayed in ray status?

for ray status, it should potentially display sth like gpu_memory: 4 gpus (A10) * 3gb?

In ray status, if a task is scheduled with gpu_memory, both gpu & gpu memory values are subtracted?

How is it displayed in resource_requirement in ray list tasks? Is it translated into num_gpus? Or it only includes gpu_memory? Or both?

in ray list nodes, it will be GPU (resources left) * gpu_memory_per_gpu which is the constant stored in node label. ray status, ray list task and ray.available_resources currently didn't show GPU memory but if we added one, it will be the same as ray list nodes.

and yes, basically both gpu and gpu_memory values are subtracted to show the remaining

rkooo567 · 2023-12-08T05:51:35Z

reps/2023-10-30-gpu-memory-support.md

+
+# Requesting 30GB of GPU memory from a A10 GPU with 24GB of memory.
+# Task won't be able to be scheduled.
+@ray.remote(gpu_memory=30 * 1024 * 1024 * 1024 * 1024, accelerator_type="NVIDIA_TESLA_A10G")


If you have 40GB gpu

and schedule 1 task with 20GB
and schedule another with with num_gpus=1, would it fail to schedule?

yes, the second one will fail since the GPU remaining after scheduler 20GB task will be 0.5

rkooo567 · 2023-12-08T05:55:03Z

reps/2023-10-30-gpu-memory-support.md

+
+```python
+# Request a fractional GPU both num_gpus and gpu_memory is not allowed
+@ray.remote(gpu_memory=1024 * 1024 * 1024, num_gpus=0.5) # raise ValueError exception


is it possible to express 2 GPUs using gpu_memory? Or is it not allowed?

can you specify this in REP?

it's not allowed, since only either one of num_gpus or gpu_memory (1 gpu per request) can be specified in request

Could they both be allowed? If both num_gpus and gpu_memory are specified, then it would require that much memory on that many GPUs. num_gpus would default to 1, so not specifying it would get the behavior described above. It could be an error condition to specify a fractional value for num_gpus if also specifying gpu_memory. Thoughts?

init rep doc

981349e

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jonathan-anyscale marked this pull request as draft November 19, 2023 03:34

add conversion example

0f73faa

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jonathan-anyscale assigned jjyao Nov 20, 2023

add usage example

cc21d13

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao reviewed Nov 27, 2023

View reviewed changes

.ipynb_checkpoints/Untitled-checkpoint.ipynb Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

clean rep

35839fd

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao reviewed Nov 28, 2023

View reviewed changes

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Show resolved Hide resolved

more fix

38ee994

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao reviewed Dec 1, 2023

View reviewed changes

jjyao reviewed Dec 2, 2023

View reviewed changes

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

reps/2023-10-30-gpu-memory-support.md Outdated Show resolved Hide resolved

jonathan-anyscale and others added 2 commits December 2, 2023 20:20

placement group example

af09266

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

final update

20e553b

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jonathan-anyscale marked this pull request as ready for review December 6, 2023 20:00

rkooo567 reviewed Dec 8, 2023

View reviewed changes

This was referenced Dec 11, 2023

[Ray Core] - Add ability to specify gpu memory resources in addition to gpu units ray-project/ray#37574

Open

[Core] Remote placement using gpu memory ray-project/ray#26929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][REP] GPU Memory awareness scheduling #47

[Core][REP] GPU Memory awareness scheduling #47

jonathan-anyscale commented Nov 19, 2023 •

edited

jjyao left a comment

rkooo567 Dec 8, 2023

jonathan-anyscale Dec 9, 2023

rkooo567 Dec 8, 2023

jonathan-anyscale Dec 9, 2023

rkooo567 Dec 8, 2023

jonathan-anyscale Dec 9, 2023

rkooo567 Dec 8, 2023

rkooo567 Dec 8, 2023

jonathan-anyscale Dec 9, 2023

thatcort Dec 26, 2023

[Core][REP] GPU Memory awareness scheduling #47

Are you sure you want to change the base?

[Core][REP] GPU Memory awareness scheduling #47

Conversation

jonathan-anyscale commented Nov 19, 2023 • edited

jjyao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathan-anyscale commented Nov 19, 2023 •

edited