Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

larrylian
Copy link
Contributor

@larrylian larrylian commented May 13, 2023

Core Motivation:

  1. Improve the performance of batch calling ActorTask.
  2. Implement Ray's native collective communication library through this interface. -- This will be discussed in a new REP afterwards.

Current situation of batch calling actor tasks:

actors = [WorkerActor.remote() for _ in range(400)]

# This loop's repeated invocation actually wastes a lot of performance.
for actor in actors:
  actor.compute.remote(args)

Using the new Batch Remote API:

actors = [WorkerActor.remote() for _ in range(400)]

# Calling it only once can greatly improve performance.  
Plan 1
batch_remote_handle = ray.experimental.batch_remote(actors)
batch_remote_handle.compute.remote(args)

Plan 2
batch_remote_handle = ray.experimental.BatchRemoteHandle(actors)
batch_remote_handle.compute.remote(args)

The Batch Remote API can save the following performance costs(The N is the number of Actors):

  1. Reduce (N-1) times of parameter serialization performance time.
  2. Reduce (N-1) times of putting parameter into object store performance time for scenarios with large parameters.
  3. Reduce (N-1) times of python and C++ execution layer switching and repeated parameter verification performance time.

[WIP][Core]Add batch remote api for batch submit actor task
ray-project/ray#35597

Copy link
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the ad/dis sections.

### General Motivation
**Core Motivation**:
1. Improve the performance of batch calling ActorTask.
2. Implement Ray's native collective communication library through this interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should split this REP into 2:

  1. batch remote in ray core REP
  2. introducing RAY_NATIVE mode in ray collective lib REP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, This REP just adds Batch Rmote API.

## Summary
### General Motivation
**Core Motivation**:
1. Improve the performance of batch calling ActorTask.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should introduce the bottleneck aspects, such as frequent context switching between python and cpp, serializing the same object in multiple times.



### Should this change be within `ray` or outside?
This requires adding a new interface in Ray Core.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be added as an experimental & internal API first:

ray.experimental._batch_remote(actors).compute.remote(args)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. At first, it should be placed in the experimental module.

@jovany-wang
Copy link
Contributor

I guess this REP also benefits RLlib's sampling and weights-syncing aspects.

@gjoliver CC

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high level comment: can we do the optimization transparently without introducing a new api. For example, we cache the last args we see, if the new arg is the same, we reuse the previously serialized obj ref.

reps/2023-05-10-actors-batch-remote-api.md Outdated Show resolved Hide resolved
@larrylian
Copy link
Contributor Author

A high level comment: can we do the optimization transparently without introducing a new api. For example, we cache the last args we see, if the new arg is the same, we reuse the previously serialized obj ref.

@jjyao
The method you mentioned can only reduce the performance cost of repeatedly serializing parameters. However, the performance cost of frequent switching context between python and c++ cannot be avoided.
You can see my performance test even without parameters. Batch remote API performance optimization can reach 40%.

image

@jovany-wang
Copy link
Contributor

Following up on the last offline meeting, please:

  1. add a POC implementation to show the code structure.
  2. update the failure section.

@larrylian
Copy link
Contributor Author

@scv119 @jjyao

  1. I have added a POC implementation PR to show the code strucature - [WIP][Core]Add batch remote api for batch submit actor task ray#35597
  2. I have added a performance comparison chart with an objectRef as a parameter. In the case of using an objectRef, the batch remote approach shows a 3-4 times improvement in performance.
  3. I have added section of "Failure & Exception Scenario"

cc @jovany-wang

Signed-off-by: 稚鱼 <554538252@qq.com>
Plan 1
```
batch_remote_handle = ray.experimental.batch_remote(actors)
batch_remote_handle.compute.remote(args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if one of the actor failed? (i.e. killed or terminated?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure & Exception Scenario.
1. Exceptions occurred during parameter validation or preprocessing before batch submission of ActorTasks.
Since these exceptions occur before the process of submitting ActorTasks, they can be handled by directly throwing specific error exceptions as current situation.

2. Some actors throw exceptions during the process of batch submitting ActorTasks.
When traversing and submitting ActorTasks in a loop, if one of the Actors throws an exception during submission, the subsequent ActorTasks will be terminated immediately, and the exception will be throwed to user.

Reason:
Submitting ActorTask is normally done without any exceptions being thrown. If an error does occur, it is likely due to issues with the code and will require modifications.
The exception behavior of this plan is the same as the current foreach remote.

Uploading image.png…

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 20, 2023

Reduce (N-1) times of python and C++ execution layer switching and repeated parameter verification performance time.

Is there any flamegraph that backs up this? IIUC, the verification is done in Cython, and there's no such thing as "switching" (cython is just C).

@larrylian larrylian requested a review from scv119 June 21, 2023 02:47
@larrylian
Copy link
Contributor Author

Is there any flamegraph that backs up this? IIUC, the verification is done in Cython, and there's no such thing as "switching" (cython is just C).
@rkooo567
This "switching" refers to the execution context switch between python and (cython & c++).
I have confirmed through verification that after using BatchRemote optimization, the performance can be improved by 2~3 times in the scenario without parameters.
image

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 28, 2023

(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP?

@jovany-wang
Copy link
Contributor

(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP?

The most frequent context switching is with gil() and without gil().

@TRSWNCA
Copy link

TRSWNCA commented May 20, 2024

(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP?

I suppose the context switching happens when for loops continue?
Original way use for loop in python, while if we provide a interface, the for loop would be in Ray Core in C.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants