Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: cudaStreamWaitEvent #1139

Open
Ruberik opened this issue Dec 15, 2023 · 7 comments · May be fixed by #1183
Open

Feature request: cudaStreamWaitEvent #1139

Ruberik opened this issue Dec 15, 2023 · 7 comments · May be fixed by #1183
Labels
feature A new feature (or feature request)

Comments

@Ruberik
Copy link
Contributor

Ruberik commented Dec 15, 2023

As my teammates and I move to multiple, big GPUs with many streams doing interdependent work simultaneously, synchronizing at the GPU level is introducing significant amounts of inefficiency. We'd like to be able to use cudaStreamWaitEvent to make a stream wait on an event. There's already RecordEvent, CreateEvent, DestroyEvent and QueryEvent, so I'm hoping this is a straightforward addition.

@MoFtZ MoFtZ added the feature A new feature (or feature request) label Dec 15, 2023
@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 15, 2023

API Proposal

  1. Add CreateEvent method to Accelerator class, returning new AcceleratorEvent class.
  2. Add RecordEvent and WaitForEvent methods to AcceleratorStream class.

@m4rs-mt what do you think?

@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 16, 2023

Alternate API Proposal

API Proposal

  1. Add AddEvent method to AcceleratorStream class, returning new AcceleratorEvent class.
  2. AddWaitForEvent method to AcceleratorStream class.

This removes the ability to re-record the event.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 16, 2023

Alternate API Proposal 2

Instead of exposing a new object, introduce an WaitForStream method between two streams.

Not sure how this will affect the lifetime of the native handles.

@Ruberik
Copy link
Contributor Author

Ruberik commented Dec 18, 2023

Both your API proposal and your first Alternate API Proposal seem reasonable. The first more closely mimics the underlying API, while the second strikes me as harder to screw up.

The "screw up" I'm thinking of is that when an event is created, I believe the only reasonable action to take with that event is to call RecordEvent() on it (that may, in fact, be the only action that doesn't produce an error). It thus makes some sense to combine creation with the call to RecordEvent(); but I'd be hesitant to make impossible something that's possible in the underlying API, even if most use cases are invalid.

Here's the "screw up" I'm imagining. Imagine a scenario with 8 streams, each of which has three tasks to complete, in order: A, B and C. No stream can start C until every stream's A is done.

It's natural to want to write code like this, but it fails:

var taskAFinishedEvents = Enumerable.Range(0, 8).Select(_ => acc.CreateEvent()).ToArray();
for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
  streams[i].RecordEvent(taskAFinishedEvents[i]);
  DoTaskB(streams[i]);
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForEvent(taskAFinishedEvents[j]);
  }
  DoTaskC(streams[i]);
}

The correct code would look like this for your first API proposal:

var taskAFinishedEvents = new AcceleratorEvent[8];
for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
  taskAFinishedEvents[i] = acc.CreateEvent();
  streams[i].RecordEvent(taskAFinishedEvents[i]);
}
for (int i = 0; i < 8; i++) {
  DoTaskB(streams[i]);
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForEvent(taskAFinishedEvents[j]);
  }
  DoTaskC(streams[i]);
}

The second proposal (Alternate API proposal) would simply replace two lines with taskAFinishedEvents[i] = streams[i].AddEvent();

I don't think I understand the third proposal (Alternate API proposal 2) well enough to know how it would solve this problem.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 18, 2023

Thanks for the feedback @Ruberik.

The first alternate proposal was to hide the concept of re-recording an event. Cuda supports it, but I'm not sure about OpenCL.

The second alternate proposal was to hide the concept of an AcceleratorEvent, since the only reason to have it is to establish a dependency between two streams. This places the burden of managing the underlying native objects on ILGPU.

In your example, it would look more like:

for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForStream(stream[j]);
  }
}
for (int i = 0; i < 8; i++) {
  DoTaskB(streams[i]);
  DoTaskC(streams[i]);
}

@Ruberik
Copy link
Contributor Author

Ruberik commented Dec 18, 2023

@MoFtZ, I think that does something different from my code. In my example, a stream finishes A, then works on B, and only then must wait until other streams are done with A. In your code here, nobody can start B until everybody's done A.

BTW, I think there's a problem with that code. Don't you need to enqueue all the As before you can have them wait on each other, like this?

for (int i = 0; i < 8; i++) {
  DoTaskA(streams[i]);
}
for (int i = 0; i < 8; i++) {
  for (int j = 0; j < 8; j++) {
    if (j != i) stream[i].WaitForStream(stream[j]);
  }
}
for (int i = 0; i < 8; i++) {
  DoTaskB(streams[i]);
  DoTaskC(streams[i]);
}

You need three loops here, because you have to have enqueued A, but not B or C, when you call WaitForStream.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Dec 18, 2023

Yes, you are correct. Perhaps that proposal is too simplified.

It would need to change the meaning from "wait for all Task A to finish before starting C" to "synchronise the start of all Task C".

@MoFtZ MoFtZ linked a pull request Mar 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature (or feature request)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants