Allow submitting kernels to an existing CUDA stream #1389

fknorr · 2024-03-03T13:28:12Z

In this issue I want to make the case for an API that allows submitting SYCL kernels directly onto existing CUDA streams (or equivalent native backend queue types). This would allow some applications to avoid synchronization round-trips to the host from calling event::wait or polling event::get_info, while still making use of ACpp's compiler frontend and libkernel / glue.

Motivation

I see the following use cases:

Celerity, which maintains its own execution DAG that manually submits cudaMemcpys to a pool of streams but also needs to invoke SYCL kernels and perform auxiliary work like MPI operations. We currently need to delay the call to sycl::queue::submit until all manually-submitted dependencies have completed, and have to wait on the host until the SYCL kernel is complete before doing manual CUDA submissions. At the same time we generally have no need for SYCL runtime features like dependency tracking or advanced scheduling.
Programs that are CUDA by architecture but are in the process of transitioning to SYCL and therefore would like to invoke SYCL kernels as part of their existing streams.

Ideal API purely from the Celerity perspective

In a fantasy world where we would not need to care about other people, use cases, or implementability, this would be the ideal API for Celerity:

auto invoke = hipsycl::bind([&](sycl::handler &cgh) {
    sycl::local_accessor<int, 1> local_memory(sycl::range(256), cgh);
    cgh.parallel_for(range, [=](sycl::nd_item it) { ... });
});
invoke(cuda_stream); // equivalent to CUDA kernel<<<threads>>>(params)
cudaMemcpy2DAsync(..., cuda_stream);
// ...

bind allows ACpp to set up any necessary data structures to minimize the latency of invoke later
local_accessor would be nice to have, but we could do without if there was an alternative way to specify the amount of local memory required and somehow get a pointer to it inside the kernel.
reduction support is also very optional

If there is no local_accessor but an a simpler "raw local memory" API, the CGF indirection could be dropped and we would arrive at something like:

auto invoke = hipsycl::bind_parallel_for(range, num_local_memory_bytes,
        [](sycl::nd_item it, void *local_memory) { ... });

Maybe this can already be constructed from some internal / unstable interface? That might be good enough!

More realistic proposal?

From an earlier conversation with @illuhad on Discord: This feature could work in combination with instant kernel submissions (#1128) by passing the existing CUDA stream together with an inorder property to the queue constructor.

sycl::queue q({sycl::property::queue::inorder(cuda_stream)});
q.submit([&](sycl::handler &cgh) {
    sycl::local_accessor<int, 1> local_memory(sycl::range(256), cgh);
    cgh.parallel_for(range, [=](sycl::nd_item it) { ... });
});
// don't care about the returned event!
cudaMemcpy2DAsync(..., cuda_stream);
// ...

Restrictions like "no reductions" might still apply, that would be fine.

Alternatives

There is hipSYCL_enqueue_custom_operation allowing single CUDA ops to become SYCL graph nodes. I'm however interested in exact opposite, inserting single SYCL operations into an existing CUDA application.

The text was updated successfully, but these errors were encountered:

illuhad · 2024-03-03T15:26:52Z

Celerity, which maintains its own execution DAG that manually submits cudaMemcpys to a pool of streams but also needs to invoke SYCL kernels and perform auxiliary work like MPI operations. We currently need to delay the call to sycl::queue::submit until all manually-submitted dependencies have completed, and have to wait on the host until the SYCL kernel is complete before doing manual CUDA submissions. At the same time we generally have no need for SYCL runtime features like dependency tracking or advanced scheduling.

Wait, so does Celerity only support execution on CUDA devices? What happens when someone wants to use other backends?

Why are you not using a pool of SYCL inorder queues instead? That could be more easily generalized to other backends, and boil down to the same thing. You could then submit your manual CUDA operations using the custom operation extension.

Why do you even need to manually submit CUDA operations in the first place? If there's something missing on the SYCL level that can be expressed in CUDA, it might be a better approach to just add that interface. If you need strided 2d memcopies with SYCL USM pointers like cudaMemcpy2DAsync as in your example, it would be more useful to add that on the SYCL layer, imo.

AdaptiveCpp does not do any fancy scheduling or dependency tracking if you use instant submission mode (and fulfill the requirements for instant submission), so you can have that already. SYCL event construction latency can be addressed using coarse-grained events if that is a problem.

While what you are asking for could probably be implemented easily using internals of AdaptiveCpp, I'm hesistant to add such a functionality to the publicly supported API. Committing to a way in which existing CUDA streams are used can be really limiting for future developments.

fknorr · 2024-03-03T18:52:22Z

Wait, so does Celerity only support execution on CUDA devices? What happens when someone wants to use other backends?

No, but we currently have specializations for CUDA because it's our most prominent platform. Please think "native backend queue" whenever I say "CUDA stream" :)

Why are you not using a pool of SYCL inorder queues instead? That could be more easily generalized to other backends, and boil down to the same thing. You could then submit your manual CUDA operations using the custom operation extension.

Mhh, that might actually work together with enqueue_custom_operation! So inorder queues map to CUDA streams 1:1 and don't necessarily need to build a real DAG as long as they qualify for instant submission, right? And if that's true I shouldn't even need the extension to submit CUDA ops but would be able to call sycl::get_native(inorder_queue) for the same effect.

Why do you even need to manually submit CUDA operations in the first place? If there's something missing on the SYCL level that can be expressed in CUDA, it might be a better approach to just add that interface. If you need strided 2d memcopies with SYCL USM pointers like cudaMemcpy2DAsync as in your example, it would be more useful to add that on the SYCL layer, imo.

Yes, 2D/3D copies primarily, these are essential. Would be great to see those in SYCL, or as an Acpp extension!

In the future we would like to look into NCCL, and we're also experimenting with virtual-memory shenanigans using the driver API to support allocations that are larger than device memory, although that's not pressing right now.

illuhad · 2024-03-04T14:26:17Z

No, but we currently have specializations for CUDA because it's our most prominent platform. Please think "native backend queue" whenever I say "CUDA stream" :)

Ok, I see :)

Mhh, that might actually work together with enqueue_custom_operation! So inorder queues map to CUDA streams 1:1

Yes.

and don't necessarily need to build a real DAG as long as they qualify for instant submission, right?

yes, by virtue of the definition of "qualifying for instant submission". An operation qualifies for instant submission in particular if there's only USM used (no buffers) and everything is in-order. In that case, there is no need for the SYCL implementation to perform dependency analysis or build complex DAGs behind the scenes.
You can still have operations that depend on other operations explicitly using depends_on, but that is up to the user (Note: Creating a dependency to non-instant dependencies causes loss of the instant submission property!).

The instant submission mode was in particular built for the stdpar model, which also assumes "everything is in-order, no complex scheduling needed".

And if that's true I shouldn't even need the extension to submit CUDA ops but would be able to call sycl::get_native(inorder_queue) for the same effect.

I don't think we implement sycl::get_native(inorder_queue) even for in-order queues. It's... complicated. So at least at the moment you'd still have to go through the extension.

Yes, 2D/3D copies primarily, these are essential. Would be great to see those in SYCL, or as an Acpp extension!

Right, I think this is a fairly large gap in the USM model at the moment, and something that has been on the backlog for some time. Intel's OpenCL USM extensions don't have 2d/3d memcopies, so we'll need some emulation there, but that is an Intel problem.

In the future we would like to look into NCCL, and we're also experimenting with virtual-memory shenanigans using the driver API to support allocations that are larger than device memory

How is this different from sycl::malloc_shared, which should also allow memory oversubscription?

fknorr · 2024-03-05T14:00:51Z

Note: Creating a dependency to non-instant dependencies causes loss of the instant submission property!

Is there a way to assert that instant submission has taken place / should take place? Then we could also avoid calling dag::flush_async after every submission (which we currently have to do because we never wait(), only get_info() on the kernel events due to how our executor loop works).

Edit: Do custom_operations qualify for instant submission the same way as kernels do?

I don't think we implement sycl::get_native(inorder_queue) even for in-order queues. It's... complicated. So at least at the moment you'd still have to go through the extension.

I checked, indeed this doesn't work at the moment.

template <backend Backend>
typename backend_traits<Backend>::template native_type<queue>
get_native(const queue &sycl_object);

... is defined, but gives a type error when used.

DPC++ does have this feature (but not hipsSYCL_enqueue obviously), so there doesn't seem to be a portable way to use this between SYCL implementations that have a CUDA backend at the moment.

If it is not too much effort I can try making this work in Acpp.

Intel's OpenCL USM extensions don't have 2d/3d memcopies, so we'll need some emulation there, but that is an Intel problem.

We're just submitting 1D SYCL memcpys in a loop as a fallback. Very inefficient of course, but correct.

Edit: I just saw that DPC++ has handler::ext_oneapi_memcpy2d.

How is this different from sycl::malloc_shared, which should also allow memory oversubscription?

Hmm having the driver/OS do the memory mapping implicitly is an interesting idea indeed!

Caveats (mostly just things I need to check myself after this):

what are the performance implications of doing the mapping page-by-page instead of en-bloc?
would this work if the virtual allocation size exceeds the host memory capacity (easily happens in a cluster setting)?
does this work with GPUdirect RDMA (very likely not)
cudaMallocManaged (which I assume malloc_shared does in Acpp) has a tendency to crash the driver / Linux kernel when Segfaults occur near mapped ranges

illuhad · 2024-03-05T17:40:15Z

Is there a way to assert that instant submission has taken place / should take place? Then we could also avoid calling dag::flush_async after every submission (which we currently have to do because we never wait(), only get_info() on the kernel events due to how our executor loop works).

I don't think there's a programmatic way at the moment. You can manually verify it by running with ACPP_DEBUG_LEVEL=3. If it's submitting instantly, you will no longer see flush messages from dag_builder because it bypasses that layer.

Edit: Do custom_operations qualify for instant submission the same way as kernels do?

Yes, it should affect all kinds of operations equally.

We're just submitting 1D SYCL memcpys in a loop as a fallback. Very inefficient of course, but correct.

Yeah, we'll have to do something similar.

Edit: I just saw that DPC++ has handler::ext_oneapi_memcpy2d.

Interesting. I assume that for OpenCL they use a similar fallback like you do, and this only really does something different for Level Zero?

DPC++ does have this feature (but not hipsSYCL_enqueue obviously), so there doesn't seem to be a portable way to use this between SYCL implementations that have a CUDA backend at the moment.

It's fairly easy to create a common abstraction for both host task and custom operation. You can get the native queue in both cases from the interop handle. Of course, the host task is not going to be as efficient.

If it is not too much effort I can try making this work in Acpp.

You'd need to introspect the execution hints that the queue uses. If the backend supports it, an inorder queue will have an additional hint that carries an inorder_executor. You'd then need to get the cuda_/hip_/ocl_/... queue from that. Not sure if there's an API for that. Then get the native queue from the <backend>_queue. So quite a lot of layers you need to pierce.
Then there might be special cases where we'd have to think about how to handle it. For example, the L0 queue currently assumes that it knows all operations that are submitted to such that it can setup dependencies accordingly (AFAIK L0 did not have actual in-order queues until recently).

would this work if the virtual allocation size exceeds the host memory capacity (easily happens in a cluster setting)?

I believe it should. IIRC it follows first-touch allocation policy, as in regular Linux virtual memory subsystem.

cudaMallocManaged (which I assume malloc_shared does in Acpp) has a tendency to crash the driver / Linux kernel when Segfaults occur near mapped ranges

Have never observed something like this before -- including in my work with stdpar, where basically every allocation of the entire programm is malloc_shared/cudaMallocManaged. At least on NVIDIA, USM has always been rock-solid for me. AMD is a different story, but that's their problem.

what are the performance implications of doing the mapping page-by-page instead of en-bloc?

In my experience working with stdpar, perf is mostly indistinguishable from malloc_device, especially if you use it in conjunction with prefetches.

does this work with GPUdirect RDMA (very likely not)

I don't know.

fknorr · 2024-03-05T18:43:11Z

Interesting. I assume that for OpenCL they use a similar fallback like you do, and this only really does something different for Level Zero?

Seems like they're simply launching a copy kernel! As long as all allocations are at least SYCL host memory that would be a viable alternative.

It's fairly easy to create a common abstraction for both host task and custom operation. You can get the native queue in both cases from the interop handle. Of course, the host task is not going to be as efficient.

True, that's not going to be a lot more efficient than waiting on the host and submitting a CUDA call manually then.

You'd need to introspect the execution hints that the queue uses...

Hmm that sounds more complex than I hoped. Looks like doing the host task vs custom-enqueue trade-off is the easiest path forward.

Have never observed something like this before -- including in my work with stdpar, where basically every allocation of the entire programm is malloc_shared/cudaMallocManaged. At least on NVIDIA, USM has always been rock-solid for me. AMD is a different story, but that's their problem.

That might be very hardware dependent...

fknorr added the enhancement New feature or request label Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow submitting kernels to an existing CUDA stream #1389

Allow submitting kernels to an existing CUDA stream #1389

fknorr commented Mar 3, 2024 •

edited

illuhad commented Mar 3, 2024

fknorr commented Mar 3, 2024 •

edited

illuhad commented Mar 4, 2024 •

edited

fknorr commented Mar 5, 2024 •

edited

illuhad commented Mar 5, 2024

fknorr commented Mar 5, 2024

Allow submitting kernels to an existing CUDA stream #1389

Allow submitting kernels to an existing CUDA stream #1389

Comments

fknorr commented Mar 3, 2024 • edited

Motivation

Ideal API purely from the Celerity perspective

More realistic proposal?

Alternatives

illuhad commented Mar 3, 2024

fknorr commented Mar 3, 2024 • edited

illuhad commented Mar 4, 2024 • edited

fknorr commented Mar 5, 2024 • edited

illuhad commented Mar 5, 2024

fknorr commented Mar 5, 2024

fknorr commented Mar 3, 2024 •

edited

fknorr commented Mar 3, 2024 •

edited

illuhad commented Mar 4, 2024 •

edited

fknorr commented Mar 5, 2024 •

edited