-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow submitting kernels to an existing CUDA stream #1389
Comments
Wait, so does Celerity only support execution on CUDA devices? What happens when someone wants to use other backends? Why are you not using a pool of SYCL inorder queues instead? That could be more easily generalized to other backends, and boil down to the same thing. You could then submit your manual CUDA operations using the custom operation extension. Why do you even need to manually submit CUDA operations in the first place? If there's something missing on the SYCL level that can be expressed in CUDA, it might be a better approach to just add that interface. If you need strided 2d memcopies with SYCL USM pointers like AdaptiveCpp does not do any fancy scheduling or dependency tracking if you use instant submission mode (and fulfill the requirements for instant submission), so you can have that already. SYCL event construction latency can be addressed using coarse-grained events if that is a problem. While what you are asking for could probably be implemented easily using internals of AdaptiveCpp, I'm hesistant to add such a functionality to the publicly supported API. Committing to a way in which existing CUDA streams are used can be really limiting for future developments. |
No, but we currently have specializations for CUDA because it's our most prominent platform. Please think "native backend queue" whenever I say "CUDA stream" :)
Mhh, that might actually work together with
Yes, 2D/3D copies primarily, these are essential. Would be great to see those in SYCL, or as an Acpp extension! In the future we would like to look into NCCL, and we're also experimenting with virtual-memory shenanigans using the driver API to support allocations that are larger than device memory, although that's not pressing right now. |
Ok, I see :)
Yes.
yes, by virtue of the definition of "qualifying for instant submission". An operation qualifies for instant submission in particular if there's only USM used (no buffers) and everything is in-order. In that case, there is no need for the SYCL implementation to perform dependency analysis or build complex DAGs behind the scenes. The instant submission mode was in particular built for the stdpar model, which also assumes "everything is in-order, no complex scheduling needed".
I don't think we implement
Right, I think this is a fairly large gap in the USM model at the moment, and something that has been on the backlog for some time. Intel's OpenCL USM extensions don't have 2d/3d memcopies, so we'll need some emulation there, but that is an Intel problem.
How is this different from |
Is there a way to assert that instant submission has taken place / should take place? Then we could also avoid calling Edit: Do
I checked, indeed this doesn't work at the moment. template <backend Backend>
typename backend_traits<Backend>::template native_type<queue>
get_native(const queue &sycl_object); ... is defined, but gives a type error when used. DPC++ does have this feature (but not If it is not too much effort I can try making this work in Acpp.
We're just submitting 1D SYCL memcpys in a loop as a fallback. Very inefficient of course, but correct. Edit: I just saw that DPC++ has handler::ext_oneapi_memcpy2d.
Hmm having the driver/OS do the memory mapping implicitly is an interesting idea indeed! Caveats (mostly just things I need to check myself after this):
|
I don't think there's a programmatic way at the moment. You can manually verify it by running with
Yes, it should affect all kinds of operations equally.
Yeah, we'll have to do something similar.
Interesting. I assume that for OpenCL they use a similar fallback like you do, and this only really does something different for Level Zero?
It's fairly easy to create a common abstraction for both host task and custom operation. You can get the native queue in both cases from the interop handle. Of course, the host task is not going to be as efficient.
You'd need to introspect the execution hints that the queue uses. If the backend supports it, an inorder queue will have an additional hint that carries an
I believe it should. IIRC it follows first-touch allocation policy, as in regular Linux virtual memory subsystem.
Have never observed something like this before -- including in my work with stdpar, where basically every allocation of the entire programm is
In my experience working with stdpar, perf is mostly indistinguishable from
I don't know. |
Seems like they're simply launching a copy kernel! As long as all allocations are at least SYCL host memory that would be a viable alternative.
True, that's not going to be a lot more efficient than waiting on the host and submitting a CUDA call manually then.
Hmm that sounds more complex than I hoped. Looks like doing the host task vs custom-enqueue trade-off is the easiest path forward.
That might be very hardware dependent... |
In this issue I want to make the case for an API that allows submitting SYCL kernels directly onto existing CUDA streams (or equivalent native backend queue types). This would allow some applications to avoid synchronization round-trips to the host from calling
event::wait
or pollingevent::get_info
, while still making use of ACpp's compiler frontend and libkernel / glue.Motivation
I see the following use cases:
cudaMemcpy
s to a pool of streams but also needs to invoke SYCL kernels and perform auxiliary work like MPI operations. We currently need to delay the call tosycl::queue::submit
until all manually-submitted dependencies have completed, and have to wait on the host until the SYCL kernel is complete before doing manual CUDA submissions. At the same time we generally have no need for SYCL runtime features like dependency tracking or advanced scheduling.Ideal API purely from the Celerity perspective
In a fantasy world where we would not need to care about other people, use cases, or implementability, this would be the ideal API for Celerity:
bind
allows ACpp to set up any necessary data structures to minimize the latency ofinvoke
laterlocal_accessor
would be nice to have, but we could do without if there was an alternative way to specify the amount of local memory required and somehow get a pointer to it inside the kernel.If there is no
local_accessor
but an a simpler "raw local memory" API, the CGF indirection could be dropped and we would arrive at something like:Maybe this can already be constructed from some internal / unstable interface? That might be good enough!
More realistic proposal?
From an earlier conversation with @illuhad on Discord: This feature could work in combination with instant kernel submissions (#1128) by passing the existing CUDA stream together with an
inorder
property to the queue constructor.Restrictions like "no reductions" might still apply, that would be fine.
Alternatives
There is
hipSYCL_enqueue_custom_operation
allowing single CUDA ops to become SYCL graph nodes. I'm however interested in exact opposite, inserting single SYCL operations into an existing CUDA application.The text was updated successfully, but these errors were encountered: