Create SYCL events for submissions to native queues #13706

fknorr · 2024-05-08T16:25:37Z

Is your feature request related to a problem? Please describe

SYCL allows submitting native backend operations to (in-order) queues without paying the synchronization overhead cost of a host_task through sycl::get_native(queue).

sycl::queue q(device, sycl::property::queue::in_order{});
const auto evt1 = q.submit(/* some SYCL operation */);
auto cuda_stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(q);
cudaNativeFunctionAsync(..., cuda_stream); // does not wait for evt1 on the host, only on device!
const auto evt2 = ??;

As seen above we can get a sycl::event for the SYCL operation / kernel submission, but not for the cudaNative submission. Such an event would however be desirable so that another operation (on a different queue) could specify a dependency on that exact submission, something which is not possible when manually doing cudaEventRecord.

Describe the solution you would like

Multiple ideas, in descending complexity:

Support for AdaptiveCpp's enqueue_custom_operation, see docs
An extension function sycl::event sycl::queue::ext_record_event() that performs the equivalent of a cudaEventRecord on an in-order queue
A working implementation of sycl::make_event<backend::ext_oneapi_cuda>(cudaEvent_t, context &)
A pointer on what internal function needs to be called as workaround in the meantime

Describe alternatives you have considered

I have attempted

sycl::event record_cuda_event(sycl::queue &queue) {
    const auto stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(queue);
    cudaEvent_t event;
    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
    cudaEventRecord(event, stream);
    return sycl::detail::make_event(sycl::detail::pi::cast<pi_native_handle>(event), queue.get_context(), sycl::backend::ext_oneapi_cuda);
}

but the returned event does not make progress when queried using event.get_infosycl::info::event::command_execution_status()`.

Using the official API

return sycl::make_event<sycl::backend::ext_oneapi_cuda>(event, context);

instead fails to compile with

include/sycl/backend.hpp:356:1: note: candidate template ignored: requirement 'detail::InteropFeatureSupportMap<sycl::backend::ext_oneapi_cuda>::MakeEvent == true' was not satisfied [with Backend = sycl::backend::ext_oneapi_cuda]

Additional context

Using host_task as a replacement is not desirable because it needs to wait (on the host) for the previous operations on the (in-order) queue to complete, negating the latency-hiding benefits of eagerly submitting device work in-order.

Please advise if there is any workaround using (non-portable / unstable) internal APIs at the moment to create an event from such a manual submission, or to convert (wrap) a cudaEvent_t to a sycl::event so that other in-order queues may wait on it.

Above code / workarounds was tried with DPC++ e330855 (May 7, 2024).

The text was updated successfully, but these errors were encountered:

al42and · 2024-05-17T14:24:12Z

An extension function sycl::event sycl::queue::ext_record_event() that performs the equivalent of a cudaEventRecord on an in-order queue

Have you considered using sycl_ext_oneapi_enqueue_barrier? sycl::queue::ext_oneapi_submit_barrier() does exactly that, as far as I can tell.

fknorr · 2024-05-19T06:42:34Z

Thanks, I've looked into the implementation of ext_oneapi_submit_barrier and it appears to just return the event from the last SYCL submission that was made to the in-order queue. It does not seem to record new events meaning it is blind to my manual submission through the native queue.

I've found another extension with a promising name, sycl_ext_oneapi_in_order_queue_events which provides queue::ext_oneapi_set_external_event - but the external event is again a SYCL event and I have no way of creating one from a cudaEvent_t, so I'm back to the start.

Thinking about this a some more, I'm beginning to wonder if my example code can even be made thread-safe at all. As far as I understand DPC++ is free to perform its own native-queue submissions of kernels in a background thread and return control immediately, which means that even if CUDA streams themselves are / were thread-safe, I could experience spurious re-orders between a kernel launch in q.submit() and my own subsequent operation on the native queue.

Having an equivalent to AdaptiveCpp's enqueue_custom_operation really seems like the best option to have semantics that are both unambiguous to the user and allow the implementation to see all of the users interop work.

fknorr added the enhancement New feature or request label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create SYCL events for submissions to native queues #13706

Create SYCL events for submissions to native queues #13706

fknorr commented May 8, 2024 •

edited

al42and commented May 17, 2024

fknorr commented May 19, 2024

Create SYCL events for submissions to native queues #13706

Create SYCL events for submissions to native queues #13706

Comments

fknorr commented May 8, 2024 • edited

Is your feature request related to a problem? Please describe

Describe the solution you would like

Describe alternatives you have considered

Additional context

al42and commented May 17, 2024

fknorr commented May 19, 2024

fknorr commented May 8, 2024 •

edited