Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SYCL events for submissions to native queues #13706

Open
fknorr opened this issue May 8, 2024 · 2 comments
Open

Create SYCL events for submissions to native queues #13706

fknorr opened this issue May 8, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@fknorr
Copy link

fknorr commented May 8, 2024

Is your feature request related to a problem? Please describe

SYCL allows submitting native backend operations to (in-order) queues without paying the synchronization overhead cost of a host_task through sycl::get_native(queue).

sycl::queue q(device, sycl::property::queue::in_order{});
const auto evt1 = q.submit(/* some SYCL operation */);
auto cuda_stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(q);
cudaNativeFunctionAsync(..., cuda_stream); // does not wait for evt1 on the host, only on device!
const auto evt2 = ??;

As seen above we can get a sycl::event for the SYCL operation / kernel submission, but not for the cudaNative submission. Such an event would however be desirable so that another operation (on a different queue) could specify a dependency on that exact submission, something which is not possible when manually doing cudaEventRecord.

Describe the solution you would like

Multiple ideas, in descending complexity:

  1. Support for AdaptiveCpp's enqueue_custom_operation, see docs
  2. An extension function sycl::event sycl::queue::ext_record_event() that performs the equivalent of a cudaEventRecord on an in-order queue
  3. A working implementation of sycl::make_event<backend::ext_oneapi_cuda>(cudaEvent_t, context &)
  4. A pointer on what internal function needs to be called as workaround in the meantime

Describe alternatives you have considered

I have attempted

sycl::event record_cuda_event(sycl::queue &queue) {
    const auto stream = sycl::get_native<sycl::backend::ext_oneapi_cuda>(queue);
    cudaEvent_t event;
    cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
    cudaEventRecord(event, stream);
    return sycl::detail::make_event(sycl::detail::pi::cast<pi_native_handle>(event), queue.get_context(), sycl::backend::ext_oneapi_cuda);
}

but the returned event does not make progress when queried using event.get_infosycl::info::event::command_execution_status()`.

Using the official API

return sycl::make_event<sycl::backend::ext_oneapi_cuda>(event, context);

instead fails to compile with

include/sycl/backend.hpp:356:1: note: candidate template ignored: requirement 'detail::InteropFeatureSupportMap<sycl::backend::ext_oneapi_cuda>::MakeEvent == true' was not satisfied [with Backend = sycl::backend::ext_oneapi_cuda]

Additional context

Using host_task as a replacement is not desirable because it needs to wait (on the host) for the previous operations on the (in-order) queue to complete, negating the latency-hiding benefits of eagerly submitting device work in-order.

Please advise if there is any workaround using (non-portable / unstable) internal APIs at the moment to create an event from such a manual submission, or to convert (wrap) a cudaEvent_t to a sycl::event so that other in-order queues may wait on it.

Above code / workarounds was tried with DPC++ e330855 (May 7, 2024).

@fknorr fknorr added the enhancement New feature or request label May 8, 2024
@al42and
Copy link
Contributor

al42and commented May 17, 2024

An extension function sycl::event sycl::queue::ext_record_event() that performs the equivalent of a cudaEventRecord on an in-order queue

Have you considered using sycl_ext_oneapi_enqueue_barrier? sycl::queue::ext_oneapi_submit_barrier() does exactly that, as far as I can tell.

@fknorr
Copy link
Author

fknorr commented May 19, 2024

Thanks, I've looked into the implementation of ext_oneapi_submit_barrier and it appears to just return the event from the last SYCL submission that was made to the in-order queue. It does not seem to record new events meaning it is blind to my manual submission through the native queue.

I've found another extension with a promising name, sycl_ext_oneapi_in_order_queue_events which provides queue::ext_oneapi_set_external_event - but the external event is again a SYCL event and I have no way of creating one from a cudaEvent_t, so I'm back to the start.

Thinking about this a some more, I'm beginning to wonder if my example code can even be made thread-safe at all. As far as I understand DPC++ is free to perform its own native-queue submissions of kernels in a background thread and return control immediately, which means that even if CUDA streams themselves are / were thread-safe, I could experience spurious re-orders between a kernel launch in q.submit() and my own subsequent operation on the native queue.

Having an equivalent to AdaptiveCpp's enqueue_custom_operation really seems like the best option to have semantics that are both unambiguous to the user and allow the implementation to see all of the users interop work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants