Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unordered map creation freezes async processes #350

Open
trsh opened this issue Mar 15, 2023 · 10 comments
Open

unordered map creation freezes async processes #350

trsh opened this issue Mar 15, 2023 · 10 comments
Labels

Comments

@trsh
Copy link

trsh commented Mar 15, 2023

Describe the bug
unordered map creation freezes async processes

Steps to reproduce

runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);

// The line below would only complete when runBuldKernel is done
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);

Expected behavior
The map creation and memory allocation should complete right away, without waiting for runBuldKernel to complete

Actual behavior
The map creation and memory allocation completes only after runBuldKernel is done

System (please complete the following information):

  • OS: Windows 11 x64
  • Compiler: MSVC Visual Studio 2022
  • Backend: CUDA 12
  • Library version: master
@trsh trsh added the bug label Mar 15, 2023
@trsh
Copy link
Author

trsh commented Mar 15, 2023

runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui); 

printf("1\n");
//stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Pointer* p;
cudaMalloc(&p, 1 * sizeof(Pointer));
printf("2\n")

This in contrast works in async. Allocation happens without waiting for runBuldKernel to complete

@stotko
Copy link
Owner

stotko commented Mar 15, 2023

This is a known limitation. Although the required parallel algorithms from thrust used in stdgpu as well as the intermediate interface in stdgpu all support arbitrary execution_policys (where a CUDA stream can be encapsulated), most functions (which also includes stdgpu::unordered_map::createDeviceObject) fall back to the default stream. Thus, the behavior you observe primarily comes from how the default stream is handled in CUDA, which by default is "legacy" behavior and forces synchronization.

I think adding explicit support for asynchronous streams would be a good enhancement. Until this feature lands in stdgpu, as a workaround you could possibly 1. move the creation of the map to an earlier stage if this is possible, or 2. enable "per-thread" behavior for the default stream which can be set with the --default-stream compiler flag.

@trsh
Copy link
Author

trsh commented Mar 15, 2023

1 is not possible. And I am not sure what 2 does, need to read about it, so it doesn't brake something else.

@stotko
Copy link
Owner

stotko commented Mar 15, 2023

For reference, #351 tracks all affected functions which currently do not have proper support for custom execution_policys such as thrust::device.on(stream).

@trsh
Copy link
Author

trsh commented Mar 16, 2023

@stotko doesn't seem like default stream is the issue. This below works in async..

runBuldKernel << < block_size_x, thread_size_x >> > (ng, object_size_ui);
printf("1\n");
//stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Pointer* p;
cudaMalloc(&p, 1 * sizeof(Pointer));
printf("2\n");

cudaMalloc and printf("2\n") runs right away, without waiting for runBuldKernel (made it infinite one) to finish. As I did not specify stream, this all goes to default one.

When I uncomment the map part, its blocked. No matter what comes after it.

@stotko
Copy link
Owner

stotko commented Mar 16, 2023

Thanks for further testing. I still believe that the issue is related to the default stream. Just to make sure, could be try calling another kernel on the default stream (could be anything), while runBuldKernel uses build_stream as done before, so that you have the same setup described in the legacy default stream section.

In contrast to a pure cudaMalloc which does not block, stdgpu::unordered_map::createDeviceObject additionally also has to initialize its internal arrays which is done by calling kernels on the default stream.

@trsh
Copy link
Author

trsh commented Mar 16, 2023

runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);
printf("1\n");
k_2 << <1, 1 >> > ();
printf("2\n");
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
printf("3\n");

k_2 is executed without waiting. An then it blocks in map creation, i.e. 2 is printed

@stotko
Copy link
Owner

stotko commented Mar 16, 2023

I have reproduced your observations. In fact, there are two issues:

  1. The legacy default stream forces a strict order on the execution of the kernels and performs implicit (weak) synchronization of the involved scheduled kernels but leaves the CPU asynchronous, hence the non-blocking CPU printf statements.
  2. thrust's algorithms are all synchronous since CUDA 10.1 unless the thrust::async versions are used. More recent versions, i.e. thrust 1.16 and newer, introduced the asynchronous policy thrust::cuda::par_nosync.on(stream) which would make the called (by default synchronous) algorithms behave like custom CUDA kernels without CPU blocking.

In that sense, you are right that my initial explanation was not sufficient. Fortunately, adding support for custom execution_policys would still resolve the issue as above policy could be used to force the correct behavior on thrust's side.

@trsh
Copy link
Author

trsh commented Mar 16, 2023

So there is currently no solution to make this happen in async?

@stotko
Copy link
Owner

stotko commented Mar 16, 2023

If you are only concerned about the CPU blocking part and the stream ordering behavior is acceptable, then a workaround could be to create the unordered_map object in a different CPU thread, for instance using std::async. Then, the creation would block in the newly created thread while the main thread would continue normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants