Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

Open
abellina opened this issue Apr 30, 2024 · 4 comments
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Apr 30, 2024

We have seen patterns where small cudaMemcpyAsync collide with large cudaMemcpyAsync being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In the example below, we can see a H2D pinned copy of 51KB that was scheduled with a latency of 12ms because there is another pinned H2D copy happening at the same time (the larger copy is ~200MB).

2024-04-30_10-35

The big issue behind this pattern is that Stream 30 in this case is serializing because nothing else will run in the stream until this small copy is done. Usually when we invoke kernels in cuDF there is a pattern of: small H2Ds, followed by kernel invocation, then small D2Hs. Any of the pre/post copies done around a kernel is a candidate to get stuck, serializing all the work in that stream.

We have a PoC that uses thrust::copy_n to copy from pinned to device memory and viceversa using SMs instead of the copy engine, as kernels can directly touch pinned memory. When such an approach is followed, the small copy is able to run with much less latency and subsequent work in the stream is unblocked. This leads to kernels running at the same time as large copies, which is a desirable pattern.

This issue was created in order to track this work and to figure out how to bring these changes to cuDF in a configurable way so we don't affect serial workloads, as the effect is really prominent for parallel workloads such as Spark.

This is NDS q9 at 3TB, looking at the CUDA HW row, we can see how the compute (blue) overlaps more often than not with pinned H2Ds (green).

Before:
2024-04-30_10-43

After:
2024-04-30_10-44

@abellina abellina added feature request New feature or request Spark Functionality that helps Spark RAPIDS labels Apr 30, 2024
@bdice
Copy link
Contributor

bdice commented Apr 30, 2024

Are there any ways in which CCCL's cuda::memcpy_async could help us here? https://nvidia.github.io/cccl/libcudacxx/extended_api/asynchronous_operations/memcpy_async.html

I'm not super familiar with the facilities libcudacxx provides for this, but perhaps this could be in scope.

@abellina
Copy link
Contributor Author

@bdice I am not familiar with it either. A quick code search is yielding that cuda::memcpy_async handles D2D copies (shared, global). I suppose we could refer to pinned memory as another "device" memory type, so in that case I see some relation, but the main thing is code we use to orchestrate copies should likely fallback to cudaMemcpyAsync: if pinned memory isn't available to bounce the copy through, or if it's disabled. I believe single threaded applications (single stream) may want to use cudaMemcpyAsync as is, because there shouldn't be much contention on the CE in those cases.

@jrhemstad
Copy link
Contributor

I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.

@vuule
Copy link
Contributor

vuule commented May 3, 2024

I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.

AFAIK we also need to get device_scalar to use pinned memory. Which implies that we'd also need to pass a host memory resource at to enable the use of a pinned pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Development

No branches or pull requests

4 participants