[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

abellina · 2024-04-30T17:45:40Z

We have seen patterns where small cudaMemcpyAsync collide with large cudaMemcpyAsync being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In the example below, we can see a H2D pinned copy of 51KB that was scheduled with a latency of 12ms because there is another pinned H2D copy happening at the same time (the larger copy is ~200MB).

The big issue behind this pattern is that Stream 30 in this case is serializing because nothing else will run in the stream until this small copy is done. Usually when we invoke kernels in cuDF there is a pattern of: small H2Ds, followed by kernel invocation, then small D2Hs. Any of the pre/post copies done around a kernel is a candidate to get stuck, serializing all the work in that stream.

We have a PoC that uses thrust::copy_n to copy from pinned to device memory and viceversa using SMs instead of the copy engine, as kernels can directly touch pinned memory. When such an approach is followed, the small copy is able to run with much less latency and subsequent work in the stream is unblocked. This leads to kernels running at the same time as large copies, which is a desirable pattern.

This issue was created in order to track this work and to figure out how to bring these changes to cuDF in a configurable way so we don't affect serial workloads, as the effect is really prominent for parallel workloads such as Spark.

This is NDS q9 at 3TB, looking at the CUDA HW row, we can see how the compute (blue) overlaps more often than not with pinned H2Ds (green).

Before:

After:

The text was updated successfully, but these errors were encountered:

bdice · 2024-04-30T18:01:01Z

Are there any ways in which CCCL's cuda::memcpy_async could help us here? https://nvidia.github.io/cccl/libcudacxx/extended_api/asynchronous_operations/memcpy_async.html

I'm not super familiar with the facilities libcudacxx provides for this, but perhaps this could be in scope.

abellina · 2024-04-30T19:07:47Z

@bdice I am not familiar with it either. A quick code search is yielding that cuda::memcpy_async handles D2D copies (shared, global). I suppose we could refer to pinned memory as another "device" memory type, so in that case I see some relation, but the main thing is code we use to orchestrate copies should likely fallback to cudaMemcpyAsync: if pinned memory isn't available to bounce the copy through, or if it's disabled. I believe single threaded applications (single stream) may want to use cudaMemcpyAsync as is, because there shouldn't be much contention on the CE in those cases.

jrhemstad · 2024-04-30T20:49:54Z

I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.

vuule · 2024-05-03T18:40:50Z

I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.

AFAIK we also need to get device_scalar to use pinned memory. Which implies that we'd also need to pass a host memory resource at to enable the use of a pinned pool.

abellina added feature request New feature or request Spark Functionality that helps Spark RAPIDS labels Apr 30, 2024

ahendriksen mentioned this issue May 2, 2024

[PERF][BUG]: Thrust uses cudaMemcpy for Device->Device copies (66% SoL on H200) NVIDIA/cccl#1672

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

abellina commented Apr 30, 2024 •

edited

bdice commented Apr 30, 2024

abellina commented Apr 30, 2024

jrhemstad commented Apr 30, 2024

vuule commented May 3, 2024

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

Comments

abellina commented Apr 30, 2024 • edited

bdice commented Apr 30, 2024

abellina commented Apr 30, 2024

jrhemstad commented Apr 30, 2024

vuule commented May 3, 2024

abellina commented Apr 30, 2024 •

edited