You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have seen patterns where small cudaMemcpyAsync collide with large cudaMemcpyAsync being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In the example below, we can see a H2D pinned copy of 51KB that was scheduled with a latency of 12ms because there is another pinned H2D copy happening at the same time (the larger copy is ~200MB).
The big issue behind this pattern is that Stream 30 in this case is serializing because nothing else will run in the stream until this small copy is done. Usually when we invoke kernels in cuDF there is a pattern of: small H2Ds, followed by kernel invocation, then small D2Hs. Any of the pre/post copies done around a kernel is a candidate to get stuck, serializing all the work in that stream.
We have a PoC that uses thrust::copy_n to copy from pinned to device memory and viceversa using SMs instead of the copy engine, as kernels can directly touch pinned memory. When such an approach is followed, the small copy is able to run with much less latency and subsequent work in the stream is unblocked. This leads to kernels running at the same time as large copies, which is a desirable pattern.
This issue was created in order to track this work and to figure out how to bring these changes to cuDF in a configurable way so we don't affect serial workloads, as the effect is really prominent for parallel workloads such as Spark.
This is NDS q9 at 3TB, looking at the CUDA HW row, we can see how the compute (blue) overlaps more often than not with pinned H2Ds (green).
Before:
After:
The text was updated successfully, but these errors were encountered:
@bdice I am not familiar with it either. A quick code search is yielding that cuda::memcpy_async handles D2D copies (shared, global). I suppose we could refer to pinned memory as another "device" memory type, so in that case I see some relation, but the main thing is code we use to orchestrate copies should likely fallback to cudaMemcpyAsync: if pinned memory isn't available to bounce the copy through, or if it's disabled. I believe single threaded applications (single stream) may want to use cudaMemcpyAsync as is, because there shouldn't be much contention on the CE in those cases.
I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.
I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.
AFAIK we also need to get device_scalar to use pinned memory. Which implies that we'd also need to pass a host memory resource at to enable the use of a pinned pool.
We have seen patterns where small
cudaMemcpyAsync
collide with largecudaMemcpyAsync
being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In the example below, we can see a H2D pinned copy of 51KB that was scheduled with a latency of 12ms because there is another pinned H2D copy happening at the same time (the larger copy is ~200MB).The big issue behind this pattern is that Stream 30 in this case is serializing because nothing else will run in the stream until this small copy is done. Usually when we invoke kernels in cuDF there is a pattern of: small H2Ds, followed by kernel invocation, then small D2Hs. Any of the pre/post copies done around a kernel is a candidate to get stuck, serializing all the work in that stream.
We have a PoC that uses
thrust::copy_n
to copy from pinned to device memory and viceversa using SMs instead of the copy engine, as kernels can directly touch pinned memory. When such an approach is followed, the small copy is able to run with much less latency and subsequent work in the stream is unblocked. This leads to kernels running at the same time as large copies, which is a desirable pattern.This issue was created in order to track this work and to figure out how to bring these changes to cuDF in a configurable way so we don't affect serial workloads, as the effect is really prominent for parallel workloads such as Spark.
This is NDS q9 at 3TB, looking at the CUDA HW row, we can see how the compute (blue) overlaps more often than not with pinned H2Ds (green).
Before:
After:
The text was updated successfully, but these errors were encountered: