-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda : generalize pocl_cuda_submit_memfill #1027
base: main
Are you sure you want to change the base?
Conversation
conformance_buffers_fill mostly works now on jetson, before it failed when the size was 8+ |
@trixirt, I don't think this is correct. AFAIK, |
yes i believe so, that cu api is a value not a pointer. |
The use of cuMemsetD32|16 is dependent on the alignment and as well as well as the size. Check the alignment of the inputs and use an appropriate cuMemset. Handle the general size by breaking the memset into the equivelent pocl_cuda_* calls. first copy the pattern from the host to the device memory, then copy pattern from device to device memory. Move pocl_cuda_sumit_memfill to after the calls it references. Signed-off-by: Tom Rix <trix@redhat.com>
I'm afraid the only way to properly implement clEnqueueFillBuffer through CUDA is using kernels. This may be optimized for specific alignment/size cases, but a kernel for the generic case is necessary. |
Can you explain why a write and copies would not work ? |
If we're only aiming for correctness, then yes, the approach would work. However, the performance is going to be abysmal. Buffer filling should work at close to peak bandwidth performance. If it's implemented as a doubling copy, you're only going to get < 1/3rd the performance that the hardware could achieve (it's sort of like a “reverse reduction”). Even a simple kernel doing (That being said, it's obviously OK to strive for correctness now and then aim for performance later.) |
What's the plan with this work? |
Now that there are builtin cuda kernel support, it should be easy to write a memfill kernel and plug it to |
The use of cuMemsetD32|16 is dependent on the alignment and
not as much the size. Check the alignment of the inputs
and use an appropriate cuMemset.
Signed-off-by: Tom Rix trix@redhat.com