cuda : generalize pocl_cuda_submit_memfill #1027

trixirt · 2022-01-18T19:55:35Z

The use of cuMemsetD32|16 is dependent on the alignment and
not as much the size. Check the alignment of the inputs
and use an appropriate cuMemset.

Signed-off-by: Tom Rix trix@redhat.com

trixirt · 2022-01-18T19:57:40Z

conformance_buffers_fill mostly works now on jetson, before it failed when the size was 8+
failures now are on the flags wanting to use host ptr.

isuruf · 2022-01-19T13:00:08Z

@trixirt, I don't think this is correct. AFAIK, pattern_size gives the length of the data in pattern, but cuMemsetD32Async considers pattern to be a 32-bit value. Is my understanding incorrect?

trixirt · 2022-01-19T20:39:39Z

yes i believe so, that cu api is a value not a pointer.
I must have been getting lucky on the conformance test.
I will rework.

The use of cuMemsetD32|16 is dependent on the alignment and as well as well as the size. Check the alignment of the inputs and use an appropriate cuMemset. Handle the general size by breaking the memset into the equivelent pocl_cuda_* calls. first copy the pattern from the host to the device memory, then copy pattern from device to device memory. Move pocl_cuda_sumit_memfill to after the calls it references. Signed-off-by: Tom Rix <trix@redhat.com>

Oblomov · 2022-01-20T15:14:07Z

I'm afraid the only way to properly implement clEnqueueFillBuffer through CUDA is using kernels. This may be optimized for specific alignment/size cases, but a kernel for the generic case is necessary.

trixirt · 2022-01-20T15:23:58Z

Can you explain why a write and copies would not work ?
The test i am fixing is conformance_buffers_fill

Oblomov · 2022-01-22T08:42:07Z

If we're only aiming for correctness, then yes, the approach would work. However, the performance is going to be abysmal.

Buffer filling should work at close to peak bandwidth performance. If it's implemented as a doubling copy, you're only going to get < 1/3rd the performance that the hardware could achieve (it's sort of like a “reverse reduction”). Even a simple kernel doing a[i] = pattern would do better in this case.

(That being said, it's obviously OK to strive for correctness now and then aim for performance later.)

pjaaskel · 2022-03-16T09:50:56Z

What's the plan with this work?

isuruf · 2023-06-06T21:14:09Z

Now that there are builtin cuda kernel support, it should be easy to write a memfill kernel and plug it to pocl_cuda_submit_memfill

trixirt force-pushed the cuda_memset branch from 403ba73 to e07b989 Compare January 20, 2022 14:53

pjaaskel marked this pull request as draft March 16, 2022 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : generalize pocl_cuda_submit_memfill #1027

cuda : generalize pocl_cuda_submit_memfill #1027

trixirt commented Jan 18, 2022

trixirt commented Jan 18, 2022

isuruf commented Jan 19, 2022

trixirt commented Jan 19, 2022

Oblomov commented Jan 20, 2022

trixirt commented Jan 20, 2022

Oblomov commented Jan 22, 2022

pjaaskel commented Mar 16, 2022

isuruf commented Jun 6, 2023

cuda : generalize pocl_cuda_submit_memfill #1027

Are you sure you want to change the base?

cuda : generalize pocl_cuda_submit_memfill #1027

Conversation

trixirt commented Jan 18, 2022

trixirt commented Jan 18, 2022

isuruf commented Jan 19, 2022

trixirt commented Jan 19, 2022

Oblomov commented Jan 20, 2022

trixirt commented Jan 20, 2022

Oblomov commented Jan 22, 2022

pjaaskel commented Mar 16, 2022

isuruf commented Jun 6, 2023