Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : generalize pocl_cuda_submit_memfill #1027

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

trixirt
Copy link
Contributor

@trixirt trixirt commented Jan 18, 2022

The use of cuMemsetD32|16 is dependent on the alignment and
not as much the size. Check the alignment of the inputs
and use an appropriate cuMemset.

Signed-off-by: Tom Rix trix@redhat.com

@trixirt
Copy link
Contributor Author

trixirt commented Jan 18, 2022

conformance_buffers_fill mostly works now on jetson, before it failed when the size was 8+
failures now are on the flags wanting to use host ptr.

@isuruf
Copy link
Member

isuruf commented Jan 19, 2022

@trixirt, I don't think this is correct. AFAIK, pattern_size gives the length of the data in pattern, but cuMemsetD32Async considers pattern to be a 32-bit value. Is my understanding incorrect?

@trixirt
Copy link
Contributor Author

trixirt commented Jan 19, 2022

yes i believe so, that cu api is a value not a pointer.
I must have been getting lucky on the conformance test.
I will rework.

The use of cuMemsetD32|16 is dependent on the alignment and
as well as well as the size.  Check the alignment of the inputs
and use an appropriate cuMemset.

Handle the general size by breaking the memset into
the equivelent pocl_cuda_* calls.  first copy the pattern
from the host to the device memory, then copy pattern from
device to device memory.

Move pocl_cuda_sumit_memfill to after the calls it references.

Signed-off-by: Tom Rix <trix@redhat.com>
@Oblomov
Copy link
Contributor

Oblomov commented Jan 20, 2022

I'm afraid the only way to properly implement clEnqueueFillBuffer through CUDA is using kernels. This may be optimized for specific alignment/size cases, but a kernel for the generic case is necessary.

@trixirt
Copy link
Contributor Author

trixirt commented Jan 20, 2022

Can you explain why a write and copies would not work ?
The test i am fixing is conformance_buffers_fill

@Oblomov
Copy link
Contributor

Oblomov commented Jan 22, 2022

If we're only aiming for correctness, then yes, the approach would work. However, the performance is going to be abysmal.

Buffer filling should work at close to peak bandwidth performance. If it's implemented as a doubling copy, you're only going to get < 1/3rd the performance that the hardware could achieve (it's sort of like a “reverse reduction”). Even a simple kernel doing a[i] = pattern would do better in this case.

(That being said, it's obviously OK to strive for correctness now and then aim for performance later.)

@pjaaskel
Copy link
Member

What's the plan with this work?

@pjaaskel pjaaskel marked this pull request as draft March 16, 2022 09:51
@isuruf
Copy link
Member

isuruf commented Jun 6, 2023

Now that there are builtin cuda kernel support, it should be easy to write a memfill kernel and plug it to pocl_cuda_submit_memfill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants