New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New kernel qualifier macro #307
Comments
Since this seems specific to HIP/CUDA/OpenCL, what about passing it in as a build property? addVectors = device.buildKernel("addVectors.okl",
"addVectors",
"launch_bounds: 1024"); |
That would be good I think. |
I just ran into this. I'm happy to contribute this but I have a question. Does OCCA know the inner loop dimensions at JIT-time? |
I believe the loop dimensions can be passed in as kernel arguments. |
That's true, but I don't understand why that's useful. I'm asking how occa can emit a |
He's saying the dims of the inner loops can be passed as arguments, i.e. we don't necessarily know the threadblock dimensions at JIT-compile time. |
Oh right. I see what you're saying now @tcew. Good point. |
This is what I had in mind:
OCCA obviously cannot know the numerical loop bounds at JIT-time. It does however, create a launcher that sets the dimensions of the thread grid:
Thus the user can specify any size for the loop bounds at runtime. |
Crossed posts. |
SYCL uses similar syntax:
(Buffers have a range associated with them, which avoids the need to pass in N) |
The SYCL In your example the specification of the threading dimensions is separate from the body of the parallel for loops. The OCCA OKL syntax is specifically designed to bring the loop dimensions and body code into a more familiar parallel for loop syntax.
The code inside the kernel is supposed to keep the parallel for loop bounds in close proximity to the body of the parallel for loops. Also the loop bound is not specified by any input array, since a more general kernel may require a very different thread grid configuration to the data arrays. The OKL kernel constructions was an intentional choice born of having to repeatedly explain CUDA/OpenCL kernel syntax, kernel launch parameters, and kernel threading philosophy when training people. |
Does SYCL have an analogous thread-block size hint like CUDA/HIP's |
Good question. I double-checked the SYCL standard (v1.2.1) to find out. Any attributes available in OpenCL C are supported and can be given with the C++ 11 attribute specifier using the There are two flavours available to specify the thread-group size: |
Some options for OCCA:
Both of these can be done in a backwards compatible way. |
I would actually trust the separate runtimes to manage this, and opt to not recompile anything. In essence, use @dmed256's original proposal of making it a build prop, then adding the respective For CUDA and HIP, |
When running an auto tuner I have noticed segmentation faults for OCCA:HIP kernels that required too much LDS or REG. I was skeptical that HIP will throw errors that will be caught. Hopefully that has been fixed now. |
I am ok with a user supplied hint, or a launcher that spots new thread configurations at runtime. |
FYI I have been happy as a lark just specifying explicit runtime loop bounds from host code via macro constant substitution in OKL kernel code (e.g., Having these loop bounds explicitly defined at runtime also facilitates runtime sizing GPU local memory arrays with them in the OKL code. |
Looks like this feature is going to be important: |
With apologies for the clumsy implementation, this is a workaround I use for a kernel in libparanumal that requires Np (>256) threads:
Using the hipcc compiler flag to specify the launch bounds was suggested by Noel Chalmers. The gross implementation is mine. |
It is important to be careful when doing this since it is unclear what happens if a kernel violates the max bound. To avoid using inappropriate bounds, I create separate copies of the |
It's likely that in the future, violating the launch bounds will become a runtime error. |
It is often helpful to give the kernel compiler some extra information about the number of threads (work-items) in a thread-block (work-group). For instance we can give the HIP compiler an upper bound on the number of threads in a thread-block (say 1024) as follows:
__launch_bounds__(1024) __global__ void fooKernel(...) { ... }
In fact for the current HIP release it is unfortunately the case that this must be specified when the thread-block size exceeds 256 (see ROCm/HIP#1310 )
CUDA also has the same attribute. There is also an extra argument to the launch bounds qualifier for minimum number of thread-blocks (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds).
The kernel qualifiers in OpenCL are slightly different (see 6.7.2 of https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf )
Proposal v1 - in an ideal world when we know the thread-block size at compile time OCCA will add kernel qualifiers for appropriate launch bounds (CUDA, HIP) or work-group size hint (OpenCL).
Proposal v2 - If it is too complicated to do Proposal v1, then it would neat to add an okl attribute for launch bounds @qualifier("inner sizes", B) where B could be a compiler define. This would be expanded to launch_bounds(value of B) for CUDA/HIP or attribute((work_group_size_hint(value of B))) for OpenCL. Multi-dim variant would also be helpful.
The text was updated successfully, but these errors were encountered: