New kernel qualifier macro #307

tcew · 2020-04-02T14:13:53Z

It is often helpful to give the kernel compiler some extra information about the number of threads (work-items) in a thread-block (work-group). For instance we can give the HIP compiler an upper bound on the number of threads in a thread-block (say 1024) as follows:

__launch_bounds__(1024) __global__ void fooKernel(...) { ... }

In fact for the current HIP release it is unfortunately the case that this must be specified when the thread-block size exceeds 256 (see ROCm/HIP#1310 )

CUDA also has the same attribute. There is also an extra argument to the launch bounds qualifier for minimum number of thread-blocks (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds).

The kernel qualifiers in OpenCL are slightly different (see 6.7.2 of https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf )

Proposal v1 - in an ideal world when we know the thread-block size at compile time OCCA will add kernel qualifiers for appropriate launch bounds (CUDA, HIP) or work-group size hint (OpenCL).

Proposal v2 - If it is too complicated to do Proposal v1, then it would neat to add an okl attribute for launch bounds @qualifier("inner sizes", B) where B could be a compiler define. This would be expanded to launch_bounds(value of B) for CUDA/HIP or attribute((work_group_size_hint(value of B))) for OpenCL. Multi-dim variant would also be helpful.

The text was updated successfully, but these errors were encountered:

dmed256 · 2020-04-06T03:13:53Z

Since this seems specific to HIP/CUDA/OpenCL, what about passing it in as a build property?

  addVectors = device.buildKernel("addVectors.okl",
                                  "addVectors",
                                  "launch_bounds: 1024");

tcew · 2020-04-06T03:17:24Z

That would be good I think.

dmcdougall · 2020-06-29T17:36:23Z

I just ran into this. I'm happy to contribute this but I have a question. Does OCCA know the inner loop dimensions at JIT-time?

tcew · 2020-07-01T17:51:33Z

I believe the loop dimensions can be passed in as kernel arguments.

dmcdougall · 2020-07-06T20:42:04Z

That's true, but I don't understand why that's useful. I'm asking how occa can emit a __launch_bounds__ attribute at JIT-time. It needs to know the loop dimensions at that point. This is orthogonal to the kernel arguments, isn't it?

noelchalmers · 2020-07-06T21:05:33Z

He's saying the dims of the inner loops can be passed as arguments, i.e. we don't necessarily know the threadblock dimensions at JIT-compile time.

dmcdougall · 2020-07-06T21:07:32Z

Oh right. I see what you're saying now @tcew. Good point.

tcew · 2020-07-06T21:08:34Z

This is what I had in mind:

@kernel void runtimeArgs(const int B,
                         const int T,
                         const int N,
                         const float *x,
                         const float *y,
                         float *xPy) {
  for (int b=0;b<B;++b;@outer(0)){
    for (int t=0;t<T;++t;@inner(0)){

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }
}

OCCA obviously cannot know the numerical loop bounds at JIT-time.

It does however, create a launcher that sets the dimensions of the thread grid:

extern "C" void runtimeArgs(occa::modeKernel_t * *deviceKernel,
                            const int & B,
                            const int & T,
                            const int & N,
                            occa::modeMemory_t * x,
                            occa::modeMemory_t * y,
                            occa::modeMemory_t * xPy) {
  {
    occa::dim outer, inner;
    outer.dims = 1;
    inner.dims = 1;
    int b = 0;
    outer[0] = B - 0;
    int t = 0;
    inner[0] = T - 0;
    occa::kernel kernel(deviceKernel[0]);
    kernel.setRunDims(outer, inner);
    kernel(B, T, N, x, y, xPy);
  }
}

Thus the user can specify any size for the loop bounds at runtime.

tcew · 2020-07-06T21:09:05Z

Crossed posts.

kris-rowe · 2020-07-06T22:55:14Z

SYCL uses similar syntax:

sycl::range<2> global_range(Bx*Tx,By*Ty);
sycl::range<2> local_range(Tx,Ty);
sycl::nd_range<2> kernel_range(global_range, local_range);

device_queue.submit([&](sycl::handler &cgh) {
    ...
  cgh.parallel_for(kernel_range, kernel);
});

(Buffers have a range associated with them, which avoids the need to pass in N)

tcew · 2020-07-07T02:02:25Z

The SYCL sycl::range syntax is adapted from OpenCL, which itself is adapted from CUDA.

In your example the specification of the threading dimensions is separate from the body of the parallel for loops.

The OCCA OKL syntax is specifically designed to bring the loop dimensions and body code into a more familiar parallel for loop syntax.

for (int b=0;b<B;++b;@outer(0)){ /*  grid dimension defined here */
    for (int t=0;t<T;++t;@inner(0)){ /* thread block dimension defined here */

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }

The code inside the kernel is supposed to keep the parallel for loop bounds in close proximity to the body of the parallel for loops. Also the loop bound is not specified by any input array, since a more general kernel may require a very different thread grid configuration to the data arrays.

The OKL kernel constructions was an intentional choice born of having to repeatedly explain CUDA/OpenCL kernel syntax, kernel launch parameters, and kernel threading philosophy when training people.

noelchalmers · 2020-07-07T13:39:31Z

Does SYCL have an analogous thread-block size hint like CUDA/HIP's __launch_bounds__ we're discussing here?

kris-rowe · 2020-07-07T15:29:14Z

Good question. I double-checked the SYCL standard (v1.2.1) to find out. Any attributes available in OpenCL C are supported and can be given with the C++ 11 attribute specifier using the cl namespace. For example, __attribute__(((reqd_work_group_size(n))) in OpenCL C is equivalent to [[cl::reqd_work_group_size(n)]] in SYCL.

There are two flavours available to specify the thread-group size: work_group_size_hint(n) is the soft version—suggesting the thread-group size will be n—whereas req_work_group_size(n) is a strict requirement.

tcew · 2020-07-07T17:53:01Z

Some options for OCCA:

add a "innerDimHint" member function to the occa::kernel class, that forces a recompile (if not already in the hash) with thread dim hint for CUDA/OpenCL/HIP.
add some logic inside the launcher to trigger a recompile done when a new thread array size is specified. This featuremight be turned on/off by some OCCA_*** define.

Both of these can be done in a backwards compatible way.

noelchalmers · 2020-07-07T18:39:56Z

I would actually trust the separate runtimes to manage this, and opt to not recompile anything. In essence, use @dmed256's original proposal of making it a build prop, then adding the respective __launch_bound__ hint to the kernel at translation if the backend supports it.

For CUDA and HIP, __launch_bound__ is really just a hint, so maybe more akin to OpenCL's work_group_size_hint. It's only used to tell the compiler how many registers it can assume will be available to each thread in the block. If the user violates the launch bound, it's not necessarily an error, since the kernel may not heavily be using register. In the case where the user violates the launch bound and there is indeed not enough register, the runtime will throw a error that OCCA should catch.

tcew · 2020-07-07T22:20:25Z

When running an auto tuner I have noticed segmentation faults for OCCA:HIP kernels that required too much LDS or REG. I was skeptical that HIP will throw errors that will be caught. Hopefully that has been fixed now.

tcew · 2020-07-07T22:48:29Z

I am ok with a user supplied hint, or a launcher that spots new thread configurations at runtime.

pdhahn · 2021-01-04T05:57:43Z

FYI I have been happy as a lark just specifying explicit runtime loop bounds from host code via macro constant substitution in OKL kernel code (e.g., K_blockDim for outer, K_threadDim for inner) that I pass down via the kernel props. Of course, I have my own API code that queries the characteristics and status of the GPU. I use that information to calculate loop bounds to pass down. Yes that kind of dynamic thing causes JIT re-compile sometimes during runtime but in my case it is rare because I step-threshold changes, so the actual loop bounds usually remain the same or fall into some common set for a given kernel. I also utilize pre-compile for some of my kernels so that also reduces JIT re-compile.

Having these loop bounds explicitly defined at runtime also facilitates runtime sizing GPU local memory arrays with them in the OKL code.

stgeke · 2021-05-06T13:40:25Z

Looks like this feature is going to be important:
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#performance-impact-for-kernel-launch-bound-attribute

tcew · 2021-05-06T13:57:17Z

With apologies for the clumsy implementation, this is a workaround I use for a kernel in libparanumal that requires Np (>256) threads:

occa::properties kernelInfo; 
...
 if(platform.device.mode()=="HIP"){
      char newflag[BUFSIZ];
      sprintf(newflag, " --gpu-max-threads-per-block=%d", mesh.Np);
      kernelInfo["compiler_flags"] += newflag;
    }

Using the hipcc compiler flag to specify the launch bounds was suggested by Noel Chalmers. The gross implementation is mine.

tcew · 2021-05-06T14:04:16Z

It is important to be careful when doing this since it is unclear what happens if a kernel violates the max bound.

To avoid using inappropriate bounds, I create separate copies of the occa::properties object for kernels that use different max thread counts.

noelchalmers · 2021-05-06T15:11:52Z

It's likely that in the future, violating the launch bounds will become a runtime error.

dmed256 added the feature Use this label to request a new feature! label Apr 6, 2020

stgeke mentioned this issue Dec 31, 2020

Kernel launch fails because of register overflow Nek5000/nekRS#199

Closed

stgeke mentioned this issue May 28, 2021

Add launch bound support to OCCA Nek5000/nekRS#89

Closed

stgeke mentioned this issue Sep 20, 2021

Add kernel launch bounds #519

Closed

2 tasks

kris-rowe mentioned this issue Oct 22, 2021

[OKL][Feature] @max_inner_dims attribute #531

Merged

kris-rowe closed this as completed in #531 Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New kernel qualifier macro #307

New kernel qualifier macro #307

tcew commented Apr 2, 2020

dmed256 commented Apr 6, 2020

tcew commented Apr 6, 2020

dmcdougall commented Jun 29, 2020

tcew commented Jul 1, 2020

dmcdougall commented Jul 6, 2020

noelchalmers commented Jul 6, 2020

dmcdougall commented Jul 6, 2020

tcew commented Jul 6, 2020 •

edited

tcew commented Jul 6, 2020

kris-rowe commented Jul 6, 2020

tcew commented Jul 7, 2020

noelchalmers commented Jul 7, 2020

kris-rowe commented Jul 7, 2020

tcew commented Jul 7, 2020

noelchalmers commented Jul 7, 2020

tcew commented Jul 7, 2020

tcew commented Jul 7, 2020

pdhahn commented Jan 4, 2021

stgeke commented May 6, 2021

tcew commented May 6, 2021 •

edited

tcew commented May 6, 2021

noelchalmers commented May 6, 2021

New kernel qualifier macro #307

New kernel qualifier macro #307

Comments

tcew commented Apr 2, 2020

dmed256 commented Apr 6, 2020

tcew commented Apr 6, 2020

dmcdougall commented Jun 29, 2020

tcew commented Jul 1, 2020

dmcdougall commented Jul 6, 2020

noelchalmers commented Jul 6, 2020

dmcdougall commented Jul 6, 2020

tcew commented Jul 6, 2020 • edited

tcew commented Jul 6, 2020

kris-rowe commented Jul 6, 2020

tcew commented Jul 7, 2020

noelchalmers commented Jul 7, 2020

kris-rowe commented Jul 7, 2020

tcew commented Jul 7, 2020

noelchalmers commented Jul 7, 2020

tcew commented Jul 7, 2020

tcew commented Jul 7, 2020

pdhahn commented Jan 4, 2021

stgeke commented May 6, 2021

tcew commented May 6, 2021 • edited

tcew commented May 6, 2021

noelchalmers commented May 6, 2021

tcew commented Jul 6, 2020 •

edited

tcew commented May 6, 2021 •

edited