Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New kernel qualifier macro #307

Closed
tcew opened this issue Apr 2, 2020 · 22 comments · Fixed by #531
Closed

New kernel qualifier macro #307

tcew opened this issue Apr 2, 2020 · 22 comments · Fixed by #531
Labels
feature Use this label to request a new feature!

Comments

@tcew
Copy link
Collaborator

tcew commented Apr 2, 2020

It is often helpful to give the kernel compiler some extra information about the number of threads (work-items) in a thread-block (work-group). For instance we can give the HIP compiler an upper bound on the number of threads in a thread-block (say 1024) as follows:

__launch_bounds__(1024) __global__ void fooKernel(...) { ... }

In fact for the current HIP release it is unfortunately the case that this must be specified when the thread-block size exceeds 256 (see ROCm/HIP#1310 )

CUDA also has the same attribute. There is also an extra argument to the launch bounds qualifier for minimum number of thread-blocks (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds).

The kernel qualifiers in OpenCL are slightly different (see 6.7.2 of https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf )

Proposal v1 - in an ideal world when we know the thread-block size at compile time OCCA will add kernel qualifiers for appropriate launch bounds (CUDA, HIP) or work-group size hint (OpenCL).

Proposal v2 - If it is too complicated to do Proposal v1, then it would neat to add an okl attribute for launch bounds @qualifier("inner sizes", B) where B could be a compiler define. This would be expanded to launch_bounds(value of B) for CUDA/HIP or attribute((work_group_size_hint(value of B))) for OpenCL. Multi-dim variant would also be helpful.

@dmed256 dmed256 added the feature Use this label to request a new feature! label Apr 6, 2020
@dmed256
Copy link
Member

dmed256 commented Apr 6, 2020

Since this seems specific to HIP/CUDA/OpenCL, what about passing it in as a build property?

  addVectors = device.buildKernel("addVectors.okl",
                                  "addVectors",
                                  "launch_bounds: 1024");

@tcew
Copy link
Collaborator Author

tcew commented Apr 6, 2020

That would be good I think.

@dmcdougall
Copy link
Contributor

I just ran into this. I'm happy to contribute this but I have a question. Does OCCA know the inner loop dimensions at JIT-time?

@tcew
Copy link
Collaborator Author

tcew commented Jul 1, 2020

I believe the loop dimensions can be passed in as kernel arguments.

@dmcdougall
Copy link
Contributor

That's true, but I don't understand why that's useful. I'm asking how occa can emit a __launch_bounds__ attribute at JIT-time. It needs to know the loop dimensions at that point. This is orthogonal to the kernel arguments, isn't it?

@noelchalmers
Copy link
Contributor

He's saying the dims of the inner loops can be passed as arguments, i.e. we don't necessarily know the threadblock dimensions at JIT-compile time.

@dmcdougall
Copy link
Contributor

Oh right. I see what you're saying now @tcew. Good point.

@tcew
Copy link
Collaborator Author

tcew commented Jul 6, 2020

This is what I had in mind:

@kernel void runtimeArgs(const int B,
                         const int T,
                         const int N,
                         const float *x,
                         const float *y,
                         float *xPy) {
  for (int b=0;b<B;++b;@outer(0)){
    for (int t=0;t<T;++t;@inner(0)){

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }
}

OCCA obviously cannot know the numerical loop bounds at JIT-time.

It does however, create a launcher that sets the dimensions of the thread grid:

extern "C" void runtimeArgs(occa::modeKernel_t * *deviceKernel,
                            const int & B,
                            const int & T,
                            const int & N,
                            occa::modeMemory_t * x,
                            occa::modeMemory_t * y,
                            occa::modeMemory_t * xPy) {
  {
    occa::dim outer, inner;
    outer.dims = 1;
    inner.dims = 1;
    int b = 0;
    outer[0] = B - 0;
    int t = 0;
    inner[0] = T - 0;
    occa::kernel kernel(deviceKernel[0]);
    kernel.setRunDims(outer, inner);
    kernel(B, T, N, x, y, xPy);
  }
}

Thus the user can specify any size for the loop bounds at runtime.

@tcew
Copy link
Collaborator Author

tcew commented Jul 6, 2020

Crossed posts.

@kris-rowe
Copy link
Member

SYCL uses similar syntax:

sycl::range<2> global_range(Bx*Tx,By*Ty);
sycl::range<2> local_range(Tx,Ty);
sycl::nd_range<2> kernel_range(global_range, local_range);

device_queue.submit([&](sycl::handler &cgh) {
    ...
  cgh.parallel_for(kernel_range, kernel);
});

(Buffers have a range associated with them, which avoids the need to pass in N)

@tcew
Copy link
Collaborator Author

tcew commented Jul 7, 2020

The SYCL sycl::range syntax is adapted from OpenCL, which itself is adapted from CUDA.

In your example the specification of the threading dimensions is separate from the body of the parallel for loops.

The OCCA OKL syntax is specifically designed to bring the loop dimensions and body code into a more familiar parallel for loop syntax.

for (int b=0;b<B;++b;@outer(0)){ /*  grid dimension defined here */
    for (int t=0;t<T;++t;@inner(0)){ /* thread block dimension defined here */

      if(b==0 && t==0) printf("B=%d, T=%d\n", B, T);

      int  n = t + T*b;
      if(n<N){
        xPy[n] = x[n] + y[n];
      }
    }
  }

The code inside the kernel is supposed to keep the parallel for loop bounds in close proximity to the body of the parallel for loops. Also the loop bound is not specified by any input array, since a more general kernel may require a very different thread grid configuration to the data arrays.

The OKL kernel constructions was an intentional choice born of having to repeatedly explain CUDA/OpenCL kernel syntax, kernel launch parameters, and kernel threading philosophy when training people.

@noelchalmers
Copy link
Contributor

Does SYCL have an analogous thread-block size hint like CUDA/HIP's __launch_bounds__ we're discussing here?

@kris-rowe
Copy link
Member

Good question. I double-checked the SYCL standard (v1.2.1) to find out. Any attributes available in OpenCL C are supported and can be given with the C++ 11 attribute specifier using the cl namespace. For example, __attribute__(((reqd_work_group_size(n))) in OpenCL C is equivalent to [[cl::reqd_work_group_size(n)]] in SYCL.

There are two flavours available to specify the thread-group size: work_group_size_hint(n) is the soft version—suggesting the thread-group size will be n—whereas req_work_group_size(n) is a strict requirement.

@tcew
Copy link
Collaborator Author

tcew commented Jul 7, 2020

Some options for OCCA:

  1. add a "innerDimHint" member function to the occa::kernel class, that forces a recompile (if not already in the hash) with thread dim hint for CUDA/OpenCL/HIP.

  2. add some logic inside the launcher to trigger a recompile done when a new thread array size is specified. This featuremight be turned on/off by some OCCA_*** define.

Both of these can be done in a backwards compatible way.

@noelchalmers
Copy link
Contributor

I would actually trust the separate runtimes to manage this, and opt to not recompile anything. In essence, use @dmed256's original proposal of making it a build prop, then adding the respective __launch_bound__ hint to the kernel at translation if the backend supports it.

For CUDA and HIP, __launch_bound__ is really just a hint, so maybe more akin to OpenCL's work_group_size_hint. It's only used to tell the compiler how many registers it can assume will be available to each thread in the block. If the user violates the launch bound, it's not necessarily an error, since the kernel may not heavily be using register. In the case where the user violates the launch bound and there is indeed not enough register, the runtime will throw a error that OCCA should catch.

@tcew
Copy link
Collaborator Author

tcew commented Jul 7, 2020

When running an auto tuner I have noticed segmentation faults for OCCA:HIP kernels that required too much LDS or REG. I was skeptical that HIP will throw errors that will be caught. Hopefully that has been fixed now.

@tcew
Copy link
Collaborator Author

tcew commented Jul 7, 2020

I am ok with a user supplied hint, or a launcher that spots new thread configurations at runtime.

@pdhahn
Copy link
Contributor

pdhahn commented Jan 4, 2021

FYI I have been happy as a lark just specifying explicit runtime loop bounds from host code via macro constant substitution in OKL kernel code (e.g., K_blockDim for outer, K_threadDim for inner) that I pass down via the kernel props. Of course, I have my own API code that queries the characteristics and status of the GPU. I use that information to calculate loop bounds to pass down. Yes that kind of dynamic thing causes JIT re-compile sometimes during runtime but in my case it is rare because I step-threshold changes, so the actual loop bounds usually remain the same or fall into some common set for a given kernel. I also utilize pre-compile for some of my kernels so that also reduces JIT re-compile.

Having these loop bounds explicitly defined at runtime also facilitates runtime sizing GPU local memory arrays with them in the OKL code.

@stgeke
Copy link
Contributor

stgeke commented May 6, 2021

@tcew
Copy link
Collaborator Author

tcew commented May 6, 2021

With apologies for the clumsy implementation, this is a workaround I use for a kernel in libparanumal that requires Np (>256) threads:

occa::properties kernelInfo; 
...
 if(platform.device.mode()=="HIP"){
      char newflag[BUFSIZ];
      sprintf(newflag, " --gpu-max-threads-per-block=%d", mesh.Np);
      kernelInfo["compiler_flags"] += newflag;
    }

Using the hipcc compiler flag to specify the launch bounds was suggested by Noel Chalmers. The gross implementation is mine.

@tcew
Copy link
Collaborator Author

tcew commented May 6, 2021

It is important to be careful when doing this since it is unclear what happens if a kernel violates the max bound.

To avoid using inappropriate bounds, I create separate copies of the occa::properties object for kernels that use different max thread counts.

@noelchalmers
Copy link
Contributor

It's likely that in the future, violating the launch bounds will become a runtime error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Use this label to request a new feature!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants