CUDA Max Blocks Fix #612

davideberius · 2019-06-05T21:59:25Z

Added a fix for the case where the requested number of CUDA threads requires too many registers and results in a maximum number of CUDA blocks of 0.

Signed-off-by: David Eberius davideberius@gmail.com

…equires too many registers and results in a maximum number of CUDA blocks of 0. Signed-off-by: David Eberius <davideberius@gmail.com>

Signed-off-by: David Eberius <davideberius@gmail.com>

codecov · 2019-06-05T23:39:17Z

Codecov Report

❗ No coverage uploaded for pull request base (develop@1b8c262). Click here to learn what that means.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             develop      #612   +/-   ##
===========================================
  Coverage           ?   98.663%           
===========================================
  Files              ?        65           
  Lines              ?      1272           
  Branches           ?         0           
===========================================
  Hits               ?      1255           
  Misses             ?        17           
  Partials           ?         0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b8c262...8cf5011. Read the comment docs.

codecov · 2019-06-05T23:39:18Z

Codecov Report

Merging #612 into develop will decrease coverage by 0.001%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##           develop      #612       +/-   ##
=============================================
- Coverage   98.664%   98.663%   -0.002%     
=============================================
  Files           65        65               
  Lines         1273      1272        -1     
=============================================
- Hits          1256      1255        -1     
  Misses          17        17

Impacted Files	Coverage Δ
include/RAJA/policy/openmp/scan.hpp	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa9b82e...82b56e2. Read the comment docs.

rhornung67 · 2019-06-06T16:20:16Z

@ajkunen and @MrBurmark please look over this PR and weigh in. Thanks.

ajkunen

Looks good to me

ajkunen

actually @MrBurmark says that we should be using the cached occupancy calculator that we are using elsewhere, I'll let him comment on this

rhornung67 · 2019-06-06T17:22:09Z

Yeah, that's why I wanted you and @MrBurmark to weigh in. We need to make sure we aren't adding to runtime overhead, if we can avoid it.

davideberius · 2019-06-06T17:44:03Z

The code path that determines the recommended threads is only executed if both num_blocks and num_threads are <= 0. The comments suggest that this means that both blocks and threads are determined at runtime. In my case, I'm hitting the logic for 'determine blocks at runtime and determine threads at compile time'.

In this case, the number of threads is not based on what can fit on the card with this kernel, so the cudaOccupancyMaxPotentialBlockSize would only be called once in the code path I am hitting.

davideberius · 2019-06-06T17:46:46Z

It might be worth verifying that the compile-time chosen num_threads will fit at the point that recommended_threads is assigned.

include/RAJA/policy/cuda/kernel/CudaKernel.hpp

MrBurmark · 2019-06-06T18:07:55Z

The overhead concern is running the cudaOccupancy functions each time the kernel is hit in the user code instead of just the first time.

…da_occupancy_max_blocks_threads call. Signed-off-by: David Eberius <davideberius@gmail.com>

MrBurmark · 2019-06-06T18:20:38Z

include/RAJA/policy/cuda/kernel/CudaKernel.hpp

  inline static void max_blocks(int shmem_size,
-      int &max_blocks, int actual_threads)
+      int &max_blocks, int &actual_threads)


actual_threads should not be changed at this point if num_threads was specified at compile-time.

MrBurmark · 2019-06-06T18:48:37Z

It makes sense to have CudaLaunchHelper::max_threads calculate an actual max threads in the runtime case. Perhaps it could error if the max threads from the occupancy calculator is too low in the compile-time case so we can avoid this error in the future. A test for these cases should be added as well.

davideberius · 2019-06-06T20:45:51Z

The problem arises when the compile time number of threads is too high, which I think should be taken into account when max_threads is calculated.

MrBurmark · 2019-06-07T16:22:45Z

I agree, how about you move the occupancy calculation for max_threads into the run-time case in max_threads so there is something reasonable there. In the compile-time case there isn't anything RAJA can do but check that what the user specified is possible to do and error out in a reasonable way if it isn't.

davideberius · 2019-06-07T16:46:08Z

The thing that confuses me about the whole compile/runtime decision is that I as the user didn't specify anything about the number of threads in my policies, so how am I always hitting the case where threads are 'determined at compile time'?

MrBurmark · 2019-06-07T17:21:34Z

Looks like I was forgetting how part of this all works so I'll go over the process in some detail.
The default policy is defined here.

RAJA/include/RAJA/policy/cuda/kernel/CudaKernel.hpp

Line 158 in 6b0e35d

using CudaKernel = CudaKernelFixed<1024, EnclosedStmts...>;

All cudaKernel policies end up instantiating a cuda_launch<async, num_blocks, num_threads> class to specify their num_threads and num_blocks.

RAJA/include/RAJA/policy/cuda/kernel/CudaKernel.hpp

Line 62 in 6b0e35d

struct cuda_launch {};

These "compile-time" num_threads and num_blocks are upper bounds if specified.
The lower bounds come from the policies inside the cudaKernel.
For example cudaKernel<Tile<0, tile_fixed<16>, cuda_block_x_loop, For<0, cuda_thread_x_direct, ...>>> requires at least 16 threads in the x direction and at least 1 block in the x direction. These requirements are calulated here.

RAJA/include/RAJA/policy/cuda/kernel/CudaKernel.hpp

Line 420 in 6b0e35d

LaunchDims launch_dims = executor_t::calculateDimensions(data);

davideberius · 2019-06-12T17:49:48Z

@MrBurmark, so the upper bound used for these kernels in the compile time case is not so much based on the kernel as it is on the definition of the policies? So, kernels with high register usage slip through the cracks.

So, my question then becomes: where should this fix be implemented? I currently have it at the spot where a failure is about to occur, however I think it would be better to have when calculating max_threads if possible.

MrBurmark · 2019-06-12T20:34:42Z

@davideberius Yes the upper bound in the compile time case is from the policy. Indeed the upper bound may be too high for higher register usage kernels and that was not being checked.
I agree that the upper bound should be fixed when calculating max_threads. The thing that is concerning to me is that that occupancy calculator function is maximizing overall occupancy instead of giving the max possible threads per block. Also I'm concerned that whatever is decided in max_threads could be less than min_threads.
Wherever the fix is implemented we should also be able to catch failure and print a reasonable error message.

davideberius · 2019-06-12T23:30:00Z

I've been looking for a CUDA function that determines the number of registers required by a kernel in order to help determine a maximum number of threads, but I haven't found one yet.

MrBurmark · 2019-06-13T15:08:57Z

include/RAJA/policy/cuda/kernel/CudaKernel.hpp

@@ -477,7 +499,17 @@ struct StatementExecutor<
      // Compute the MAX physical kernel blocks
      //
      int max_blocks;
-      launch_t::max_blocks(shmem, max_blocks, launch_dims.num_threads());
+      int adjusted_threads = launch_dims.num_threads();
+      launch_t::max_blocks(shmem, max_blocks, adjusted_threads);


The adjusted_threads should be corrected if less than min_threads

MrBurmark · 2019-06-13T15:45:49Z

Thinking about this more I have convinced myself that the real solution to this problem is to switch the default policy to use the occupancy calculator so that num_threads is unspecified. This way the initial recommended value comes from the occupancy calculator instead of changing it to use that later. When using the specified num_threads policies the kernel should either launch with that number of threads or fail to launch and we should be catching that failure and printing a better error message.

MrBurmark · 2019-06-13T16:16:24Z

include/RAJA/policy/cuda/kernel/CudaKernel.hpp

  inline static void max_blocks(int shmem_size,
-      int &max_blocks, int actual_threads)
+      int &max_blocks, int &actual_threads)
  {
    auto func = internal::CudaKernelLauncher<Data, executor_t>;


This should use the fixed kernel launch function if num_threads is greater than zero.
internal::CudaKernelLauncherFixed<num_threads, Data, executor_t>
I broke this when I consolidated the CudaLaunchHelper implementation into a single code base.

MrBurmark · 2019-06-13T16:19:15Z

After talking with @ajkunen the intended behavior when num_threads is specified at compile time is for that to be a max_threads.
This should never have failed like it did because the global function should have launch_bounds applied, but that is not being used because I broke that part when I worked on this last.
I'll fix that.

davideberius added 3 commits June 5, 2019 14:53

Added a fix for the case where the requested number of CUDA threads r…

9774431

…equires too many registers and results in a maximum number of CUDA blocks of 0. Signed-off-by: David Eberius <davideberius@gmail.com>

Only do the update of the launch dimensions if they have changed.

1ad17c0

Signed-off-by: David Eberius <davideberius@gmail.com>

Use the correct type for changing launch_dims.threads.

8cf5011

Signed-off-by: David Eberius <davideberius@gmail.com>

Merge branch 'develop' into develop

939f3f1

ajkunen approved these changes Jun 6, 2019

View reviewed changes

ajkunen requested changes Jun 6, 2019

View reviewed changes

rhornung67 requested review from MrBurmark and ajkunen June 6, 2019 17:22

MrBurmark requested changes Jun 6, 2019

View reviewed changes

include/RAJA/policy/cuda/kernel/CudaKernel.hpp Outdated Show resolved Hide resolved

Changed the cudaOccupancyMaxPotentialBlockSize call to an internal cu…

82b56e2

…da_occupancy_max_blocks_threads call. Signed-off-by: David Eberius <davideberius@gmail.com>

MrBurmark reviewed Jun 6, 2019

View reviewed changes

MrBurmark reviewed Jun 13, 2019

View reviewed changes

MrBurmark mentioned this pull request Jun 17, 2019

Use CudaKernelLauncherFixed with compile-time num_threads #620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Max Blocks Fix #612

CUDA Max Blocks Fix #612

davideberius commented Jun 5, 2019

codecov bot commented Jun 5, 2019

codecov bot commented Jun 5, 2019 •

edited

rhornung67 commented Jun 6, 2019

ajkunen left a comment

ajkunen left a comment

rhornung67 commented Jun 6, 2019

davideberius commented Jun 6, 2019 •

edited

davideberius commented Jun 6, 2019

MrBurmark commented Jun 6, 2019

MrBurmark Jun 6, 2019

MrBurmark commented Jun 6, 2019

davideberius commented Jun 6, 2019

MrBurmark commented Jun 7, 2019

davideberius commented Jun 7, 2019

MrBurmark commented Jun 7, 2019 •

edited

davideberius commented Jun 12, 2019

MrBurmark commented Jun 12, 2019

davideberius commented Jun 12, 2019

MrBurmark Jun 13, 2019

MrBurmark commented Jun 13, 2019

MrBurmark Jun 13, 2019

MrBurmark commented Jun 13, 2019

CUDA Max Blocks Fix #612

Are you sure you want to change the base?

CUDA Max Blocks Fix #612

Conversation

davideberius commented Jun 5, 2019

codecov bot commented Jun 5, 2019

Codecov Report

codecov bot commented Jun 5, 2019 • edited

Codecov Report

rhornung67 commented Jun 6, 2019

ajkunen left a comment

Choose a reason for hiding this comment

ajkunen left a comment

Choose a reason for hiding this comment

rhornung67 commented Jun 6, 2019

davideberius commented Jun 6, 2019 • edited

davideberius commented Jun 6, 2019

MrBurmark commented Jun 6, 2019

MrBurmark Jun 6, 2019

Choose a reason for hiding this comment

MrBurmark commented Jun 6, 2019

davideberius commented Jun 6, 2019

MrBurmark commented Jun 7, 2019

davideberius commented Jun 7, 2019

MrBurmark commented Jun 7, 2019 • edited

davideberius commented Jun 12, 2019

MrBurmark commented Jun 12, 2019

davideberius commented Jun 12, 2019

MrBurmark Jun 13, 2019

Choose a reason for hiding this comment

MrBurmark commented Jun 13, 2019

MrBurmark Jun 13, 2019

Choose a reason for hiding this comment

MrBurmark commented Jun 13, 2019

codecov bot commented Jun 5, 2019 •

edited

davideberius commented Jun 6, 2019 •

edited

MrBurmark commented Jun 7, 2019 •

edited