New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Max Blocks Fix #612
base: develop
Are you sure you want to change the base?
CUDA Max Blocks Fix #612
Conversation
…equires too many registers and results in a maximum number of CUDA blocks of 0. Signed-off-by: David Eberius <davideberius@gmail.com>
Signed-off-by: David Eberius <davideberius@gmail.com>
Signed-off-by: David Eberius <davideberius@gmail.com>
Codecov Report
@@ Coverage Diff @@
## develop #612 +/- ##
===========================================
Coverage ? 98.663%
===========================================
Files ? 65
Lines ? 1272
Branches ? 0
===========================================
Hits ? 1255
Misses ? 17
Partials ? 0 Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## develop #612 +/- ##
=============================================
- Coverage 98.664% 98.663% -0.002%
=============================================
Files 65 65
Lines 1273 1272 -1
=============================================
- Hits 1256 1255 -1
Misses 17 17
Continue to review full report at Codecov.
|
@ajkunen and @MrBurmark please look over this PR and weigh in. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually @MrBurmark says that we should be using the cached occupancy calculator that we are using elsewhere, I'll let him comment on this
Yeah, that's why I wanted you and @MrBurmark to weigh in. We need to make sure we aren't adding to runtime overhead, if we can avoid it. |
The code path that determines the recommended threads is only executed if both num_blocks and num_threads are <= 0. The comments suggest that this means that both blocks and threads are determined at runtime. In my case, I'm hitting the logic for 'determine blocks at runtime and determine threads at compile time'. In this case, the number of threads is not based on what can fit on the card with this kernel, so the cudaOccupancyMaxPotentialBlockSize would only be called once in the code path I am hitting. |
It might be worth verifying that the compile-time chosen num_threads will fit at the point that recommended_threads is assigned. |
The overhead concern is running the cudaOccupancy functions each time the kernel is hit in the user code instead of just the first time. |
…da_occupancy_max_blocks_threads call. Signed-off-by: David Eberius <davideberius@gmail.com>
inline static void max_blocks(int shmem_size, | ||
int &max_blocks, int actual_threads) | ||
int &max_blocks, int &actual_threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual_threads should not be changed at this point if num_threads was specified at compile-time.
It makes sense to have CudaLaunchHelper::max_threads calculate an actual max threads in the runtime case. Perhaps it could error if the max threads from the occupancy calculator is too low in the compile-time case so we can avoid this error in the future. A test for these cases should be added as well. |
The problem arises when the compile time number of threads is too high, which I think should be taken into account when max_threads is calculated. |
I agree, how about you move the occupancy calculation for max_threads into the run-time case in max_threads so there is something reasonable there. In the compile-time case there isn't anything RAJA can do but check that what the user specified is possible to do and error out in a reasonable way if it isn't. |
The thing that confuses me about the whole compile/runtime decision is that I as the user didn't specify anything about the number of threads in my policies, so how am I always hitting the case where threads are 'determined at compile time'? |
Looks like I was forgetting how part of this all works so I'll go over the process in some detail.
All cudaKernel policies end up instantiating a cuda_launch<async, num_blocks, num_threads> class to specify their num_threads and num_blocks.
These "compile-time" num_threads and num_blocks are upper bounds if specified. The lower bounds come from the policies inside the cudaKernel. For example cudaKernel<Tile<0, tile_fixed<16>, cuda_block_x_loop, For<0, cuda_thread_x_direct, ...>>> requires at least 16 threads in the x direction and at least 1 block in the x direction. These requirements are calulated here.
|
@MrBurmark, so the upper bound used for these kernels in the compile time case is not so much based on the kernel as it is on the definition of the policies? So, kernels with high register usage slip through the cracks. So, my question then becomes: where should this fix be implemented? I currently have it at the spot where a failure is about to occur, however I think it would be better to have when calculating max_threads if possible. |
@davideberius Yes the upper bound in the compile time case is from the policy. Indeed the upper bound may be too high for higher register usage kernels and that was not being checked. |
I've been looking for a CUDA function that determines the number of registers required by a kernel in order to help determine a maximum number of threads, but I haven't found one yet. |
@@ -477,7 +499,17 @@ struct StatementExecutor< | |||
// Compute the MAX physical kernel blocks | |||
// | |||
int max_blocks; | |||
launch_t::max_blocks(shmem, max_blocks, launch_dims.num_threads()); | |||
int adjusted_threads = launch_dims.num_threads(); | |||
launch_t::max_blocks(shmem, max_blocks, adjusted_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The adjusted_threads should be corrected if less than min_threads
Thinking about this more I have convinced myself that the real solution to this problem is to switch the default policy to use the occupancy calculator so that num_threads is unspecified. This way the initial recommended value comes from the occupancy calculator instead of changing it to use that later. When using the specified num_threads policies the kernel should either launch with that number of threads or fail to launch and we should be catching that failure and printing a better error message. |
inline static void max_blocks(int shmem_size, | ||
int &max_blocks, int actual_threads) | ||
int &max_blocks, int &actual_threads) | ||
{ | ||
auto func = internal::CudaKernelLauncher<Data, executor_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the fixed kernel launch function if num_threads is greater than zero.
internal::CudaKernelLauncherFixed<num_threads, Data, executor_t>
I broke this when I consolidated the CudaLaunchHelper implementation into a single code base.
After talking with @ajkunen the intended behavior when num_threads is specified at compile time is for that to be a max_threads. |
Added a fix for the case where the requested number of CUDA threads requires too many registers and results in a maximum number of CUDA blocks of 0.
Signed-off-by: David Eberius davideberius@gmail.com