Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Max Blocks Fix #612

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open

Conversation

davideberius
Copy link

Added a fix for the case where the requested number of CUDA threads requires too many registers and results in a maximum number of CUDA blocks of 0.

Signed-off-by: David Eberius davideberius@gmail.com

…equires too many registers and results in a maximum number of CUDA blocks of 0.

Signed-off-by: David Eberius <davideberius@gmail.com>
Signed-off-by: David Eberius <davideberius@gmail.com>
Signed-off-by: David Eberius <davideberius@gmail.com>
@codecov
Copy link

codecov bot commented Jun 5, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@1b8c262). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##             develop      #612   +/-   ##
===========================================
  Coverage           ?   98.663%           
===========================================
  Files              ?        65           
  Lines              ?      1272           
  Branches           ?         0           
===========================================
  Hits               ?      1255           
  Misses             ?        17           
  Partials           ?         0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b8c262...8cf5011. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 5, 2019

Codecov Report

Merging #612 into develop will decrease coverage by 0.001%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##           develop      #612       +/-   ##
=============================================
- Coverage   98.664%   98.663%   -0.002%     
=============================================
  Files           65        65               
  Lines         1273      1272        -1     
=============================================
- Hits          1256      1255        -1     
  Misses          17        17
Impacted Files Coverage Δ
include/RAJA/policy/openmp/scan.hpp 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa9b82e...82b56e2. Read the comment docs.

@rhornung67
Copy link
Member

@ajkunen and @MrBurmark please look over this PR and weigh in. Thanks.

Copy link
Contributor

@ajkunen ajkunen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

@ajkunen ajkunen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually @MrBurmark says that we should be using the cached occupancy calculator that we are using elsewhere, I'll let him comment on this

@rhornung67
Copy link
Member

Yeah, that's why I wanted you and @MrBurmark to weigh in. We need to make sure we aren't adding to runtime overhead, if we can avoid it.

@davideberius
Copy link
Author

davideberius commented Jun 6, 2019

The code path that determines the recommended threads is only executed if both num_blocks and num_threads are <= 0. The comments suggest that this means that both blocks and threads are determined at runtime. In my case, I'm hitting the logic for 'determine blocks at runtime and determine threads at compile time'.

In this case, the number of threads is not based on what can fit on the card with this kernel, so the cudaOccupancyMaxPotentialBlockSize would only be called once in the code path I am hitting.

@davideberius
Copy link
Author

It might be worth verifying that the compile-time chosen num_threads will fit at the point that recommended_threads is assigned.

@MrBurmark
Copy link
Member

The overhead concern is running the cudaOccupancy functions each time the kernel is hit in the user code instead of just the first time.

…da_occupancy_max_blocks_threads call.

Signed-off-by: David Eberius <davideberius@gmail.com>
inline static void max_blocks(int shmem_size,
int &max_blocks, int actual_threads)
int &max_blocks, int &actual_threads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual_threads should not be changed at this point if num_threads was specified at compile-time.

@MrBurmark
Copy link
Member

It makes sense to have CudaLaunchHelper::max_threads calculate an actual max threads in the runtime case. Perhaps it could error if the max threads from the occupancy calculator is too low in the compile-time case so we can avoid this error in the future. A test for these cases should be added as well.

@davideberius
Copy link
Author

The problem arises when the compile time number of threads is too high, which I think should be taken into account when max_threads is calculated.

@MrBurmark
Copy link
Member

I agree, how about you move the occupancy calculation for max_threads into the run-time case in max_threads so there is something reasonable there. In the compile-time case there isn't anything RAJA can do but check that what the user specified is possible to do and error out in a reasonable way if it isn't.

@davideberius
Copy link
Author

The thing that confuses me about the whole compile/runtime decision is that I as the user didn't specify anything about the number of threads in my policies, so how am I always hitting the case where threads are 'determined at compile time'?

@MrBurmark
Copy link
Member

MrBurmark commented Jun 7, 2019

Looks like I was forgetting how part of this all works so I'll go over the process in some detail.
The default policy is defined here.

using CudaKernel = CudaKernelFixed<1024, EnclosedStmts...>;

All cudaKernel policies end up instantiating a cuda_launch<async, num_blocks, num_threads> class to specify their num_threads and num_blocks.
These "compile-time" num_threads and num_blocks are upper bounds if specified.
The lower bounds come from the policies inside the cudaKernel.
For example cudaKernel<Tile<0, tile_fixed<16>, cuda_block_x_loop, For<0, cuda_thread_x_direct, ...>>> requires at least 16 threads in the x direction and at least 1 block in the x direction. These requirements are calulated here.
LaunchDims launch_dims = executor_t::calculateDimensions(data);

@davideberius
Copy link
Author

@MrBurmark, so the upper bound used for these kernels in the compile time case is not so much based on the kernel as it is on the definition of the policies? So, kernels with high register usage slip through the cracks.

So, my question then becomes: where should this fix be implemented? I currently have it at the spot where a failure is about to occur, however I think it would be better to have when calculating max_threads if possible.

@MrBurmark
Copy link
Member

@davideberius Yes the upper bound in the compile time case is from the policy. Indeed the upper bound may be too high for higher register usage kernels and that was not being checked.
I agree that the upper bound should be fixed when calculating max_threads. The thing that is concerning to me is that that occupancy calculator function is maximizing overall occupancy instead of giving the max possible threads per block. Also I'm concerned that whatever is decided in max_threads could be less than min_threads.
Wherever the fix is implemented we should also be able to catch failure and print a reasonable error message.

@davideberius
Copy link
Author

I've been looking for a CUDA function that determines the number of registers required by a kernel in order to help determine a maximum number of threads, but I haven't found one yet.

@@ -477,7 +499,17 @@ struct StatementExecutor<
// Compute the MAX physical kernel blocks
//
int max_blocks;
launch_t::max_blocks(shmem, max_blocks, launch_dims.num_threads());
int adjusted_threads = launch_dims.num_threads();
launch_t::max_blocks(shmem, max_blocks, adjusted_threads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adjusted_threads should be corrected if less than min_threads

@MrBurmark
Copy link
Member

Thinking about this more I have convinced myself that the real solution to this problem is to switch the default policy to use the occupancy calculator so that num_threads is unspecified. This way the initial recommended value comes from the occupancy calculator instead of changing it to use that later. When using the specified num_threads policies the kernel should either launch with that number of threads or fail to launch and we should be catching that failure and printing a better error message.

inline static void max_blocks(int shmem_size,
int &max_blocks, int actual_threads)
int &max_blocks, int &actual_threads)
{
auto func = internal::CudaKernelLauncher<Data, executor_t>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the fixed kernel launch function if num_threads is greater than zero.
internal::CudaKernelLauncherFixed<num_threads, Data, executor_t>
I broke this when I consolidated the CudaLaunchHelper implementation into a single code base.

@MrBurmark
Copy link
Member

After talking with @ajkunen the intended behavior when num_threads is specified at compile time is for that to be a max_threads.
This should never have failed like it did because the global function should have launch_bounds applied, but that is not being used because I broke that part when I worked on this last.
I'll fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants