Wrap launch bounds #570

gevtushenko · 2022-09-05T06:56:59Z

This PR addresses the following issue by replacing __launch_bounds__ usages with CUB_DETAIL_LAUNCH_BOUNDS. CUB_DETAIL_LAUNCH_BOUNDS leads to __launch_bounds__ usage only when RDC is not specified. Builds without RDC are not affected by this PR. For builds with RDC, the max performance differences are:

facility	type	diff
cub::DeviceSelect::If (complex predicate)	All	0%
cub::DeviceSelect::If	U32	-9%
cub::DeviceSelect::If	U64	0%
cub::DeviceSegmentedReduce::Sum	U8	1%
cub::DeviceSegmentedReduce::Sum	U64	-10%
cub::DeviceSegmentedRadixSort::SortPairs	U{8,16}	4%
cub::DeviceSegmentedRadixSort::SortPairs	U{32,64}	-32%
cub::DeviceSegmentedRadixSort::SortKeys	U{8,16}	8%
cub::DeviceSegmentedRadixSort::SortKeys	U{32,64}	-25%
cub::DeviceScan::InclusiveSum	All	0%
cub::DeviceScan::InclusiveSum - complex op	F32	-7%
cub::DeviceScan::ExclusiveSum	All	0%
cub::DeviceReduce::Reduce - custom op	All	-8%
cub::DeviceReduce::Reduce	U8	20%
cub::DeviceReduce::Reduce	U32	8%
cub::DeviceReduce::Reduce	F64	-4%
cub::DevicePartition::If	All	-4%
cub::DeviceHistogram::HistogramRange - A lot of bins	U64	30%
cub::DeviceHistogram::HistogramRange	All	0%
cub::DeviceHistogram::HistogramEven - A lot of bins	U32	-20%
cub::DeviceHistogram::HistogramEven	All	0%
cub::DeviceAdjacentDifference	All	0%

Negative diff means speedup of the version without __launch_bounds__. Since the results are quite controversial, I wouldn't like to advertise the macro as our API. If absolutely needed, one might define:

#define CUB_DETAIL_LAUNCH_BOUNDS(...) \
  __launch_bounds__(__VA_ARGS__)

#include <cub/cub.cuh>

But for now it's an implementation detail that fixes compilation with RDC in some corner cases. Going forward, we might consider having tuning API that would control __launch_bounds__ specification as well as pragma unroll usage. The default tuning would be a function of the input types.

cub/device/dispatch/dispatch_histogram.cuh

dkolsen-pgi · 2022-09-05T17:06:38Z

The performance results are interesting with the mix of better and worse. My interpretation of that is that some of the functions are not as well tuned as they could be.

gevtushenko · 2022-09-05T17:30:45Z

The performance results are interesting with the mix of better and worse. My interpretation of that is that some of the functions are not as well tuned as they could be.

My interpretation is that presence of __launch_bounds__ probably wasn't questioned during the tuning process before.

gevtushenko · 2022-09-06T09:50:51Z

Testing revealed some issues of this approach. We can't simply remove __launch_bounds__, here's a reproducer that answers the question why it's the case:

#include <stdio.h>
#include <cub/cub.cuh>

using MaxPolicyT = cub::DispatchRadixSort<false, unsigned short, cub::NullType, unsigned int>::MaxPolicy;

int main() {
  int sm_occupancy{};
  if (cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &sm_occupancy,
        cub::DeviceRadixSortDownsweepKernel<MaxPolicyT, false,  true, unsigned short, cub::NullType, unsigned int>,
        512,
        0)) {
    std::printf("error\n");
  }
  std::printf("%d\n", sm_occupancy);
}

When DeviceRadixSortDownsweepKernel has __launch_bounds__ specified, cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 1. Otherwise, it returns 0. That's because without __launch_bounds__ nvcc generates code that requires 153 registers per thread. This gives us 78336 registers per thread block. V100, however, has only 65k registers per thread block, making it impossible to launch a kernel with required thread block size.

In order to proceed with this PR, we have to retune entire CUB for every supported HW with and without __launch_bounds__. Even in this case, I have doubts if we can do so in a presence of custom user types. It might happen so that a custom user type exceeds a particular register count and we are unable to launch a kernel with pre-tuned thread block size again. So the tuning process has to be complemented with a new abstraction layer that would iterate over thread block size search space and converge on a tuning based of funcAttrib.numRegs. @dkolsen-pgi am I missing something?

jrhemstad · 2022-09-06T14:00:15Z

So the tuning process has to be complemented with a new abstraction layer that would iterate over thread block size search space and converge on a tuning based of funcAttrib.numRegs. @dkolsen-pgi am I missing something?

As a shorter-term work around, we could clamp the threads-per-block at 128 when RDC is used. With the 255 registers per thread limit, then we are guaranteed that a CTA size of 128 will always fall within the register allocation requirements.

This would obviously have performance implications, but I don't see what else we can do better any time soon.

a new abstraction layer that would iterate over thread block size search space and converge on a tuning based of funcAttrib.numRegs

I don't think this would work. You can only use cudaFuncGetAttributes on __global__ functions. If you queried the kernel for its register usage, that doesn't tell you anything about the register requirements of the invoked __device__ functions when using RDC.

gevtushenko · 2022-09-06T14:34:11Z

@jrhemstad I agree with your point, thanks! I'll probably try to clamp the threads block size.

Wrap launch bounds

a3937a6

gevtushenko requested review from jrhemstad, alliepiper and dkolsen-pgi September 5, 2022 06:57

gevtushenko added this to Inbox in PR Tracking via automation Sep 5, 2022

gevtushenko added the type: bug: compiler Bug in a compiler, not this library. label Sep 5, 2022

gevtushenko added this to the 2.1.0 milestone Sep 5, 2022

gevtushenko moved this from Inbox to Need Review in PR Tracking Sep 5, 2022

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Sep 5, 2022

Testing NVIDIA/cub#570

161eba1

miscco reviewed Sep 5, 2022

View reviewed changes

cub/device/dispatch/dispatch_histogram.cuh Show resolved Hide resolved

miscco approved these changes Sep 5, 2022

View reviewed changes

gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Sep 5, 2022

dkolsen-pgi approved these changes Sep 5, 2022

View reviewed changes

gevtushenko marked this pull request as draft September 6, 2022 09:37

jrhemstad removed their request for review April 26, 2023 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap launch bounds #570

Wrap launch bounds #570

gevtushenko commented Sep 5, 2022

dkolsen-pgi commented Sep 5, 2022

gevtushenko commented Sep 5, 2022

gevtushenko commented Sep 6, 2022

jrhemstad commented Sep 6, 2022 •

edited

gevtushenko commented Sep 6, 2022

Wrap launch bounds #570

Are you sure you want to change the base?

Wrap launch bounds #570

Conversation

gevtushenko commented Sep 5, 2022

dkolsen-pgi commented Sep 5, 2022

gevtushenko commented Sep 5, 2022

gevtushenko commented Sep 6, 2022

jrhemstad commented Sep 6, 2022 • edited

gevtushenko commented Sep 6, 2022

jrhemstad commented Sep 6, 2022 •

edited