Overhead of first reduction with CUDA backend #1558

gzagaris · 2023-10-16T13:51:53Z

Hi everyone,

We've observed that the execution time of the first reduction is notably high (in some cases, slower than sequential). However, subsequent reductions do not exhibit this behavior, suggesting that there might be some overhead (perhaps internal initialization?) with the first reduction that is invoked by the application.

At your convenience, can you confirm if that is expected behavior with the current implementation and elaborate a bit on what is happening?

Steps To Reproduce

We first noticed this with the latest RAJA-v2023.06.1 (haven't tried older versions, but, I would expect that it is reproducible in older versions as well)
Here is a quick code snippet that illustrates the behavior:

static constexpr int NUM_ITEMS = 20000;
static constexpr int NUM_BYTES = NUM_ITEMS * sizeof(int);
static constexpr int TEST_VAL_1 = 42;

using LOOP_EXEC = RAJA::cuda_exec< 256 >;
using REDUCE_EXEC = RAJA::cuda_reduce;

  // Allocate and initialize the data
  int* data_1 = nullptr;
  cudaMalloc((void**)&data_1, NUM_BYTES);
  RAJA::forall< LOOP_EXEC >(
    RAJA::RangeSegment(0, NUM_ITEMS),
    [ = ] RAJA_HOST_DEVICE(const int& idx) {
    data_1[ idx ] = TEST_VAL_1;
  } );

  // Reduce data
  {
    auto start = std::chrono::high_resolution_clock::now();

    RAJA::ReduceSum< REDUCE_EXEC, int > sum(0);
    RAJA::forall< LOOP_EXEC >( 
      RAJA::RangeSegment(0,NUM_ITEMS), 
      [=] RAJA_HOST_DEVICE(const int& idx) {
      sum += data_1[ idx ];
    } );
    int result = static_cast< int >( sum.get() );

    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration< double >(end-start).count();
    std::cout << "elapsed[1] = " << elapsed << std::endl;
  }

  // Reduce data again
  {
    auto start = std::chrono::high_resolution_clock::now();

    RAJA::ReduceSum< REDUCE_EXEC, int > sum(0);
    RAJA::forall< LOOP_EXEC >( 
      RAJA::RangeSegment(0,NUM_ITEMS), 
      [=] RAJA_HOST_DEVICE(const int& idx) {
      sum += data_1[ idx ];
    } );
    int result = static_cast< int >( sum.get() );

    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration< double >(end-start).count();
    std::cout << "elapsed[1] = " << elapsed << std::endl;
  }
  cudaFree(data_1);

This produces the following output:

elapsed[1] = 0.0221074    // <--------------- EXECUTION TIME OF FIRST REDUCTION
elapsed[1] = 1.5151e-05.  // <--------------- RUNNING THE REDUCTION AGAIN IS SIGNIFICANTLY FASTER

Please, let me know if you need any additional information. Thank you for all your time and help.

The text was updated successfully, but these errors were encountered:

MrBurmark · 2023-10-23T17:01:29Z

This is expected behavior. The reducers need memory to function so the first reducer call causes allocations and initialization in internal memory pools (device, device zeroed, and pinned pools). You should be able to see this happening if you profile with something like nsight systems.
I am thinking about how to allow users to pass in their own allocators so we don't have these separate RAJA only pools of memory.
In other reduction news, try using this policy for loops with reductions in them RAJA::cuda_exec_occ_calc_async<BLOCK_SIZE>. Note that using this policy adds overhead the first time each loop using this policy is run because the occupancy calculator is called. This improves reduction performance for loops with extents significantly over a million. This policy uses the occupancy calculator to only launch as many GPU threads as can run concurrently on the device, then uses a grid stride loop to run all the iterates of the loop. This results in a simpler reduction tree and less temporary memory usage in the implementation. These policies will likely become the recommended policy to use with reductions in the near future.

gzagaris · 2023-10-23T19:34:59Z

Thank you for the detailed explanation and suggestions @MrBurmark!

The RAJA::cuda_exec_occ_calc_async< BLOCK_SIZE > sounds fancy! I'll definitely try it out.

Can this policy also be used for nested loops with RAJA::Kernel, i.e., as the policy with the RAJA::statement::For? Would you recommend that?

On a slight tangent, I was thinking a bit more about this the other day and I was wondering if hiding initial overheads like this justifies as a use case for providing methods, such as, RAJA::initialize(argc, argv) and RAJA::finalize(). Moreover, having these methods may provide a good way for users to pass allocators and memory pools from the application space for RAJA to use internally. I am not certain if that was something that has been considered in the past and wanted to bring it up for your consideration.

Thank you again for all your help.

MrBurmark · 2023-10-23T19:56:45Z

You can't use that policy with RAJA::kernel.
There are occupancy calculator policies for kernel already but using them is a bit more complicated.
You have to use an occupancy CudaKernel statement for the occupancy calculator to be used when choosing launch params like the number of blocks. In addition block_direct and global_direct policies should probably be replaced with block_loop or global_loop policies to ensure the number of blocks required stays below the bounds imposed by the occupancy calculator.

      RAJA::KernelPolicy<
        RAJA::statement::CudaKernelFixedAsync<block_sz,
          RAJA::statement::For<0, RAJA::cuda_global_size_x_direct<block_sz>,
            RAJA::statement::For<1, RAJA::cuda_block_y_direct,
              RAJA::statement::For<2, RAJA::cuda_thread_y_direct,
                RAJA::statement::Lambda<0> > > > > >;

vs

      RAJA::KernelPolicy<
        RAJA::statement::CudaKernelOccAsync<
  // or RAJA::statement::CudaKernelExt<cuda_occ_calc_launch<block_sz, true>,
          RAJA::statement::For<0, RAJA::cuda_global_size_x_loop<block_sz>,
            RAJA::statement::For<1, RAJA::cuda_block_y_loop,
              RAJA::statement::For<2, RAJA::cuda_thread_y_direct,
                RAJA::statement::Lambda<0> > > > > >;

gzagaris · 2023-10-23T20:07:29Z

Thanks @MrBurmark -- it's always complicated with RAJA::kernel 😄

The example you provided is exactly what I was looking for!

rhornung67 · 2023-10-23T20:08:03Z

@gzagaris several years ago, we considered adding RAJA::initialize and RAJA::finalize methods. Kokkos does that, for example. We didn't really see a strong need at the time and thought RAJA would be more flexible without it. We will reconsider and let you know.

gzagaris · 2023-10-23T20:13:04Z

@gzagaris several years ago, we considered adding RAJA::initialize and RAJA::finalize methods. Kokkos does that, for example. We didn't really see a strong need at the time and thought RAJA would be more flexible without it. We will reconsider and let you know.

Sounds good, thanks @rhornung67!

rhornung67 added performance cuda support labels Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhead of first reduction with CUDA backend #1558

Overhead of first reduction with CUDA backend #1558

gzagaris commented Oct 16, 2023

MrBurmark commented Oct 23, 2023 •

edited

gzagaris commented Oct 23, 2023

MrBurmark commented Oct 23, 2023 •

edited

gzagaris commented Oct 23, 2023

rhornung67 commented Oct 23, 2023

gzagaris commented Oct 23, 2023 •

edited

Overhead of first reduction with CUDA backend #1558

Overhead of first reduction with CUDA backend #1558

Comments

gzagaris commented Oct 16, 2023

Steps To Reproduce

MrBurmark commented Oct 23, 2023 • edited

gzagaris commented Oct 23, 2023

MrBurmark commented Oct 23, 2023 • edited

gzagaris commented Oct 23, 2023

rhornung67 commented Oct 23, 2023

gzagaris commented Oct 23, 2023 • edited

MrBurmark commented Oct 23, 2023 •

edited

MrBurmark commented Oct 23, 2023 •

edited

gzagaris commented Oct 23, 2023 •

edited