Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhead of first reduction with CUDA backend #1558

Open
gzagaris opened this issue Oct 16, 2023 · 6 comments
Open

Overhead of first reduction with CUDA backend #1558

gzagaris opened this issue Oct 16, 2023 · 6 comments

Comments

@gzagaris
Copy link
Member

Hi everyone,

We've observed that the execution time of the first reduction is notably high (in some cases, slower than sequential). However, subsequent reductions do not exhibit this behavior, suggesting that there might be some overhead (perhaps internal initialization?) with the first reduction that is invoked by the application.

At your convenience, can you confirm if that is expected behavior with the current implementation and elaborate a bit on what is happening?

Steps To Reproduce

  • We first noticed this with the latest RAJA-v2023.06.1 (haven't tried older versions, but, I would expect that it is reproducible in older versions as well)
  • Here is a quick code snippet that illustrates the behavior:
static constexpr int NUM_ITEMS = 20000;
static constexpr int NUM_BYTES = NUM_ITEMS * sizeof(int);
static constexpr int TEST_VAL_1 = 42;

using LOOP_EXEC = RAJA::cuda_exec< 256 >;
using REDUCE_EXEC = RAJA::cuda_reduce;

  // Allocate and initialize the data
  int* data_1 = nullptr;
  cudaMalloc((void**)&data_1, NUM_BYTES);
  RAJA::forall< LOOP_EXEC >(
    RAJA::RangeSegment(0, NUM_ITEMS),
    [ = ] RAJA_HOST_DEVICE(const int& idx) {
    data_1[ idx ] = TEST_VAL_1;
  } );

  // Reduce data
  {
    auto start = std::chrono::high_resolution_clock::now();

    RAJA::ReduceSum< REDUCE_EXEC, int > sum(0);
    RAJA::forall< LOOP_EXEC >( 
      RAJA::RangeSegment(0,NUM_ITEMS), 
      [=] RAJA_HOST_DEVICE(const int& idx) {
      sum += data_1[ idx ];
    } );
    int result = static_cast< int >( sum.get() );

    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration< double >(end-start).count();
    std::cout << "elapsed[1] = " << elapsed << std::endl;
  }

  // Reduce data again
  {
    auto start = std::chrono::high_resolution_clock::now();

    RAJA::ReduceSum< REDUCE_EXEC, int > sum(0);
    RAJA::forall< LOOP_EXEC >( 
      RAJA::RangeSegment(0,NUM_ITEMS), 
      [=] RAJA_HOST_DEVICE(const int& idx) {
      sum += data_1[ idx ];
    } );
    int result = static_cast< int >( sum.get() );

    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration< double >(end-start).count();
    std::cout << "elapsed[1] = " << elapsed << std::endl;
  }
  cudaFree(data_1);

This produces the following output:

elapsed[1] = 0.0221074    // <--------------- EXECUTION TIME OF FIRST REDUCTION
elapsed[1] = 1.5151e-05.  // <--------------- RUNNING THE REDUCTION AGAIN IS SIGNIFICANTLY FASTER

Please, let me know if you need any additional information. Thank you for all your time and help.

@MrBurmark
Copy link
Member

MrBurmark commented Oct 23, 2023

This is expected behavior. The reducers need memory to function so the first reducer call causes allocations and initialization in internal memory pools (device, device zeroed, and pinned pools). You should be able to see this happening if you profile with something like nsight systems.
I am thinking about how to allow users to pass in their own allocators so we don't have these separate RAJA only pools of memory.
In other reduction news, try using this policy for loops with reductions in them RAJA::cuda_exec_occ_calc_async<BLOCK_SIZE>. Note that using this policy adds overhead the first time each loop using this policy is run because the occupancy calculator is called. This improves reduction performance for loops with extents significantly over a million. This policy uses the occupancy calculator to only launch as many GPU threads as can run concurrently on the device, then uses a grid stride loop to run all the iterates of the loop. This results in a simpler reduction tree and less temporary memory usage in the implementation. These policies will likely become the recommended policy to use with reductions in the near future.

@gzagaris
Copy link
Member Author

Thank you for the detailed explanation and suggestions @MrBurmark!

The RAJA::cuda_exec_occ_calc_async< BLOCK_SIZE > sounds fancy! I'll definitely try it out.

  • Can this policy also be used for nested loops with RAJA::Kernel, i.e., as the policy with the RAJA::statement::For? Would you recommend that?

On a slight tangent, I was thinking a bit more about this the other day and I was wondering if hiding initial overheads like this justifies as a use case for providing methods, such as, RAJA::initialize(argc, argv) and RAJA::finalize(). Moreover, having these methods may provide a good way for users to pass allocators and memory pools from the application space for RAJA to use internally. I am not certain if that was something that has been considered in the past and wanted to bring it up for your consideration.

Thank you again for all your help.

@MrBurmark
Copy link
Member

MrBurmark commented Oct 23, 2023

You can't use that policy with RAJA::kernel.
There are occupancy calculator policies for kernel already but using them is a bit more complicated.
You have to use an occupancy CudaKernel statement for the occupancy calculator to be used when choosing launch params like the number of blocks. In addition block_direct and global_direct policies should probably be replaced with block_loop or global_loop policies to ensure the number of blocks required stays below the bounds imposed by the occupancy calculator.

      RAJA::KernelPolicy<
        RAJA::statement::CudaKernelFixedAsync<block_sz,
          RAJA::statement::For<0, RAJA::cuda_global_size_x_direct<block_sz>,
            RAJA::statement::For<1, RAJA::cuda_block_y_direct,
              RAJA::statement::For<2, RAJA::cuda_thread_y_direct,
                RAJA::statement::Lambda<0> > > > > >;

vs

      RAJA::KernelPolicy<
        RAJA::statement::CudaKernelOccAsync<
  // or RAJA::statement::CudaKernelExt<cuda_occ_calc_launch<block_sz, true>,
          RAJA::statement::For<0, RAJA::cuda_global_size_x_loop<block_sz>,
            RAJA::statement::For<1, RAJA::cuda_block_y_loop,
              RAJA::statement::For<2, RAJA::cuda_thread_y_direct,
                RAJA::statement::Lambda<0> > > > > >;

@gzagaris
Copy link
Member Author

Thanks @MrBurmark -- it's always complicated with RAJA::kernel 😄

The example you provided is exactly what I was looking for!

@rhornung67
Copy link
Member

@gzagaris several years ago, we considered adding RAJA::initialize and RAJA::finalize methods. Kokkos does that, for example. We didn't really see a strong need at the time and thought RAJA would be more flexible without it. We will reconsider and let you know.

@gzagaris
Copy link
Member Author

gzagaris commented Oct 23, 2023

@gzagaris several years ago, we considered adding RAJA::initialize and RAJA::finalize methods. Kokkos does that, for example. We didn't really see a strong need at the time and thought RAJA would be more flexible without it. We will reconsider and let you know.

Sounds good, thanks @rhornung67!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants