New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bench/bowen/desul raja atomics #1624
base: develop
Are you sure you want to change the base?
Conversation
dad5812
to
7d0da79
Compare
7e171df
to
c4a9582
Compare
|
||
template<int BLOCK_SZ> | ||
struct ExecPolicyGPU { | ||
using policy = RAJA::cuda_exec<BLOCK_SZ>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a synchronous policy.
void TimeAtomicOp(int num_iterations = 2, int array_size = 100) { | ||
RAJA::Timer timer; | ||
|
||
for (int i = 0; i < num_iterations; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to not time the first run of the kernel. Then time multiple iterations of the loop, running asynchronously for gpus, instead of timing each loop individually.
|
||
template<int BLOCK_SZ> | ||
struct ExecPolicyGPU { | ||
using policy = RAJA::cuda_exec<BLOCK_SZ>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using policy = RAJA::cuda_exec<BLOCK_SZ>; | |
using policy = RAJA::cuda_exec<BLOCK_SZ, true /*asynchronous*/>; |
By default, RAJA exec policies are synchronous. To make them asynchronous, you add a template parameter with a bool true value.
#include "desul/atomics.hpp" | ||
#include "RAJA/util/Timer.hpp" | ||
|
||
#define N 1000000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think N
should be a command line arg to select at run time.
// GPU benchmarks | ||
std::cout << "Executing CUDA benchmarks" << std::endl; | ||
std::cout << INDENT << "Executing atomic add benchmarks" << std::endl; | ||
TimeAtomicOp<ExecPolicyGPU<64>::policy, int, AtomicAdd<int, typename GPUAtomic::policy>, true>(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be cleaner and easier to work with to set a constexpr thread block size variable to a default value, such as 256, at the top of the file. I think we agreed in a recent group meeting that thread block size doesn't have a significant performance impact for the simple kernels in this file. @MrBurmark what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me, then we could vary it fairly easily if we wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we could follow the pattern in RAJA Perf if we wanted to try block size sweeps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a for_each_type
function into RAJA for just these kind of use cases https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/for_each.hpp#L88
Comparison of RAJA's implementation of atomic operations with desul