New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stream operations to accelerator components #12356
base: main
Are you sure you want to change the base?
Conversation
Could you please add stub functions for the ZE component? |
3d87780
to
ead2c40
Compare
@hppritcha I added stubs returning |
df07a9c
to
236af1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few minor comments/requests, nothing major, looks good overall.
1c4d11b
to
529a68d
Compare
opal_mutex_t opal_accelerator_rocm_stream_lock = {0}; | ||
int opal_accelerator_rocm_num_devices = 0; | ||
float *opal_accelerator_rocm_mem_bw = NULL; | ||
hipStream_t *opal_accelerator_rocm_MemcpyStream = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, is this change from hipStream_t opal_accelerator_rocm_MemcpyStream
to hipStream_t *opal_accelerator_rocm_MemcpyStream
truly correct? We do not have anywhere the memory allocated for the actualy hipStream_t object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change because of this line: https://github.com/open-mpi/ompi/pull/12356/files#diff-a83aa406fc5f195f910c83df9369e84a3202b9b10476dac3659fae9b4345453eL217
We shouldn't cast an opaque type to void*
, that opens the door for all kinds of nasty behavior.
The memory is allocated here: https://github.com/open-mpi/ompi/pull/12356/files#diff-a83aa406fc5f195f910c83df9369e84a3202b9b10476dac3659fae9b4345453eR198
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, for whatever reason I did not see the malloc, thanks. However, follow up questions:
a) why do we need two additional streams, the memcpy_stream() and the opal_accelerator_rocm_alloc_stream, couldn't we use the opal_accelerator_rocm_MemcpyStream instead (at least for the first one it sounds like its for the same purpose).
b) I cannot locate a hipStreamDestroy for those two additional streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes, I believe this was a relict from when I experimented with memory pools (which ultimately didn't seem to help). There should only be opal_accelerator_rocm_MemcpyStream
left now.
ebad01d
to
97dbb09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @devreal . Some high-level questions:
- I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?
- How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.
return OPAL_ERROR; | ||
} | ||
result = cuStreamSynchronize(opal_accelerator_cuda_memcpy_stream); | ||
result = cuMemcpy((CUdeviceptr) dest, (CUdeviceptr) src, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devreal what is the rationale behind replacing an asynchronous memcpy on an asynchronous stream with a blocking on the default stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the important change I guess. The rationale was that I saw that cuMemcpyAsync
has to pin the memory, which is expensive. Since this is not an asynchronous function I don't see the need to pay this cost. On the other hand, cuMemcpy
synchronizes with stream 0 (AFAIU) so that's a different cost to pay...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuMemcpyAsync has to pin the memory
This would be the case if host memory isn't pinned (either using cuMemHostRegister or by allocating through cuMemAllocHost). If you're reusing host buffers, then it would probably better to pin them once and release pinnings during cleanup phase.
On the other hand, cuMemcpy synchronizes with stream 0 (AFAIU) so that's a different cost to pay...
I would recommend against synchronizing with stream 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the case if host memory isn't pinned (either using cuMemHostRegister or by allocating through cuMemAllocHost). If you're reusing host buffers, then it would probably better to pin them once and release pinnings during cleanup phase.
I do use temporary buffers in the collectives. I guess I can register them, since they are reused multiple times. For now, I will revert this change. I guess the same would apply to the hip component
OPAL_PROC_MY_HOSTNAME, result); | ||
return OPAL_ERROR; | ||
} | ||
return OPAL_SUCCESS; | ||
} | ||
|
||
static int accelerator_cuda_memmove(int dest_dev_id, int src_dev_id, void *dest, const void *src, size_t size, | ||
opal_accelerator_transfer_type_t type) | ||
static int accelerator_cuda_memmove_async(int dest_dev_id, int src_dev_id, void *dest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is called from memmove which is both allocating and issuing a copy. Is memory being allocated to handle src/dst overlap cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we allocate memory to handle overlap. I guess there could be a better implementation, we just haven't looked for one because we didn't need it (the function was there before, we just added the async version)
* See https://forums.developer.nvidia.com/t/memory-clock-rate/107940 | ||
*/ | ||
float bw = ((float)mem_clock_rate*(float)bus_width*2.0) / 1024 / 1024 / 8; | ||
opal_accelerator_cuda_mem_bw[i] = bw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is bandwidth here reporting peak intra-device memory bandwidth alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we only care about intra-device bandwidth here. I guess using buffers from multiple devices in reductions would be an interesting optimization.
0765a9e
to
7de3f68
Compare
7de3f68
to
f3133b4
Compare
f3133b4
to
26185d6
Compare
@devreal could you rebase on top of head of main to pull in the CI ZE compiler sanity check? |
- Stream-based alloc and free - Stream-based memmove - Wait for stream to complete Also, enable querying for number of devices and memory bandwidth. These operations are needed for operation device offloading. Co-authored-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu> Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
26185d6
to
c13779f
Compare
First batch of changes from #12318 (offloading of reductions to devices).
This PR adds stream operations to the accelerator components:
Default streamAlso, enable querying for number of devices and memory bandwidth.
This PR is missing implementations for the
ze
component because I haven't had time to dig into that. Maybe someone familiar with that API can contribute the implementation? Otherwise I will need to find some time in the coming week(s) to implement them myself (theze
component didn't exist when I made these changes).