Add stream operations to accelerator components #12356

devreal · 2024-02-20T21:27:21Z

First batch of changes from #12318 (offloading of reductions to devices).

This PR adds stream operations to the accelerator components:

~~Default stream~~
Stream-based alloc and free
Stream-based memmove
Wait for stream to complete

Also, enable querying for number of devices and memory bandwidth.

This PR is missing implementations for the ze component because I haven't had time to dig into that. Maybe someone familiar with that API can contribute the implementation? Otherwise I will need to find some time in the coming week(s) to implement them myself (the ze component didn't exist when I made these changes).

hppritcha · 2024-02-20T21:50:26Z

Could you please add stub functions for the ZE component?

devreal · 2024-02-20T22:04:17Z

@hppritcha I added stubs returning OPAL_ERR_NOT_IMPLEMENTED to the ze component.

edgargabriel

just a few minor comments/requests, nothing major, looks good overall.

opal/mca/accelerator/accelerator.h

opal/mca/accelerator/rocm/accelerator_rocm_component.c

opal/mca/accelerator/rocm/accelerator_rocm_module.c

edgargabriel · 2024-02-26T14:02:01Z

opal/mca/accelerator/rocm/accelerator_rocm_component.c

+opal_mutex_t opal_accelerator_rocm_stream_lock = {0};
+int opal_accelerator_rocm_num_devices = 0;
+float *opal_accelerator_rocm_mem_bw = NULL;
+hipStream_t *opal_accelerator_rocm_MemcpyStream = NULL;


actually, is this change from hipStream_t opal_accelerator_rocm_MemcpyStream to hipStream_t *opal_accelerator_rocm_MemcpyStream truly correct? We do not have anywhere the memory allocated for the actualy hipStream_t object

I made the change because of this line: https://github.com/open-mpi/ompi/pull/12356/files#diff-a83aa406fc5f195f910c83df9369e84a3202b9b10476dac3659fae9b4345453eL217

We shouldn't cast an opaque type to void*, that opens the door for all kinds of nasty behavior.

The memory is allocated here: https://github.com/open-mpi/ompi/pull/12356/files#diff-a83aa406fc5f195f910c83df9369e84a3202b9b10476dac3659fae9b4345453eR198

ok, for whatever reason I did not see the malloc, thanks. However, follow up questions:
a) why do we need two additional streams, the memcpy_stream() and the opal_accelerator_rocm_alloc_stream, couldn't we use the opal_accelerator_rocm_MemcpyStream instead (at least for the first one it sounds like its for the same purpose).
b) I cannot locate a hipStreamDestroy for those two additional streams

Ahh yes, I believe this was a relict from when I experimented with memory pools (which ultimately didn't seem to help). There should only be opal_accelerator_rocm_MemcpyStream left now.

Akshay-Venkatesh

Hi @devreal . Some high-level questions:

I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?
How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.

Akshay-Venkatesh · 2024-04-09T18:51:01Z

opal/mca/accelerator/cuda/accelerator_cuda.c

-        return OPAL_ERROR;
-    }
-    result = cuStreamSynchronize(opal_accelerator_cuda_memcpy_stream);
+    result = cuMemcpy((CUdeviceptr) dest, (CUdeviceptr) src, size);


@devreal what is the rationale behind replacing an asynchronous memcpy on an asynchronous stream with a blocking on the default stream?

That's the important change I guess. The rationale was that I saw that cuMemcpyAsync has to pin the memory, which is expensive. Since this is not an asynchronous function I don't see the need to pay this cost. On the other hand, cuMemcpy synchronizes with stream 0 (AFAIU) so that's a different cost to pay...

cuMemcpyAsync has to pin the memory

This would be the case if host memory isn't pinned (either using cuMemHostRegister or by allocating through cuMemAllocHost). If you're reusing host buffers, then it would probably better to pin them once and release pinnings during cleanup phase.

On the other hand, cuMemcpy synchronizes with stream 0 (AFAIU) so that's a different cost to pay...

I would recommend against synchronizing with stream 0.

This would be the case if host memory isn't pinned (either using cuMemHostRegister or by allocating through cuMemAllocHost). If you're reusing host buffers, then it would probably better to pin them once and release pinnings during cleanup phase.

I do use temporary buffers in the collectives. I guess I can register them, since they are reused multiple times. For now, I will revert this change. I guess the same would apply to the hip component

Akshay-Venkatesh · 2024-04-09T18:52:02Z

opal/mca/accelerator/cuda/accelerator_cuda.c

                       OPAL_PROC_MY_HOSTNAME, result);
        return OPAL_ERROR;
    }
    return OPAL_SUCCESS;
 }

-static int accelerator_cuda_memmove(int dest_dev_id, int src_dev_id, void *dest, const void *src, size_t size,
-                             opal_accelerator_transfer_type_t type)
+static int accelerator_cuda_memmove_async(int dest_dev_id, int src_dev_id, void *dest,


Seems like this is called from memmove which is both allocating and issuing a copy. Is memory being allocated to handle src/dst overlap cases?

Indeed, we allocate memory to handle overlap. I guess there could be a better implementation, we just haven't looked for one because we didn't need it (the function was there before, we just added the async version)

Akshay-Venkatesh · 2024-04-09T18:57:13Z

opal/mca/accelerator/cuda/accelerator_cuda_component.c

+         * See https://forums.developer.nvidia.com/t/memory-clock-rate/107940
+         */
+        float bw = ((float)mem_clock_rate*(float)bus_width*2.0) / 1024 / 1024 / 8;
+        opal_accelerator_cuda_mem_bw[i] = bw;


Is bandwidth here reporting peak intra-device memory bandwidth alone?

Yes, we only care about intra-device bandwidth here. I guess using buffers from multiple devices in reductions would be an interesting optimization.

opal/mca/accelerator/accelerator.h

opal/mca/accelerator/cuda/accelerator_cuda.c

hppritcha · 2024-05-02T15:33:15Z

@devreal could you rebase on top of head of main to pull in the CI ZE compiler sanity check?

- Stream-based alloc and free - Stream-based memmove - Wait for stream to complete Also, enable querying for number of devices and memory bandwidth. These operations are needed for operation device offloading. Co-authored-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu> Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

devreal added the Target: main label Feb 20, 2024

devreal requested review from bosilca, janjust and edgargabriel February 20, 2024 21:27

devreal marked this pull request as draft February 20, 2024 21:27

devreal force-pushed the opal_accelerator_stream_ops branch from 3d87780 to ead2c40 Compare February 20, 2024 22:03

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from df07a9c to 236af1a Compare February 20, 2024 22:29

edgargabriel requested changes Feb 21, 2024

View reviewed changes

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from 1c4d11b to 529a68d Compare February 25, 2024 15:33

edgargabriel reviewed Feb 26, 2024

View reviewed changes

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from ebad01d to 97dbb09 Compare February 26, 2024 14:55

Akshay-Venkatesh reviewed Apr 9, 2024

View reviewed changes

wenduwan reviewed Apr 17, 2024

View reviewed changes

opal/mca/accelerator/accelerator.h Outdated Show resolved Hide resolved

opal/mca/accelerator/cuda/accelerator_cuda.c Show resolved Hide resolved

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from 0765a9e to 7de3f68 Compare April 24, 2024 21:40

devreal marked this pull request as ready for review April 24, 2024 21:41

devreal requested review from Akshay-Venkatesh and edgargabriel April 24, 2024 21:41

janjust removed their request for review April 24, 2024 21:46

devreal force-pushed the opal_accelerator_stream_ops branch from 7de3f68 to f3133b4 Compare April 25, 2024 14:15

wenduwan approved these changes Apr 25, 2024

View reviewed changes

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from f3133b4 to 26185d6 Compare April 25, 2024 18:23

devreal force-pushed the opal_accelerator_stream_ops branch from 26185d6 to c13779f Compare May 2, 2024 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stream operations to accelerator components #12356

Add stream operations to accelerator components #12356

devreal commented Feb 20, 2024 •

edited

hppritcha commented Feb 20, 2024

devreal commented Feb 20, 2024

edgargabriel left a comment

edgargabriel Feb 26, 2024

devreal Feb 26, 2024

edgargabriel Feb 26, 2024

devreal Feb 26, 2024

Akshay-Venkatesh left a comment

Akshay-Venkatesh Apr 9, 2024

devreal Apr 10, 2024

Akshay-Venkatesh Apr 10, 2024

devreal Apr 11, 2024

Akshay-Venkatesh Apr 9, 2024

devreal Apr 11, 2024

Akshay-Venkatesh Apr 9, 2024

devreal Apr 10, 2024

hppritcha commented May 2, 2024

Add stream operations to accelerator components #12356

Are you sure you want to change the base?

Add stream operations to accelerator components #12356

Conversation

devreal commented Feb 20, 2024 • edited

hppritcha commented Feb 20, 2024

devreal commented Feb 20, 2024

edgargabriel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hppritcha commented May 2, 2024

devreal commented Feb 20, 2024 •

edited