[abandoned] Allowing zerocopy: `makeAvailable(queue, dst, srcView)` #1820

bernhardmgruber · 2022-10-29T13:01:35Z

Adds an overload set zero_memcpy that only copies a buffer if the destination device requires the buffer in a different memory space in order to access it. Otherwise, no copy is performed and just the handles are adjusted.

Motivating use case (for me):
Avoid duplicating data if the accelerator device is the same as the host device. This is the case in the standard workflow where we create a buffer on the host, then a buffer on the accelerator device, and then copy from host -> device. If e.g. the TBB or OpenMP backend is active, there is no need for two buffers and an actual memory copy.

This zero copy idea can be extended further to also include various other technologies like CUDA's USM (managed memory), or remove memory access etc. For now, I do not have the time to work out all those details, so I am just proposing a minimal API to accomplish my motivating use case.

Example randomCells2D is rewritten to use the new API.

Addresses part of: #28

TODO:

Unit tests
Cheatsheet
Documentation?
Should this be an experimental feature? (namespace alpaka::experimental)

bernhardmgruber · 2022-10-29T13:03:06Z

example/randomCells2D/src/randomCells2D.cpp

+    BufAcc bufAccS{alpaka::zero_memcpy(queue, bufHostS)};
+    float* const ptrBufAccS{alpaka::getPtrNative(bufAccS)};
    auto pitchBufAccS = alpaka::getPitchBytes<1u>(bufAccS);
-    alpaka::memcpy(queue, bufAccS, bufHostS);


Buffer creation is moved down to where the memory copies were performed, because allocation/memcpy is now one potential step hidden behind zero_memcpy.

bernhardmgruber · 2022-10-29T13:04:08Z

include/alpaka/mem/buf/Traits.hpp

+    //! Makes the content of the source view available on the device associated with the destination queue. If the
+    //! destination shares the same memory space as the source view, no copy is performed and the destination view is
+    //! updated to share the same buffer as the source view. Otherwise, a memcpy is performed.
+    template<typename TQueue, typename TViewDst, typename TViewSrc>
+    ALPAKA_FN_HOST void zero_memcpy(TQueue& dstQueue, TViewDst& viewDstOpt, TViewSrc const& viewSrc)


This version is used when the user already has a destination buffer. Typically, the case when transferring back from the device to the host buffer.

From our offline discussion yesterday: IMO TQueue& dstQueue must be the source queue because you should not provide the control back to the caller before the memory operations on the source device are finished.

I'm thinking of two possible use cases:

"transfer" data from host to device

"transfer" data from device to host

where device could be any device for which we do not want to make the copy, but take the zero-copy path; it could be from a CPU backend, or a GPU backend with unified memory, etc.

In both cases I would expect to use a queue associated to the device, not to the host, same as for the regular memcpy function.

So, maybe this should simply be TQueue queue ?

@fwyzard I think it must always be the source queue because there zero copy operation is instantly finished, so it takes no time but it is not allowed to use viewDstOpt before all operations on viewSrc are finished. To guard this case you need to wait for the source queue. In cases we would introduce a zero_memcpy_async we would need to return an event that provides the information when we can use viewDstOpt.
Even for host to device we need the source queue (host queue) because the user can run a host-accelerated kernel with viewSrc.

Mhh I am a little bit confused the function is called zero_memcpy but is performing a copy in case the memory is not available on the destination device. So it is NOT zero-memcopy, it is some kind of an optimized (intelligent) copy.

In that case I think we need the source and destination queue to gard when the operation starts in case it is not zero copiable and to secure that follow-up operations to the destination not start before the possible copy is finished.

IMO the function should not be called zero_memcpy or I have another definition of zero memcopy in mind.

In both cases I would expect to use a queue associated to the device, not to the host, same as for the regular memcpy function.
So, maybe this should simply be TQueue queue ?

Exactly, so I renamed it yesterday like so, but GitHub is only showing Outdated instead of the actual state of the PR.

it is not allowed to use viewDstOpt before all operations on viewSrc are finished. To guard this case you need to wait for the source queue.

So how do we guard against this in the standard alpaka::memcpy(queue, dstView, srcView)? Isn't it the same problem here? So I wondered what the meaning of this queue actually is, and whether it needs to have any relation to the views. Looking into the implementation, there is actually no requirement on the queue related to the views. For CUDA, the associated devices of the views determine whether the memcpy kind is host-2-device, device-2-device, etc., but I could apparently use a queue to a device A and then memcpy between two views on different devices B and C. This may be a design bug, but it leads me to believe that the queue's meaning is to allow the user to sequence operations with respect to other operations. And if the user gets that wrong, then that's the users fault.

Even for host to device we need the source queue (host queue) because the user can run a host-accelerated kernel with viewSrc.

Could you point me to an example where we actually have a queue to the host device as well (in addition to the accelerator device) and how that would be used? This is the first time I hear of this use.

Mhh I am a little bit confused the function is called zero_memcpy but is performing a copy in case the memory is not available on the destination device.

Right, which is why I chose the provocative placeholder name zero_memcpy. We still need to have that discussion on how to name this functionality.

I was tempted yesterday to define zero-copy as: zero_memcpy(queue, dev, srcView) -> optional<DstView> or even omit the queue parameter altogether, because there is no real work necessary in case the view is also accessible on the destination device. However, this is not the API I would want to have as a user and I would immediately need to implement:

if (auto dst = zero_memcpy(queue, dev, srcView)) return *dst; else { auto dst = allocBuf(...); memcpy(queue, dst, src); }

I think @j-stephan suggested in the past to use make_available and @fwyzard suggested to use convey or smart_memcpy in #28.

In that case I think we need the source and destination queue to gard when the operation starts in case it is not zero copiable and to secure that follow-up operations to the destination not start before the possible copy is finished.

Again, this is the users responsibilty to use the queues correctly. If you create two queues and use two views on both of the queues, you can also create a race with just the normal alpaka::memcpy.

Could you point me to an example where we actually have a queue to the host device as well (in addition to the accelerator device) and how that would be used? This is the first time I hear of this use.

I would say we do not have such an example because we do not use accelerators colocated in one application.
As I know we are very "dirty" in all our examples and run host code without a use of an accelerator because we run the host code sequentially.
Even in PIConGPU we use native OpenMP where normally an accelerator would be the better way to run parallel host code to re-use all device functions we implemented. The problem with PIConGPU is that we use cupla where we can not use two accelerators at the same time.
To be fair but this is one reason why we made alpaka but we never created an example :-(
Our tests are using multiple accelerators in one test-app but not cooperative as I know :-(

As I know we are very "dirty" in all our examples and run host code without a use of an accelerator because we run the host code sequentially.

I don't think this is dirty. Alpaka should just coexist with ordinary C++. If I use a std::for_each(std::execution::par, ...) or a std::jthread next to an alpaka kernel call, this should be allowed (given that I also get all my synchronization right).

Our tests are using multiple accelerators in one test-app but not cooperative as I know :-(

Yeah, I think we should have a test/example for that, see #1909.

Yes it should co-exists with "normal" C++. Nver the less all our examples assume we do not need paralism on the host side and we simply use a sequential for loop e.g. to evaluate results instead of a kernel that can do it in parallel.

bernhardmgruber · 2022-10-29T13:05:10Z

include/alpaka/mem/buf/Traits.hpp

+    //! Makes the content of the source view available on the device associated with the destination queue. If the
+    //! destination shares the same memory space as the source view, no copy is performed and the source view is
+    //! returned. Otherwise
+    template<typename TQueue, typename TViewSrc>
+    ALPAKA_FN_HOST auto zero_memcpy(TQueue& dstQueue, TViewSrc const& viewSrc)


This version is used when the user does not have a destination buffer yet. Typically, the case when transferring from the host to a (potentially newly allocated) device buffer.

And I had to add another overload so the user could specify the destination device, because this is not necessarily the same as the queue device.

bernhardmgruber · 2022-10-29T13:08:12Z

include/alpaka/mem/buf/Traits.hpp

+            return viewSrc;
+        }
+
+        auto dst = allocAsyncBufIfSupported<Elem<TViewSrc>, Idx<TViewSrc>>(dstQueue, getExtentVec(viewSrc));


What strategy should I use to allocate a new buffer here?

bernhardmgruber · 2022-10-29T13:11:16Z

For now I implemented the new routines for buffers. I wondered whether it makes any sense to allow more generic views as well. I guess not, because views have no device associated with them?

example/randomCells2D/src/randomCells2D.cpp

fwyzard · 2022-11-01T17:49:42Z

include/alpaka/mem/buf/Traits.hpp

+    namespace detail
+    {
+        template<typename DevDst, typename DevSrc>
+        auto canZeroCopy(const DevDst& devDst, const DevSrc& devSrc) -> bool


can this be constexpr ?

Maybe. I chose to be conservative for now in case we ever need to call a runtime function to figure out whether we can zero-copy between two devices.

Understood.

By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu); ?

Make a copy, or let SomeGpu access the buffer from OtherGpu memory through nvlink/PCIe/etc. ?

That is a very good question and I don't know what's best. I added it to #28.

I would prefer the second version as it makes use of (fast) hardware capabilities. Now the question to me is: Can and should we decide if this is possible automatically or let the users declare what they want

So, I renamed the feature makeAvailable now, which is what its intend is. So makeAvailable(queue, bufOnSomeGpu, bufOnOtherGPu) should make the memory available on another GPU. If zero-copying is possible by any chance (mapping memory through PCIe, nvlink etc), it should be done. If non of that is possible, a full copy is performed.

If the user intends to have a full copy anyway, they should just say so and write memcpy(queue, bufOnSomeGpu, bufOnOtherGPu). In that sense, we should decide this automatically. But that decision may depend on the user having declared the involved buffers in a certain way.

If the user intended to use a buffer over nvlink between 2 CUDA CPUs (a very explicit use), then they don't need to use alpaka and can just call the CUDA API directly.

psychocoderHPC · 2023-01-25T09:24:05Z

After discussion #1820 (comment) I think this method can be useful but is very dangerous for the user to use. There are two cases

the buffer is visible on the other device e.g. via unified memory or managed memory. In this case an operation on the destination memory will change the values in the destination
in cases deep copy of the source buffer is created changes must be explicitly copied back to the source after the destination is changed.

As mentioned in #1820 (comment) I think this behavior must be reflected in the name of the "copy" function.

psychocoderHPC · 2023-01-25T09:40:41Z

I think the function canZeroCopy should stay because this function is saying if you need to copy data or if it is directly visible but I suggest instead of using the function name zero_memcopy -> moveBuffer() (camel case because I think all names in alpaka are at the moment camel case). The only problem I see is that move somehow implies that the source buffer should not be used anymore.

Maybe the function should be called makeDataAvailable()?

Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

psychocoderHPC · 2023-01-25T17:00:36Z

include/alpaka/mem/buf/Traits.hpp

+        typename TViewSrc,
+        std::enable_if_t<isDevice<TDevDst>, int> = 0,
+        typename TViewDst = Buf<TDevDst, Elem<TViewSrc>, Dim<TViewSrc>, Idx<TViewSrc>>>
+    ALPAKA_FN_HOST auto makeAvailable(TQueue& queue, TDevDst const& dstDev, TViewSrc const& viewSrc) -> TViewDst


Could you please extend the documentation and point to the fact that the user must take care that all operations on viewSrc are finished.
This was one point where I was always confused in this PR but now I understand that the user must link the queue to avoid data races.

alpaka::enqueue(quereWhereViewSrcIsUsed, event); wait(destQueue, event); auto newBuff = alpaka::makeAvailable(destQueue, destDev, viewSrc));

@bernhardmgruber dstDev is available in the queue parameter, is it really required to have this extra parameter queue?

Could you please extend the documentation and point to the fact that the user must take care that all operations on viewSrc are finished.

If you really insist, I can do it. It is also not documented for alpaka::memcpy btw. and I think it is kind of obvious. It's also not true. You can have pending operations on the source view:

auto bufHost = alpaka::allocBuf<int, Idx>(devHost, extents); auto bufAcc = alpaka::makeAvailable(queue, devAcc, bufHost); alpaka::exec<Acc>(queue, workdiv, Kernel{}, alpaka::getPtrNative(bufAcc)); alpaka::makeAvailable(queue, bufHost, bufAcc); // pending kernel execution on bufAcc

dstDev is available in the queue parameter, is it really required to have this extra parameter queue?

yes, because the device of the queue is not always the destination device. See example above, where the destination is the host, but the queue's device is the accelerator.

bernhardmgruber · 2023-01-26T12:53:57Z

After discussion #1820 (comment) I think this method can be useful but is very dangerous for the user to use. There are two cases

the buffer is visible on the other device e.g. via unified memory or managed memory. In this case an operation on the destination memory will change the values in the destination

in cases deep copy of the source buffer is created changes must be explicitly copied back to the source after the destination is changed.

I don't think this function is dangerous. It's just that we don't guarantee what the state of the source buffer is after makeAvailable. It can be disconnected from the destination or not, and you don't know. In fact, I think you don't need to know. But we could provide that extra information via an additional return value if necessary.

I think this behavior must be reflected in the name of the "copy" function.

Maybe the function should be called makeDataAvailable()?

[...] I suggest instead of using the function name zero_memcopy -> moveBuffer()

Are you fine with the proposed makeAvailable?

I think the function canZeroCopy should stay because this function is saying if you need to copy data or if it is directly visible

This is an internal function and I think it's bad. Because you actually need a mix of compile and runtime information to get the data types of the buffers right, and canZeroCopy does not handle that ATM. See also #28 (comment).

The only problem I see is that move somehow implies that the source buffer should not be used anymore.

Which is exactly what I want.

bernhardmgruber · 2024-03-02T23:44:49Z

I am abandoning work on this. Feel free to use or discard anything from the changeset provided here.

bernhardmgruber commented Oct 29, 2022

View reviewed changes

bernhardmgruber force-pushed the zerocopy branch 2 times, most recently from e57226c to ae0febb Compare October 29, 2022 14:27

fwyzard reviewed Nov 1, 2022

View reviewed changes

example/randomCells2D/src/randomCells2D.cpp Outdated Show resolved Hide resolved

fwyzard reviewed Nov 1, 2022

View reviewed changes

bernhardmgruber mentioned this pull request Nov 3, 2022

Add a method that copies memory only if it is in a different memory #28

Open

bernhardmgruber force-pushed the zerocopy branch from ae0febb to 3b670f5 Compare November 11, 2022 13:09

bernhardmgruber force-pushed the zerocopy branch 2 times, most recently from a6d95c5 to 6cb3dae Compare January 13, 2023 11:36

bernhardmgruber force-pushed the zerocopy branch 2 times, most recently from f374037 to 92f086b Compare January 24, 2023 15:17

bernhardmgruber mentioned this pull request Jan 25, 2023

Add tests and example for multi GPU usage #1909

Open

bernhardmgruber added 2 commits January 25, 2023 14:58

Remove unused buffers from randomCells2D

6737a8f

Add new test function iotaCheckView

1482dc4

bernhardmgruber force-pushed the zerocopy branch 2 times, most recently from 741852c to 1760283 Compare January 25, 2023 14:00

bernhardmgruber changed the title ~~Zerocopy~~ Allowing zerocopy: makeAvailable(queue, dst, srcView) Jan 25, 2023

bernhardmgruber added 2 commits January 25, 2023 17:09

Add makeAvailable

745c3c5

Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted. Fixes: alpaka-group#28

Rewrite randomCells2D to use makeAvailable

39b5dce

bernhardmgruber force-pushed the zerocopy branch from 1760283 to 39b5dce Compare January 25, 2023 16:09

psychocoderHPC reviewed Jan 25, 2023

View reviewed changes

bernhardmgruber marked this pull request as draft July 31, 2023 14:35

bernhardmgruber changed the title ~~Allowing zerocopy: makeAvailable(queue, dst, srcView)~~ [abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) Mar 2, 2024

SimeonEhrig mentioned this pull request May 16, 2024

[WIP] implement memory visibility #2180

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[abandoned] Allowing zerocopy: `makeAvailable(queue, dst, srcView)` #1820

[abandoned] Allowing zerocopy: `makeAvailable(queue, dst, srcView)` #1820

bernhardmgruber commented Oct 29, 2022 •

edited

bernhardmgruber Oct 29, 2022

bernhardmgruber Oct 29, 2022

psychocoderHPC Jan 25, 2023

fwyzard Jan 25, 2023

psychocoderHPC Jan 25, 2023 •

edited

psychocoderHPC Jan 25, 2023

bernhardmgruber Jan 25, 2023

psychocoderHPC Jan 25, 2023

bernhardmgruber Jan 26, 2023

psychocoderHPC Jan 26, 2023

bernhardmgruber Oct 29, 2022

bernhardmgruber Oct 29, 2022

bernhardmgruber Oct 29, 2022

bernhardmgruber commented Oct 29, 2022

fwyzard Nov 1, 2022

bernhardmgruber Nov 3, 2022

fwyzard Nov 3, 2022

bernhardmgruber Nov 3, 2022

bussmann Nov 15, 2022

bernhardmgruber Jan 26, 2023

psychocoderHPC commented Jan 25, 2023

psychocoderHPC commented Jan 25, 2023 •

edited

psychocoderHPC Jan 25, 2023

psychocoderHPC Jan 25, 2023

bernhardmgruber Jan 26, 2023

bernhardmgruber Jan 26, 2023

bernhardmgruber commented Jan 26, 2023

bernhardmgruber commented Mar 2, 2024 •

edited

[abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) #1820

Are you sure you want to change the base?

[abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) #1820

Conversation

bernhardmgruber commented Oct 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psychocoderHPC Jan 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernhardmgruber commented Oct 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psychocoderHPC commented Jan 25, 2023

psychocoderHPC commented Jan 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernhardmgruber commented Jan 26, 2023

bernhardmgruber commented Mar 2, 2024 • edited

[abandoned] Allowing zerocopy: `makeAvailable(queue, dst, srcView)` #1820

[abandoned] Allowing zerocopy: `makeAvailable(queue, dst, srcView)` #1820

bernhardmgruber commented Oct 29, 2022 •

edited

psychocoderHPC Jan 25, 2023 •

edited

psychocoderHPC commented Jan 25, 2023 •

edited

bernhardmgruber commented Mar 2, 2024 •

edited