Refactor memcpy_async for easier extensions. #348

griwes · 2023-08-16T23:27:24Z

Also make the memcpy_async tests slightly more robust.

Description

Closes #57

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Also make the memcpy_async tests slightly more robust.

miscco

First pass review

libcudacxx/include/cuda/std/detail/libcxx/include/__config

libcudacxx/include/cuda/pipeline

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

Significant one is reworking barrier<thread_scope_thread>, because I noticed that it was starting to rot (it didn't get all the new try_wait and wait_parity APIs that were added to the block version).

miscco · 2023-08-21T09:33:13Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

    _LIBCUDACXX_DEVICE
-    static async_contract_fulfillment __synchronize(__arch::__cuda<80>, barrier<_Sco, _CompF> &, async_contract_fulfillment __acf) {
+    static async_contract_fulfillment __synchronize(__arch::__cuda<80>, barrier<_Sco, _CompF> &, async_contract_fulfillment __acf, _Empty...) {
        if (__acf == async_contract_fulfillment::async) {


Should we be defensive and add static_assert(sizeof...(_Empty) == 0, "Should not be called with additional arguments");

miscco · 2023-08-21T09:34:54Z

libcudacxx/.upstream-tests/test/support/overrun_guard.h

+
+#include <cuda/std/type_traits>
+
+template<typename T>


I would love some comments on when we need this

At some point during this work I managed to do a dumb and write over some bytes beyond the variable, which results in a silly behavior of an endless hang if you happen to overwrite the barrier; fun times. Hopefully this will catch the more likely cases of off-by-ones (that's what I did, didn't subtract 1 from the first set bit index when turning that index into an actual alignment value) with a reasonable assert message instead of a hang in arrive_and_wait.

Oh I meant in the file for future reference ;)

libcudacxx/.upstream-tests/test/support/cuda_space_selector.h

miscco · 2023-08-21T09:39:23Z

libcudacxx/include/cuda/pipeline

-
-        return __memcpy_async<alignof(_Type)>(__group, reinterpret_cast<char *>(__destination), reinterpret_cast<char const *>(__source), __size, __pipeline);
-    }
+    template<thread_scope _Sco, __tx_api _Tx, typename _Arch, __space _OutSpace, __space _InSpace, __space _SyncSpace>


Nitpick: I believe we generally trend towards using class instead of typename

I really dislike class, because it's the less semantically accurate of the two spellings (an argument could be an int, and int is definitely not a class) - but we have both throughout the library, so we should probably settle on a policy and do a library-wide unification if we want to.

miscco · 2023-08-21T09:40:23Z

libcudacxx/include/cuda/pipeline

+        _OutSpace, _InSpace, _SyncSpace
+    > {
+        __host__ __device__
+        static async_contract_fulfillment __synchronize(_Arch, pipeline<_Sco> &, async_contract_fulfillment __acf) {


We might need to macroize async_contract_fulfillment

At the same time, if a user overwrites this with a macro they kind of deserve what they get

Hmm? This is a name defined in the public API of this header, all normal rules apply here.

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

miscco · 2023-08-21T09:47:29Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

@@ -557,244 +543,103 @@ _LIBCUDACXX_END_NAMESPACE_CUDA
 _LIBCUDACXX_BEGIN_NAMESPACE_CUDA_DEVICE

 _LIBCUDACXX_DEVICE
-inline _CUDA_VSTD::uint64_t * barrier_native_handle(barrier<thread_scope_block> & b) {
-    return reinterpret_cast<_CUDA_VSTD::uint64_t *>(&b.__barrier);
+inline _CUDA_VSTD::uint64_t * barrier_native_handle(barrier<thread_scope_block> & __b) {


Can we move this rename into a separate bugfix PR to reduce the noise?

We can, though that's going to create more issue/PR noise :P

miscco · 2023-08-21T09:54:43Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+_LIBCUDACXX_DEVICE
+bool __is_grid_constant(const void * __p) {
+#ifdef _LIBCUDACXX_CUDACC_BELOW_11_7
+    return false;


Suggested change

return false;

(void)__p; return false;

miscco · 2023-08-21T09:59:36Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+_LIBCUDACXX_INLINE_VISIBILITY async_contract_fulfillment __dispatch_alignment_bit(_Fn && __f, _CUDA_VSTD::size_t __alignment_fsb) {
+    const _CUDA_VSTD::size_t __alignment_v = 1ull << (__alignment_fsb - 1);
+
+    if (__builtin_expect(__alignment_v >= _MaxInterestingAlignment, true)) {


This i going to be soo much fun porting to the various supported CTKs / host compilers

I believe we will need a macro to keep being able to handle this with configurations that do not know about the builtin,

All things currently in the CI matrix work fine with this. We need to run this on Windows though, it's ~~possible~~likely that MSVC doesn't like it.

miscco · 2023-08-21T11:00:21Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+        NV_PROVIDES_SM_70,
+        (return _CUDA_VSTD::forward<_Fn>(__f)(__arch::__cuda<70>());),
+        NV_IS_HOST,
+        (return _CUDA_VSTD::forward<_Fn>(__f)(__arch::__host());))


Could you move the closing brace to a separate line, makes it easier to parse

wmaxey · 2023-08-21T22:17:18Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+    NV_IF_ELSE_TARGET(
+        NV_IS_DEVICE,
+        (return __ffsll(__val);),
+        (return _CUDA_VSTD::__libcpp_ctz(__val) + 1;)


I doubt it might be more efficient, but countr_zero will use the right intrinsic depending on the context. Possibly even using the host's builtin for constexpr evaluation.

Also fix the order of the names of the template parameters of __are_memcpy_async_hooks_specialized, and uglify an identifier I missed before.

miscco

I was looking through PRs and realized I did not send the review 🤦‍♂️

miscco · 2023-09-11T12:00:51Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+struct __single_thread_group {
+    _LIBCUDACXX_INLINE_VISIBILITY
+    void sync() const {}
+    _LIBCUDACXX_INLINE_VISIBILITY


I believe we would want to add nodiscard here and elsewhere

Suggested change

_LIBCUDACXX_INLINE_VISIBILITY

_LIBCUDACXX_NODISCARD_ATTRIBUTE _LIBCUDACXX_INLINE_VISIBILITY

miscco · 2023-09-11T12:02:06Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+
+template<typename _Tag, _CUDA_VSTD::size_t _Value>
+struct __down_convertible_constant<_Tag, _Value, _CUDA_VSTD::__enable_if_t<_Tag::__min == _Value>> {
+


Suggested change

miscco · 2023-09-11T12:03:50Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+    typename = void>
+struct __memcpy_async_invoke_if_applicable {
+    template<typename _Fn>
+    _LIBCUDACXX_INLINE_VISIBILITY


ditto: nodiscard

miscco · 2023-09-11T12:05:26Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+    template<typename _Fn>
+    _LIBCUDACXX_INLINE_VISIBILITY
+    static async_contract_fulfillment __invoke(_Fn && __f) {
+        return _CUDA_VSTD::forward<_Fn>(__f)(__alignment<_Alignment>());


question: The msvc folks started using {} for construction to disambiguate function calls. I am sympathetic to that approach. Could we start doing so?

miscco · 2023-09-11T12:07:04Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+    template<typename _Fn>
+    _LIBCUDACXX_INLINE_VISIBILITY
+    static async_contract_fulfillment __invoke(_Fn && __f) {
+        _LIBCUDACXX_UNREACHABLE();


I am slightly worried, that there are compilers that will scream at us about missing return value, but I am not sure how to properly guard against that. I guess we will see

miscco · 2023-09-11T12:10:10Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+        }
+    }
+
+    switch (__alignment_fsb) {


I am wondering whether a type alias would allow us to avoid the macro alltogether:

template <cuda::std::size_t _Value> using __memcpy_async_invoke_if_alignment = ....

miscco · 2023-09-11T12:11:02Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+        _ADD_CASE(1);
+
+#undef _ADD_CASE
+    }


Could we move the unreachable into the default cause to silence potentially stupid compilers?

miscco · 2023-09-11T12:13:27Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+template<typename _Tp>
+struct __dependent_false : std::false_type {};
+
+template<typename _Hooks, typename _Size, _CUDA_VSTD::size_t _NativeAlignment, typename = void>


Above there is a trapping fake implementation. here we use a static assert. Is there a reason for the difference?

miscco · 2023-09-11T12:15:08Z

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h

+using __cuda = __down_convertible_constant<__cuda_tag, _ProvidedSM>;
+
+template<typename _Tp, _CUDA_VSTD::size_t _RequestedSM>
+struct __is_cuda_provides_sm : _CUDA_VSTD::false_type {


I really dislike the is in the typename

jrhemstad · 2024-05-13T20:41:49Z

I think we can just go ahead and close this PR for now. It's not likely to be revived any time soon.

@griwes do you agree?

griwes added feature request New feature or request. libcu++ For all items related to libcu++ labels Aug 16, 2023

griwes force-pushed the refactor-memcpy-async branch 3 times, most recently from 0e5cd59 to c8c9b9d Compare August 17, 2023 00:57

Refactor memcpy_async for easier extensions.

f9dd2d3

Also make the memcpy_async tests slightly more robust.

griwes force-pushed the refactor-memcpy-async branch from c8c9b9d to f9dd2d3 Compare August 17, 2023 05:11

miscco reviewed Aug 17, 2023

View reviewed changes

griwes added 6 commits August 19, 2023 20:16

Compilation fixes.

a04b535

Oooops, forgot to offset the input pointers for cp.async...

2f3dee2

SM80 fixes, drive-by improvements.

6706c9f

Significant one is reworking barrier<thread_scope_thread>, because I noticed that it was starting to rot (it didn't get all the new try_wait and wait_parity APIs that were added to the block version).

Merge remote-tracking branch 'origin/main' into refactor-memcpy-async

27ee731

Resolve initial review comments.

56ac0e7

Fix a missed macro rename.

e1cd20e

griwes force-pushed the refactor-memcpy-async branch from 7a74ae6 to e1cd20e Compare August 21, 2023 09:16

griwes marked this pull request as ready for review August 21, 2023 09:49

griwes requested review from a team as code owners August 21, 2023 09:49

griwes requested review from ericniebler and alliepiper and removed request for a team August 21, 2023 09:49

miscco reviewed Aug 21, 2023

View reviewed changes

Don't widen when aligned_size_t is used, add missing inlines.

100c392

wmaxey reviewed Aug 21, 2023

View reviewed changes

griwes added 4 commits August 24, 2023 00:45

Fix an overload resolution snafu.

41231d4

Fix the order of testing the memory spaces to fix SM90.

556331e

Also fix the order of the names of the template parameters of __are_memcpy_async_hooks_specialized, and uglify an identifier I missed before.

Import changes to the framework necessary for 1d tma.

a1c489f

Make generated sync shorter even when spaces are not statically known.

f4a413d

griwes force-pushed the refactor-memcpy-async branch from aa89fbf to f4a413d Compare September 8, 2023 22:27

miscco reviewed May 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor memcpy_async for easier extensions. #348

Refactor memcpy_async for easier extensions. #348

griwes commented Aug 16, 2023

miscco left a comment

miscco Aug 21, 2023

miscco Aug 21, 2023

griwes Aug 21, 2023

miscco Aug 21, 2023 •

edited

miscco Aug 21, 2023

griwes Aug 21, 2023

miscco Aug 21, 2023

griwes Aug 21, 2023

miscco Aug 21, 2023

griwes Aug 21, 2023

miscco Aug 21, 2023

miscco Aug 21, 2023

griwes Aug 21, 2023

miscco Aug 21, 2023

wmaxey Aug 21, 2023

miscco left a comment

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

miscco Sep 11, 2023

jrhemstad commented May 13, 2024

	_LIBCUDACXX_INLINE_VISIBILITY
	_LIBCUDACXX_NODISCARD_ATTRIBUTE _LIBCUDACXX_INLINE_VISIBILITY


		template<typename _Tag, _CUDA_VSTD::size_t _Value>
		struct __down_convertible_constant<_Tag, _Value, _CUDA_VSTD::__enable_if_t<_Tag::__min == _Value>> {

Refactor memcpy_async for easier extensions. #348

Are you sure you want to change the base?

Refactor memcpy_async for easier extensions. #348

Conversation

griwes commented Aug 16, 2023

Description

Checklist

miscco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miscco Aug 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miscco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad commented May 13, 2024

miscco Aug 21, 2023 •

edited