-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor memcpy_async for easier extensions. #348
base: main
Are you sure you want to change the base?
Conversation
0e5cd59
to
c8c9b9d
Compare
Also make the memcpy_async tests slightly more robust.
c8c9b9d
to
f9dd2d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/memcpy_async.h
Outdated
Show resolved
Hide resolved
Significant one is reworking barrier<thread_scope_thread>, because I noticed that it was starting to rot (it didn't get all the new try_wait and wait_parity APIs that were added to the block version).
7a74ae6
to
e1cd20e
Compare
_LIBCUDACXX_DEVICE | ||
static async_contract_fulfillment __synchronize(__arch::__cuda<80>, barrier<_Sco, _CompF> &, async_contract_fulfillment __acf) { | ||
static async_contract_fulfillment __synchronize(__arch::__cuda<80>, barrier<_Sco, _CompF> &, async_contract_fulfillment __acf, _Empty...) { | ||
if (__acf == async_contract_fulfillment::async) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be defensive and add static_assert(sizeof...(_Empty) == 0, "Should not be called with additional arguments");
|
||
#include <cuda/std/type_traits> | ||
|
||
template<typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love some comments on when we need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point during this work I managed to do a dumb and write over some bytes beyond the variable, which results in a silly behavior of an endless hang if you happen to overwrite the barrier; fun times. Hopefully this will catch the more likely cases of off-by-ones (that's what I did, didn't subtract 1 from the first set bit index when turning that index into an actual alignment value) with a reasonable assert message instead of a hang in arrive_and_wait
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I meant in the file for future reference ;)
libcudacxx/include/cuda/pipeline
Outdated
|
||
return __memcpy_async<alignof(_Type)>(__group, reinterpret_cast<char *>(__destination), reinterpret_cast<char const *>(__source), __size, __pipeline); | ||
} | ||
template<thread_scope _Sco, __tx_api _Tx, typename _Arch, __space _OutSpace, __space _InSpace, __space _SyncSpace> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I believe we generally trend towards using class
instead of typename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike class
, because it's the less semantically accurate of the two spellings (an argument could be an int
, and int
is definitely not a class) - but we have both throughout the library, so we should probably settle on a policy and do a library-wide unification if we want to.
libcudacxx/include/cuda/pipeline
Outdated
_OutSpace, _InSpace, _SyncSpace | ||
> { | ||
__host__ __device__ | ||
static async_contract_fulfillment __synchronize(_Arch, pipeline<_Sco> &, async_contract_fulfillment __acf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to macroize async_contract_fulfillment
At the same time, if a user overwrites this with a macro they kind of deserve what they get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm? This is a name defined in the public API of this header, all normal rules apply here.
@@ -557,244 +543,103 @@ _LIBCUDACXX_END_NAMESPACE_CUDA | |||
_LIBCUDACXX_BEGIN_NAMESPACE_CUDA_DEVICE | |||
|
|||
_LIBCUDACXX_DEVICE | |||
inline _CUDA_VSTD::uint64_t * barrier_native_handle(barrier<thread_scope_block> & b) { | |||
return reinterpret_cast<_CUDA_VSTD::uint64_t *>(&b.__barrier); | |||
inline _CUDA_VSTD::uint64_t * barrier_native_handle(barrier<thread_scope_block> & __b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this rename into a separate bugfix PR to reduce the noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, though that's going to create more issue/PR noise :P
_LIBCUDACXX_DEVICE | ||
bool __is_grid_constant(const void * __p) { | ||
#ifdef _LIBCUDACXX_CUDACC_BELOW_11_7 | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false; | |
(void)__p; return false; |
_LIBCUDACXX_INLINE_VISIBILITY async_contract_fulfillment __dispatch_alignment_bit(_Fn && __f, _CUDA_VSTD::size_t __alignment_fsb) { | ||
const _CUDA_VSTD::size_t __alignment_v = 1ull << (__alignment_fsb - 1); | ||
|
||
if (__builtin_expect(__alignment_v >= _MaxInterestingAlignment, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This i going to be soo much fun porting to the various supported CTKs / host compilers
I believe we will need a macro to keep being able to handle this with configurations that do not know about the builtin,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All things currently in the CI matrix work fine with this. We need to run this on Windows though, it's possiblelikely that MSVC doesn't like it.
NV_PROVIDES_SM_70, | ||
(return _CUDA_VSTD::forward<_Fn>(__f)(__arch::__cuda<70>());), | ||
NV_IS_HOST, | ||
(return _CUDA_VSTD::forward<_Fn>(__f)(__arch::__host());)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the closing brace to a separate line, makes it easier to parse
NV_IF_ELSE_TARGET( | ||
NV_IS_DEVICE, | ||
(return __ffsll(__val);), | ||
(return _CUDA_VSTD::__libcpp_ctz(__val) + 1;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it might be more efficient, but countr_zero
will use the right intrinsic depending on the context. Possibly even using the host's builtin for constexpr evaluation.
Also fix the order of the names of the template parameters of __are_memcpy_async_hooks_specialized, and uglify an identifier I missed before.
aa89fbf
to
f4a413d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking through PRs and realized I did not send the review 🤦♂️
struct __single_thread_group { | ||
_LIBCUDACXX_INLINE_VISIBILITY | ||
void sync() const {} | ||
_LIBCUDACXX_INLINE_VISIBILITY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we would want to add nodiscard
here and elsewhere
_LIBCUDACXX_INLINE_VISIBILITY | |
_LIBCUDACXX_NODISCARD_ATTRIBUTE _LIBCUDACXX_INLINE_VISIBILITY |
|
||
template<typename _Tag, _CUDA_VSTD::size_t _Value> | ||
struct __down_convertible_constant<_Tag, _Value, _CUDA_VSTD::__enable_if_t<_Tag::__min == _Value>> { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typename = void> | ||
struct __memcpy_async_invoke_if_applicable { | ||
template<typename _Fn> | ||
_LIBCUDACXX_INLINE_VISIBILITY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: nodiscard
template<typename _Fn> | ||
_LIBCUDACXX_INLINE_VISIBILITY | ||
static async_contract_fulfillment __invoke(_Fn && __f) { | ||
return _CUDA_VSTD::forward<_Fn>(__f)(__alignment<_Alignment>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: The msvc folks started using {}
for construction to disambiguate function calls. I am sympathetic to that approach. Could we start doing so?
template<typename _Fn> | ||
_LIBCUDACXX_INLINE_VISIBILITY | ||
static async_contract_fulfillment __invoke(_Fn && __f) { | ||
_LIBCUDACXX_UNREACHABLE(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly worried, that there are compilers that will scream at us about missing return value, but I am not sure how to properly guard against that. I guess we will see
} | ||
} | ||
|
||
switch (__alignment_fsb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether a type alias would allow us to avoid the macro alltogether:
template <cuda::std::size_t _Value>
using __memcpy_async_invoke_if_alignment = ....
_ADD_CASE(1); | ||
|
||
#undef _ADD_CASE | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move the unreachable into the default cause to silence potentially stupid compilers?
template<typename _Tp> | ||
struct __dependent_false : std::false_type {}; | ||
|
||
template<typename _Hooks, typename _Size, _CUDA_VSTD::size_t _NativeAlignment, typename = void> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above there is a trapping fake implementation. here we use a static assert. Is there a reason for the difference?
using __cuda = __down_convertible_constant<__cuda_tag, _ProvidedSM>; | ||
|
||
template<typename _Tp, _CUDA_VSTD::size_t _RequestedSM> | ||
struct __is_cuda_provides_sm : _CUDA_VSTD::false_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike the is
in the typename
I think we can just go ahead and close this PR for now. It's not likely to be revived any time soon. @griwes do you agree? |
Also make the memcpy_async tests slightly more robust.
Description
Closes #57
Checklist