Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor memcpy_async for easier extensions. #348

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

griwes
Copy link
Collaborator

@griwes griwes commented Aug 16, 2023

Also make the memcpy_async tests slightly more robust.

Description

Closes #57

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes added feature request New feature or request. libcu++ For all items related to libcu++ labels Aug 16, 2023
@griwes griwes force-pushed the refactor-memcpy-async branch 3 times, most recently from 0e5cd59 to c8c9b9d Compare August 17, 2023 00:57
Also make the memcpy_async tests slightly more robust.
Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review

libcudacxx/include/cuda/std/detail/libcxx/include/__config Outdated Show resolved Hide resolved
libcudacxx/include/cuda/pipeline Outdated Show resolved Hide resolved
libcudacxx/include/cuda/pipeline Outdated Show resolved Hide resolved
Significant one is reworking barrier<thread_scope_thread>, because I
noticed that it was starting to rot (it didn't get all the new try_wait
and wait_parity APIs that were added to the block version).
@griwes griwes marked this pull request as ready for review August 21, 2023 09:49
@griwes griwes requested review from a team as code owners August 21, 2023 09:49
@griwes griwes requested review from ericniebler and alliepiper and removed request for a team August 21, 2023 09:49
_LIBCUDACXX_DEVICE
static async_contract_fulfillment __synchronize(__arch::__cuda<80>, barrier<_Sco, _CompF> &, async_contract_fulfillment __acf) {
static async_contract_fulfillment __synchronize(__arch::__cuda<80>, barrier<_Sco, _CompF> &, async_contract_fulfillment __acf, _Empty...) {
if (__acf == async_contract_fulfillment::async) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be defensive and add static_assert(sizeof...(_Empty) == 0, "Should not be called with additional arguments");


#include <cuda/std/type_traits>

template<typename T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love some comments on when we need this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point during this work I managed to do a dumb and write over some bytes beyond the variable, which results in a silly behavior of an endless hang if you happen to overwrite the barrier; fun times. Hopefully this will catch the more likely cases of off-by-ones (that's what I did, didn't subtract 1 from the first set bit index when turning that index into an actual alignment value) with a reasonable assert message instead of a hang in arrive_and_wait.

Copy link
Collaborator

@miscco miscco Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant in the file for future reference ;)


return __memcpy_async<alignof(_Type)>(__group, reinterpret_cast<char *>(__destination), reinterpret_cast<char const *>(__source), __size, __pipeline);
}
template<thread_scope _Sco, __tx_api _Tx, typename _Arch, __space _OutSpace, __space _InSpace, __space _SyncSpace>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I believe we generally trend towards using class instead of typename

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike class, because it's the less semantically accurate of the two spellings (an argument could be an int, and int is definitely not a class) - but we have both throughout the library, so we should probably settle on a policy and do a library-wide unification if we want to.

_OutSpace, _InSpace, _SyncSpace
> {
__host__ __device__
static async_contract_fulfillment __synchronize(_Arch, pipeline<_Sco> &, async_contract_fulfillment __acf) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to macroize async_contract_fulfillment

At the same time, if a user overwrites this with a macro they kind of deserve what they get

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm? This is a name defined in the public API of this header, all normal rules apply here.

@@ -557,244 +543,103 @@ _LIBCUDACXX_END_NAMESPACE_CUDA
_LIBCUDACXX_BEGIN_NAMESPACE_CUDA_DEVICE

_LIBCUDACXX_DEVICE
inline _CUDA_VSTD::uint64_t * barrier_native_handle(barrier<thread_scope_block> & b) {
return reinterpret_cast<_CUDA_VSTD::uint64_t *>(&b.__barrier);
inline _CUDA_VSTD::uint64_t * barrier_native_handle(barrier<thread_scope_block> & __b) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this rename into a separate bugfix PR to reduce the noise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, though that's going to create more issue/PR noise :P

_LIBCUDACXX_DEVICE
bool __is_grid_constant(const void * __p) {
#ifdef _LIBCUDACXX_CUDACC_BELOW_11_7
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false;
(void)__p; return false;

_LIBCUDACXX_INLINE_VISIBILITY async_contract_fulfillment __dispatch_alignment_bit(_Fn && __f, _CUDA_VSTD::size_t __alignment_fsb) {
const _CUDA_VSTD::size_t __alignment_v = 1ull << (__alignment_fsb - 1);

if (__builtin_expect(__alignment_v >= _MaxInterestingAlignment, true)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This i going to be soo much fun porting to the various supported CTKs / host compilers

I believe we will need a macro to keep being able to handle this with configurations that do not know about the builtin,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All things currently in the CI matrix work fine with this. We need to run this on Windows though, it's possiblelikely that MSVC doesn't like it.

NV_PROVIDES_SM_70,
(return _CUDA_VSTD::forward<_Fn>(__f)(__arch::__cuda<70>());),
NV_IS_HOST,
(return _CUDA_VSTD::forward<_Fn>(__f)(__arch::__host());))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the closing brace to a separate line, makes it easier to parse

NV_IF_ELSE_TARGET(
NV_IS_DEVICE,
(return __ffsll(__val);),
(return _CUDA_VSTD::__libcpp_ctz(__val) + 1;)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it might be more efficient, but countr_zero will use the right intrinsic depending on the context. Possibly even using the host's builtin for constexpr evaluation.

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking through PRs and realized I did not send the review 🤦‍♂️

struct __single_thread_group {
_LIBCUDACXX_INLINE_VISIBILITY
void sync() const {}
_LIBCUDACXX_INLINE_VISIBILITY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we would want to add nodiscard here and elsewhere

Suggested change
_LIBCUDACXX_INLINE_VISIBILITY
_LIBCUDACXX_NODISCARD_ATTRIBUTE _LIBCUDACXX_INLINE_VISIBILITY


template<typename _Tag, _CUDA_VSTD::size_t _Value>
struct __down_convertible_constant<_Tag, _Value, _CUDA_VSTD::__enable_if_t<_Tag::__min == _Value>> {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

typename = void>
struct __memcpy_async_invoke_if_applicable {
template<typename _Fn>
_LIBCUDACXX_INLINE_VISIBILITY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: nodiscard

template<typename _Fn>
_LIBCUDACXX_INLINE_VISIBILITY
static async_contract_fulfillment __invoke(_Fn && __f) {
return _CUDA_VSTD::forward<_Fn>(__f)(__alignment<_Alignment>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The msvc folks started using {} for construction to disambiguate function calls. I am sympathetic to that approach. Could we start doing so?

template<typename _Fn>
_LIBCUDACXX_INLINE_VISIBILITY
static async_contract_fulfillment __invoke(_Fn && __f) {
_LIBCUDACXX_UNREACHABLE();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly worried, that there are compilers that will scream at us about missing return value, but I am not sure how to properly guard against that. I guess we will see

}
}

switch (__alignment_fsb) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether a type alias would allow us to avoid the macro alltogether:

template <cuda::std::size_t _Value>
using __memcpy_async_invoke_if_alignment = ....

_ADD_CASE(1);

#undef _ADD_CASE
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the unreachable into the default cause to silence potentially stupid compilers?

template<typename _Tp>
struct __dependent_false : std::false_type {};

template<typename _Hooks, typename _Size, _CUDA_VSTD::size_t _NativeAlignment, typename = void>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above there is a trapping fake implementation. here we use a static assert. Is there a reason for the difference?

using __cuda = __down_convertible_constant<__cuda_tag, _ProvidedSM>;

template<typename _Tp, _CUDA_VSTD::size_t _RequestedSM>
struct __is_cuda_provides_sm : _CUDA_VSTD::false_type {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike the is in the typename

@jrhemstad
Copy link
Collaborator

I think we can just go ahead and close this PR for now. It's not likely to be revived any time soon.

@griwes do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request. libcu++ For all items related to libcu++
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[FEA]: Redesign memcpy_async implementation to support more general src/dest dispatch
4 participants