Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement <cuda/std/bitset> #1496

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Implement <cuda/std/bitset> #1496

wants to merge 35 commits into from

Conversation

griwes
Copy link
Collaborator

@griwes griwes commented Mar 6, 2024

Description

Implements <cuda/std/bitset>, with a complementary backport of constexprness to C++14 and up.

Resolves #1321

Note for reviewers: I separated pulling updated upstream headers into their own commits, so you can more or less ignore the first two commits of this PR while reviewing.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

griwes added 19 commits April 2, 2024 22:35
Math is hard, even for compilers (especially for older compilers).
I admit I am a bit lost about the apparent differences between whether
this operation is inside the if or outside of it between __copy_aligned
and __copy_backward_aligned, but it seems that it works when they aren't
symmetric.
Seems that I had the code right from the beginning, but it's simply a
codegen-ish snafu with older GCCs, specifically in the backward case? I
am confused, but maybe this time it'll fix more than it breaks.
@griwes griwes marked this pull request as ready for review April 23, 2024 08:57
@griwes griwes requested review from a team as code owners April 23, 2024 08:57
@griwes griwes requested review from wmaxey and elstehle April 23, 2024 08:57
Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass review

@@ -32,34 +32,43 @@
# define _CCCL_DIAG_SUPPRESS_GCC(str)
# define _CCCL_DIAG_SUPPRESS_NVHPC(str)
# define _CCCL_DIAG_SUPPRESS_MSVC(str)
# define _CCCL_DIAG_SUPPRESS_ICC(str)
#elif defined(_CCCL_COMPILER_GCC) || defined(_CCCL_COMPILER_ICC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move ICC to its own section

Comment on lines +65 to +67
#include <cuda/std/detail/libcxx/include/memory> // for __murmur2_or_cityhash

#include <cuda/std/detail/libcxx/include/algorithm> // for search and min
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Please include the individual headers

Comment on lines +147 to +148
template <class _CharT>
inline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I am wondering is the inline here needed given that it is already a template?

Also why is none of those functions marked as _LIBCUDACXX_INLINE_VISIBILITY?

If they are currently not used I would strongly prefer to just drop them

++__len;
return __len;
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Please add comments to the #endif Applies throughout

Comment on lines +77 to +104
template <class _Tp, class _Up>
inline _LIBCUDACXX_HIDE_FROM_ABI _LIBCUDACXX_INLINE_VISIBILITY _CCCL_CONSTEXPR_CXX14 bool
__constexpr_tail_overlap_fallback(_Tp* __first, _Up* __needle, _Tp* __last)
{
while (__first != __last)
{
if (__first == __needle)
{
return true;
}
++__first;
}
return false;
}

template <class _Tp, class _Up>
inline _LIBCUDACXX_HIDE_FROM_ABI _LIBCUDACXX_INLINE_VISIBILITY _CCCL_CONSTEXPR_CXX14 bool
__constexpr_tail_overlap(_Tp* __first, _Up* __needle, _Tp* __last)
{
#if __has_builtin(__builtin_constant_p) || defined(_CCCL_COMPILER_GCC)
NV_IF_ELSE_TARGET(NV_IS_HOST,
(return __builtin_constant_p(__first < __needle) && __first < __needle;),
(return __constexpr_tail_overlap_fallback(__first, __needle, __last);))
#else
return __constexpr_tail_overlap_fallback(__first, __needle, __last);
#endif
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of those changes. I would much rather prefer if we would write an appropriate function for bitset rather than change the copy implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh. We already purposefully go to memmove and not memcpy in this implementation at runtime. I don't understand why we wouldn't just extend that and provide a bit of an extra guarantee that is explicitly allowed by the standard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmaxey and maybe @jrhemstad: any hot takes on this issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily have an opinion on this specific change. However seeing memcpy/memmove mentioned adds to the concern that those functions have particularly bad performance on device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the same hacky constexpr trickery I had to do in <cuda/std/bit> which makes me feel okay with what is here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me


template <class _Cp>
inline _LIBCUDACXX_INLINE_VISIBILITY
void
inline _LIBCUDACXX_HIDE_FROM_ABI _LIBCUDACXX_INLINE_VISIBILITY _CCCL_CONSTEXPR_CXX14 void
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we want to take the opportunity and moderinze the code base?

I was previously of the opinion that we should stay close to libcxx but I am less and less convinced

{
__result.__seg_ -= __nw;
__last.__seg_ -= __nw;
_CUDA_VSTD::copy_n(std::__to_address(__last.__seg_), __nw, std::__to_address(__result.__seg_));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: We should not use std:: qualified functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's just me failing to add /g at the end of my sed here. Will fix.

static_cast<unsigned>(__size_ % __bits_per_word));
for (size_t __i = 0; __i != __bit_array<_Cp>::_Np; ++__i)
{
_CUDA_VSTD::__construct_at(__word_ + __i, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: AFAIK the storage_type should be trivial. Can we just assign to it?

Copy link
Collaborator Author

@griwes griwes Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to check this, but I am expecting to run into problems with compilers that don't implement the implicit lifetime types DR (which would be most of the compilers in our matrix). (Edit: especially at constexpr.)

{
__bit_array<_Cp> __b(__d1);
_CUDA_VSTD::copy(__first, __middle, __b.begin());
_CUDA_VSTD::copy(__b.begin(), __b.end(), std::copy(__middle, __last, __first));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto should not use std:: functions

#ifndef _CUDA_STD_BITSET
#define _CUDA_STD_BITSET

#include "detail/__config"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to have clang-format off to ensure that formatting does not screw up the ordering of the includes

public:
using __container = typename _Cp::__self;

_LIBCUDACXX_HIDE_FROM_ABI _CCCL_CONSTEXPR_CXX14 __bit_reference(const __bit_reference&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do that

}
// do middle whole words
__storage_type __nw = __n / __bits_per_word;
_CUDA_VSTD::fill_n(std::__to_address(__first.__seg_), __nw, _FillVal ? static_cast<__storage_type>(-1) : 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has std:: in it

@miscco miscco added 2.5.0 feature request New feature or request. libcu++ For all items related to libcu++ labels May 6, 2024
@fbusato
Copy link

fbusato commented May 24, 2024

Please consider to use uint32_t for the storage type if it is allowed by the C++ specification https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/std/detail/libcxx/include/bitset#L151.
64-bit operations are less efficient on gpu architectures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.5.0 feature request New feature or request. libcu++ For all items related to libcu++
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[FEA]: Provide support for <bitset>
4 participants