Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libcu++] Fix undefined behavior in atomics to automatic storage #478

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

gonzalobg
Copy link
Collaborator

The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior.

Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity.

This change tests whether the address of an atomic operation is in local memory using __isLocal, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated.

Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use __isLocal, like our workaround here. Instead, we use PTX isspace instruction to perform the detection.

@gonzalobg gonzalobg requested review from a team as code owners September 25, 2023 11:39
@gonzalobg gonzalobg requested review from ericniebler and wmaxey and removed request for a team September 25, 2023 11:39
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 25, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The current implementation of atomic operations is unsound.
It issues generic PTX atomic instructions even if the address
falls in the local memory address space, causing well-formed
CUDA C++ programs to exhibit PTX undefined behavior.

Since this only impact objects with automatic storage, the
impact is not very widespread, but it does impact beginners
trying to learn libcu++ atomic operations, and it also impacts
most of the examples in our documentation which use automatic
storage for simplicity.

This change tests whether the address of an atomic operation is
in local memory using `__isLocal`, and when that is the case, it
uses weak memory operations instead. This is sound because CUDA C++
does not allow sharing the address of automatic variables across
threads. If that ever changes, this would need to be updated.

Unfortunately, nvidia compilers from toolkits older than 12.3 have
a bug that miscompiles programs that use `__isLocal`, like our
workaround here. Instead, we use PTX `isspace` instruction to
perform the detection.
gonzalobg and others added 2 commits September 25, 2023 17:55
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
@gonzalobg
Copy link
Collaborator Author

@miscco have hopefully addressed all the issues. This is now blocked by #479

@miscco
Copy link
Collaborator

miscco commented Sep 25, 2023

I would have added the macro within this PR

@miscco
Copy link
Collaborator

miscco commented Sep 25, 2023

/ok to test

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
@miscco
Copy link
Collaborator

miscco commented Oct 12, 2023

/ok to test

@@ -40,6 +44,8 @@ void _LIBCUDACXX_DEVICE __atomic_exchange_cuda(_Type volatile *__ptr, _Type *__v

template<class _Type, class _Delta, class _Scope, typename _CUDA_VSTD::enable_if<sizeof(_Type)<=2, int>::type = 0>
_Type _LIBCUDACXX_DEVICE __atomic_fetch_add_cuda(_Type volatile *__ptr, _Delta __val, int __memorder, _Scope __s) {
_Type __ret;
if (__cuda_fetch_add_weak_if_local(__ptr, __val, &__ret)) return __ret;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: compiler is unable to see through the memory and identify that it's not local. This affects codegen and overall performance. Here's a simple kernel:

using device_atomic_t = cuda::atomic<int, cuda::thread_scope_device>;

__global__ void use(device_atomic_t *d_atomics) {
  d_atomics->fetch_add(threadIdx.x, cuda::memory_order_relaxed);
}

On RTX 6000 Ada the change leads to the following slowdown (up to ~3x slower)

device_scope_atomics

In the case of the block-scope atomics the performance difference is even more pronounced:

template <int BlockSize>
__launch_bounds__(BlockSize) __global__ void use(device_atomic_t *d_atomics, int mv) {
  __shared__ block_atomic_t b_atomics;

  if (threadIdx.x == 0) {
    new (&b_atomics) block_atomic_t{};
  }
  __syncthreads();

  b_atomics.fetch_add(threadIdx.x, cuda::memory_order_relaxed);
  __syncthreads();

  if (threadIdx.x == 0) {
    if (b_atomics.load(cuda::memory_order_relaxed) > mv) {
      d_atomics->fetch_add(1, cuda::memory_order_relaxed);
    }
  }
}

Results for RTX 6000 Ada illustrate up to ~4x slowdown:

block_scope_atomics

I think I agree with:

Since this only impact objects with automatic storage, the impact is not very widespread

Given this, I think we should explore options not to penalize widespread use cases. If compiler is able to see through the local space check, this would be a solution. Otherwise, we can consider refining the:

it affects an object in GPU memory and only GPU threads access it.

requirement to talk about global, cluster or block memory + add a check of automatic storage in debug build.

Copy link
Collaborator Author

@gonzalobg gonzalobg Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is known but the analysis is incomplete since:

  • this lands on CUDA CTK 12.4,
  • the impact is zero on CUDA CTK 12.3 and newer, and
  • the impact is zero on CUDA CTK 12.2 and older iff cuda atomics are used through the cuda::atomic bundled in the CTK, since those are not impacted by this.

The performance regression is scoped to:

  • users of CUDA 12.2 and older,
  • that are not using the CUDA C++ standard library bundled with their CTK, but instead picking a different version from github.

For those users, we could - in a subsequent PR - provide a way to opt out into broken behavior via some feature macro, e.g., LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE, that users define before including the headers consistently to avoid ODR issues:

#define LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE
#include <cuda/atomic>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the slack discussion, an alternative is to enable the check in CTK 12.2 and older only in debug mode, to avoid the perf hit.

Copy link
Collaborator

@miscco miscco Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something where we could work with attributes e.g [[likely]] / [[unlikely]]?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

4 participants