[libcu++] Fix undefined behavior in atomics to automatic storage #478

gonzalobg · 2023-09-25T11:39:09Z

The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior.

Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity.

This change tests whether the address of an atomic operation is in local memory using __isLocal, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated.

Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use __isLocal, like our workaround here. Instead, we use PTX isspace instruction to perform the detection.

copy-pr-bot · 2023-09-25T11:39:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior. Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity. This change tests whether the address of an atomic operation is in local memory using `__isLocal`, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated. Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use `__isLocal`, like our workaround here. Instead, we use PTX `isspace` instruction to perform the detection.

libcudacxx/.upstream-tests/test/cuda/atomics/atomic.local.pass.cpp

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

gonzalobg · 2023-09-25T16:10:11Z

@miscco have hopefully addressed all the issues. This is now blocked by #479

miscco · 2023-09-25T16:56:44Z

I would have added the macro within this PR

miscco · 2023-09-25T20:02:12Z

/ok to test

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>

libcudacxx/codegen/codegen.cpp

miscco · 2023-10-12T17:16:01Z

/ok to test

libcudacxx/codegen/codegen.cpp

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

…ic/atomic_cuda_local.h

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

…ic/atomic_cuda_local.h

libcudacxx/codegen/codegen.cpp

gevtushenko · 2023-10-16T19:11:01Z

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_derived.h

@@ -40,6 +44,8 @@ void _LIBCUDACXX_DEVICE __atomic_exchange_cuda(_Type volatile *__ptr, _Type *__v

 template<class _Type, class _Delta, class _Scope, typename _CUDA_VSTD::enable_if<sizeof(_Type)<=2, int>::type = 0>
 _Type _LIBCUDACXX_DEVICE __atomic_fetch_add_cuda(_Type volatile *__ptr, _Delta __val, int __memorder, _Scope __s) {
+    _Type __ret;
+    if (__cuda_fetch_add_weak_if_local(__ptr, __val, &__ret)) return __ret;


important: compiler is unable to see through the memory and identify that it's not local. This affects codegen and overall performance. Here's a simple kernel:

using device_atomic_t = cuda::atomic<int, cuda::thread_scope_device>; __global__ void use(device_atomic_t *d_atomics) { d_atomics->fetch_add(threadIdx.x, cuda::memory_order_relaxed); }

On RTX 6000 Ada the change leads to the following slowdown (up to ~3x slower)

In the case of the block-scope atomics the performance difference is even more pronounced:

template <int BlockSize> __launch_bounds__(BlockSize) __global__ void use(device_atomic_t *d_atomics, int mv) { __shared__ block_atomic_t b_atomics; if (threadIdx.x == 0) { new (&b_atomics) block_atomic_t{}; } __syncthreads(); b_atomics.fetch_add(threadIdx.x, cuda::memory_order_relaxed); __syncthreads(); if (threadIdx.x == 0) { if (b_atomics.load(cuda::memory_order_relaxed) > mv) { d_atomics->fetch_add(1, cuda::memory_order_relaxed); } } }

Results for RTX 6000 Ada illustrate up to ~4x slowdown:

I think I agree with:

Since this only impact objects with automatic storage, the impact is not very widespread

Given this, I think we should explore options not to penalize widespread use cases. If compiler is able to see through the local space check, this would be a solution. Otherwise, we can consider refining the:

it affects an object in GPU memory and only GPU threads access it.

requirement to talk about global, cluster or block memory + add a check of automatic storage in debug build.

This is known but the analysis is incomplete since:

this lands on CUDA CTK 12.4,

the impact is zero on CUDA CTK 12.3 and newer, and

the impact is zero on CUDA CTK 12.2 and older iff cuda atomics are used through the cuda::atomic bundled in the CTK, since those are not impacted by this.

The performance regression is scoped to:

users of CUDA 12.2 and older,

that are not using the CUDA C++ standard library bundled with their CTK, but instead picking a different version from github.

For those users, we could - in a subsequent PR - provide a way to opt out into broken behavior via some feature macro, e.g., LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE, that users define before including the headers consistently to avoid ODR issues:

#define LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE #include <cuda/atomic>

From the slack discussion, an alternative is to enable the check in CTK 12.2 and older only in debug mode, to avoid the perf hit.

Is this something where we could work with attributes e.g [[likely]] / [[unlikely]]?

gonzalobg requested review from a team as code owners September 25, 2023 11:39

gonzalobg requested review from ericniebler and wmaxey and removed request for a team September 25, 2023 11:39

gonzalobg mentioned this pull request Sep 25, 2023

Fix UB in atomic memory operations to memory locations with automatic storage NVIDIA/libcudacxx#427

Closed

2 tasks

gonzalobg force-pushed the bugfix/atomic_automatic_storage branch from a7673cf to a3f0405 Compare September 25, 2023 11:42

gonzalobg requested review from griwes, miscco and jrhemstad September 25, 2023 14:41

miscco reviewed Sep 25, 2023

View reviewed changes

gonzalobg and others added 2 commits September 25, 2023 17:55

Cleanup includes

10ce88a

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Cleanup test

0e05292

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

88b1d61

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

6b07744

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

3f46c1f

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

b50314f

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

e51c1b9

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

e5104cc

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

236824a

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Add missing include guard comment

9f2fe4b

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Use _LIBCUDACXX_CUDACC_BELOW_12_3

f9ad757

wmaxey mentioned this pull request Sep 25, 2023

[FEA]: Diff libcudacxx codegen output against header in repo #480

Open

1 task

gevtushenko requested changes Sep 26, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

nvcc accepts variable names starting with percent, but clang does not

3099287

Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>

gonzalobg commented Sep 26, 2023

View reviewed changes

libcudacxx/codegen/codegen.cpp Outdated Show resolved Hide resolved

gonzalobg and others added 2 commits September 26, 2023 21:26

Fmt

618f71d

Merge branch 'main' into pr/gonzalobg/478

91d36fa

griwes requested changes Oct 13, 2023

View reviewed changes

libcudacxx/codegen/codegen.cpp Outdated Show resolved Hide resolved

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Show resolved Hide resolved

gonzalobg added 2 commits October 13, 2023 23:33

Update libcudacxx/include/cuda/std/detail/libcxx/include/support/atom…

18ba198

…ic/atomic_cuda_local.h

Update libcudacxx/codegen/codegen.cpp

72c3e5a

gonzalobg commented Oct 13, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Update libcudacxx/include/cuda/std/detail/libcxx/include/support/atom…

ead9dac

…ic/atomic_cuda_local.h

gonzalobg commented Oct 13, 2023

View reviewed changes

libcudacxx/codegen/codegen.cpp Outdated Show resolved Hide resolved

Update libcudacxx/codegen/codegen.cpp

c85ff17

griwes approved these changes Oct 13, 2023

View reviewed changes

gevtushenko requested changes Oct 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libcu++] Fix undefined behavior in atomics to automatic storage #478

[libcu++] Fix undefined behavior in atomics to automatic storage #478

gonzalobg commented Sep 25, 2023

copy-pr-bot bot commented Sep 25, 2023

gonzalobg commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Oct 12, 2023

gevtushenko Oct 16, 2023

gonzalobg Oct 16, 2023 •

edited

gonzalobg Oct 16, 2023

miscco Oct 17, 2023 •

edited

[libcu++] Fix undefined behavior in atomics to automatic storage #478

Are you sure you want to change the base?

[libcu++] Fix undefined behavior in atomics to automatic storage #478

Conversation

gonzalobg commented Sep 25, 2023

copy-pr-bot bot commented Sep 25, 2023

gonzalobg commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Oct 12, 2023

gevtushenko Oct 16, 2023

Choose a reason for hiding this comment

gonzalobg Oct 16, 2023 • edited

Choose a reason for hiding this comment

gonzalobg Oct 16, 2023

Choose a reason for hiding this comment

miscco Oct 17, 2023 • edited

Choose a reason for hiding this comment

gonzalobg Oct 16, 2023 •

edited

miscco Oct 17, 2023 •

edited