-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730
Comments
Thanks for the detailed write up @ssadasivam1! Could you provide more information on how the results end up wrong? Is it removing things it shouldn't? Not removing things it should? Is the filtered output actually wrong? Or just the number of elements removed returned from the algorithm wrong? |
I went back and checked what exactly is wrong. It seems like the filtered output in vecD is wrong for a small fraction of the elements (roughly ~30-40 elements are wrong out of 5 million). Number of elements removed is fine. Also if you switch from int16_t to int32_t in the example code, the problem seems to go away (at least from a few runs I have done so far). |
Another interesting observation if it helps with debug. The number of wrong elements seems to be 16 or 32 or 48 or 64 -- seems to like multiples of 16 :) |
Thanks a lot for adding further details @ssadasivam1! That's very helpful. I'm currently investigating the issues. Just to add my findings:
Btw., just as general remark, it is always advisable for programs that use CUB and thrust to be compiled with the architectures they are targeting. CUB uses tuning policies to make sure that the algorithms are compiled with the right meta parameters (e.g., number of threads per block, number of items per thread, etc.) for the hardware it is getting compile for. |
Thanks, will compile with |
Update: I'm also unable to repro on A100 when compiling with |
@ssadasivam1, I suspect this may be due to some compiler issue with 12.2 and 12.3. I was not able to reproduce this issue using nvcc from CUDA 12.4.1 (using driver NVIDIA-SMI 550.54.14). What made me suspicious was that, when using CTK 12.3, changing the tuning policies (i.e., make SM 80 use the tuning policy for SM 52 and vice versa), did leave the error behaviour unaffected. Can you resolve this issue by compiling for the targeted GPU architectures or by upgrading to a more recent CTK? |
@elstehle Our application from which this standalone simplified reproducer was extracted still fails with CUDA 12.4 and CUDA 12.5 So I do believe this issue still exists in our app, although the simplified reproducer seems OK with CUDA 12.4. |
Is this a duplicate?
Type of Bug
Silent Failure
Component
Thrust
Describe the bug
We found some intermittent wrong results from
thrust::remove_if
under heavy GPU loading (multiple processes running simultaneously on same GPU under MPS). Specifically the failure seems to happen when using this variant ofremove_if
:thrust::remove_if(firstIt, lastIt, removePred{})
.The problem seems to go away if the removal predicate is first materialized into a stencil array and then calling
remove_if
with the stencil.I have seen reports of performance improvements when the stencil array is materialized but in this case we see wrong output intermittently. The issue is not present in older cccl v 2.2. It seems to have been introduced in v2.3 and is also reproducible with current main branch of cccl. I'm attaching a standalone reproducer but note that many instances of the reproducer need to be run simultaneously to simulate heavy GPU loading. We suspect some sort of race condition that likely only manifests under heavy loading.
How to Reproduce
/usr/local/cuda/bin/nvcc -O3 -I /path/to/cccl/thrust/ -I /path/to/cccl/cub -I /path/to/cccl/libcudacxx/include/ test.cu
a.out
. Now, run the same binary as 40 different processes on same GPU under MPS using this bash script. When I run this on a A100 with 80 GB memory, I typically see 1-2 procs fail the test. The test does take 6-7 mins to run -- patience!Expected behavior
Ideally all 40 procs should pass and the bash script should print
All processes exited successfully
. Such successful execution happens when the boolean stencil is materialized ( see commented out#else
block in the reproducer) or with older cccl version 2.2.Reproduction link
No response
Operating System
SLES 15 SP3 and Ubuntu 20.04.6
nvidia-smi output
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:C1:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
NVCC version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
The text was updated successfully, but these errors were encountered: