[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

ssadasivam1 · 2024-05-10T19:12:03Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Silent Failure

Component

Thrust

Describe the bug

We found some intermittent wrong results from thrust::remove_if under heavy GPU loading (multiple processes running simultaneously on same GPU under MPS). Specifically the failure seems to happen when using this variant of remove_if:
thrust::remove_if(firstIt, lastIt, removePred{}).

The problem seems to go away if the removal predicate is first materialized into a stencil array and then calling remove_if with the stencil.

thrust::transform(first, last, stencil, removePred{}); //first materialize the removal criteria into a boolean stencil array
thrust::remove_if(first, last, stencil, thrust::identity<bool>());

I have seen reports of performance improvements when the stencil array is materialized but in this case we see wrong output intermittently. The issue is not present in older cccl v 2.2. It seems to have been introduced in v2.3 and is also reproducible with current main branch of cccl. I'm attaching a standalone reproducer but note that many instances of the reproducer need to be run simultaneously to simulate heavy GPU loading. We suspect some sort of race condition that likely only manifests under heavy loading.

How to Reproduce

Compile this standalone program that tests remove_if 10000 times.
/usr/local/cuda/bin/nvcc -O3 -I /path/to/cccl/thrust/ -I /path/to/cccl/cub -I /path/to/cccl/libcudacxx/include/ test.cu

#include <thrust/device_vector.h>
#include <thrust/remove.h>
#include <thrust/count.h>
#include <thrust/iterator/zip_iterator.h>

#include <iostream>
#include <vector>

struct RemovePred {
    __host__ __device__ __forceinline__
    bool operator()(const thrust::tuple<int16_t, int16_t, uint32_t, uint32_t>& ele){
        return (thrust::get<0>(ele) == 0 && 
               thrust::get<1>(ele) == 0);
    } 
};

struct RemovePred2 {
    __host__ __device__ __forceinline__
    bool operator()(const thrust::tuple<int16_t, int16_t>& ele){
        return (thrust::get<0>(ele) == 0 && 
               thrust::get<1>(ele) == 0);
    } 
};

struct checkIfOdd {
    __host__ __device__ __forceinline__
    bool operator()(const uint32_t val){
        return (val%2 == 1);
    } 
};

bool testRemoveIf() {
    constexpr int N = 5000000;
    std::vector<int16_t> vec1(N);
    std::vector<uint32_t> vec2(N);
    for(int ii = 0; ii < N; ++ii) {
        vec1[ii] = (int16_t)(ii%2); //alternate 0s and 1s
        vec2[ii] = ii;
    }
    thrust::device_vector<int16_t> vecA(vec1);
    thrust::device_vector<int16_t> vecB(vec1);
    thrust::device_vector<uint32_t> vecC(vec2);
    thrust::device_vector<uint32_t> vecD(vec2);
   
#if 1
    //This produces wrong results intermittently.
    auto inputKeyItBegin = thrust::make_zip_iterator(vecA.begin(), vecB.begin(), vecC.begin(), vecD.begin());
    auto endIt = thrust::remove_if(inputKeyItBegin, inputKeyItBegin+N, RemovePred{});
    auto numEle = endIt - inputKeyItBegin;
#else
    //Materializing the boolean stencil and then calling remove_if seems to work fine.
    thrust::device_vector<bool> bStencil(N);
    auto inputKeyItBegin = thrust::make_zip_iterator(vecA.begin(), vecB.begin());
    thrust::transform(inputKeyItBegin, inputKeyItBegin+N, bStencil.begin(), RemovePred2{});
    auto inputKeyItBegin2 = thrust::make_zip_iterator(vecA.begin(), vecB.begin(), vecC.begin(), vecD.begin());
    auto endIt = thrust::remove_if(inputKeyItBegin2, inputKeyItBegin2+N, bStencil.begin(), thrust::identity<bool>());
    auto numEle = endIt - inputKeyItBegin2;
#endif
    vecA.resize(numEle);
    vecB.resize(numEle);
    vecC.resize(numEle);
    vecD.resize(numEle);

    //Sanity checks to make sure remove_if did the right thing.
    auto numEle2 = thrust::count_if(vecC.begin(), vecC.end(), checkIfOdd{});
    auto numEle3 = thrust::count_if(vecD.begin(), vecD.end(), checkIfOdd{});
    bool bEqual = thrust::equal(vecC.begin(), vecC.end(), vecD.begin());

    return (bEqual && numEle3 == N/2 && numEle2 == N/2);
}

int main() {
    for (int ii = 0; ii < 10000; ++ii) {
        bool bPass = testRemoveIf();
        if (!bPass) {
            std::cout << "Test failed!\n";
            return 1;
        } else {
            if (ii%1000 == 0)
                std::cout << "Test passed for attempt = " << ii << std::endl;
        }
    }

    return 0;
}

Assume the above step produces binary a.out. Now, run the same binary as 40 different processes on same GPU under MPS using this bash script. When I run this on a A100 with 80 GB memory, I typically see 1-2 procs fail the test. The test does take 6-7 mins to run -- patience!

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d # Start MPS server

# Define an array to hold the PIDs of the background processes
declare -a pids

mkdir -p logs

# Run multiple processes simultaneously under MPS to load the GPU
for ((i=1; i<=40; i++))
do
    ./a.out > logs/logit_${i}.log 2>&1 &
    pids+=($!)
done

# Initialize a variable to keep track of any failures
any_failures=0

# Wait for all background processes to finish and check their exit statuses
for pid in "${pids[@]}"; do
    wait $pid
    exit_status=$?
    if [ $exit_status -ne 0 ]; then
        echo "Process with PID $pid exited with a non-zero status: $exit_status"
        any_failures=1
    fi
done

# Check if any process failed
if [ $any_failures -eq 1 ]; then
    echo "At least one process exited with a non-zero status."
else
    echo "All processes exited successfully."
fi

echo quit | nvidia-cuda-mps-control # Shutdown MPS server.

Expected behavior

Ideally all 40 procs should pass and the bash script should print All processes exited successfully. Such successful execution happens when the boolean stencil is materialized ( see commented out #else block in the reproducer) or with older cccl version 2.2.

Reproduction link

No response

Operating System

SLES 15 SP3 and Ubuntu 20.04.6

nvidia-smi output

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

The text was updated successfully, but these errors were encountered:

jrhemstad · 2024-05-15T14:16:36Z

Thanks for the detailed write up @ssadasivam1!

Could you provide more information on how the results end up wrong? Is it removing things it shouldn't? Not removing things it should?

Is the filtered output actually wrong? Or just the number of elements removed returned from the algorithm wrong?

ssadasivam1 · 2024-05-15T17:58:32Z

I went back and checked what exactly is wrong. It seems like the filtered output in vecD is wrong for a small fraction of the elements (roughly ~30-40 elements are wrong out of 5 million). Number of elements removed is fine.

Also if you switch from int16_t to int32_t in the example code, the problem seems to go away (at least from a few runs I have done so far).

ssadasivam1 · 2024-05-15T18:05:56Z

Another interesting observation if it helps with debug. The number of wrong elements seems to be 16 or 32 or 48 or 64 -- seems to like multiples of 16 :)

elstehle · 2024-05-15T18:14:35Z

Thanks a lot for adding further details @ssadasivam1! That's very helpful.

I'm currently investigating the issues. Just to add my findings:

I wasn't able to repro on a V100 yet
I wasn't able to repro on a A100, when compiling with -gencode arch=compute_80,code=sm_80

Btw., just as general remark, it is always advisable for programs that use CUB and thrust to be compiled with the architectures they are targeting. CUB uses tuning policies to make sure that the algorithms are compiled with the right meta parameters (e.g., number of threads per block, number of items per thread, etc.) for the hardware it is getting compile for.

ssadasivam1 · 2024-05-15T18:17:24Z

Thanks, will compile with -gencode and check. Also I just cranked up the number of elements N to 5 million (it was 1 million earlier) which seems to increase the probability of failure. You can try the same if it seems hard to reproduce.

ssadasivam1 · 2024-05-15T19:24:54Z

Update: I'm also unable to repro on A100 when compiling with -gencode arch=compute_80,code=sm_80 It repros without compiling for the specific architecture.

elstehle · 2024-05-17T12:34:19Z

@ssadasivam1, I suspect this may be due to some compiler issue with 12.2 and 12.3. I was not able to reproduce this issue using nvcc from CUDA 12.4.1 (using driver NVIDIA-SMI 550.54.14).

What made me suspicious was that, when using CTK 12.3, changing the tuning policies (i.e., make SM 80 use the tuning policy for SM 52 and vice versa), did leave the error behaviour unaffected. Can you resolve this issue by compiling for the targeted GPU architectures or by upgrading to a more recent CTK?

ssadasivam1 · 2024-05-20T20:54:30Z

@elstehle Our application from which this standalone simplified reproducer was extracted still fails with CUDA 12.4 and CUDA 12.5 So I do believe this issue still exists in our app, although the simplified reproducer seems OK with CUDA 12.4.

ssadasivam1 added the bug Something isn't working right. label May 10, 2024

elstehle self-assigned this May 15, 2024

elstehle linked a pull request May 27, 2024 that will close this issue

[DRAFT] Fix for in-place DeviceSelect, thrust::copy_if, and thrust::remove_if #1782

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

ssadasivam1 commented May 10, 2024 •

edited

jrhemstad commented May 15, 2024

ssadasivam1 commented May 15, 2024

ssadasivam1 commented May 15, 2024

elstehle commented May 15, 2024

ssadasivam1 commented May 15, 2024

ssadasivam1 commented May 15, 2024

elstehle commented May 17, 2024

ssadasivam1 commented May 20, 2024

[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

Comments

ssadasivam1 commented May 10, 2024 • edited

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

jrhemstad commented May 15, 2024

ssadasivam1 commented May 15, 2024

ssadasivam1 commented May 15, 2024

elstehle commented May 15, 2024

ssadasivam1 commented May 15, 2024

ssadasivam1 commented May 15, 2024

elstehle commented May 17, 2024

ssadasivam1 commented May 20, 2024

ssadasivam1 commented May 10, 2024 •

edited