Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

Open
1 task done
ssadasivam1 opened this issue May 10, 2024 · 8 comments · May be fixed by #1782
Open
1 task done

[BUG]: Intermittent wrong output from thrust::remove_if under heavy GPU loading #1730

ssadasivam1 opened this issue May 10, 2024 · 8 comments · May be fixed by #1782
Assignees
Labels
bug Something isn't working right.

Comments

@ssadasivam1
Copy link

ssadasivam1 commented May 10, 2024

Is this a duplicate?

Type of Bug

Silent Failure

Component

Thrust

Describe the bug

We found some intermittent wrong results from thrust::remove_if under heavy GPU loading (multiple processes running simultaneously on same GPU under MPS). Specifically the failure seems to happen when using this variant of remove_if:
thrust::remove_if(firstIt, lastIt, removePred{}).

The problem seems to go away if the removal predicate is first materialized into a stencil array and then calling remove_if with the stencil.

thrust::transform(first, last, stencil, removePred{}); //first materialize the removal criteria into a boolean stencil array
thrust::remove_if(first, last, stencil, thrust::identity<bool>());

I have seen reports of performance improvements when the stencil array is materialized but in this case we see wrong output intermittently. The issue is not present in older cccl v 2.2. It seems to have been introduced in v2.3 and is also reproducible with current main branch of cccl. I'm attaching a standalone reproducer but note that many instances of the reproducer need to be run simultaneously to simulate heavy GPU loading. We suspect some sort of race condition that likely only manifests under heavy loading.

How to Reproduce

  1. Compile this standalone program that tests remove_if 10000 times.
    /usr/local/cuda/bin/nvcc -O3 -I /path/to/cccl/thrust/ -I /path/to/cccl/cub -I /path/to/cccl/libcudacxx/include/ test.cu
#include <thrust/device_vector.h>
#include <thrust/remove.h>
#include <thrust/count.h>
#include <thrust/iterator/zip_iterator.h>

#include <iostream>
#include <vector>

struct RemovePred {
    __host__ __device__ __forceinline__
    bool operator()(const thrust::tuple<int16_t, int16_t, uint32_t, uint32_t>& ele){
        return (thrust::get<0>(ele) == 0 && 
               thrust::get<1>(ele) == 0);
    } 
};

struct RemovePred2 {
    __host__ __device__ __forceinline__
    bool operator()(const thrust::tuple<int16_t, int16_t>& ele){
        return (thrust::get<0>(ele) == 0 && 
               thrust::get<1>(ele) == 0);
    } 
};

struct checkIfOdd {
    __host__ __device__ __forceinline__
    bool operator()(const uint32_t val){
        return (val%2 == 1);
    } 
};

bool testRemoveIf() {
    constexpr int N = 5000000;
    std::vector<int16_t> vec1(N);
    std::vector<uint32_t> vec2(N);
    for(int ii = 0; ii < N; ++ii) {
        vec1[ii] = (int16_t)(ii%2); //alternate 0s and 1s
        vec2[ii] = ii;
    }
    thrust::device_vector<int16_t> vecA(vec1);
    thrust::device_vector<int16_t> vecB(vec1);
    thrust::device_vector<uint32_t> vecC(vec2);
    thrust::device_vector<uint32_t> vecD(vec2);
   
#if 1
    //This produces wrong results intermittently.
    auto inputKeyItBegin = thrust::make_zip_iterator(vecA.begin(), vecB.begin(), vecC.begin(), vecD.begin());
    auto endIt = thrust::remove_if(inputKeyItBegin, inputKeyItBegin+N, RemovePred{});
    auto numEle = endIt - inputKeyItBegin;
#else
    //Materializing the boolean stencil and then calling remove_if seems to work fine.
    thrust::device_vector<bool> bStencil(N);
    auto inputKeyItBegin = thrust::make_zip_iterator(vecA.begin(), vecB.begin());
    thrust::transform(inputKeyItBegin, inputKeyItBegin+N, bStencil.begin(), RemovePred2{});
    auto inputKeyItBegin2 = thrust::make_zip_iterator(vecA.begin(), vecB.begin(), vecC.begin(), vecD.begin());
    auto endIt = thrust::remove_if(inputKeyItBegin2, inputKeyItBegin2+N, bStencil.begin(), thrust::identity<bool>());
    auto numEle = endIt - inputKeyItBegin2;
#endif
    vecA.resize(numEle);
    vecB.resize(numEle);
    vecC.resize(numEle);
    vecD.resize(numEle);

    //Sanity checks to make sure remove_if did the right thing.
    auto numEle2 = thrust::count_if(vecC.begin(), vecC.end(), checkIfOdd{});
    auto numEle3 = thrust::count_if(vecD.begin(), vecD.end(), checkIfOdd{});
    bool bEqual = thrust::equal(vecC.begin(), vecC.end(), vecD.begin());

    return (bEqual && numEle3 == N/2 && numEle2 == N/2);
}

int main() {
    for (int ii = 0; ii < 10000; ++ii) {
        bool bPass = testRemoveIf();
        if (!bPass) {
            std::cout << "Test failed!\n";
            return 1;
        } else {
            if (ii%1000 == 0)
                std::cout << "Test passed for attempt = " << ii << std::endl;
        }
    }

    return 0;
}

  1. Assume the above step produces binary a.out. Now, run the same binary as 40 different processes on same GPU under MPS using this bash script. When I run this on a A100 with 80 GB memory, I typically see 1-2 procs fail the test. The test does take 6-7 mins to run -- patience!
#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d # Start MPS server

# Define an array to hold the PIDs of the background processes
declare -a pids

mkdir -p logs

# Run multiple processes simultaneously under MPS to load the GPU
for ((i=1; i<=40; i++))
do
    ./a.out > logs/logit_${i}.log 2>&1 &
    pids+=($!)
done

# Initialize a variable to keep track of any failures
any_failures=0

# Wait for all background processes to finish and check their exit statuses
for pid in "${pids[@]}"; do
    wait $pid
    exit_status=$?
    if [ $exit_status -ne 0 ]; then
        echo "Process with PID $pid exited with a non-zero status: $exit_status"
        any_failures=1
    fi
done

# Check if any process failed
if [ $any_failures -eq 1 ]; then
    echo "At least one process exited with a non-zero status."
else
    echo "All processes exited successfully."
fi

echo quit | nvidia-cuda-mps-control # Shutdown MPS server.

Expected behavior

Ideally all 40 procs should pass and the bash script should print All processes exited successfully. Such successful execution happens when the boolean stencil is materialized ( see commented out #else block in the reproducer) or with older cccl version 2.2.

Reproduction link

No response

Operating System

SLES 15 SP3 and Ubuntu 20.04.6

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:C1:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

@ssadasivam1 ssadasivam1 added the bug Something isn't working right. label May 10, 2024
@jrhemstad
Copy link
Collaborator

Thanks for the detailed write up @ssadasivam1!

Could you provide more information on how the results end up wrong? Is it removing things it shouldn't? Not removing things it should?

Is the filtered output actually wrong? Or just the number of elements removed returned from the algorithm wrong?

@elstehle elstehle self-assigned this May 15, 2024
@ssadasivam1
Copy link
Author

I went back and checked what exactly is wrong. It seems like the filtered output in vecD is wrong for a small fraction of the elements (roughly ~30-40 elements are wrong out of 5 million). Number of elements removed is fine.

Also if you switch from int16_t to int32_t in the example code, the problem seems to go away (at least from a few runs I have done so far).

@ssadasivam1
Copy link
Author

Another interesting observation if it helps with debug. The number of wrong elements seems to be 16 or 32 or 48 or 64 -- seems to like multiples of 16 :)

@elstehle
Copy link
Collaborator

Thanks a lot for adding further details @ssadasivam1! That's very helpful.

I'm currently investigating the issues. Just to add my findings:

  • I wasn't able to repro on a V100 yet
  • I wasn't able to repro on a A100, when compiling with -gencode arch=compute_80,code=sm_80

Btw., just as general remark, it is always advisable for programs that use CUB and thrust to be compiled with the architectures they are targeting. CUB uses tuning policies to make sure that the algorithms are compiled with the right meta parameters (e.g., number of threads per block, number of items per thread, etc.) for the hardware it is getting compile for.

@ssadasivam1
Copy link
Author

Thanks, will compile with -gencode and check. Also I just cranked up the number of elements N to 5 million (it was 1 million earlier) which seems to increase the probability of failure. You can try the same if it seems hard to reproduce.

@ssadasivam1
Copy link
Author

Update: I'm also unable to repro on A100 when compiling with -gencode arch=compute_80,code=sm_80 It repros without compiling for the specific architecture.

@elstehle
Copy link
Collaborator

@ssadasivam1, I suspect this may be due to some compiler issue with 12.2 and 12.3. I was not able to reproduce this issue using nvcc from CUDA 12.4.1 (using driver NVIDIA-SMI 550.54.14).

What made me suspicious was that, when using CTK 12.3, changing the tuning policies (i.e., make SM 80 use the tuning policy for SM 52 and vice versa), did leave the error behaviour unaffected. Can you resolve this issue by compiling for the targeted GPU architectures or by upgrading to a more recent CTK?

@ssadasivam1
Copy link
Author

@elstehle Our application from which this standalone simplified reproducer was extracted still fails with CUDA 12.4 and CUDA 12.5 So I do believe this issue still exists in our app, although the simplified reproducer seems OK with CUDA 12.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants