Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock issue in OpenBLAS with TBB #1336

Open
goplanid opened this issue Apr 1, 2024 · 2 comments
Open

Deadlock issue in OpenBLAS with TBB #1336

goplanid opened this issue Apr 1, 2024 · 2 comments

Comments

@goplanid
Copy link

goplanid commented Apr 1, 2024

Brief Description: I am trying out this OpenBLAS PR [https://github.com/OpenMathLib/OpenBLAS/pull/4577] with TBB. I first register a callback in my code to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS passes the work to the registered callback. I use TBB for running gemm and again want to use TBB for executing the callback.

Issue: I am facing deadlock issue in OpenBLAS (multiple threads get stuck in inner_threads function in OpenBLAS). OpenBLAS apears to encounter deadlock when used with fewer threads than no of available threads.

Below is my test code and steps to reproduce it.

#include <iostream>
#include <cblas.h>
#include <vector>
#include <tbb/tbb.h>
#include <chrono>

const int MATRIX_DIMENSION = 1000; // Adjust as needed
bool delay_threading = 1;

class MatrixMultiplicationTask {
private:
    const std::vector<double>& A;
    const std::vector<double>& B;
    std::vector<double>& C;

public:
    MatrixMultiplicationTask(const std::vector<double>& A,
                             const std::vector<double>& B,
                             std::vector<double>& C)
        : A(A), B(B), C(C) {}

    void operator()(const tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i != range.end(); ++i) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        MATRIX_DIMENSION, MATRIX_DIMENSION, MATRIX_DIMENSION,
                        1.0, A.data(), MATRIX_DIMENSION, B.data(), MATRIX_DIMENSION,
                        0.0, &C[i * MATRIX_DIMENSION], MATRIX_DIMENSION);
        }
    }
};

class InnerLoopTask {
private:
    openblas_dojob_callback dojob;
    int numjobs;
    size_t jobdata_elsize;
    void* jobdata;
    int dojob_data;

public:
    InnerLoopTask(openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void* jobdata, int dojob_data)
        : dojob(dojob), numjobs(numjobs), jobdata_elsize(jobdata_elsize), jobdata(jobdata), dojob_data(dojob_data) {}

    void operator()(const tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i != range.end(); ++i) {
            void* element_adrr = (void*)(((char*)jobdata) + ((unsigned)i) * jobdata_elsize);
            dojob(i, element_adrr, dojob_data);
        }
    }
};

class MyObserver : public tbb::task_scheduler_observer {
public:
    MyObserver() {
        observe(true);
    }

    ~MyObserver() {
        observe(false);
    }

    void on_scheduler_entry(bool is_worker) override {
        std::cout << "Task scheduler entry" << std::endl;
    }

    void on_scheduler_exit(bool is_worker) override {
        std::cout << "Task scheduler exit" << std::endl;
    }
};

void myfunction_ (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
    //MyObserver observer;
    //observer.observe(true);
    InnerLoopTask innerLoopTask(dojob, numjobs, jobdata_elsize, jobdata, dojob_data);
    //tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
    tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
}


int main() {
    // Dynamically create matrices using std::vector for easier management
    std::vector<double> A(MATRIX_DIMENSION * MATRIX_DIMENSION, 8.0);
    std::vector<double> B(MATRIX_DIMENSION * MATRIX_DIMENSION, 5.0);
    std::vector<double> C(MATRIX_DIMENSION * MATRIX_DIMENSION, 0.5);

    if (delay_threading)
        openblas_set_threads_callback_function(myfunction_);

    auto start = std::chrono::high_resolution_clock::now();

    tbb::parallel_for(tbb::blocked_range<int>(0, 2), MatrixMultiplicationTask(A,B,C));

    auto stop = std::chrono::high_resolution_clock::now();

    // Output a portion of the result (printing the entire matrix would be too much)
    for (int i = 0; i < 10; ++i) {
        for (int j = 0; j < 10; ++j) {
            std::cout << C[i * MATRIX_DIMENSION + j] << "\t";
        }
        std::cout << std::endl;
    }

    // Compute the duration
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    std::cout << "Time taken by function: " << duration.count() << " milliseconds\n";

    return 0;
}

Run command: g++ -std=c++11 -o tbb_nested tbb_nested.cpp -ltbb -lpthread -I/home/openblas/include -L/home/openblas/lib -lopenblas -Wl,-rpath,/home/openblas/lib

Help needed: So as you can see here, I have below case of nested parallelism,
outer loop: tbb::parallel_for(tbb::blocked_range(0, 2), MatrixMultiplicationTask(A,B,C));
inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);

In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used. What is the best possible nested solution provided by TBB to solve this problem? Kindly advise.

@goplanid
Copy link
Author

goplanid commented Apr 3, 2024

@anton-malakhov

@dnmokhov
Copy link
Contributor

dnmokhov commented Apr 5, 2024

Hi @goplanid,

To guarantee parallelism in the inner loop, you could use TBB in the outer loop only. In the inner loop, you could launch numjobs threads (e.g., with std::thread) in myfunction_, with each thread performing an InnerLoopTask.

You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency / numjobs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants