Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing Deadlock issue with nested TBB #1316

Open
goplanid opened this issue Feb 26, 2024 · 8 comments
Open

Facing Deadlock issue with nested TBB #1316

goplanid opened this issue Feb 26, 2024 · 8 comments
Assignees

Comments

@goplanid
Copy link

Hi,

I have below case of nested parallelism,

Level 1 or outer loop: tbb::parallel_for(tbb::blocked_range(0, 2),
outerLoopTask(A,B,C));
Level 2 or inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);

What I want to do: I want to run the above code with best possible nested solution provided by TBB. In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used.

Steps tried:
To solve this problem I looked into work isolation page of documentation https://oneapi-src.github.io/oneTBB/main/tbb_userguide/work_isolation.html

  1. I tried to create seperate task arena for each inner loop or level2 using the below code but it didn't help as I continue to see deadlock issue:
    oneapi::tbb::task_arena nested;
    nested.execute( [innerLoopTask,numjobs]{
    tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);

  2. I also tried the isolate function using the below code but still see the same issue:
    oneapi::tbb::this_task_arena::isolate([numjobs, innerLoopTask]{
    tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
    });

Help needed:

  1. What is the right way to solve this problem? If it is work isolation can i understand why it doesn't work in this scenario.
  2. Is there a way we can set no of threads for each task arena?

Any pointers will be of great help.

@dnmokhov dnmokhov self-assigned this Feb 28, 2024
@pavelkumbrasev
Copy link
Contributor

Hi @goplanid,
I still have several questions regarding your use case and small reproducer would be helpful.

This is how I understood what you wanted to implement:

  1. Create some number of tasks (not really big) on top level.
  2. This top level tasks will do some compute and run nested parallel loop.
  3. This nested parallel loop should be statically divided on particular number of threads.
  4. Execution of nested task can start only all the required threads are arrived.

In oneTBB we doesn't guarantee parallelism for parallel region so this example might hang on:

int num_threads_available = std::thread::hardware_concurrency();
tbb::task_arena arena(/* concurrency = */ num_threads_available);
   
arena.execute([&] {
     tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), [&] (tbb::blocked_range<int> r) {
         wait_threads();
     },
     /* Will divide work statically on num_threads_available */ tbb::static_partitioner);
});

It will work fine in most of the cases but might hang if for example you have several loops that have wait inside and each wait expects hardware_concurrency threads than total number of threads in thread pool should be hardware_concurrency * num_loops but TBB by default allows to create only hardware_concurrency - 1 threads (you can control this by global_control).

Should the top level tasks be processed in parallel (1) or it can be done in stages (2)?

(1)
You can try to use parallel_invoke or task_group to create appropriate nummber of tasks.
Each task then creates task_arena with total_concurrency / number_of_top_level_tasks to make it possible to run several static loops with barrier.

(2)
You can try to use parallel_pipeline where nested task will be a parallel stage with required level of parallelism.

@goplanid
Copy link
Author

Hi @pavelkumbrasev, Thank you for providing your inputs. Your understanding is right.

Let me give more details with the below example:

I am dividing my dataset into multiple blocks/chunks where each block is parallely computed(using TBB). Each block is calling a third party library function that needs numjob threads to do its work. The library function has several loops that wait inside and each wait expects hardware_concurrency threads.

Here are my further experiments:

  1. The example you provided above with task arena.execute() hangs in my case as you pointed out right as hardware_concurrency threads are expected.

  2. I tried to set the total no of threads using tbb global_control like below for a 8 core machine (assuming I have 8 chunks of data):
    tbb::global_control gc(tbb::global_control::max_allowed_parallelism, num_jobs*8);

There are issues with this path:

  • Firstly, I don't see these many TBB threads getting created on the call stack. I see exact hardware_concurrency threads on the call stack. (also not hardware_concurrency - 1)
  • Second, this might lead to the case of oversubscription as the no of chunks can be really huge for a large dataset.
  1. I am assuming based on my usecase you are suggesting (1) parallel_invoke or taskgroup.
  • I am not sure how to use parallel_invoke like how do I create a seperate function for each chunk (looking at parallel_invoke syntax) as this is a nested parallelism scenario.
  • I tried task group like below in the nested loop but still face the deadlock issue.
tg.run([&]() {
    tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
    });
tg.wait();

Your guidance will be highly appreciated. Thanks.

@pavelkumbrasev
Copy link
Contributor

Hi @goplanid,
If each chunk requires hardware_concurrency threads to process and you don't want to bog system with oversubscription perhaps the best performance can be achieved by serial loop for serial tasks and nested parallel_for with static_partioner.

@dnmokhov
Copy link
Contributor

Hi @goplanid,

To guarantee parallelism in the inner loop, you could try launching numjobs threads (e.g., with std::thread) in each outerLoopTask, with each thread performing an innerLoopTask.

You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency / numjobs).

@goplanid
Copy link
Author

goplanid commented Mar 4, 2024

Hi @dnmokhov

  1. I am launching numjobs threads in each outerLoopTask using tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask); Is this correct? Any reason you have mentioned using std::thread above?

Hi @pavelkumbrasev @dnmokhov
2. I have tried isolate/task arena execute functions so that each innerLoopTask can run in seperate task arena [as I have a dependency on hardware_concurrency threads] but still see deadlock issue. Kindly help me understand the below points:

2.a How can i get more detailed logs with oneTBB. Like how many threads are actually being used in each innerLoopTask. I am suspecting no of threads could be an issue.
2.b Whether P-1 or P outer level threads will be used with each nested task arena ?
2.c What would be the overall threads used by TBB in this case that is with creation of nested task arenas?

@dnmokhov
Copy link
Contributor

dnmokhov commented Mar 5, 2024

Hi @goplanid,

  1. I am launching numjobs threads in each outerLoopTask using tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask); Is this correct? Any reason you have mentioned using std::thread above?

OneTBB parallel algorithms (e.g., parallel_for) use available worker threads and do not launch new threads, so "there is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads" (https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler.html).

2.a How can i get more detailed logs with oneTBB. Like how many threads are actually being used in each innerLoopTask. I am suspecting no of threads could be an issue.

You can call current_thread_index() in each task to log the thread it is using.

2.b Whether P-1 or P outer level threads will be used with each nested task arena ?

As mentioned above, there is no specific parallelism guarantee. The executed tasks are distributed among the available threads. When a thread completes a task, it will run the next available task, so some of the tasks can end up being run serially.

2.c What would be the overall threads used by TBB in this case that is with creation of nested task arenas?

By default, hardware_concurrency threads are used. You can query this value with default_concurrency() and change it with global_control.

@goplanid
Copy link
Author

goplanid commented Mar 11, 2024

Hi @dnmokhov, Thank you for your inputs. Sorry for the late reply, I was on leave.

  1. I debugged further using the above pointers and see that my inner loop is getting called using 31 threads when there are 2 outer loop threads. One of the outer loop threads is busy waiting and is not available for use in the inner loop. I want inner loop to be called using 32 threads(basically all threads on the machine). Is there a mechanism to yield in oneTBB for the outer threads so taht they are available in inner loop execution.

  2. I also tried changing the no of threads using global_control in both outer and inner loop but it didn't help. Placing the code here.

outer loop:
tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
tbb::parallel_for(tbb::blocked_range(0, 2), MatrixMultiplicationTask(A,B,C));

inner loop:
oneapi::tbb::task_arena nested;
tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
nested.execute( [innerLoopTask,numjobs]{
tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
});

  1. I tried changing the no of threads in explict task arenas using below code but it didn't help and see the below warning.
    oneapi::tbb::task_arena nested{32,0}; //To ensure 0 threads are reserved for the master
    nested.execute( [innerLoopTask,numjobs]{
    tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
    });

TBB Warning: The number of workers is currently limited to 31. The request for 32 workers is ignored. Further requests for more workers will be silently ignored until the limit changes.

Kindly correct if i am wrong anywhere and advise. You inputs are really appreciated.

@dnmokhov
Copy link
Contributor

Hi @goplanid,

The executed tasks are distributed among the available threads, so each of your 2 inner loops will be called using anywhere from 1 to 32 threads.

I want inner loop to be called using 32 threads(basically all threads on the machine)

To not bog down the system with oversubscription, perhaps the best performance can be achieved by a serial outer loop and nested parallel_for with static_partioner, as suggested here: #1316 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants