Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tbb::task_group thread scaling #313

Open
Dr15Jones opened this issue Dec 9, 2020 · 4 comments · May be fixed by #1310
Open

tbb::task_group thread scaling #313

Dr15Jones opened this issue Dec 9, 2020 · 4 comments · May be fixed by #1310

Comments

@Dr15Jones
Copy link

As part of transitioning from using the deprecated tbb::task API to tbb::task_group I have been doing performance measurement on our applications. I have found that when using a single tbb::task_group we get highly diminished thread scaling. To illustrate the problem, I created four highly simplified versions of the main processing loop of our applications. The code for the simple applications can be found here: https://github.com/Dr15Jones/tbb_group_scaling. Each application does the same processing but uses TBB in a different way. The differences are

  • using tbb::tasks directly which are all created using allocate_root (this is how our application typically works)
  • using 1 tbb::task_group to launch all the needed work
  • using N tbb::task_groups where we can use a task_group per thread we are requesting.
  • using tbb::tasks directly but using allocate_additional_child_of (created based on studying the performance of the other three cases).

When testing on either an Intel or AMD CPU, the single tbb::task_group was found to either not scale as the number of threads increased or to have extremely weak scaling compared to the other options. The tbb::task using allocate_additional_child_of had the best performance followed closely by the N tbb::task_groups case.

My question is, are there plans to improve the performance when using a single tbb::task_group? If not, is the use of multiple tbb::task_groups working together to share the load on creating tasks a supported use case? Alternatively, could a new API for creating a performant hierarchy of task_groups be developed in order to avoid doing a 'spin' loop over the task_group::wait calls?

@Dr15Jones
Copy link
Author

To give some context, here is a plot of the throughput (effectively groups of actions per second) when using my Intel based laptop with a 4 core linux VM.

laptop

Here is a plot of the throughput for a 32 core AMD machine

amd

@alexey-katranov
Copy link
Contributor

alexey-katranov commented Dec 16, 2020

I'd slightly refactor the approach with N task_groups to be similar with child_task.
Replace https://github.com/Dr15Jones/tbb_group_scaling/blob/master/with_multiple_groups.cc#L46-L61 with

    tbb::task_group group;
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned int i = 0; i < nLanes; ++i) {
        group.run(
            [&nEventsProcessed, nEvents, nChains, &group]() {
                tbb::task_group lane_group;
                lane_group.run_and_wait([&nEventsProcessed, nEvents, nChains, &lane_group]() {
                    workInLane(nEventsProcessed, nChains, nEvents, lane_group, 0);
                });
            });
    }

    group.wait();

Also you do not need iNGroupsDone any more. Just remove https://github.com/Dr15Jones/tbb_group_scaling/blob/master/with_multiple_groups.cc#L14

It uses one task_group to wait. It is not so elegant as child_task because there are nested task_groups but it should scale well.

@Dr15Jones
Copy link
Author

@alexey-katranov thank you for taking the time to look at this. Unfortunately, although my example properly shows the performance characteristics of our actual application, it does not exhibit the full range of capabilities. In the full application, the equivalent of the for loop can spawn multiple independent tasks (in addition to once a task finishes it starts another task) plus there are cases which force synchronization across the tasks in different iterations of the for loop which allows multiple tasks from the same for loop iteration to run concurrently on different threads. Therefore doing a run_and_wait is not an option for us and we need to call run.

@alexey-katranov
Copy link
Contributor

Thank you for the clarification. We will think how we can improve tasking interfaces to cover such cases. Notify: @aleksei-fedotov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants