Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tbb on wasm always executed on the main thread. #1287

Open
jellychen opened this issue Jan 1, 2024 · 35 comments
Open

tbb on wasm always executed on the main thread. #1287

jellychen opened this issue Jan 1, 2024 · 35 comments
Assignees
Labels

Comments

@jellychen
Copy link

jellychen commented Jan 1, 2024

On the wasm platform
Both tbb::task_group tg and tbb::parallel_for are always executed on the main thread. But std::thread executes on an asynchronous thread. What causes this?

and oneapi::tbb::info::default_concurrency() > 10

@jellychen
Copy link
Author

==================
This is my situation: I have ported OpenVDB to the web platform, and OpenVDB relies on TBB for its multi-threaded implementation. The porting process went smoothly, of course. However, the performance test results were unexpectedly low.

After several comparisons, some phenomena were discovered. My function is to perform voxel processing through VDB. I encapsulated the functionality within a function.

==================

  1. When I first call this function, the CPU does not exceed 100%. It is evident that at this point TBB does not utilize the performance of multiple cores.
  2. When I call this function for the second time, the CPU usage is approximately 200%, which means it can utilize the parallelism of two CPU cores. My computer has 8 cores. This results in a performance improvement of twice as much compared to the previous execution.
  3. When I called the function for the third and fourth time, I noticed that the CPU usage of this function can reach around 780%, and the overall execution time is approximately 7.5 times faster. Therefore, it can be concluded that at this point, TBB can effectively utilize the features of multiple cores.

==================
To summarize, with the same code and execution environment, the only difference lies in the order of execution. TBB exhibits different multicore utilization on wasm.

It seems that TBB needs a warm-up. So I made some changes. I compiled the code using emscripten and added -sPTHREAD_POOL_SIZE=(navigator.hardwareConcurrency), but there doesn't seem to be any difference in performance. Do you have any similar

@jellychen
Copy link
Author

==================
But I conducted an experiment using std::thread, and the code is roughly like the following.

static std::vector<std::thread> threads;
    for (int i = 0; i < 8; ++i) {
        auto a= std::thread([]() {
            for (;;) {
                ;
            }
        });
    threads.emplace_back(std::move(a));
    }

In this code snippet, threads can directly make use of the multi-core features without the need for pre-warming like in TBB. I wonder if there is any way to bypass this issue or adjust some mechanisms in TBB.

@jellychen
Copy link
Author

===============
I have conducted some research, and I modified the source code of TBB by adding logging for thread creation in the rml_thread_monitor.h file. Through analyzing the logs, I discovered that only a small number of threads (around 2) were created during the first phase of execution. Therefore, this is not an inherent issue with wasm

===============
Due to the complexity of TBB's mechanism, I haven't conducted in-depth research on it yet. It could be some differences in multi-threaded semaphore or synchronization mechanisms on the web platform that are causing this issue. However, I can generally confirm that it is an inherent problem with TBB

@jellychen
Copy link
Author

But I have found a possible solution, which is to execute the following code segment after the program starts, acting as the warm-up code for

===============

        {
#pragma optimize("", off)
            auto concurrency = std::thread::hardware_concurrency();
            if (concurrency > 1) {
                tbb::task_arena arena;
                arena.initialize(concurrency, 1, tbb::task_arena::priority::high);
                int start = 0, len = concurrency * 5;
                for (int i = 0; i < concurrency; ++i) {
                    tbb::parallel_for(start, len, [](size_t i) {
                    // printf("thread id %d\n", std::this_thread::get_id());
                    });
                }
            }
#pragma optimize("", on)
        }

I found that executing this nearly ineffective code ahead of time enables subsequent OpenVDB to efficiently utilize multi-core computing.

@JhaShweta1
Copy link
Contributor

Hi, Did you face this issue with TBB prior to your porting to WASM? As you have said - it doesn't seem to be a WASM issue, but an inherent TBB issue. I will investigate this further and keep you updated.

@jellychen
Copy link
Author

Hi, Did you face this issue with TBB prior to your porting to WASM? As you have said - it doesn't seem to be a WASM issue, but an inherent TBB issue. I will investigate this further and keep you updated.

I have been using it in non-web scenarios, mainly on macOS, and it works well

@jellychen
Copy link
Author

HI @JhaShweta1

=============================
I conducted the same experiment on OpenSubdiv, which is a geometry algorithm library specifically designed for mesh subdivision. I discovered some strange phenomena.

The phenomenon is that using TBB (Threading Building Blocks) for computation is much slower than using a single thread, approximately three times slower.

=============================
The rough process is as follows.

Just like the previous method, warm up TBB by using the code snippet below.

{
#pragma optimize("", off)
            auto concurrency = std::thread::hardware_concurrency();
            if (concurrency > 1) {
                tbb::task_arena arena;
                arena.initialize(concurrency, 1, tbb::task_arena::priority::high);
                int start = 0, len = concurrency * 5;
                for (int i = 0; i < concurrency; ++i) {
                    tbb::parallel_for(start, len, [](size_t i) {
                    // printf("thread id %d\n", std::this_thread::get_id());
                    });
                }
            }
#pragma optimize("", on)
        }

OpenSubdiv extensively utilizes tbb::parallel_for for parallel execution of kernel functions. To ensure the effective utilization of TBB's multithreading, I simulated the invocation of tbb::parallel_for externally beforehand, guaranteeing that each callback function of tbb::parallel_for indeed occurs on different threads.

The subsequent performance of the normal CPU utilization rate will never exceed 100%, which is quite peculiar. As a result, there is a significant decrease in performance compared to the single-threaded version without using TBB.

=============================
I conducted repeated experiments with the same code on a Mac system, and the conclusion is that using TBB (Threading Building Blocks) effectively utilizes the multi-core capabilities. The code is at least 3 to 5 times faster than the single-threaded version. I am using an 8-core device.

=============================
Maybe these phenomena can help you make better judgments. As far as my results are concerned, the overall effect is unsatisfactory, possibly due to the instability of the Wasm platform itself.

@pca006132
Copy link

Hi, I am also encountering similar issues, but only for nodejs and not in the browser. For nodejs 18 with --experimental-wasm-threads flag set, it occasionally works (same file, different runs have different characteristics). For nodejs 20/21, I cannot set --experimental-wasm-threads and it cannot utilize multiple threads.

elalish/manifold#653 (comment)

@jellychen
Copy link
Author

elalish/manifold#653 (comment)

There seems to be no way

@pca006132
Copy link

I wonder if this is something related to the scheduler in tbb, not familiar with the internals so cannot say much. I can try to create a MRE and detailed environment information (emscripten, browser, nodejs version) if that helps.

@JhaShweta1
Copy link
Contributor

Hi,
Yes, Please share reproducer and environment details. I tried a couple of things suggested by Emscripten previously but it didn't seem to work.

@pca006132
Copy link

Sure, but this will take some time as I am busy with other things right now. Debugging this wasm weirdness takes quite a lot of time... Hopefully I have more time next week to do this.

@pca006132
Copy link

Consider the following code:

#include <chrono>
#include <iostream>
#include <thread>

#include "oneapi/tbb/parallel_for.h"

using namespace std::chrono_literals;

int main() {
  auto start = std::chrono::high_resolution_clock::now();

  oneapi::tbb::parallel_for(  //
      oneapi::tbb::blocked_range<std::size_t>(0, 10), [&](const auto &r) {
        std::this_thread::sleep_for(1s);
        auto end = std::chrono::high_resolution_clock::now();
        std::cout << "worker: "
                  << std::chrono::duration_cast<std::chrono::milliseconds>(
                         end - start)
                         .count()
                  << std::endl;
      });
  return 0;
}

Examples results:

worker: 1001
worker: worker: 1005
1005
worker: worker: 1006
1006
worker: 1040
worker: 1058
worker: 1066
worker: 1068
worker: 1069

The results are close to 1000, indicating this is indeed running in multiple threads. However, the CPU utilization never exceeds 100% for compute heavy workload:

#include <chrono>
#include <iostream>
#include <thread>

#include "oneapi/tbb/parallel_for.h"

using namespace std::chrono_literals;

int main() {
  auto start = std::chrono::high_resolution_clock::now();

  oneapi::tbb::parallel_for(  //
      oneapi::tbb::blocked_range<std::size_t>(0, 10), [](const auto &r) {
        long long steps = 0;
        for (long long i = 2; i < 1000000000000; i++) {
          long long n = i;
          while (n != 1) {
            if (n % 2)
              n = (3 * n + 1) / 2;
            else
              n /= 2;
            steps++;
          }
        }
        std::cout << "good " << steps << std::endl;
      });
  return 0;
}
time node a.js
node a.js  6.22s user 0.03s system 101% cpu 6.147 total
# emcmake cmake -DCMAKE_BUILD_TYPE=Release -DEMSCRIPTEN_SYSTEM_PROCESSOR=web ..
cmake_minimum_required(VERSION 3.11)
project(test)

include(FetchContent)
set(TBB_TEST OFF CACHE INTERNAL "" FORCE)
set(TBB_STRICT OFF CACHE INTERNAL "" FORCE)
FetchContent_Declare(TBB
    GIT_REPOSITORY https://github.com/oneapi-src/oneTBB.git
    GIT_TAG        v2021.11.0
)
FetchContent_MakeAvailable(TBB)

set(CMAKE_CXX_FLAGS "-pthread")
set(CMAKE_EXE_LINKER_FLAGS "-pthread -sPTHREAD_POOL_SIZE=4 -sINITIAL_MEMORY=1gb")

add_executable(a a.cpp)
target_link_libraries(a PUBLIC TBB::tbb)
target_link_options(a PUBLIC -pthread)
  • Emscripten version: 3.1.47
  • node version: v21.6.2

@SoilRos
Copy link
Contributor

SoilRos commented Apr 12, 2024

I also found the same issue in my project. Since we are constrained to the web js, we also made a reproducible docker environment for that case:

git clone git@github.com:josephholten/em-multi.git
cd em-multi
docker build -t em-multi .
docker run -d -p 8080:8080 em-multi
firefox -new-tab localhost:8080
# any update on the website (F5) will reproduce the results in the console (Ctrl + Shift + C)

Note: this starts docker in detached mode. You need to stop it manually. If you didn't run any other docker image, just remove the last one with docker stop $(docker ps -lq).

Output:

filling random vectors...
calculating sequential scalarproduct...
using thread: 131060
seq scalarprod: 16779532.297833
seq time: 170ms
calculating cpp_threads scalarproduct...
cpp_threads concurrency: 4
using thread: 1074151888
using thread: 1073948128
using thread: 1074016056
using thread: 1074083968
cpp_threads scalarprod: 16779532.297837
cpp_threads time: 64ms
calculating tbb_threads scalarproduct...
tbb_threads concurrency: 8
using thread: 131060
tbb_threads scalarprod: 32815.633438
tbb_threads time: 103ms

As you can see, multiple threads are possible in the same C++ program but TBB scheduler still manages to bind tasks to the main thread.

@jellychen
Copy link
Author

jellychen commented Apr 13, 2024

Certainly, TBB WebAssembly (WASM) is very unstable, but some open-source projects depend on it. It seems like the official team doesn't pay much attention to the bugs discussed. I wonder if we should consider abandoning this library in the future.

@pavelkumbrasev
Copy link
Contributor

pavelkumbrasev commented Apr 15, 2024

Hi All, sorry to hear you had such problem. Our team are not yet experts in WASM. We are also new to this technology so it takes longer time for us to react to such problems.
Talking about the issue, at the first glance I was thinking that there is not enough time for TBB to wake-up all the threads and main thread finishes parallel region before workers can join (wake-up mechanism is not serial and main thread will wake at max 2 threads and each of them in turn will wake at max 2).
But with provided log information seems that threads did join the parallel region (the most accurate way to check it is to use thread_id or thread_local variable).
So TBB scheduler utilizes available concurrency but for some reason System or WASM scheduler don't allocate CPU time for these threads so they are executed serially.
It's in turn really bizarre because you saw that sometimes there bigger system utilization.
@jellychen could you please confirm that with std::thread CPU utilization is always all the cores? May be threads creation in TBB lacks some flags that prevents threads from parallel execution.

@SoilRos
Copy link
Contributor

SoilRos commented Apr 15, 2024

@pavelkumbrasev I think this may be related to issue #1341. I tried to add an observer to at least log the entry point of the threads and found out error stated in #1341. Once solved, I found that the observer hooks are being called after all the parallel loops are invoked (see #1341 (comment) for more details). I think that there must be a bug during the thread initialization related to my comment in that issue.

@jellychen
Copy link
Author

Hi All, sorry to hear you had such problem. Our team are not yet experts in WASM. We are also new to this technology so it takes longer time for us to react to such problems. Talking about the issue, at the first glance I was thinking that there is not enough time for TBB to wake-up all the threads and main thread finishes parallel region before workers can join (wake-up mechanism is not serial and main thread will wake at max 2 threads and each of them in turn will wake at max 2). But with provided log information seems that threads did join the parallel region (the most accurate way to check it is to use thread_id or thread_local variable). So TBB scheduler utilizes available concurrency but for some reason System or WASM scheduler don't allocate CPU time for these threads so they are executed serially. It's in turn really bizarre because you saw that sometimes there bigger system utilization. @jellychen could you please confirm that with std::thread CPU utilization is always all the cores? May be threads creation in TBB lacks some flags that prevents threads from parallel execution.

After testing, it has been found that std::thread can utilize all the cores in almost all scenarios.

@pavelkumbrasev
Copy link
Contributor

@jellychen, I'm not really familiar with WASM work model. Is there a chance you can print threads stacks during parallel section execution where CPU utilization is equal to 1 thread running so we can see if threads are sleeping in thread pool for some reason or their stacks are also involved in computation?

@b-qp
Copy link

b-qp commented Apr 19, 2024

Same behavior, recompiled in debug mode, got this

Assertion node(val).my_prev_node == &node(val) && node(val).my_next_node == &node(val) failed (located in the push_front function, line in file: 135)
Detailed description: Object with intrusive list node can be part of only one intrusive list simultaneously
...
$tbb::detail::r1::assertion_failure_impl(char const*, int, char const*, char const*) @ a.out.wasm:0x5e516
$tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0::operator()() const @ a.out.wasm:0x5e443
$void tbb::detail::d0::run_initializer<tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0>(tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0 const&, std::__2::atomic<tbb::detail::d0::do_once_state>&) @ a.out.wasm:0x5e00b
$void tbb::detail::d0::atomic_do_once<tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0>(tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*)::$_0 const&, std::__2::atomic<tbb::detail::d0::do_once_state>&) @ a.out.wasm:0x5df97
$tbb::detail::r1::assertion_failure(char const*, int, char const*, char const*) @ a.out.wasm:0x5de7c
$tbb::detail::r1::intrusive_list_base<tbb::detail::r1::intrusive_list<tbb::detail::r1::thread_dispatcher_client>, tbb::detail::r1::thread_dispatcher_client>::push_front(tbb::detail::r1::thread_dispatcher_client&) @ a.out.wasm:0x85922
$tbb::detail::r1::thread_dispatcher::insert_client(tbb::detail::r1::thread_dispatcher_client&) @ a.out.wasm:0x85505
invoke_vii @ a.out.js:4760
$tbb::detail::r1::thread_dispatcher::register_client(tbb::detail::r1::thread_dispatcher_client*) @ a.out.wasm:0x852b5
$tbb::detail::r1::threading_control_impl::publish_client(tbb::detail::r1::threading_control_client, tbb::detail::d1::constraints&) @ a.out.wasm:0x94d5f
$tbb::detail::r1::threading_control::publish_client(tbb::detail::r1::threading_control_client, tbb::detail::d1::constraints&) @ a.out.wasm:0x97e32
$tbb::detail::r1::arena::create(tbb::detail::r1::threading_control*, unsigned int, unsigned int, unsigned int, tbb::detail::d1::constraints) @ a.out.wasm:0x1dd0d
$tbb::detail::r1::governor::init_external_thread() @ a.out.wasm:0x3d192
$tbb::detail::r1::governor::get_thread_data() @ a.out.wasm:0x1e4a6
$tbb::detail::r1::allocate(tbb::detail::d1::small_object_pool*&, unsigned long) @ a.out.wasm:0x69e32
$tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>* tbb::detail::d1::small_object_allocator::new_object<tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>, tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&, tbb::detail::d1::small_object_allocator&>(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&, tbb::detail::d1::small_object_allocator&) @ a.out.wasm:0x7d5d
$tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&, tbb::detail::d1::task_group_context&) @ a.out.wasm:0x78da
$tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&, tbb::detail::d1::auto_partitioner const&) @ a.out.wasm:0x46d3
$void tbb::detail::d1::parallel_for<tbb::detail::d1::blocked_range<unsigned long>, main::$_0>(tbb::detail::d1::blocked_range<unsigned long> const&, main::$_0 const&) @ a.out.wasm:0x3fa3
$__original_main @ a.out.wasm:0x3bb3
$main @ a.out.wasm:0xcbbb

@pavelkumbrasev
Copy link
Contributor

@b-qp I believe we saw this problem before with static version of TBB (and only with static version).
Is there a chance you can try to run your reproducer with static version of TBB to see if problem persists.

@jellychen
Copy link
Author

@jellychen, I'm not really familiar with WASM work model. Is there a chance you can print threads stacks during parallel section execution where CPU utilization is equal to 1 thread running so we can see if threads are sleeping in thread pool for some reason or their stacks are also involved in computation?
@pavelkumbrasev

I'm sorry for the late response; I've been on vacation recently. I'm not quite sure how to print the call stack. Could you tell me the exact steps?

@pavelkumbrasev
Copy link
Contributor

@jellychen, it will be just a guess because I'm not familiar with a technology too.
Is there a chance you can attach gdb to a process and call thread apply all bt. If you place break point into parallel region I would expect all of the worker threads participating.

@jellychen
Copy link
Author

@jellychen, it will be just a guess because I'm not familiar with a technology too.
Is there a chance you can attach gdb to a process and call thread apply all bt. If you place break point into parallel region I would expect all of the worker threads participating.

Maybe Wasm does not support gdb debugging

@pavelkumbrasev
Copy link
Contributor

Could you please provide steps to reproduce the issue? (If you can do it with debug version of the library it also will be helpful)

@jellychen
Copy link
Author

Could you please provide steps to reproduce the issue? (If you can do it with debug version of the library it also will be helpful)

Almost nothing special is required, as long as you compile to wasm to perform the simplest parallel tasks, you can say 100% sure to occur

@pca006132
Copy link

@pavelkumbrasev see my comment above (#1287 (comment)).

@jellychen
Copy link
Author

@pavelkumbrasev

I suspect that the multithreading mechanism of TBB does not work effectively under the web worker mechanism of Emscripten. It might not be an issue with TBB, perhaps it's a problem with the web itself.
In any case, I haven't isolated the cause.

However, I have found a solution by implementing a set of interfaces similar to TBB, although not entirely. Many pieces of software only utilize parts of the TBB interface, mainly task_group, parallel_sort, parallel_for, and parallel_reduce.

My approach involves initializing a std::thread pool at startup and then bridging these implementations to std::thread.

So far, this solution has shown better effects than TBB in some software experiments. Currently, the multithreading performance of TBB in some wasm software, such as Openvdb, is even weaker than its single-threading performance.

I hope this can help most developers working on wasm.

@pavelkumbrasev
Copy link
Contributor

@jellychen, I'm not sure if the problem is the Emscripten. I was able to reproduce described behavior and from my perspective something is odd. I will continue investigating the problem.

@pavelkumbrasev
Copy link
Contributor

@jellychen I have summarized concluded analysis into a set of questions into Emscripten discussion:
emscripten-core/emscripten#21963

@jellychen
Copy link
Author

@jellychen I have summarized concluded analysis into a set of questions into Emscripten discussion: emscripten-core/emscripten#21963

I have also read quite a bit of the TBB code, and I will keep tracking this issue with the hope that TBB gets even better. I'm grateful for the work you've done.

@pavelkumbrasev
Copy link
Contributor

Hi @jellychen and @SoilRos, I was thinking what the best way to overcome current problem.
I think to make a decision I'm still lacking TBB usage context.
Seems like -sPROXY_TO_PTHREAD would be the best solution since TBB will work with no changes. However, I don't know if every app that uses TBB can use this flag.
What do you think? Can you build your apps using -sPROXY_TO_PTHREAD?

@jellychen
Copy link
Author

Hi @jellychen and @SoilRos, I was thinking what the best way to overcome current problem. I think to make a decision I'm still lacking TBB usage context. Seems like -sPROXY_TO_PTHREAD would be the best solution since TBB will work with no changes. However, I don't know if every app that uses TBB can use this flag. What do you think? Can you build your apps using -sPROXY_TO_PTHREAD?

Based on the current situation, it doesn't seem feasible. I have many interfaces within my WebAssembly module that require manipulation of the DOM from the main thread, and there's no way to migrate them out of the main thread.

@pavelkumbrasev
Copy link
Contributor

Based on the current situation, it doesn't seem feasible. I have many interfaces within my WebAssembly module that require manipulation of the DOM from the main thread, and there's no way to migrate them out of the main thread.

That's probably applicable for a lot of WebAssembly users. I will try to come up with the solution keeping this in mind.

@pca006132
Copy link

I guess I now understand why our project with tbb works fine on the browser but not with nodejs. Will try to figure out how to make it work there. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants