Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

callback on idle threads #992

Open
fwyzard opened this issue Dec 14, 2022 · 11 comments · May be fixed by #995
Open

callback on idle threads #992

fwyzard opened this issue Dec 14, 2022 · 11 comments · May be fixed by #995

Comments

@fwyzard
Copy link

fwyzard commented Dec 14, 2022

Hi,
I am looking for a way to keep track of when and how long any TBB thread stays idle within (the framework used by) my application.

After reading a bit of the oneTBB code, my understanding is that I can consider a thread to be idle when it is in the stealing loop at https://github.com/oneapi-src/oneTBB/blob/v2021.8.0-rc1/src/tbb/task_dispatcher.h#L193-L232:

// Stealing loop mailbox/enqueue/other_slots
for (;;) {
__TBB_ASSERT(t == nullptr, nullptr);
// Check if the resource manager requires our arena to relinquish some threads
// For the external thread restore idle state to true after dispatch loop
if (!waiter.continue_execution(slot, t)) {
__TBB_ASSERT(t == nullptr, nullptr);
break;
}
// Start searching
if (t != nullptr) {
// continue_execution returned a task
}
else if ((t = get_inbox_or_critical_task(ed, inbox, isolation, critical_allowed))) {
// Successfully got the task from mailbox or critical task
}
else if ((t = get_stream_or_critical_task(ed, a, resume_stream, resume_hint, isolation, critical_allowed))) {
// Successfully got the resume or critical task
}
else if (fifo_allowed && isolation == no_isolation
&& (t = get_stream_or_critical_task(ed, a, fifo_stream, fifo_hint, isolation, critical_allowed))) {
// Checked if there are tasks in starvation-resistant stream. Only allowed at the outermost dispatch level without isolation.
}
else if (stealing_is_allowed
&& (t = steal_or_get_critical(ed, a, arena_index, tls.my_random, isolation, critical_allowed))) {
// Stole a task from a random arena slot
}
else {
t = get_critical_task(t, ed, isolation, critical_allowed);
}
if (t != nullptr) {
ed.context = task_accessor::context(*t);
ed.isolation = task_accessor::isolation(*t);
a.my_observers.notify_entry_observers(tls.my_last_observer, tls.my_is_worker);
break; // Stealing success, end of stealing attempt
}
// Nothing to do, pause a little.
waiter.pause(slot);
} // end of nonlocal task retrieval loop

Is this reasonable? That is,

  • is it OK to consider a thread as idle, that is, not executing any task, while it is in that loop ?
  • are there other places where an idle thread could spend its time ?

As for a way to notify my application when a thread is idle, I have been thinking to extend the task_scheduler_observer adding two methods on_thread_idle() and on_thread_active().

Does this seem like a good approach ?

Thanks for any comments and suggestions - next I'll see if I can implement something along these lines.

.Andrea

@pavelkumbrasev
Copy link
Contributor

Hi @fwyzard, could you please describe more your use case? Since the still loop is a part of task stealing mechanism it is might be considered as a working state. In balanced scenarios this might introduce some overheads because we will insert the V-call on the hot path.

@fwyzard
Copy link
Author

fwyzard commented Dec 15, 2022

Hi @pavelkumbrasev, of course.

CMSSW is the software used by the CMS experiment at CERN for the simulation, physics reconstruction and analysis of the experimental data, and its framework relies heavily on TBB to implement task-based multithreading.

An optimised application may run order of 10k TBB tasks per second per CPU thread.

However we also have cases where the threads are idle, mostly because of two reasons:

  1. the input data is available at a lower rate that what the system can process, leading to a situation where simply there aren't any tasks to run;
  2. sometimes there are tasks available, but their dependencies are not yet satisfied, so they aren't available to run.

We implemented a "service" inside the application that tracks how much (cpu and real) time is spent inside each "module" of the application (roughly, a module maps to a TBB task). However we don't have a way to measure how much time the threads spend idling, for example because of one of the two reasons above.

Hence my attempt to extend TBB to let it notify us about idling tasks :-)
Though I haven't thought yet whether the two cases above (no input data, or unsatisfied dependencies) could be threaded separately.

So far, I thought to generate a pair of calls (idle/active) at most once per call to task_dispatcher::receive_or_steal_task(...), by making the idle call only one the thread reaches the waiter.pause(slot); call for the first time, and making the active call only one the thread exists the loop and only if it was marked as idle.

Let me know what you think, and if I can provide more information !

@pavelkumbrasev
Copy link
Contributor

Seems your right and entry_idle might be called just before the waiter.pause() and leave on the return. Although, applicability of such API is kinda limited too - maybe statistic.
Do you observe that in such scenarios ("1" and "2") still loop is important enough to mark it as a separate state? Since on described scenarios it will lead to frequent waiter.pause() calls i.e. 2 * P process pauses + 100 yields that should be in total ~100 us or even less. After one thread marks internal arena as empty worker threads will leave at this moment.

@fwyzard
Copy link
Author

fwyzard commented Dec 15, 2022

Yes, my goal is to monitor the active vs idle time, not to affect the properties of the threads.

I know that the fraction of time spent idle in scenario 1. can be significant.

I do not know yet what the fraction of time ending up in scenario 2. could be - first I would need to find a way to monitor when it happens :-)

@pavelkumbrasev
Copy link
Contributor

I can review your PR once you submit it and after it we could discuss applicability of this API in general case.
Are you ok with this plan?

@fwyzard
Copy link
Author

fwyzard commented Dec 15, 2022

Sure, thanks.

Though, I have one question first.
In order to call the two new methods on_thread_idle() and on_thread_active() to the task_scheduler_observer, I think I need to extend the observer_list class and implement the equivalent of do_notify_entry_observers() and do_notify_exit_observers() for the new methods.
Looking at their implementation, they are similar but not identical. Could you give me a brief explanation of why they are different ?

Thank you,
.Andrea

@pavelkumbrasev
Copy link
Contributor

TBH that pretty complicated logic so on first glance entry will iterate until reach last == nullptr and 'exit' iterate to the last notified observer that was stored into TLS.
I wonder if you could change interface and reuse do_notify_entry_observers and do_notify_exit_observers methods and pass lambda with needed call e,g, in do_notify_entry_observers tso->on_scheduler_entry(worker); and tso->on_thread_idle(worker); and similar with do_notify_exit_observers.

@fwyzard fwyzard linked a pull request Dec 20, 2022 that will close this issue
14 tasks
@fwyzard
Copy link
Author

fwyzard commented Dec 20, 2022

@pavelkumbrasev, thanks for the suggestions.

Please find at #995 my first attempt to implement the idle thread notifications.

@fwyzard
Copy link
Author

fwyzard commented Dec 20, 2022

It's split into two commits:

  • the first implements the minimal changes, but with a large amount of code duplication in src/tbb/observer_proxy.h/.cpp;
  • the second attempts to reduce this duplication, though I'm not very happy about the result.

@fwyzard
Copy link
Author

fwyzard commented Dec 20, 2022

From a first round of checks in our application, it seems to behave as intended. Next I'll try to implement some proper monitoring on top of it, and then measure the impact on the application performance.

@pavelkumbrasev
Copy link
Contributor

Hi @fwyzard any updates regarding your research?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants