API to specify number of threads, from threadpool, to use for the task #17

kimishpatel · 2021-10-11T21:07:41Z

Summary:
This PR adds a way use fewer threads than configured with in
pthreadpool. Occassionaly it has been seen that using the # of thredas =
logical core is not efficient. This can be due to system load and
varying other factors that lead threads either being mapped to slower
cores or being mapped to fewer than logical core (as actually seen).
Thus this PR attempt to fix this.

Approach:

Add api to set thread local var for specifying the #of threads to use.
pthreadpool_parallelize will then distributed the work only among
specified threads.
Threads that are not picked continue to wait, likely via mutex/condvar,
for next chunk of work and thus give up their cpu slot.
Both pthreads.c windows.c are modified to add this feature.

Test Plan:
4 tests are added to check this.

Reviewers:

Subscribers:

Tasks:

Tags:

kimishpatel · 2021-10-11T21:08:19Z

@Maratyszcza would love to get your inputs here. Thanks!

raziel

Thanks @kimishpatel
Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.

Is there a simple explanation?

raziel · 2021-10-28T18:24:04Z

include/pthreadpool.h

@@ -85,6 +86,12 @@ pthreadpool_t pthreadpool_create(size_t threads_count);
 */
 size_t pthreadpool_get_threads_count(pthreadpool_t threadpool);

+/*


Maybe we include the why?
E.g.
API to cap the number of threads used to do work, rather than those
currently available in the runtime's threadpool.

This is useful to counter potential performance degradation

of using more threads than optimal for the device and use-case

such as the OS scheduling threads to run on smaller cores

at the cost of threading overhead.

Also indicate what happens if num_threads > # threads in pool

Yes. I need to do add that.

Please follow the same pattern (including @param and @returns tag) as the other functions, make sure both pthreadpool_set_max_num_threads and pthreadpool_get_max_num_threads are documented

I think it is better to use pthreadpool_get_threads_count/pthreadpool_set_threads_count for the new API functions and add a new function pthreadpool_get_max_threads_count to return the number of threads in the thread pool (what pthreadpool_get_threads_count does now)

raziel · 2021-10-28T18:27:42Z

src/gcd.c

@@ -73,6 +75,15 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
 	return threadpool;
 }

+void pthreadpool_cap_num_threads(size_t num_threads) {


shouldn't this check vs threads_count?
And return a bool to say if it actually capped anything.

We dont have access to threadpool object here. I think your terminology of max threads actually better explains the behavior. It would so do not use more than max_threads. Capping somehow implies that some pre-existing value has to be capped?

raziel · 2021-10-28T18:29:12Z

src/gcd.c

@@ -99,15 +110,16 @@ PTHREADPOOL_INTERNAL void pthreadpool_parallelize(

 	/* Locking of completion_mutex not needed: readers are sleeping on command_condvar */
 	const struct fxdiv_divisor_size_t threads_count = threadpool->threads_count;
+  const struct fxdiv_divisor_size_t num_threads_to_use = fxdiv_init_size_t(min(threads_count.value, capped_num_threads));


I think this should be done directly in the method.

which method?

raziel · 2021-10-28T18:33:41Z

src/pthreads.c

@@ -1,3 +1,34 @@
+/*
+ * Overall architecture:
+ * 1. capped_num_threads is used to specifiy max threads to use.


I think that we should be clear capped_num_threads is <= threads_count.
Let's not introduce another term (max threads).

Are you saying that we should document this? Because pthreadpool_cap_num_threads does not know threadpool size.

raziel · 2021-10-28T18:35:55Z

src/pthreads.c

+ * However thread 2 will never see cmd 3 because the masked bit is same as cmd1 and it is not perceived as new command.
+ * To fix this we must add this invariant:
+ * - Master thread must synchronize with all threads before submitting next command regardless of the eligibility of threads to paricipate in the work.
+ * Testing:


you mean you ran this test?

That is the test I added. I still could not figure out a good way to ensure that only specified number of threads are used. Of course I validated manually but nothing else besides that.

raziel · 2021-10-28T18:39:57Z

src/pthreads.c

+	uint32_t last_flags,
+  size_t thread_id)
 {


this code seems to have tabs vs spaces, you should fix this.

My bad. Not sure how that creeped in.

raziel · 2021-10-28T18:40:50Z

src/pthreads.c

+ * At this point thread 2 starts waiting with last_command = cmd 1 because it never saw cmd 2. Master thread submits cmd 3. At cmd 3 all three threads are eligible.
+ * However thread 2 will never see cmd 3 because the masked bit is same as cmd1 and it is not perceived as new command.
+ * To fix this we must add this invariant:
+ * - Master thread must synchronize with all threads before submitting next command regardless of the eligibility of threads to paricipate in the work.


have you estimated the cost of doing this sync?

Only on one model we wanted to test, but I need to do more comprehensive benchmarking.

raziel · 2021-10-28T19:15:33Z

src/pthreads.c

@@ -291,6 +388,15 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
 	return threadpool;
 }

+void pthreadpool_cap_num_threads(size_t num_threads) {
+  assert(num_threads > 0);
+  capped_num_threads = num_threads;


Same comments as in the other file.
That way we don't need another num_threads_to_use but can use capped_num_threads directly which is cleaner.

I think I asked this before but to be sure, directly where?

raziel · 2021-10-28T19:16:05Z

src/pthreads.c

@@ -322,7 +428,11 @@ PTHREADPOOL_INTERNAL void pthreadpool_parallelize(

 	/* Locking of completion_mutex not needed: readers are sleeping on command_condvar */
 	const struct fxdiv_divisor_size_t threads_count = threadpool->threads_count;
-	pthreadpool_store_relaxed_size_t(&threadpool->active_threads, threads_count.value - 1 /* caller thread */);
+  const struct fxdiv_divisor_size_t num_threads_to_use = fxdiv_init_size_t(min(threads_count.value, capped_num_threads));


raziel · 2021-10-28T19:26:57Z

src/threadpool-object.h

+   * As per this change, this feature is not available in GCD based
+   * pthreadpool
+   */
+	pthreadpool_atomic_size_t num_threads_to_use;


same comment here, we should just have capped_num_threads

You mean just use capped_num_threads? Thats what you mean by directly? That we cannot do because capped_num_threads is thread local variable and each thread will have different value for that which we cannot communicate to each except via atomic variable. But possible I overlooked something.

kimishpatel · 2021-10-28T19:57:25Z

Thanks @kimishpatel Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.

Is there a simple explanation?

Not sure what you mean. Are you saying that your intuitive understanding is that it should be a simpler change?

The changes in other places are required because we want to:

Distribute work only among max threads (I think that terminology is probably better than capped threads)
Have each thread figure out whether it has work to do. THis is needed because each threadpool does no have its own command queue (I have thought of that change as well but that is a bigger change)
The cap should not be observed at the time of threadpool creation and destruction.

Maratyszcza

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.

Maratyszcza · 2021-11-01T17:28:03Z

include/pthreadpool.h

@@ -1,6 +1,7 @@
 #ifndef PTHREADPOOL_H_
 #define PTHREADPOOL_H_

+#include <stdbool.h>


This is unnecessary, none of the functions in the header use bool.

Aaah. This is from my dev setup. Will fix.

Maratyszcza · 2021-11-01T17:30:58Z

src/pthreads.c

@@ -54,6 +85,7 @@
 #include "threadpool-object.h"
 #include "threadpool-utils.h"

+thread_local size_t capped_num_threads = UINT_MAX;


Thread-local cap doesn't make sense, the same thread can work with different pthreadpool_t objects and multiple threads can submit tasks to the same pthreadpool_t object.

Limit on the number of threads should be a property of pthreadpool_t object, not of a thread.

I think that seems fair. I had something else in mind, but what you said makes more sense.

Actually for the case of "multiple threads can submit tasks to the same pthreadpool_t object ", the issue is this. If capped_num_threads was part of pthreadpool_t object then we would have to atomically update it and depending on whose update finishes last we might apply different caps. So that may break when thread 1 sets cap to 2 but later on it gets set to 3 by thread 2 and then thread 1 runs with the cap of 3 threads and so does thread 2. In order to make this work we will have to change *parallelize* API to account for cap.

But for "the same thread can work with different pthreadpool_t objects", what you said makes sense.

Dont have a good solution but did want to point it out in case I missed something.

kimishpatel · 2021-11-01T19:35:02Z

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

That is correct, but to do more appropriate fix, it requires us to have separate command buffer for each thread in the pool. However, that is a much larger change so I refrained from it. Another reason was that it also complicates work stealing, although it should be doable.

My thought was to try such a change in a follow-up PR, but I am open to suggestions and if you think you want to do that from the get-go, thats also ok.

Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.

Yes, my bad. I did not realize this. WIll fix.

kimishpatel · 2021-11-01T20:32:58Z

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

Also btw for the behavior we observed where 4 threads can get mapped to 3 cores, which results in thread swapping each other out, this solution still works even though unused threads are spuriously woken up, since, I suppose, latency of compute tends to be longer than waking and going back to sleep.

kimishpatel · 2021-11-08T23:03:12Z

@Maratyszcza can you please re-review this?

In the latest commit I have addressed two of your concerns:

All threads waking up but only some participating. This is fixed by making command/wakeup logic per thread.
Use pthreadpool object to convey # of thread to use rather than thread_local variable. This is in the second commit.

I personally feel the second commit somewhat diminishes the value of what we are trying to achieve here. If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.
On the other hand I do understand your concern about multiple threadpool objects being subject to same constraint.

If you have a better suggestion, I am happy to hear.

Look forward to your comments.

Maratyszcza · 2021-11-17T03:20:53Z

include/pthreadpool.h

+ * Purpose of this is to ameliorate some perf degradation observed
+ * due to OS mapping a given set of threads to fewer cores.
+ */
+void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads);


Use pthreadpool_t instead of struct pthreadpool*

Maratyszcza · 2021-11-17T03:21:23Z

include/pthreadpool.h

+ * due to OS mapping a given set of threads to fewer cores.
+ */
+void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads);
+size_t pthreadpool_get_max_num_threads();


This function should have void in the parameter list for compatibility with C

Aah good to know. Did not know this.

Maratyszcza · 2021-11-17T03:25:02Z

src/portable-api.c

@@ -20,7 +20,6 @@
 #include "threadpool-object.h"
 #include "threadpool-utils.h"

-


Maratyszcza · 2021-11-17T03:28:17Z

src/threadpool-object.h

@@ -79,6 +79,31 @@ struct PTHREADPOOL_CACHELINE_ALIGNED thread_info {
 	 */
 	HANDLE thread_handle;
 #endif
+


I don't see why all changes in this file are needed. Please revert.

So this PR adds per thread wakeup logic so that only participating threads are woken up. That is why this change was needed. If you have better suggestions I am open to it.

Maratyszcza · 2021-11-17T03:28:56Z

src/windows.c

@@ -22,6 +22,7 @@
 #include "threadpool-object.h"
 #include "threadpool-utils.h"

+thread_local size_t max_num_threads = UINT_MAX;


There should be no thread-local variables

Maratyszcza · 2021-11-17T03:29:09Z

src/windows.c

@@ -53,11 +54,11 @@ static void wait_worker_threads(struct pthreadpool* threadpool, uint32_t event_i
 }

 static uint32_t wait_for_new_command(
-	struct pthreadpool* threadpool,
+	struct thread_info* thread,


Why the API change?

Because now wakeup and command both are per thread.

Maratyszcza · 2021-11-17T03:29:38Z

src/windows.c

@@ -147,6 +148,7 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
 		return NULL;
 	}
 	threadpool->threads_count = fxdiv_init_size_t(threads_count);
+  pthreadpool_store_relaxed_size_t(&threadpool->num_threads_to_use, threads_count);


Wrong indentation

Maratyszcza · 2021-11-17T03:34:14Z

src/windows.c

@@ -190,6 +195,14 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
 	return threadpool;
 }

+void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads) {
+	pthread_mutex_lock(&threadpool->execution_mutex);


threadpool->execution_mutex is a WinAPI mutex handle, use WaitForSingleObject/ReleaseMutex

Aah. Thanks for pointing this out. Surprised that internal windows build did not fail. I will make sure to build this on windows.

Maratyszcza · 2021-11-17T03:36:07Z

src/gcd.c

@@ -21,6 +21,8 @@
 #include "threadpool-object.h"
 #include "threadpool-utils.h"

+thread_local size_t max_num_threads = UINT_MAX;


This shouldn't be here

Maratyszcza · 2021-11-17T03:36:32Z

src/gcd.c

@@ -73,6 +76,14 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
 	return threadpool;
 }

+void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads) {
+	pthread_mutex_lock(&threadpool->execution_mutex);


threadpool->execution_mutex doesn't exist when targeting GCD, use threadpool->execution_semaphore

Thanks for pointing this out.

Maratyszcza

Overall, the code needs substantial changes:

Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.
Please rename the existing threads_count to max_threads_count and use threads_count for the new functionality.
Validate that threads_count <= max_threads_count once in the public API that sets this variable. Doesn't clip to max_threads_count in other functions, just assume this holds (optionally add an assert for this).
Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.

kimishpatel · 2021-11-17T15:02:18Z

Responding here:

Overall, the code needs substantial changes:

Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.

Internal API changes are needed because we add per thread command and wait logic.

Please rename the existing threads_count to max_threads_count and use threads_count for the new functionality.

Good suggestion. Will follow up.

Validate that threads_count <= max_threads_count once in the public API that sets this variable. Doesn't clip to max_threads_count in other functions, just assume this holds (optionally add an assert for this).

Makes sense.

Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.

Will do.

@Maratyszcza this leaves me with one high level question that I mentioned in my earlier message. I am copy pasting that here:

Should thread_count be set per threadpool object or set as a thread local variable.

If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.

I do understand your concern that with thread local pattern, multiple threadpool objects being subject to the same constraint.

Maratyszcza · 2021-11-18T16:50:47Z

Should thread_count be set per threadpool object or set as a thread local variable.

pthreadpool is a low-level library, and for low-level libraries it is preferred to avoid global objects. If necessary, users can implement this functionality at a higher level on top of pthreadpool.

If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.

I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.

kimishpatel · 2021-11-18T21:02:09Z

I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.

We usually use a singleton pthreadpool object. Thus multiple threads use single threadpool. There are cases when multiple models can be running simultaneously. When threads running individual models are not interleaved, implementation of this PR is ok. However, we dont have any control over how these run in sw stack in which pytorch is integrated. If two threads running two different models start using new API then we may run into weird performance issues which might be hard to debug, such as the last thread to set the thread count wins.

Issue is once we expose this API to client, they expect certain behavior which is not guaranteed. That is why thread local setting seems to make more sense to me.

But if you feel strongly about this, I understand.

kimishpatel · 2021-11-27T00:20:28Z

Tested on windows and mac

Summary: This diff splits the command command queueu and instead uses commands specific to each thread. This enables: - Waking up only subset of threads needed. - Waiting for only subset of threads In this commit the number of threads to use is a thread local variable. Subsetquent commit makes that a property of threadpool object Test Plan: pthreadpool-test Reviewers: Subscribers: Tasks: Tags:

Summary: This commits changes API to set max num threads. It applies the limit to the pthreadpool object. Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel force-pushed the use_fewer_threads_api branch from b382813 to 668abfa Compare October 28, 2021 15:38

raziel suggested changes Oct 28, 2021

View reviewed changes

Maratyszcza requested changes Nov 1, 2021

View reviewed changes

kimishpatel force-pushed the use_fewer_threads_api branch from 668abfa to c7d1ae2 Compare November 8, 2021 22:56

kimishpatel requested a review from Maratyszcza November 8, 2021 22:57

kimishpatel force-pushed the use_fewer_threads_api branch from c7d1ae2 to f5d213f Compare November 9, 2021 00:18

Maratyszcza reviewed Nov 17, 2021

View reviewed changes

Maratyszcza requested changes Nov 17, 2021

View reviewed changes

kimishpatel force-pushed the use_fewer_threads_api branch from f5d213f to ef5e4b7 Compare November 18, 2021 22:03

kimishpatel requested a review from Maratyszcza November 18, 2021 22:04

kimishpatel force-pushed the use_fewer_threads_api branch 3 times, most recently from 2301913 to e5f0dfe Compare November 22, 2021 18:11

kimishpatel force-pushed the use_fewer_threads_api branch from e5f0dfe to 2a8a028 Compare November 22, 2021 18:23

kimishpatel force-pushed the use_fewer_threads_api branch from 2a8a028 to bca5ac6 Compare December 7, 2021 15:01

kimishpatel added 2 commits December 14, 2021 08:37

Use pthreadpool object to store max number of threads to use

45b9c96

Summary: This commits changes API to set max num threads. It applies the limit to the pthreadpool object. Test Plan: Reviewers: Subscribers: Tasks: Tags:

kimishpatel force-pushed the use_fewer_threads_api branch from bca5ac6 to 45b9c96 Compare December 14, 2021 16:49

		@@ -20,7 +20,6 @@
		#include "threadpool-object.h"
		#include "threadpool-utils.h"

API to specify number of threads, from threadpool, to use for the task #17

Are you sure you want to change the base?

API to specify number of threads, from threadpool, to use for the task #17

Conversation

kimishpatel commented Oct 11, 2021

kimishpatel commented Oct 11, 2021

raziel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Oct 28, 2021

Maratyszcza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Nov 1, 2021

kimishpatel commented Nov 1, 2021

kimishpatel commented Nov 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maratyszcza left a comment

Choose a reason for hiding this comment

kimishpatel commented Nov 17, 2021

Maratyszcza commented Nov 18, 2021

kimishpatel commented Nov 18, 2021

kimishpatel commented Nov 27, 2021