Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to specify number of threads, from threadpool, to use for the task #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kimishpatel
Copy link

Summary:
This PR adds a way use fewer threads than configured with in
pthreadpool. Occassionaly it has been seen that using the # of thredas =
logical core is not efficient. This can be due to system load and
varying other factors that lead threads either being mapped to slower
cores or being mapped to fewer than logical core (as actually seen).
Thus this PR attempt to fix this.

Approach:

  • Add api to set thread local var for specifying the #of threads to use.
  • pthreadpool_parallelize will then distributed the work only among
    specified threads.
  • Threads that are not picked continue to wait, likely via mutex/condvar,
    for next chunk of work and thus give up their cpu slot.
    Both pthreads.c windows.c are modified to add this feature.

Test Plan:
4 tests are added to check this.

Reviewers:

Subscribers:

Tasks:

Tags:

@kimishpatel
Copy link
Author

@Maratyszcza would love to get your inputs here. Thanks!

Copy link

@raziel raziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kimishpatel
Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.

Is there a simple explanation?

@@ -85,6 +86,12 @@ pthreadpool_t pthreadpool_create(size_t threads_count);
*/
size_t pthreadpool_get_threads_count(pthreadpool_t threadpool);

/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we include the why?
E.g.
API to cap the number of threads used to do work, rather than those
currently available in the runtime's threadpool.

    • This is useful to counter potential performance degradation
    • of using more threads than optimal for the device and use-case
    • such as the OS scheduling threads to run on smaller cores
    • at the cost of threading overhead.

Also indicate what happens if num_threads > # threads in pool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I need to do add that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the same pattern (including @param and @returns tag) as the other functions, make sure both pthreadpool_set_max_num_threads and pthreadpool_get_max_num_threads are documented

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use pthreadpool_get_threads_count/pthreadpool_set_threads_count for the new API functions and add a new function pthreadpool_get_max_threads_count to return the number of threads in the thread pool (what pthreadpool_get_threads_count does now)

src/gcd.c Outdated
@@ -73,6 +75,15 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
return threadpool;
}

void pthreadpool_cap_num_threads(size_t num_threads) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this check vs threads_count?
And return a bool to say if it actually capped anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont have access to threadpool object here. I think your terminology of max threads actually better explains the behavior. It would so do not use more than max_threads. Capping somehow implies that some pre-existing value has to be capped?

src/gcd.c Outdated
@@ -99,15 +110,16 @@ PTHREADPOOL_INTERNAL void pthreadpool_parallelize(

/* Locking of completion_mutex not needed: readers are sleeping on command_condvar */
const struct fxdiv_divisor_size_t threads_count = threadpool->threads_count;
const struct fxdiv_divisor_size_t num_threads_to_use = fxdiv_init_size_t(min(threads_count.value, capped_num_threads));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be done directly in the method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which method?

src/pthreads.c Outdated
@@ -1,3 +1,34 @@
/*
* Overall architecture:
* 1. capped_num_threads is used to specifiy max threads to use.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should be clear capped_num_threads is <= threads_count.
Let's not introduce another term (max threads).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that we should document this? Because pthreadpool_cap_num_threads does not know threadpool size.

src/pthreads.c Outdated
* However thread 2 will never see cmd 3 because the masked bit is same as cmd1 and it is not perceived as new command.
* To fix this we must add this invariant:
* - Master thread must synchronize with all threads before submitting next command regardless of the eligibility of threads to paricipate in the work.
* Testing:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean you ran this test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the test I added. I still could not figure out a good way to ensure that only specified number of threads are used. Of course I validated manually but nothing else besides that.

src/pthreads.c Outdated
Comment on lines 187 to 144
uint32_t last_flags,
size_t thread_id)
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code seems to have tabs vs spaces, you should fix this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Not sure how that creeped in.

src/pthreads.c Outdated
* At this point thread 2 starts waiting with last_command = cmd 1 because it never saw cmd 2. Master thread submits cmd 3. At cmd 3 all three threads are eligible.
* However thread 2 will never see cmd 3 because the masked bit is same as cmd1 and it is not perceived as new command.
* To fix this we must add this invariant:
* - Master thread must synchronize with all threads before submitting next command regardless of the eligibility of threads to paricipate in the work.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you estimated the cost of doing this sync?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only on one model we wanted to test, but I need to do more comprehensive benchmarking.

src/pthreads.c Outdated
@@ -291,6 +388,15 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
return threadpool;
}

void pthreadpool_cap_num_threads(size_t num_threads) {
assert(num_threads > 0);
capped_num_threads = num_threads;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as in the other file.
That way we don't need another num_threads_to_use but can use capped_num_threads directly which is cleaner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I asked this before but to be sure, directly where?

src/pthreads.c Outdated
@@ -322,7 +428,11 @@ PTHREADPOOL_INTERNAL void pthreadpool_parallelize(

/* Locking of completion_mutex not needed: readers are sleeping on command_condvar */
const struct fxdiv_divisor_size_t threads_count = threadpool->threads_count;
pthreadpool_store_relaxed_size_t(&threadpool->active_threads, threads_count.value - 1 /* caller thread */);
const struct fxdiv_divisor_size_t num_threads_to_use = fxdiv_init_size_t(min(threads_count.value, capped_num_threads));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

* As per this change, this feature is not available in GCD based
* pthreadpool
*/
pthreadpool_atomic_size_t num_threads_to_use;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, we should just have capped_num_threads

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean just use capped_num_threads? Thats what you mean by directly? That we cannot do because capped_num_threads is thread local variable and each thread will have different value for that which we cannot communicate to each except via atomic variable. But possible I overlooked something.

@kimishpatel
Copy link
Author

Thanks @kimishpatel Overall I'm still not understanding why capping the # threads requires the bulk of the changes here.

Is there a simple explanation?

Not sure what you mean. Are you saying that your intuitive understanding is that it should be a simpler change?

The changes in other places are required because we want to:

  1. Distribute work only among max threads (I think that terminology is probably better than capped threads)
  2. Have each thread figure out whether it has work to do. THis is needed because each threadpool does no have its own command queue (I have thought of that change as well but that is a bigger change)
  3. The cap should not be observed at the time of threadpool creation and destruction.

Copy link
Owner

@Maratyszcza Maratyszcza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.

@@ -1,6 +1,7 @@
#ifndef PTHREADPOOL_H_
#define PTHREADPOOL_H_

#include <stdbool.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, none of the functions in the header use bool.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah. This is from my dev setup. Will fix.

src/pthreads.c Outdated
@@ -54,6 +85,7 @@
#include "threadpool-object.h"
#include "threadpool-utils.h"

thread_local size_t capped_num_threads = UINT_MAX;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread-local cap doesn't make sense, the same thread can work with different pthreadpool_t objects and multiple threads can submit tasks to the same pthreadpool_t object.

Limit on the number of threads should be a property of pthreadpool_t object, not of a thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that seems fair. I had something else in mind, but what you said makes more sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for the case of "multiple threads can submit tasks to the same pthreadpool_t object ", the issue is this. If capped_num_threads was part of pthreadpool_t object then we would have to atomically update it and depending on whose update finishes last we might apply different caps. So that may break when thread 1 sets cap to 2 but later on it gets set to 3 by thread 2 and then thread 1 runs with the cap of 3 threads and so does thread 2. In order to make this work we will have to change *parallelize* API to account for cap.

But for "the same thread can work with different pthreadpool_t objects", what you said makes sense.

Dont have a good solution but did want to point it out in case I missed something.

@kimishpatel
Copy link
Author

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

That is correct, but to do more appropriate fix, it requires us to have separate command buffer for each thread in the pool. However, that is a much larger change so I refrained from it. Another reason was that it also complicates work stealing, although it should be doable.

My thought was to try such a change in a follow-up PR, but I am open to suggestions and if you think you want to do that from the get-go, thats also ok.

Also, many formatting inconsistencies, please format similarly to existing pthreadpool code.

Yes, my bad. I did not realize this. WIll fix.

@kimishpatel
Copy link
Author

My biggest concern is that this code doesn't do what you expect: it still wakes up all threads in pthreadpool, and them sends some of them to sleep right away. I.e. work dispatch doesn't use all threads, but still pays the full latency cost of synchronizing all threads, even unused ones.

Also btw for the behavior we observed where 4 threads can get mapped to 3 cores, which results in thread swapping each other out, this solution still works even though unused threads are spuriously woken up, since, I suppose, latency of compute tends to be longer than waking and going back to sleep.

@kimishpatel
Copy link
Author

@Maratyszcza can you please re-review this?

In the latest commit I have addressed two of your concerns:

  • All threads waking up but only some participating. This is fixed by making command/wakeup logic per thread.
  • Use pthreadpool object to convey # of thread to use rather than thread_local variable. This is in the second commit.

I personally feel the second commit somewhat diminishes the value of what we are trying to achieve here. If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.
On the other hand I do understand your concern about multiple threadpool objects being subject to same constraint.

If you have a better suggestion, I am happy to hear.

Look forward to your comments.

* Purpose of this is to ameliorate some perf degradation observed
* due to OS mapping a given set of threads to fewer cores.
*/
void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pthreadpool_t instead of struct pthreadpool*

* due to OS mapping a given set of threads to fewer cores.
*/
void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads);
size_t pthreadpool_get_max_num_threads();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should have void in the parameter list for compatibility with C

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah good to know. Did not know this.

@@ -20,7 +20,6 @@
#include "threadpool-object.h"
#include "threadpool-utils.h"


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

@@ -79,6 +79,31 @@ struct PTHREADPOOL_CACHELINE_ALIGNED thread_info {
*/
HANDLE thread_handle;
#endif

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why all changes in this file are needed. Please revert.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR adds per thread wakeup logic so that only participating threads are woken up. That is why this change was needed. If you have better suggestions I am open to it.

src/windows.c Outdated
@@ -22,6 +22,7 @@
#include "threadpool-object.h"
#include "threadpool-utils.h"

thread_local size_t max_num_threads = UINT_MAX;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no thread-local variables

@@ -53,11 +54,11 @@ static void wait_worker_threads(struct pthreadpool* threadpool, uint32_t event_i
}

static uint32_t wait_for_new_command(
struct pthreadpool* threadpool,
struct thread_info* thread,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the API change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now wakeup and command both are per thread.

src/windows.c Outdated
@@ -147,6 +148,7 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
return NULL;
}
threadpool->threads_count = fxdiv_init_size_t(threads_count);
pthreadpool_store_relaxed_size_t(&threadpool->num_threads_to_use, threads_count);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry.

src/windows.c Outdated
@@ -190,6 +195,14 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
return threadpool;
}

void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads) {
pthread_mutex_lock(&threadpool->execution_mutex);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threadpool->execution_mutex is a WinAPI mutex handle, use WaitForSingleObject/ReleaseMutex

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah. Thanks for pointing this out. Surprised that internal windows build did not fail. I will make sure to build this on windows.

src/gcd.c Outdated
@@ -21,6 +21,8 @@
#include "threadpool-object.h"
#include "threadpool-utils.h"

thread_local size_t max_num_threads = UINT_MAX;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be here

src/gcd.c Outdated
@@ -73,6 +76,14 @@ struct pthreadpool* pthreadpool_create(size_t threads_count) {
return threadpool;
}

void pthreadpool_set_max_num_threads(struct pthreadpool* threadpool, size_t num_threads) {
pthread_mutex_lock(&threadpool->execution_mutex);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threadpool->execution_mutex doesn't exist when targeting GCD, use threadpool->execution_semaphore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out.

Copy link
Owner

@Maratyszcza Maratyszcza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the code needs substantial changes:

  1. Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.
  2. Please rename the existing threads_count to max_threads_count and use threads_count for the new functionality.
  3. Validate that threads_count <= max_threads_count once in the public API that sets this variable. Doesn't clip to max_threads_count in other functions, just assume this holds (optionally add an assert for this).
  4. Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.

@kimishpatel
Copy link
Author

Responding here:

Overall, the code needs substantial changes:

  1. Avoid changing existing internal APIs unless it is absolutely needed to add the new functionality. This PR is already very big by itself, and mixed-in changes in unrelated APIs make it hard to review.

Internal API changes are needed because we add per thread command and wait logic.

  1. Please rename the existing threads_count to max_threads_count and use threads_count for the new functionality.

Good suggestion. Will follow up.

  1. Validate that threads_count <= max_threads_count once in the public API that sets this variable. Doesn't clip to max_threads_count in other functions, just assume this holds (optionally add an assert for this).

Makes sense.

  1. Make sure it is tested on Windows and iOS/Mac. I suspect the current version may not compile on these platforms.

Will do.

@Maratyszcza this leaves me with one high level question that I mentioned in my earlier message. I am copy pasting that here:

Should thread_count be set per threadpool object or set as a thread local variable.

If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.

I do understand your concern that with thread local pattern, multiple threadpool objects being subject to the same constraint.

@Maratyszcza
Copy link
Owner

Should thread_count be set per threadpool object or set as a thread local variable.

pthreadpool is a low-level library, and for low-level libraries it is preferred to avoid global objects. If necessary, users can implement this functionality at a higher level on top of pthreadpool.

If you have multiple threads, each running something (pytorch models in this instance) that uses a global threadpool (and I would assume this to be more common pattern), then this is what would happen: Thread 1 sets max # of threads on threadpool object and subsequently thread 2 sets it to another value. Now if the runs of both threads are interleaved then thread 1 is forced to use value set by thread 2.

I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.

@kimishpatel
Copy link
Author

I don't expect it to be a common use-case to use different number of thread pool threads depending on which thread called into it, especially with the current implementation that still wakes up all threads.

We usually use a singleton pthreadpool object. Thus multiple threads use single threadpool. There are cases when multiple models can be running simultaneously. When threads running individual models are not interleaved, implementation of this PR is ok. However, we dont have any control over how these run in sw stack in which pytorch is integrated. If two threads running two different models start using new API then we may run into weird performance issues which might be hard to debug, such as the last thread to set the thread count wins.

Issue is once we expose this API to client, they expect certain behavior which is not guaranteed. That is why thread local setting seems to make more sense to me.

But if you feel strongly about this, I understand.

@kimishpatel
Copy link
Author

Tested on windows and mac

Summary:
This diff splits the command command queueu and instead uses commands
specific to each thread.

This enables:
- Waking up only subset of threads needed.
- Waiting for only subset of threads

In this commit the number of threads to use is a thread local variable.
Subsetquent commit makes that a property of threadpool object

Test Plan:
pthreadpool-test

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
This commits changes API to set max num threads. It applies the limit to
the pthreadpool object.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants