Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler gets confused when two pipelines run at the same time #5796

Open
dsharletg opened this issue Mar 9, 2021 · 21 comments · May be fixed by #8153
Open

Profiler gets confused when two pipelines run at the same time #5796

dsharletg opened this issue Mar 9, 2021 · 21 comments · May be fixed by #8153

Comments

@dsharletg
Copy link
Contributor

When profiling is enabled for two pipelines running at the same time, the results are confusing, because both pipelines are setting the global profiler token to tokens with different meanings.

This is happening when using an extern stage which also calls another Halide pipeline, but the same issue could occur if running two different pipelines from two different threads.

@mcourteaux
Copy link
Contributor

@abadams Could you tag this with dev-meeting?

I'm currently seeing something that could be related. Halide profiler reports this:

neonraw_bilateral_grid_loglum_constructor_8
 total time: 665.483459 ms  samples: 621  runs: 48  time/run: 13.864239 ms
 average threads used: 1.107890
 heap allocations: 0  peak heap usage: 0 bytes
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    bilateral_grid:              13.864ms (100.0%)  threads: 1.107 

While Tracy clearly shows that this takes 39ms per run.

image


Looking again at the profiler runtime, I indeed don't see any mechanism to support multiple threads calling Halide pipelines. Is this a non-documented by-design thing that the profiler can only be used for single-threaded applications?

@abadams
Copy link
Member

abadams commented Mar 6, 2024

Multiple threads with different pipelines running are supposed to work. Each func id belongs to a unique range of integers tied to a specific pipeline by name.

profiler_common.cpp does something like this when a pipeline is entered:

pipeline_state->first_func_id = global_state->first_free_id;
...
global_state->first_free_id += num funcs in this pipeline

Multiple threads running the same pipeline get billed to the same entry. That should still work because total time will be the time from both instances, and total runs will be the number of runs for both instances.

Is tracy is telling you the wall clock time between the pipeline entry and exit? That's not what the profiler does. It shows how much time was spent in that pipeline alone, so it doesn't include time spent in any other Halide pipeline running at the same time.

@abadams
Copy link
Member

abadams commented Mar 6, 2024

(So I believe the original issue is wrong)

@mcourteaux
Copy link
Contributor

Is tracy is telling you the wall clock time between the pipeline entry and exit? That's not what the profiler does. It shows how much time was spent in that pipeline alone, so it doesn't include time spent in any other Halide pipeline running at the same time.

Tracy measures the wall clock time between pipeline entry and exit. It does that for every invocation of the pipeline, and then reports various statistics as shown in the screenshot: mean time per invocation, among others. The mean over all 48 invocations of the pipeline is 39.6ms. The total time spent in all of these invocations (possibly coming from different threads is 1.9s). The Halide profiler on the other hand is very wrong: it shows 665ms of total runtime, and averaged over the 48 invocations of it, it gives 13.8ms.

My guess is that two simultaneous invocations of the same pipeline coming from two different threads, are actually only billed as one? As I mentioned earlier, I don't see any mechanism in the code that would account for the fact that two threads might be doing different work: I just see halide_profiler_state::current_func, which sort of suggest to me that the profiler can only be aware of one Func running at any given time in the entire program.

Multiple threads running the same pipeline get billed to the same entry. That should still work because total time will be the time from both instances, and total runs will be the number of runs for both instances.

I don't see how that would be true, given what I just typed out above. More specifically, the Halide profiler sampling thread code bills each loop iteration time only once (i.e., to one function).

WEAK void bill_func(halide_profiler_state *s, int func_id, uint64_t time, int active_threads) {
halide_profiler_pipeline_stats *p_prev = nullptr;
for (halide_profiler_pipeline_stats *p = s->pipelines; p;
p = (halide_profiler_pipeline_stats *)(p->next)) {
if (func_id >= p->first_func_id && func_id < p->first_func_id + p->num_funcs) {
if (p_prev) {
// Bubble the pipeline to the top to speed up future queries.
p_prev->next = (halide_profiler_pipeline_stats *)(p->next);
p->next = s->pipelines;
s->pipelines = p;
}
halide_profiler_func_stats *f = p->funcs + func_id - p->first_func_id;
f->time += time;
f->active_threads_numerator += active_threads;
f->active_threads_denominator += 1;
p->time += time;
p->samples++;
p->active_threads_numerator += active_threads;
p->active_threads_denominator += 1;
return;
}
p_prev = p;
}
// Someone must have called reset_state while a kernel was running. Do nothing.
}
extern "C" WEAK int halide_profiler_sample(struct halide_profiler_state *s, uint64_t *prev_t) {
int func, active_threads;
if (s->get_remote_profiler_state) {
// Execution has disappeared into remote code running
// on an accelerator (e.g. Hexagon DSP)
s->get_remote_profiler_state(&func, &active_threads);
} else {
func = s->current_func;
active_threads = s->active_threads;
}
uint64_t t_now = halide_current_time_ns(nullptr);
if (func == halide_profiler_please_stop) {
#if TIMER_PROFILING
s->sampling_thread = nullptr;
#endif
return -1;
} else if (func >= 0) {
// Assume all time since I was last awake is due to
// the currently running func.
bill_func(s, func, t_now - *prev_t, active_threads);
}
*prev_t = t_now;
return s->sleep_time;
}
WEAK void sampling_profiler_thread(void *) {
halide_profiler_state *s = halide_profiler_get_state();
// grab the lock
halide_mutex_lock(&s->lock);
while (s->current_func != halide_profiler_please_stop) {
uint64_t t1 = halide_current_time_ns(nullptr);
uint64_t t = t1;
while (true) {
int sleep_ms = halide_profiler_sample(s, &t);
if (sleep_ms < 0) {
break;
}
// Release the lock, sleep, reacquire.
halide_mutex_unlock(&s->lock);
halide_sleep_ms(nullptr, sleep_ms);
halide_mutex_lock(&s->lock);
}
}
halide_mutex_unlock(&s->lock);
}

@abadams
Copy link
Member

abadams commented Mar 7, 2024

halide_profiler_state::current_func is the sum of the func id from the pipeline, and a base value unique to each pipeline. So distinct pipelines have a distinct range of values. The base value is allocated here: https://github.com/halide/Halide/blob/main/src/runtime/profiler_common.cpp#L67-L94

and the sum is performed here:
https://github.com/halide/Halide/blob/main/src/runtime/profiler_inlined.cpp#L14

So there's no way for two different pipelines to ever set the profiling token to the same value.

So what the profiler ends up measuring (in expectation) is not the wall-clock time between the start and end of a pipeline invocation like tracy. It measures the amount of time spent in that pipeline minus the amount of time spent in any other Halide pipeline running at the same time. If another instance of the same pipeline is running, they get billed together, because they have the same base id. But this gets divided by the number of pipeline runs, so the final value is still effectively the time spent in one instance, minus the time spent in another other Halide pipeline running at the same time.

So TL;DR: The Halide profiler and the tracy profiler disagree because they are measuring very different things.

@abadams
Copy link
Member

abadams commented Mar 7, 2024

I should add that the profiler can produced very biased results if the two pipelines are setting the token at very different rates. The one setting it more frequently clobbers the other one.

@abadams
Copy link
Member

abadams commented Mar 7, 2024

Continuing that train of thought, are you saying that you'd prefer each pipeline have its own token? The profiler thread would have to iterate through all running pipelines somehow. Currently it can only iterate through all pipelines it has ever seen before.

@mcourteaux
Copy link
Contributor

mcourteaux commented Mar 7, 2024

halide_profiler_state::current_func is the sum of the func id from the pipeline, and a base value unique to each pipeline. So distinct pipelines have a distinct range of values. The base value is allocated here: https://github.com/halide/Halide/blob/main/src/runtime/profiler_common.cpp#L67-L94

and the sum is performed here: https://github.com/halide/Halide/blob/main/src/runtime/profiler_inlined.cpp#L14

So there's no way for two different pipelines to ever set the profiling token to the same value.

Agreed 100%, but that's not the problem. If a single-threaded pipeline runs twice concurrently, both starting and ending at the exact same time, both taking 100ms, the profiler report will tell you that 100ms is spent in total over 2 runs, giving 50ms per run, which is obviously false.

I think you were thinking more of the statement of the OP of this issue. I'm looking at an even simpler case: one pipeline that gets called multiple times in parallel.

@mcourteaux
Copy link
Contributor

mcourteaux commented Mar 7, 2024

The reason why this is a problem is that I'm running a bunch of Halide piplines (both the same and different pipelines) on different threads in my job system. Now the more worker threads I have, the more Halide pipelines will overlap in time, and the "faster" the pipelines will appear in the profiler report. The report is super non-deterministic, and depends on how many Halide pipelines are concurrently working.

You can't use the profiler report in any real-world application that launches pipelines from multiple threads.

@abadams
Copy link
Member

abadams commented Mar 7, 2024

halide_profiler_state::current_func is the sum of the func id from the pipeline, and a base value unique to each pipeline. So distinct pipelines have a distinct range of values. The base value is allocated here: https://github.com/halide/Halide/blob/main/src/runtime/profiler_common.cpp#L67-L94
and the sum is performed here: https://github.com/halide/Halide/blob/main/src/runtime/profiler_inlined.cpp#L14
So there's no way for two different pipelines to ever set the profiling token to the same value.

Agreed 100%, but that's not the problem. If a single-threaded pipeline runs twice concurrently, both starting and ending at the exact same time, both taking 100ms, the profiler report will tell you that 100ms is spent in total over 2 runs, giving 50ms per run, which is obviously false.

I think you were thinking more of the statement of the OP of this issue. I'm looking at an even simpler case: one pipeline that gets called multiple times in parallel.

Ah, I was assuming all pipelines use all available cores, and was trying to report actual cpu time spent by one instance of that pipeline, not wall clock time between the start and end. If the two pipelines were using all cores and running two of them at the same time took 100ms from start to end, 50ms would be correct. Hrm. I don't think a separate current_func variable per pipeline would fix that. You'd need a separate current_func per instance running. We could certainly do that, but it's measuring something different to what the profiler currently measures.

Let's figure out the ideal behavior at the next dev meeting, because I think any of these choices are implementable.

@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Mar 7, 2024
@abadams
Copy link
Member

abadams commented Mar 7, 2024

So the top level question we need to answer is: Should interference from other Halide pipelines running at the same time count towards the profiling results for a pipeline? Given that we can only possibly account for interference from other Halide pipelines, and not interference from other stuff happening, maybe the answer is yes. Currently the system is designed to say "no" and subtract any interference, but as you point out that doesn't work if the pipeline isn't using all available resources.

@mcourteaux
Copy link
Contributor

mcourteaux commented Mar 7, 2024

If the two pipelines were using all cores and running two of them at the same time took 100ms from start to end, 50ms would be correct.

Wauw, this is confusing. I understand the argument, and I'm inclined to agree with this statement, but dang, that complicates things... 😅 On the other hand is it much more tangible that the profile result would report 100ms because it really just took that long (wall time, start-stop). But true, either only half the machine, or the full machine half of the time, was working on it. So the time spent by "one full machine" is 50ms. I am starting to think this is a matter of choosing units correctly. We might report both wall-time start-to-end, and FMms "Full-Machine milliseconds". Then the question becomes: what to do with a single threaded pipeline, running on a 16-core machine with wall-time duration 160ms. Do we report 160 FMms, or 10 FMms? I'm inclined to report 10 FMms, which is not misleading, as long as you have the accompanying information that the wall-time start-to-end was 160ms, and perhaps that machine occupation was 1/16 (or even less if the pipeline got scheduled out by the OS).

Going back to your original statement (the one I quoted on top): the 50ms there that we kinda think of as "correct", would then actually be 50 FMms.

@abadams
Copy link
Member

abadams commented Mar 8, 2024

I think you have convinced me that the current definition of time taken isn't really workable, and we should just switch to wall clock times, because it's less confusing and more well defined in tricky situations like the ones we're describing.

I have some ideas for how to do this efficiently. I'll try to hack something together for discussion.

@mcourteaux
Copy link
Contributor

Late at night, I was thinking that FMms is rather confusing, and that Core-milliseconds Cms is more intuitive and more consistent across machines. Something that takes 10ms wall time on a single-threaded pipeline, will report 10 Cms and 10ms. Something that takes 10ms wall time on a parallel pipeline that takes all resources available on a 16-core machine, gets you 160 Cms and 10ms. I think that makes sense, in case we wish to keep the nuance of machine utilization.

@abadams
Copy link
Member

abadams commented Mar 11, 2024

Try out the branch abadams/per_instance_profiling and see what you think.

@mcourteaux
Copy link
Contributor

Try out the branch abadams/per_instance_profiling and see what you think.

Thanks for this effort! I'll have a more thorough look at it tomorrow. I was very busy with other things today.

@mcourteaux
Copy link
Contributor

Okay, let's test this. For reference with the original profiler, the same pipeline, being invoked from 4 worker threads:

neonraw_bilateral_grid_loglum_constructor_8
 total time: 323.572388 ms  samples: 306  runs: 48  time/run: 6.741091 ms
 average threads used: 3.068627
 heap allocations: 0  peak heap usage: 0 bytes
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    bilateral_grid:               3.193ms (47.3%)   threads: 2.827 
    logged:                       0.748ms (11.1%)   threads: 2.058  stack: 4096
    lum:                          2.798ms (41.5%)   threads: 3.614  stack: 32

While Tracy reports:
image

Some time is not billed, as discussed before.


Now, with the per-instance-profiling branch:

neonraw_bilateral_grid_loglum_constructor_8
 total time: 456.430878 ms  samples: 408  runs: 48  time per run: 9.508977 ms
 heap allocations: 0  peak heap usage: 0 bytes
    waiting on task queue:        0.000ms ( 0.0%)   
    halide_malloc:                0.000ms ( 0.0%)   
    halide_free:                  0.000ms ( 0.0%)   
    bilateral_grid:               5.646ms (59.3%)   
    logged:                       0.594ms ( 6.2%)    stack: 4096
    lum:                          3.267ms (34.3%)    stack: 32

and Tracy:
image
The total estimate is off by 20ms, or 20ms/48 = 0.41ms per run, which sounds acceptable, given that it's a sampling profiler. I validated that this behaves correctly by changing the sleep-time to 0ms instead of 1ms. The result is much closer match:

neonraw_bilateral_grid_loglum_constructor_8
 total time: 447.379089 ms  samples: 8010  runs: 48  time per run: 9.320397 ms
 heap allocations: 0  peak heap usage: 0 bytes
    waiting on task queue:        0.000ms ( 0.0%)   
    halide_malloc:                0.000ms ( 0.0%)   
    halide_free:                  0.000ms ( 0.0%)   
    bilateral_grid:               5.068ms (54.3%)   
    logged:                       0.789ms ( 8.4%)    stack: 4096
    lum:                          3.462ms (37.1%)    stack: 32

and Tracy confirms:
image

I was surprised to see that the sampling thread stops sampling when I'm not running any Halide pipelines! Very neat! I was expecting to see one thread run at 100% sampling. I'm thinking now that perhaps a faster sampling speed would be nice. I'm a bit surprised that 0ms sleep times only yields 8010samples / 447ms = 17 samples per millisecond. This sounds very slow? Or perhaps usleep(0) actually yields the thread? I quickly tested again with the call to sleep commented out.

neonraw_bilateral_grid_loglum_constructor_8
 total time: 441.692749 ms  samples: 1066208  runs: 48  time per run: 9.201932 ms
 average threads used: 0.960744
 heap allocations: 0  peak heap usage: 0 bytes
    overhead:                     0.013ms ( 0.1%)   threads: 1.000 
    waiting on task queue:        0.380ms ( 4.1%)   threads: 0.000 
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    bilateral_grid:               4.927ms (53.5%)   threads: 1.000 
    logged:                       0.870ms ( 9.4%)   threads: 1.000  stack: 4096
    lum:                          3.010ms (32.7%)   threads: 1.000  stack: 32

Now we get a lot of samples per ms. This is great! However, Tracy is a bit disagreeing with this result:
image

Perhaps the sampling thread spawning constantly dies and gets respawned, causing lost samples a the beginning?


Next, the threads: xxx indication is gone, which surprised me, but is actually perfect, since the pipeline I tested is fully single-threaded.


Testing once more without my sleep-fiddling:

  • total time: 463.312988 ms samples: 421 runs: 48 time per run: 9.652354 ms
    image

And on a non-serial pipeline:

  • total time: 305.989166 ms samples: 280 runs: 48 time per run: 6.374774 ms
    image

Lastly, a really nice addition to also profile the amount of time being waited for all threads to finish their work in a parallel for (and other threading constructs)!

One minor nitpick: when decreasing the thread count for the last time when the pipeline finishes, there is one final call to set the profiled function to "waiting on thread queue", which is not technically correct:

image


Also clever to allocate the instance_state struct on the stack of the pipeline itself! Another comment there: it's not necessary to actually allocate the struct on the stack of the inject_profiling() function to calculate the size of the thing. A simple calculation like:

const int instance_size_bytes = sizeof(halide_profiler_func_stats) + num_funcs * sizeof(halide_profiler_func_stats);

should work as well, I believe.


Overall, I think this works well, and the changes look solid.

@abadams
Copy link
Member

abadams commented Mar 13, 2024

I was surprised to see that the sampling thread stops sampling when I'm not running any Halide pipelines! Very neat! I was expecting to see one thread run at 100% sampling. I'm thinking now that perhaps a faster sampling speed would be nice. I'm a bit surprised that 0ms sleep times only yields 8010samples / 447ms = 17 samples per millisecond. This sounds very slow? Or perhaps usleep(0) actually yields the thread? I quickly tested again with the call to sleep commented out.

Perhaps the sampling thread spawning constantly dies and gets respawned, causing lost samples a the beginning?

It should spawn once, and continue sampling until shutdown, but if there are no instances running all it does it grab the lock, check if any instances are running, release the lock, and sleep, so it might not show up as anything in a trace. I considered adding a condition variable so that it would sleep for real, but I'm worried any delay in waking up would result in the first Func of each pipeline getting undersampled.

Lastly, a really nice addition to also profile the amount of time being waited for all threads to finish their work in a parallel for (and other threading constructs)!

This was meant to measure time spent with no tasks scheduled for this pipeline because tasks from some other pipeline got enqueued. The thread pool is most-recently-enqueued-job first, so a pipeline can get starved out (except for the main thread of that pipeline, which doesn't wait on the queue, so at worst it'll run single-threaded).

But now that you mention it, it does indeed mostly measure the amount of time waiting for all threads to finish. Accidental but neat. Given that, is "waiting on thread queue" the right name? I've changed it to "waiting for parallel tasks to finish".

One minor nitpick: when decreasing the thread count for the last time when the pipeline finishes, there is one final call to set the profiled function to "waiting on thread queue", which is not technically correct:

I think it should be OK, because immediately after that the instance state is removed from the list of running instances, so the only way that value would be sampled is if the sampling profiler is holding the lock at the time the pipeline tries to exit, which blocks the instance from removing itself from the list. In that case there is indeed a delay before the instance can exit, while it waits to grab the lock. Still, I think I'll bill that as overhead instead of waiting on thread queue.

Also clever to allocate the instance_state struct on the stack of the pipeline itself! Another comment there: it's not necessary to actually allocate the struct on the stack of the inject_profiling() function to calculate the size of the thing. A simple calculation like:

const int instance_size_bytes = sizeof(halide_profiler_func_stats) + num_funcs * sizeof(halide_profiler_func_stats);

The struct allocated on the stack is only ever used for address calculations, so according to godbolt the actual memory is culled.

I was worried the direct size computation would be wrong if halide_profiler_func_stats doesn't use a multiple of 8 bytes, because halide_profiler_func_stats needs an 8-byte alignment, so there could be 4 bytes of padding... Actually checking godbolt, the way I did it doesn't account for padding either. It's ok right now because the struct contains 8-byte members so sizeof on it rounds up to the next multiple of 8. I'll just add a static assert that sizeof(halide_profiler_instance_state) is a multiple of 8.

TL;DR I was overthinking it.

Finally, I'm also going to change the profiler to ignore bounds queries to the pipeline. Those run super fast, so if there's one bounds query per actual invocation, it misleadingly reports the runtime as being half of the true runtime, on average.

@abadams abadams linked a pull request Mar 13, 2024 that will close this issue
@abadams
Copy link
Member

abadams commented Mar 13, 2024

A remaining question: the time-spent-waiting id could be per-func, so you know which parallel loops might need their task size adjusting. That would add a lot of lines of output though. Or maybe it should be another column that only shows up if threads > 1?

@mcourteaux
Copy link
Contributor

I like the column idea. That sounds perfect! I'm also thinking to get rid of malloc and free if there are none in the pipeline. That would make sense to me.

@abadams
Copy link
Member

abadams commented Mar 14, 2024

Actually I've soured on the column idea, because if there are multiple pipelines running, the overhead goes almost entirely to the first parallel Func in the pipeline, while the jobs for that pipeline are waiting to be scheduled. But that's not a property of that Func, so it's misleading.

@abadams abadams removed the dev_meeting Topic to be discussed at the next dev meeting label Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants