New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiler gets confused when two pipelines run at the same time #5796
Comments
@abadams Could you tag this with dev-meeting? I'm currently seeing something that could be related. Halide profiler reports this:
While Tracy clearly shows that this takes 39ms per run. Looking again at the profiler runtime, I indeed don't see any mechanism to support multiple threads calling Halide pipelines. Is this a non-documented by-design thing that the profiler can only be used for single-threaded applications? |
Multiple threads with different pipelines running are supposed to work. Each func id belongs to a unique range of integers tied to a specific pipeline by name. profiler_common.cpp does something like this when a pipeline is entered:
Multiple threads running the same pipeline get billed to the same entry. That should still work because total time will be the time from both instances, and total runs will be the number of runs for both instances. Is tracy is telling you the wall clock time between the pipeline entry and exit? That's not what the profiler does. It shows how much time was spent in that pipeline alone, so it doesn't include time spent in any other Halide pipeline running at the same time. |
(So I believe the original issue is wrong) |
Tracy measures the wall clock time between pipeline entry and exit. It does that for every invocation of the pipeline, and then reports various statistics as shown in the screenshot: mean time per invocation, among others. The mean over all 48 invocations of the pipeline is 39.6ms. The total time spent in all of these invocations (possibly coming from different threads is 1.9s). The Halide profiler on the other hand is very wrong: it shows 665ms of total runtime, and averaged over the 48 invocations of it, it gives 13.8ms. My guess is that two simultaneous invocations of the same pipeline coming from two different threads, are actually only billed as one? As I mentioned earlier, I don't see any mechanism in the code that would account for the fact that two threads might be doing different work: I just see
I don't see how that would be true, given what I just typed out above. More specifically, the Halide profiler sampling thread code bills each loop iteration time only once (i.e., to one function). Halide/src/runtime/profiler_common.cpp Lines 99 to 172 in 22868a4
|
halide_profiler_state::current_func is the sum of the func id from the pipeline, and a base value unique to each pipeline. So distinct pipelines have a distinct range of values. The base value is allocated here: https://github.com/halide/Halide/blob/main/src/runtime/profiler_common.cpp#L67-L94 and the sum is performed here: So there's no way for two different pipelines to ever set the profiling token to the same value. So what the profiler ends up measuring (in expectation) is not the wall-clock time between the start and end of a pipeline invocation like tracy. It measures the amount of time spent in that pipeline minus the amount of time spent in any other Halide pipeline running at the same time. If another instance of the same pipeline is running, they get billed together, because they have the same base id. But this gets divided by the number of pipeline runs, so the final value is still effectively the time spent in one instance, minus the time spent in another other Halide pipeline running at the same time. So TL;DR: The Halide profiler and the tracy profiler disagree because they are measuring very different things. |
I should add that the profiler can produced very biased results if the two pipelines are setting the token at very different rates. The one setting it more frequently clobbers the other one. |
Continuing that train of thought, are you saying that you'd prefer each pipeline have its own token? The profiler thread would have to iterate through all running pipelines somehow. Currently it can only iterate through all pipelines it has ever seen before. |
Agreed 100%, but that's not the problem. If a single-threaded pipeline runs twice concurrently, both starting and ending at the exact same time, both taking 100ms, the profiler report will tell you that 100ms is spent in total over 2 runs, giving 50ms per run, which is obviously false. I think you were thinking more of the statement of the OP of this issue. I'm looking at an even simpler case: one pipeline that gets called multiple times in parallel. |
The reason why this is a problem is that I'm running a bunch of Halide piplines (both the same and different pipelines) on different threads in my job system. Now the more worker threads I have, the more Halide pipelines will overlap in time, and the "faster" the pipelines will appear in the profiler report. The report is super non-deterministic, and depends on how many Halide pipelines are concurrently working. You can't use the profiler report in any real-world application that launches pipelines from multiple threads. |
Ah, I was assuming all pipelines use all available cores, and was trying to report actual cpu time spent by one instance of that pipeline, not wall clock time between the start and end. If the two pipelines were using all cores and running two of them at the same time took 100ms from start to end, 50ms would be correct. Hrm. I don't think a separate current_func variable per pipeline would fix that. You'd need a separate current_func per instance running. We could certainly do that, but it's measuring something different to what the profiler currently measures. Let's figure out the ideal behavior at the next dev meeting, because I think any of these choices are implementable. |
So the top level question we need to answer is: Should interference from other Halide pipelines running at the same time count towards the profiling results for a pipeline? Given that we can only possibly account for interference from other Halide pipelines, and not interference from other stuff happening, maybe the answer is yes. Currently the system is designed to say "no" and subtract any interference, but as you point out that doesn't work if the pipeline isn't using all available resources. |
Wauw, this is confusing. I understand the argument, and I'm inclined to agree with this statement, but dang, that complicates things... 😅 On the other hand is it much more tangible that the profile result would report 100ms because it really just took that long (wall time, start-stop). But true, either only half the machine, or the full machine half of the time, was working on it. So the time spent by "one full machine" is 50ms. I am starting to think this is a matter of choosing units correctly. We might report both wall-time start-to-end, and Going back to your original statement (the one I quoted on top): the |
I think you have convinced me that the current definition of time taken isn't really workable, and we should just switch to wall clock times, because it's less confusing and more well defined in tricky situations like the ones we're describing. I have some ideas for how to do this efficiently. I'll try to hack something together for discussion. |
Late at night, I was thinking that |
Try out the branch abadams/per_instance_profiling and see what you think. |
Thanks for this effort! I'll have a more thorough look at it tomorrow. I was very busy with other things today. |
It should spawn once, and continue sampling until shutdown, but if there are no instances running all it does it grab the lock, check if any instances are running, release the lock, and sleep, so it might not show up as anything in a trace. I considered adding a condition variable so that it would sleep for real, but I'm worried any delay in waking up would result in the first Func of each pipeline getting undersampled.
This was meant to measure time spent with no tasks scheduled for this pipeline because tasks from some other pipeline got enqueued. The thread pool is most-recently-enqueued-job first, so a pipeline can get starved out (except for the main thread of that pipeline, which doesn't wait on the queue, so at worst it'll run single-threaded). But now that you mention it, it does indeed mostly measure the amount of time waiting for all threads to finish. Accidental but neat. Given that, is "waiting on thread queue" the right name? I've changed it to "waiting for parallel tasks to finish".
I think it should be OK, because immediately after that the instance state is removed from the list of running instances, so the only way that value would be sampled is if the sampling profiler is holding the lock at the time the pipeline tries to exit, which blocks the instance from removing itself from the list. In that case there is indeed a delay before the instance can exit, while it waits to grab the lock. Still, I think I'll bill that as overhead instead of waiting on thread queue.
The struct allocated on the stack is only ever used for address calculations, so according to godbolt the actual memory is culled. I was worried the direct size computation would be wrong if halide_profiler_func_stats doesn't use a multiple of 8 bytes, because halide_profiler_func_stats needs an 8-byte alignment, so there could be 4 bytes of padding... Actually checking godbolt, the way I did it doesn't account for padding either. It's ok right now because the struct contains 8-byte members so sizeof on it rounds up to the next multiple of 8. I'll just add a static assert that sizeof(halide_profiler_instance_state) is a multiple of 8. TL;DR I was overthinking it. Finally, I'm also going to change the profiler to ignore bounds queries to the pipeline. Those run super fast, so if there's one bounds query per actual invocation, it misleadingly reports the runtime as being half of the true runtime, on average. |
A remaining question: the time-spent-waiting id could be per-func, so you know which parallel loops might need their task size adjusting. That would add a lot of lines of output though. Or maybe it should be another column that only shows up if threads > 1? |
I like the column idea. That sounds perfect! I'm also thinking to get rid of malloc and free if there are none in the pipeline. That would make sense to me. |
Actually I've soured on the column idea, because if there are multiple pipelines running, the overhead goes almost entirely to the first parallel Func in the pipeline, while the jobs for that pipeline are waiting to be scheduled. But that's not a property of that Func, so it's misleading. |
When profiling is enabled for two pipelines running at the same time, the results are confusing, because both pipelines are setting the global profiler token to tokens with different meanings.
This is happening when using an extern stage which also calls another Halide pipeline, but the same issue could occur if running two different pipelines from two different threads.
The text was updated successfully, but these errors were encountered: