Per-pipeline-invocation profiling #8153

abadams · 2024-03-13T18:00:53Z

This is a new sampling profiler as discussed in #5796. When it samples, it samples all simultaneously running Halide pipelines, and all simultaneously running instances of the same pipeline, tracking their stats separately (they have separate sampling tokens, instead of fighting over a single global one). At pipeline exit these stats are accumulated into global pipeline stats and static per-pipeline stats.

The big differences with the current sampling profiler are that:

pipeline invocations are measured as total wall-clock time, rather than wall-clock time minus time spent in any other Halide pipeline running at the same time. The old behavior didn't really work when pipelines could use different numbers of threads, or if simultaneously running pipelines set the sampling token at different rates.
In the per-pipeline results, time with the thread pool busy spent doing work on some other Halide pipeline is tracked as its own entry. This conveniently also measures time spend in the ragged end of a too-coarse parallel for loop, or time spent trying to grab thread pool locks in a too-fine parallel for loop, so it's also useful for standalone micro-benchmarking.
Pipeline invocations that are just bounds queries get ignored. Previously these could misleadingly bring the average runtime way way down.

I'm going to need some help testing (and probably fixing) the hvx changes. At the very least the remote runtime needs to be rebuilt.

Fixes #5796

This should give better results when multiple Halide pipelines are running at the same time.

- Don't profile bounds queries - Simplify layout calculation - Bill time after decrementing main thread as overhead, not waiting on parallel tasks - Change waiting on parallel tasks label

…profiling

mcourteaux · 2024-03-14T08:28:31Z

May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this?

src/runtime/HalideRuntime.h

src/runtime/profiler_common.cpp

mcourteaux · 2024-03-14T08:37:14Z

src/runtime/profiler_common.cpp

+            // Unfortunately a user_context is not available or meaningful if
+            // there are zero or multiple instances running.
+            halide_error(nullptr, "Cannot use device profiling while multiple profiled pipelines are running on host. Shutting down profiler.");
+            return -1;


I have never worked with this Hexagon thing, but perhaps shutting down the profiler is an overkill? Might be not user-friendly. You still might want to see the profiler results for another pipeline that is running alone on a later point in the program? Instead, omitting the profile report for all pipelines that have run simultaneously might be a less extreme alternative? As I said: no clue what I'm talking about, but trying to spot potential annoyances for anyone using this.

I think this situation means that the profiler state is already corrupt. This should already have thrown an error in hexagon_host when trying to launch the remote kernel. If you try to launch a profiled hexagon kernel, and there's already an instance in flight, it will refuse to even launch the kernel. I'll change it to a hard assert because it's supposed to be unreachable.

I changed it to a debug assert in this place, and instead put in a hard error message if you either try to launch a profiled pipeline while a profiled pipeline is running on device, or try to launch a profiled pipeline on device while another profiled pipeline is running.

mcourteaux · 2024-03-14T08:41:59Z

src/runtime/profiler_common.cpp

+
+    // Tell the instance the pipeline to which it belongs.
+    instance->pipeline_stats = p;
+


Here might be a good point to just take the halide_current_time_ns() to take the start time, to produce accurate wall times, as opposed to summing samples (which I demonstrated to be very noise/inaccurate in #5796).

Will fix in next push, though I won't resolve this because the way I did it might need discussion.

Pushed. There was a question about whether or not pipeline start and end should count as sampling events, which could be used to ensure the sum of the billed times for each func equalled the wall clock time of the whole pipeline, but I decided that would actually introduce some biases, so there's an awkward adjustment step at pipeline exit to rescale the billed times to add up to the wall clock time.

mcourteaux · 2024-03-14T08:49:13Z

src/runtime/profiler_common.cpp

    halide_profiler_state *s = halide_profiler_get_state();
    if (!s->sampling_thread) {
        return;
    }

-    s->current_func = halide_profiler_please_stop;
+    int one = 1;
+    atomic_store_relaxed(&(s->shutdown), &one);


What is the purpose of having a relaxed "atomic" operation and this auxiliary function? Doesn't the C++ memory consistency model guarantee the same behavior even without this relaxed atomic store on all platforms?

The implementation of atomic_store_relaxed is just *a = *b, so this is really just documentation that this is an intentional unlocked communication of a value across threads.

src/runtime/HalideRuntime.h

src/Profiling.cpp

src/IR.cpp

abadams · 2024-03-14T20:39:36Z

May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this?

I changed things to microseconds (except on Windows, which now just does a yield if the requested sleep time is under a millisecond). Regarding rdtsc, I think that would be a separate change that provides a different implementation for one of the *_clock.cpp runtime modules to use on x86.

mcourteaux

Looks pristine! 😄

aankit-ca · 2024-03-21T06:59:08Z

@abadams I'm testing the PR for hvx. Will update you soon.

aankit-ca · 2024-03-28T07:40:00Z

@abadams @mcourteaux I tested the PR on simulator (host-hvx) and device (arm-hvx). Profiling worked with a few minor changes on the simulator but I see a crash on device. I'm trying to debug the device crash. I see the crash on device with the main branch too, so the crash might not be related to this PR. I'll be OOO starting tomorrow till April 7. @prasmish will work on resolving the issue in the meantime.

steven-johnson · 2024-04-05T16:11:39Z

Ready to land?

abadams · 2024-04-05T16:38:08Z

Waiting on the hvx issue.

aankit-quic · 2024-04-11T06:10:04Z

@abadams These #8187 are some changes needed to get these changes working on hexagon. There is still failure on device, but the error is not related to this PR. We can reproduce the error on the main branch too. I don't wanna hold off this PR because of the crash.

mcourteaux · 2024-04-19T15:44:30Z

Can be merged. The HVX issue is unrelated to this, it seems.

abadams · 2024-04-19T23:18:49Z

It looks like there's other stuff in #8187 which is needed though. Only the missing sampling token issue was broken on main.

abadams added 4 commits March 11, 2024 15:22

Profiler tracks per-invocation state, instead of global state

14d2038

This should give better results when multiple Halide pipelines are running at the same time.

Profiler improvements

64b81a5

- Don't profile bounds queries - Simplify layout calculation - Bill time after decrementing main thread as overhead, not waiting on parallel tasks - Change waiting on parallel tasks label

name hygiene

cab7fce

Merge remote-tracking branch 'origin/main' into abadams/per_instance_…

35b38bc

…profiling

abadams added the dev_meeting Topic to be discussed at the next dev meeting label Mar 13, 2024

abadams added 2 commits March 13, 2024 12:11

Fix signature

55f8abf

Fix tracking of pipeline-level memory statistics

3c4d911

abadams added the release_notes For changes that may warrant a note in README for official releases. label Mar 13, 2024

mcourteaux suggested changes Mar 14, 2024

View reviewed changes

Address review comments

2e0e176

Pacify clang-tidy

606902b

mcourteaux approved these changes Mar 15, 2024

View reviewed changes

Merge branch 'main' into abadams/per_instance_profiling

b07d7fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-pipeline-invocation profiling #8153

Per-pipeline-invocation profiling #8153

abadams commented Mar 13, 2024

mcourteaux commented Mar 14, 2024

mcourteaux Mar 14, 2024

abadams Mar 14, 2024

abadams Mar 14, 2024

mcourteaux Mar 14, 2024

abadams Mar 14, 2024

abadams Mar 14, 2024

mcourteaux Mar 14, 2024

abadams Mar 14, 2024

abadams commented Mar 14, 2024

mcourteaux left a comment

aankit-ca commented Mar 21, 2024

aankit-ca commented Mar 28, 2024

steven-johnson commented Apr 5, 2024

abadams commented Apr 5, 2024

aankit-quic commented Apr 11, 2024

mcourteaux commented Apr 19, 2024

abadams commented Apr 19, 2024


		// Tell the instance the pipeline to which it belongs.
		instance->pipeline_stats = p;

Per-pipeline-invocation profiling #8153

Are you sure you want to change the base?

Per-pipeline-invocation profiling #8153

Conversation

abadams commented Mar 13, 2024

mcourteaux commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams commented Mar 14, 2024

mcourteaux left a comment

Choose a reason for hiding this comment

aankit-ca commented Mar 21, 2024

aankit-ca commented Mar 28, 2024

steven-johnson commented Apr 5, 2024

abadams commented Apr 5, 2024

aankit-quic commented Apr 11, 2024

mcourteaux commented Apr 19, 2024

abadams commented Apr 19, 2024