Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-pipeline-invocation profiling #8153

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

abadams
Copy link
Member

@abadams abadams commented Mar 13, 2024

This is a new sampling profiler as discussed in #5796. When it samples, it samples all simultaneously running Halide pipelines, and all simultaneously running instances of the same pipeline, tracking their stats separately (they have separate sampling tokens, instead of fighting over a single global one). At pipeline exit these stats are accumulated into global pipeline stats and static per-pipeline stats.

The big differences with the current sampling profiler are that:

  • pipeline invocations are measured as total wall-clock time, rather than wall-clock time minus time spent in any other Halide pipeline running at the same time. The old behavior didn't really work when pipelines could use different numbers of threads, or if simultaneously running pipelines set the sampling token at different rates.

  • In the per-pipeline results, time with the thread pool busy spent doing work on some other Halide pipeline is tracked as its own entry. This conveniently also measures time spend in the ragged end of a too-coarse parallel for loop, or time spent trying to grab thread pool locks in a too-fine parallel for loop, so it's also useful for standalone micro-benchmarking.

  • Pipeline invocations that are just bounds queries get ignored. Previously these could misleadingly bring the average runtime way way down.

I'm going to need some help testing (and probably fixing) the hvx changes. At the very least the remote runtime needs to be rebuilt.

Fixes #5796

This should give better results when multiple Halide pipelines are
running at the same time.
- Don't profile bounds queries
- Simplify layout calculation
- Bill time after decrementing main thread as overhead, not waiting on
parallel tasks
- Change waiting on parallel tasks label
@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Mar 13, 2024
@abadams abadams added the release_notes For changes that may warrant a note in README for official releases. label Mar 13, 2024
@mcourteaux
Copy link
Contributor

May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this?

src/runtime/HalideRuntime.h Outdated Show resolved Hide resolved
src/runtime/HalideRuntime.h Show resolved Hide resolved
src/runtime/profiler_common.cpp Outdated Show resolved Hide resolved
// Unfortunately a user_context is not available or meaningful if
// there are zero or multiple instances running.
halide_error(nullptr, "Cannot use device profiling while multiple profiled pipelines are running on host. Shutting down profiler.");
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never worked with this Hexagon thing, but perhaps shutting down the profiler is an overkill? Might be not user-friendly. You still might want to see the profiler results for another pipeline that is running alone on a later point in the program? Instead, omitting the profile report for all pipelines that have run simultaneously might be a less extreme alternative? As I said: no clue what I'm talking about, but trying to spot potential annoyances for anyone using this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this situation means that the profiler state is already corrupt. This should already have thrown an error in hexagon_host when trying to launch the remote kernel. If you try to launch a profiled hexagon kernel, and there's already an instance in flight, it will refuse to even launch the kernel. I'll change it to a hard assert because it's supposed to be unreachable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to a debug assert in this place, and instead put in a hard error message if you either try to launch a profiled pipeline while a profiled pipeline is running on device, or try to launch a profiled pipeline on device while another profiled pipeline is running.


// Tell the instance the pipeline to which it belongs.
instance->pipeline_stats = p;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here might be a good point to just take the halide_current_time_ns() to take the start time, to produce accurate wall times, as opposed to summing samples (which I demonstrated to be very noise/inaccurate in #5796).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in next push, though I won't resolve this because the way I did it might need discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed. There was a question about whether or not pipeline start and end should count as sampling events, which could be used to ensure the sum of the billed times for each func equalled the wall clock time of the whole pipeline, but I decided that would actually introduce some biases, so there's an awkward adjustment step at pipeline exit to rescale the billed times to add up to the wall clock time.

halide_profiler_state *s = halide_profiler_get_state();
if (!s->sampling_thread) {
return;
}

s->current_func = halide_profiler_please_stop;
int one = 1;
atomic_store_relaxed(&(s->shutdown), &one);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of having a relaxed "atomic" operation and this auxiliary function? Doesn't the C++ memory consistency model guarantee the same behavior even without this relaxed atomic store on all platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of atomic_store_relaxed is just *a = *b, so this is really just documentation that this is an intentional unlocked communication of a value across threads.

src/runtime/HalideRuntime.h Outdated Show resolved Hide resolved
src/Profiling.cpp Outdated Show resolved Hide resolved
src/Profiling.cpp Show resolved Hide resolved
src/IR.cpp Outdated Show resolved Hide resolved
@abadams
Copy link
Member Author

abadams commented Mar 14, 2024

May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this?

I changed things to microseconds (except on Windows, which now just does a yield if the requested sleep time is under a millisecond). Regarding rdtsc, I think that would be a separate change that provides a different implementation for one of the *_clock.cpp runtime modules to use on x86.

Copy link
Contributor

@mcourteaux mcourteaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pristine! 😄

@aankit-ca
Copy link
Contributor

@abadams I'm testing the PR for hvx. Will update you soon.

@aankit-ca
Copy link
Contributor

@abadams @mcourteaux I tested the PR on simulator (host-hvx) and device (arm-hvx). Profiling worked with a few minor changes on the simulator but I see a crash on device. I'm trying to debug the device crash. I see the crash on device with the main branch too, so the crash might not be related to this PR. I'll be OOO starting tomorrow till April 7. @prasmish will work on resolving the issue in the meantime.

@steven-johnson
Copy link
Contributor

Ready to land?

@abadams
Copy link
Member Author

abadams commented Apr 5, 2024

Waiting on the hvx issue.

@aankit-quic
Copy link
Contributor

@abadams These #8187 are some changes needed to get these changes working on hexagon. There is still failure on device, but the error is not related to this PR. We can reproduce the error on the main branch too. I don't wanna hold off this PR because of the crash.

@mcourteaux
Copy link
Contributor

Can be merged. The HVX issue is unrelated to this, it seems.

@abadams
Copy link
Member Author

abadams commented Apr 19, 2024

It looks like there's other stuff in #8187 which is needed though. Only the missing sampling token issue was broken on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev_meeting Topic to be discussed at the next dev meeting release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Profiler gets confused when two pipelines run at the same time
5 participants