Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a label to ParallelFor that is forwarded to TinyProfiling #3899

Open
DerNils-git opened this issue Apr 19, 2024 · 2 comments
Open

Add a label to ParallelFor that is forwarded to TinyProfiling #3899

DerNils-git opened this issue Apr 19, 2024 · 2 comments

Comments

@DerNils-git
Copy link

The most expensive regions of a simulation are (probably) wrapped in an amrex::ParallelFor.
The overhead of wrapping amrex::ParallelFor into a tiny profiling call is admittedly not very high.
But it could simplify quick profiling of AMReX based codes if a label could be added directly to an
amrex::ParallelFor.

Enforcing the label would probably require to much effort for existing codes to adapt to the new interface.
But the ParallelFor could be extended by an optional string as the last argument which would
be forwarded to a the tiny profiling calls if the label is provided.
Current usage:

ParallelFor(box, numcomps,
            [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { ... });

Extended usage:

ParallelFor(box, numcomps,
            [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { ... },"MyParallelKernel");

A possible default argument could be

  • an empty string. Then ParallelFor is only profiled if a label is given explicitly.
  • a default string. Then all ParallelFor regions wo a label are still profiled and the measurement
    always provides insights into the time which is spend in a shared memory kernels.
@WeiqunZhang
Copy link
Member

We actually have versions of ParallelFor functions that were added with this in mind.

void ParallelFor (Gpu::KernelInfo const& info, Box const& box, L const& f) noexcept
Here, Gpu::KernelInfo is defined at
class KernelInfo
. We could add a non-explicit constructor to Gpu::KernelInfo. Then we can do

ParallelFor("MyKernelName", box, [=] ....);

But we still need to decide what to do with the string. Note that by default the kernel launch is async. So if we use a TinyProfiler in ParallelFor, we will have to set tiny_profiler.device_synchronize_around_region=true to make the kernel launch synchronous. But that might hurt performance. Alternatively, we can just call nvtx and roctx's RangePush and RangePop. The vendor's profiling tools will get the names.

@DerNils-git Since you mentioned kokkos. Do you know what they do?

@DerNils-git
Copy link
Author

If I understand the approach of the above ParallelFor correctly, a developer has to decide at the time of implementation whether a kernel should be profiled or not.
In a performance analysis it would be nice to get all kernel information by default and not depend on the developers decision, if a kernel should be profiled or not. I might not necessarily know which kernels are invoked at all during a simulation. So by default getting the information on all invoked kernels at runtime would be nice.

I am only a user of the Kokkos profiling tools. So I can only guess a bit.
Whether profiling tools should be used is be decided at runtime not at compile time. The profiling library has
the option to synchronize kernels, if I interpret the code correctly. But it depends on the profiler whether kernels are synchronized or not. That not a lot of information but looking closer into the implementation details would be a bit to much overhead for now, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants