Replies: 2 comments 2 replies
-
Hi @rcarson3 , Getting HW counters like FLOP counts from GPU kernels is not straightforward. For GPU FLOP metrics and such we generally recommend the NVidia NSight tools (nsys and ncu, or the older nvprof). You can forward Caliper-annotated code regions to the NVidia tools with the "nvtx" service ("nprof" in Caliper <= 2.4), e.g.
You can then use Caliper annotations to only profile a specific region - check the NVidia tools documentation. You can also use Caliper to profile CUDA API calls, or trace GPU activities (kernel execution, memcopies, etc.). For both of these you'll need to build Caliper with CUpti support. CUDA API profiling is supported for many built-in configs, e.g.
Activity tracing reports time on the device and host for CUDA activities:
While we currently don't support GPU metrics directly in Caliper, PAPI has a CUDA component - if you get that to work then Caliper should be able to read these metrics via the PAPI service. However, it looks like the PAPI CUDA component is no longer supported on newer devices. I also don't think we have a PAPI installation on Lassen, so you'd have to build it yourself. May be worth a try. |
Beta Was this translation helpful? Give feedback.
-
For what it's worth, I've worked extensively with the CUPTI Callback API which underpins NVprof and the PAPI CUDA component. It is a pain to work with and there is a reason NVIDIA moved away from it. The new CUPTI Profiler API which underpins nsight-compute has a massive problem for PAPI and, in theory, Caliper -- it doesn't support getting data values in a nested context. At least, I could not get it to do so. E.g. if marker "A" contains marker "B", any attempts get the flop counts (or any other HW counter metric) exclusively for "B" either invalidated "A" resuming collection after the "B" metrics were recorded or "B" returned zeros because you didn't fully stop the profiler. It appears that NVIDIA designed the API with the expectation that you would only want values at the end of the application, which obviously causes issues for tools like PAPI, Caliper, etc. whose APIs implicitly (e.g. flush output) or explicitly (e.g. PAPI_read, callbacks) have the expectation that one can get the numerical results during the runtime. |
Beta Was this translation helpful? Give feedback.
-
@daboehme I'm currently using caliper to get basic timing information out of my code ExaConstit which is used within an ECP application. As part of this ECP app, we're currently looking at various different performance tools (TAU,HPCToolkit, and etc) to potentially get out such information as FLOP or tracking memory/ memory usage. I was wondering if either of these were possible in Caliper for GPU code.
I've tried using the PAPI feature to get at the FLOP metric based on the config file for LULESH in the caliper-example repo on Lassen and have been striking out with that.
Beta Was this translation helpful? Give feedback.
All reactions