PTXprofiler

A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.

How to compile?

on Windows: compile with Visual Studio Community
on Linux: run chmod +x make.sh and ./make.sh path/to/kernel.ptx

How to use?

Generate a .ptx file from your application; this works only with an Nvidia GPU. With the OpenCL-Wrapper, you can simply uncomment #define PTX in src/opencl.hpp and compile and run. A file kernel.ptx is created, containing the PTX assembly code.
Run bin/PTXprofiler.exe path/to/kernel.ptx. For FluidX3D for example, this table is generated:

kernel name                     |flops  (float int    bit  )|copy  |branch|cache  (load  store)|memory (load  cached store)
--------------------------------|---------------------------|------|------|--------------------|---------------------------
initialize                      |   283    129     61     93|    33|     6|     0      0      0|   135     35      0    100
stream_collide                  |   363    261     35     67|    23|     2|     0      0      0|   153     77      0     76
update_fields                   |   160     56     37     67|    21|     2|     0      0      0|    93     77      0     16
voxelize_mesh                   |   170     91     34     45|    40|    11|    84     48     36|    37     36      0      1
transfer_extract_fi             |   460      0    221    239|   122|    63|     0      0      0|   180     80     20     80
transfer__insert_fi             |   483      0    247    236|   115|    47|     0      0      0|   180     80     20     80
transfer_extract_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34
transfer__insert_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34

For each OpenCL/CUDA kernel, instructions are counted and listed:
- GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as flops, but also listed separately as float, int and bit.
- Data movement operations are listed under copy.
- Branches are listed under branch.
- Total shared/local memory (L1 cache) accesses in Byte are listed under cache, with separate counters for load and store.
- Total global memory (VRAM) accesses in Byte are listed under memory, with separate counters for load, cached (load from VRAM or L2 cache) and store.
You can use the counted flops and memory accesses, together with the measured execution time of the kernel, to place it in a roofline model diagram.

Limitations

Matrix/tensor operations are not yet supported.
Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
LICENSE.md		LICENSE.md
PTXprofiler.sln		PTXprofiler.sln
PTXprofiler.vcxproj		PTXprofiler.vcxproj
README.md		README.md
make.sh		make.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

LICENSE.md

LICENSE.md

PTXprofiler.sln

PTXprofiler.sln

PTXprofiler.vcxproj

PTXprofiler.vcxproj

README.md

README.md

make.sh

make.sh

Repository files navigation

PTXprofiler

How to compile?

How to use?

Limitations

About

Releases 1

Languages

License

ProjectPhysX/PTXprofiler

Folders and files

Latest commit

History

Repository files navigation

PTXprofiler

How to compile?

How to use?

Limitations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages