Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong values (half of the expected) of Nvidia GPU metrics when using the marker api #593

Open
teojgo opened this issue Dec 19, 2023 · 4 comments
Labels

Comments

@teojgo
Copy link

teojgo commented Dec 19, 2023

Describe the bug
When running a simple CUDA vector addition kernel, the expected number of floating point additions is equal to the number of elements in the vector. Nvidia Nsight Compute gives the correct result for the same metric.

To Reproduce
The following is the vectorAdd.cu source code (without error checking to simplify it):

#include <likwid-marker.h>
#include <cuda_runtime.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, 
                          const int N) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;

  if (i < N) {
    C[i] = A[i] + B[i];
  }
}

int main(void) {

  LIKWID_NVMARKER_INIT;
  LIKWID_NVMARKER_REGISTER("vecAdd");

  const int N = 500;
  size_t size = N * sizeof(float);
  float *h_A = (float *)malloc(size);
  float *h_B = (float *)malloc(size);

  for (int i = 0; i < N; ++i) {
    h_A[i] = 1.0f;
    h_B[i] = 2.0f;
  }

  float *d_A, *d_B, *d_C;
  cudaMalloc((void **)&d_A, size);
  cudaMalloc((void **)&d_B, size);
  cudaMalloc((void **)&d_C, size);
  cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

  int threadsPerBlock = 128;
  int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

  LIKWID_NVMARKER_START("vecAdd");
  vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); 
  cudaDeviceSynchronize();
  LIKWID_NVMARKER_STOP("vecAdd");

  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);
  free(h_A);
  free(h_B);

  LIKWID_NVMARKER_CLOSE;
  return 0;
}

The above can be easily compiled using nvcc as follows:

nvcc -ccbin g++ -Xcompiler -DLIKWID_NVMON -Xcompiler -I$LIKWID_INSTALL_PREFIX/include  -o vectorAdd vectorAdd.cu -Xlinker -L$LIKWID_INSTALL_PREFIX/lib -llikwid

And then executed with:
likwid-perfctr -W FLOPS_SP -m ./vectorAdd

The output is giving:

.
.
.
--------------------------------------------------------------------------------
Region vecAdd, Group 1: FLOPS_SP
+-------------------+----------+
|    Region Info    |   GPU 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.000085 |
|     call count    |        1 |
+-------------------+----------+

+----------------------------------------------------+---------+-------+
|                        Event                       | Counter | GPU 0 |
+----------------------------------------------------+---------+-------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM |   GPU0  |   250 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM |   GPU1  |     0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM |   GPU2  |     0 |
+----------------------------------------------------+---------+-------+
.
.
.

Running using Nvidia Nsight Compute:

ncu  --metrics="smsp__sass_thread_inst_executed_op_fadd_pred_on.sum"  ./vectorAdd
.
.
.
[43265] vectorAdd@mymachine
 vectorAdd(const float *, const float *, float *, int) (4, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
   Section: Command line profiler metrics
   --------------------------------------------------- ----------- ------------
   Metric Name                                         Metric Unit Metric Value
   --------------------------------------------------- ----------- ------------
   smsp__sass_thread_inst_executed_op_fadd_pred_on.sum        inst          500
   --------------------------------------------------- ----------- ------------
  • The Likwid version is the current master, commit (dd8443b)
  • OS: Linux Mint 21.1
  • Using Cuda 12.1
  • Using the NvMarkerAPI

To Reproduce with a LIKWID command

$LIKWID_INSTALL_PREFIX/bin/likwid-perfctr -V 3  -W FLOPS_SP -m ./vectorAdd

DEBUG - [hwloc_init_cpuInfo:359] HWLOC CpuInfo Family 6 Model 158 Stepping 10 Vendor 0x0 Part 0x0 isIntel 1 numHWThreads 12 activeHWThreads 12
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 6 Thread 1 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 7 Thread 1 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 2 Thread 0 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 8 Thread 1 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 3 Thread 0 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 9 Thread 1 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 4 Thread 0 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 10 Thread 1 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 5 Thread 0 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 11 Thread 1 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 0 Level 1 Size 32768 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 1 Level 2 Size 262144 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 2 Level 3 Size 12582912 Threads 12
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
CPU type:	Intel Coffeelake processor
CPU clock:	3.19 GHz
CPU family:	6
CPU model:	158
CPU short:	skylake
CPU stepping:	10
CPU features:	FP ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 RDSEED SSE3 
CPU arch:	x86_64
--------------------------------------------------------------------------------
PERFMON version:			4
PERFMON number of counters:		4
PERFMON width of counters:		48
PERFMON number of fixed counters:	3
--------------------------------------------------------------------------------
NVMON GPU 0 compute capability:	8.9
NVMON GPU 0 short:		nvidia_gpu_cc_ge_7
--------------------------------------------------------------------------------
DEBUG - [nvmon_init:184] Device 0 runs with CUPTI Profiling API backend
DEBUG - [nvmon_perfworks_createDevice:939] link_perfworks_libraries in createDevice
DEBUG - [link_perfworks_libraries:443] LD_LIBRARY_PATH = /home/myuser/likwid_install/lib/::/usr/local/cuda/lib64
DEBUG - [link_perfworks_libraries:445] CUDA_HOME = (null)
DEBUG - [link_perfworks_libraries:613] Run cuInit
DEBUG - [link_perfworks_libraries:615] Run cuDeviceGetCount
DEBUG - [link_perfworks_libraries:620] Run cuDeviceGet
DEBUG - [link_perfworks_libraries:622] Run cuDeviceGetAttribute for major CC
DEBUG - [link_perfworks_libraries:627] Run cuDeviceGetAttribute for minor CC
DEBUG - [nvmon_perfworks_createDevice:955] Found  1 GPUs
DEBUG - [nvmon_perfworks_createDevice:962] Current GPU  0
DEBUG - [nvmon_perfworks_createDevice:987] Current GPU chip AD104
DEBUG - [nvmon_perfworks_createDevice:1001] Create metric context for chip 'AD104'
DEBUG - [nvmon_perfworks_createDevice:1005] Create metric context done
DEBUG - [nvmon_perfworks_createDevice:1020] Create metric context getMetricNames
DEBUG - [nvmon_perfworks_createDevice:1076] Destroy metric context getMetricNames
DEBUG - [nvmon_perfworks_createDevice:1080] Destroy metric context
DEBUG - [_nvml_linkLibraries:398] Init NVML Libaries
DEBUG - [_nvml_linkLibraries:425] Init NVML Libaries
Executing: ./vectorAdd
DEBUG - [nvmon_addEventSet:556] Allocating new group structure for group.
DEBUG - [nvmon_addEventSet:558] NVMON: Currently 1 groups of 2 active
DEBUG - [nvmon_addEventSet:602] Performance group for PerfWorks backend
DEBUG - [perfgroup_readGroup:873] Reading group FLOPS_SP from /home/myuser/likwid_install/share/likwid/perfgroups/nvidia_gpu_cc_ge_7/FLOPS_SP.txt
DEBUG - [nvmon_addEventSet:653] EventStr SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM:GPU0,SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM:GPU1,SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM:GPU2
DEBUG - [nvmon_addEventSet:671] Calling addevents
DEBUG - [nvmon_perfworks_addEventSet:1739] Add events to GPU device  0 with context 1410751488
DEBUG - [perfworks_check_nv_context:677] Current context  94241583160320 DevContext  0
DEBUG - [perfworks_check_nv_context:691] Reuse context 94241583160320 for device 0
DEBUG - [nvmon_perfworks_addEventSet:1769] SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM
DEBUG - [nvmon_perfworks_addEventSet:1775] Adding real event smsp__sass_thread_inst_executed_op_fadd_pred_on.sum
DEBUG - [nvmon_perfworks_addEventSet:1769] SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM
DEBUG - [nvmon_perfworks_addEventSet:1775] Adding real event smsp__sass_thread_inst_executed_op_fmul_pred_on.sum
DEBUG - [nvmon_perfworks_addEventSet:1769] SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM
DEBUG - [nvmon_perfworks_addEventSet:1775] Adding real event smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
DEBUG - [nvmon_perfworks_addEventSet:1799] Increase size of eventSet space on device  0
DEBUG - [nvmon_perfworks_addEventSet:1812] Filling eventset  0 on device  0
DEBUG - [nvmon_perfworks_createConfigImage:1474] Create config image for chip AD104
DEBUG - [nvmon_perfworks_getMetricRequests114:1147] Create scratch buffer for AD104 and 0x55b657198c40
DEBUG - [nvmon_perfworks_getMetricRequests114:1161] Init Metric evaluator
DEBUG - [nvmon_perfworks_getMetricRequests114:1275] Destroy Metric evaluator
DEBUG - [nvmon_perfworks_createConfigImage:1476] Create config image for chip AD104 with 3 metric requests
DEBUG - [nvmon_perfworks_createConfigImage:1570] Allocated 296 byte for configImage
DEBUG - [nvmon_perfworks_createConfigImage:1580] nvmon_perfworks_createConfigImage_out enter  0
DEBUG - [nvmon_perfworks_createConfigImage:1582] NVPW_RawMetricsConfig_Destroy
DEBUG - [nvmon_perfworks_createConfigImage:1586] NVPW_MetricsContext_Destroy
DEBUG - [nvmon_perfworks_createConfigImage:1602] nvmon_perfworks_createConfigImage returns  296
DEBUG - [nvmon_perfworks_getMetricRequests114:1147] Create scratch buffer for AD104 and (nil)
DEBUG - [nvmon_perfworks_getMetricRequests114:1161] Init Metric evaluator
DEBUG - [nvmon_perfworks_getMetricRequests114:1275] Destroy Metric evaluator
DEBUG - [nvmon_perfworks_createCounterDataPrefixImage:1679] Allocated 172 byte for configPrefixImage
DEBUG - [nvmon_perfworks_createCounterDataPrefixImage:1691] nvmon_perfworks_createCounterDataPrefixImage_out enter  0
DEBUG - [nvmon_perfworks_createCounterDataPrefixImage:1716] nvmon_perfworks_createCounterDataPrefixImage returns  172
DEBUG - [nvmon_perfworks_addEventSet:1844] Filling eventset  0 on device  0
DEBUG - [nvmon_perfworks_addEventSet:1885] Adding eventset  0
--------------------------------------------------------------------------------
DEBUG - [nvmon_init:184] Device 0 runs with CUPTI Profiling API backend
DEBUG - [nvmon_perfworks_createDevice:939] link_perfworks_libraries in createDevice
DEBUG - [link_perfworks_libraries:443] LD_LIBRARY_PATH = /home/myuser/likwid_install/lib/::/usr/local/cuda/lib64
DEBUG - [link_perfworks_libraries:445] CUDA_HOME = (null)
DEBUG - [link_perfworks_libraries:613] Run cuInit
DEBUG - [link_perfworks_libraries:615] Run cuDeviceGetCount
DEBUG - [link_perfworks_libraries:620] Run cuDeviceGet
DEBUG - [link_perfworks_libraries:622] Run cuDeviceGetAttribute for major CC
DEBUG - [link_perfworks_libraries:627] Run cuDeviceGetAttribute for minor CC
DEBUG - [nvmon_perfworks_createDevice:955] Found  1 GPUs
DEBUG - [nvmon_perfworks_createDevice:962] Current GPU  0
DEBUG - [nvmon_perfworks_createDevice:987] Current GPU chip AD104
DEBUG - [nvmon_perfworks_createDevice:1001] Create metric context for chip 'AD104'
DEBUG - [nvmon_perfworks_createDevice:1005] Create metric context done
DEBUG - [nvmon_perfworks_createDevice:1020] Create metric context getMetricNames
DEBUG - [nvmon_perfworks_createDevice:1076] Destroy metric context getMetricNames
DEBUG - [nvmon_perfworks_createDevice:1080] Destroy metric context
DEBUG - [_nvml_linkLibraries:398] Init NVML Libaries
DEBUG - [_nvml_linkLibraries:425] Init NVML Libaries
DEBUG - [nvmon_addEventSet:556] Allocating new group structure for group.
DEBUG - [nvmon_addEventSet:558] NVMON: Currently 1 groups of 2 active
DEBUG - [nvmon_addEventSet:602] Performance group for PerfWorks backend
DEBUG - [nvmon_addEventSet:653] EventStr SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM:GPU0,SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM:GPU1,SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM:GPU2
DEBUG - [nvmon_addEventSet:671] Calling addevents
DEBUG - [nvmon_perfworks_addEventSet:1739] Add events to GPU device  0 with context 3747273696
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  0
DEBUG - [perfworks_check_nv_context:691] Reuse context 94613286869984 for device 0
DEBUG - [nvmon_perfworks_addEventSet:1769] SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM
DEBUG - [nvmon_perfworks_addEventSet:1775] Adding real event smsp__sass_thread_inst_executed_op_fadd_pred_on.sum
DEBUG - [nvmon_perfworks_addEventSet:1769] SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM
DEBUG - [nvmon_perfworks_addEventSet:1775] Adding real event smsp__sass_thread_inst_executed_op_fmul_pred_on.sum
DEBUG - [nvmon_perfworks_addEventSet:1769] SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM
DEBUG - [nvmon_perfworks_addEventSet:1775] Adding real event smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
DEBUG - [nvmon_perfworks_addEventSet:1799] Increase size of eventSet space on device  0
DEBUG - [nvmon_perfworks_addEventSet:1812] Filling eventset  0 on device  0
DEBUG - [nvmon_perfworks_createConfigImage:1474] Create config image for chip AD104
DEBUG - [nvmon_perfworks_getMetricRequests114:1147] Create scratch buffer for AD104 and 0x560ce25dd490
DEBUG - [nvmon_perfworks_getMetricRequests114:1161] Init Metric evaluator
DEBUG - [nvmon_perfworks_getMetricRequests114:1275] Destroy Metric evaluator
DEBUG - [nvmon_perfworks_createConfigImage:1476] Create config image for chip AD104 with 3 metric requests
DEBUG - [nvmon_perfworks_createConfigImage:1570] Allocated 296 byte for configImage
DEBUG - [nvmon_perfworks_createConfigImage:1580] nvmon_perfworks_createConfigImage_out enter  0
DEBUG - [nvmon_perfworks_createConfigImage:1582] NVPW_RawMetricsConfig_Destroy
DEBUG - [nvmon_perfworks_createConfigImage:1586] NVPW_MetricsContext_Destroy
DEBUG - [nvmon_perfworks_createConfigImage:1602] nvmon_perfworks_createConfigImage returns  296
DEBUG - [nvmon_perfworks_getMetricRequests114:1147] Create scratch buffer for AD104 and (nil)
DEBUG - [nvmon_perfworks_getMetricRequests114:1161] Init Metric evaluator
DEBUG - [nvmon_perfworks_getMetricRequests114:1275] Destroy Metric evaluator
DEBUG - [nvmon_perfworks_createCounterDataPrefixImage:1679] Allocated 172 byte for configPrefixImage
DEBUG - [nvmon_perfworks_createCounterDataPrefixImage:1691] nvmon_perfworks_createCounterDataPrefixImage_out enter  0
DEBUG - [nvmon_perfworks_createCounterDataPrefixImage:1716] nvmon_perfworks_createCounterDataPrefixImage returns  172
DEBUG - [nvmon_perfworks_addEventSet:1844] Filling eventset  0 on device  0
DEBUG - [nvmon_perfworks_addEventSet:1885] Adding eventset  0
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_setupCounters:2112] Setup Counters on device  0
DEBUG - [nvmon_perfworks_setupCounterImageData:1910] counterDataPrefixSize  172
DEBUG - [nvmon_perfworks_setupCounterImageData:1935] Resize counterDataImage to  26523
DEBUG - [nvmon_perfworks_setupCounterImageData:1944] Resized counterDataImage to  26523
DEBUG - [nvmon_perfworks_setupCounterImageData:1962] Resize counterDataScratchBuffer to  40
DEBUG - [nvmon_perfworks_setupCounterImageData:1972] Resized counterDataScratchBuffer to  40
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_startCounters:2160] Start Counters on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_startCounters:2189] (START)counterDataImageSize  26523
DEBUG - [nvmon_perfworks_startCounters:2195] (START)counterDataScratchBufferSize  40
DEBUG - [nvmon_perfworks_startCounters:2207] (START)configImage  296
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_stopCounters:2252] Stop Counters on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_stopCounters:2307] Get results on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_fadd_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_fmul_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_ffma_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_fadd_pred_on.sum Last  0.000000 Full  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_fmul_pred_on.sum Last  0.000000 Full  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_ffma_pred_on.sum Last  0.000000 Full  0.000000
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_startCounters:2160] Start Counters on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_startCounters:2189] (START)counterDataImageSize  26523
DEBUG - [nvmon_perfworks_startCounters:2195] (START)counterDataScratchBufferSize  40
DEBUG - [nvmon_perfworks_startCounters:2207] (START)configImage  296
DEBUG - [nvmon_markerStartRegion:418] START Device  0 Event  0 :  0.000000
DEBUG - [nvmon_markerStartRegion:418] START Device  0 Event  1 :  0.000000
DEBUG - [nvmon_markerStartRegion:418] START Device  0 Event  2 :  0.000000
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_stopCounters:2252] Stop Counters on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_stopCounters:2307] Get results on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_fadd_pred_on.sum :  250.000000
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_fmul_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_ffma_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_fadd_pred_on.sum Last  250.000000 Full  250.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_fmul_pred_on.sum Last  0.000000 Full  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_ffma_pred_on.sum Last  0.000000 Full  0.000000
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_startCounters:2160] Start Counters on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_startCounters:2189] (START)counterDataImageSize  26523
DEBUG - [nvmon_perfworks_startCounters:2195] (START)counterDataScratchBufferSize  40
DEBUG - [nvmon_perfworks_startCounters:2207] (START)configImage  296
DEBUG - [nvmon_markerStopRegion:468] STOP Device  0 Event  0 :  250.000000 -  0.000000
DEBUG - [nvmon_markerStopRegion:468] STOP Device  0 Event  1 :  0.000000 -  0.000000
DEBUG - [nvmon_markerStopRegion:468] STOP Device  0 Event  2 :  0.000000 -  0.000000
DEBUG - [perfworks_check_nv_context:677] Current context  94613286869984 DevContext  94613286869984
DEBUG - [perfworks_check_nv_context:708] Context 94613286869984 fits for device 0
DEBUG - [nvmon_perfworks_stopCounters:2252] Stop Counters on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_stopCounters:2307] Get results on device  0(Eventset  0)
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_fadd_pred_on.sum :  166.666667
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_fmul_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_getMetricValue:2078] Final Eval smsp__sass_thread_inst_executed_op_ffma_pred_on.sum :  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_fadd_pred_on.sum Last  166.666667 Full  416.666667
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_fmul_pred_on.sum Last  0.000000 Full  0.000000
DEBUG - [nvmon_perfworks_stopCounters:2338] smsp__sass_thread_inst_executed_op_ffma_pred_on.sum Last  0.000000 Full  0.000000
--------------------------------------------------------------------------------
Region vecAdd, Group 1: FLOPS_SP
+-------------------+----------+
|    Region Info    |   GPU 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.000092 |
|     call count    |        1 |
+-------------------+----------+

+----------------------------------------------------+---------+-------+
|                        Event                       | Counter | GPU 0 |
+----------------------------------------------------+---------+-------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM |   GPU0  |   250 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM |   GPU1  |     0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM |   GPU2  |     0 |
+----------------------------------------------------+---------+-------+

+---------------------+--------+
|        Metric       |  GPU 0 |
+---------------------+--------+
| Runtime (RDTSC) [s] | 0.0001 |
|     SP [MFLOP/s]    | 2.7225 |
+---------------------+--------+
@teojgo teojgo added the bug label Dec 19, 2023
@TomTheBear
Copy link
Member

Thanks for the output. It seems to me that despite calling cudaDeviceSynchronize() before LIKWID_NVMARKER_STOP, the counts are not finished yet. In LIKWID_NVMARKER_CLOSE, the library finally stops the counting (last lines of debugging output after nvmon_markerStopRegion) and reads still 166 fadd instructions although the GPU should be inactive after the execution of LIKWID_NVMARKER_STOP. I have to investigate what's going on there. Thanks for the test code.

@TomTheBear
Copy link
Member

I played around a little but no solution yet.

@teojgo
Copy link
Author

teojgo commented Jan 8, 2024

I played around a little but no solution yet.

Have you managed to reproduce the problem?

@TomTheBear
Copy link
Member

Yes but I couldn't find any problematic parts until now.

My holidays start tomorrow, so there will be no activity in the next month. I just wanted to let you know that it is not forgotten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants