Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] perfctr crashes on a64fx #599

Open
jdomke opened this issue Jan 24, 2024 · 2 comments
Open

[BUG] perfctr crashes on a64fx #599

jdomke opened this issue Jan 24, 2024 · 2 comments
Labels

Comments

@jdomke
Copy link
Contributor

jdomke commented Jan 24, 2024

Describe the bug
likwid-perfctr throws different Aborted (core dumped) errors depending on runtime of the sleep command

 $ likwid-perfctr -C 0 -g L2 sleep 1
--------------------------------------------------------------------------------
CPU name:
CPU type:       Fujitsu A64FX
CPU clock:      0.00 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
malloc(): unaligned tcache chunk detected
[1]+  Aborted                 (core dumped) likwid-perfctr -C 0 -g L2 sleep 1
Aborted (core dumped)
$ likwid-perfctr -C 0 -g L2 sleep 2
------------------------------------------------------------------------------
--
CPU name:
CPU type:       Fujitsu A64FX
CPU clock:      0.00 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: L2
+------------------+---------+------------+
<<snip>>
|    L1<->L2 data volume [GBytes]    |     0.0020 |
+------------------------------------+------------+

double free or corruption (out)
Aborted (core dumped)

To Reproduce

  • LIKWID command and/or API usage
    ** see above
  • LIKWID version and download source (Github, FTP, package manger, ...)
    ** v5.3.0 tag compiled with GCCARMv8 and ACCESSMODE=direct
  • Operating system
    ** RHEL 8.8 (Ootpa)
  • Does your application use libraries like MPI, OpenMP or Pthreads?
  • In case of Nvidia GPUs, which CUDA version?
  • Are you using the MarkerAPI (CPU code instrumentation) or the NvMarkerAPI (Nvidia GPU code instrumentation)?

To Reproduce with a LIKWID command
Please supply the output of the command with -V 3 added to the command:

  • likwid-perfctr
$ likwid-perfctr -V 3 -C 0 -g L2 sleep 1
DEBUG - [hwloc_init_cpuInfo:367] HWLOC CpuInfo Family 8 Model 1 Stepping 0 Vendor 0x46 Part 0x1 isIntel 0 numHWThreads 24 activeHWThreads 24
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 2 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 3 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 4 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 5 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 6 Thread 0 Core 0 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 7 Thread 0 Core 1 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 8 Thread 0 Core 6 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 9 Thread 0 Core 7 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 10 Thread 0 Core 8 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 11 Thread 0 Core 10 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 12 Thread 0 Core 0 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 13 Thread 0 Core 5 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 14 Thread 0 Core 6 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 15 Thread 0 Core 8 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 16 Thread 0 Core 10 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 17 Thread 0 Core 11 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 18 Thread 0 Core 0 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 19 Thread 0 Core 5 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 20 Thread 0 Core 6 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 21 Thread 0 Core 8 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 22 Thread 0 Core 10 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 23 Thread 0 Core 11 Die 0 Socket 3 inCpuSet 1
DEBUG - [affinity_init:547] Affinity: Socket domains 4
DEBUG - [affinity_init:549] Affinity: CPU die domains 4
DEBUG - [affinity_init:554] Affinity: CPU cores per LLC 12
DEBUG - [affinity_init:557] Affinity: Cache domains 0
DEBUG - [affinity_init:561] Affinity: NUMA domains 4
DEBUG - [affinity_init:562] Affinity: All domains 13
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 24 HW threads on 24 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 6 HW threads on 6 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S1: 6 HW threads on 6 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S2: 6 HW threads on 6 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S3: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D1: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D2: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D3: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M1: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M2: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M3: 6 HW threads on 6 cores
DEBUG - [create_lookups:295] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 2 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 3 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 4 T2C 8 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 5 T2C 10 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 6 T2C 0 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 7 T2C 1 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 8 T2C 6 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 9 T2C 7 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 10 T2C 8 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 11 T2C 10 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 12 T2C 0 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 13 T2C 5 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 14 T2C 6 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 15 T2C 8 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 16 T2C 10 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 17 T2C 11 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 18 T2C 0 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 19 T2C 5 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 20 T2C 6 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 21 T2C 8 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 22 T2C 10 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 23 T2C 11 T2S 3 T2D 3 T2LLC 0 T2M 3
--------------------------------------------------------------------------------
CPU name:	
CPU type:	Fujitsu A64FX
CPU clock:	0.00 GHz
CPU family:	8
CPU model:	1
CPU short:	arm64fx
CPU stepping:	0
CPU features:	FP ASIMD AES PMULL ASIMDRDM SVE 
CPU arch:	armv8
--------------------------------------------------------------------------------
[likwid-pin] Main PID -> hwthread 0 - OK
Executing: sleep 1
DEBUG - [perfmon_addEventSet:2326] Currently 1 groups of 2 active
DEBUG - [perfgroup_readGroup:873] Reading group L2 from /home/domke/CPUStudy_A64FX_2600Mhz/testCompile_llvm/dep/likwid/share/likwid/perfgroups/arm64fx/L2.txt
DEBUG - [perfmon_addEventSet:2385] Eventstring INST_RETIRED:PMC0,CPU_CYCLES:PMC1,L1D_CACHE_REFILL:PMC2,L1D_CACHE_WB:PMC3,L1I_CACHE_REFILL:PMC4
DEBUG - [perfmon_addEventSet:2512] Added event INST_RETIRED for counter PMC0 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event CPU_CYCLES for counter PMC1 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event L1D_CACHE_REFILL for counter PMC2 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event L1D_CACHE_WB for counter PMC3 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event L1I_CACHE_REFILL for counter PMC4 to group 0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x0 , Flags: 0x8 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x1 , Flags: 0x11 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x2 , Flags: 0x3 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x3 , Flags: 0x15 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x4 , Flags: 0x1 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
--------------------------------------------------------------------------------
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x5 , Flags: 0x7049 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x6 , Flags: 0x10653 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x7 , Flags: 0xE4 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x8 , Flags: 0x58 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x9 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x9 , Flags: 0x213 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x9 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x5 , Flags: 0x952C8 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x6 , Flags: 0x1070C7 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x7 , Flags: 0xD87 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x8 , Flags: 0x4F8 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x9 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x9 , Flags: 0xF23 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x9 , Flags: 0x0 
--------------------------------------------------------------------------------
Group 1: L2
+------------------+---------+------------+
|       Event      | Counter | HWThread 0 |
+------------------+---------+------------+
|   INST_RETIRED   |   PMC0  |     611016 |
|    CPU_CYCLES    |   PMC1  |    1077447 |
| L1D_CACHE_REFILL |   PMC2  |       3463 |
|   L1D_CACHE_WB   |   PMC3  |       1272 |
| L1I_CACHE_REFILL |   PMC4  |       3875 |
+------------------+---------+------------+

+------------------------------------+------------+
|               Metric               | HWThread 0 |
+------------------------------------+------------+
|         Runtime (RDTSC) [s]        |     1.0025 |
|                 CPI                |     1.7634 |
|  L1D<-L2 load bandwidth [MBytes/s] |     0.8843 |
|  L1D<-L2 load data volume [GBytes] |     0.0009 |
| L1D->L2 evict bandwidth [MBytes/s] |     0.3248 |
| L1D->L2 evict data volume [GBytes] |     0.0003 |
|  L1I<-L2 load bandwidth [MBytes/s] |     0.9895 |
|  L1I<-L2 load data volume [GBytes] |     0.0010 |
|    L1<->L2 bandwidth [MBytes/s]    |     2.1986 |
|    L1<->L2 data volume [GBytes]    |     0.0022 |
+------------------------------------+------------+

double free or corruption (out)
@jdomke jdomke added the bug label Jan 24, 2024
@jdomke
Copy link
Contributor Author

jdomke commented Jan 24, 2024

note: using FCC results in similar crashes

@jdomke
Copy link
Contributor Author

jdomke commented Jan 25, 2024

The issue results from having disabled cores in a 24-core version of A64FX (the chip has all 48 nodes, but only 24 are active). Unlike on Intel/AMD the kernel does not properly mask/map the coreIDs to be consecutive. Visible here:

DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 2 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 3 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 4 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 5 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1

for one of the CMGs of the chip.

I was able to "fix" this part with

diff --git a/src/topology_proc.c b/src/topology_proc.c
index 398be11f..77fa871a 100644
--- a/src/topology_proc.c
+++ b/src/topology_proc.c
@@ -602,6 +602,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
     int (*ownatoi)(const char*);
     ownatoi = &atoi;
     int last_socket = -1;
+    int last_coreid = -1;
     int num_sockets = 0;
     int num_cores_per_socket = 0;
     int num_threads_per_core = 0;
@@ -631,6 +632,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
             {
                 num_sockets++;
                 last_socket = packageId;
+                last_coreid = -1;
             }
             fclose(fp);
         }
@@ -639,7 +641,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
         if (NULL != (fp = fopen (bdata(file), "r")))
         {
             bstring src = bread ((bNread) fread, fp);
-            hwThreadPool[i].coreId = ownatoi(bdata(src));
+            hwThreadPool[i].coreId = (++last_coreid); //ownatoi(bdata(src));
             if (hwThreadPool[i].packageId == 0)
             {
                 num_cores_per_socket++;

but it will only move the error to other parts of the code. I think likwid has severe issues when cores, sockets, cachedomains, etc. are not in idea conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant