Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEM1 and MEM2 are both zero on AMD 9654 #613

Open
kadircs opened this issue Mar 4, 2024 · 13 comments
Open

MEM1 and MEM2 are both zero on AMD 9654 #613

kadircs opened this issue Mar 4, 2024 · 13 comments

Comments

@kadircs
Copy link

kadircs commented Mar 4, 2024

I am trying to measure memory bandwidth for a stencil application that runs on both sockets of a two socket AMD 9654 system.
I am getting zero as the memory bandwidth as seen below. Is there an issue with DFC counters on zen4 architecture? Is it fully supported? I tried with and without -f.

for metric in MEM2 MEM1;do export OMP_NUM_THREADS=192; srun --nodes=1 --cpus-per-task=192 --threads-per-core=1  -t 1-0:00 --hint=nomultithread likwid-perfctr -f -C 0-191 -g ${metric}  ./a.out 512 512 512 201 2504 ;done

INFO: You are running LIKWID in a cpuset with 192 CPUs. Taking given IDs as logical ID in cpuset
--------------------------------------------------------------------------------
CPU name:   AMD EPYC 9654 96-Core Processor
CPU type:   AMD K19 (Zen4) architecture
CPU clock:  2.40 GHz
--------------------------------------------------------------------------------

+---------------------------+---------+---------------+------------+-------------+--------------+
|           Event           | Counter |      Sum      |     Min    |     Max     |      Avg     |
+---------------------------+---------+---------------+------------+-------------+--------------+
|   ACTUAL_CPU_CLOCK STAT   |  FIXC1  | 1998528125004 | 6754696363 | 14015654916 | 1.040900e+10 |
|     MAX_CPU_CLOCK STAT    |  FIXC2  | 1297140105696 | 4381439136 |  9099460176 | 6.755938e+09 |
| RETIRED_INSTRUCTIONS STAT |   PMC0  |  319022774933 |  783746373 | 14053108949 | 1.661577e+09 |
|  CPU_CLOCKS_UNHALTED STAT |   PMC1  | 1975117444353 | 6649101710 | 13903071879 | 1.028707e+10 |
|    DRAM_CHANNEL_4 STAT    |   DFC0  |             0 |     inf    |           0 |            0 |
|    DRAM_CHANNEL_5 STAT    |   DFC1  |             0 |     inf    |           0 |            0 |
|    DRAM_CHANNEL_6 STAT    |   DFC2  |             0 |     inf    |           0 |            0 |
|    DRAM_CHANNEL_7 STAT    |   DFC3  |             0 |     inf    |           0 |            0 |
+---------------------------+---------+---------------+------------+-------------+--------------+
+-------------------------------------------------+-------------+-----------+-----------+-----------+
|                      Metric                     |     Sum     |    Min    |    Max    |    Avg    |
+-------------------------------------------------+-------------+-----------+-----------+-----------+
|             Runtime (RDTSC) [s] STAT            |    737.0496 |    3.8388 |    3.8388 |    3.8388 |
|            Runtime unhalted [s] STAT            |    834.0599 |    2.8190 |    5.8493 |    4.3441 |
|                 Clock [MHz] STAT                | 708884.2274 | 3689.5780 | 3694.0565 | 3692.1054 |
|                     CPI STAT                    |   1226.1420 |    0.7612 |    9.6593 |    6.3862 |
| Memory bandwidth (channels 4-7) [MBytes/s] STAT |           0 |         0 |         0 |         0 |
| Memory data volume (channels 4-7) [GBytes] STAT |           0 |         0 |         0 |         0 |
+-------------------------------------------------+-------------+-----------+-----------+-----------+
INFO: You are running LIKWID in a cpuset with 192 CPUs. Taking given IDs as logical ID in cpuset
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 9654 96-Core Processor
CPU type:       AMD K19 (Zen4) architecture
CPU clock:      2.40 GHz
--------------------------------------------------------------------------------

+---------------------------+---------+---------------+------------+-------------+--------------+
|           Event           | Counter |      Sum      |     Min    |     Max     |      Avg     |
+---------------------------+---------+---------------+------------+-------------+--------------+
|   ACTUAL_CPU_CLOCK STAT   |  FIXC1  | 2007244857500 | 6802115600 | 14075787486 | 1.045440e+10 |
|     MAX_CPU_CLOCK STAT    |  FIXC2  | 1302715101408 | 4412394336 |  9137566680 | 6.784974e+09 |
| RETIRED_INSTRUCTIONS STAT |   PMC0  |  319758910632 |  786951029 | 14250478646 | 1.665411e+09 |
|  CPU_CLOCKS_UNHALTED STAT |   PMC1  | 1983547352860 | 6693526027 | 13952660032 | 1.033098e+10 |
|    DRAM_CHANNEL_0 STAT    |   DFC0  |             0 |     inf    |           0 |            0 |
|    DRAM_CHANNEL_1 STAT    |   DFC1  |             0 |     inf    |           0 |            0 |
|    DRAM_CHANNEL_2 STAT    |   DFC2  |             0 |     inf    |           0 |            0 |
|    DRAM_CHANNEL_3 STAT    |   DFC3  |             0 |     inf    |           0 |            0 |
+---------------------------+---------+---------------+------------+-------------+--------------+


+-------------------------------------------------+-------------+-----------+-----------+-----------+
|                      Metric                     |     Sum     |    Min    |    Max    |    Avg    |
+-------------------------------------------------+-------------+-----------+-----------+-----------+
|             Runtime (RDTSC) [s] STAT            |    740.3136 |    3.8558 |    3.8558 |    3.8558 |
|            Runtime unhalted [s] STAT            |    837.6987 |    2.8388 |    5.8744 |    4.3630 |
|                 Clock [MHz] STAT                | 708920.2595 | 3690.0568 | 3694.0458 | 3692.2930 |
|                     CPI STAT                    |   1229.3118 |    0.7538 |    9.8461 |    6.4027 |
| Memory bandwidth (channels 0-3) [MBytes/s] STAT |           0 |         0 |         0 |         0 |
| Memory data volume (channels 0-3) [GBytes] STAT |           0 |         0 |         0 |         0 |
+-------------------------------------------------+-------------+-----------+-----------+-----------+
likwid-perfctr -a
Group name      Description
--------------------------------------------------------------------------------
  BRANCH        Branch prediction miss rate/ratio
   CACHE        Data cache miss rate/ratio
   CLOCK        Cycles per instruction
     CPI        Cycles per instruction
    DATA        Load to store ratio
  DIVIDE        Divide unit information
  ENERGY        Power and Energy consumption
FLOPS_DP        Double Precision MFLOP/s
FLOPS_SP        Single Precision MFLOP/s
  ICACHE        Instruction cache miss rate/ratio
      L2        L2 cache bandwidth in MBytes/s (experimental)
 L2CACHE        L2 cache miss rate/ratio (experimental)
      L3        L3 cache bandwidth in MBytes/s
 L3CACHE        L3 cache miss rate/ratio (experimental)
    MEM1        Main memory bandwidth in MBytes/s (channels 0-3)
    MEM2        Main memory bandwidth in MBytes/s (channels 4-7)
    NUMA        L2 cache bandwidth in MBytes/s (experimental)
     TLB        TLB miss rate/ratio
$ size=$((100*1024));srun --nodes=1 --cpus-per-task=192 --threads-per-core=1  -t 1:00:00 --hint=nomultithread likwid-bench -t load_avx -W N:${size}kB:128
Cycles:                 5695362048
CPU Clock:              2396160729
Cycle Clock:            2396160729
Time:                   2.376870e+00 sec
Iterations:             33554432
Iterations per thread:  262144
Inner loop executions:  6250
Size (Byte):            102400000
Size per thread:        800000
Number of Flops:        0
MFlops/s:               0.00
Data volume (Byte):     26843545600000
MByte/s:                11293654.25
Cycles per update:      0.001697
Cycles per cacheline:   0.013579
Loads per update:       1
Stores per update:      0
Load bytes per element: 8
Store bytes per elem.:  0
Instructions:           1468006400016
UOPs:                   1258291200000
@TomTheBear
Copy link
Member

I assume perf_event backend and suspect a too high setting in /proc/sys/kernel/perf_event_paranoid. It has to be zero to get data from the Uncore devices. Run with -V 1 and there should be a message.

@ziyht
Copy link

ziyht commented Mar 18, 2024

The above situation also occurs on AMD 9554.

(note: I made a sum statistics data output, so the runtime is 384)

Runtime (RDTSC) [s]:  384.015717
Runtime unhalted [s]:  0.058734
Clock [MHz]:  199629.250000
CPI:  nan
Memory bandwidth (channels 0-3) [MBytes/s]:  0.000000
Memory data volume (channels 0-3) [GBytes]:  0.000000
----------------------------
Runtime (RDTSC) [s]:  384.044250
Runtime unhalted [s]:  0.041370
Clock [MHz]:  186808.812500
CPI:  nan
Memory bandwidth (channels 0-3) [MBytes/s]:  0.000000
Memory data volume (channels 0-3) [GBytes]:  0.000000
----------------------------
Runtime (RDTSC) [s]:  384.012970
Runtime unhalted [s]:  0.113669
Clock [MHz]:  187998.656250
CPI:  nan
Memory bandwidth (channels 0-3) [MBytes/s]:  0.000000
Memory data volume (channels 0-3) [GBytes]:  0.000000
----------------------------
Runtime (RDTSC) [s]:  384.045624
Runtime unhalted [s]:  0.561691
Clock [MHz]:  191052.828125
CPI:  nan
Memory bandwidth (channels 0-3) [MBytes/s]:  0.000000
Memory data volume (channels 0-3) [GBytes]:  0.000000

@ziyht
Copy link

ziyht commented Mar 18, 2024

I assume perf_event backend and suspect a too high setting in /proc/sys/kernel/perf_event_paranoid. It has to be zero to get data from the Uncore devices. Run with -V 1 and there should be a message.

I attempted this, but it seems to have been ineffective.

@TomTheBear
Copy link
Member

What has been ineffective? Setting the value to zero or getting messages?

LIKWID with perf_event backend requires the unit amd_df to be present (/sys/devices/amd_df). If this device does not exist, there is no chance to get the memory traffic through perf_event and consequently LIKWID. You need a newer or patched kernel.

@marquis-wang
Copy link

marquis-wang commented Mar 22, 2024

I encountered the same problem,
os is Rocky linux 8.6
kernel version: 4.18.0-372.9.1.el8.x86_64
/proc/sys/kernel/perf_event_paranoid is 0
/sys/device/amd_df and /sys/device/amd_l3 has existed

[root@localhost bin]# grep -i perf_event /boot/config-4.18.0-372.9.1.el8.x86_64
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=m
CONFIG_PERF_EVENTS_INTEL_RAPL=m
CONFIG_PERF_EVENTS_INTEL_CSTATE=m
CONFIG_PERF_EVENTS_AMD_POWER=m

[root@localhost bin]# likwid-perfctr -f -V 1 -g MEM2 /home/pcadmin/stream

CPU name: AMD EPYC 9554 64-Core Processor
CPU type: AMD K19 (Zen4) architecture
CPU clock: 3.10 GHz
CPU family: 25
CPU model: 17
CPU short: zen4
CPU stepping: 1
CPU features: FP MMX SSE SSE2 HTT MMX RDTSCP MONITOR SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3
CPU arch: x86_64

DEBUG - [access_client_startDaemon:157] Starting daemon /usr/local/sbin/likwid-accessD
DEBUG - [access_client_startDaemon:235] Successfully opened socket /tmp/likwid-83685 to daemon for CPU 127
Executing: /home/pcadmin/stream
DEBUG - [perfmon_addEventSet:2328] Currently 1 groups of 2 active
DEBUG - [perfgroup_readGroup:873] Reading group MEM2 from /usr/local/share/likwid/perfgroups/zen4/MEM2.txt
DEBUG - [perfmon_addEventSet:2514] Added event ACTUAL_CPU_CLOCK for counter FIXC1 to group 0
DEBUG - [perfmon_addEventSet:2514] Added event MAX_CPU_CLOCK for counter FIXC2 to group 0
DEBUG - [perfmon_addEventSet:2514] Added event RETIRED_INSTRUCTIONS for counter PMC0 to group 0
DEBUG - [perfmon_addEventSet:2514] Added event CPU_CLOCKS_UNHALTED for counter PMC1 to group 0
DEBUG - [checkAccess:237] WARNING: Counter DFC0 does not exist
DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC0
DEBUG - [checkAccess:237] WARNING: Counter DFC1 does not exist
DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC1
DEBUG - [checkAccess:237] WARNING: Counter DFC2 does not exist
DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC2
DEBUG - [checkAccess:237] WARNING: Counter DFC3 does not exist
DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC3

@marquis-wang
Copy link

zen4 cpu has 12 memory channels(https://www.amd.com/en/products/cpu/amd-epyc-9554),but why likwid library only support 8 memory channels for profmon datas?

@marquis-wang
Copy link

marquis-wang commented Mar 25, 2024

I encountered the same problem, os is Rocky linux 8.6 kernel version: 4.18.0-372.9.1.el8.x86_64 /proc/sys/kernel/perf_event_paranoid is 0 /sys/device/amd_df and /sys/device/amd_l3 has existed

[root@localhost bin]# grep -i perf_event /boot/config-4.18.0-372.9.1.el8.x86_64 CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_EVENTS=y CONFIG_HAVE_PERF_EVENTS_NMI=y CONFIG_PERF_EVENTS_INTEL_UNCORE=m CONFIG_PERF_EVENTS_INTEL_RAPL=m CONFIG_PERF_EVENTS_INTEL_CSTATE=m CONFIG_PERF_EVENTS_AMD_POWER=m

[root@localhost bin]# likwid-perfctr -f -V 1 -g MEM2 /home/pcadmin/stream

CPU name: AMD EPYC 9554 64-Core Processor

CPU type: AMD K19 (Zen4) architecture
CPU clock: 3.10 GHz
CPU family: 25
CPU model: 17
CPU short: zen4
CPU stepping: 1
CPU features: FP MMX SSE SSE2 HTT MMX RDTSCP MONITOR SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3
CPU arch: x86_64
DEBUG - [access_client_startDaemon:157] Starting daemon /usr/local/sbin/likwid-accessD DEBUG - [access_client_startDaemon:235] Successfully opened socket /tmp/likwid-83685 to daemon for CPU 127 Executing: /home/pcadmin/stream DEBUG - [perfmon_addEventSet:2328] Currently 1 groups of 2 active DEBUG - [perfgroup_readGroup:873] Reading group MEM2 from /usr/local/share/likwid/perfgroups/zen4/MEM2.txt DEBUG - [perfmon_addEventSet:2514] Added event ACTUAL_CPU_CLOCK for counter FIXC1 to group 0 DEBUG - [perfmon_addEventSet:2514] Added event MAX_CPU_CLOCK for counter FIXC2 to group 0 DEBUG - [perfmon_addEventSet:2514] Added event RETIRED_INSTRUCTIONS for counter PMC0 to group 0 DEBUG - [perfmon_addEventSet:2514] Added event CPU_CLOCKS_UNHALTED for counter PMC1 to group 0 DEBUG - [checkAccess:237] WARNING: Counter DFC0 does not exist DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC0 DEBUG - [checkAccess:237] WARNING: Counter DFC1 does not exist DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC1 DEBUG - [checkAccess:237] WARNING: Counter DFC2 does not exist DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC2 DEBUG - [checkAccess:237] WARNING: Counter DFC3 does not exist DEBUG - [perfmon_addEventSet:2437] Cannot access counter register DFC3

I maybe find this WARNING message reason, the struct zen4_counter_map of src/include/perfmon_zen4_counters.h file,missing Index "PMC17"。
image

@TomTheBear
Copy link
Member

@marquis-wang Yes, you found it. I fixed it yesterday night. Please test it: 7027aa6

I will add additional memory channels to the branch. Yes it should be 12.

@marquis-wang
Copy link

marquis-wang commented Mar 26, 2024

@TomTheBear Great ! I test branch amd_zen4 :44cf4ca it works well.

@TomTheBear
Copy link
Member

It works but it is not done. I did some major updates yesterday to the branch but the branch cannot be merged, so I create a new one only with the fixes.

The events currently configured in MEM1 and MEM2 do no exist for Zen4 anymore, so unclear whether they actually count memory traffic. The updated version will not have MEM1 and MEM2 anymore but MEMREAD and MEMWRITE and use the officially documented metrics for memory traffic..

@marquis-wang
Copy link

marquis-wang commented Mar 29, 2024

I want to using likwid library to develop collect tools for our's Cluster(Zen4), the memory bandwidth data of 7027aa6 missing 4 memory channls。 I look at the newest commit (44cf4ca) had add full channls ,so I test it ,I compare the likwid-perfctr‘s output(MEMREAD and MEMWRITE) and stream’s output,the results is no big difference。In he officially documented (AMD PPR Family 19h),i found a new event (DATA_BW)maybe helperful moniter the memory bandwidth, I will test the event .

@TomTheBear
Copy link
Member

I'm glad that it works for you now. Please be careful with the PPRs, you have to use the one for the family & model: AMD Family 19h Model 11h should be the right one. In the third document, it documents a DATA_BW event but it is just the in detail explanation/breakdown of the events already documented in #618. Unfortunately, also with the details, it is impossible to perform read&write measurements in one go.

@TomTheBear
Copy link
Member

The UMC performance counters would be of interest to count at the memory controller instead of the DataFabric but they seem quite complicated to add. There is already infrastructure for MMIO based counters but some effort would be required. Unfortunately, they are never exposed by perf_event, so they can be added for accessdaemon/direct only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants