

# Conquering Noise with Hardware Counters on HPC Systems

Marcus Ritter<sup>1</sup>, Ahmad Tarraf<sup>1</sup>, Alexander Geiß<sup>1</sup>, Nour Daoud<sup>2</sup>, Bernd Mohr<sup>2</sup>, **Felix Wolf<sup>1</sup>**

<sup>1</sup>Technical University of Darmstadt

<sup>2</sup>Forschungszentrum Jülich GmbH



# Motivation

Performance and complexity of HPC systems are constantly increasing  
→ Important to examine the scaling behavior of an application and identify **performance bottlenecks** early  
→ Use **empirical performance modeling** (e.g., Extra-P)

## Problem:

In **noisy** environments → difficult to create accurate performance models

- Strong run-to-run variations of the underlying measurements
- Irreproducible and misleading
- Deviation from *intrinsic* application behavior

# Empirical performance modeling



Performance measurements  
with different execution  
parameters  $x_1, \dots, x_n$



$$t = f(x_1, \dots, x_n)$$

Alternative metrics:  
**Hard- and software counters**

# Motivation

## Problem (cont.):

- Application runtime affected by noise
- Most common performance metric



## Solution:

- Use hardware counters
- Noise has little impact on some **hardware counters**
  - E.g., floating-point operations
  - Selecting the right counters requires a thorough analysis
  - Once found, the counters can be used for performance modeling

# Contributions



TECHNISCHE  
UNIVERSITÄT  
DARMSTADT

Parallel  
Programming

A detailed noise analysis on various hardware counters on different systems:

→ Total of **26950** experiments (**PAPI preset** events only):

 **Five** systems

 **Four** hardware architectures

 **Three** applications

 **With and without** injected noise

 **Multiple** resource configurations  
(number of nodes)

 **Five** repetitions per setup

**Categorized** the counters across the different systems according to their noise resilience  
and provided a **user guide**

# Analysis Methodology



Find noise-resilient hardware counters:

- Examine if counters' values change when repeating the measurements
- Expose the counters to different levels of noise
  - Using NOIGENA (NOIse GENerator Application) developed in Jülich
  - NOIGENA processes were running on the odd processors
  - Inject different noise patterns using NOIGENA

# NOIse GENerator Application



TECHNISCHE  
UNIVERSITÄT  
DARMSTADT



- Produces shared-resource contention
- Uses several benchmarks
  - Memory: Stream
  - Network: FzjLinkTest
  - I/O: IOR
- Consecutively runs configurable patterns

```
PATTERN_1:  
Sequence:  
  - REPEATED_NOISE:  
    REPEAT: inf  
    Sequence:  
      - NETWORK_NOISE: 2  
      - NO_NOISE: 2  
      - MEMORY_NOISE: 2  
      - NO_NOISE: 2
```

Noise pattern used by NOIGENA to configure the amount and duration of generated noise.

# Analysis Methodology

- Instrument all call paths of the applications with **Score-P** using **PAPI**
- We do the analysis **for each counter, on all systems, with and without injected noise** to get the Score-P measurements for the:
  - $a^{\text{th}}$  application kernel (call path),
  - $p^{\text{th}}$  MPI rank,
  - $t^{\text{th}}$  OpenMP thread, and
  - $i^{\text{th}}$  repetition:

a counter value  $v_{a,p,t,i}$



# Analysis Methodology

Compare counter values across the **repeated experiments** for each  $a^{\text{th}}$  application kernel (call path),  $p^{\text{th}}$  MPI rank, and  $t^{\text{th}}$  OpenMP thread

- Find the **arithmetic mean** across the repetitions:

$$\bar{v}_{a,p,t} = \text{mean}(v_{a,p,t,i})$$

- Calculate the **relative deviation from the arithmetic mean** in percent:

$$\frac{|v_{a,p,t,i} - \bar{v}_{a,p,t}|}{\bar{v}_{a,p,t}} * 100\%$$

→ Use this metric to compare the counter results

# Application Benchmarks



TECHNISCHE  
UNIVERSITÄT  
DARMSTADT



## MiniFE

- Mini-app that mimics: finite element generation, assembly, and solution
- Unstructured grid problem, required by many engineering applications

## LULESH

- Proxy app solves a Sedov blast problem with analytic answers
- Represents algorithms, data motion, and programming style typical in scientific applications

## LAMMPS

- Classical molecular dynamics code focusing on materials modeling
- We use the atomic fluid, Lennard-Jones (LJ) potential

For all of them, we used OpenMP and MPI for the measurements

# Evaluation Systems



TECHNISCHE  
UNIVERSITÄT  
DARMSTADT



| Alias   | Name                                       | Nodes                   | Processor                                                              | RAM                           | Network                                                     |
|---------|--------------------------------------------|-------------------------|------------------------------------------------------------------------|-------------------------------|-------------------------------------------------------------|
| CM      | DEEP-EST,<br>Cluster Module                | 50                      | 2x Intel Xeon Skylake Gold<br>6146 CPUs<br>(12 cores, 24 threads)      | 192 GB DDR4 RAM<br>(2666 MHz) | InfiniBand EDR (100 GBit/s)                                 |
| ESB     | DEEP-EST,<br>Extreme Scale<br>Booster      | 75                      | 1x Intel Xeon Cascade Lake<br>Silver 4215 CPU<br>(8 cores, 16 threads) | 48 GB DDR4 RAM<br>(2400 MHz), | InfiniBand EDR (100 GBit/s)                                 |
| Jureca  | JURECA DC<br>Module, std.<br>compute nodes | 480                     | 2x AMD EPYC 7742 CPUs<br>(64 cores, 128 threads)                       | 512 GB DDR4 RAM<br>(3200 MHz) | InfiniBand HDR100 (100 GBit/s)                              |
| Jetson  | OACISS,<br>Franken-cluster<br>Jetson ARM64 | 12x Jetson<br>Tegra TX1 | 1x Quad-Core ARM Cortex®-<br>A57 MPCore<br>(4 cores, 4 threads)        | 4 GB 64-bit<br>LPDDR4 RAM     | 1 GBit/s ethernet                                           |
| Cyclops | OACISS,<br>Franken-cluster<br>Cyclops      | 1                       | 2x 20c IBM Power9 CPUs<br>(20 cores, 80 threads)                       | 384 GB of RAM                 | BNX2 10G Ethernet NICs, 2x<br>Infiniband EDR<br>(25 Gbit/s) |

# Application Configurations

| App    | System  | Experiment configuration                                           |
|--------|---------|--------------------------------------------------------------------|
| MiniFE | CM      | $n = [1, 2, 4, 8]$ , $p = n$ , $t = 12p$ , $s = 20^3 p$            |
|        | ESB     | $n = [1, 2, 4, 8, 12, 16, 32]$ , $p = n$ , $t = 8p$ , $s = 50^3 p$ |
|        | Jureca  | $n = [4, 8, 12, 16, 20, 24]$ , $p = n$ , $t = 128p$ , $s = 20^3 p$ |
|        | Jetson  | $n = [2, 3, 4, 5, 6]$ , $p = n$ , $t = 2p$ , $s = 50^3 p$          |
|        | Cyclops | $n = 1$ , $p = [4, 8, 12, 16, 20]$ , $t = 4p$ , $s \approx 24^3 p$ |
| LULESH | CM      | $n = [1, 8, 27]$ , $p = n$ , $t = 12p$ , $s = 10^3 p$              |
|        | ESB     | $n = [1, 8, 27]$ , $p = n$ , $t = 8p$ , $s = 30^3 p$               |
|        | Jureca  | $n = [1, 8, 27]$ , $p = n$ , $t = 128p$ , $s = 10^3 p$             |
|        | Jetson  | $n = 8$ , $p = n$ , $t = 2p$ , $s = 15^3 p$                        |
|        | Cyclops | $n = 1$ , $p = [1, 8, 27, 64]$ , $t = p$ , $s = 5^3 p$             |
| LAMMPS | CM      | $n = [1, 2, 4, 8]$ , $p = n$ , $t = 12p$ , $s = 20^3 p$            |
|        | ESB     | $n = [1, 2, 4, 8, 12, 16, 32]$ , $p = n$ , $t = 8p$ , $s = 20^3 p$ |
|        | Jureca  | $n = [1, 2, 4, 8, 16]$ , $p = n$ , $t = 128p$ , $s = 20^3 p$       |
|        | Jetson  | $n = [2, 3, 4, 5, 6]$ , $p = n$ , $t = 2p$ , $s = 20^3 p$          |
|        | Cyclops | $n = 1$ , $p = [4, 8, 12, 16, 20]$ , $t = 4p$ , $s = 20^3 p$       |

s: problem size  
n: number of computational nodes  
p: the number of MPI processes  
t: the number of OpenMP thread

# Evaluation Results

Visualizing the results for **each counter** is not an easy task:

- Three different application
  - Thousands of different call paths
  - Different resource configurations
  - Several repetitions per setup
  - Five different systems
  - Several hardware counters
  - Presence and absence of noise
- 
- Count relative metric (deviation from the mean)
  - Intensity (importance to the overall behavior)
  - Distinct plots
  - Separate axes

To compare distinct counter → Scale the plots with the peak occurrence of the relative deviation

# Evaluation Results

- Height = how often the corresponding relative deviation occurs
- Intensity = share of the total counter value for that call path (importance)
- Only call paths with > 1% of the total counter value are visible



# Evaluation Results: How to Interpret the Results

## Bad Counters [X]

Counter strongly influenced by noise:

- Large distribution in the presence of noise, small one in the absence



Counter with large deviation:

- Large distribution disregarding the noise
- Not suited for modeling



# Evaluation Results: How to Interpret the Results

## Good Counters [+]

Counter robust against noise:

- Similar distribution in the presence and absence of noise
- Small deviation ( < 20%)



## OK Counters [0]

Counter to some extent robust against noise:

- Small deviation (< 20%) without noise
- Small deviation (< 20%) in the presence of noise for **significant call paths**



# Evaluation Results: Runtime



# Evaluation Results: Floating Point Operations



# Evaluation Results: Instructions, Cycles



# Evaluation Results: Instructions, Cycles



# Evaluation Results: L1, L2 Instruction Cache



# Evaluation Results: Stalls & Reference Cycles



# Evaluation Results: L2, L3 Cache



# Impact of the Application: Time and Load Inst.



LAMMPS



LULESH



MiniFE



# Best Practice User Guide

|                            | Jureca<br>(AMD) | ESB (Intel) | CM (Intel) | Jetson<br>(ARM) | Cyclops<br>(IBM) |
|----------------------------|-----------------|-------------|------------|-----------------|------------------|
| Floating point ops./instr. | ++              | ++          | ++         | ++              | ++               |
| Cycles                     | X               | +           | +          | 0               | 0                |
| Instructions               | X               | +           | +          | +               | +                |
| L1                         | +               | X           | X          | 0               | +                |
| L2                         | +               | 0           | X          | 0               | +                |
| L3                         |                 | X           | X          |                 | X                |

| Legend |           |
|--------|-----------|
| ++     | very good |
| +      | good      |
| 0      | ok        |
| X      | bad       |

# Conclusion

- Examined noise resilience of hardware counters on five systems with different architectures (Intel, AMD, ARM, IBM Power9)
- Analyzed all available presets and a selection of native events
- Most hardware counters are affected by noise, but still less than the runtime
- Floating-point operations or instructions are noise resilient on all systems
- Variability significantly depends on the system architecture
- Best practice guide helps choose suitable counters for given system



Marcus Ritter, Ahmad Tarraf, Alexander Geiß, Nour Daoud, Bernd Mohr, Felix Wolf: Conquering Noise With Hardware Counters on HPC Systems. In *Proc. of the Workshop on Programming and Performance Visualization Tools (ProTools), held in conjunction with the Supercomputing Conference (SC22)*, pages 1–10, IEEE, 2022.

# Future Work

- Create a tool that performs the analysis and evaluation on the system
- Examine more counters (i.e., native counters)
- Examine the correlation between the counters
- Identify the source of variation
- Started working with the PAPI developers

# Acknowledgment



Hessisches Ministerium  
für Wissenschaft und Kunst



EuroHPC  
Joint Undertaking



Federal Ministry  
of Education  
and Research



# Questions?

**Thank you for your attention!**