



# Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

Julian Hammer  
Georg Hager  
Jan Eitzinger  
Gerhard Wellein

# Overview

1. Loop Kernels
2. Roofline and ECM
3. Kerncraft
  1. Overview and Structure
  2. Output and Results
4. 3D-long-range Example



# LOOP KERNELS



Automatic **Loop Kernel** Analysis and  
Performance Modeling with Kerncraft

# Loop Kernels

```
double a[5000], b[5000];
double s;

for(i=0; i<5000; ++i)
    a[i] = s * b[i];
```

```
double a[5000][5000];
double b[5000][5000];
double s;

for(j=1; j<5000-1; ++j)
    for(i=1; i<5000-1; ++i)
        b[j][i] = ( a[j][i-1] + a[j][i+1]
                    + a[j-1][i] + a[j+1][i] )
                    * s;
```

- Many inner-loop iterations
- No branching
- Access fully determined by loop counters (i.e., no irregularities)

# Loop Kernels

```
double a[5000], b[5000];
double s;

for(i=0; i<5000; ++i)
    a[i] = s * b[i];
```



```
double a[5000][5000];
double b[5000][5000];
double s;

for(j=1; j<5000-1; ++j)
    for(i=1; i<5000-1; ++i)
        b[j][i] = ( a[j][i-1] + a[j][i+1]
                    + a[j-1][i] + a[j+1][i] )
                    * s;
```



## Streaming Kernel

- Simple structure
- No data-reuse

## Stencil Code

- Complex Structure
- Heavy data-reuse

# Loop Kernels

How to predict performance on complex architectures?

Two major contributions/bottlenecks:



# Potential Benefits

- Guided optimizations
  - › Which optimization strategy to pursue?
  - › What gain is to be expected?
- Guiding decisions for or against a specific architecture
  - › What can be expected on new hardware?
  - › Enabling hardware/software co-design
- Energy optimized computing
  - › Which configuration yields the best energy to solution?
- Deeper understanding of code and hardware interactions

# ROOFLINE AND ECM



Automatic Loop Kernel Analysis and  
**Performance Modeling** with Kerncraft

# Roofline



- All memory levels are separate bottlenecks

$$P = \min(P_{comp.}, I \cdot b_s)$$

|             |                                   |
|-------------|-----------------------------------|
| $P_{comp.}$ | Peak performance<br>[FLOP/s]      |
| $I$         | Operational Intensity<br>[FLOP/B] |
| $b_s$       | Peak bandwidth<br>[B/s]           |

- Bandwidths are measured by suitable benchmarks

# Roofline: Performance vs Time



- CPU frequency?  
→ Cycles!
- Basic memory units?  
→ Cache Lines! (64 Byte)

# Roofline: Performance vs Time



- CPU frequency?
    - Cycles!
  - Basic memory units?
    - Cache Lines! (1 CL=64 B)
    - 1 unit of work = 1 CL
- Cycle / Cache Line (cy/CL)
- Lower is better

# Execution-Cache-Memory (ECM) Model

- Memory and cache levels contribute to runtime

$T_{OL}$  computation & stores

$T_{nOL}$  loads from L1

$T_{L1-L2}$  loads from L2 into L1

$T_{L2-L3}$  loads from L3 into L2

$T_{L3-MEM}$  loads from main memory into L3

{  $T_{OL} \parallel T_{nOL} \mid T_{L1-L2} \mid T_{L2-L3} \mid T_{L3-MEM}$  }

- One measured input:  
full-socket mem. bandwidth



# Roofline and ECM



# Performance Modeling

```
double a[5000], b[5000];
double s;

for(i=0; i<5000; ++i)
    a[i] = s * b[i];
```

```
double a[5000][5000];
double b[5000][5000];
double s;

for(j=1; j<5000-1; ++j)
    for(i=1; i<5000-1; ++i)
        b[j][i] = ( a[j][i-1] + a[j][i+1]
                    + a[j-1][i] + a[j+1][i] )
                    * s;
```

## STREAM Scale

- For each cache line (8 it.):
  - › 1 CL is stored
  - › 8 FLOP
  - › 1 CL are loaded

## 2D 5-point Stencil

- For each cache line (8 it.):
  - › 1 CL is stored
  - › 32 FLOP
  - › 1-3 CL are loaded

Why 1-3?

# Layer Conditions



1D layer condition:  $\text{stencil-width} * \text{stencil-height} < \text{cache-size}$

2D layer condition:  $\text{stencil-height} * \text{matrix-width} < \text{cache-size}$

# Performance Modeling

```
double U[M][N][N];
double V[M][N][N];
double ROC[M][N][N];
double c0, c1, c2, c3, c4, lap;

for(int k=4; k < M-4; k++) {
    for(int j=4; j < N-4; j++) {
        for(int i=4; i < N-4; i++) {
            lap = c0 * V[k][j][i]
                + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1] )
                + c1 * ( V[ k ][j+1][ i ] + V[ k ][j-1][ i ] )
                + c1 * ( V[ k+1 ][ j ][ i ] + V[ k-1 ][ j ][ i ] )
                + c2 * ( V[ k ][ j ][i+2] + V[ k ][ j ][i-2] )
                + c2 * ( V[ k ][j+2][ i ] + V[ k ][j-2][ i ] )
                + c2 * ( V[ k+2 ][ j ][ i ] + V[ k-2 ][ j ][ i ] )
                + c3 * ( V[ k ][ j ][i+3] + V[ k ][ j ][i-3] )
                + c3 * ( V[ k ][j+3][ i ] + V[ k ][j-3][ i ] )
                + c3 * ( V[ k+3 ][ j ][ i ] + V[ k-3 ][ j ][ i ] )
                + c4 * ( V[ k ][ j ][i+4] + V[ k ][ j ][i-4] )
                + c4 * ( V[ k ][j+4][ i ] + V[ k ][j-4][ i ] )
                + c4 * ( V[ k+4 ][ j ][ i ] + V[ k-4 ][ j ][ i ] );
            U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i]
                + ROC[k][j][i] * lap;
        }
    }
}
```



# KERN CRAFT



## Automatic Loop Kernel Analysis and Performance Modeling **with Kerncraft**

# Kerncraft



# Kerncraft – Output

ECM model: {  $T_{OL} \parallel T_{nOL} | T_{L1-L2} | T_{L2-L3} | T_{L3-MEM}$  }

```
$ kerncraft --machine snb.yaml 2d-5pt.c --pmodel ECM -D N 5000 -D M 500
=====
kernels/2d-5pt.c
=====
{ 9.0 || 8.0 | 10 | 6 | 12.74 } = 36.74 cy/CL
{ 9.0 \ 18.00 \ 24.00 \ 36.74 } cy/CL
saturating at 3 cores
$ 
$ kerncraft --machine snb.yaml 2d-5pt.c --pmodel Roofline --unit cy/CL -D N 5000 -D M 500
=====
kernels/2d-5pt.c
=====
Cache or mem bound with 1 core(s)
29.79 cy/CL due to L3-MEM transfer bottleneck (bw from copy benchmark)
Arithmetic Intensity: 0.17 FLOP/b
$
```

# Kerncraft – Results

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core



# 3D-LONG-RANGE EXAMPLE



```
double U[M][N][N];
double V[M][N][N];
double ROC[M][N][N];
double c0, c1, c2, c3, c4, lap;

for(int k=4; k < M-4; k++) {
    for(int j=4; j < N-4; j++) {
        for(int i=4; i < N-4; i++) {
            lap = c0 * V[k][j][i]
                + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1] )
                + c1 * ( V[ k ][j+1][ i ] + V[ k ][j-1][ i ] )
                + c1 * ( V[k+1][ j ][ i ] + V[k-1][ j ][ i ] )
                + c2 * ( V[ k ][ j ][i+2] + V[ k ][ j ][i-2] )
                + c2 * ( V[ k ][j+2][ i ] + V[ k ][j-2][ i ] )
                + c2 * ( V[k+2][ j ][ i ] + V[k-2][ j ][ i ] )
                + c3 * ( V[ k ][ j ][i+3] + V[ k ][ j ][i-3] )
                + c3 * ( V[ k ][j+3][ i ] + V[ k ][j-3][ i ] )
                + c3 * ( V[k+3][ j ][ i ] + V[k-3][ j ][ i ] )
                + c4 * ( V[ k ][ j ][i+4] + V[ k ][ j ][i-4] )
                + c4 * ( V[ k ][j+4][ i ] + V[ k ][j-4][ i ] )
                + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ] );
            U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i]
                + ROC[k][j][i] * lap;
        }
    }
}
```

# 3D-long-range Example



# 3D-long-range Example

3d-long-range-stencil.c in memory on single Intel Xeon E5-2680 (SandyBridge) core



# 3D-long-range Example



# Benefits

- Guiding optimizations
- Hardware-software co-design
- Energy optimized computing
- Deeper understanding of code and hardware interactions

## Kerncraft...

- is a white-box utility
- takes some of the pain out of performance modeling
- is free (as in free beer and freedom)
- is NOT for inexperienced programmers
- is NOT a fully-automated jack-of-all-trades yielding better performance

# Open Source



# Further Work

- Support for other compilers than ICC (done)
- Lift some code restrictions (partially done)
- Replacement for IACA (under investigation)
- Support for non-Intel Architectures

# ERLANGEN REGIONAL COMPUTING CENTER [RRZE]



## Thank You for Your Attention!

Julian Hammer <[julian.hammer@fau.de](mailto:julian.hammer@fau.de)>

RRZE High Performance Computing Group

<http://www.rrze.fau.de/hpc>