

Wednesday 14<sup>th</sup> October 2015

## Performance Analysis & Optimization.

- Ultimate goal:

Answer questions: 1) How fast does my code run?

→ Performance measurement.

2) How fast can my code run?

→ Performance analysis / modeling.

3) How can we achieve 2)

→ Optimization.

- Sessions

Wed. 09:00 - 10:00 This lecture.

10:30 - 12:00 Exercises.

Thu. 08:30 - 10:30 Cont. Exercises from Wed.

Interleaved lecture/exercise.

## 1) How fast does my code run?

- Flop-rate
- Bandwidth
- Wall-time

→ Measure time from beginning to end.

time ./a.out.

- Depending on the os, 'time' gives user/system/cpu.
- No fine-grained info.
- Inconvenient ... - not portable.

→ Measure time using a system function.

- Need "wall time": not cpu-ticks, not program time.
- gettimeofday()

Returns time elapsed since epoch (Jan 1<sup>st</sup>, 1970 for most Unix Systems)

Returns two integers: tv\_sec (seconds)

tv\_usec (useconds)

$$\Rightarrow \text{Seconds} = \text{tv\_sec} + \text{tv\_usec} \cdot 10^{-6}$$

- Can wrap around my kernels for fine-grained info.
- Even though it returns usec, don't assume error is  $O(\mu s)$ .
- O.S. "jitters"  $\Rightarrow$  runtime fluctuates.
- Take a sufficient number of measurements

- Eg.

```
for i=1,...,Nouter
    t0 = Stop-watch()
    for j=1,...,N.inner
        | my-kernel()
    end for
    t1 = Stop-watch()
    T[i] = (t1-t0)/Ninner;
end for.
```

Take average and error:

$$\bar{T} = \frac{1}{Nouter} \sum_i T_i \quad \sigma_T = \sqrt{\frac{(T_i - \bar{T})^2}{Nouter}}$$

: Increase Ninner until  $\frac{\sigma_T}{\bar{T}}$  acceptable (e.g. 5%)

- Measure performance

- Sustained Floating-point operations per second:  
 ————— " ————— Bytes -read/written per second.

- Assumptions for our kernels:

- Fixed number of FP ops per kernel call.

- ————— " ————— Bytes of I/O per kernel call

$$\Rightarrow \beta_{FP} = \frac{N_{FP}}{T} : \text{sustained floating point operations.}$$

$$\beta_{IO} = \frac{N_{IO}}{T} : \text{sustained bandwidth.}$$

$N_{FP}$ : Count floating point operations explicitly.

+, -, \* : 1 FP operation.

/, sqrt(), etc : depend on architecture  $\gg 2$  Flops.

$N_{IO}$ : Count bytes of I/O explicitly.

- Assume that for every kernel call, each byte read is never re-read

"maximum data reuse"

- Count I/O of  $O(N)$  and above, where  $N$  is the problem size.

- Example:

Kernel:  $y_i^{ab} = \sum_{c=0}^{N-1} A^{ac} x_i^{cb}$  : Array of matrices-times-array.

c.g.  $N=2$ :  $y_i^{ab} = A^{a0} x_i^{0b} + A^{a1} x_i^{1b}$   $i=1, \dots, L$ ,  $L \gg N$ .



$$Y_i^{ab} = \underbrace{A^{ab} \cdot X_i^{ob} + A^{ai} \cdot X_i^{ib}}_{F.P.: 3 \cdot 2 \cdot 2 \cdot L = 12 \cdot L}$$

$$I/O: \underbrace{(2 \cdot 2 + 2 \cdot 2)}_{\downarrow \times \quad \downarrow Y} \cdot L + \underbrace{2 \cdot 2}_{A} \simeq 8L + \text{low order terms (L)} \\ = 8 \cdot 8L = 64L \text{ Bytes.}$$

$$\rightarrow L = \frac{2^{22}}{T} = 4M \quad \left\{ \begin{array}{l} \Rightarrow P_{FP} = \frac{2^{22} \cdot 12}{0.1 \text{ sec.}} = 120 \cdot 2^{22} \text{ Flop/sec} = 0.15 \text{ G.Flops/sec} \end{array} \right.$$

$$\Rightarrow P_{IO} = \frac{2^{22} \cdot 64}{0.1 \text{ sec.}} = 164 \cdot 2^{22} \text{ Bytes/sec.} \\ \simeq 2.7 \text{ GByte/sec.}$$

Note: 1) Such kernels have a fixed number of Flops/Byte independent of "L"

"Arithmetic or Computational Intensity": e.g. here:  $\frac{12L}{64L} \simeq 0.1875 \frac{\text{Flop}}{\text{Byte}}$

$$I_k = \frac{P_{FP}}{P_{IO}} = \frac{N_{FP}}{N_{IO}} = \text{independent of problem size.}$$

2) Assumed max. data reuse.

Re-using the elements of  $X_i$  did not as extract  $X_i^{ob}$  as extract  $X_i^{ib}$

$$Y_i^{oo} = A^{oo} X_i^{oo} + A^{oi} X_i^{io} \\ Y_i^{oi} = A^{oo} X_i^{oi} + A^{ii} X_i^{ii} \\ Y_i^{io} = A^{io} X_i^{oo} + A^{ii} X_i^{io} \\ Y_i^{ii} = A^{io} X_i^{oi} + A^{ii} X_i^{ii}$$

Remark: Other ways to count performance: Hardware counters (PAPI)

→ For codes with non-deterministic FP/I/O counts.



## 2) How fast can my code run?

Depends on:

i) Machine/Hardware



ii) Kernel.

- Arithmetic intensity.

Assume a simple model for the architecture:



$\delta_{FP}$ : Rate at which processor consumes FP ops.

$\delta_{IO}$ : Rate at which bytes of I/O move between memory - CPU.

Peak F.P.: - Clock rate of CPU (GHz)  
- Flop/cycle of CPU (Vector-length).

Vector length:

- Modern CPUs allow processing multiple data at the same time!

Single Instruction Multiple Data: SIMD.

- The number of elements processed depends on the vector length of the architecture.

i) SSE(2,3,4): 128 bits (Westmere/Judge) ii) AVX : 256 bits iii) KNC: 512 bits (Xeon Phi)

E.g. SSE, Single precision; axpy operation.

$$\text{for } i=0, \dots, L-1 \quad | \quad \begin{array}{l} \text{GP: 4bytes} \\ \Rightarrow 4 \text{ elements} \\ \text{per 128bit vector} \end{array} \quad \left. \right\} \text{for } i=0, 4, 8, \dots, L \quad | \quad \begin{array}{l} y_i = a \cdot x_i + y_i \\ y_{i+4} = a \cdot x_{i+4} + y_{i+4} \\ y_{i+8} = a \cdot x_{i+8} + y_{i+8} \\ y_{i+12} = a \cdot x_{i+12} + y_{i+12} \end{array}$$

$$y_i = a \cdot x_i + y_i$$



Need : vector  $y$ :  $v_y = \{y_i, y_{i+1}, y_{i+2}, y_{i+3}\}$

vector  $x$ :  $v_x = \{x_i, x_{i+1}, x_{i+2}, x_{i+3}\}$

vector  $a$ :  $v_a = \{a, a, a, a\}$

Pseudo-Code  
 $v_q = vload1(a)$

for  $i=0, 4, 8, \dots, L$ .

$v_y = vload(y_i)$

$v_x = vload(x_i)$

$v_y = v_a + v_x + v_y$

$vstore(y_i, v_y)$

- The number of vector-ops per cycle.

> 0.5, 1, 2 } 0.5: 1 vector op every other cycle. (old).

1: 1 ——— every cycle.

2: 1 multiply + 1 add op every cycle (madd).  
op.

Example: SSE4 architecture, 1madd per cycle,  $F=2$  GHz.

$$\text{Peak: } \frac{4 \text{ Flop}}{\text{Vector length}} \times 2 \times 2 \text{ GHz} = 16 \text{ GFlop/sec. peak}$$

$\gamma_{FP}^{\text{Peak}}$

$\gamma_{IO}$ : More difficult to figure out!!

• Read. of trans Specs of vector.

• Depends on RAM channels populated: for HPC systems, assume fully populated channels.

So: How fast can my kernel run?

$$T_{FP} = \frac{N_{FP}}{\gamma_{FP}}, T_{IO} = N_{IO} / \gamma_{IO} \quad \left. \right\} T = \max [T_{FP}, T_{IO}]$$

Assumes: I/o ops can be issued in parallel with

FP ops.

How far is  $\bar{T}$  measured from theoretical  $T = \max [T_{FP}, T_{IO}]$ ?

Remarks: - Some systems may not yield  $\gamma_{IO}^{\text{Peak}}$ . - Maximum data reuse must apply.  
- Assumption of fixed ops in  $\gamma_{FP}^{\text{Peak}}$

### 3) How to get to optimal performance

Remark: In the same way we defined "computational intensity" for a kernel,

we can determine a similar ratio for the machine :  $I_m = \frac{\gamma_{FP}}{\gamma_{I/O}}$ .

- a)  $I_m > I_k \Rightarrow$  The kernel is said to be "bandwidth-bound" on this machine  
 ↓  
 (Most typical situation nowadays.) i.e. at peak bandwidth, the F.P. units of the machine will be under-utilized.  
 (for lattice kernels).
- b)  $I_m < I_k \Rightarrow$  The kernel is said to be "computational-bound" on this machine.  
 i.e. We can reach Peak F.P. at less than peak BW.
- c)  $I_m = I_k \Rightarrow$  The kernel is balanced. (Ideal situation!)

Depending on where kernel is (a, b, or c?), we optimize for better BW utilization, FP utilization or both.

Optimization Strategies covered here.

Multi-core system :



1) Effectively multiple  $\gamma_{FP}$ .

2) Closer to  $\gamma_{I/O}$ .

1) Open MP for multi-core processing.

2) Vectorization of kernels for better FP and I/o utilization.

Other optimizations : - Cache locality (spacial, temporal)

→ Not covered, but some indirect. insight from exercises.

- Optimization for distributed memory systems (MPI, overlap comm.-comp.)
- Optimization for streaming processors (GPUs)

## 1) Open-MP for multi-processing.

Add pragmas to your code:

```
for(i=0; i<N; i++) → #pragma omp for
    x[i] = a * x[i];           for(j=0; j<N; j++)
                                x[j] = a * x[j].
```

} Split N over number of openmp threads, which is controlled by OMP\_NUM\_THREADS variable.

$$x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n-1} \end{bmatrix}; \quad \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n-1} \end{bmatrix} = \begin{bmatrix} a & 0 & 0 & \dots & 0 \\ 0 & a & 0 & \dots & 0 \\ 0 & 0 & a & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & a \end{bmatrix} \cdot \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n-1} \end{bmatrix}$$

double a = 2.0; ← local variable.  
#pragma omp parallel {

```
int ithr = omp_get_thread_num();
double b = ithr * a;
```

```
#pragma omp for
for(int i=0; i<N; i++)
    x[i] = b * x[i];
```

}

Full API and pragma listing.  
→ [www.OpenMP.org](http://www.OpenMP.org).  
(Google: openmp reference.)

## 2) Vectorisation

- a) Bring kernel to vectorizable form.
- b) Implement with intrinsics.

} Improve both FP AND BW performance

BW: improved because of less shuffles, simpler load/stores

Remark:

If code is in vectorizable form (a), then implementing (b) may not be necessary - many be handled by compiler auto-vectorization.

Short vectorization how-to:

- Here for Intel intrinsics. Should work for gcc/icc.
- Similar for other vector types (e.g. Altivec)

Includes: <xmmintrin.h>      Types: `--m128` (128-bit floating point, holds 4)  
`--m128d` (128-bit double, holds 2)

Functions: `_mm_load_ps(*float)` : load from memory (S.P.)  
`_mm_load_pd(*double)` : load from memory (D.P.)

`_mm_load1_ps(*float)` (`*double`)

`_mm_store_ps(s, *double)` (`single *`, `--m128d`)

`_mm_mul_ps/d(--m128/d, --m128/d)`.

`_mm_add_ps/d(-----)`

Shuffle:

`C = _mm_shuffle_ps(a, b, mask)`      mask: 8 bits.

`a = (a3, a2, a1, a0)`

`b = (b3, b2, b1, b0)`

`mask = (x2|x0|x1|x3|x4|x5|x2|x1.|x0)`

`C[0] = a[x1, x0], C[1] = a[x3, x2].`

`C[2] = b[x5, x4], C[3] = b[x7, x6].`

Example: `a = [0, 1, 2, 3]`, `b = [4, 5, 6, 7]`, `mask = 0b10011011`

`C[0] = a[3], C[1] = a[2], C[2] = b[1], C[3] = b[0], C = [3, 2, 5, 4].`

More info: [software.intel.com/sites/landingpage/IntrinsicsGuide/](http://software.intel.com/sites/landingpage/IntrinsicsGuide/)