

# Optimizing Applications: an Ant Farm Approach



Jack Deslippe  
August 2016



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



# Agenda

---



1. Many Core vs Multi Core
2. Performance Optimization Concepts for Many Core
3. Performance Optimization Strategy for Many Core
4. Example Case Studies

# Many Core HPC Systems



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



# Many Core Systems Coming to NERSC, ALCF and More



- **NERSC's Cori will begin to transition the workload to more energy efficient architectures**
- **Cray XC system with over 9300 Intel Knights Landing (Xeon-Phi) compute nodes**
  - Self-hosted, (not an accelerator) manycore processor with 68 cores per node
  - On-package high-bandwidth memory



System named after Gerty Cori, Biochemist and first American woman to receive the Nobel prize in science.



Office of  
Science



# What is different about Many-Core



## Edison (Multi-Core):

- 5000+ Ivy Bridge Nodes
- 12 Cores Per CPU
- 24 Virtual Cores Per CPU
- 2.4-3.2 GHz
- Can do 4 Double Precision Operations per Cycle (+ multiply/add)
- 2.5 GB of Memory Per Core
- ~100 GB/s Memory Bandwidth

## Cori (Many-Core):

- 9000+ Knights Landing Nodes
- 68 Physical Cores Per CPU
- Up to 272 Virtual Cores Per CPU
- Much slower GHz
- Can do 8 Double Precision Operations per Cycle (+ multiply/add)
- < 0.3 GB of Fast Memory Per Core  
< 2 GB of Slow Memory Per Core
- Fast Memory has ~ 4-5x DDR4 Bandwidth



U.S. DEPARTMENT OF  
**ENERGY** | Office of  
Science



# Basic Optimization Concepts



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



Need to explicitly consider both inter and on-node parallelism in application.

Existing applications may suffer from:

- Memory overhead due to duplicated data in traditional MPI tasks
- Lack of SIMD/Vectorization expressiveness in app.
- Potential MPI latency in all-to-all communication patterns

Possible Solutions:

MPI+MPI, MPI+OpenMP, PGAS (MPI+PGAS), Task Based Programming



# PARATEC Use Case For OpenMP

PARATEC computes parallel FFTs across all processors.

Involves MPI all-to-all communication (small messages, latency bound).

Reducing the number of MPI tasks in favor OpenMP threads makes large improvement in overall runtime.



Figure Courtesy of Andrew Canning

# Vectorization

There is another important form of on-node parallelism

```
do i = 1, n
    a(i) = b(i) + c(i)
enddo
```

$$\begin{pmatrix} a_1 \\ \dots \\ a_n \end{pmatrix} = \begin{pmatrix} b_1 \\ \dots \\ b_n \end{pmatrix} + \begin{pmatrix} c_1 \\ \dots \\ c_n \end{pmatrix}$$

Vectorization: CPU does identical operations on different data; e.g., multiple iterations of the above loop can be done concurrently. Works best with long/aligned vectors.

# Vectorization

There is another important form of on-node parallelism

```
do i = 1, n
    a(i) = b(i) + c(i)
enddo
```

$$\begin{pmatrix} a_1 \\ \dots \end{pmatrix} = \begin{pmatrix} b_1 \\ \dots \end{pmatrix} + \begin{pmatrix} c_1 \\ \dots \end{pmatrix}$$

Intel Xeon Sandy-Bridge/Ivy-Bridge:

4 Double Precision Ops Concurrently

Vectorize Intel Xeon Phi:  
above 1

8 Double Precision Ops Concurrently

of the

Compilers want to “vectorize” your loops whenever possible. But sometimes they get stumped. Here are a few things that prevent your code from vectorizing:

## Loop dependency:

```
do i = 1, n  
    a(i) = a(i-1) + b(i)  
enddo
```

## Task forking:

```
do i = 1, n  
    if (a(i) < x) cycle  
    if (a(i) > x) ...  
enddo
```

## Example From NERSC User Group Hackathon - (Astrophysics Transport Code)

```
for (many iterations) {  
    ... many flops ...  
    et = exp(outcome1)  
    tt = pow(outcome2,3)  
    IN = IN * et +tt  
}
```

# Things that prevent vectorization in your code



## Example From NERSC User Group Hackathon - (Astrophysics Transport Code)

```
for (many iterations) {  
    ... many flops ...  
    et = exp(outcome1)  
    tt = pow(outcome2,3)  
    IN = IN * et +tt  
}
```



```
for (many iterations) {  
    ... many flops ...  
    et(i) = exp(outcome1)  
    tt(i) = pow(outcome2,3)  
}  
for (many iterations) {  
    IN = IN * et(i) + tt(i)  
}
```

## Example From NERSC User Group Hackathon - (Astrophysics Transport Code)

```
for (many iterations) {  
    ... many flops ...  
    et = exp(outcome1)  
    tt = pow(outcome2,3)  
    IN = IN * et +tt  
}
```



```
for (many iterations) {  
    ... many flops ...  
    et(i) = exp(outcome1)  
    tt(i) = pow(outcome2,3)  
}  
for (many iterations) {  
    IN = IN * et(i) + tt(i)  
}
```

**30% speed up for entire application!**

# Things that prevent vectorization in your code

Original

```
real(8),dimension
  (5,(col_f_nvr-1)*(col_f_nvz-1),
  (col_f_nvr-1)*(col_f_nvz-1)) :: Ms

do index_ip = 1, mesh_Nzml
  do index_jp = 1, mesh_Nrml
    index_2dp = index_jp+mesh_Nrml*(index_ip-1)

    tmp_vol = cs2%local_center_volume(index_jp)
    tmp_f_half_v = f_half(index_jp, index_ip) *
    tmp_vol
    tmp_dfdv = dfdr(index_jp, index_ip) *
    tmp_vol
    tmp_dfdz_v = dfdz(index_jp, index_ip) *
    tmp_vol

    tmpr(1:3)= tmpr(1:3) +
    Ms(1:3,index_2dp,index_2D)* tmp_f_half_v
    tmpr(5) = tmpr(5) +
    Ms(4,index_2dp,index_2D)*tmp_dfdv +
```

Optimized

```
real (8),dimension
  ((col_f_nvr-1),5,(col_f_nvz-1),
  (col_f_nvr-1)*(col_f_nvz-1)) :: Ms

do index_ip = 1, mesh_Nzml
  do index_jp = 1, mesh_Nrml
    index_2dp = index_jp+mesh_Nrml*(index_ip-1)
    tmp_vol = cs2%local_center_volume(index_jp)
    tmp_f_half_v = f_half(index_jp, index_ip) *
    tmp_vol
    tmp_dfdv = dfdr(index_jp, index_ip) * tmp_vol
    tmp_dfdz_v = dfdz(index_jp, index_ip) * tmp_vol

    tmpr(index_jp,1) = tmpr(index_jp,1) +
    Ms(index_jp,1,index_ip,index_2D)*
    tmp_f_half_v
    tmpr(index_jp,2) = tmpr(index_jp,2) +
    Ms(index_jp,2,index_ip,index_2D)*
    tmp_f_half_v
    tmpr(index_jp,3) = tmpr(index_jp,3) +
    Ms(index_jp,3,index_ip,index_2D)*
    tmp_f_half_v
    tmpr(index_jp,5) = tmpr(index_jp,5) +
    Ms(index_jp,4,index_ip,index_2D)*
    Ms(index_ip,2,index_ip,index_2D)*
    tmp_dfdv
    tmp_dfdz_v
```

Example From Cray COE Work on XGC1

# Things that prevent vectorization in your code

Original

```
real(8),dimension
((5,(col_f_nvr-1)*(col_f_nvz-1),
((col_f_nvr-1)*(col_f_nvz-1)) :: Ms

do index_ip = 1, mesh_Nzm1
do index_jp = 1, mesh_Nrml
    index_2dp = index_jp+mesh_Nrml*(index_ip-1)

    tmp_vol = cs2%local_center_volume(index_jp)
    tmp_f_half_v = f_half(index_jp, index_ip) *
    tmp_vol
    tmp_dfdv = dfdr(index_jp, index_ip) *
    tmp_vol
    tmp_dfdz_v = dfdz(index_jp, index_ip) *
    tmp_vol

    tmpr(1:3)= tmpr(1:3) +
    Ms(1:3,index_2dp,index_2D)* tmp_f_half_v
    tmpr(5) = tmpr(5) +
    Ms(4,index_2dp,index_2D)*tmp_dfdv +
    Ms(5,index_2dp,index_2D)*tmp_dfdz_v
```

Optimized

```
real (8),dimension
((col_f_nvr-1),5,(col_f_nvz-1),
((col_f_nvr-1)*(col_f_nvz-1)) :: Ms

do index_ip = 1, mesh_Nzm1
do index_jp = 1, mesh_Nrml
    index_2dp = index_jp+mesh_Nrml*(index_ip-1)
    tmp_vol = cs2%local_center_volume(index_jp)
    tmp_f_half_v = f_half(index_jp, index_ip) *
    tmp_vol
    tmp_dfdv = dfdr(index_jp, index_ip) * tmp_vol
    tmp_dfdz_v = dfdz(index_jp, index_ip) * tmp_vol

    tmpr(index_jp,1) = tmpr(index_jp,1) +
    Ms(index_jp,1,index_ip,index_2D)*
    tmp_f_half_v
    tmpr(index_jp,2) = tmpr(index_jp,2) +
    Ms(index_jp,2,index_ip,index_2D)*
    tmp_f_half_v
    tmpr(index_jp,3) = tmpr(index_jp,3) +
    Ms(index_jp,3,index_ip,index_2D)*
    tmp_f_half_v
    tmpr(index_jp,5) = tmpr(index_jp,5) +
    Ms(index_jp,4,index_ip,index_2D)*
    Ms(index_jp,5,index_ip,index_2D)*
    tmp_dfdv +
    tmp_dfdz_v
```

Example From Cray COE Work on XGC1

**~40% speed up  
for kernel**



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



BERKELEY LAB

Lawrence Berkeley National Laboratory

Consider the following loop:

```
do i = 1, n  
  do j = 1, m  
    c = c + a(i) * b(j)  
  enddo  
enddo
```

Assume, n & m are very large such that a & b don't fit into cache.

Then,

During execution, the **number of loads From DRAM** is

$$n*m + n$$

# Memory Bandwidth

Consider the following loop:

```

do i = 1, n
    do j = 1, m
        c = c + a(i) * b(j)
    enddo
enddo

```

Assume, n & m are very large such that a & b don't fit into cache.

Then,

During execution, the **number of loads From DRAM** is

$$n*m + n$$

Requires 8 bytes loaded from DRAM per FMA (if supported). Assuming 100 GB/s bandwidth on Edison, we can **at most achieve 25 GFlops/second** (2 Flops per FMA)

Much lower than 460 GFlops/second peak on Edison node. **Loop is memory bandwidth bound.**

# Roofline Model For Edison



# Processor Memory Hierarchy on MultiCore



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



BERKELEY LAB

Lawrence Berkeley National Laboratory

# Improving Memory Locality



Improving Memory Locality. Reducing bandwidth required.

```
do i = 1, n  
  do j = 1, m  
    c = c + a(i) * b(j)  
  enddo  
enddo
```



```
do jout = 1, m, block  
  do i = 1, n  
    do j = jout, jout+block  
      c = c + a(i) * b(j)  
    enddo  
  enddo  
enddo
```

Loads From DRAM:

$$n*m + n$$

Loads From DRAM:

$$\begin{aligned} & m/\text{block} * (n+\text{block}) \\ & = \mathbf{n*m/block + m} \end{aligned}$$

# Improving Memory Locality Moves you to the Right on the Roofline



# Optimization Strategy



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



# The Ant Farm!



The Ant Farm Flow Chart





## 1. Determine your roofline position:

<http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/>



# Measuring Your Memory Bandwidth Usage (VTune)

Measure memory bandwidth usage in VTune. (Next Talk)

Compare to Stream GB/s.

If 90% of stream, you are memory bandwidth bound.

If less, more tests need to be done.



# Are you memory or compute bound? Or both?



# Are you memory or compute bound? Or both?



Run Example  
in “Half  
Packed” Mode

If you run on only half of the cores on a node, each core you do run has access to more bandwidth

aprun -n 24 -N 12 -S 6 ...

VS

aprun -n 24 -N 24 -S 12 ...

If your performance changes, you are at least partially memory bandwidth bound



U.S. DEPARTMENT OF  
**ENERGY** | Office of  
Science



# Are you memory or compute bound? Or both?



Run Example  
in “Half  
Packed” Mode

aprun -n 24 -N

If your performance

If you run on only half of the cores on a node, each core you do run has access to more bandwidth



2 ...

ound



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



# Are you memory or compute bound? Or both?



Run Example  
at “Half Clock”  
Speed

Reducing the CPU speed slows down computation, but doesn't reduce memory bandwidth available.

aprun --p-state=2400000 ...

VS

aprun --p-state=1900000 ...

If your performance changes, you are at least partially compute bound

# So, you are Memory Bandwidth Bound?



## What to do?

1. Try to improve memory locality, cache reuse
2. Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-be allocated in HBM on Cori.



Profit by getting ~ 5x more bandwidth GB/s.

## What to do?

1. Make sure you have good OpenMP scalability. Look at VTune to see thread activity for major OpenMP regions.



2. Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization in vtune.

See whether intel compiler vectorized loop using compiler flag: -qopt-report=5

# High latency instructions : Complex-Division (without -fp model fast=2)



# Are you latency bound?

You may be memory latency bound (or you may be spending all your time in IO and Communication).

If running with hyper-threading on Edison improves performance, you *\*might\** be latency bound:

aprun -j 2 -n 48 ....

VS

aprun -n 24 ....

If you can, try to reduce the number of memory requests per flop by accessing contiguous and predictable segments of memory and reusing variables in cache as much as possible.

On Cori, each core will support up to 4 threads. Use them all.

# NESAP Case Study



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



- ★ Big systems require more memory. Cost scales as  $N_{\text{atoms}}^2$  to store the data.
- ★ In an MPI GW implementation, in practice, to avoid communication, data is duplicated and **each MPI task has a memory overhead**.
- ★ Users sometimes forced to use 1 of 24 available cores, in order to provide MPI tasks with enough memory. **90% of the computing capability is lost**.



In house code (I'm one of main developers). Use as “prototype” for App Readiness.

Significant Bottleneck is large matrix reduction like operations. Turning arrays into numbers.

$$\begin{aligned} \langle n\mathbf{k} | \Sigma_{\text{CH}}(E) | n'\mathbf{k} \rangle &= \frac{1}{2} \sum_{n''} \sum_{\mathbf{q}\mathbf{G}\mathbf{G}'} M_{n''n}^*(\mathbf{k}, -\mathbf{q}, -\mathbf{G}) M_{n''n'}(\mathbf{k}, -\mathbf{q}, -\mathbf{G}') \\ &\times \frac{\Omega_{\mathbf{G}\mathbf{G}'}^2(\mathbf{q}) (1 - i \tan \phi_{\mathbf{G}\mathbf{G}'}(\mathbf{q}))}{\tilde{\omega}_{\mathbf{G}\mathbf{G}'}(\mathbf{q}) (E - E_{n''\mathbf{k}-\mathbf{q}} - \tilde{\omega}_{\mathbf{G}\mathbf{G}'}(\mathbf{q}))} v(\mathbf{q}+\mathbf{G}') \end{aligned}$$

# Optimization Path

Optimization process for Kernel-C (Sigma code):

1. Refactor (3 Loops for MPI, OpenMP, Vectors)
2. Add OpenMP
3. Initial Vectorization (loop reordering, conditional removal)
4. Cache-Blocking
5. Improved Vectorization
6. Hyper-threading



# Vectorization

```

!$OMP DO reduction(+:achtemp)
do my_igp = 1, ngpown ←
...
do iw=1,nfreq ! nfreq is 3 ←
    scht=0D0
    wxt = wx_array(iw)
    do ig = 1, ncoul ←
        if (abs(wtilde_array(ig,my_igp) * eps(ig,my_igp)) .lt. TOL) cycle ←
        wdiff = wxt - wtilde_array(ig,my_igp)
        delw = wtilde_array(ig,my_igp) / wdiff ←
        ...
        scha(ig) = mygpvar1 * aqsntemp(ig) * delw * eps(ig,my_igp)
        scht = scht + scha(ig) ←
    enddo ! loop over g ←
    sch_array(iw) = sch_array(iw) + 0.5D0*scht ←
enddo
achtemp(:) = achtemp(:) + sch_array(:) * vcoul(my_igp)
enddo

```

ngpown typically in 100's to 1000s. Good for many threads.

Original inner loop. Too small to vectorize!

ncoul typically in 1000s - 10,000s. Good for vectorization.

Attempt to save work breaks vectorization and makes code slower.

# Change in Roofline

Haswell Roofline Optimization Path



KNL Roofline Optimization Path





## Why KNC worse than Haswell for GPP Kernel?

- 2S Haswell 27.9s    KNC 39.9s    (Bandwidth bound on KNC but not on Haswell)

```
!$OMP DO
do my_igp = 1, ngpown
    do iw = 1 , 3
        do ig = 1, igmax
            load wtilde_array(ig,my_igp) 819 MB, 512KB per row
            load aqsntemp(ig,n1) 256 MB, 512KB per row
            load l_eps_array(ig,my_igp) 819 MB, 512KB per row
            do work (including divide)
```

Required Cache size to reuse 3 times:

1536 KB

L2 on KNL is 512 KB per core

L2 on Has. is 256 KB per core

L3 on Has. is 3800 KB per core

**Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.**



## Why KNC worse than Haswell for GPP Kernel?

- 2S Haswell 27.9s    KNC 39.9s    (Bandwidth bound on KNC but not on Haswell)

```
!$OMP DO
do my_igp = 1, ngpown
  do igbeg = 1, igmax, igblk
    do iw = 1 , 3
      do ig = igbeg, min(igbeg + igblk,igmax)
        load wtilde_array(ig,my_igp) 819 MB, 512KB per row
        load aqsntemp(ig,n1) 256 MB, 512KB per row
        load l_eps_array(ig,my_igp) 819 MB, 512KB per row
      do work (including divide)
```

Required Cache size to reuse 3 times:

1536 KB

L2 on KNL is 512 KB per core

L2 on Has. is 256 KB per core

L3 on Has. is 3800 KB per core

**Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us.**

# Cache Blocking Optimization

Haswell Roofline Optimization Path



KNL Roofline Optimization Path



# Why Complex Divides so Slow?



Found significant x87 instructions from 1/complex\_number instead of AVX/AVX-512



Can significantly speed up by

- a) Doing complex divide manually

Or

- b) Using -fp-model fast=2

# Additional Speedups from Hyperthreading

Haswell Roofline Optimization Path



KNL Roofline Optimization Path



# Conclusions



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



1. Optimizing code is not always straightforward. It is a continual discovery process that involves many sequential and coupled changes.
2. Use profiling tools to find and characterize hotspots.
3. Understanding bandwidth and compute limitations of hotspots are key to deciding how to improve code.

# The End (Extra Slides)



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



# Thread Scaling

KNL DDR performance saturates at around 50 threads, becomes memory bandwidth limited.

KNL MCDRAM performance beats dual socket Haswell by 63%.

## Kernel C Thread Scaling



# Why Complex Divides so Slow?



---

Code performance now limited by complex divides

why??

For complex division in performance critical loop, I had already removed the explicit complex divide but what is faster?

a)  $c = 1 / c$  vs. b)

$$r = c * \text{conj}(c)$$

$$r = 1 / r$$

$$c = \text{conj}(c) * r$$

c/d) Compiling with/without -fp-model fast=2

# Real-Division (with or without -fp model fast=2)



# Complex-Division (with -fp model fast=2)



# Profile Your Application (VTune / CrayPat)



---

Approximation:

- a. Real Division
- b. Complex Division
- c. Complex Division  
+ -fp-model fast=2



Wall Time:

|              |
|--------------|
| 6.37 seconds |
| 4.99 seconds |
| 5.30 seconds |

---

Approximation:

a. Real Division



Wall Time:

6.37 seconds

b. Complex Division

4.99 seconds

c. Complex Divsion +  
-fp-model=fast

5.30 seconds

Approximation:

a. Real Division



Wall Time:

6.37 seconds

b. Complex Division

4.99 seconds

c. Complex Division +  
-fp-model fast=2

5.30 seconds

d. Complex Division +  
-fp-model=fast=2 +  
!dir\$ nounroll

4.89 seconds

# Early NESAP (Advances with Cray and Intel) Advances



Thread Scaling in BerkeleyGW GPP Kernel on Xeon-Phi



Source: [GPP Thread](#)

BerkeleyGW FF Kernel Runtimes on Xeon and Xeon-Phi (Nathan)



## Overall Improvement

|                |        |
|----------------|--------|
| BGW GPP Kernel | 0-10%  |
| BGW FF Kernel  | 2x-4x  |
| BGW Chi Kernel | 10-30% |
| BGW BSE Kernel | 10-50% |

## Notes

Pretty optimized to begin with. Thread scalability improved by fixing ifort allocation performance.  
Unoptimized to begin with. Cache reuse improvements  
Moved threaded region outward in code  
Created custom vector matmuls



U.S. DEPARTMENT OF  
**ENERGY**

Office of  
Science



**Breakdown of Application Hours**  
**on Hopper and Edison 2013**



NESAP Tier-1, 2 Code  
 NESAP Proxy Code or Tier-3 Code



# Early Lessons Learned



Cray and Intel very helpful in profiling/optimizing the code. See following slides for using Intel resources effectively

Generating small tangible kernels is important for success -

Targeting Many-Core greatly helps performance  
back on Xeon.

Complex division is slow on (particularly on KNC)

BGW 1.0 vs 1.1 Sigma Performance

