

## Course Outline

- . General Overview – HPC ✓
- . Hardware + basic optimization techniques
- . Parallel algorithm design
- . Shared and Distributed
- . Technical (OpenMP and MPI)
- . Performance modeling of parallel algorithms
- . Important parallel patterns ✓
- . Application (assignments and project)

✓ **Operating system :** Scientific Linux

✓ **Compilers and Libraries:** The entire GNU compiler suite - gcc, g++, and g77

The Java Execution and Development Environments.

Python programming language. Also Matplotlib, Numpy and Scipy

## **Parallel Programming Libraries and tools:**

✓ OpenMP - API for directing multi-threaded shared memory parallelism.

✓ OPENMPI- open source implementation of MPI (access to HPC Cluster)

✓ GNU Gprof - performance analysis tool for Unix applications.

**Important Scientific Software:** Scilab; Octave; R - software environment for statistical computing and graphics.

✓ **Visualization:** Gnuplot. Paraview.

**Documentation and reader:** Latex

## **Access to HPC Cluster** ✓

Goal of HPC: to achieve the maximum possible performance out of a particular system for a particular problem



What is expected from a good programmer?

Accuracy ✓  
Efficiency ✓

Accuracy & N. of  
Computations



Evaluation of the program in the given software-hardware setting



## Thinking in serial



- \* Resonate → cleaving Enzyme
- \* Workers - process



## Thinking in Parallel

Time - 15 min



- \* 4 equally efficient workers
- \* 4 (multiple resonases)
- \* Work Allocation → Load balancing  
→ Decomposition →
- \* Coordinates

Mamsu





Dependency

Analysis

1m in diameter

1m in depth

Serial  $\rightarrow$  1 worker  $\rightarrow$  1 log - 2nd logs  $\rightarrow$  3rd logs

Parallel  $\rightarrow$  (10 workers)  $\rightarrow$



# Serial vs Parallel



## Serial

- Break problem → discrete series of instructions
- Instructions executed → sequentially.
- Executed on a single processor
- Only one instruction @ any moment in time .

**Improve Latency.**

Unit  $\rightarrow$  (Second)



## Parallel: multiple compute resources.

- Break problem → discrete parts that can be solved concurrently.
- Each part again broken down to a series of instructions
- Each instructions (from each part) → execute simultaneously on several processors.
- An overall control/coordination mechanism is needed.

**Improve throughput.**

Combine  
Memory

per second  
 $\rightarrow$  Rate

$n \times n =$

Via point

We need a systems perspective : why?

systems are in general non-terminating and non-deterministic, whereas the behavior of algorithms is terminating, deterministic and platform-independent

$$M \downarrow \\ O(n^2) > O(n^3)$$



Theory of parallel algorithm design or theory of parallel computing is based on abstract concepts of time and memory which may ignore real life constraints for simplicity, and therefore not take into account non-deterministic and hardware factors

The performance of a computer program depends on a wide range of factors like the nature of the algorithm, the machine (several hardware factors), compiler optimizations, the runtime environment, the input, the measurement methodology etc. and their mutual interaction

Cloud

Not enough just to get some speedup, but being able to explain the speedup from a systems viewpoint in terms of resources.

$$\text{Speedup}_{(S)} = \frac{\text{Serial Time}}{\text{Parallel Time}}$$
$$= \frac{2}{1} = 2$$

$$S = 3 \text{ on } P = 4$$
$$S = 4.2 \text{ on } P = 4$$

## Organization of Parallel Platforms



# Main Outcome

- ✓ 1. Start with any Serial Algorithm / Application
- ✓ 2. Understand the Data Access and Compute Pattern
- ✓ 3. Implement the most optimized version (multiple versions possible) - right choice of data structures etc.
- ✓ 4. Do runtime analysis - memory requirement and execution time
- ✓ 5. Figure out Scope of Parallelism - Dependency analysis etc. → Scope
- ✓ 6. Naive parallel algorithm for shared memory system (multiple threads) - parallel run-time analysis
- ✓ 7. Do profiling and optimize the parallel version - develop a better parallel algorithm - if possible (new parallel)
- ✓ 8. Scalability study
- ✓ 9. Comparision of different parallel versions, speedup, efficiency - as a function of problem-size and architecture
- ✓ 10. Interpretation of run-time data in terms of memory access etc. Comparision with theoretical estimations.





Figure 1. Concept map for parallel computing. Adapted from [9].

## Module 1:

Modern Processor, Performance Metrics and benchmark, Memory Hierarchies, Caches, Pipelining, SIMD, SPMD, performance estimates, profiling, vector triad, serial code optimizations, data access optimizations, balance analysis, dependence testing, granularity, concurrency, loop fusion, loop distribution. Cache miss, cache hit, block matrix multiplication. Amdahl's law.

# Computer Performance

5

What is HPC and goal of HPC ??

Quantitative measure !!



# Computer Performance

What is HPC??

➤ Measure of computer performance - > **FLOPS** ( FLoating-point Operations Per Second).  $\text{FLOPS}/\text{Second}$

➤ Performance → We need Clock speed and microprocessor (Flops/cycle)

➤ **FLOPS** = clock rate x (flop/cycle) x no. of cores =  $2.5 \times 10^9 \frac{\text{Cycles}}{\text{Second}} \times 4$

➤ Microprocessors today can do 4 FLOPs per clock cycle.

➤ A single-core 2.5 GHz processor has a theoretical performance of  $10 \text{ e}9$  FLOPS =  $10 \text{ GFLOPS}$ .

Aggregating computing power in such a way that delivers much higher performance than a typical desktop computer (Look at FLOPS formula !!.) → to solve large problems.

**Useful in - science, engineering, or business.**

$$\frac{\text{FLOPS}}{\text{Sec cycle}}$$

$$\begin{aligned}
 & \text{FLOPS} \quad \times \quad 1 \\
 & \text{Cycle} \\
 & = \frac{\text{FLOPS}}{\text{Second}} \\
 & = 10^3 \text{ GFLOPS} \\
 & = 1 \text{ TFLOPS} \\
 & \text{HPC}
 \end{aligned}$$

$$\begin{aligned}
 n \times n &= 10 \times 10^3 \\
 &= 10^4
 \end{aligned}$$

$$\begin{aligned}
 O(n^3) &= 10^6
 \end{aligned}$$

[ ]

- ✓ Performance =  $\frac{\text{Work}}{\text{Time}} = \frac{\text{FLOP}}{\text{Wall clock time}}$
- ✓ FLOP (addition, multiplication, division etc.)
- ✓ FLOPS --> Number of FLOP/Second

FLOPS (flops per second) can be used to characterize a computing system (peak theoretical performance) and it can also be used to characterize a program (code) i.e. actual performance of the program in terms of how many flops per seconds

$$\frac{10^6 \text{ FLOPS}}{1 \text{ Sec}}$$

1 MFLOPS

$$\frac{10^9 \text{ FLOPS}}{\text{Laptop}}$$

$$\frac{10^{12} \text{ FLOPS}}{\text{Code}}$$

# Software Developer Jobs

Mentioning “performance”



Mentioning “optimization”



Mentioning “parallel”



Mentioning “concurrency”



Source: Monster.com

# High performance computing trend

21

Cray -  $10^9$   
→ Tera -  $10^{12}$   
Peta -  $10^{15}$   
-  $10^{18}$   
-  $10^{21}$



Figure 1. Data showing performance of Top 500 supercomputers on the LINPACK test over the past 20 years. The top line is the sum of the top 500 systems, the middle line is no. 1, and the bottom line is no. 500 [1]. (Online version in colour.)

<http://dx.doi.org/10.1098/rsta.2013.0319>

# HPC trends

CPU  
1000 core



Multi-core

Mary (Cars)



$C_1 | C_2$   
 $C_3 | C_4$

Confidence  
Prediction

- No more serial --- It's a Parallel World.....We need to know how to work with these instruments!

# HPC Applications

24

## Science and Engineering:

- Atmosphere, Earth, Environment
- Physics - applied, nuclear, particle, condensed matter, high pressure, fusion.
- Bioscience, Biotechnology, Genetics
- Chemistry, Molecular Sciences
- Mechanical Engineering
- Electrical Engineering, Microelectronics
- Computer Science, Mathematics
- Defense, Weapons

## Industrial and Commercial:

- Databases, data mining
- Oil exploration
- Medical imaging and diagnosis
- Pharmaceutical design
- Financial modeling
- Entertainment industry
- Web search engines, web based business services

**Optimizing a code → There is no alternative to knowing what is going on between your code and the hardware. → Performance Modeling.**

# What do we learn?

25

## Serial Performance Scaling is Over

- ! Cannot continue to scale processor frequencies
- ! no 10 GHz chips

- ! Cannot continue to increase power consumption
- ! can't melt chip

- ! Can continue to increase transistor density
- ! as per Moore's Law

You **must** re-think your algorithms to be parallel !

High Time to learn Parallel Programming →

Design

# Profiling

40

Profiling → information about program's behavior.

Runtime → "hot spots"

Hot spots → performance bottleneck → optimization.

Context      Memory      Access time

Function and line based profiling.

Function profiling → (e.g. gprof from GNU)

→ Flat function profile and callgraph profile.



Addition

$$\text{Add} \quad n_1 + n_2$$

Ent (an)

$$\text{Sum} \quad \text{B} \quad \text{2m}$$

To get a break up of time taken by each subroutines/functions of the code :

compile with

**f95 -pg file.f**

file.e

after running the executable (a.out) we will get a file  
**gmon.out**

We have to give the command      **gprof --line a.out**  
**gmon.out** to get this breakup times.

→ “hotspots”

| % cumulative | self    | self    | total  |        |        |                 |
|--------------|---------|---------|--------|--------|--------|-----------------|
| time         | seconds | seconds | calls  | s/call | s/call | name            |
| 48.85        | 19.47   | 19.47   | 3000   | 0.01   | 0.01   | adv_efield_vel_ |
| 23.29        | 28.75   | 9.28    | 3000   | 0.00   | 0.00   | rms_            |
| 12.15        | 33.59   | 4.84    | 3000   | 0.00   | 0.00   | inc_efield_     |
| 11.87        | 38.32   | 4.73    | 3000   | 0.00   | 0.00   | mr_mur_         |
| 3.77         | 39.82   | 1.50    | 3000   | 0.00   | 0.00   | adv_hfield_     |
| 0.08         | 39.85   | 0.03    | 4      | 0.01   | 0.01   | elec_dens_      |
| 0.03         | 39.86   | 0.01    | 394416 | 0.00   | 0.00   | fioniz_         |
| 0.00         | 39.86   | 0.00    | 1      | 0.00   | 39.86  | MAIN_           |
| 0.00         | 39.86   | 0.00    | 1      | 0.00   | 0.00   | setup_          |

Old code - time taken (39.86 seconds)

Old code

| % cumulative | self    | self    | total  |        |        |                 |
|--------------|---------|---------|--------|--------|--------|-----------------|
| time         | seconds | seconds | calls  | s/call | s/call | name            |
| 52.76        | 14.04   | 14.04   | 3000   | 0.00   | 0.00   | adv_efield_vel_ |
| 17.47        | 18.70   | 4.65    | 3000   | 0.00   | 0.00   | rms_            |
| 17.40        | 23.33   | 4.63    | 3000   | 0.00   | 0.00   | inc_efield_     |
| 6.61         | 25.09   | 1.76    | 3000   | 0.00   | 0.00   | mr_mur_         |
| 5.52         | 26.56   | 1.47    | 3000   | 0.00   | 0.00   | adv_hfield_     |
| 0.08         | 26.58   | 0.02    | 394416 | 0.00   | 0.00   | fioniz_         |
| 0.04         | 26.59   | 0.01    | 4      | 0.00   | 0.01   | elec_dens_      |
| 0.00         | 26.59   | 0.00    | 1      | 0.00   | 26.59  | MAIN_           |
| 0.00         | 26.59   | 0.00    | 1      | 0.00   | 0.00   | setup_          |

New code - time taken (26.59 seconds)

Peak Performance. → for benchmarking.

Performance matrices →

Low level benchmarking → program to understand the chief performance characteristics of a processor/system.

How to do “time measurement” for different sections of the code. → “elapsed time” ?

**% time** - the percentage of the total running time of the program used by this function.

**Cumulative upto this function.**

**self** - the number of seconds accounted for by this function alone.  
**seconds**

**calls** - the number of times this function was invoked.

**self** - the average number of milliseconds spent in this function per call .  
**ms/call**

**total** - the average number of ms spent in this function and its descendants per  
**call**  
**ms/call**

**name** - the name of the function.

# Callgraph profile / butterfly graph

44

Runtime profile of a function → several different callers.

|     | index         | % time | self      | children | called    | name                |
|-----|---------------|--------|-----------|----------|-----------|---------------------|
|     |               | 0.00   | 26.59     | 1/1      |           | main [2]            |
| [1] | 100.0         | 0.00   | 26.59     | 1        | MAIN_ [1] |                     |
|     | 14.04         | 0.00   | 3000/3000 |          |           | adv_efield_vel_ [3] |
|     | 4.65          | 0.00   | 3000/3000 |          |           | rms_ [4]            |
|     | 4.63          | 0.00   | 3000/3000 |          |           | inc_efield_ [5]     |
|     | 1.76          | 0.00   | 3000/3000 |          |           | mr_mur_ [6]         |
|     | 1.47          | 0.00   | 3000/3000 |          |           | adv_hfield_ [7]     |
|     | 0.01          | 0.02   | 4/4       |          |           | elec_dens_ [8]      |
|     | 0.00          | 0.00   | 1/1       |          |           | setup_ [10]         |
|     | -----         |        |           |          |           |                     |
|     | <spontaneous> |        |           |          |           |                     |
| [2] | 100.0         | 0.00   | 26.59     |          |           | main [2]            |
|     | 0.00          | 26.59  | 1/1       |          |           | MAIN_ [1]           |
|     | -----         |        |           |          |           |                     |
|     | 14.04         | 0.00   | 3000/3000 |          |           | MAIN_ [1]           |
| [3] | 52.8          | 14.04  | 0.00      | 3000     |           | adv_efield_vel_ [3] |

## Contd ....

---

```
[7]   1.50  0.00 3000/3000      MAIN_ [1]
[7]   3.8   1.50  0.00 3000      adv_hfield_ [7]
```

---

```
[8]   0.03  0.01  4/4      MAIN_ [1]
[8]   0.1   0.03  0.01   4      elec_dens_ [8]
[8]   0.01  0.00 394416/394416  fioniz_ [9]
```

---

```
[9]   0.01  0.00 394416/394416  elec_dens_ [8]
[9]   0.0   0.01  0.00 394416      fioniz_ [9]
```

---

```
[10]  0.00  0.00  1/1      MAIN_ [1]
[10]  0.0   0.00  0.00      1      setup_ [10]
```

- Amdahl's Law states that potential program speedup → the fraction of code (P) that can be parallelized:

$$\text{Speedup} = \frac{1}{1-P}$$

- If none of the code can be parallelized,  $P = 0$  and the speedup = 1 (no speedup).
- If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast.

- Theoretical Maximum speedup ??

$$\cancel{\text{Speedup}} = \frac{1}{1-P} = \frac{1}{1-0} = \underline{\underline{1}}$$

Code  
1 w/ ①  
50% ②  
Can be parallelized X  
 $P = .5$

$N=2^x$ 

$$\text{Speedup} = \frac{1}{(P/N) + S}$$



$$\text{Speedup} = \frac{1}{1-p}$$

$$P \text{ in fraction}$$

$$P+S=1$$

$$S = \frac{1}{N} + S_0$$

$N = \text{No. of processors}$

$$P+S = 1$$

$$\downarrow \quad \downarrow \quad S = 1 - p$$

Parallel fraction  
Serial fraction

$$\bullet \quad S$$

$$S = \frac{1}{1-p}$$

Project - 6 member team - continuous evaluation - project report submission via overleaf - 4 Submissions with different marks

$$\text{Speedup} \quad \frac{\text{Serial Time}}{\text{Parallel Time}} = \frac{10 \text{ min}}{1 \text{ min}} = 10 \quad (12) \\ (15)$$

$$10 \leftarrow 10+ \\ 7 \leftarrow 2^{2n} \rightarrow 32 \text{ min} \\ 2 \leftarrow 32 - 21$$



1. Reading and literature survey --> Analyze and formulate problem. Figure out the key issues and questions. Project report (literature review with at least 10-15 papers)

- 2. Understand the algorithm, choice of data structure
- 3. Serial code development. Reproduce published results. Run-time analysis.
- 4. Parallel algorithm design and code development. Validation (comparison with serial code).
- 5. Data collection on different benches for different problem sizes - trends.
- 6. Profiling and optimization. Analysis of different performance metrics.
- 7. Project report/ Paper writing. One team member must have good writing skills.

→ Data Analysis

(speedup)

Parallel  
Graph



## Practical constraints for faster serial Machines: why parallel computing?

1. **Data Transmission speeds** - the speed of a serial computer directly depends on how fast data can move through hardware. Absolute limits -> the speed of light (30 cm/nanosecond). The transmission limit of copper wire (9 cm/ nanosecond).
2. **Limits to diminishing size** - processor technology is currently allowing an increasing number of transistors to be placed on a chip. Even with molecular or atomic-level components -> a limit will be reached on how small components can be.
3. **Commercial limitations** - very expensive to make a single processor faster. Using a larger number of available processors to achieve a better performance is less expensive. → **Parallel processing**.

OLD : Hardware, Architecture, Compilers → Speed

NEW : Parallel programming → Speed

## **New Trends (Reinterpretation of Moore's Law):**

- Number of cores per chip can double every two years
  - Clock speed will not increase (possibly decrease)
  - Systems with millions of concurrent threads (e.g. GPUs)
  - Inter-chip parallelism (MPI) and intra-chip parallelism (OpenMP)
- 

## Modern Processor and Principle of multiplicity

- ① Multiple memories (Cache, RAM, etc.)
- ② Multiple Processors
- ③ Multiple data-paths



## Single core Architecture (stored program digital computer)

Instructions are stored as data in memory

- ✓ CU - instructions read & execute
- ✓ ALU - computation
- ✓ Memory - store data (for ALU) and instructions (for CU)
- ✓ I/O - communication with user



Main  
Performance  
Limitation- ?

Memory

# Terminology

18

130

- 1 The cores perform FLOPS.
- 2 A processor contains one or more (CPU) cores.
- 3 A socket holds one processor.
- 4 A node contains one or more sockets.

**Supercomputer = several nodes → HPC**

HPC world → node peak theoretical performance:

Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node).

CPU → Serial & Parallel (MPI, OpenMP etc.)

How many cores we need for 1 Teraflop (cores@2.5 GHz)?



# Terminology --- visual



Supercomputer - each blue light is a node

Single O/S

Node - standalone

Von Neumann computer

CPU / Processor / Socket - each has multiple cores / processors.



Processor/Socket



The same Parallel code can be executed on a laptop (multi-core) and can be also executed on a supercomputer

# Measurement Metrics

- Flop: floating point operation (usually double precision)
- Flop/s: floating point operations per second
- Bytes: size of data (a double precision floating= 8 bytes)

bit →  
bytes → 8  
words → 4

- Typical sizes
  - Mega Mflop/s =  $10^6$  flop/sec
  - **Gflop/s =  $10^9$  flop/sec**
  - Tera Tflop/s =  $10^{12}$  flop/sec
- 
- Peta Pflop/s =  $10^{15}$  flop/sec
  - Exa Eflop/s =  $10^{18}$  flop/sec
  - Zetta Zflop/s =  $10^{21}$  flop/sec
  - Yotta Yflop/s =  $10^{24}$  flop/sec

Kbyte  $2^{10}$   
Mbyte =  $2^{20} \approx 10^6$  bytes  
**Gbyte =  $2^{30} \approx 10^9$  bytes**  
Tbyte =  $2^{40} \approx 10^{12}$  bytes

Pbyte =  $2^{50} \approx 10^{15}$  bytes  
Ebyte =  $2^{60} \approx 10^{18}$  bytes  
Zbyte =  $2^{70} \approx 10^{21}$  bytes  
Ybyte =  $2^{80} \approx 10^{24}$  bytes

Quiz:

We want a speed-up of 90x with 100 processors,  
the sequential part <sup>of the code</sup> can only be \_\_\_\_\_? %.

Approach →

## Quiz:

We want a speed-up of  $90x$  with 100 processors,  
the sequential part can only be  $\underline{1\%}$ ?

$$\text{Speed} = \frac{1}{(1 - \text{Fraction time affected}) + \frac{\text{Fraction time affected}}{100}}$$

*Parallel*

$$\rightarrow \text{Fraction time affected} = \frac{89}{89.1} = 0.999$$

$$1 - 0.999 = \dots$$

# Memory hierarchy

Time scale  $\sim 10^{-9}$

- Data transfers are the #1 limiting factor in computing
  - Main memory is too slow to keep up with the CPU's hunger for data
- You can either build a small and fast memory or a large and slow memory
  - Caches hold often-used data for fast reference
  - Multiple levels (the larger the slower)
  - Data transfers occur in "bursts" of single cache lines (typically 64 bytes)
- The purpose of many optimizations is to avoid slow data paths



# Characterization of data paths

multiple data path

- Basic model: Latency & bandwidth  
Transfer time for message ( $N$  bytes)

$$T = \lambda + \frac{N}{b}$$

$\lambda$ : latency (set-up time) [s]

$b$ : bandwidth of data path [byte/s]

- Effective bandwidth of message transfer:

$$B_{\text{eff}} = \frac{N}{T} = \frac{N}{\lambda + \frac{N}{b}}$$



## Memory Hierarchies



DRAM Gap - increasing distance between CPU and memory in terms of latency and bandwidth  
Cache alleviate the effects of DRAM gap.

✓ CPU issues a read request to transfer data item to register, first level cache logic checks whether this item already exists in cache or not

\* If it exists - cache hit (low latency)

\* If it does not exist - cache miss (high latency) --> fetch from outer cache --> worse from memory

Instruction cache miss are rare events compared to data cache miss

# What's Driving Parallel Computing Architecture?

31

Performance =  
? / Time



**von Neumann bottleneck!! (memory wall)**

## Code balance

$B_c$   
Code balance of a loop = data traffic/ floating point operations

Software

$$= \frac{3N}{N}$$

$$= \frac{\text{bytes}}{\text{FLOPS}}$$

Reciprocal of code balance  $\rightarrow$  Computational Intensity.

$$= \frac{\text{FLOPS}}{\text{bytes}}$$

for  $i = 1, N$

$$\begin{cases} A[i] + B[i] \\ \quad C[i] \end{cases}$$

End

$$I = \frac{3N}{2N}$$



Machine Balance ( $B_m$ ) = Possible Memory Bandwidth



Peak Performance Computer

$$= \frac{\text{Bytes/Sec}}{\text{Flops/Sec}} = \frac{\text{Bytes}}{\text{Flops}}$$

$$B_m = \frac{b_{\text{max}}}{P_{\text{max}}}$$

$$B_m P_{\text{max}} = b_{\text{max}}$$

Bandwidth Based Performance Modeling

$$MB = \frac{10 \text{ MB/Sec}}{10 \text{ GFLOPS/Sec}} = 1$$

$$= 2.5 \times 10^9 \frac{\text{cycles}}{\text{sec}} \times \frac{4 \text{ FLOPs}}{\text{cycle}} \times 1$$



Roofline  
Model.

Grid code



depends on code  
 $l \approx 1$

If  $B_m \sim B_c$ ,

$$\frac{B_m}{B_c} \ll 1$$

$$l = \left( \frac{B_m}{B_c} \right)$$

Performance @ P =  $\min(P_{\text{max}}, P_{\text{min}})$

$$\frac{B_m}{B_c}$$

Machine

$$= \min \left[ P_{\text{max}}, \frac{l_{\text{max}}}{B_c} \right]$$

Code

for  $i=1, N$

## Arithmetic Intensity: (AI, CI)

- ratio of computation to data traffic; measured in FLOPs:bytes.
- Traffic-Associated with particular memory - not the number of loads and stores.
- when arithmetic intensity exceeds machine balance (the ratio of peak floating-point performance to peak bandwidth) - likely compute bound

Lets define - Arithmetic intensity is the ratio of total floating-point operations to total DRAM bytes

# AI=?

$$= \frac{N \text{ flops}}{8N \text{ bytes}}$$

~~1. B� k~~

```

temp=0.0;
for(i=0;i<N;i++){
    temp = A[i] * A[i];
}
magnitude = sqrt(temp);

```

$$N = 10^9 \quad 9.18 \text{ Giga}$$

$N$  is small

Double precision  $A[i] \rightarrow$  cache

$$N = 10^9 \\ 8 \text{ bytes}$$

$$\text{Size of } A = 80 \text{ bytes}$$

$$\text{Cache} = 32 \text{ KB}$$

- Arithmetic intensity is the ratio of total floating-point operations to total DRAM bytes.
- Assume,  $N$  is sufficiently large that the array does not fit in cache
- The second access to  $A[i]$  exploits the cache/register locality within the processor.

$$N \gg 1$$

$$\text{Size} \rightarrow 10$$

```
temp=0.0;
for(i=0;i<N;i++){
    temp = A[i]*A[i];
}
magnitude = sqrt(temp);
```

0.125

performs N flops while transferring only 8N bytes.

# A|=?

```
for(i=0; i<N; i++) {  
    for(j=0; j<N; j++) {  
        C[i, j] = a * A[i, j] +  
                  b * (A[i, j-1] +  
                         A[i-1, j] +  
                         A[i+1, j] +  
                         A[i, j+1]);  
    }  
}
```

$$= \frac{6N^2}{8N + 16N^2}$$

$$= \frac{6}{16}$$

Cache line  $\rightarrow$  8 Data points



$$N \gg 10$$

Assume cache is substantially larger than 8 N, but substantially smaller than  $16 N^2$

$$N = 10^3$$

$$\text{8KB Cache} < 16 \text{ MB}$$

$$16N^2 = 16 \times (10^3)^2 = 16 \times 10^6 = 16 \text{ MB}$$

SKB

```
for(i=0;i<N;i++){
    for(j=0;j<N;j++){
        C[i,j] = a*A[i,j] +
                  b*( A[i,j-1] +
                      A[i-1,j] +
                      A[i+1,j] +
                      A[i,j+1] );
    }
}
```

0.25



the leading point in the stencil  $A[i,j+1]$  will eventually be reused by subsequent stencils as  $A[i+1,j]$ ,  $A[i,j]$ ,  $A[i-1,j]$ , and  $A[i,j-1]$ .

- references to  $A[i,j]$  only generates  $8 N^2$  bytes of communication
- accesses to  $C[i,j]$  generate  $16 N^2$  bytes because write-allocate cache architectures will generate both a read for the initial fill on the write miss in addition to the eventual write back
- code performs  $6 N^2$  flops

## *Cache Lines*

*The bytes in RAM are grouped in a block of 64 each*



Travel through the array → forward or backward →  
we will get the benefit.

G G G B  
L G G B  
Z G G B

$$A[1] \quad \xrightarrow{\hspace{1cm}} \quad A[8]$$

# AI=?

Compute  
Data Access Pattern

---

```
c[i,j]=0.0;  
for(i=0;i<N;i++){  
    for(j=0;j<N;j++){  
        for(k=0;k<N;k++){  
            c[i,j] += A[i,k]*B[k,j];  
        }  
    }  
}
```

$$AI = \frac{2N^3}{?} \text{ Flops}$$

Assume the cache is substantially larger than  $24 N^2$



$8N^2$   
bytes

$8N^2$   
"

$8N^2$   
bytes

```

C[i,j]=0.0;
for(i=0;i<N;i++){
    for(j=0;j<N;j++){
        for(k=0;k<N;k++){
            C[i,j] += A[i,k]*B[k,j];
        }
    }
}

```

N / 16

AJ ↑

$$\frac{2N^3}{32N^2} = \frac{N}{16}$$

$A[i,j]$ ,  $B[i,j]$ , and  $C[i,j]$  can be kept in cache and only their initial and write back references will generate DRAM memory traffic.

the loop nest will perform 2 N<sup>3</sup> flops while only transferring 32 N<sup>2</sup> bytes.

# Common kernels

First 7 kernels/dwarfs covers most computational science applications

14

|    |                            |                                                                                                                                          |
|----|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| 7  | Monte Carlo or MapReduce   | Calculation is done as per repeated random trials which are embarrassingly parallel                                                      |
| 8  | Combinational Logic        | Implemented using logical function and stored states used to perform simple operation on large amount of data e.g. calculating CRC       |
| 9  | Graph Traversal            | Visit nodes following successive edges, computationally less expensive                                                                   |
| 10 | Dynamic Programming        | Solve simpler overlapping problems, useful for optimization                                                                              |
| 11 | Backtrack and Branch+Bound | Finds an optimal solution by recursively dividing the feasible region into subdomains, and then pruning sub-problems that are suboptimal |
| 12 | Graphical Models           | Graphs with nodes as random variables and edges as conditional dependencies. E.g. Bayesian Network, Hidden Markov Models                 |
| 13 | Finite State Machine       | System behavior defined by states, transitions defined by input and current state and event associated with transitions or states        |

# Common kernels

Numerical Algorithm behind an Application

| ID | Dwarf                            | Description                                                                                                                                      |
|----|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| 1  | Dense Linear Algebra             | Dense Vector-Vector, Matrix-Vector, Matrix-Matrix operations, Use of stride access for rows and columns                                          |
| 2  | Sparse Linear Algebra (SpMV etc) | Sparse data (with many zeroes) stored in compressed format. Data access with indexed load and store                                              |
| 3  | Spectral Methods (FFT etc)       | Data in frequency domain, data accessed in multiple butterfly stages with all-to-all for some stages and local to others                         |
| 4  | N-Body methods                   | Depends on interaction between many discrete points                                                                                              |
| 5  | Structure Grids                  | Data in regular grid format, points on grid updated together with high spatial locality. Subdivides grids into finer grids with area of interest |
| 6  | Unstructured Grids               | Data locations in grids as per application characteristics                                                                                       |

Compute  
Data Access  
Pattern



# Important Patterns

| Dwarf/Kernel            | Embedded Computing | General Computing | Machine Learning | Graphics and Games | Databases |
|-------------------------|--------------------|-------------------|------------------|--------------------|-----------|
| Dense Linear Algebra    | ✓                  | ✓                 | ✓                |                    | ✓         |
| Sparse Linear Algebra   | ✓                  | ✓                 | ✓                | ✓                  |           |
| Spectral Methods        | ✓                  |                   | ✓                | ✓                  |           |
| N-Body methods          |                    | ✓                 |                  |                    |           |
| Structure Grids         | ✓                  | ✓                 |                  | ✓                  |           |
| Unstructured Grids      |                    |                   | ✓                |                    |           |
| Monte Carlo (MapReduce) |                    | ✓                 | ✓                |                    | ✓         |

# Common patterns

| Dwarf/Kernel               | Embedded Computing | General Computing | Machine Learning | Graphics and Games | Databases |
|----------------------------|--------------------|-------------------|------------------|--------------------|-----------|
| Combinational Logic        | √                  |                   | √                |                    | √         |
| Graph Traversal            | √                  |                   | √                | √                  | √         |
| Dynamic Programming        | √                  | √                 | √                |                    | √         |
| Backtrack and Branch+Bound |                    | √                 | √                |                    |           |
| Construct Graphical Models | √                  | √                 | √                |                    |           |
| Finite State Machine       | √                  | √                 |                  | √                  |           |

# Challenges

17

- Application design may have combination of different kernels → How to use architecture and programming model tuned in individual kernels/dwarfs for combination design
- Algorithm implementation for different kernels have preferred data structure → Solution having combination of different dwarfs require data structure translation

HP C

Send performance

## Performance Metrics and Benchmarks

We know that all components can operate at some max speed called peak performance.

10 GFLOPS

10 GB/S

Two important quantities:  
Compute Throughput (GFLOPS/Sec) Time (cycles)  
Memory Bandwidth (GBbytes/Sec)

1 GHz  $\rightarrow \frac{10^9 \text{ cycles}}{\text{sec}}$

1 cycle  $\rightarrow 1 \text{ ns}$

Low level Benchmark: Program that tries to test some specific features of the architecture.

Isolate small set of instructions to separate influences and determine specific machine capabilities

Vector Triad - popular microbenchmarking exercise  
Multiply-Add nested loop on the elements of three vectors

Latency  $\rightarrow 5 \text{ cycles}$   
5 ns

## Vector Triad: $A[i] = B[i] + C[i]*D[i]$

int minSize = pow(2, 3); int maxSize = pow(2, 29);  
 int total = maxSize; →

```
for(int size=minSize; size<=maxSize; size*=2){ →
    /* init data */
    for(int i=0; i<size; i++){
        b[i]=3; c[i]=2; d[i]=1;
    }
}
```

RUNS  $\times S_B = \text{Const}$

start = clock();

```
for(int run=0; run<RUNS; run++) { →
    /* vector triad */
```

```
    for(int ind=0; ind<size; ind++) {
        a[ind] = b[ind] + c[ind]*d[ind];
        if(((double)ind)==333.333)
            dummy(ind);
    }
}
```

end = clock();

wallTime = (end - start)/((double)CLOCKS\_PER\_SEC);

/\* Avg throughput \*/

double throughput = ((double)sizeof(double) \* total \* 2)/ wallTime

Complete ↗



$$\text{RUNS} = \frac{2^8 \times 2^9}{2^4} \rightarrow \text{iter } 2 \\ = 2^{25}$$

$$\text{Total FLOPS} = 2^{30} (2 \times 2^{29})$$

2 FLOPS  
 $B, C, D$  - Read  
 $A \rightarrow$  Write  
 $A, B, C, D$   
 → how big?

$2^{30} \sim 1G$

Compute Throughput

$$= \frac{\text{Total FLOPS}}{\text{Time}}$$

Distance = 100 km  
 1 hr  
 5 km  
 A - ()  
 A ->  
 A - V



# Most important limitation – data access !!

Loop based code → large amount of data movement in and out of the CPU.  
Concern : underutilization of on-chip resources.

Several data paths → range of bandwidths and latencies.

Objective : reducing traffic over slow data paths.

## Microbenchmarking

- **Probing of the memory hierarchy**
- **Saturation effects in cache and memory**
- **Typical overheads?**

## Memory Read Bandwidth/Latency



1ns

## Latency Numbers Every Programmer Should Know

|                             |
|-----------------------------|
| ■ 1ns                       |
| ■ L1 cache reference: 0.5ns |
| ■ Branch mispredict: 5ns    |
| ■ L2 cache reference: 7ns   |
| ■ Mutex lock/unlock: 25 ns  |
| ■ 100 ns                    |



Source: <https://gist.github.com/2841832>

1μs

# Balance analysis

35

Theoretical performance of loop based codes.

Machine balance (processor chip) =  
memory bandwidth / peak performance

| Data Path    | Balance (units) |
|--------------|-----------------|
| cache        | ? $10^0$        |
| RAM          | ?               |
| interconnect | ?               |
| Disk         | ?               |

Machine balance will decrease or increase in future ? *Decrease*

Typical historical trend (balance) ?? *Present*

Computer Behavior

$$= \textcircled{2.5 \text{ GB}} \times \textcircled{4} \times \textcircled{10}$$

FLOPs

$$2.5 \times 10^9 \text{ cycles/sec} \times 4 \frac{\text{Flops}}{\text{cycles}}$$

$$\text{MB} = \textcircled{10^{11}} \times \textcircled{10^3} \text{ Sec}$$

$$10^6 \text{ FLOPs/sec}$$

$$= \frac{10^{11}}{10^3}$$

$$= \frac{10^6 \text{ FLOPS}}{\text{sec}}$$

To understand the results of vector triad , we need certain concepts i.e. techniques to improve application performance

- \* Pipelined functional units
- \* superscalar architecture
- \* Data Parallelism through SIMD
- \* Out of order execution
- \* Larger Caches

Vector Processor



This pipeline has a latency (depth) of five cycles because after this 1st result is available

Food - some operation like Addition or Multiplication

Items/ dishes - sub-tasks to complete operation

Steps - stages of pipeline

N students - N operations

Simplest pipeline --> fetch (item 1) - decode (item 2) - execute (item 3)

5 Stage pipeline - —



\* Pipeline context- Students taking food in a queue (line)

\* Total N students

\* Complete Food (lunch) contains 5 items (dishes) - to be taken one after the other

\* All students taking same food

\* serving/taking each item is a task (step) - total 5 steps required

\* Workers (serving food) are highly skilled and specialized for a single task

\* Each worker executes same work but works on different objects (serving same food item to different students)

\* All tasks (steps) take around same time (1 cycle)

\* Forwarding the partially-finished work to the next

Dinner / Lunch



cycle - 1

cycle nins

for i = 1 N  
 $A[i] = A[i] + C[i]$   
 }

## Pipelining of Functional units

Core can work on several independent instructions simultaneously (parallelly) - working on 5 additions at the same time after the wind-up phase

One instruction gets finished at each cycle after the pipeline is full

Widely used in modern processors

Generalize : Pipeline of depth "m"

Total independent operations to be executed - N

Total time taken to finish ( $T_{\text{pipeline}}$ ) in cycles =  $m + N - 1 = \underline{N} + \underline{m - 1}$

Unit without pipeline takes "m" cycles to generate a single result; how much for N results -  $\underline{mN}$

$$\text{Speedup} = \frac{T_{\text{Serial}}}{T_{\text{Pipeline}}} = \frac{mN}{N+m-1} \approx \frac{m}{m} \quad (N \gg m)$$

Compute Throughput of the pipeline =  $\frac{N}{N+m-1} = \frac{1}{1 + \frac{m-1}{N}}$

$N \rightarrow \text{large}$

so. Throughput  $\rightarrow 1$

\* Right loop performance is not good

✓ Pipeline subtask - load, store, address calculation, decode, execute etc.

Job of compiler - arrange instructions in such a way as to make efficient use of all the different pipelines

$N = 1 \text{ loop}$

$m = 5$

loop

for  $i=1$ ,  $\underline{N}$

$+$   
 $m$

$\overbrace{N \gg m}$

Addition - 5

Opuntion - 50



$$\left| \begin{array}{l} P = \\ \cdot 5 = \frac{1}{1 + \frac{m-1}{N}} \\ \cdot 8 \end{array} \right. \quad \left. \begin{array}{l} N \gg m \\ 0 < P \leq 1 \end{array} \right.$$



\* Either build a small and fast memory or a large and slow memory

RAM

Cache

✓ Multiplicity- multiple levels - the larger the slower. Typical caches - KBytes to MBytes

\* Data transfers occur in a block (bursts) of single cache lines (~64 bytes) from one level to another level.

Goal-avoid slow data paths- optimization

Data transfer basic model in terms of latency and bandwidth - for different data paths

Data - N bytes ✓

Latency -  $T_L$  (wait time) ✓

Bandwidth - B bytes/sec (after data starts flowing)

Transfer time for message  $T = T_L + N/B$

Effective bandwidth of data transfer =  $N/T = N/(T_L + N/B)$

If it does - Cache hit

if it does not  
→ Cache Miss





## *Summary*

- *Statement order must not matter.*
- *Statements must not have dependences.*
  - *Some dependences can be removed.*
  - *Some dependences may not be obvious.*

*Eliminating dependence by Scalar Expansion or privatization*

```
for (I = 0; I < 100, I++  
T = A[I];  
A[I] = B[I];  
B[I] = T;
```

## Cost of cache miss

understand the difference between hit and miss

Hit time - time to deliver a line in the cache to processor (time to determine whether line is in cache included) - typical hit times - 1-2 clock cycles for L1 cache

Miss penalty - typically 10-100 of cycles

99% hit vs 97% hit --> do we need to worry and do something (increase hit)

Cache hit time - 1 cyle

Cache miss penalty - 100 cycles (L1 cache vs main memory)



Worry



✓ Peak Compute Performance  $\rightarrow$  Clock Speed  $\times$   $\left( \frac{\text{GFLOPs}}{\text{cycle}} \right) \times \text{no. of rows} \quad ( \quad )$

10 GFLOPS/sec

### Memory System Performance (Read throughput)

- \* 64 bit DDRAM interface (Double rate)
- ✓ \* 8 channels
- \* 1 GHz clock freq.

Peak Access Throughput =  $\frac{(64 \text{ bit})}{\text{Transfer}} \times \frac{(2) \text{ transfers}}{\text{Clock cycle / Channel}} \times 8 \text{ channels} \times \frac{1 \text{ (G) clock cycle}}{\text{second}}$

$$= 128 \text{ GB/sec}$$

# Simultaneous multi-threading (SMT)



- Pipelines often underutilized due to dependencies, waiting times, etc.
- "Pipeline bubbles" mean wasted resources



- SMT can improve the utilization of pipelines
- It does not improve any other resources on the chip!
- Need to run multiple threads

# SIMD processing

$$A[i] = \beta[i] \oplus c_i$$

- Single Instruction Multiple Data (SIMD) instructions allow the execution of the same operation on “wide” registers from a single instruction
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit → 2 double precision floating point operands
  - AVX(2): register width = 256 Bit → 4 double precision floating point operands
  - AVX-512: ... you guessed it!
- It is **not specified** if these operations are **concurrent**
  - They mostly are, though, on modern standard CPUs



# Scalar (non-SIMD) execution

```
double *A, *B, *C;  
for (int j=0; j<size; j++) {  
    A[j] = B[j] + C[j];  
}
```

Register width:

- 1 operand (scalar)



# Data-parallel execution (SIMD)

$\text{freq} \leftarrow \frac{\text{blocks}}{\text{num}} \times 4$

```
double *A, *B, *C;  
for (int j=0; j<size; j++) {  
    A[j] = B[j] + C[j];  
}
```

Register widths (double prec.):

- 1 operand
- 2 operands (SSE)
- 4 operands (AVX)  $\swarrow 256$
- 8 operands (AVX512)

$A[0]$   
 $A[1]$   
 $A[2]$   
 $A[3]$

## SIMD execution



## When two statements can be executed in parallel?

Suppose we have 2 statements

✓ statement 1 ✓  
✓ statement 2 ✓

✓ and we wish (2 processors execute independently, i.e. no control over order of execution between processor):

✓ statement 1 in proc 1  
✓ statement 2 in proc 2

✓ Main requirement: their order of execution should not matter i.e.

statement 2 ✓  
statement 1 ✓

statement 1 ✓  
statement 2 ✓

case 1

a=1 ✓  
b=2 ✓

case 2

a=1 ✓  
b=a ✓

Case 3

b=a ✓  
a=1 ✓

Case 4  
a=2 ✓  
a=1 ✓

Case 5  
a=f(y)  
b=a ✓

## Dependences : types

### \* Flow (true) dependence

statement i precedes j and i computes a value that j uses

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$

Dependence between  
 $1 \rightarrow 2$   
 $2 \rightarrow 4$

### \* Anti dependence

statement i precedes j and i uses a value that j computes

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$

$2 \rightarrow 3$

### \* Output dependence

statement i precedes j and i computes a value that j also computes

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$

$3 \rightarrow 4$

### \* Input dependence

statement i precedes j and i uses a value that j also uses

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$

$3 \rightarrow 4$

RAW  $\rightarrow$  Read after write dependence. (time depend.)

$$a_1 = a + 1$$

$$\text{Term1} = a_1 * S1$$

$$a_1 = a_1 + b$$

$$\text{Term2} = a_1 * S2$$

WAR  $\rightarrow$  Write after Read (Anti-depend.)

Dependencies can be eliminated by renaming variables.

\* If we say dependence flows from  $i$  to  $j$  then  $i$  is source and  $j$  is sink.

Flow dependence is the true dependence, other dependences can be eliminated by renaming

\* Data dependency graph: (1) nodes are the statements and (2) directed edges are dependence relations

Dependence between loop iteration

$\left\{ \begin{array}{l} \text{for } i=1, n \\ \text{for } j=2, m \\ b[i, j] = \dots + b[i, j-1] \end{array} \right\}$

$$\begin{aligned} a[0] &= 1 \\ \text{for } i=1, N \\ a[i] &= a[i-1] + a[i-1] \end{aligned}$$

$$\left\{ \begin{array}{l} i=3 \\ i=1 \\ i=2 \end{array} \right. \quad \left. \begin{array}{l} a[1] = a[0] + a[0] \\ a[2] = a[1] + a[1] \end{array} \right\} \Rightarrow \begin{array}{l} T_1 \\ T_2 \end{array}$$

has to be evaluated in a particular order.

(1) loop independent dependence  
(2) loop carried dependence

$\left\{ \begin{array}{l} \text{for } i=1, N \\ a[i] = b[i] + c[i] \\ d[i] = a[i] \end{array} \right\}$

dependence flows within the same iteration  
dependence distance = 0 ✓

in the above case dependence flows within same iteration (loop independent ; dependence distance=0)

$\left\{ \begin{array}{l} \text{for } i=1, N-1 \\ a[i] = b[i] + c[i] \\ d[i] = a[i-1] \end{array} \right\}$

dependence flows from 1 iteration to next (loop carried dependence; dependence distance=1) ✓

$$\left\{ \begin{array}{l} a[1] = b[1] + c[1] \\ d[1] = a[0] \end{array} \right\} \text{ iteration 1}$$
$$\left\{ \begin{array}{l} a[2] = b[2] + c[2] \\ d[2] = a[1] \end{array} \right\} \text{ iteration 2}$$

## ⑦ Detect dependences (rules)

1) Variable only read & never written  $\rightarrow$  no dependence.

2) For each written variable  $\rightarrow$  if there are access ~~is~~ in other iterations than the current  $\rightarrow$  dependence problem present.

\* When loop has no dependence.

1) Let's say all assignments are made to arrays.

2) Each element is assigned by at most one iteration.

3) No iteration reads elements assigned by any other iteration.

1)  $\text{for}(i=1, i \leq N, i++)$

$$a[i] = a[i] + a[i-1]$$

$$\left\{ \begin{array}{ll} i=1 & a[1] = a[1] + a[0] \\ i=2 & a[3] = a[3] + a[2] \\ i=3 & a[5] = a[5] + a[4] \end{array} \right. \rightarrow \begin{array}{l} \text{Can be parallelized} \\ \text{No dependence} \end{array}$$

2)  $\text{for}(i=0, i \leq N_L, i++)$

$$a[i] = a[i] + a[i+N_L]$$

$$\left\{ \begin{array}{ll} i=0 & a[0] = a[0] + a[N_L] \\ i=1 & a[1] = a[1] + a[N_L+1] \end{array} \right. \rightarrow \begin{array}{l} \text{No dependence.} \\ \text{No dependence.} \end{array}$$

3)  $\text{for}(i=0, i \leq N_L, i++)$

$$a[i] = a[i] + a[i+N_L]$$

$$\left\{ \begin{array}{ll} i=0 & a[0] = a[0] + a[N_L] \\ i=1 & a[1] = a[1] + a[N_L+1] \\ i=N_L & \end{array} \right. \rightarrow \begin{array}{l} \text{No dependence} \\ \text{Not parallelizable} \end{array}$$

## *Loop carried dependences*

```
a[0]=1  
for (i=1; i<N; i++) {  
    a[i]= a[i] +a[i-1]  
}
```

*Approach:*

*Look into the pattern for some iterations*

*Can this loop be parallelized?*

## *How to detect dependences*

- \* Look carefully how each variable is used within a iteration
- \* Is the variable only read and not written (if yes - no dependences)
- \* For each variable which is written - find out whether there is any accesses in other iterations than the current? if yes then dependences are present

*Other important points for no dependences:*

- \* Each element is assigned by at most one iteration
- \* No iteration reads elements assigned by any other iteration

```
for (i=1; i<N; i+=2) {  
    a[i]= a[i] +a[i-1]  
}
```

*Approach:  
Look into the pattern for some iterations*

*Can this loop be parallelized?*

```
for (i=0; i<N/2; i++) {  
    a[i]= a[i] +a[i+N/2]  
}
```

*Approach:*  
*Look into the pattern for some iterations*

*Can this loop be parallelized?*

```
for (i=0; i<=N/2; i++) {  
    a[i]= a[i] +a[i+N/2]  
}
```

*Approach:*  
*Look into the pattern for some iterations*

*Can this loop be parallelized?*

```
a[0]=0  
for (i=1; i<N; i++) {  
    a[i]= a[i-1] +i  
}
```

*Approach:*

*Can this loop be parallelized?*

```
for (j=0; j<N; j++) {  
    a[ id[j] ]= a[ id[j] ] +b[ id[j] ]  
}
```

*Approach:*

*Can this loop be parallelized?*

Can we remove dependences in the following cases - is there false dependences

### Case-1

- (1)  $sum = a1 + 1$
- (2)  $b1 = sum * c1$
- (3)  $sum = a2 + 1$
- (4)  $b2 = sum * c2$

4 threads X

### Case 2

```
for (i=0; i<N; i++) {  
    x=(b[i] + c[i]) / 2  
    a[i]= a[i+1] + x  
}
```

i=0

$$x = (b[0] + c[0]) / 2$$

$$a[0] = a[0] + x$$

$$i=1$$
$$x = (b[1] + c[1]) / 2$$
$$a[1] = a[2] + x$$

i=3

$$i=2$$
$$x = (b[2] + c[2]) / 2$$
$$a[2] = a[3] + x$$

Can this loop be parallelized?

```
for (i=0, i<N, i++) {  
    a2[i] = a[i+1]  
}
```



$$\begin{aligned} & \text{for } (r=0, i \leftarrow n, i \leftarrow) \\ & \quad \times \text{ for } (j=0, j \leftarrow n, j \leftarrow) \\ & \quad a[i, j+1] = a[i, j] + 1 \end{aligned}$$

} can we parallelize

$$\begin{array}{ll} \bar{v} = 0 & T_0 \\ v = 1 & T_1 \\ \bar{v} = 2 & T_2 \end{array}$$

Can parallelise, but not on i

for ( $i = 1, i \leq \underline{N}, i++$ )  
 /  $\alpha[i-1] = b[i]$   
 /  $c[i] = \underline{\alpha[i]}$  }

$$\left. \begin{array}{l} a[0] = b[1] \\ c[1] = a[1] \\ a[1] = b[2] \\ c[2] = a[2] \end{array} \right\} \begin{matrix} i=1 \\ i=2 \end{matrix}$$

for ( $i=1$ ,  $i < N$ ,  $i \rightarrow i+1$ )  
 $\alpha[i-1] = \underline{\lambda[i]}$   
 for ( $j=1$ ,  $j < N$ ,  $j \rightarrow j+1$ )  
 $C[j] = \lambda[j]$

$$\begin{aligned} a[0] &= b[1] \\ a[1] &= b[2] \\ a[2] &= b[3] \end{aligned}$$

proposed  
Replacement

$$\underline{N = 10^9}$$

Bottom  
Locality

The diagram shows a horizontal line representing a program's flow. A curved arrow labeled "Loop fusion" points from the left towards the center. Another curved arrow labeled "Loop distribution" points from the center towards the right. The word "Loop" is written above both arrows.

The  
bot can  
be  
parallelized

- \* Assuming both are giving same results
- \* Both are doing same amt of computation.

## *Summary*

- *Statement order must not matter.*
- *Statements must not have dependences.*
  - *Some dependences can be removed.*
  - *Some dependences may not be obvious.*

*Eliminating dependence by Scalar Expansion or privatization*

```
for (I = 0; I < 100, I++  
T = A[I];  
A[I] = B[I];  
B[I] = T;
```

# Simple Optimizations of serial code

# Case A vs. Case B

47

## Case A

```
logical :: flag  
flag = .false.  
do i=1,N  
if(complex_func(A(i)) < TRESHOLD) then  
    flag=.true.  
    EXIT  
Endif
```

Do less work

## Case B

```
logical :: flag  
flag = .false.  
do i=1,N  
if(complex_func(A(i)) < TRESHOLD) then  
    flag=.true.  
endif  
enddo
```

**Which one  
we should  
choose !!**

**Why?**

# Case A vs Case B

$A = A + B^{**}2$   
(A,B float)

$$\boxed{a = a + b^2}$$

case  $\Rightarrow$  one to one translation

~~$A = A + B^{**}2.0$~~  (case A) ~~X~~  
 $A = A + \underline{B} * B$  (case B) ✓

✓ Add, multiplication

✓ DIV | SAR | SIN | COS | TAN

Cost difference?

$$\underline{B^{**}2} = \exp \left\{ 2 \ln(B) \right\}$$

Lookup Table

SIN | COS | TAN

Know the form of arguments

# Case A vs B

## Case A

```
2 do i=1,N  
3     A(i)=A(i)+s+r*sin(x)  
4 enddo
```

## Case B

```
tmp=s+r*sin(x)  
do i=1,N  
    A(i)=A(i)+tmp  
enddo
```

```
1 do j=1,N  
2   do i=1,N  
3     if(i.ge.j) then  
4       sign=1.d0  
5     else if(i.lt.j) then  
6       sign=-1.d0  
7     else  
8       sign=0.d0  
9     endif  
10    C(j) = C(j) + sign * A(i,j) * B(i)  
11  enddo  
12 enddo
```

## Case A

```
1 do j=1,N  
2   do i=j+1,N  
3     C(j) = C(j) + A(i,j) * B(i)  
4   enddo  
5 enddo  
6 do j=1,N  
7   do i=1, j-1  
8     C(j) = C(j) - A(i,j) * B(i)  
9   enddo  
0 enddo
```

## Case B

```

do i=1,N,2          (N^2)
  do j=1,N
    c(i)=c(i)+a(i,j) * b(j)
    c(i+1)=c(i+1)+a(j,i+1)*b(j)
  enddo
enddo

```

### Case A

Locality is better.

$$N=10^9$$

### Case B

```

do i=1,N          (N^2)
  do j=1,N
    c(i)=c(i)+a(j,i)*b(j)
  enddo
enddo

```

(C1)

(C1)

- C2

(C2)

## *Example of locality*

### *Case-1*

```
total=0  
for (i=0,i<n,i++)  
total += a[i]  
}  
return total
```

*Which locality (data)?  
Look into each iteration*

*Skill: identify the pattern of data access in the code/ algorithm*

CASE-2 (C code)  
*n is very large*

```
total=0  
for (i=0,i<n,i++)  
    for (j=0, j<n, j++)  
        for(k=0, k<n, k++)  
total += a[k] [i] [j]  
  
return total
```

*Optimization for the memory hierarchy - improve locality*

- \* Proper choice of algorithms
- \* Loop transformations

Principle of locality → Spatial locality and temporal locality.

Exploit principle of locality. Memory access on blocks of data instead of individual data items. These are called cache blocks.

- **Do less work!**
- **Avoid expensive operations!**

FP MULT & FP ADD are the fastest way to compute

- **Replace expensive functions by table lookup**
  - Avoid DIV / SQRT / SIN / COS / TAN → table lookup
- Explore whether data is close to CPU!
- **Elimination of common subexpressions!**
- **Avoid branches!**

Support the compiler to understand and optimize your code!

Previous Lecture:

## Memory Levels : DRAM and Caches

### Cache Hit and Cache Miss

Cache Lines: 64 bytes (CPU requests data from RAM in blocks of 64 bytes)

Latency in Clock Cycles.

Clock cycle: 1 tick of the CPUs timer - smallest possible duration of the time for the device. 1GHz clock cycle - 1 billion ticks per second.

Latency - the time the CPU takes to wait to perform something (measured in clock cycles)

X86/ X64 caches: L1, L2, L3

L1 - smallest (KB) and fastest

L2 - medium sized (hundreds of KB to few MB) and medium speed

L3 - largest (tens of MB) but slowest

Some hypothetical representative numbers

| Memory               | Latency (Clock Cycles) |
|----------------------|------------------------|
| Register             | 1                      |
| L1                   | 3-4                    |
| L2                   | 15-20                  |
| L3                   | 50-60                  |
| Memory<br>Hard Drive | 100-200<br>~ Millions  |

## The Landscape of Parallel Computing Research: A View from Berkeley<sup>8</sup>

- Why parallel computing ?
- What should parallel computing achieve ?
- Which disciplines parallel computing should address ?

| Conventional Wisdom                                                                                                                                                             | Conceptual shift                                                                                                                    |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| Power is Free, Transistors are expensive                                                                                                                                        | Transistors are free, Power is expensive → <b>Power Wall</b>                                                                        |
| Uniprocessor performance doubles every 18 months (Moore's law)                                                                                                                  | <b>Power Wall + Memory Wall + ILP Wall → Brick Wall.</b> Now doubling uniprocessor performance may take 5 years as per recent trend |
| Hardware design can increase in size with more transistors without any negative impact                                                                                          | Wire delay, clock jitter, noise, cross coupling stretches the limits in design for feature size < 65nm                              |
| Multiply functional unit is slow but load and store memory operations are fast                                                                                                  | Due to increasing DRAM gap, load and store are much slower than floating point multiplication → <b>Memory Wall</b>                  |
| Instruction Level Parallelism (ILP) through Out-of-Order execution, speculation, compiler optimization such as loop unrolling improves performance of "multicore" architectures | Diminishing results for ILP in "manycore" architectures → <b>ILP Wall</b>                                                           |

# continued

10

| Conventional Wisdom                                                                 | Conceptual shift                                                                                                     |
|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
| Increase in frequency is primary method of improving processor performance          | Can't burn the processor by increase in frequency. Increasing parallelism is primary method of improving performance |
| Less than linear scaling in performance for multiprocessor application is a failure | By switching to parallel computing, any speedup via parallelism is a success                                         |
|                                                                                     |                                                                                                                      |

# Applications

12

| Application Domain        | Application Examples                                                                                                                                                          |
|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Embedded Computing        | Consumer: JPEG, RGB to CYMK, Entertainment: MPEG decode-encode, Networking: IP NAT, OSPF, Route Lookup, Automation: Text Processing etc                                       |
| General Purpose Computing | Computational science, Fluid dynamics, Molecular dynamics, Computational Electromagnetics, Weather modelling, network simulation, XML transformation , Video compression, etc |
| Machine Learning          | Support Vector Machines, Principal Component Analysis, Spectral Clustering, Expectation Maximization, Bayesian Networks etc                                                   |
| Graphics and Game         | Reverse Kinetics, Spring models, Texture Maps, Smoothing, Interpolation, Collision Detections and Responses etc                                                               |
| Databases                 | Query Optimization, MapReduce, Hashing etc                                                                                                                                    |