

## Course Outline

- . General Overview – HPC ✓
- . Hardware + basic optimization techniques
- . Parallel algorithm design ✓
- . Shared and Distributed ✓
- . Technical (OpenMP and MPI)
- . Performance modeling of parallel algorithms ✓
- . Important parallel patterns
- . Application (assignments and project)  
\_\_\_\_\_

# Lab 207

**Operating system :** Scientific Linux ✓

**Compilers and Libraries:** The entire GNU compiler suite - gcc, g++, and g77

The Java Execution and Development Environments.

Python programming language. Also Matplotlib, Numpy and Scipy

## **Parallel Programming Libraries and tools:**

✓ OpenMP - API for directing multi-threaded shared memory parallelism.

✓ OPENMPI - open source implementation of MPI (access to HPC Cluster)

GNU Gprof - performance analysis tool for Unix applications.

**Important Scientific Software:** Scilab; Octave; R - software environment for statistical computing and graphics.

**Visualization:** Gnuplot. Paraview.

**Documentation and reader:** Latex

## **Access to HPC Cluster**

## *Module 1: Main topics*

1. *Introduction to high performance computing and Systems Perspective.*
2. *Thinking in serial vs thinking in parallel.Need for Parallel Computing.*
3. *Performance Metrics : Compute performance and memory performance. Latency, bandwidth, throughput. Balance Analysis.*
4. *Modern Processors*
5. *Pipeline and Memory Hierarchies*
6. *Data access optimizations. Performance Modeling- benchmarks: Vector Triad. Measuring performance and profiling.*
7. *Dependences and loops*
8. *Scope of parallelism and Parallel computing metrics*

## *References:*

- \* *Introduction to HPC for Scientists and Engineers by G Hager and G Wellein*
- \* *Let's HPC: A web-based platform to aid parallel, distributed and high performance computing education*  
<https://doi.org/10.1016/j.jpdc.2018.03.001>

Goal of HPC: to achieve the maximum possible performance out of a particular system for a particular problem

\* Performance  $\rightarrow$  Time ( $\text{Work}/\text{time} \rightarrow \text{Rate}$ )

\* System  $\rightarrow$  Architecture (Computing)  $\rightarrow$  Non-deterministic is nature

\* Problem  $\rightarrow$  Algorithm (Application)

Need to understand as a whole not in parts.

What is expected from a good programmer?

\* Accuracy  $\rightarrow$  Correctness

\* Efficiency (Performance - time)

Accuracy is proportional to no. of computations



Evaluation of the program in the given software-hardware setting

## Algorithm design

### Thinking in serial

1 km Road (Sweep)



Resource → Scrubbing tool

Adds  $n$  numbers  
 $\frac{5 \text{ hrs}}{\text{for } i=1 \dots n}$   
 end



### Thinking in Parallel



- \* Problem decomposition (ideal)  $\Rightarrow$  person
- & Resource to every person  $\hookrightarrow$  1 hr
- ~~for  $i=1 \dots n$~~   $\left|$  for  $i=201 \dots 4n$
- ~~partial Avg~~  $\left|$  partial Avg



Dig a hole

---

1m diameter  
1m deep -

5 feeding

Reproducing

10 grasses

# Serial vs Parallel



We need a systems perspective : why?

systems are in general non-terminating and non-deterministic, whereas the behavior of algorithms is terminating, deterministic and platform-independent



## Organization of Parallel Platforms



Complexity

$O(n)$

$O(\log n)$

$O(n^3)$

Solution



Theory of parallel algorithm design or theory of parallel computing is based on abstract concepts of time and memory which may ignore real life constraints for simplicity, and therefore not take into account non-deterministic and hardware factors

The performance of a computer program depends on a wide range of factors like the nature of the algorithm, the machine (several hardware factors), compiler optimizations, the runtime environment, the input, the measurement methodology etc. and their mutual interaction

Not enough just to get some speedup, but being able to explain the speedup from a systems viewpoint in terms of resources.

$$T = S$$

$$S = 4.5 \text{ ms}$$

$$1 \text{ ms}$$

$$5 \text{ ms}$$



Figure 1. Concept map for parallel computing. Adapted from [9].

## Module 1:

Modern Processor, Performance Metrics and benchmark, Memory Hierarchies, Caches, Pipelining, SIMD, SPMD, performance estimates, profiling, vector triad, serial code optimizations, data access optimizations, balance analysis, dependence testing, granularity, concurrency, loop fusion, loop distribution. Cache miss, cache hit, block matrix multiplication. Amdahl's law.

# Computer Performance

5

What is HPC and goal of HPC ??

~~✓~~ Quantitative measure !!

# Computer Performance

What is ~~HPC~~??

➤ Measure of computer performance - FLOPS (FLoating-point Operations Per Second).

➤ Performance → We need Clock speed and microprocessor (Flops/cycle)

➤  $\text{FLOPS} = \text{clock rate} \times (\text{flop/cycle}) \times \text{no. of cores}$  units

➤ Microprocessors today can do 4 FLOPs per clock cycle.

➤ A single-core 2.5 GHz processor has a theoretical performance of  $10 \text{ e}9$   
 $\text{FLOPS} = 10 \text{ GFLOPS}$

Aggregating computing power in such a way that delivers much higher performance than a typical desktop computer (Look at FLOPS formula !!.) → to solve large problems.

Useful in - science, engineering, or business.

$$\begin{aligned}
 & 2.5 \times 10^9 \times 4 \text{ FLOP} \\
 & \times 1 \\
 & (2.5 \text{ GHz}) \\
 & \text{Clock rate} \\
 \rightarrow & \\
 & \text{Flops} \\
 & \frac{\text{cycles}}{\text{second}} \times \frac{\text{FLOPs}}{\text{cycle}}
 \end{aligned}$$

Lecture 2  
CS301 - HPC  
Course Instructor : B. Chaudhury

## Previous Lecture

Performance= Work/Time = FLOP/Wall clock time

FLOP (addition, multiplication, division etc.)

FLOPS --> Number of FLOP/Second

FLOPS (flops per second) can be used to characterize a computing system (peak theoretical performance) and it can also be used to characterize a program (code) i.e. actual performance of the program in terms of how many flops per seconds

# High performance computing trend

21

1 exaFLOPS (EFLOPS)  
Before 2020.



**Figure 1.** Data showing performance of Top500 supercomputers on the LINPACK test over the past 20 years. The top line is the sum of the top 500 systems, the middle line is no. 1, and the bottom line is no. 500 [1]. (Online version in colour.)

<http://dx.doi.org/10.1098/rsta.2013.0319>

# HPC trends

23



Figure 3. Evolution in transistor count, clock frequency, number of cores and power consumption for processors over the past 40 years. (Online version in colour.)

- No more serial --- It's a Parallel World.....We need to know how to work with these instruments!

# HPC Applications

## Science and Engineering:

- Atmosphere, Earth, Environment
- Physics - applied, nuclear, particle, condensed matter, high pressure, fusion.
- Bioscience, Biotechnology, Genetics
- Chemistry, Molecular Sciences
- Mechanical Engineering
- Electrical Engineering, Microelectronics
- Computer Science, Mathematics
- Defense, Weapons

## Industrial and Commercial:

- Databases, data mining
- Oil exploration
- Medical imaging and diagnosis
- Pharmaceutical design
- Financial modeling
- Entertainment industry
- Web search engines, web based business services

# Software Developer Jobs

Mentioning "performance"



Mentioning "optimization"



Mentioning "parallel"



Mentioning "concurrency"



Source: Monster.com

# What do we learn?

~~Serial Performance Scaling is Over~~

- ! Cannot continue to scale processor frequencies
- ! no 10 GHz chips

- ! Cannot continue to increase power consumption

- ! can't melt chip

- ! Can continue to increase transistor density

- ! as per Moore's Law

You **must** re-think your algorithms to be parallel !

High Time to learn Parallel Programming →

## The Landscape of Parallel Computing Research: A View from Berkeley<sup>8</sup>

- Why parallel computing ?
- What should parallel computing achieve ?
- Which disciplines parallel computing should address ?

| Conventional Wisdom                                                                                                                                                                     | Conceptual shift                                                                                                                                          |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| <u>Power is Free, Transistors are expensive</u>                                                                                                                                         | Transistors are free, Power is expensive → <u>Power Wall</u>                                                                                              |
| <u>Uniprocessor performance doubles every 18 months</u> (Moore's law)                                                                                                                   | <u>Power Wall + Memory Wall + ILP Wall</u> → <u>Brick Wall</u> . Now doubling <u>uniprocessor</u> performance may take 5 years as per <u>recent trend</u> |
| Hardware design can increase in size with more transistors without any negative impact                                                                                                  | Wire delay, clock jitter, noise, cross coupling stretches the limits in design for feature size < 65nm                                                    |
| <u>Multiply functional unit is slow but load and store memory operations are fast</u>                                                                                                   | Due to <u>increasing DRAM gap</u> , <u>load and store are much slower than floating point multiplication</u> → <u>Memory Wall</u>                         |
| Instruction Level Parallelism (ILP) through <u>Out-of-Order execution, speculation</u> , compiler optimization such as loop unrolling improves performance of "multicore" architectures | <u>Diminishing results for ILP</u> in "manycore" architectures → <u>ILP Wall</u>                                                                          |

# continued

10

| Conventional Wisdom                                                                 | Conceptual shift                                                                                                    |
|-------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| Increase in frequency is primary method of improving processor performance          | Cant burn the processor by increase in frequency. Increasing parallelism is primary method of improving performance |
| Less than linear scaling in performance for multiprocessor application is a failure | By switching to parallel computing, any speedup via parallelism is a success                                        |
|                                                                                     |                                                                                                                     |



# Applications

12

| Application Domain                 | Application Examples                                                                                                                                                          |
|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Embedded Computing<br>_____        | Consumer: JPEG, RGB to CYMK, Entertainment: MPEG decode-encode, Networking: IP NAT, OSPF, Route Lookup, Automation: Text Processing etc                                       |
| General Purpose Computing<br>_____ | Computational science, Fluid dynamics, Molecular dynamics, Computational Electromagnetics, Weather modelling, network simulation, XML transformation , Video compression, etc |
| Machine Learning<br>_____          | Support Vector Machines, Principal Component Analysis, Spectral Clustering, Expectation Maximization, Bayesian Networks etc                                                   |
| Graphics and Game<br>_____         | Reverse Kinetics, Spring models, Texture Maps, Smoothing, Interpolation, Collision Detections and Responses etc                                                               |
| Databases<br>_____                 | Query Optimization, MapReduce, Hashing etc                                                                                                                                    |

# Common kernels

Numerical Algorithm behind an Application

| ID | Dwarf                            | Description                                                                                                                                      |
|----|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| 1  | Dense Linear Algebra             | Dense Vector-Vector, Matrix-Vector, Matrix-Matrix operations, Use of stride access for rows and columns                                          |
| 2  | Sparse Linear Algebra (SpMV etc) | Sparse data (with many zeroes) stored in compressed format. Data access with indexed load and store                                              |
| 3  | Spectral Methods (FFT etc)       | Data in frequency domain, data accessed in multiple butterfly stages with all-to-all for some stages and local to others                         |
| 4  | N-Body methods                   | Depends on interaction between many discrete points                                                                                              |
| 5  | Structure Grids                  | Data in regular grid format, points on grid updated together with high spatial locality. Subdivides grids into finer grids with area of interest |
| 6  | Unstructured Grids               | Data locations in grids as per application characteristics                                                                                       |

# Common kernels

First 7 kernels/dwarfs covers most computational science applications

14

|    |                            |                                                                                                                                          |
|----|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| 7  | Monte Carlo or MapReduce   | Calculation is done as per repeated random trials which are embarrassingly parallel                                                      |
| 8  | Combinational Logic        | Implemented using logical function and stored states used to perform simple operation on large amount of data e.g. calculating CRC       |
| 9  | Graph Traversal            | Visit nodes following successive edges, computationally less expensive                                                                   |
| 10 | Dynamic Programming        | Solve simpler overlapping problems, useful for optimization                                                                              |
| 11 | Backtrack and Branch+Bound | Finds an optimal solution by recursively dividing the feasible region into subdomains, and then pruning sub-problems that are suboptimal |
| 12 | Graphical Models           | Graphs with nodes as random variables and edges as conditional dependencies. E.g. Bayesian Network, Hidden Markov Models                 |
| 13 | Finite State Machine       | System behavior defined by states, transitions defined by input and current state and event associated with transitions or states        |

# Important Patterns

| Dwarf/Kernel            | Embedded Computing | General Computing | Machine Learning | Graphics and Games | Databases |
|-------------------------|--------------------|-------------------|------------------|--------------------|-----------|
| Dense Linear Algebra    | √                  | √                 | √                |                    | √         |
| Sparse Linear Algebra   | √                  | √                 | √                | √                  |           |
| Spectral Methods        | √                  |                   | √                | √                  |           |
| N-Body methods          |                    | √                 |                  |                    |           |
| Structure Grids         | √                  | √                 |                  | √                  |           |
| Unstructured Grids      |                    |                   | √                |                    |           |
| Monte Carlo (MapReduce) |                    | √                 | √                |                    | √         |

# Common patterns

| Dwarf/Kernel               | Embedded Computing | General Computing | Machine Learning | Graphics and Games | Databases |
|----------------------------|--------------------|-------------------|------------------|--------------------|-----------|
| Combinational Logic        | √                  |                   | √                |                    | √         |
| Graph Traversal            | √                  |                   | √                | √                  | √         |
| Dynamic Programming        | √                  | √                 | √                |                    | √         |
| Backtrack and Branch+Bound |                    | √                 | √                |                    |           |
| Construct Graphical Models | √                  | √                 | √                |                    |           |
| Finite State Machine       | √                  | √                 |                  | √                  |           |

# Challenges

17

- Application design may have combination of different kernels → How to use architecture and programming model tuned in individual kernels/dwarfs for combination design
- Algorithm implementation for different kernels have preferred data structure → Solution having combination of different dwarfs require data structure translation

# Terminology

- The cores perform FLOPS.
- A processor contains one or more (CPU) cores.
- A socket holds one processor.
- A node contains one or more sockets.



**Supercomputer = several nodes → HPC**

HPC world → node peak theoretical performance:

Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node).



**CPU → Serial & Parallel (MPI, OpenMP etc.)**

How many cores we need for 1 Teraflop (cores@2.5 GHz)?

# Terminology --- visual



Single O/S  
Node - standalone Von Neumann computer  
CPU / Processor / Socket - each has multiple cores / processors.



The same Parallel code can be executed on a laptop (multi-core) and can be also executed on a supercomputer

# What do we learn?

## Serial Performance Scaling is Over

- ! Cannot continue to scale processor frequencies
- ! no 10 GHz chips

- ! Cannot continue to increase power consumption
- ! can't melt chip

- ! Can continue to increase transistor density
- ! as per Moore's Law

You **must** re-think your algorithms to be parallel !

High Time to learn Parallel Programming →

# Modern Processor and Principle of multiplicity



## Single core Architecture (stored program digital computer)

Instructions are stored as data in memory



CU - instructions read & execute  
ALU - computation

Memory - store data (for ALU) and  
instructions (for CU)

I/O - communication with user

## Different ways to classify parallel computers

- \* Instruction Stream
- \* Data Stream

Two possible states  
(single or multiple)



Stream refers to a sequence or flow of either instruction or data operated on by the computer

\* Instruction Stream: flow of instructions from main memory to CPU

\* Data Stream : flow of operands between processor and memory (bi-directional)

# Performance From where?

- ~~Device Technology~~
  - Logic switching speed
  - device density
  - Memory capacity and access time
  - Communications bandwidth and latency
- Computer Architecture
  - Execution pipelining
  - Cache management
  - Parallelism
    - Parallelism – number of operations per cycle per processor
      - Instruction level parallelism (ILP)
      - Vector processing
    - Parallelism – number of processors per node
    - Parallelism – number of nodes in a system

$\left\{ \begin{array}{l} \text{Do } i=1, N \\ \quad A(i) = B(i) + C(i) \end{array} \right.$

$$N = 10^6$$

$$\text{Perf} = \frac{10^6}{\text{Time}(s)}$$

Instruction  $\rightarrow$

Runtime  $\rightarrow$  Data + Coherence + Misc  
transfer

*Lecture 3*  
*CS301 - HPC*  
*Course Instructor : B. Chaudhury*

# Previous Lecture

## Data Paths



✓ Main performance bottleneck  
- Memory Access is slow

Performance depends on (Runtime)

→ Instruction Execution - }  
  Data Movement - }

Do  $i=1, N$   
 $A[i] = B[i] + C[i]$   
end

① Total work  $\rightarrow N$  adds  $\rightarrow N$  FLOPs  $\leftarrow$  Programmer.  
 $\hookrightarrow$  Several Instructions (load, store, etc.)  $\leftarrow$  Architect

② Data transferred  $\rightarrow$  How much ?  
8 bytes each A, B, C  
24 bytes per iteration

Measure ?  
Bytes / Sec

Classification of Computers: Some Computer organizations and their effectiveness by M. J. Flynn, 1972.

Stored Program computer.

✓ Von Neumann architecture → main memory, CPU and interconnection.

✓ Von Neumann bottleneck ?

Improvements → modifications to von Neumann model.

- Caching.
- Pipeline

## *Performance Metrics and Benchmarks*

We know that all components can operate at some max speed called peak performance.

Two important quantities :

Compute Throughput (GFLOPS/Sec)

Memory Bandwidth (GBytes/Sec)

✓ **Low level Benchmark:** Program that tries to test some specific features of the architecture.

Isolate small set of instructions to separate influences and determine specific machine capabilities

✓ Vector Triad - popular microbenchmarking exercise

Multiply-Add nested loop on the elements of three vectors

## Vector Triad: $A[i] = B[i] + C[i]*D[i]$

```
int minSize = pow(2, 3); int maxSize = pow(2, 29);
int total = maxSize;
```

```

for(int size=minSize; size<=maxSize; size*=2) {
    /* init data */
    for(int i=0; i<size; i++) {
        b[i]=3; c[i]=2; d[i]=1;
    }

    int RUNS = total/size;
    start = clock();
    for(int run=0; run<RUNS, run++) {
        /* vector triad */
        for(int ind=0; ind<size; ind++) {
            a[ind] = b[ind] + c[ind]*d[ind];
            if((double)ind==333.333)
                dummy(ind);
        }
    }
    end = clock();
    wallTime = (end - start)/((double)CLOCKS_PER_SEC);
    /* Avg throughput */
    double throughput = ((double)sizeof(double) * total * 2)/
        Memory " = ( total * 4 * 8 bytes ) / Walltime
}

```

$$\boxed{\text{total} = \text{Runs} + \text{size}}$$

Something which is never true

$\boxed{\text{total} = 2^{29}}$   
 for  $s = 2^3 + n \cdot 2^{29}$   
 $\rightarrow 2^3 + 2^4$   
 $\rightarrow 2^3 + 2^5$   
 Size ↑      ↓ Runs



## Vector Triad - Performance as a function of N



# Most important limitation – data access !!

✓ Loop based code → large amount of data movement in and out of the CPU.  
Concern : underutilization of on-chip resources.

Several data paths → range of bandwidths and latencies.

Objective : reducing traffic over slow data paths. ←

## ✓ Microbenchmarking

- **Probing of the memory hierarchy** ✓
- **Saturation effects in cache and memory** ↗
- **Typical overheads?** ↗

# What's Driving Parallel Computing Architecture?

31



Latency (Sec)



# Latency and bandwidth in modern computer environments



## Balance Analysis

bytes

$$\beta_c = \frac{\text{Code balance of a loop}}{\text{operations}} = \frac{\text{data traffic}}{\text{operations}} = \frac{\text{Bytes}}{\text{Flops}}$$

Reciprocal of code balance  $\rightarrow$  Computational Intensity.Data  $\rightarrow$  load, store and execution of operation.Data traffic  $\rightarrow$  performance limiting data path (hardware dependent).

$$\left\{ \begin{array}{l} \text{do } i=1, N \\ \quad A[i] = B[i] + C[i] \\ \text{end} \end{array} \right.$$

$$\left( \frac{\text{Flops}}{\text{Bytes}} \right)$$



8GB

# Balance analysis

35

Theoretical performance of loop based codes.

$B_m$

Machine balance (processor chip) =  
memory bandwidth / peak performance

$$= \frac{\text{Bytes/sec}}{\text{FLOPs/sec}}$$

| Data Path    | $B_m$     | Balance (units) |
|--------------|-----------|-----------------|
| cache        | $10^{11}$ | ? $10^1$ 10     |
| RAM          | $10^{10}$ | ? 1             |
| interconnect |           | ?               |
| Disk         | $10^9$    | ? 1             |

Core  $\rightarrow$  loc FLOPs/sec

2.5GB  $\times$  4 FLOPs/cycle  $\times$  1 core

Machine balance will decrease or increase in future ?

Typical historical trend (balance) ??

Good Code 

$B_m$  is fixed

$\frac{B_m}{B_C}$  will increase for Good (now).

Optimizing a code → There is no alternative to knowing what is going on between your code and the hardware. → Performance Modeling.

# Profiling

40

- ✓ Profiling → information about program's behavior.
  - ✓ Runtime → “hot spots”
  - ✓ Hot spots → performance bottleneck → optimization.
  - ✓ Function and line based profiling.
- Function profiling → (e.g. gprof from GNU)  
→ Flat function profile and callgraph profile.



To get a break up of time taken by each subroutines/functions of the code :

compile with

f95 -pg file.f  
gcc - file.c

after running the executable (a.out) we will get a file gmon.out

We have to give the command gprof --line a.out  
gmon.out to get this breakup times.

→ “hotspots”

Peak Performance. → for benchmarking.

Performance matrices →

Low level benchmarking → program to understand the chief performance characteristics of a processor/system.

How to do “time measurement” for different sections of the code. → “elapsed time” ?

fall back

| % time | cumulative seconds | self seconds | calls  | self s/call | total s/call | name            |
|--------|--------------------|--------------|--------|-------------|--------------|-----------------|
| 48.85  | 19.47              | 19.47        | 3000   | 0.01        | 0.01         | adv_efield_vel_ |
| 23.29  | 28.75              | 9.28         | 3000   | 0.00        | 0.00         | rms_            |
| 12.15  | 33.59              | 4.84         | 3000   | 0.00        | 0.00         | inc_efield_     |
| 11.87  | 38.32              | 4.73         | 3000   | 0.00        | 0.00         | mr_mur_         |
| 3.77   | 39.82              | 1.50         | 3000   | 0.00        | 0.00         | adv_hfield_     |
| 0.08   | 39.85              | 0.03         | 4      | 0.01        | 0.01         | elec_dens_      |
| 0.03   | 39.86              | 0.01         | 394416 | 0.00        | 0.00         | fioniz_         |
| 0.00   | 39.86              | 0.00         | 1      | 0.00        | 39.86        | MAIN_           |
| 0.00   | 39.86              | 0.00         | 1      | 0.00        | 0.00         | setup_          |

Code

Old code - time taken (39.86 seconds)

| % time | cumulative seconds | self seconds | calls  | self s/call | total s/call | name            |
|--------|--------------------|--------------|--------|-------------|--------------|-----------------|
| 52.76  | 14.04              | 14.04        | 3000   | 0.00        | 0.00         | adv_efield_vel_ |
| 17.47  | 18.70              | 4.65         | 3000   | 0.00        | 0.00         | rms_            |
| 17.40  | 23.33              | 4.63         | 3000   | 0.00        | 0.00         | inc_efield_     |
| 6.61   | 25.09              | 1.76         | 3000   | 0.00        | 0.00         | mr_mur_         |
| 5.52   | 26.56              | 1.47         | 3000   | 0.00        | 0.00         | adv_hfield_     |
| 0.08   | 26.58              | 0.02         | 394416 | 0.00        | 0.00         | fioniz_         |
| 0.04   | 26.59              | 0.01         | 4      | 0.00        | 0.01         | elec_dens_      |
| 0.00   | 26.59              | 0.00         | 1      | 0.00        | 26.59        | MAIN_           |
| 0.00   | 26.59              | 0.00         | 1      | 0.00        | 0.00         | setup_          |

New code - time taken (26.59 seconds)



**% time** - the percentage of the total running time of the program used by this function.

**Cumulative upto this function.**

**self** - the number of seconds accounted for by this function alone.  
**seconds**

**calls** - the number of times this function was invoked.

**self** - the average number of milliseconds spent in this function per call .  
**ms/call**

**total** - the average number of ms spent in this function and its descendants per  
**call**  
**ms/call**

**name** - the name of the function.

# Callgraph profile / butterfly graph

44

Runtime profile of a function → several different callers.

|       | index | % time | self  | children  | called              | name                |
|-------|-------|--------|-------|-----------|---------------------|---------------------|
|       |       | 0.00   | 26.59 | 1/1       |                     | main [2]            |
| [1]   | 100.0 | 0.00   | 26.59 | 1         | MAIN_ [1]           |                     |
|       |       | 14.04  | 0.00  | 3000/3000 |                     | adv_efield_vel_ [3] |
|       |       | 4.65   | 0.00  | 3000/3000 |                     | rms_ [4]            |
|       |       | 4.63   | 0.00  | 3000/3000 |                     | inc_efield_ [5]     |
|       |       | 1.76   | 0.00  | 3000/3000 |                     | mr_mur_ [6]         |
|       |       | 1.47   | 0.00  | 3000/3000 |                     | adv_hfield_ [7]     |
|       |       | 0.01   | 0.02  | 4/4       |                     | elec_dens_ [8]      |
|       |       | 0.00   | 0.00  | 1/1       |                     | setup_ [10]         |
| <hr/> |       |        |       |           |                     |                     |
|       |       |        |       |           |                     | <spontaneous>       |
| [2]   | 100.0 | 0.00   | 26.59 |           | main [2]            |                     |
|       |       | 0.00   | 26.59 | 1/1       | MAIN_ [1]           |                     |
| <hr/> |       |        |       |           |                     |                     |
|       |       | 14.04  | 0.00  | 3000/3000 | MAIN_ [1]           |                     |
| [3]   | 52.8  | 14.04  | 0.00  | 3000      | adv_efield_vel_ [3] |                     |

## Contd ....

---

```
      1.50  0.00 3000/3000    MAIN_ [1]
[7] 3.8   1.50  0.00 3000    adv_hfield_ [7]
```

---

```
      0.03  0.01  4/4    MAIN_ [1]
[8] 0.1   0.03  0.01   4    elec_dens_ [8]
      0.01  0.00 394416/394416  fioniz_ [9]
```

---

```
      0.01  0.00 394416/394416  elec_dens_ [8]
[9] 0.0   0.01  0.00 394416    fioniz_ [9]
```

---

```
      0.00  0.00  1/1    MAIN_ [1]
[10] 0.0   0.00  0.00     1    setup_ [10]
```

*Lecture 4*

*CS301- High Performance Computing*

*Course Instructor : B. Chaudhury*

To understand the results of vector triad, we need certain concepts i.e. techniques to improve application performance

- \* Pipelined functional units 
- \* Superscalar architecture
- \* Data Parallelism through SIMD
- \* Out of order execution
- \* Larger Caches



Food - some operation like Addition or Multiplication

Items/ dishes - sub-tasks to complete operation

Steps - stages of pipeline

N students - N operations ( $A[i]=B[i]+C[i]$  where  $i=1, N$ )

Simplest pipeline --> fetch (item 1) - decode (item 2) - execute (item 3)



\* Pipeline context- Students taking food in a queue (line)

\* Total N students

\* Complete Food (lunch) contains 5 items (dishes) - to be taken one after the other

\* All students taking same food

\* serving/taking each item is a task (step) - total 5 steps required

\* Workers (serving food) are highly skilled and specialized for a single task

\* Each worker executes same work but works on different objects (serving same food item to different students)

✓ All tasks (steps) take around same time (1 cycle)

\* Forwarding the partially-finished work to the next

## Pipelining of Functional units

- Core can work on several independent instructions simultaneously (parallelly) - working on 5 additions at the same time after the wind-up phase
- One instruction gets finished at each cycle after the pipeline is full
- Widely used in modern processors

Pipeline subtask - load, store, address calculation, decode, execute etc.

Job of compiler - arrange instructions in such a way as to make efficient use of all the different pipelines

Generalize : Pipeline of depth "m"

Total independent operations to be executed - N

for  $i=1, N$

Pipeline  $\rightarrow$  Total time taken to finish ( $T_{\text{pipeline}}$ ) in cycles -  $m + N - 1$

Unit without pipeline takes "m" cycles to generate a single result; how much for N results -  $m/N$

$$\text{Speedup} = \frac{m/N}{m+N-1} \approx m \quad N \gg m$$

Conclusion: Depth of pipeline decides speedup

$$\text{Throughput of the pipeline} = \frac{N}{N+m-1} = \frac{1}{1 + \frac{m-1}{N}}$$

for large  $N \therefore \text{Throughput} \rightarrow 1$

& Conclusion: Tight Input ( $N$  is small)  $\rightarrow$  pipeline bottleneck



## Memory Hierarchies



DRAM Gap - increasing distance between CPU and memory in terms of latency and bandwidth.  
Cache alleviate the effects of DRAM gap.

CPU issues a read request to transfer data item to register, first level cache logic checks whether this item already exists in cache or not

\* If it exists - cache hit (low latency)

\* If it does not exist - cache miss (high latency) --> fetch from outer cache --> worse from memory

Instruction cache miss are rare events compared to data cache miss



\* Either build a small and fast memory or a large and slow memory

\* Multiplicity- multiple levels - the larger the slower. Typical caches - KBytes to MBytes

~~\* Data transfers occur in a block (bursts) of single cache lines (~64 bytes) from one level to another level.~~

✓ Goal-avoid slow data paths- optimization

Data transfer basic model in terms of latency and bandwidth - for different data paths

Data - N bytes

Latency - T\_L (wait time)

Bandwidth - B bytes/sec (after data starts flowing)

✓ Transfer time for message  $T = T_L + N/B \rightarrow$  Seconds

Effective bandwidth of data transfer =  $N/T = N / (T_L + N/B)$

## Application (algorithms)

\* Access pattern - shows some locality of reference

Data items loaded into cache gets used again before getting evicted --> Temporal locality (reuse in time)

Estimation of performance gain from cache:

Suppose cache is  $k$  times faster than memory -  $k$  comes from both latency and bandwidth

Cache reuse ratio (fraction of load or store that can be satisfied from cache because there was a recent load or store to the same address) -  $r$

Access time to main memory  $\geq T_m$

Cache access time  $T_c = \frac{T_m}{K}$

Average access time (for some  $r$ )  $T_{av} = \lambda T_c + (1-\lambda) T_m$

$$\begin{aligned} \text{Performance Gain} &= \frac{T_m}{\lambda T_c + (1-\lambda) T_m} = \frac{K T_c}{\lambda T_c + (1-\lambda) K T_c} \\ &= \frac{K}{\lambda + (1-\lambda) K} = G_r(K, \lambda) \end{aligned}$$

Reuse Ratio  $\rightarrow 1$ , Performance Gain =  $K$



Limitation - many applications reuse ratio (temporal locality) is very low -->0 (for example image processing)

✓ Each new load is expensive -- huge latency

To solve this issue --> concept of cache line comes into picture

Goal - increase cache hit ratio (not cache reuse)

The cache line is fetched as a whole from memory, therefore latency penalty of a cache miss occurs only on the first miss.

Neighbouring items can be loaded from cache with much lower latency.

Whenever there is spatial locality --> latency problem can be reduced

Disadvantage - erratic data access pattern -- polluting the message bus

Suppose cache line length = 16

Spatial locality will fix the hit ratio at  $= \frac{15}{16} = .94$

Algorithm - Performance is governed by memory bandwidth and latency - memory bound application



Stencil



## Cost of cache miss

Profess

understand the difference between hit and miss

Hit time - time to deliver a line in the cache to processor (time to determine whether line is in cache included) - typical hit times - 1-2 clock cycles for L1 cache

Miss penalty - typically 10-100 of cycles

99% hit vs 97% hit --> do we need to worry and do something (increase hit)

99% vs 80%

Cache hit time - 1 cycle

Cache miss penalty - 100 cycles (L1 cache vs main memory)

Average Access time

$$97\text{-l. hit} \rightarrow \frac{97}{100} \times 1 + \frac{3}{100} \times 100 \sim 4 \text{ cycles}$$

$$99\text{-l. hit} \rightarrow \frac{99}{100} \times 1 + \frac{1}{100} \times 100 \sim 2 \text{ cycles}$$

$$80\text{-l. hit} \rightarrow \frac{80}{100} \times 1 + \frac{20}{100} \times 100 \sim 21 \text{ cycles}$$



$L_1 \rightarrow t_B$       Access latency  $\rightarrow 4$  cycles

$L_2 \rightarrow t_B$

$L_3 \rightarrow M_B$

"       $\rightarrow 3_0 - 4_0$  cycles

Memory  $\rightarrow C_B$       "       $\rightarrow$  few low cycles

Block Size - 64 bytes

pipeline efficient



### Vector Triad interpretation



$$2^3 \rightarrow 8 \times 3L = 256 \text{ bytes}$$

$$2^5 \rightarrow 3L$$

Variables  
4 × 8 = 32  
↓  
bytes

$$2^{10} \rightarrow 11L$$

$$2^{20} - 1M$$

$$2^{30} - 1G$$



At small  $N$  processor pipeline is too long to be efficient

Vector triad performance analysis:

1. At small  $N$ , processor pipeline is too long to be efficient. (tight loop)
2. Reasonable  $N$ , processor becomes efficient.
3. Performance saturates at some value of  $N$  set by  $L1$  cache - all 4 arrays fit into cache
4. The above value depends on bandwidth and latency of  $L1$  cache.
5. Further increase in  $N$  -- sharp drop in performance because innermost cache is not large enough to hold all the data.
6.  $L2$  has similar bandwidth that of  $L1$  but higher latency.
7. Now  $L1$  has to support both - provide data to registers as well as continuously evict cache lines from  $L2$ .

*Lecture 5  
HPC: CS301  
Course Instructor: B. Chaudhury*



Previous Lecture:

## ✓ Memory Levels : DRAM and Caches

### ✓ Cache Hit and Cache Miss

Cache Lines: 64 bytes (CPU requests data from RAM in blocks of 64 bytes)

### Latency in Clock Cycles.

Clock cycle: 1 tick of the CPUs timer - smallest possible duration of the time for the device. 1GHz clock cycle - 1 billion ticks per second.

Latency - the time the CPU takes to wait to perform something (measured in clock cycles)

✓ X86/ X64 caches: L1, L2, L3

L1 - smallest (KB) and fastest

L2 - medium sized (hundreds of KB to few MB) and medium speed

L3 - largest (tens of MB) but slowest

Some hypothetical representative numbers

| Memory     | Latency (Clock Cycles) |
|------------|------------------------|
| Register   | 1                      |
| L1         | 3-4                    |
| L2         | 15-20                  |
| L3         | 50-60                  |
| Memory     | 100-200                |
| Hard Drive | ~ Millions             |

# Cache Lines

The bytes in RAM are grouped in a block of 64 each



Travel through the array → forward or backward →  
we will get the benefit.

## Example of locality

Case-1

```
total=0  
for (i=0, i<n, i++)  
    total += a[i]  
}  
return total
```

Which locality (data)?  
Look into each iteration

Spatial  $\rightarrow a[i]$   
Temporal  $\rightarrow total$

$a[1]a[2] \dots a[3]$

✓ Skill: identify the pattern of data access in the code/ algorithm

CASE-2 (C code)  
 $n$  is very large

```
total=0  
for (i=0, i<n, i++)  
    for (j=0, j<n, j++)  
        for (k=0, k<n, k++)  
            total += a[k][i][j]  
return total
```

Optimization for the memory hierarchy - improve locality

- \* Proper choice of algorithms
- \* Loop transformations

## Case Study Matrix multiplication

Row major layout for a 2D C array (4 x 4 matrix)

|                 |                 |                 |                 |
|-----------------|-----------------|-----------------|-----------------|
| M <sub>00</sub> | M <sub>01</sub> | M <sub>02</sub> | M <sub>03</sub> |
| M <sub>10</sub> | M <sub>11</sub> | M <sub>12</sub> | M <sub>13</sub> |
|                 |                 | X               |                 |
|                 |                 |                 | M <sub>33</sub> |

16 elements



Linearized into 16 element 1D array - flat memory space in modern computers

$$\text{Element} = \text{Row} * \text{width} + \text{column}$$

$$10 \text{th element} = 2 \times 4 + 2 = 10$$

## Matrix multiplication code

```
int row, col, k  
for (row=0, row<N, row++)  
    for (col=0, col<N, col++)  
        for (k=0, k<N, k++)  
            element [row*width + col] += A[row*N + k]*B[k*N + col]
```

$\text{Size}(A) \rightarrow N^2$

where  $N = \text{width of the matrix}$



# Case Study: Matrix Multiplication

For single element of  $C(i,j)$  --> Read : Row  $(i)$  x Column  $(j)$  {dot product}

Cache starts empty, now look into hit and miss



If matrices are stored in rows in RAM -- reading columns will be slow (not using cache lines)

## Cache miss analysis

- \* Matrix elements - double
- \* Cache block - 64 bytes (8 elements)
- \* Cache size  $C << N$

Lets say  $N = 10000$

First iteration:  
Total cache Misses

$$= \left(\frac{N}{8}\right) + N$$

(A)      (B)

$$= \frac{9N}{8} \text{ Misses}$$

Total Misses

$$= \frac{9N}{8} \times N^2 = \frac{9N^3}{8}$$

(8 bytes)

Matrix -  $10^4 \times 10^4$

A  $\rightarrow 8 \times 10^8$

Matrix  $\rightarrow 10^2 \times 10^2$

A  $\rightarrow 80 \text{ KB}$

Matrix  $\rightarrow 10^3 \times 10^3$

A  $\rightarrow 8 \text{ MB}$

*What if we transpose the matrix B*



Read the rows or the columns -- we are using the entire cache lines

Store the matrices in both in rows and columns -- swap the X and Y coordinates

Message: try to access data in consecutive blocks

## *Block matrix multiplication*



(Blocking Technique)

Phase 1 calculation

$n$        $l$        $y$   
 $m$        $z$        $t$

# Block matrix multiplication

Block  $\rightarrow 2 \times 2$

$$\begin{array}{c}
 \text{4} \\
 \begin{bmatrix} A_{0,0} & A_{0,1} \\ A_{1,0} & A_{1,1} \\ A_{2,0} & A_{2,1} \\ A_{3,0} & A_{3,1} \end{bmatrix} \times \begin{bmatrix} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ B_{1,0} & B_{1,1} & B_{1,2} & B_{1,3} \\ B_{2,0} & B_{2,1} & B_{2,2} & B_{2,3} \\ B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \end{bmatrix} = \begin{bmatrix} C_{0,0} & C_{0,1} \\ C_{1,0} & C_{1,1} \end{bmatrix}
 \end{array}$$

$16 \times 16$

$3 \times 3$

$m \times n \times p$

$$\begin{bmatrix} A_{0,0} \\ A_{1,0} \end{bmatrix} \times \begin{bmatrix} B_{0,0} \\ B_{1,0} \end{bmatrix} + \begin{bmatrix} A_{0,1} \\ A_{1,1} \end{bmatrix} \times \begin{bmatrix} B_{1,0} \\ B_{2,0} \end{bmatrix}$$

$$= \begin{bmatrix} a_{0,0}b_{0,0} + a_{0,1}b_{1,0} & a_{0,0}b_{0,1} + a_{0,1}b_{1,1} \\ a_{1,0}b_{0,0} + a_{1,1}b_{1,0} & a_{1,0}b_{0,1} + a_{1,1}b_{1,1} \end{bmatrix} + \begin{bmatrix} a_{0,2}b_{2,0} + a_{0,3}b_{3,0} & a_{0,2}b_{2,1} + a_{0,3}b_{3,1} \\ a_{1,2}b_{2,0} + a_{1,3}b_{3,0} & a_{1,2}b_{2,1} + a_{1,3}b_{3,1} \end{bmatrix}$$

$$= \begin{bmatrix} a_{0,0}b_{0,0} + a_{0,1}b_{1,0} + a_{0,2}b_{2,0} + a_{0,3}b_{3,0} & a_{0,0}b_{0,1} + a_{0,1}b_{1,1} + a_{0,2}b_{2,1} + a_{0,3}b_{3,1} \\ a_{1,0}b_{0,0} + a_{1,1}b_{1,0} + a_{1,2}b_{2,0} + a_{1,3}b_{3,0} & a_{1,0}b_{0,1} + a_{1,1}b_{1,1} + a_{1,2}b_{2,1} + a_{1,3}b_{3,1} \end{bmatrix}$$

$= C_{0,0}$

$B \cdot d^{-1} \cdot B$

N

M

$\begin{bmatrix} N & M \\ B & D \end{bmatrix}$



✓ Misses for each block =  $\frac{B^2}{8}$  → Cache line size & alignment

how many blocks in vertical & horizontal dimension

$$\frac{N}{B} + \frac{N}{B} = \frac{2N}{B}$$

$$\text{Misses in the first iteration} = \frac{B^2}{8} \times \frac{2N}{B} = \frac{NB}{4}$$

$$\text{Total Misses} = \left(\frac{N}{B}\right)^2 \times \frac{NB}{4} = \frac{N^3}{4B}$$

↳ total Blocky

- \* Cache block  $\Rightarrow$  64 bytes
- \* Cache Size  $C \ll N$  → Row & cols
- \* Assume  $(B \times B)$
- \* Block size (elements)  $\rightarrow B^2$

$$3B^2 \leq C$$



Double precision vs Single precision  
8 bytes vs 4 bytes

$$\text{No Blocking} \rightarrow \frac{9N^3}{8}$$

$$1 \text{ Blocking} \rightarrow \frac{N^3}{4B}$$

$$A_{min} = \frac{\frac{9N^3}{8}}{\frac{N^3}{4B}} = \boxed{\frac{9}{2}B}$$

$$\text{Block } B=8$$

$$\frac{9}{2} \times 8 = \underline{36} \times$$

$$\text{Block } B=16$$

$$\frac{9}{2} \times 16 = \underline{72} \times$$

### Optimize Code Performance

- \* Data organization
- \* Data Access
- \* Loop structures
- \* Blocking Technique

keep working set small  $\rightarrow$  Temporal Locality}

Small Stride  $\rightarrow$  Spatial Locality

Inner loop code  $\rightarrow$  Data Access



Block size  $\rightarrow 4 \times 4$

Precision  $\rightarrow$  Single, Double

↳ how many elements per cache line  
4 elements, 8 elements

$3B < \text{Cache size}$

$3B \rightarrow 3 \times (4+4) \times 8$   
 $\rightarrow$

Uniprocessor throughput (FPS/Sec)  $\rightarrow$  how to calculate

Thought of the Memory system  $\Rightarrow$

64 bit DDRAM interface (Double data Rate)  
8 channels  
1 GHz clock freq

Peak Access Throughput (GB/S)

$$= \frac{8 \text{ byte}}{\text{transfer}} \times \frac{2 \text{ transfers}}{\text{Clock cycle}} \times 8 \text{ channels} \times \frac{1 \text{ Octet}}{\text{Sec}}$$

$$= \boxed{\frac{128 \text{ GB}}{\text{S}}}$$

Each data - 4 byte

$$\text{Elements / Sec} \rightarrow \boxed{\frac{32 \text{ G}}{\text{S}}}$$

# *Memory Hierarchies*



*Lecture 6  
HPC: CS301  
Course Instructor: B. Chaudhury*



Principle of locality → Spatial locality and temporal locality.

Exploit principle of locality. Memory access on blocks of data instead of individual data items.  
These are called cache blocks.

## *When two statements can be executed in parallel?*

Suppose we have 2 statements

statement 1  
statement 2

and we wish (2 processors execute independently, i.e. no control over order of execution between processor):

statement 1 in proc 1  
statement 2 in proc 2

Main requirement: their order of execution should not matter i.e.

statement 2     $\curvearrowleft$     Equivalent    statement 1     $\curvearrowleft$   
statement 1     $\curvearrowleft$

case 1  
 $a=1$   
 $b=2$     }     $\checkmark$

case 2  
 $a=1$   
 $b=a$

Case 3  
 $b=a$   
 $a=1$      $\times$

Case 4  
 $a=2$   
 $a=1$      $\times$

Case 5  
 $a=f(y)$   
 $b=a$      $\times$

## Dependences : types

### \* Flow (true) dependence

statement i precedes j and i computes a value that j uses

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$



### \* Anti dependence

statement i precedes j and i uses a value that j computes

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$



### \* Output dependence

statement i precedes j and i computes a value that j also computes

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$

1 & 3  
3 & 4

### \* Input dependence

statement i precedes j and i uses a value that j also uses

- (1)  $x=1$       (2)  $y=x+2$       (3)  $x=z-w$       (4)  $x=y/z$

3 → 4

\* If we say dependence flows from  $i$  to  $j$  then  $i$  is source and  $j$  is sink.  
 Flow dependence is the true dependence, other dependences can be eliminated by renaming

\* Data dependency graph: (1) nodes are the statements and (2) directed edges are dependence relations

### Dependence between loop iteration

$\left\{ \begin{array}{l} \text{for } i=1, n \\ \text{for } j=2, m \\ b[i,j] = \underline{\dots} + b[i,j-1] \end{array} \right.$

\* iterations of the loop  $j$  must be done sequentially  
 $\& \quad \text{", " ", " } i$  can be executed  $\rightarrow$  parallel.

$\left\{ \begin{array}{l} \text{for } i=1, n \\ \text{for } j=2, m \\ a[i] = b[i] + c[i] \\ d[i] = a[i] \end{array} \right.$

(1) loop independent dependence  
 (2) loop carried dependence

$i=1 \rightarrow \text{case 1}$   
 $i=2 \rightarrow \text{case 2}$   
 $i=3 \rightarrow \text{case 3}$

in the above case dependence flows within same iteration (loop independent; dependence distance=0)

$\left\{ \begin{array}{l} a[i] = b[i] + c[i] \\ d[i] = a[i-1] \end{array} \right.$

dependence flows from 1 iteration to next (loop carried dependence; dependence distance=1)

$a[1] = b[1] + c[1]$   
 $a[2] = b[2] + c[2]$   
 $a[2] = a[1]$

$a[1] = b[1] + c[1]$   
 $a[1] = a[0]$   
 $a[2] = b[2] + c[2]$   
 $a[2] = a[1]$

## Loop carried dependences

```
a[0]=1  
for (i=1; i<N; i++) {  
    a[i]= a[i] +a[i-1]  
}
```

Can this loop be parallelized?

Approach:

Look into the pattern for some iterations

~~i=1~~

~~i=2~~

i>3

$$a[1] = a[1] + a[0]$$

$$a[2] = a[2] + a[1]$$

$$a[3] = a[3] + a[2]$$

$$i=N-1 \quad a[N-1] = a[N-1] + a[N-2]$$

## How to detect dependences

- \* Look carefully how each variable is used within a iteration
- \* Is the variable only read and not written (if yes - no dependences)
- \* For each variable which is writtten - find out whether there is any accesses in other iterations than the current? if yes then depepnces are present

Other important points for no dependences:

- \* Each element is assigned by at most one iteration
- \* No iteration reads elements assigned by any other iteration

```
for (i=1; i<N; i+=2) {  
    a[i] = a[i] + a[i-1]  
}
```

Can this loop be parallelized?

Approach:

Look into the pattern for some iterations

$$\begin{aligned} a[1] &= a[1] + a[0] \\ - \quad a[3] &= a[3] + a[2] \\ a[5] &= a[5] + a[4] \end{aligned}$$

*✓* `for (i=0; i<N/2; i++) {  
 a[i] = a[i] + a[i+N/2]  
}`

Can this loop be parallelized?

Yes

Approach:

Look into the pattern for some iterations

$$\left\{ \begin{array}{l} a[1] = a[1] + a[1 + \frac{N}{2}] \\ a[2] = a[2] + a[2 + \frac{N}{2}] \\ \vdots \\ a[\frac{N}{2}-1] = a[\frac{N}{2}-1] + a[\frac{N}{2}-1 + \frac{N}{2}] \end{array} \right.$$

*✓* `for (i=0; i<=N/2; i++) {  
 a[i] = a[i] + a[i+N/2]  
}`

Can this loop be parallelized?

No

Approach:

Look into the pattern for some iterations

$N=10$

$$\left( \begin{array}{l} a[0] = a[0] + a[5] \\ a[5] = a[5] + a[5+5] \end{array} \right)$$

$$a[\frac{N}{2}] = a[\frac{N}{2}] + a[\frac{N}{2} + \frac{N}{2}]$$

$a[0]=0$   
 ✓  
 for ( $i=1; i < N; i++$ ) {  
      $a[i] = a[i-1] + i$   
 }

Can this loop be parallelized?

Approach:

$$\begin{array}{lll}
 i=1 & a[1] = a[0] + 1 & = 0 + 1 = 1 \\
 i=2 & a[2] = a[1] + 2 & = 1 + 2 = 3 \\
 i=3 & a[3] = a[2] + 3 & = 3 + 3 = 6 \\
 i=4 & a[4] = a[3] + 4 & = 6 + 4 = 10
 \end{array}$$

$$\boxed{a[i] = \frac{i(i+1)}{2}} \rightarrow \begin{array}{c} i \rightarrow 3 \rightarrow 6 \rightarrow 10 \\ \downarrow \quad \downarrow \quad \downarrow \\ 1 \times 5 = 10 \end{array}$$

✓  
 for ( $j=0; j < N; j++$ ) {  
      $a[\underline{id}[j]] = a[\underline{id}[j]] + b[\underline{id}[j]]$   
 }

Approach:

Can this loop be parallelized?

Can we remove dependences in the following cases - is there false dependences

Case-1

$sum = a1 + 1$   
 $b1 = sum * c1$

$sum = a2 + 1$   
 $b2 = sum * c2$



Sum 1



Sum 2

Case 2

```
for (i=0; i<N; i++) {  
    x=(b[i] + c[i]) /2  
    a[i]= a[i+1] + x  
}
```

Assignment

Can this loop be parallelized?

## Summary

- Statement order must not matter. ✓
- Statements must not have dependences. ✓
  - Some dependences can be removed. ✓
  - Some dependences may not be obvious. ✓

Eliminating dependence by Scalar Expansion or privatization

*for (I = 0; I < 100, I++)*  
*T = A[I];*  
*A[I] = B[I];*  
*B[I] = T;*

$$\begin{aligned}T[0] &= A[0] \\T[1] &= B[0] \\B[1] &= T[0]\end{aligned}$$

$$T = A[0] \quad \left. \right\} \quad T = A[5]$$

$\hat{i} = 1$

$\hat{i} = 5$

## Distribution

Loop instance  $\Delta$  Loop fusion

- \* Loop fusion  $\rightarrow$  Combining 2 loops into a single loop.
- \* Reason  $\rightarrow$  Improve locality & (parallelism)
- \* Loop distribution (opposite of fusion)  $\rightarrow$  Can increase parallelism, worse locality
- $\rightarrow$  Can turn a non-parallelizable loop into a parallelizable loop.



# Simple Optimizations of serial code

# Case A vs. Case B

## Case A

```

logical :: flag
flag = .false.
do i=1,N
if(complex_func(A(i)) < TRESHOLD) then
  flag=.true.
  EXIT
endif

```

W<sup>h</sup>Y

*Which one  
we should  
choose !!  
Why?*

## Case B

```

logical :: flag
flag = .false.
do i=1,N
if(complex_func(A(i)) < TRESHOLD) then
  flag=.true.
endif
enddo

```

Lesson → Do less work

# Case A vs Case B

$$A = A + B^{**2}$$

(A,B float)

*Expensive operation*

$$\rightarrow A = A + B^{**2} \quad 0$$

(case A)

$$A = A + B * B$$

(case B)

$$B^{**2} = \cancel{B^{\cancel{**}} 2} - \cancel{emb} 2 \ln(B)$$

Strength      Reducton

Algebraic , Sequential

Lesson  $\rightarrow$  Avoid expensive operation

# Case A vs B

## Case A

```
2 do i=1,N  
3     A(i)=A(i)+s+r*sin(x)  
4 enddo
```

## Case B

```
tmp=s+r*sin(x)  
do i=1,N  
    A(i)=A(i)+tmp  
enddo
```



```
1 do j=1,N  
2   do i=1,N  
3     if(i.ge.j) then  
4       sign=1.d0  
5     else if(i.lt.j) then  
6       sign=-1.d0  
7     else  
8       sign=0.d0  
9     endif  
10    C(j) = C(j) + sign * A(i,j) * B(i)  
11  enddo  
12 enddo
```

## Case A

```
1 do j=1,N  
2   do i=j+1,N  
3     C(j) = C(j) + A(i,j) * B(i)  
4   enddo  
5 enddo  
6 do j=1,N  
7   do i=1, j-1  
8     C(j) = C(j) - A(i,j) * B(i)  
9   enddo  
0 enddo
```

## Case B

```

do i=1,N,2
  do j=1,N
    c(i)=c(i)+a(i,j) * b(j)
    c(i+1)=c(i+1)+a(j,i+1)*b(j)
  enddo
enddo

```

Case A

Loop unrolling

$N^2$  additions

$2N^2$  operations  
 $\approx N^2$  Accesses

```

do i=1,N
  do j=1,N
    c(i)=c(i)+a(j,i)*b(j)
  enddo
enddo

```

Case B

- **Do less work!**



- **Avoid expensive operations!**

FP MULT & FP ADD are the fastest way to compute



- **Replace expensive functions by table lookup**

Avoid DIV / SQRT / SIN / COS / TAN → table lookup

- Explore whether data is close to CPU!



- **Elimination of common subexpressions!**

- **Avoid branches!**

Support the compiler to understand and optimize your code!



*Lecture 7  
HPC: CS301  
Course Instructor: B. Chaudhury*



Compute n values and add them together

Serial code – simple !! →

Sum=0; for loop  $i=1, n$ ; compute  $x$ ; sum  $+=x$ .



Suppose you have "p" cores and "p < n"

How you will do it in parallel?



**Compute n values and add them together**

**Suppose you have “p” cores and “p<<n”**

**“partial -sum” for each core**

**Each core → divide in independent blocks, private variables, Say “my\_sum” of each core.**

**Then “global-sum” → sum of all “my\_sum”**

**Master core gets all “my\_sum” from other cores.**

## First Example

Is it the best way always?

Method is generalization of the serial global sum ; divide the work among cores and then master core simply repeats the basic serial addition.

Efficient Parallelization is not parallelization of each step of the serial algorithm, but devising an entirely new algorithm.



[Arjun]

b1 b2 b3 b4



## Granularity

Ratio of computation to communication.

- Coarse: large amounts of computational work between communication events
- Fine: small amounts of computational work between communication events

Comp  
Comm



## Parallel Overhead

The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead includes:

- Task start-up time
- Synchronizations
- Data communications
- Software overhead imposed by parallel languages, libraries, operating system, etc.
- Task termination time

Better algorithm is possible !!

- **Communication among the cores,**
- **load balancing,**
- **and synchronization.**



**Complexity of the algorithm.**

**Types of Parallelism. → Partitioning among cores.**

## Parallel Computing Metrics - Speedup and efficiency

$T(1)$  - execution time with one core or processor

$T(p)$  - execution time with  $p$  processors

$$\text{Speedup} = \frac{T(1)}{T(p)} \rightarrow S(p)$$

$$\text{Efficiency } E = \frac{S(p)}{p}$$

✓ Is there any limit on speedup - what if we have infinite processors?

Ideal world  $S=p$ , assuming perfect scaling  $p$  times with  $p$  cores and work is fully parallelizable and constant workload

✓ Speedup and efficiency are not performance (work per unit time) metrics!

Lets assume normalized execution time for single core can be divided into two parts

$$T(1) = s + p = 1$$

$$p = 1 - s$$

- $\curvearrowleft$   $s$  - sequential or serial part
- $\curvearrowleft$   $p$  - purely parallel part



Parallel execution time with  $N$  processors  $T(N) = s + \frac{p}{N}$

Speedup  $S = 1 / (s + (1-s)/N)$

$$\begin{aligned} &= \frac{\text{Serial}}{\text{Parallel}} \\ &= \frac{1}{s + \frac{(1-s)}{N}} \\ &= \frac{1}{s} = \frac{1}{1-p} \end{aligned}$$

Possibility of Parallelization

- Amdahl's Law states that potential program speedup → the fraction of code ( $P$ ) that can be parallelized:

$$\cancel{\text{Speedup} = \frac{1}{1-p}}$$

- If none of the code can be parallelized,  $P = 0$  and the speedup = 1 (no speedup).
- If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast.
- Theoretical Maximum speedup ??  $\infty$

Profiling the code

10 Subtracts  
9 Subtracts  
 $\cancel{1 \times \text{Parallel}}$

Sequential

$S = \frac{1}{1-p}$

Std: 1

Profiling an sequential code

$1 \times 10 = 1$

$S = 1 \quad p = .9$

$S =$



$$\text{Speedup} = \frac{1}{1-p}$$

$P$  in fraction  
 $P+S=1$

Y axis → speedup

Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by:

$$\text{Speedup} = \frac{1}{[(P/N)+S]}$$



$$\begin{cases} S = 0.01 \\ S = 0.1 \end{cases}$$

$$S = 2$$

$$P = 0.9$$

$$\frac{P}{N} \downarrow 0$$

$$\frac{1}{S} = \frac{1}{-1}$$

$$= 10$$

Now if we introduce overhead ?

JW, info Valider - ~~(n)~~



(a) \* Segmented computation time  $\underline{\sigma(n)} = \underline{=}$

(b) \* Parallel computation time (<sup>Serial</sup>  
that can be executed parallelly)

(c) Parallel overhead  $\rightarrow \underline{\phi(n)} =$   
 $\rightarrow K(n, p)$



Segmented on a single processor =  $\underline{\sigma(n)} + \underline{\phi(n)}$

Parallel time for the parallel code =  $\underline{\sigma(n)} + \underline{\frac{\phi(n)}{P}} + \underline{K(n, p)}$

$$\boxed{\Psi(n, p) = \frac{\sigma(n) + \phi(n)}{\sigma(n) + \frac{\phi(n)}{P} + K(n, p)}} = \frac{\text{Serial}}{\text{Parallel}}$$

Not perfectly divided, speedup will be less.

Ir  
3

33, 33, 3

\* What will happen if we increase  $p$  -  
 Computation time decreases  
 Communication time increases

$$T(n, p) = \frac{G(n) + \phi(n)}{\frac{G(n) + \phi(n)}{p} + \frac{f(n)}{p}}$$

Conf Conf Comm

fix n → fixing the problem size



Conclusion: →  
 For any fixed problem size, there is an optimal no of processes  
 that minimizes overall execution time.

$$\text{Efficiency} = \frac{\text{Speedup}}{P}$$

$$E = 50\% \leftarrow$$

$$E = 100\%$$

Range of  $\in (n, p)$

Range  $[0 \leq E \leq 1]$

$$E(n, p) = \frac{G(n) + Q(n)}{P \left[ G(n) + \frac{Q(n)}{P} + K(n, p) \right]}$$

~~Computation~~  $\xrightarrow{\quad}$  overhead

$$= \frac{G(n) + Q(n)}{P G(n) + Q(n) + P K(n, p)}$$

Best case Scenario  
for parallelizat<sup>n</sup>  $\rightarrow G(n) \rightarrow 0, K(n, p) \rightarrow 0$   $E(n, p) \rightarrow 1$

Worst case Scenario

$$Q(n) \rightarrow 0, G(n) \rightarrow \text{finite}, K(n, p) \rightarrow \text{finite}, P \rightarrow \infty$$

$$E(n, p) \rightarrow 0$$

$$\Psi(n/p) \leq \frac{G(n) + Q(n)}{G(n) + Q(n) \underbrace{+ K(n/p)}_{P}} + K(n/p)$$

$$\Psi(n/p) \leq \frac{G(n) + Q(n)}{G(n) + Q(n) \underbrace{\phantom{G(n) + Q(n)}}_P} + K(n/p)$$

Amdahl's law does not consider overhead

Let  $f$  denote the inherently sequential portion of the computation

$$f = \frac{G(n)}{G(n) + Q(n)}$$

$$fG(n) - G(n) = -fQ(n) \\ Q(n) = G(n) \left[ \frac{1-f}{f} \right] = G(n) \left( \frac{1}{f} - 1 \right)$$

$$\begin{aligned} \Psi(n/p) &\leq \frac{G(n) + Q(n)}{G(n) + Q(n) \underbrace{\phantom{G(n) + Q(n)}}_P} + K(n/p) \\ &\leq \frac{G(n) + \cancel{fG(n)}}{G(n) + G(n) \left( \frac{1}{f} - 1 \right) \underbrace{\phantom{G(n) + G(n) \left( \frac{1}{f} - 1 \right)}}_P} + K(n/p) \end{aligned}$$

$$\leq \left( f + \frac{1-f}{f} \right) + K(n/p)$$

$n/p$  terms

Matrix Multiplication

Data Structure (Memory Required)

L1 - 32

L2 - 256



Real

Problem size

$\sqrt{64}$   
128

256

512

1024

2048



Random

1 step  
2 step

} for  $i = 1, 2^{\text{nd}}$   
 $A[i] = B[i] + C[i] \times D[i]$



} Time for this  
case  
contin

for  $j = 1, 2^{\text{nd}}$   
 $f[1] = 1, 2$   
 $\boxed{a = b + c}$  ~~a~~

$2^{\text{nd}} \neq 2$   
 $\rightarrow 2 \times 10$   
for  
Process  
 $\rightarrow$  ~~locvars~~

A rectangular step function with a height of 2.

~~locvars~~ 1sec.

\* Which time ? Resolution  $\rightarrow$   $\Delta t_{Sel}$

[ms, ms]

Start time  
for  $i=1, N$

$$a[i] = \underline{[s_i]} + \underline{[r_i]} - \underline{[f_i]}$$



Compute  $\rightarrow$  Problem size  $\times$   $N$   $\times$   $\frac{\text{No. of operations}}{\text{iteration}}$   
 $\times 2$



~~Compute~~ | ~~Process~~

$$N = 1 \approx$$

Compute  $\rightarrow \underline{\underline{200}}$

$$\underline{\underline{10 \text{ FLOPs}}}/\underline{\underline{\text{sec.}}}$$

$$\frac{200}{10 \times 10^9} \text{ sec.} \\ = \underline{\underline{2 \times 10^{-10}}} \text{ sec.}$$

~~10<sup>-10</sup>~~



$$\frac{\text{Total time}}{10^9} = \underline{\underline{\mu s}}$$

$$\left[ \text{Loop } 10^9 \right]$$

$$\begin{bmatrix} 1 & [r_i] \\ C & [r_i] \\ D & [r_i] \end{bmatrix} =$$



$$\frac{200 \text{ Flops}}{10 \times 10^9 \text{ Flops}}$$

$$= 2 \text{ ns}$$

Clock  $\rightarrow$  ~~and~~  
-1 sec



$$\frac{10 \text{ sec}}{10^9} =$$



Pushku

C

$$a[i] = a$$

a =

✓ for  $i=1, N$

$$\checkmark a[i] = f[i] + C_f[i] \times d[i]$$



| for  $i=1, N$

$a = b + c * d$

(~~old~~ ~~it~~ new)

)

10<sup>9</sup>  
10<sup>16</sup>

N → Bandwidth  
L1 L2 L3 12 AM  
Latency →

GB/s  
sec  
↓  
↓ total time

GB/s  
sec  
↓  
↓ complete

## Algorithm +

- Scope of parallelism? (Amdahl's Law)
- Granularity
- Locality
- Load balance
- Coordination and Synchronization

These makes parallel programming harder than sequential programming.