

Date 19<sup>th</sup> feb 24

Monday

## Quiz 1 :-

Uniprocessor are fast but some probs require : too much data  
too much computation  
too many parameters to explore

### \* Lecture 1 :-

Moore's Law: The num of transistors on microchips doubles every 2 years.

Moore's Law does not hold because ?

- 1.) multiple cores on single chip cause heat issues
- 2.) increasing num of cores may not be able to increase speeds (due to inter-process interactions)
- 3.) transistors would reach the limit of minituzierung

Memory / Disk Speed Argument :-

→ Clock Rates of processors have increased 40% per year over past decade, DRam has improved roughly 10% per year

→ Mismatch in speeds = performance bottlenecks

→ Parallel Platforms :-

- ① provide increased bandwidth
- ② provide higher aggregate caches

Data Communication Argument :-

→ Volume of data is too much in databases & data mining probs so it can not be moved => analysis of this data is done by using parallel techniques.

## ① Distributed Systems:

collection of autonomous computers, connected thru a network and distribution middleware

→ computers coordinate their activities to share the resources.

↳ basic of the system

→ system is perceived as a single, integrated computing facility

## ② Distributed Computing:

a specific use of dist sys, to split a large & complex processing into subparts & execute them in parallel to increase the productivity

① develop algos for dist cluster systems

② computers without any shared mem & have geographical dist

③ high latency & no shared clock

## ③ Parallel Computing:

Develop concurrent solutions for 2 types of systems

① Multi-Core Arch

② Many-Core Arch

## Limitations of Parallel Computing:

- ① design proper communication & synchronization b/w processes & sub-tasks
- ② program must have low coupling & high cohesion
- ③ requires more technical skills to code a parallel prog
- ④ exploring proper parallelism from a prob is an hectic process

See Applications of Parallel Computing  
from slides.

## Lecture 2 :-

Amdeahl's Law :

- formulated in 1967
- shows an upper bound on the max-speedup that can be achieved

$$\text{Speedup (p)} = \frac{1}{F + \left(1-F\right) \frac{p}{p}}$$

processors =  $p$

Derivation :-

Time for sequential code =  $T(s)$

Time for parallel code =  $T(p)$

$T(p) = \text{serial comp time} + \text{parallel comp time}$

$$T(p) = F \cdot T(s) + (1-F) \cdot T(s)$$

$$\text{Speedup} = \frac{T(s)}{T(p)} = \frac{T(s)}{F \cdot T(s) + (1-F) \cdot T(s)} = \frac{T(s)}{T(s) \left[ F + \frac{(1-F)}{p} \right]} = \frac{1}{F + \frac{1-F}{p}}$$

Example 1: 70% of seq algo is parallelizable. Remaining must be calculated sequentially. Calc Speedup

1.) 4 Processors

2.) Infinite Processors

$$1-F = 0.70$$

$$F = 0.3$$

$$0.3 + (1-0.7)$$

$$\frac{4}{4}$$

$$=$$

Date

Example 2: 25% of seq algo is parallelizable. Remaining is sequential. Calc max speedup

1.) 5 processors :-

$$1-F = 0.25$$

$$F = 0.75$$

$$0.75 + (1-0.25) = 1.333$$

$$[5 + 1]d = 6$$

=

2.) Infinite processors

$$\frac{1}{0.75 + (1-0.25)}$$

becomes zero

∴ parallelism is lost

Q. How many proc needed to achieve max speedup?

$$\text{speedup} = \frac{1/(1+F)}{F + (1-F)} = \frac{1}{0.75 + (0.25)} = \frac{1}{1.00} = 1 \Rightarrow 1.33 \Rightarrow \frac{1}{0.75p + 0.25}$$

$$1.33 = \frac{P}{0.75p + 0.25} = \frac{0.9975p + 0.3325}{0.75p + 0.25} = p \Rightarrow 0.0025p = 0.3325$$

$$p = 0.3325 / 0.0025 = 133 \text{ processors}$$

Karp-Flatt Metric :-

calculate serial fraction for a given parallel config

$$e = \frac{1/s - 1/p}{1 - 1/p}$$

∴  $s = \text{Speedup}$

Q. in a parallel prog, for 5 processors, you gained speedup of 1.25 determine seq. fraction.

$$e = \frac{1/1.25 - 1/5}{1 - 1/5} =$$

Date

- ① Sometimes the primary reason for the poor speedup is the sequential part of code
  - ② Sometimes when the serial fraction  $\epsilon$  is steadily increasing with  $p$ , parallel overhead contributes to the poor speedup

## Types of Parallelism :-

- ① Data Parallelism :- independent tasks applying same operation to diff elements of a data set. e.g.  $\text{for } (\text{int } i=0; i < 99; i++)$   
 $a[i] = b[i] + c[i]$

All 100 iterations of loop could be executed simultaneously.

② Functional Parallelism :- independent tasks applying diff operations to diff data elements e.g.  $a = 2$ ,  $b = 3$

③ ② can be performed concurrently ①  $m = (a+b)/2$  ②  $s = (a^2+b^2)/2$

④ Pipelining :- divide whole computation of each instance into multiple stages provided that there are multiple instances of the problem

  - output of one stage is input of the other stage
  - used for probs where single instance of prob can not be parallelized

engine doors wheels paint

|       |       |       |       |
|-------|-------|-------|-------|
| car 1 |       |       |       |
| car 2 | car 1 |       |       |
| car 3 | car 2 | car 1 |       |
| car 4 | car 3 | car 2 | car 1 |
|       | car 4 | car 3 | car 2 |

For each car, car 4 follows, car 3 follows, car 2 follows, car 1 follows

\* **Multi-Processor :-**

- Multiple CPUs with shared mem
- same addr on 2 diff CPU's refers to same memory location

① **Centralized MP :-**

- all processors share same primary mem
- all memory is at one place
- same access time from every processor
- UMA (uniform memory access)
- SMP (symmetric multi-processor)



② **Distributed MP :-**

- distributed collection of memories forms one logical addr space
- same addr on diff processors refers to same mem location
- NUMA (Non-Uniform Mem Access)
- Mem access time varies depending on physical location of referenced addr



\* **Multi-Computer :-**

- elist memory
- disjoint local addr spaces
- each processor has direct access to their local mem only
- same addr on diff processors refers to two diff physical mem location
- processors interact w each other thru passing msgs

Date

## ① Asymmetric Multi-Computer:

- front-end computers interact with users & I/O devices
- ↳ executes a full, multi-programmed OS & provides all functions needed for prog development.
- Back-ends are reserved for executing parallel prog processors → used for "number crunching"

## ② Symmetric Multi-Computer:

- every computer executes same OS
- all nodes participate in execution of a parallel prog
- Users may log into any of the computers → enables multiple users to concurrently login, edit & compile prog.



## Cluster

Co-located collection of low-cost computers & switches dedicated to running parallel jobs.

Run Same Version of OS

Some computers may not have interfaces for user login

- Use high speed networks for communication

## Network of Workstations

• Dispersed collection of computers

• Diff OS & executable programs

• User can login & poweroff their workstations

• Ethernet speed for this network

- 1.) Cache Coherence : In a distributed env, each CPU's cache needs to be continuously updated for current values
- 2.) Snooping : achieve data consistency b/w cache mem & shared mem thru a bus-based system. "Write-Invalidate" & "Write-Update" policies are used for maintaining cache consistency.
- 3.) Branch - Prediction : technique used in CPU design that guesses the outcome of a conditional operation & prepare for the most likely result.

## Lecture 3 :-

### \* Flynn's Taxonomy :-

#### 1.) SISD :-

- single instruction, single data
- single core computers
- single instruction stream is in execution at a given time
- only one data stream active at a time



#### 2.) SIMD :-

- parallel architecture with multiple cores
- cores execute same instruction stream, data stream is diff
- Used with Array Operations, image processing & graphics
- Well-Suited for matrix operations & scientific operations
- Cray Vector Proc Machine



|                           |                           |
|---------------------------|---------------------------|
| load A(1)                 | load A(2)                 |
| load B(1)                 | load B(2)                 |
| $C(1) = A(1) \times B(1)$ | $C(2) = A(2) \times B(2)$ |
| store C(1)                | store C(2)                |

(P1)

(P2)

Date \_\_\_\_\_

Load A(1)      Load A(1)  
 $C(1) = A(1) \times 1$        $C(2) = A(1) \times 2$   
 store C(1)      store C(2)

## MTS1 :-

multiple instr stream,

single data stream

pipeline of multiple independently executing functional units,  
 same data set



## MIMD :-

multiple instr, multiple data

clifff (PUs) can simultaneously execute  
 clifff instruction streams manipulating  
 clifff data



e.g. Multi-processor, 3 multi-computer

Many MIMD arch include SIMD

executions by default

Control Unit

## SIMD :-



- req less hardware (1 CU)
- expensive & long design cycles
- not all apps are suited to SIMD

## MIMI :-



- SPMD (Same Program, Multiple Data) paradigm built from inexpensive common comp with lit effort & short time

\* SPMD is close variant of MIMD

5

19

## \* PRAM :-

- parallel random access machine but execute diff inst in each cycle
- consists of P processors
- global memory  $\rightarrow$  unbounded size



proc share common clock cycle

$\rightarrow$  uniformly available to all proc with same address space

① EREW :-  
 $\rightarrow$  exclusive read, exclusive write  
 $\rightarrow$  No two proc can perform read/write

## ② CREW :-

- $\rightarrow$  read concurrently but can't write at same time
- $\rightarrow$  multiple write entries to mem but location core serialized

③ ERWC :-  
 $\rightarrow$  Can Read Concurrently but can write

## ④ CRCW :-

- $\rightarrow$  Most powerful PRAM model

## \* Concurrent Reads cause no prob, but concurrent write

### Arbitration Protocols for Concurrent Write :-

- 1.) Common : write only if all values that proc are attempting to write are same.
- 2.) Arbitrary : write data from random proc ignore rest
- 3.) Priority : follow predetermined priority order. Proc with highest priority succeeds, rest fail
- 4.) Sum : write sum of data items in all write regs.  
 $\hookrightarrow$  Can be extended for any associative operator

## Communication Costs :-

- comm is major overhead in parallel programs
- dist sys suffer from major comm overheads

① Startup Time ( $t_s$ ) :- Time spent at sending 3 receiving nodes, prepare msg, add headers, trailers, execute route algo, establish interface b/w node & router

② Per-hop time ( $t_h$ ) :- num of hops 3 includes factors such as switch latencies, network delays etc.

③ Per-Word transfer time ( $t_w$ ) :- includes all overheads determined by length of msg. includes link bandwidths & buffering overheads.

## \* Store & Forward Routing :-

→ Msg traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop.

$$t_{comm} = t_s + (m(t_w + t_h))l$$

$\therefore l$  comm links

$$\Rightarrow t_h \text{ small so } \Rightarrow t_{comm} = t_s + mlt_w$$

Packet Routing breaks msg into packets, each packet carries routing info, error checking & other header info so

$$t_{comm} = t_s + t_{hl} + t_{wm} \text{ i.e., } \therefore t_w = \text{overheads in}$$

packet headers

about network

8881-116

Date 1-2

## 1. transmission

1.0

### Cut-Through Routing:-

- div msg into basic units called "flits", they are very small so header info minimized.
- Router Msg programs tell intermediate routers → flits then take the same route
- No seq. num needed

$$t_{comm} = t_s + t_{nl} + t_{wm}$$

$$t_{comm} = t_s + t_{wm}$$

∴  $t_n$  very small, ignore

$$(1-q)q_t + (1-q)q_t = (\text{exit to mult}) \cdot 10 + (\text{exit to mult}) \cdot \text{workstation} : t_{ao}$$

$$6 \cdot q_t = (1-q)q_t = (q-1)q_t = q_t - q = q_t - q : t_{ao}$$

$$6 - q_t = 1 - q_t + 1 - q_t = (\text{exit to mult}) + (\text{exit to mult}) : \text{return to mult}$$

$$(1-q)q_t = (\text{exit to mult}) + (\text{workstation})$$

$$6 - q_t = 1 - q_t + 1 - q_t = (\text{exit to mult}) + (\text{workstation})$$

$$6 - q_t = 1 - q_t + 1 - q_t = (\text{exit to mult}) + (\text{workstation})$$

6 : return to mult

browsing after class



last work off 6 : return to mult

9 : exit to mult

7 : error

6 : exit to mult, 10 : system

8 : for