

# Types of Parallel Computers

Flynn's Classical

Taxonomy

| S I S D                           | S I M D                             |
|-----------------------------------|-------------------------------------|
| Single Instruction, Single Data   | Single Instruction, Multiple Data   |
| M I S D                           | M I M D                             |
| Multiple Instruction, Single Data | Multiple Instruction, Multiple Data |

Instruction Set  
Data Stream

Parallel

Serial Computer

S

SISD



SIMD



time

MISD



Research



|               |
|---------------|
| prev instruct |
| load A(1)     |
| $C(1)=A(1)*1$ |
| store C(1)    |
| next instruct |

P1

|               |
|---------------|
| prev instruct |
| load A(1)     |
| $C(2)=A(1)*2$ |
| store C(2)    |
| next instruct |

P2

|               |
|---------------|
| prev instruct |
| load A(1)     |
| $C(n)=A(1)*n$ |
| store C(n)    |
| next instruct |

Pn

time



MIMD  
==

|                  |
|------------------|
| prev instruct    |
| load A(1)        |
| load B(1)        |
| $C(1)=A(1)*B(1)$ |
| store C(1)       |
| next instruct    |

P1

|                |
|----------------|
| prev instruct  |
| call funcD     |
| $x=y^z$        |
| $sum=x^2$      |
| call sub1(i,j) |
| next instruct  |

P2

|                 |
|-----------------|
| prev instruct   |
| do 10 i=1,N     |
| $\alpha=w^{*3}$ |
| $\zeta=C(i)$    |
| 10 continue     |
| next instruct   |

Pn

time

# m/y Architectures.

## 1. Shared

Executable code

Two forms

(1) Uniform m/y access (UMA)

(2) non-uniform m/y access (NUMA)



Cache  
preserving  
locality

CC - UMA

Cache coherent



a processor can  
access its own local  
memory faster than  
non-local memory

## SMP - symmetric multiprocessor system

=



## Distributed Memory

my address  
↳ no concept  
of global address  
space.



Concept of cache coherency ? {not applicable}

message passing → Programming

(MPI) =

## Hybrid models

=====

Shared +  
distributed

Types of parallel  
computers



1. Shared
2. Distributed
3. Hybrid

# Parallel Programming Models & Paradigms

Traditional — sequential (cost of components)  
↓ high availability

parallel paradigms



{cluster} → cost-effective

n/w/s



Super computers

# Scalability

Scalable computing clusters ?



{ PC or WS or SMPs. }

homogeneous  
heterogeneous

Intelligent mechanisms

(huge data set)

n/w wide resource sharing

↳ Scalability performances

↳ flexible use of n/w  
multi-user mgmt →

HPC



High Performance  
computing powers

Allocation





## Main motivation

- minimize economic risk
- Consumer off-the-shelf technology



# Strategies for developing parallel App<sup>ms</sup>



*Successful*

### Strategy 2: Parallel Libraries



{reusability}

Parallel Libraries

(1) Control structures

(2) provide parallel impl<sup>n</sup> of mathematical routines

### **Strategy 3: Major Recoding**



*Scratch*

Level of parallelism

h/w

s/w

Code granularity

- 1. Signal } h/w
- 2. Circuit } h/w
- 3. Component } s/w
- 4. S/m level } s/w

(umps of code)  
grain size

Code granularity

(latency)

Parallelised by

- (1) Very fine → multiple "inst" issue → Processor
- (2) Fine → data level → Compiler
- (3) Medium → Control level → Programmer
- (4) Large → task - level → "

Level of parallelism

h/w

s/w

Code granularity

- 1. Signal } h/w
- 2. Circuit } h/w
- 3. Component } s/w
- 4. S/m level } s/w

(umps of code)  
grain size

Code granularity

(latency)

Parallelised by

- (1) Very fine → multiple "inst" issue → Processor
- (2) Fine → data level → Compiler
- (3) Medium → Control level → Programmer
- (4) Large → task - level → "



Large grain  
(Task level)

Medium grain  
(control level)

Fine grain  
(data level)

Very fine grain  
(multiple issue)

Programmer → Compiler → Processor

## Tools

1. Parallelizing Compilers ✓ 3 architectures
2. Parallel languages ✓
3. message passings → MPI, PVM
4. Virtual shared m/y
5. Object Oriented pgmming
6. Pgmming skeleton.

vs implements  
shared m/y  
pgmming model ←  
in a distributed  
m/y environment

Distributed Shared m/y  
hybrid



# Methodical Design of Parallel Algorithms

machine independent ✓ → early

machine-specific tasks. → bit delayed

$$\frac{\# \text{Computation}}{\# \text{Comm}^n}$$

- (1) Partitioning → decomposing
- (2) Communication → Coordinate all the partitioned tasks
- (3) Agglomeration → Evaluate first two stages
- (4) Mapping → Load balancing

- ↓  
(a) maximizing processor utilization  
(b) communication cost ↓

{ Scalability  $\swarrow h/w$   
Efficiency  $\swarrow s/w$

# Methodical Design of Parallel Algorithms

## 1. Partitioning

Datastructure



Decomposition → Very fine

Computation data

functional decomposition — Computations ← associate data

domain decomposition — first focus is on data — associate comput<sup>n</sup> to data



{ Disjoint sets  
Overlapping

- 1.  $n$  no. of processors  $\rightarrow$   $n$  no. of tasks
- 2. redundant computation / storage reqmts.
- 3. Granularity  $\rightarrow$  Equal amount of work to processors
- 4. Problem size  $\uparrow$  Does the no. of tasks  $\uparrow$  or  $\downarrow$ ?



# Methodical Design of Parallel Algorithms

## Communication

1. channel link
2. messages

$$\frac{\# \text{ computation}}{\# \text{ communication}}$$



grid  
 tree  
 set of tasks  
 prod cons

1. local/global comm<sup>n</sup> ← graph  
 2. Structured/unstructured comm<sup>n</sup>  
 3. Static/dynamic comm<sup>n</sup> ← varying over time.  
 4. Synchronous/asynchronous comm<sup>n</sup>  
 ↓  
 Consumers

$$x_{i,j}^{(t+1)} = \underbrace{4x_{i,j}^{(t)}}_{\text{Consumed}} + x_{i-1,j}^{(t)} + x_{i+1,j}^{(t)} + x_{i,j+1}^{(t)} + x_{i,j-1}^{(t)}$$

## Domain

{ Data disjoint  
 Set }  
 (Challenging)

## Functional

↓  
 data flow  
 b/w tasks

$\{x_{ij}^{(1)}, x_{ij}^{(2)}, x_{ij}^{(3)} \dots\}$

$x_{i-1,j}^{(1)}, x_{i-1,j}^{(2)} \dots$

$x_{i+1,j}^{(1)} \dots$

$x_{i,j-1}^{(1)} \dots$

$x_{i,j+1}^{(1)} \dots$

for  $t=0$  to  $T-1$   
 { send  $x_{ij}^{(t)}$  to each neighbour  
 receive 4 values from  
 neighbours  
 Compute  $x_{ij}^{(t+1)}$

}

|   |   |   |   |
|---|---|---|---|
| x | x | x | x |
| x | x | x | x |
| x | x | • | x |
| - |   |   |   |

|   |   |   |
|---|---|---|
| x |   | x |
|   | x |   |
| x | • | x |
| - |   |   |

# Methodical Design of Parallel Algorithms



# Methodical Design of Parallel Algorithms

Divide & Conquer { $\log n$ }  
concurrently

procedure DAC

{

if base case

solve the problem

else

partitioning problem into L & R

Solve L using DAC

Solve R using DAC

Combine L & R to get solution

}



Unstructured

—

graphs

mesh



Dynamic

# Methodical Design of Parallel Algorithms

Asynchronous  
or  
synchronous.



1. polls      polling

2.

- (C)
1.  $A[i]$
  2.  $A[i] \times B[i]$
  3.  $\rightarrow$  request C polling
  4.  $B[j] \times C[j]$
  5. :
  6. request D polling

## Communication —

1. Unbalanced → Scalability ?

2. global →  
local

3. efficiency → Concurrency  
PAC

# Methodical Design of Parallel Algorithms — Agglomeration

Determine → replicate the data

1. reduction in communication cost  
increasing computation
2. retain the flexibility w.r.t  
Scalability and mapping decisions
3. Reduce the S/w Engineering Cost

task < processors



# Methodical Design of Parallel Algorithms

## 1. Increasing Granularity

Surface - to - volume effects

### Replicating Computations

① ② ③



# Methodical Design of Parallel Algorithms



Sum → replicate

S - Summation  
 b - broad casting

- $2(N-1)$  steps
- $2 \log N$  steps.



Butterfly      Comm



$n$  tasks  $\Rightarrow$  single processor

more processors

Waiting  
for remote data

{ Overlapping  
Comp<sup>n</sup> & Comm<sup>n</sup> }

$\geq$  more no: of tasks than processors

{ mapping strategies }

↳ load balance

## 4 - Mapping

mapping algms → minimize the total execution time

### Goal

- (1) To execute concurrently on different processors.
- (2) Tasks which communicate frequently on the same processor , which increases locality.

## Load balancing algms

Dynamic load balancing strategy - Agglomeration & mapping

Probabilistic load balancing

{ local algorithms }

{ Global computation state }

{ domain decomposition }

master / slave

Functional  
decomposition

} → task - scheduling Algm →

Decentralized  
—  
RR



# RAM model

Count the primitive operations

\*

frequency count

-

$O(n)$

=

$s = \text{sum}$

$\Theta(n)$

=

$s = d(0)$

do  $i = 1, N-1$

$s = s + d(i)$

end do



# PRAM

1. P processors connected to single shared m/y
2. processor id ( $1 \leq i \leq p$ )
3. SIMD



multiprocessor system with shared memory, on which all  $n$  processors operate in lockstep in memory access and program execution operations. Each processor can access any memory location in unit time

# PRAM Model Variants

1. EREW → Exclusive Read (Upper bound)

↳ Exclusive Write

↳ does not allow any kind of simultaneous to a single memory location

2. CREW → Concurrent Read exclusive Write

3. ERCW → Exclusive Read Concurrent Write.

4. CRCW — Concurrent Read & Write. (Lower bound)

| Writes     | Reads                          |                               |
|------------|--------------------------------|-------------------------------|
|            | Exclusive                      | Concurrent                    |
| Exclusive  | EREW                           | CREW                          |
|            | Least effective<br>Most simple | Most common                   |
| Concurrent | ERCW                           | CRCW                          |
|            | Practically no use             | Most powerful<br>Most complex |

CRCW → How to resolve concurrent writes?

→ (1) Common CRCW → Concurrent writes are allowed when processors are attempting to write the same value

(2) Arbitrary CRCW → a value arbitrarily chosen from the values written to the common my location

(3) Priority CRCW →

(4) Combining CRCW →

# Matrix multiplication — PRAM



$$A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}, \quad B = \begin{pmatrix} b_{11} & b_{12} & \cdots & b_{1p} \\ b_{21} & b_{22} & \cdots & b_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ b_{n1} & b_{n2} & \cdots & b_{np} \end{pmatrix}$$

$$C = \begin{pmatrix} a_{11}b_{11} + \cdots + a_{1n}b_{n1} & a_{11}b_{12} + \cdots + a_{1n}b_{n2} & \cdots & a_{11}b_{1p} + \cdots + a_{1n}b_{np} \\ a_{21}b_{11} + \cdots + a_{2n}b_{n1} & a_{21}b_{12} + \cdots + a_{2n}b_{n2} & \cdots & a_{21}b_{1p} + \cdots + a_{2n}b_{np} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1}b_{11} + \cdots + a_{mn}b_{n1} & a_{m1}b_{12} + \cdots + a_{mn}b_{n2} & \cdots & a_{m1}b_{1p} + \cdots + a_{mn}b_{np} \end{pmatrix}$$

→ individual multiplication

SIMD

$$\{ C_{ij} = A_{ik} K \times B_{ki} \}$$

$$\downarrow O(1)$$

$$\downarrow O(n^3)$$

↓  
Strassen's mult

$$\downarrow O(n^{2.81})$$

# Individual processors

## Step 1

1. Read  $A(i, k)$
2. Read  $B(k, j)$
3. Compute  $A(i, k) \times B(k, j)$
4. Store in  $C(i, j, k)$



## Step 2

```
1.  $\ell \leftarrow n$ 
2. Repeat
    $\ell \leftarrow \ell/2$ 
   if ( $k < \ell$ ) then
      begin
         Read  $C(i, j, k)$ 
         Read  $C(i, j, k + \ell)$ 
         Compute  $C(i, j, k) + C(i, j, k + \ell)$ 
         Store in  $C(i, j, k)$ 
      end
   until ( $\ell = 1$ )
```

Summation

of

$n$  numbers

$\downarrow$   
DAC

$\underline{\underline{O(n^3)}}$

Merging 2 sorted list

Parallel Merge

$L_1$        $L_2$   
 $n/2$        $n/2$        $\rightarrow n$  elements

$h \rightarrow$  processors

binary Search

$A = \{a_1, a_2, \dots\}$   
 $B = \{b_1, b_2, \dots\}$   
 $C = \{a_1, a_2, b_1, \dots\}$

$O(n)$

$A[1]$        $P_3$   
 $A = 1, 5, \boxed{7}, 13, 17, 19, 23 \rightarrow \text{lower}$

- Processors  
One for each element

$B = 2, \textcircled{4}, 8, 11, 12, 21, 24 \rightarrow \text{Upper}$   
 $A[9]$        $A[16]$

$C = ?$

Task of  $P_3$

$(i-1) + (\text{high} - n/2) \Rightarrow$  position of 7 in merged list.

Binary Search :  $A[i=3]$

↓

$A[3]$

$i-1 = (3-1) = 2$  elements in the lower array

7 is larger 2 elements "

high = index of the largest integer smaller than 7

index = 10     $\uparrow (\text{high} - n/2) = 10 - 8 = 2$  elements

in upper array

$A = [1, 5, 7, 13, 17, 19, 23]$  → lower  
 $B = [2, 4, 8, 11, 12, 21, 24]$  → upper  
 $C = ?$

- Processors  
 One for each element

Task of  $P_{11}$

$$(i - (n/2 + 1) + \text{high})$$

$$A[i=11] = 8$$

$$i - (n/2 + 1) = 11 - (8 + 1) = 2$$

Binary Search with  $A[11]$  in the lower array  
 $\text{high} = \text{index of largest integer smaller than } 8$   
 $\text{high} = 3$

$A = [1, 5, 7, 13, 17, 19, 23]$  → lower  
 $B = [2, 4, 8, 11, 12, 21, 24]$  → upper  
 $C = ?$

- Processors  
 One for each element

Task of  $P_{11}$

$$(i - (n/2 + 1) + \text{high})$$

$$A[i=11] = 8$$

$$i - (n/2 + 1) = 11 - (8 + 1) = 2$$

Binary Search with  $A[11]$  in the lower array  
 $\text{high} = \text{index of largest integer smaller than } 8$   
 $\text{high} = 3$

### MERGE.LISTS (CREW PRAM):

Given: Two sorted lists of  $n/2$  elements each, stored in  $A[1] \dots A[n/2]$  and  $A[(n/2) + 1] \dots A[n]$   
 The two lists and their unions have disjoint values  
 Final condition: Merged list in locations  $A[1] \dots A[n]$   
 Global  $A[1 \dots n]$   
 Local  $x, low, high, index$   
 begin  
     spawn ( $P_1, P_2, \dots, P_n$ )  
     for all  $P_i$  where  $1 \leq i \leq n$  do  
         ( Each processor sets bounds for binary search )  
         if  $i \leq n/2$  then  
              $low \leftarrow (n/2) + 1$   
              $high \leftarrow n$   
         else  
              $low \leftarrow 1$   
              $high \leftarrow n/2$   
         endif

↑↑  
Array C

### Binary search

( Each processor performs binary search ) → ? T/F

```

x ← A[i]
repeat
    index ← ⌊(low + high)/2⌋
    if x < A[index] then
        high ← index - 1
    else
        low ← index + 1
    endif
    until low > high
    ( Put value in correct position on merged list )
    A[high + i - n/2] ← x
endfor
end

```

{ lower array }  
 { upper array }

Sequential algm

$O(n)$

Parallel Merge  
 Time Complexity ?

Binary S =  $O(\log n)$

Overall TC =  $O(n \log n)$

$(\text{const} + B)$  → one element

### MERGE.LISTS (CREW PRAM):

Given: Two sorted lists of  $n/2$  elements each, stored in  $A[1] \dots A[n/2]$  and  $A[(n/2) + 1] \dots A[n]$   
 The two lists and their unions have disjoint values  
 Final condition: Merged list in locations  $A[1] \dots A[n]$

```

Global  A[1 ... n]
Local   x, low, high, index
begin
  spawn (P1, P2, ..., Pn)
  for all Pi where 1 ≤ i ≤ n do
    { Each processor sets bounds for binary search }
    if i ≤ n/2 then
      low ← (n/2) + 1
      high ← n
    else
      low ← 1
      high ← n/2
    endif
  endfor
end
    
```

↑↑  
Array C

### Binary search

{ Each processor performs binary search } → ? T/F

```

x ← A[i]
repeat
  index ← ⌊(low + high)/2⌋
  if x < A[index] then
    high ← index - 1
  else
    low ← index + 1
  endif
  until low > high
  { Put value in correct position on merged list }
  A[high + i - n/2] ← x
endfor
end
    
```

{ lower array }  
 { upper array }

Sequential algm

O(n)

Parallel Merge

Time Complexity ?

(Const + B) → One element

Binary S = O(log n)

Overall TC = O(n log n)

## Work - Depth Model

→ DAG

Single processor

Time taken  $\propto$  number of operations

$T_1 \rightarrow$  amount of time algm take on  
one processor

$T_p \rightarrow$  " P processors.

Wait for the slowest processor to finish

This is known as the depth of an algm

Work :- no: of computations  $\times$  no: of processors used

↳ Topological ordering



course

Lower bound

$$\frac{T_1}{p} \leq T_p$$

Depth of PRAM with Infinite Processors

- Representing algms as DAG's

↳ multiple computations      vs      infinite processors

↳ representing the dependences b/w operations  
in an algm using a DAG.

Constructing a DAG from an algm.



depth =  $T_{\infty}$  = depth of computation DAG.