

Forward → Separate data  
Backward & indirect space

## CACSC20 - HPC

Von Neumann architecture → sequential  
Program stored in memory  
Inputs & outputs ← Processor  
Shared data interface

★) FLYNN'S CLASSIFICATION OF PARALLEL ARCHITECTURES : Based on the multiplicity of instruction stream and data streams in a computer system. Given by Michael J Flynn in 1966

→ Two types of information flow into a processor → Instructions  
→ Data

★) Instruction Stream : Sequence of instructions executed by the processing unit

★) Data Stream : Sequence of data including i/p, partial or temporal results, called by the instruction stream.

According to Flynn's classification, either of the instruction or data streams can be single or multiple. Computer architecture can be classified into four categories :-

### Computer Architecture



- uniprocessor machine executes single instruction on a single data stream
- conventional single-processor Von-neumann computers
- Serial computer (Not II)
- Pipelining present (Parallelism)
- More than functional unit under a control unit
- Eg : PCs, workstations, mainframes  
CDC-6600, VAXII, IBM 7001



→ Diff IS can be issued on diff DS  
→ Maximum Parallelism  
→ Loosely/Tightly coupled  
→ Shared memory  
→ Eg : Processors  
Supercomputers  
IBM-370,  
"grids"  
UNIVAC-1100/80

## MISD



## MIMD



\* TEMPORAL PARALLELISM: Breaking up a job into a set of tasks to be executed overlapped in time. Also called assembly line processing, pipeline processing or vector processing.

Eg: 4 teachers checking a question paper, 1 question each      T<sub>1</sub>    T<sub>2</sub>    T<sub>3</sub>    T<sub>4</sub>  
 Job = checking paper      Task = checking one answer

Conditions for temporal parallelism :-

- i) Jobs must be identical
- ii) Job can be divided into independent tasks
- iii) Each task takes same time
- iv) Time to pass a task from 1 process to other is negligible w.r.t task execution time
- v) No. of tasks << No. of jobs

If no. of jobs = n, time to do a job = p. Each job divided into k tasks

Time for each task =  $p/k$

Time to complete n jobs without pipeline = np

$$\text{Time to complete n jobs with pipeline} = p + (n-1)p/k = p \frac{(k+n-1)}{k}$$

$$\text{Speed-up} = \frac{np}{p + \frac{(k+n-1)}{k}} = \frac{np}{1 + \frac{k-1}{n}}$$

No. of tasks

### \* Problems in temporal parallelism :

- i) Synchronization : Time to do each task is equal, else holdup
- ii) Bubbles in pipeline : If some tasks are missing in some jobs  $\rightarrow$  bubbles Idle processors  
do these tasks
- iii) Fault tolerance : If 1 processor fails, all fail
- iv) Inter-task communication : Time to pass task must be negligible to task execution time
- v) Scalability : The no. of processor to divide task = no. of individual tasks  
 $\hookleftarrow$  can't increase after certain limit

Pipelining is still used in vector supercomputers like CRAY as each processor expertizes its work

### \* DATA PARALLELISM : Dividing input data into independent sets and processing simultaneously.

Eg: 100 answer sheets are divided among 4 teachers with 25 each  
And each teacher checks complete paper

| $T_1$    | $T_2$    | $T_3$    | $T_4$     |
|----------|----------|----------|-----------|
| $P_1$    | $P_{26}$ | $P_1$    | $P_{76}$  |
| $P_2$    | $P_{27}$ | $P_2$    | $P_{77}$  |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$  |
| $P_{25}$ | $P_{50}$ | $P_{75}$ | $P_{100}$ |

Let no. of jobs =  $n$ , time to do a job =  $p$   
and there are  $k$  processors

Time to complete  $n$  jobs by single processor =  $np$

$$\text{Time to complete } n \text{ jobs by } k \text{ processors} = kp + \frac{np}{k}$$

$\downarrow$  Time to distribute jobs to each processor

$$\text{So, Speed-up} = \frac{np}{kp + \frac{np}{k}} = \frac{k}{1 + \frac{k^2 p}{np}} \xrightarrow{\text{for } k^2 p \ll np} \alpha_k$$

$$\text{Efficiency} = \frac{\text{Speedup}}{k}$$

### \* Advantages :

- i) No synchronization reqd
- ii) No bubble problem
- iii) Fault tolerant  $\rightarrow$  If one fails, others can continue
- iv) No communication delay

### \* Disadvantages :

- i) Static assignment  $\rightarrow$  pre-defined jobs, may have slow tasks
- ii) Sets must be equal time taking & mutually independent
- iii) Each processor must do all, no expertise
- iv) Time to divide the jobs must be small
- v) No of subsets  $\ll$  no. of jobs  $\rightarrow$  Must hold

\*) MIXED PARALLELISM: Both qualities of Temporal and Data parallelism



Example: Supercomputer like NEC-SX

\*) CONDITIONS OF PARALLELISM: For Hism, there must be 'independent' segments



\*) Dependence graph: Describes the relat<sup>n</sup> b/w statements with nodes as statements & directed edges showing dependence, ordered relat<sup>n</sup>

1) Data Dependence: Ordering relationship b/w statements

Data dependency is when a program statement refers to the data of a preceding statement.





Random Access Machine

Parallel instruction

2) CONTROL DEPENDENCE: When order of execution of statements can't be determined before runtime. Eg: If condition often prohibits parallelism.

Compilers are used to eliminate it & exploit the parallelism.

Eg:

`for(i=0; i<n; i++)`

$a[i] = c[i];$

$\text{if}(a[i] < 0) a[i] = 1;$

}

Control-independent

`for(i=0; i<n; i++)`

$\text{if}(a[i-1] < 0) a[i] = 1;$

}

Control dependent

3) RESOURCE DEPENDENCE: When conflicts in using shared resources like integer and float (ALUs), registers etc.

→ ALU conflicts are called ALU Dependence

→ Memory " " " Storage " " → Load, Store

Eg:  $I_1$  and  $I_2$  both accessing addition unit, so ALU dependence

$$I_1: A = B + C$$

$$I_2: G = D + H$$



→ Parallel Random Access Machine

\* PRAM MODEL: → shared memory

→ Can be used in SIMD & MIMD both

→ It has each step of SIMD consist of :-

0(i)  $\leftarrow$  i) Read: Processors from memory (11ly)

0(ii)  $\leftarrow$  ii) Compute: " perform computation in their registers

0(iii)  $\leftarrow$  iii) Write: " write (11ly) to memory



↓ To access shared memory

→ It is divided into 4 categories :-

- Exclusive Read Exclusive Write (EREW): No sharing in reading or writing. least concurrent
- Concurrent " concurrent " (EREW): Write is exclusive, ready is shared. Most common.
- Exclusive " concurrent " (ERCW): Write is shared, read exclusive. Not used greatly
- Concurrent " " " (CRCW): Both shared. Most powerful for 11ism.

Interconnect Networks: No shared memory, processors communicate via bus  
M → memory locat<sup>n</sup> → divided in N → processors  
each processor in its local memory → M/N locat<sup>n</sup>

N/W Topology: Way of connect<sup>n</sup> b/w processors

→ Protocols to specify the value written in concurrent write :-

- Priority CW : Highest priority processor writes
- Common CW : Write allowed only if all want to write same value.
- Arbitrary CW : Arbitrarily a processor succeeds in writing, others fail.
- Combining CW : A function of multiple values that want to be written is written in memory locat<sup>n</sup>. Eg: ip<sub>1</sub>, ip<sub>2</sub>, ip<sub>3</sub> wants to be written so f(ip<sub>1</sub>, ip<sub>2</sub>, ip<sub>3</sub>) is written.

Garbage CW:  
A garbage value  
will be written

\* GRANULARITY: Also called grain size is the measure of amount of computation involved in a ~~soft~~ process. Simplest measure is count of no. of instructions in a grain (program segment). It determines the basic program segment chosen for II processing.



\* Latency: Time measure of the communication overhead incurred b/w machine subsystems.

Eg: Memory latency → Time reqd to access the memory

Synchronization " " → " " for two processors to synchronize with each other

\* Levels of Parallelism: Lower the level, finer the grain



For exploiting fine grain IIism  $\rightarrow$  optimized compiler  $\rightarrow$  auto detect IIism & translates source code to II form to be recognized by run-time system

grains contain

- 1) Instruction Level: Lowest level, less than 20 instructions, fine grain  
Depending upon individual programs, fine grain IIism varies from 2 to 1000s  
Can be detected and exploited within processors.
- 2) Loop Level: Iterative loop operations, contains less than 500 instructions  
Most optimized to execute on II computers, pipeline can be used if successive iterations are independent  
Also fine grain. Recursive loops difficult to IIize
- 3) Procedure Level: Medium grain IIism, contains less than 2000 instructions  
<sup>grain here</sup>  
<sup>inter-process</sup>  
"Detect" of IIism is difficult, Dependence analysis important, Combinatorial requirement  $\leq$  MIMD execution mode, SPMD  $\rightarrow$  special case. Eg: Multitasking  
<sup>Program recorded with compiler assistance</sup>
- 4) Subprogram Level: Jobs steps / related subprograms, grain size in range of 10-100 thousand instructions. Overlapping job steps, Eg:  
Eg: Multiprogramming, IIism exploited by designers instead of compilers  
Coarse / medium grain IIism  $\rightarrow$  require <sup>(Algo design)</sup> suitably designed II programming language

- 5) Job or Program Level: II execut<sup>n</sup> of independent jobs in a II computer.  
High grain size  $\sim$  Millions of instructions in a single program. Used in supercomputers having small no. of  $\uparrow$  power processors, Coarse-grain IIism handled by OS and program loader, Eg: Time / Space sharing multiprocessor

- \* COMMUNICATION LATENCY: Balance granularity & latency  $\rightarrow$   $\uparrow$  performance  
Latencies attribute to machine architecture, implementing technology & communication pattern. Latency imposes a limiting factor on scalability of machine size. Eg: Memory latency  $\uparrow$  with processor cycle time over the years  
Latency in IPC also need to be minimized  $\rightarrow$  also depend on communication pattern  
G:  $n$  <sup>task</sup> <sub>processors</sub>  $\rightarrow$   $\frac{n(n-1)}{2}$  links  $\rightarrow O(n^2)$   $\rightarrow$  Limits no. of processors allowed in a large computer system  
Frequent patterns  $\rightarrow$  burst<sup>n</sup>, broadcast, multicast etc.



Ulk = tick phase followed by tick-phase



## \* ) PROGRAM DECOMPOSITION TECHNIQUES : To introduce Nism in program

### Recursive Decomposition

→ General purpose

→ Divide & conquer strategy used

→ Problem



→ Eg: Quick Sort

### Data Decomposition

→ General purpose

→ Useful for algorithms operating on large datasets

→ i) Partition the i/p data

→ ii) This induces partitioning of comput'n of tasks

→ Similar operations performed

### Types

#### Partitioning o/p data

→ Each processor produces a portion of o/p

→ Independent of other i/p

Eg: Matrix Multiplication

$$\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} = \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix}$$

Partition of o/p

$$C_{11} = A_{11}B_{11} + A_{12}B_{21}$$

$$C_{12} = A_{11}B_{12} + A_{12}B_{22}$$

$$C_{21} = A_{21}B_{11} + A_{22}B_{21}$$

$$C_{22} = A_{21}B_{12} + A_{22}B_{22}$$

#### Partitioning i/p data

→ Each processor is given a portion of i/p

→ Eg: SPMD

array add<sup>n</sup>, sorting

#### Partitioning intermediate data

→ Partially computed data is partitioned

### \* ) Owner-Computer Rule : The

process that owns a part of data, performs all computat'n related to it

For i/p partitioning → Process computes

all computat'n involving that data

For o/p partitioning → Process computes all

the o/p it is assigned to compute.

## \* ) Performance factor :-

→ Cycle Time : Time for one cycle ( $\tau$ )

\* ) Clock rate : No. of cycles per unit time in a processor clock. If  $f$  = Processor cycle time =  $\tau$

$$f = \frac{1}{\tau}$$

Book Pg 40 → Table

$$\text{Average CPI} = \frac{\text{Total no. of cycles}}{\text{Total no. of instructions}}$$

Memory cycle → Time for 1 memory reference mem cycle =  $k \times T$  depends on cache speed, tech, bus width, memory access time

\* Instruction Count ( $I_c$ ): No. of machine instructions to be executed in the program. Determines program size. Diff. machine instructions take diff. size no. of clock cycles to execute.

\* Cycles per Instruction (CPI): Measure of time taken by an instruction to execute. No. of cycles reqd. by an instruction to execute.

\* Total No. of cycles reqd to execute a program =  $C$

\* Performance Factor:

i) Execution Time ( $T$ ): Time taken by a program to execute

$$\text{CPU Time} \quad T = I_c \times \text{CPI} \times T$$

$$T = I_c \times (p + m \times k) \times T$$

$p$  = no. of processor cycles needed for instruction decode & execution

$$T = C \times T$$

$m$  = no. of memory references needed

$k$  = ratio b/w memory cycle & process cycle

ii) MIPS rate (Million instructions per second): Processor speed in terms of Mips.

It varies wrt  $f$ ,  $I_c$  and CPI

$$\text{MIPS} = \frac{I_c}{T \times 10^6} = \frac{f}{\text{CPI} \times 10^6} = \frac{f \times I_c}{C \times 10^6}$$

iii) Floating point operations per second: No. of floating point operations per unit time. Measured in mflops (megaflops), gigaflops, teraflops, petaflops.

1 flop = 1 floating point operation per second

iv) Throughput Rate:

$W_s$  = System throughput rate = no. of programs a system can execute per unit time

$W_p$  = Processor throughput rate = " " " " " " " " " " " " " " " " " " " " based on only MIPS rate and  $I_c$

Usually  $W_s < W_p \rightarrow$  system overheads I/O  
OS overhead

$$W_p = \frac{f}{I_c \times \text{CPI}} = \frac{\text{MIPS} \times 10^6}{I_c}$$

| System Attributes                  | J. | P | M | T | K | L |
|------------------------------------|----|---|---|---|---|---|
| Instruction-set Architecture (ISA) | ✓  | ✓ | ✓ | ✓ | ✓ | ✓ |
| Compiler Technology                | ✓  | ✓ | ✓ | ✓ | ✓ | ✓ |
| Processor Implementation & Control |    |   |   |   |   |   |
| Cache & Memory Hierarchy           |    |   |   |   |   |   |

System attributes influence  
Performance Factor

\* SHARED MEMORY PROGRAMMING : All processors share a common program and data stored in memory shared by all. Main program creates processes for each processor and provides them space and info to compute. After computation all processes rejoin. Main program executes after all created processes finish.

i) Fork : Process create statement

ii) Join : Invoking process needs results from invoked processes to continue



Process A

fork B

lock sum

sum := sum + A1

unlock sum

join B

end A

Process B

lock sum

sum := sum + B1

unlock sum

end B

↳ Produces sum = (sum+A1+B1)

fork(1) Add-array (input: A[N/2+1:N], output: sum)

means fork & move to process 1, and start Add-array () with given parameters

G: Program for Adding an array with 2-processor shared memory computer

Sol Main program for the 2-processor computer [ Processor 0 (main), Processor 1 (slave) ]

begin {main}

global sum, Array A[1:N];

for i := 1 to N do

Read A[i]

end for;

Sum := 0;

fork(1) Add-array (input: A[N/2+1:N], output: sum);

Add-array (input: A[1:N/2], output: sum);

join 1; // wait for Processor1 to complete

Write sum

end {main}

Add-array (input: A[P:N], output: sum)

begin {Add-array}

sum\_local := 0;

for i := P to N do

sum\_local := sum\_local + A[i];

end for

lock sum;

sum := sum + sum\_local;

unlock sum;

end; {Add-array}

★) Program for Adding an array with p-processors shared memory computer

for Processor 0 :-

```
begin & main }  
    global Sum, Array [1:N];  
    for i := 1 to N do  
        Read A[i]  
    end for  
    Sum := 0;  
    increment := N div p;  
    for j := 1 to (p-2) do  
        fork(j) Add-array (input: A[j+increment+1:(j+1)*increment], output: sum)  
    end for;  
    fork(p-1) Add-array (input: A[(p-1)+increment+1:N], output: sum);  
    Add-array (input: A[1:increment], output: sum);  
    for i := 1 to (p-1) do  
        join(i);  
    end for;  
    Write sum  
end & main }
```

procid used for branch

★) SPMW MODEL : Single Program Multiple Data - each instance of ~~multiple~~ program

working on its own data, and may follow different conditional branches or execute loops differently. Program written once → replicated many times.

Barrier primitive makes process wait at this primitive for other processes.

→ Program of SPMW add " for array

for begin

```
    global Sum, increment, Array A[1:N];  
    if procid = 0 then  
        for i := 1 to N do  
            Read A[i]  
        end for;  
        Sum := 0;  
        increment := N div P  
    end if;  
    if procid ≠ p-1 then  
        Add-array (input: A[procid*increment+1:(procid+1)*increment], output: sum)  
    else Add-array (input: A[(p-1)+increment+1:N], output: sum) else if  
    end if  
    barrier;  
    if procid = 0 then  
        Write sum  
    end if  
end
```

\*) MESSAGE PASSING PROGRAMMING : Processors send message to each other running parallelly.

Asynchronous primitives: No order, no fixed time of message arrival

- i) Send (destination processor address, variable name, size)
- ii) Receive (source " " , " " , " " )

Synchronous primitives: Order wise message delivery in given bounds. Message send block Receive all block

- i) Blocked send → same argument as send → waits till message is delivered
- i) Blocked receive → " " " receive → waits till specific message is arrived

Program for Two processor addition of array

Processor 0 :

```
Array A[1:N];
begin
  for i := 1 to N do
    Read A[i]
  end for
  Send(1, A[N/2+1], N/2);
  Sum0 := 0;
  for i := 1 to N/2 do
    Sum0 := Sum0 + A[i]
  end for
  Receive(1, Sum1, 1);
  Sum := Sum0 + Sum1
  Write Sum
end
```

Processor 1 :

```
Array B[1:N/2];
begin
  Receive(0, B[1], N/2);
  Sum1 := 0;
  for i := 0 to N/2 do
    Sum1 := Sum1 + B[i]
  end for
  Send(0, Sum1, 1)
end
```

Addition using p-processors.

Processor 0 :

```
Array A[1:N], Sum[0:p-1];
begin
  for i := 1 to N do
    Read A[i]
  end for;
  increment = N div p;
  last-inc = increment + N mod p;
  grand-sum = 0;
```

```
for j := 1 to (p-1) do
  Send(j, A[j+increment+1], increment)
end for
Send(p-1, A[(p-1) * increment + 1], last-inc);
Sum[0] := 0;
for i := 1 to increment do
  Sum[0] := Sum[0] + A[i]
end for
for j := 1 to (p-1) do
  Receive(j, Sum[j], 1)
end for;
grand-sum := 0;
for j = 0 to (p-1) do
  grand-sum := grand-sum + Sum[j]
end for;
Write grand-sum
```

```

Processor k = 1, 2, ..., p-2
Array B[1 : increment];
begin
    Receive(0, B[1], increment);
    Sum[k] := 0;
    for i=1 to increment do
        Sum[k] := Sum[k] + B[i]
    end for;
    Send(0, Sum[k], 1)
end

```

```

Processor p-1
Array B[1 : last_inc];
begin
    Receive(0, B[1], last_inc);
    Sum[p-1] := 0;
    for i=1 to last_inc do
        Sum[p-1] := Sum[p-1] + B[i]
    end for;
    Send(0, Sum[p-1], 1)
end

```

\* SPMD Model - Message Passing Model: It is a loosely synchronous program. Conditional branching due to diff procid. Processors asynchronously execute same program with their own data.

→ Program for SPMD addition of array

```

begin
    Sum-local, Sum-revd, increment;
    Array A[1:N];
    if procid = 0 then
        for i:=1 to N do
            Read A[i]
        end for
    end if
    if procid ≠ p-1 then
        increment := N div p
    else
        increment := N div p + N mod p
    end if
    if procid = 0 then
        for j:=1 to (p-2) do
            Send(j, A[j+increment+1], increment)
        end for
        Send(p-1, A[(p-1)*increment+1], increment+Nmodp)
    else
        Receive(0, A[1], increment)
    end if
    Sum-local := 0;
    for i:=1 to increment do
        Sum-local := Sum-local + A[i]
    end for
    if procid ≠ 0 then
        for j:=1 to p-1 do
            Receive(j, sum-revd, 1);
            Sum-local := Sum-local + sum-revd
        end for
        Write Sum-local
    end if
end

```

MPI → Message Passing Interface

OpenMP → Open specifications for Multi-processing → shared memory parallel computing  
7.11 Pg. 325

- \* ) MPI (Message Passing Interface): It is a standard specification for a library of message passing functions. It specifies public-domain, platform-independent standard of MP functions → providing portability.
- No feature specific to Hardware, OS or vendor
  - It is a library not a programming language
  - Can be called from FORTRAN 77, FORTRAN 95, C, C++ programs
  - Programs are compiled and then linked with MPI library
  - + → The design of MPI is based on:- (4 orthogonal concepts)
    - i) message datatypes
    - ii) Communicators
    - iii) Communication Operations
    - iv) Virtual topology → How 11 computers are connected Eg: Mesh topology

Categorization

### \* ) Running MPI Program:

- MPI assumes all processes are created when a 11 program is loaded.
- No ~~process~~ terminated in middle of program except
- MPI\_COMM\_WORLD: default process group consisting of all ~~active~~ processes in a program
- int MPI\_INIT(int \*argc, char \*\*\*argv) ⇒ first MPI call to initialize MPI environment
  - ↑ no. of params      ↓ actual params
  - Must be called once
  - Multiple calls ⇒ erroneous
- MPI\_COMM\_SIZE → to find out no. of processes in the program (n)
- MPI\_COMM\_RANK → " " " rank of each process (0 to n-1)
- MPI\_FINALIZE → Terminates the MPI environment

Program on Pg. 303

- \* ) Message →   
Message Buffer: The content of message  
Message Envelope: Destination of message

\* ) MPI\_Send(&N, 1, MPI\_INT, i, i, MPI\_COMM\_WORLD) ✓  
(value) Message address / / /  
Message count ← / / /  
Message Datatype ← / / /  
↳ supports heterogeneous & platform independent

Message Buffer      Message envelope

Destinat<sup>Y</sup> / / /  
process id      tag of msg      communicator  
↳ integer used by messages to label types of messages & restrict msg reception

process group with a content

## \* ) Message Buffer :

- i) Message Address : Anything in sender's address space, refers to starting address of message buffer
- ii) Message Count : No. of occurrences of data items of the message datatype starting from message address
- iii) Message Datatype : Heterogeneous computing  $\rightarrow$  vendor, processor, OS independent  
Eg: MPI-BYTE  $\rightarrow$  8 bits, MPI-PACKED, (Read from Book Pg. 205)  
Derived datatypes  $\rightarrow$

## \* ) Message Envelope :

- i) Destinat<sup>n</sup> process id : Id of receiving process
- ii) Message Tag : To order the messages

Eg:

Process X                          Process Y  
Send(M, 64, Y)                  recv(P, X, 64)  
send(N, 32, Y)                  recv(Q, X, 32)

We want to receive P before Q

Process X                          Process Y  
send(M, 64, Y, tag1)            recv(P, X, 64, tag1)  
send(N, 32, Y, tag2)            recv(Q, X, 32, tag2)

If N reaches earlier  $\rightarrow$  buffered

## iii) Communicator :

Tags are allocated by user  $\rightarrow$  mistakes prone,  
Sol<sup>n</sup>  $\Rightarrow$  Content : Allocated at run-time by system in response to user requests & used for matching messages by machine

Eg: Process 0 :-

MPI-Send(m1, c1, MPI-INT, 1, tag1, comm1)

(Parallel-mat-inv());

contains MPI-Send(m2, c1, MPI-INT, 1, tag1, comm2)

Process 1 :-

MPI-Recv(m1, c1, MPI-INT, 0, tag1,

parallel-mat-inv();

comm1, ..)

$\rightarrow$  Might receive m2 here if no

Communicator examples: MPI-COMM-WORLD, MPI-COMM-SELF  $\rightarrow$  contains group of all processes

Comm only the process it user

## \* ) MPI-RCV (Message address, Message count, Message datatype, Dest<sup>n</sup>, procid, Tag, Communicator, status)

status: returns info on the data received, returns error code if any error like overflow occurred

## \* ) Types of Communicat<sup>n</sup>

- Point-to-Point
- Intra-Communicator  $\rightarrow$  Broadcast, multicast
- Inter-Communicator

\* ) OPENMP: It is an API for programming in FORTRAN, C/C++ on shared memory // copiles

MPI → //ize entire applicat" immediately

OpenMP → Incrementally //ize the applicat" without major restructuring

OpenMP



→ Central entity is a thread, not a process

→ Creates multiple cooperating threads running on multiple cores / processors



→ Master thread



### OpenMP Execution Model

→ Syntax of parallel construct :- # pragma omp parallel [clause [clause] ... ]

ensures structured block executes //ly { structured block } one entry & one exit point  
↳ Team of thread does //ly (using SPMD) called parallel one or multiple statements

a command or operator directive (called sentinel)

→ omp\_get\_thread\_num() → returns ID / rank of thread

→ omp\_get\_num\_threads() → returns no. of threads in the team

→ # pragma omp barrier waits until all threads arrive at barrier

→ Variables declared in main → shared by all threads → shared scope

→ " " " " structure → only access to that thread → Private scope  
block

→ 'private' clause changes variable scope from shared to private for local changes in a block  
Similarly 'shared' clause

### \* ) Running an OpenMP Program

→ A multi-threaded program where each thread is identified by its ID & master thread prints no. of threads present

→ Program on Pg. 318

→ //ism in OpenMP is controlled

by compiler directives

# pragma omp string (called sentinel)

3mp

## \* ) Parallel Loops



\* ) Loop Scheduling: For distributing or mapping of loop iteration to thread, OpenMP supports scheduling strategy given by 'schedule' parameter.

- Schedule (static, chunk-size) → Static Scheduling: Assigns blocks of size chunk-size in Round-Robin fashion to threads

Eg:  $\langle \text{Thread 0: } 0, 1, 2, 9 \rangle$   
 $\langle \text{Thread 1: } 3, 4, 5 \rangle$   
 $\langle \text{Thread 2: } 6, 7, 8 \rangle$

→ If comput<sup>n</sup> per loop is not same → some thread may get overload & some get less load

Solution: Dynamic Scheduling ← Load imbalance

Solution: Dynamic Scheduling  $\leftarrow$  Load imbalance

- Schedule (dynamic, chunk-size) → Dynamic Scheduling: Assigns a new block of size chunk-size to thread as soon as it completes computation of previously assigned block  
→ Overhead to keep track

→ schedule (guided, chunk\_size) → Guided Scheduling → Similar to dynamic but chunk size is relative to no. of iterations left.

$$\text{Chunk size} = \frac{\text{No. of iterations left}}{\text{No. of threads}}$$

Eg: 100 iterations, 2 threads

Chunk sizes are 50, 25, 12, 6, 3, 1, 1

→ schedule (auto) : Mapping decision is delegated to compiler and/or runtime system

### \* Non-Iterative Constructs :

for construct → shares iteration of a loop across team → Data parallelism

sections construct → breaks work into separate independent → Temporal/Task parallelism sections to execute 1 by 1

# pragma omp sections

{ # pragma omp section  
  { structured block } } → independent of each other → executed 1 by 1 by diff threads

# pragma omp section  
  { structured block } } → implicit synchronization at end of construct

}

JHP

\* Synchronization Constructs: Protects critical sections (CS) or race condition as multiple threads 1 by 1 access shared data

1) # pragma omp critical → guarantees that structured block (CS) is executed by only 1 thread at a time  
  { structured block }

2) # pragma omp atomic → Ensures single line of code is executed by 1 thread <sup>only</sup> at a time  
  statement-expression  
  single line of code  
  critical → protects all array elements  
  atomic → " excluding 1 array element

3) #pragma omp barrier → Synchronizes the threads at a specific point  
It waits for all threads to arrive at barrier

4) nowait

G: #pragma omp parallel for nowait

→ Eliminates implicit synchronization occurring by default at the end of a parallel section

5) Lock functions

omp\_init\_lock() → initializes ↑ lock variable

omp\_set\_lock() → set the lock

omp\_unset\_lock() → release the lock

6) #pragma omp flush [(list)] → Updates new values of elements in [list] to memory (flushes) so all threads can see correct value  
→ No list → all thread visible variables update  
→ OpenMP uses a relaxed memory consistency model

7) Reduction (operator : list) → reduces list of variables into one, using  
→ neat way to combine results of parallel regions into a single result at the end of parallel regions

Sequential code:

```
int A[MAX], i, sum = 0;  
double avg;  
for(i=0; i<MAX; i++)  
{ sum = sum + A[i];}  
avg = sum / MAX;
```

Parallelized code:

```
# pragma omp parallel for reduction(+:sum)  
for (i=0; i<MAX; i++)  
{ sum = sum + A[i];}  
avg = sum / MAX
```

- i) By only 1 line addition, serial → Parallel code
- ii) With reduction clause, a)
  - a) A Private copy of 'sum' is created for each thread
  - b) Initialized sum
  - c) At end, all copies are reduced to one by 'add' operation
- iii) No synchronization construct like 'critical' reqd

- \* Environment Variables: To control II execution environment.
  - i) OMP\_SCHEDULE → alters 'schedule' clause when schedule(runtime) is specified in 'for' or 'parallel for' directive
  - ii) OMP\_NUM\_THREADS → Sets max no. of threads to use in a II region
  - iii) OMP\_DYNAMIC → Specifies whether runtime can adjust the no. of threads in II region
  - iv) OMP\_STACKSIZE → Controls size of stack for slave threads
- \* OpenMP 3.0: 'task' derivative parallelizes recursive functions & while loops  
 'collapse' clause for ?? nested loops
- \* OpenMP 4.0 (2013): SIMD constructs to support SIMD or vector-level IIism to exploit full potential of today's multicore architectures

\* Combinational Circuit: It is a family of models of computation.  
 It has multiple components (processors) arranged in columns  $\rightarrow$  Stages  
 No. of i/p lines to a component = Fan-in  
 .. " off " of " " = Fan-out  
 Each unit performs ALU operation in 'one time unit' & produce output.  
 Component is active only after receiving all necessary i/p  
 It has no feedback connection



Analysis of Algo → Running time → Best case, worst case, avg case analysis, cost analysis & upper bounds  
 Speedup =  $\frac{\text{Running time of best sequential algo}}{\text{Running time of given Algo}}$   
 No. of processes →  $P(n)$  for  $n$  bits size =  $n$   
 Cost = Running time × No. of processes  
 Efficiency = Worst case running time of best sequential algo / Cost of  $n$  bit algo

\* Butterfly circuit: also a combinational circuit with  $n$  inputs and  $n$  outputs

Depth =  $1 + \log n$  and Width =  $n$ , Size =  $n(1 + \log n) = n + n \log n$

For  $n=8$ :



Decreasing comparator

\* Bitonic Sorting Network:  $\langle a_0, a_1, \dots, a_{n-1} \rangle$  is a bitonic sequence if:-  
 a)  $\langle a_0, a_1, \dots, a_i \rangle$  is monotonically increasing &  $\langle a_{i+1}, \dots, a_{n-1} \rangle$  is monotonically decreasing for some  $0 \leq i \leq n-1$   
 b) OR The sequence is a cyclic shift of (a) Eg:  $\langle 5, 7, 8, 9, 6, 4, 2, 1 \rangle$  or  $\langle 5, 7, 9, 6, 4, 2, 1, 3 \rangle$

Analysis of Bitonic Sort : 1st sorting network,  $n = \text{width}$ ,  $\log n$  stages, last stage  $\rightarrow (+)BM(n) \rightarrow \text{depth of } \log(n)$  all other stages sort  $n/2$  elements.

$$\text{depth} \leftarrow d(n) = d(n/2) + \log n \Rightarrow d(n) = ((\log n)^2 + \log n)/2 = O(\log^2 n)$$

Let  $S = \langle a_0, a_1, \dots, a_{m-1} \rangle \rightarrow$  Bitonic sequence where  $a_0 \leq a_1 \leq \dots \leq a_{n/2-1}$  and  
Consider two subsequences:-  $a_{n/2} \geq a_{n/2+1} \geq \dots \geq a_{m-1}$

$$S_1 = \langle \min(a_0, a_{n/2}), \min(a_1, a_{n/2+1}), \dots, \min(a_{n/2-1}, a_m) \rangle$$

$$S_2 = \langle \max(a_0, a_{n/2}), \max(a_1, a_{n/2+1}), \dots, \max(a_{n/2-1}, a_m) \rangle$$

Now,

all values in  $S_1 < S_2$  <sup>values in</sup> and  $S_1, S_2$  both are bitonic sequences

This is called

This splitting of a bitonic sequence into two bitonic sequences is called BITONIC SPLIT. We use this recursive algo to sort bitonic sequence:-

- i) Split the sequence  $S$  into two sequences  $S_1$  &  $S_2$  using bitonic split
- ii) Recursively sort the two sequences  $S_1$  &  $S_2$
- iii) The sorted sequence of  $S =$  sorted sequence of  $S_1$  followed by sorted sequence of  $S_2$

This process of sorting a bitonic sequence is known as Bitonic Merging. This can be implemented using combinational circuit. (Pg. 256). For a sequence of length  $= n$ , it has  $\lceil \log n \rceil$  layers ~~depth~~ each with  $\lceil n/2 \rceil$  comparators

If all comparators are increasing comparators  $\rightarrow$  <sup>circuit represented</sup> ascending output  $\rightarrow (+)BM(n)$   
 " " " " decreasing " "  $\rightarrow$  descending output  $\rightarrow (-)BM(n)$

Depth =  $\lceil \log n \rceil$ , Width =  $n$

$(+)BM(n)$  and  $(-)BM(n)$  are building blocks for combinational ckt for sorting.

- i) Any sequence of only 2 numbers forms a bitonic sequence  $\rightarrow$  Trivial
- ii) A sequence consisting of a monotonically increasing sequence & a monotonically decreasing sequence forms a bitonic sequence.

So, we start with a sequence of size 2 and incrementally construct bitonic sequences. If we have <sup>two</sup> bitonic sequences of length  $m/2$  each  $\rightarrow (+)BM(m/2)$  to one sequence. Concatenate these sequences  $\rightarrow$  Bitonic sequence of  $m$  length  $\rightarrow (-)BM(m/2)$  to other segment. Once we have a bitonic sequence of length  $n$ , we  $(+)BM(n)$  to sort. (Pg. 257)

\* Combination ckt for Sorting by Merging: Splits sequence in two halves, sorts & merges

→ Odd even merging ckt

Sequence  $S_1 = \langle x_1, x_2, \dots, x_m \rangle$  and  $S_2 = \langle y_1, y_2, \dots, y_m \rangle$  can be merged to sorted sequence  $S = \langle z_1, z_2, \dots, z_{2m} \rangle$  when  $m = 2^n$ .

So to merge two sequences we follow  $(m, m)$  Odd-even merging :-  
length of each i/p

i) Using an  $(m/2, m/2)$  merging ckt, merge odd-indexed elements of two sequences :  $\langle x_1, x_3, \dots, x_{m-1} \rangle$  and  $\langle y_1, y_3, \dots, y_{m-1} \rangle \Rightarrow \langle u_1, u_2, \dots, u_{m-1} \rangle$

ii) Similarly merge even-indexed elements  $\langle x_2, x_4, \dots, x_m \rangle$  &  $\langle y_2, y_4, \dots, y_m \rangle \Rightarrow \langle v_1, v_2, \dots, v_{m-1} \rangle$

iii) Output  $\langle z_1, z_2, \dots, z_{2m} \rangle$  is given as :-

$$z_1 = u_1 \text{ and } z_{2m} = v_m$$

$$z_{2i} = \min(u_{i+1}, v_i) \text{ and } z_{2i+1} = \max(u_{i+1}, v_i) \quad \forall i \geq 1$$

Pg. 259 → diagram

\* Analysis : Width =  $m$  = input size , Depth :  $d(m) = 1 ; m=1$

Size :

$$p(1) = 1 ; m=1$$

$$p(m) = 2p(m/2) + m-1 ; m>1$$

$$\text{So, } p(m) = 1 + m \log m$$

$$d(m) = 1 + \log m \rightarrow \text{very fast}$$

→ Odd even Merge sorting ckt : For input of size =  $n$  :-

i) Split the unsorted i/p  $\langle a_1, a_2, \dots, a_n \rangle$  into two unsorted sequences  $\langle a_1, a_2, \dots, a_{n/2} \rangle$  and  $\langle a_{n/2+1}, a_{n/2+2}, \dots, a_n \rangle$

ii) Recursively sort these sequences

iii) Merge the two sorted sequences using  $(m/2, m/2)$  merging ckt to get output

\* Analysis :

Width:  $O(n/2)$

Depth:  $d_s(n) = d_s(n/2) + \underbrace{d_m(n/2)}_{\text{sorting depth}} + \underbrace{\log(n/2)}_{\text{merging depth}} \Rightarrow d_s(n) = O(\log^2 n)$  → Time reqd. for sorting

Size:  $p_s(n) = 2p_s(n/2) + p_m(n/2)$

$$\hookrightarrow p_s(n) = O(n \log^2 n)$$

ii) Using PRAM Models: Based on the idea of sorting by enumerations. Each element finds its own position in sorted sequence by comparing its value with all other elements.

Rank of element  $s_i$  = position of  $s_i$  in sorted sequence  $\ell = \pi_{i+1}$

$\pi_i$  = no. of elements smaller than  $s_i$ ;  $R = \langle \pi_1, \pi_2, \dots, \pi_n \rangle$

\* CRCW Sorting: If we have  $n^2$  processors and if more than 1 processor tries to write in a <sup>memor</sup> local, then sum of all is written there. So  $P_{ij} = \text{Processor in } i^{\text{th}} \text{ row \& } j^{\text{th}} \text{ column}$  (Total  $m$  rows &  $n$  columns  $\rightarrow n^2$  processors). So,  $P_{ij}$  tries to write 1 in  $\pi_i$  if  $s_i > s_j$  or ( $s_i = s_j$  and  $i > j$ )  $\rightarrow$   $\begin{cases} s_j \text{ is} \\ \text{smaller than } s_i \end{cases}$ . So if we place  $s_i$  at position  $\pi_{i+1}$   $\rightarrow$  sorted

Procedure for CRCW Sorting:

for  $i := 1$  to  $n$  do in parallel

    for  $j := 1$  to  $n$  do in parallel

        if  $s_i > s_j$  or ( $s_i = s_j$  and  $i > j$ ) then

$P_{ij}$  writes 1 to  $\pi_i$

Time  $\rightarrow O(1)$

        end if

Space  $\rightarrow O(n^2)$

    end for

+ Processor

for  $i := 1$  to  $n$  do in parallel

$P_{i,1}$  puts  $s_i$  in  $(\pi_{i+1})$  position of  $S$

Unrealistic write

conflict resolution

end for

\* CREW Sorting:

One process  $P_i$  computes  $\pi_i$  on completion of  $n$  iterations

Time  $\rightarrow O(n)$

In  $j^{\text{th}}$  iteration, all processors read  $s_j$

Space  $\rightarrow O(n)$

$P_i$  increments  $\pi_i$  if  $s_i > s_j$  or ( $s_i = s_j$  and  $i > j$ )

+  $n$  processors

So in the end, we will have  $\pi_i = \text{no. of elements smaller than } s_i$

So place  $s_i$  at position  $\pi_{i+1}$   $\rightarrow$  sorted sequence

Procedure:

for  $i := 1$  to  $n$  do in parallel

    for  $j := 1$  to  $n$  do

        if ( $s_i > s_j$ ) or ( $s_i = s_j$  and  $i > j$ ) then

```

 $P_i$  adds 1 to  $r_i$ 
end if
end for
for  $i := 1$  to  $n$  do in parallel
     $P_{i,1}$  puts  $s_i$  in  $(r_i+1)$  position
end for

```

### \* EREW Sorting :

Time :  $O(n)$

To avoid concurrent read, allow read in cyclic order Space :  $O(n)$

In first iteration  $P_1, P_2, \dots, P_n$  reads  $s_1, s_2, \dots, s_n$

" Second " ———— reads  $s_2, s_3, \dots, s_n, s_1$  and so on

"  $j^{th}$  " ———— reads  $s_j, s_{j+1}, \dots, s_{j-1}$  and update corresponding  $r$  values

#### Procedure :

for  $i := 1$  to  $n$  do in parallel

for  $j := 0$  to  $n-1$  do

$k := (i+j) \bmod n$

if  $(s_i > s_k) \text{ or } (s_i = s_k \text{ and } i > k)$  then

$P_i$  adds 1 to  $r_i$

end if

end for

$P_{i,1}$  puts  $s_i$  in  $(r_i+1)$  position

end for

### ii) Sorting on Interconnection Networks



Sorting on a linear array

Odd-even transposition sort  
and processors are connected  
to form a linear array

Based on bubble sort

Parallel Compare Exchange Operation

## \* ODD EVEN TRANSPOSITION SORT:

Time:  $O(n^2)$

Let input be  $\langle a_1, a_2, \dots, a_n \rangle$  unsorted.

- i) Odd phase:  $(a_1, a_2), (a_3, a_4), \dots, (a_{n-1}, a_n)$  compared & swapped if out of order
- ii) Even phase:  $(a_2, a_3), (a_4, a_5), \dots, (a_{n-2}, a_{n-1})$  " " " "

[ $a_1$  &  $a_n$  not compared]

Alternatively applied  
After  $n/2$  odd &  $n/2$  even phases  $\rightarrow$  Sorted (Total  $n$  phases)

So in a linear array,  $P_i$  has element  $a_i$

In odd phase,  $P_i$  does a compare-exchange with its right neighbour if  $i = \text{odd}$   
 " even " , " " " " " " " " " " " " " " if  $i = \text{even}$

Procedure

```
for i := 1 to n do
```

```
    if i is odd then
```

```
        for k := 1, 3, ...,  $2\lfloor n/2 \rfloor - 1$  do in parallel
            compare-exchange ( $P_k, P_{k+1}$ )
```

```
    end for
```

```
else
```

```
    for k := 2, 4, ...,  $2\lfloor (n-1)/2 \rfloor$  do in parallel
        compare-exchange ( $P_k, P_{k+1}$ )
```

```
    end for
```

```
end if
```

```
end for
```

\* Sorting on a Hypercube: Hypercube network can be used to optimize existing ~~sort~~ algorithms like Bitonic sort Algo

In bitonic sort let input lines be numbered 0000, 0001, ..., 1110, 1111 (Binary)  
 Comparison is done b/w 2 wires that differ in 1 bit only.

In a hypercube, such wires are neighbours. Thus each i/p is mapped to

Time Complexity still  $O(n^2)$

for i := 0 to  $d-1$  do { $n = 2^{d-1}$  dimensions of hypercube}

    for j := i down to 0 do

        for each  $P_k$  such that  $(i+1)^{\text{st}}$  bit of  $k$  &  $j^{\text{th}}$  bit of

            do in parallel

                compare-exchange ( $P_k, P_k^{(i)}$ )

            flipped bit at  $j^{\text{th}}$  pos

Diagram on Pg. 266



end for end for

## \* MATRIX OPERATIONS ON PRAM MODELS:

i) Matrix Multiplication : If A is  $m \times n$  and B is  $n \times k$  then C is  $m \times k$

$$A \times B = C$$

$$c_{ij} = \sum_{s=1}^n a_{is} \times b_{sj} \quad (1 \leq i \leq m, 1 \leq j \leq k)$$

Sequential algorithm runs in  $O(n^3)$  time , Strassen Algo:  $O(n^{2.37})$  for matrix multiplication

But  $n^2$  values are to be produced,

So min complexity will be around  $O(n^2)$

Sequential procedure:

for  $i := 1$  to  $m$  do

    for  $j := 1$  to  $k$  do

$$c_{ij} = 0$$

    for  $s := 1$  to  $n$  do

$$c_{ij} = c_{ij} + a_{is} \times b_{sj}$$

    end for

end for

end for

Assume A, B, C all are  $n \times n$

## \* CREW Matrix Multiplication:

$n^2$  processors arranged in  $n \times n$  2D array

$P_{ij}$  computes  $c_{ij}$  in off matrix C  
↓  
one inner product

Overall time complexity =  $O(n)$

Procedure:

for  $i := 1$  to  $n$  do parallel

    for  $j := 1$  to  $n$  do parallel

$$c_{ij} := 0;$$

    for  $k := 1$  to  $n$  do

$$c_{ij} := c_{ij} + a_{ik} \times b_{kj}$$

    end for

end for

end for

## \* EREW Matrix Multiplication: Assume in $k^{th}$ iteration, $P_{ij}$ select the pair

$(a_{l_k}, b_{l_k})$  for multiplicand where  $1 \leq l_k \leq n$ . So :-

i)  $1 \leq l_k \leq n$       ii)  $l_1, l_2, \dots, l_m$  are all distinct (not same)

iii)  $a_{l_k}$  &  $b_{l_k}$  can be read only by Processor  $P_{ij}$  in iteration  $k$

$$\text{So } l_k = ((i+j+k) \bmod n) + 1$$

(As if  $P_{ij}$  &  $P_{ab}$  read together - then either  $i=a$  or  $j=b$   
 $\therefore (i+j) \bmod n = (k+1) \bmod n$ )

Procedure for EREW :

```
for i := 1 to n do in parallel
    for j := 1 to n do in parallel
        cij := 0;
        for k := 1 to n do
            dk := (i+j+k) mod n + 1;
            cij = cij + aik * bdkj;
        end for
    end for
end for
```

Time : O(n)  
n<sup>2</sup> processors

+ CRCW Matrix Multiplication

Time : O(1)

All write conflicts resolved by storing the sum of all values. Processors → conceptually  $\rightarrow n \times n \times n$  array

n<sup>3</sup> processors

Procedure

```
for i := 1 to n do in parallel
    for j := 1 to n do in parallel
        for s := 1 to n do in parallel
            cij = 0;
            cij = ais * bsj;
        end for
    end for
end for
```

Cost of all three algorithm  
match with sequential

Matrix Multiplication on a Mesh : nxn processors arranged in a mesh



When P<sub>ij</sub> receives a 'b' :-

- It multiplies them
- Adds the result to c<sub>ij</sub>
- Sends 'a' to P<sub>ij+1</sub> if j ≠ n
- Sends 'b' to P<sub>i+1,j</sub> if i ≠ n

Finally P<sub>ij</sub> has c<sub>ij</sub>

### Parallelism property

- i) Commutative :  $P_i \parallel P_j \Leftrightarrow P_j \parallel P_i$
- ii) Non-transitive :  $P_i \parallel P_j$  and  $P_j \parallel P_k \not\Rightarrow P_i \parallel P_k$
- iii) Associative :  $(P_i \parallel P_j) \parallel P_k \Rightarrow P_i \parallel (P_j \parallel P_k)$

### Procedure

```

for i:=1 to n do in parallel
    for j:=1 to n do in parallel
         $c_{ij} := 0$ 
        while  $P_{ij}$  receives two inputs a and b do
             $c_{ij} := c_{ij} + a+b$ 
            if  $i < n$  then send b to  $P_{i+1,j}$ 
            endif
            if  $j < n$  then send a to  $P_{i,j+1}$ 
            endif
        end while
    end for
end for

```

\* Analysis:  $P_{ij}$  receives its input after  $(i-1) + (j-1)$  steps from beginning and it takes  $n$  steps to calculate  $c_{ij}$ . So  $c_{ij}$  is computed after  $(i-1) + (j-1) + n$  steps.

No. Time complexity =  $O(n-1 + n-1 + n) = O(n)$

No. of processors =  $O(n^2)$

No. Cost =  $O(n^3)$

\* Bernstein's Condition: To check if two processes can execute parallelly

$I_i \rightarrow$  Input set or Read set or Domain of  $P_i$



$O_i \rightarrow$  Output set or Write Set or Range of  $P_i$

Two processes  $P_1$  and  $P_2$  can execute in  $\parallel$  if they are independent & do not create confusing results :  $(P_1 \parallel P_2)$  if :-



|                               |                  |
|-------------------------------|------------------|
| $I_1 \cap O_2 \neq \emptyset$ | Anti-independent |
| $I_2 \cap O_1 \neq \emptyset$ | Flow-independent |
| $O_1 \cap O_2 = \emptyset$    | Op-independent   |

## \* Pipeline Ideal Conditions :

- i) Each instruction can be broken into independent tasks each taking equal time.
- ii) All instructions in order  $\rightarrow$  no branch or jump (Locality in instruction execution)
- iii) Successive instructions are independent i.e.  $i^{\text{th}}$  instruction is independent of  $(i-1)^{\text{th}}$  or previous
- iv) Sufficient resources are available readily

Delays in pipeline  $\Rightarrow$  Pipeline Hazards due to non-ideal conditions

## \* HAZARDS

- i) Structural Hazards
  - $\rightarrow$  Delays due to resource constraints, non-availability of resources
  - $\rightarrow$  one cycle takes longer than others
  - $\rightarrow$  Avoid:
    - More HW Resource
- ii) Data Hazard
  - $\rightarrow$  Delays due to data dependency
    - $\rightarrow$  If instruction J is followed by I and  $D(J) = \text{Domain of } I$  &  $R(I) = \text{Range of } J$
    - So -
- iii) Control Hazard
  - $\rightarrow$  Delays due to branch or control dependency
    - $\rightarrow$  Jump instruction is found only in decode stage
    - $\rightarrow$  Avoidance

Compute delay beforehand & decide whether to tolerate or not



### $\rightarrow$ Avoidance

i) Operand Forwarding: Special circuitry to give less time in a pipeline segment

ii) Code reordering: Special SW  $\rightarrow$  HW-dependent compiler

iii) Stall-injection: Insert NOP instruction to provide delay till operand is not ready (efficiency still  $\downarrow$ )

Unit-3 : Instruction Level Parallel Processing

- How to use ILPP to design high performance PE (Processing elements)
- Earliest use was Pipelining → widely used in RISC
- After RISC, Superscalar processors → execute multiple instructions in 1 clk cycle
  - ↳ uses ILPP by ↑ no. of arithmetic & functional units in PE
- further in VLIW one instruction word encodes more than one operation
- Recent processors schedule no. of instructions on processor to execute parallelly
- No. of transistors on a chip doubles every 24 months (Moore's Law)
- We need to effectively use these multiple processor ('cores') on a single chip
- We will discuss pipelining & superscalar here

\* Pipelining of PEs: Pipelining uses temporal parallelism to ↑ speed effective method of ↑ execut<sup>n</sup> speed given conditions (Ideal):-

- i) It is possible to break instruction into no. of independent tasks, each almost same time-taking to execute
- ii) Locality in instruction execution ⇒ Instructions executed sequentially without JUMP
- iii) Execution of one instruction helps execution of successive instruction & instruction are independent
- iv) Sufficient resources are available in a processor for each instruction in pipeline

But these are rarely achieved!

We will try to achieve pipelining under non-ideal conditions. Let's describe a hypothetical computer architectural model :- SMAC2P → Small Computer 2 Parallel

- RISC Computer (Similar to SMAC2 (Small Computer 2))
- Data memory: Data to be read / written stored here → Register file of 32 registers
- Instruc<sup>n</sup> " : Instruc<sup>n</sup> " " executed " " " General purpose reg to store operand values index values
- IMAR: # Instruc<sup>n</sup> memory address register
- DMAR: Data " " " "
- IR: Data register of Instruction memory
- MDR: " " " Data "
- PC: Program counter → Contains address of next instruc<sup>n</sup> to execute
- Machine is word addressable. A word = 32 bits

→ All instructions are of same length and relative position of operands are fixed

3 types of instructions :-

i) R-type



ii) M-type



iii) J-type



op-code → 6 bit →  $2^6 = 64$   
operations possible

However we consider only a small set (Pg. 58) Rajanikan

→ The only instructions which access memory → Load & Store instructions

Reads words from data memory ↪

& stores it in specified register

↑  
stores the contents of register  
in data memory

→ ALU carries out 1 arithmetic or 1 logic operation in 1 clk cycle

Stages in pipeline → Pg. 59 to 64 (See tables & diagrams only)

Buffer registers are needed to store the intermediate results so that they are not overridden by next instruction

In ideal case, if total no. of instructions to be executed = m

& no. of clk cycles per instruction = n

So time taken in non-pipelining = mn cycles

& .. .. Pipelining = n + m - 1 cycles

$$\text{Speedup due to pipeline} = \frac{mn}{n+m-1} \quad \text{if } (m \gg n)$$

Due to this, each stage takes  $(1+e)$  cycle

$$\text{So, Speedup} = \frac{mn}{m+(mn)(1+e)} \approx \frac{n}{1+e}$$

$$= \frac{m}{\frac{m}{m} + \frac{mn}{m}} \approx m$$

\* Delays in pipeline Execution: Delays in pipeline execution of instructions due to non-ideal conditions are called Pipeline Hazards

- i) Delays due to resource constraints  $\rightarrow$  Structural Hazards
- ii) " " " data dependency b/w instructions  $\rightarrow$  Data Hazard
- iii) " " " branch instructions or control dependency  $\rightarrow$  Control Hazard

1) Delays due to Resource Constraints: Suppose one instruction is reading data from memory and other wants to fetch instruction. If there is a common memory for data & instruction  $\rightarrow$  only one instruction can execute.  $\leftarrow$  Not in S-MAC2P  $\rightarrow$  No stalls  $\leftarrow$

Forced waiting of an instruction in pipeline processing is called <sup>Pipeline</sup> stall

Delay can also occur if execution of one instruction takes longer than others

Eg: A floating point division takes longer than integer addition.

Eg: (i+1) instruction takes 3 clk cycles. So (i+2) can't start at 5  $\rightarrow$  needs to wait till 7.

| Instructions | clk cycles $\rightarrow$ |    |    |     |    |    |     |     |     |    |
|--------------|--------------------------|----|----|-----|----|----|-----|-----|-----|----|
|              | 1                        | 2  | 3  | 4   | 5  | 6  | 7   | 8   | 9   | 10 |
| i            | FI                       | DE | EX | MEM | SR |    |     |     |     |    |
| i+1          |                          | FI | DE | EX  | EX | EX | MEM | SR  |     |    |
| i+2          |                          |    | FI | DE  | X  | X  | EX  | MEM | SR  |    |
| i+3          |                          |    |    | FI  | DE | X  | X   | EX  | MEM | SR |

X = idle period or stall of an instruction due to resource constraint

How to avoid stall?  $\rightarrow$  Speed up floating point to take only 1 clk cycle by using extra H/W

But let's see if the delay is too much to be avoided?

Let a fraction 'f' of total instruction be floating point taking  $n+k$  clk cycles each  
Let total no. of instructions = m

So time to execute m instructions without pipelining =  $m(1-f)m + mf(n+k)$

" " " " " with " " =  $m + (m-1)(1+f)$  clk cycles

$$\text{Speedup} = \frac{m(1-f)n + mf(n+k)}{m + (m-1)(1+f)}$$

$$\text{Speedup} = \frac{n + kf}{\frac{m}{m} + \frac{(m-1)}{m}(1+kf)} \approx \frac{n}{1+kf}$$

(Assuming  $m \gg n$  &  $n \gg kf$ )

for  $k=2$  and  $f=0.1$  Speedup =  $\frac{n}{1+0.2} = 0.833n$

So loss of speedup = 16.6%  $\rightarrow$  Not worth spending for extra H/W

Locking Pipeline: In some designs, whenever work cannot continue in a particular cycle, all stages except the one executing, are stalled  $\rightarrow$  Locking

So :-

clk cycle  $\rightarrow$  1 2 3 4 5 6 7 8 9 10

Instruction

|   |    |    |    |     |    |
|---|----|----|----|-----|----|
| i | FI | DE | EX | MEM | SR |
|---|----|----|----|-----|----|

|     |  |    |    |    |    |    |     |    |
|-----|--|----|----|----|----|----|-----|----|
| i+1 |  | FI | DE | EX | EX | EX | MEM | SR |
|-----|--|----|----|----|----|----|-----|----|

|     |  |  |    |    |   |   |    |     |    |
|-----|--|--|----|----|---|---|----|-----|----|
| i+2 |  |  | FI | DE | X | X | EX | MEM | SR |
|-----|--|--|----|----|---|---|----|-----|----|

|     |  |  |  |    |   |   |    |    |     |    |
|-----|--|--|--|----|---|---|----|----|-----|----|
| i+3 |  |  |  | FI | X | X | DE | EX | MEM | SR |
|-----|--|--|--|----|---|---|----|----|-----|----|

Commonly used in many machines

Here DE could have been done because only EX unit is busy  
But due to Locking  $\rightarrow$  No work

Advantage of Locking: Ease of H/W Implementation

Ensures instructions executed in order <sup>the in which they are issued</sup>

Some machines do not lock & allow instructions without resource constraints to execute. They may get 'out of order' but if it is logically accepted by the program  $\rightarrow$  No need to lock.

2) Delay Due to Data Dependency: Delay might be produced because successive iterations are not always independent of one another. Eg:

| clk Cycle $\rightarrow$ | 1              | 2  | 3  | 4  | 5   | 6  | 7   | 8   | 9   | 10 |
|-------------------------|----------------|----|----|----|-----|----|-----|-----|-----|----|
| (R1=R1+R2)              | ADD R1, R2, R3 | FI | DE | EX | MEM | SR |     |     |     |    |
| (R5=R4+R3)              | MUL R3, R4, R5 |    | FI | DE | X   | X  | EX  | MEM | SR  |    |
| (R6=R7-R2)              | SUB R7, R2, R6 |    |    | FI | DE  | EX | MEM | SR  |     |    |
| (R3=R3+1)               | INC R3         |    |    |    | FI  | DE | X   | EX  | MEM | SR |

stall due to data dependency  $\rightarrow$  Data hazard

H/W is busy

Here, third instruction completes before the second one. This is called 'out-of-order completion' → maybe unacceptable under some situations.  
To avoid this we can use locking: → preserves order but takes 1 more cycle

| Clk Cycle →    | 1  | 2  | 3  | 4   | 5  | 6  | 7   | 8   | 9   | 10 |
|----------------|----|----|----|-----|----|----|-----|-----|-----|----|
| ADD R1, R2, R3 | FI | DE | EX | MEM | SR |    |     |     |     |    |
| MUL R3, R4, R5 |    | FI | DE | X   | X  | EX | MEM | SR  |     |    |
| SUB R7, R2, R6 |    |    | FI | X   | X  | DE | EX  | MEM | SR  |    |
| INC R3         |    |    |    | X   | X  | FI | DE  | EX  | MEM | SR |

How to avoid delay due to data dependency?

i) H/W Method → Register forwarding → Consider ADD R1, R2, R3  
after EX step we will have the result of  $R1 + R2$  stored in buffer register B3.

$$\begin{array}{l} B1 \leftarrow R1 \\ B3 \leftarrow B1 \text{ op } B2 \\ B2 \leftarrow R2 \end{array}$$

(DE step)

So we provide a path from B3 to ALU ifp and bypass MEM & SR stages,  
so instead of waiting it to be written to R3, the result can be used via B3.  
This is register forwarding. It also requires H/W to detect that next instruction requires o/p of current instruction and should be fed back as ifp to ALU.  
Hardware should have facility to forward the o/p to all instructions that require it.

|     |                |                                  |
|-----|----------------|----------------------------------|
| Eg: | LD R2, R3, Y   | $C(R2) \leftarrow C(Y) + C(R3)$  |
|     | ADD R1, R2, R3 | $C(R3) \leftarrow C(R1) + C(R2)$ |
|     | MUL R4, R3, R1 | $C(R1) \leftarrow C(R4) * C(R3)$ |
|     | SUB R7, R8, R9 | $C(R9) \leftarrow C(R7) - C(R8)$ |

| Clk Cycles →   | 1  | 2  | 3  | 4   | 5  | 6  | 7   | 8   | 9   | 10 |
|----------------|----|----|----|-----|----|----|-----|-----|-----|----|
| LD R2, R3, Y   | FI | DE | EX | MEM | SR |    |     |     |     |    |
| ADD R1, R2, R3 |    | FI | DE | X   | X  | EX | MEM | SR  |     |    |
| MUL R4, R3, R1 |    |    | FI | DE  | X  | X  | EX  | MEM | SR  |    |
| SUB R7, R8, R9 |    |    |    | FI  | DE | X  | X   | EX  | MEM | SR |

Using register forwarding

Here we can see that SUB instruction is independent of above instructions but still it is delayed so much → Solution is software scheduling

ii) Software scheduling → Though H/W register forwarding reduced delay, it has not eliminated it completely. So we can re-order some independent instructions to reduce delays without changing the meaning of program. It can further help to reduce or in some cases eliminate delays. For eg. if we reorder the instructions:-

LD R2, R3, Y

---

SUB R7, R8, R9

ADD R1, R2, R3

MUL R4, R3, R1

Clock cycle → 1 2 3 4 5 6 7 8 9

LD R2, R3, Y FI DE EX MEM SR

SUB R7, R8, R9 FI DE EX MEM SR

ADD R1, R2, R3 FI DE X, EX MEM SR

MUL R4, R3, R1 FI DE X, EX MEM SR

R2 not loaded → EX is busy

→ Can be removed by adding more independent instructions in b/w

→ disrupts normal flow of control

3) Delay due to Branch Instruction: Last & most important non-ideal condition → Branch  
If an instruction is branch instruction (known after DE stage)

↓ Next instruction will be

Next sequential instruction  
(Branch not taken)

↓ Instruction specified in branch  
(Branch taken)

In SMAC2P, JMP → unconditional jump → jump address known at DE stage

JMP, JEQ, BCT → conditional

jump address known after DE

jump address known only after EX stage

But we have defined  
PC set in stage 4 for  
all → MEM

So the next instruction to be executed will be available only after MEM stage.

Eg:  $(i+1)$  instant is branch. But it will be known only after DE stage.

Meanwhile  $(i+2)^*$  is fetched. So it needs to be stalled till MEM of  
 $DE \text{ of } (i+2)$

If Branch not taken  $\rightarrow$  DE of  $i+2 \rightarrow$  delay of 2 cycles

" " taken  $\rightarrow$  Fetch will be for new branch instruction

So later this instruction will be fetched again  $\rightarrow$  delay of 3 cycles

| Clk cycle       | → | 1  | 2  | 3  | 4   | 5   | 6  | 7  | 8  | 9   | 10 |
|-----------------|---|----|----|----|-----|-----|----|----|----|-----|----|
| Instruction $i$ |   | FI | DE | EX | MEM | SR  |    |    |    |     |    |
| $i+1$ branch    |   |    | FI | DE | EX  | MEM | SR |    |    |     |    |
| $i+2$           |   |    |    | FI | X   | X   | FI | DE | EX | MEM | SR |

If branch taken

Now if maximum ideal speedup of pipeline = 5

& let % of unconditional branches in a <sup>set of</sup> program = 5 %.

" " " conditional " " " = 15 %.

" " " " " taken in program = 80 %.

So, Ideal no. of cycles per instruction = 1

Avg delay due to unconditional branches =  $3 \times 5\% = 3 \times 0.05 = 0.15$

" " " " " " " = Delay due to taken branch +  
" " " " " " " non - " "

$$\begin{aligned} \text{So speedup with } &= \frac{5}{1 + 0.15 + 0.42} \\ \text{branches} &= 3 \times (0.15 \times 80\%) + \\ &\quad 2 \times (15\% \times 20\%) \\ &\approx 3.18 \\ &= 0.36 + 0.06 \end{aligned}$$

% loss of speedup = 36.4 %  $\rightarrow$  High

Need to resolve it

There can be H/W methods & S/W method to reduce branch delays.

## H/W Method to reduce branch delay

In SMAC2P, The branch address of JMP & JMI is available at the end of DE stage. But address of JEQ & BCT .. " only after EX stage. If we add an extra ALU in DE stage to calculate B1-B2 , then all addresses will be available after DE stage.

This way delay = 1 cycle if branch taken  
= 0 cycle if .. not taken

Now if we calculate the speedup :-

$$\begin{aligned}\text{Avg delay cycles} &= 1 \times 0.05 + 1 \times 0.15 \times 0.8 + 0.0 \times 0.15 \times 0.2 \\ &= 0.05 + 0.12 \\ &= 0.17\end{aligned}$$

$$\text{So speedup} = 5 / 0.17 \approx 4.27$$

and % loss of speedup = 14.6% (from 36.4%)

So by adding H/W we get a gain of 22% → H/W is worth it because of less instructions

Adding this H/W in SMAC2P was not expensive! But in commercial processors there are large no. of <sup>branch</sup> instructions  $\rightarrow$  Expensive to add H/W.

Some other technologies which are more cost-effective in certain processors are :-

- i) Branch Prediction Buffer (BPB)
- ii) Branch Target Buffer (BTB)

i) Branch Prediction Buffer (BPB) → less expensive H/W  
→ less effective  
→ small memory

| Address of BPB memory                 | Address                        | Contents | Prediction bits                                                                          | Prediction                                                                               |
|---------------------------------------|--------------------------------|----------|------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| Low order bit of branch instn address | Address where branch will jump | 2 bits   |                                                                                          | Not taken                                                                                |
| Address of branch instruction         | Branch address target          |          | 00 → Strongly Not taken<br>01 → Weakly .. ..<br>10 → Weakly Taken<br>11 → Strongly .. .. | 00 → Strongly Not taken<br>01 → Weakly .. ..<br>10 → Weakly Taken<br>11 → Strongly .. .. |

At DE step → We will know that instruction is a branch instn or not  
Use lower order bits as address to find in BPB memory  
if pred bits = 10, 11 → Branch taken for  
if pred bits = 00, 01 → Branch not taken → Sequential instruction Predict ~

Initially prediction bits = 00, following table is used to change the prediction bits.

| Current prediction bits | 00 | 00 | 01 | 01 | 10 | 10 | 11 | 11 |
|-------------------------|----|----|----|----|----|----|----|----|
| Branch taken?           | Y  | N  | Y  | N  | Y  | N  | Y  | N  |
| New prediction bits     | 01 | 00 | 10 | 00 | 11 | 01 | 11 | 10 |

Experimentally, the predictions are 90% correct

With 1000 entries in BPP, estimated probability of finding a branch instruction in the buffer is 95%.

So prob. of finding the branch address is at least  $0.9 \times 0.95 = 0.855$

How many clk cycles do we gain?

- In case of SMAC2P with H/W modification → No gain as branch is predicted at DE only
- In case of SMAC2P without H/W; calculated at DE only
- we gain two cycles (MEM → DE) if predictions are correct
- Very useful in machines with slow computations

Why we need two bits for prediction and not one bit?

- Single-bit predictor incorrectly predicts branches more often, particularly in most loops, in comparison to a 2-bit predictor. So 2-bit predictors are found to be more cost-effective.



0: NT, 1: T  
1-bit predictor

00: Strongly Not taken  
01: Weakly Not taken  
10: Weakly taken  
11: Strongly taken

BPB & BTB can be combined!

→ More expensive H/W  
→ More effective memory  
→ More large memory

2) Branch target Buffer (BTB): Unlike BPB, it can be used at FI stage.

| Address                       | Contents                                              | Prediction bits        |
|-------------------------------|-------------------------------------------------------|------------------------|
| Address of branch instruction | Address where branch will jump<br>↓<br>target address | 1 or 2 bits (optional) |

→ Instruction address

At the end of DE stage → we know the instruction is branch

" " " " EX " → " " " target address of branch

Entry is made in a loop in 1st iteration & used by later iterations  
After this BTB entry is made & prediction initialized to 01 bits  
saves 3 clk cycles

Pipeline stages



Here it is assumed that branch target address is found at DE → SMAc2P with H/W

If No H/W → Match of predicted & target address → possible after EX stage

BTB needs to be searched for instruction address → Should not be too large ~ 1000 entries

Now assume unconditional branch = 5%, Conditional branch = 15%, Taken branch = 80%  
= 0.05 + 0.15 = 0.8

Prob. that instruction found in BTB = 95% = 0.95

∴ prediction is correct = 90% = 0.9

So, by having BTB the avg delay cycle when unconditional branch found in BTB = 0

" " " " " NOT " " " = 3  $\rightarrow$  5%

$$\text{So, Avg. delay due to unconditional branch} = 0 \times 0.95 + 3 \times 0.05 \\ = 0.15$$

(No misprediction for unconditional branch)

$$\text{Avg delay due to conditional branch if found in BTB} = 0 \rightarrow 95\% \xrightarrow{\text{Branch taken}}$$
$$\text{" " " " " NOT " " " = } (3 \times 0.8) \xrightarrow{\text{(2} \times 0.2\text{)}} \xrightarrow{\text{Not taken}}$$
$$= 2.8 \rightarrow 5\%$$

$$\text{So, avg delay due to conditional branch} = 0 \times 0.95 + 0.05 \times 2.8 \\ = 0.14$$

$$\text{Avg delay due to misprediction of conditional branch} = 0.1 \times 3 \times 0.95$$

misprediction delay composed  
(10%) cycle for in BTB  
conditional branch

$$= 0.285$$

So,

$$\text{Avg delay due to branches} = \underbrace{5\% \times 0.15}_{\text{Unconditional}} + \underbrace{15\% \times (0.14 + 0.285)}_{\text{Conditional}}$$
$$= 0.05 \times 0.15 + 0.15 \times (0.14 + 0.285)$$
$$= 0.0075 + 0.064$$
$$= 0.0715$$

$$\text{So, Speedup with BTB} = 5 / (1 + 0.0715) = 4.666$$

% loss of speedup = 6.8%. (compare to 36.4%)  $\rightarrow$  very good  
BTB is extremely useful!

\* S/W Method to reduce delay due to branch: In absence of H/W methods, we need to rearrange the statements in such a way that statement following the branch statement (called delay slot) is always executed once it is fetched with continuation of program. Often successful

Eg1

Original program

ADD R4, R5, R6

ADD R1, R2, R3  
JMP X

DELAY SLOT

X

Rearranged program

ADD R4, R5, R6

JMP X

ADD R1, R2, R3

X

Allowed to complete while JMP is decoded if it don't disturb program meaning.  
Else NOP inserted

Eg:2

Y: ADD R4, R5, R6 ←

ADD R1, R2, R3

— — — — —

JMI Y ←

— — — — —

Y+1: ADD R1, R2, R3 ←

JMI Y+1 ←

ADD R4, R5, R6 Delay slot

→ executed while JMS decoding

Eg:3

ADD R4, R5, R6

JMI Z ←

— — — — —

Z: ADD R1, R2, R3 ←

— — — — —

ADD R4, R5, R6

JMI Z+1 ←

ADD R1, R2, R3

— — — — —

Z+1: — — — — —

Eg: Loop unrolling

→ speeds up execution, adopted in many commercial processor like MIPS R4000

\* SUPERPIPELINING: In pipelined processor we assume that all stages take same time of 1 clk cycle. But in practice, the DE & MEM stage takes less than 1 clk cycle. So we can allocate them half a cycle & one cycle to all other stages. Further we divide each clk cycle into two phases of diff resource requirements → then no resource conflicts even for same stage! This is called superpipelining.

## Superpipelined Processing

EX1 & EX2 have diff resource req → No conflict

→ completes in 5 cycles

Clock cycle →

|     |     |    |     |     |      |         |
|-----|-----|----|-----|-----|------|---------|
| 1   | 2   | 3  | 4   | 5   | 6    | 7       |
| FI1 | FI2 | DE | EX1 | EX2 | MEM1 | MEM2 SR |
|     |     |    |     |     |      |         |
| FI1 | FI2 | DE | EX1 | EX2 | MEM1 | MEM2 SR |

## Pipelined processing

|    |    |    |     |    |                            |
|----|----|----|-----|----|----------------------------|
| FI | DE | EX | MEM | SR | → completes in<br>7 cycles |
|    |    |    |     |    |                            |
| FI | DE | EX | MEM | SR |                            |

- ★) SUPERSCALAR PROCESSING: Combines temporal parallelism of pipeline with data parallelism by issuing several instructions in one cycle. The instructions may not be in order but maximizes H/W resources. This is called Superscalar processing.

Superscalar processing with 2 instructions simultaneously :-

Clock cycle →

|    |    |    |     |    |   |   |                                               |
|----|----|----|-----|----|---|---|-----------------------------------------------|
| 1  | 2  | 3  | 4   | 5  | 6 | 7 | → 6 instructions<br>completed in<br>7 seconds |
| FI | DE | EX | MEM | SR |   |   |                                               |
| FI | DE | EX | MEM | SR |   |   |                                               |
|    |    |    |     |    |   |   |                                               |
| FI | DE | EX | MEM | SR |   |   |                                               |
| FI | DE | EX | MEM | SR |   |   |                                               |
|    |    |    |     |    |   |   |                                               |
| FI | DE | EX | MEM | SR |   |   |                                               |
| FI | DE | EX | MEM | SR |   |   |                                               |

Here, in steady state → two instructions will be completed every cycle (ideal case)  
For successful superscalar processing :-

- H/W should permit fetching several instructions simultaneously → several FI
- Several independent ports for Read/Write in data cache → several DE  
If Instruction = 32 bit, 2 fetch together → 64 bit path from Instruction cache regd & 2 IR regd
- Multiple EX units regd to avoid resource conflicts → several EX  
Floating pt + Integ add<sup>n</sup>

Assume two integer execution units & one floating pt. unit & superscalar with 2 instruction fetched together:-

As. I1 → I2

| Instruction.   | No. of cycles for "grat" ALU needed. | Cycles       | 1  | 2  | 3   | 4   | 5   | 6   | 7  |
|----------------|--------------------------------------|--------------|----|----|-----|-----|-----|-----|----|
| I14 R1 ← R1/R5 | 2                                    | INT          | FI | DE | EX1 | EX2 | MEM | SR  |    |
| I2 R3 ← R1+R2  | 1                                    | INT          | FI | DE | X   | X   | EX1 | MEM | SR |
| I3 R2 ← R5+3   | 1                                    | INT          | FI | DE | EX1 | X   | MEM | SR  |    |
| I4 R7 ← R1-R11 | 1                                    | INT          | FI | DE | X   | X   | EX1 | MEM | SR |
| I5 R6 ← R4×R8  | 2                                    | FLOATING PT. |    |    |     |     |     |     |    |
| I6 R5 ← R1+6   | 1                                    | INT          |    |    |     |     |     |     |    |
| I7 R1 ← R2+1   | 1                                    | INT          |    |    |     |     |     |     |    |
| I8 R10 ← R9×R8 | 2                                    | FLOATING PT. |    |    |     |     |     |     |    |

I4 ← I1

Idle cycle

To take care of dependencies :-

|    | cycles → 1 | 2  | 3   | 4   | 5   | 6   | 7   | 8   | 9  | 10 |
|----|------------|----|-----|-----|-----|-----|-----|-----|----|----|
| I1 | FI         | DE | EX1 | EX2 | MEM | SR  |     |     |    |    |
| I2 | FI         | DE | X   | X   | EX1 | MEM | SR  |     |    |    |
| I3 | FI         | DE | X   | X   | EX1 | MEM | SR  |     |    |    |
| I4 | FI         | DE | X   | EX1 | MEM | SR  |     |     |    |    |
| I5 |            | FI | DE  | EXF | EXF | MEM | SR  |     |    |    |
| I6 |            | FI | DE  | X   | X   | EX1 | MEM | SR  |    |    |
| I7 |            |    | FI  | DE  | X   | X   | EX1 | MEM | SR |    |
| I8 |            |    | FI  | DE  | X   | EXF | EXF | MEM | SR |    |

Invalid Cycle

I2 3,4 ⇒ I1 → I2 flow dependency (Read After Write (RAW) Hazard)

I3 4,5 ⇒ I2 → I3 anti dependency (Write After Read (WAR) — )

I4 4 ⇒ I1 → I4 flow dependency (RAW)

I6 5,6 ⇒ I1 → I6 → 5<sup>th</sup> cell (RAW) & I3 → I6 6<sup>th</sup> cell (WAR)

I7 6,7 ⇒ I3 → I7 → 6<sup>th</sup> cell (RAW) & I6 → I7 7<sup>th</sup> cell, I1 → I7 (Output default write after write hazard RAW)

I8 6 ⇒ EXF is only one & is used by I5 → Resource constraint

\*)) Register Scoreboarding: To keep track of dependencies, we use a separate register called Scoreboard. If a computer has 32 registers, then a 32-bit scoreboard indicates busy/free status of registers. When an instruction uses a set of registers  $\rightarrow$  their scoreboard bits set to 1. When execution is completed  $\rightarrow$  scoreboard bits reset to 0. With scoreboard instructions can check if a register is free or not. Now, flow dependency delay can't be reduced but output and anti-dependency delays can be reduced using register renaming. We allocate extra registers for this.

Eg:

|    | Old                           | Register renamed              |                                                  |
|----|-------------------------------|-------------------------------|--------------------------------------------------|
| I1 | $R1 \leftarrow R1 / R5$       | $R1 \leftarrow R1 / R5$       | This removes anti-dependency & output-dependency |
| I2 | $R3 \leftarrow R1 + R2$       | $R3 \leftarrow R1 + R2$       |                                                  |
| I3 | $R2 \leftarrow R5 + 3$        | $R2N \leftarrow R5 + 3$       |                                                  |
| I4 | $R7 \leftarrow R1 - R11$      | $R7 \leftarrow R1 - R11$      |                                                  |
| I5 | $R6 \leftarrow R4 \times R8$  | $R6 \leftarrow R4 \times R8$  |                                                  |
| I6 | $R5 \leftarrow R1 + 6$        | $R5N \leftarrow R1 + 6$       | This delays are also reduced                     |
| I7 | $R1 \leftarrow R2 + 1$        | $R1N \leftarrow R2N + 1$      |                                                  |
| I8 | $R10 \leftarrow R9 \times R8$ | $R10 \leftarrow R9 \times R8$ | All instructions complete in 9 cycles            |

Is it possible to reschedule issue of instructions to reduce delay?  $\rightarrow$  Yes  
We can use an instruction window containing several instructions.  
From this window, our system picks 2 instructions & schedules them to minimize time.

Larger window size  $\rightarrow$  Better optimization

Eg: Window size = 4  $\Rightarrow \{I1, I2, I3, I4\}$   
 $\Downarrow$  picks

Instructions picked  $\leftarrow \{I1, I3\}$

$\downarrow$  Next cycle

$\Rightarrow \{I2, I4, I5, I6\}$

$\downarrow$  {I2, I5}

$\downarrow$  Next cycle

$\Rightarrow \{I4, I6, I7, I8\}$

$\downarrow$  {I4, I8}

$I1 \rightarrow I2$  so let's swap it with I3

$I1 \rightarrow I4 \rightarrow$  delay I4

$I6, I7 \rightarrow$  INT unit busy

## Rescheduling instructions in superscalar

|    | 1  | 2  | 3   | 4   | 5   | 6   | 7   | 8  | 9 | 10 |
|----|----|----|-----|-----|-----|-----|-----|----|---|----|
| I1 | FI | DE | EX1 | EX2 | MEM | SR  |     |    |   |    |
| I3 | FI | DE | EXI | MEM | SR  |     |     |    |   |    |
| I2 |    | FI | DE  | X   | EXI | MEM | SR  |    |   |    |
| I5 |    | FI | DE  | CXF | EXF | MEM | SR  |    |   |    |
| I4 |    |    | FI  | DE  | EXI | MEM | SR  |    |   |    |
| I8 |    |    | FI  | DE  | EXF | EXF | MEM | SR |   |    |
| I6 |    |    | FI  | DE  | EXI | MEM | SR  |    |   |    |
| I7 |    |    | FI  | DE  | EXI | MEM | SR  |    |   |    |

Now I3 completes before I1 and I2. Now due to branch prediction there may be some other instructions that need to be executed & these instructions might be abandoned. Thus it is incorrect to update memory with these instructions. Their results must be stored in buffers. These buffers will later be permanently committed to appropriate storage taking into account the sequential program correctness. This is called Committing or retiring an instruction.

So :-

Both RISC & CISC are superscalar

MIPS 10000, PowerPC  $\rightarrow$  RISC superscalar  
Pentium 4  $\rightarrow$  CISC superscalar





## \* VERY LONG INSTRUCTION WORD (VLIW) PROCESSOR :

Problems with Superscalar → Need to duplicate IR, Decoder & ALU  
→ Difficult to dynamically schedule instructions

Alternative method is to use compiler to expose a sequence of instructions with no dependency & diff resource requirements because Compilers can take a more global view of program & rearrange code to better utilize the resources.

Here, a single word incorporates many independent operations → pipelined easily

Eg: 2 integer operation, 2 floating pt. operation, if not found NO-OP inserted  
2 load/store .. , 1 branch .. → All packed in 1 word

→ The word size varies from 128 bits to 256 bits

→ It requires processor to have enough resources to execute all operations ↑

→ Challenging task: Maintaining enough parallelism in instructions to keep all units busy

→ Parallelism in programs is exposed by :-

i) Loop unrolling

ii) Scheduling instructions by examining program globally

iii) Trace scheduling: Predicting path taken by a branch operat<sup>n</sup> at compile time.  
It uses some heuristics/hints given by program.

If branches follow predicted path → straight line code

→ Challenges :-

i) Lack of sufficient instruct<sup>n</sup> level parallelism: For 1 FLOP (floating point operation) with 4 stages in pipeline, it requires 10 instruction to effectively use. + 2 Integer operations.

Instruction level parallelism may not be high.

ii) Difficulties in building H/W: Requires high memory & large register file. Extra read/write ports → requires large silicon area on chip.

iii) Inefficient use of bits in a very long instruction word: It is difficult to find this many independent instructions. About half of each word is filled with NO-OP

Thus VLSW is not very popular commercially.

## Rescheduling instructions in superscalar

|    | 1  | 2  | 3   | 4   | 5   | 6   | 7  | 8 | 9 | 10 |
|----|----|----|-----|-----|-----|-----|----|---|---|----|
| I1 | FI | DE | EX1 | EX2 | MEM | SR  |    |   |   |    |
| I3 | FI | DE | EX1 | MEM | SR  |     |    |   |   |    |
| I2 |    | FI | DE  | X   | EX1 | MEM | SR |   |   |    |
| I5 |    | FI | DE  | EXF | EXF | MEM | SR |   |   |    |
| I4 |    | FI | DE  | EX1 | MEM | SR  |    |   |   |    |
| I8 |    | FI | DE  | EXF | EXF | MEM | SR |   |   |    |
| I6 |    | FI | DE  | EX1 | MEM | SR  |    |   |   |    |
| I7 |    | FI | DE  | EX1 | MEM | SR  |    |   |   |    |

Now I3 completes before I1 and I2. Now due to branch prediction there may be some other instruction that need to be executed & these instructions might be abandoned. Thus it is incorrect to update memory with these instructions. These results must be stored in buffers. These buffers will later be permanently committed to appropriate storage taking into account the sequential program correctness. This is called Committing or retiring an instruction.

So:-

Both RISC & CISC are superscalar

MIPS 10000, PowerPC → RISC superscalar  
Pentium 4 → CISC superscalar





## A) VERY LONG INSTRUCTION WORD (VLIW) PROCESSOR :

Problems with Superscalar → Need to duplicate IR, Decoder & ALU  
 → Difficult to dynamically schedule instructions

Alternative method is to use compilers to expose a sequence of instructions with no dependency & diff resource requirements because Compilers utilize the can take a more global view of program & rearrange code to better utilize resources.  
 Here, a single word incorporates many independent operations → pipelined easily

Eg: 2 Integer operation, 2 floating pt. operation,  
 2 load/store .. , 1 branch .. → if not found No-OP inserted  
 All packed in 1 word

- The word size varies from 128 bits to 256 bits
- It requires processor to have enough resources to execute all operations ↑
- Challenging task: Maintaining enough parallelism in instructions to keep all units busy
- Parallelism in programs is exposed by :-
  - i) Loop unrolling
  - ii) Scheduling instructions by examining program globally
  - iii) Trace scheduling: Predicting path taken by a branch operation at compile time.  
 It uses some heuristics/hints given by programmer.  
 If branches follow predicted path → straight line code

→ Challenges :-

- i) Lack of sufficient instruction level parallelism: For 1 FLOP (floating point operation) with 4 stages in pipeline, it requires 10 instruction to effectively use. + 2 Integer operations.  
 Instruction level parallelism may not be this high.
  - ii) Difficulties in building H/W: Requires high memory & large register file. Extra read/write ports → requires large silicon area on chip.
  - iii) Inefficient use of bits in a very long instruction word: It is difficult to find this many independent instructions. About half of each word is filled with NO-OP
- Thus VLIW is not very popular commercially.

Scalar instruction → handles one piece of info at a time → like superscalar  
Vector " → handles multiple pieces of info at a time → like array / vector processors

\* Array Processors : Performs computations on large array of data.

There are two types of array processors :-

- Attached array processor → Not in syllabus → array processor is attached to ↑ compute speed
- SIMD Array Processor

→ SIMD Array Processor :

i) It has multiple process unit operating in parallel.

ii) It also manipulate vectors but internal organization is diff from attached array processor



→ PEs are sync together to perform same operat<sup>n</sup> under control of Master Control Unit

→ Thus Master Control provides single instruction that is applied to multiple data (SIMD)

→ Each PE includes :-

- ALU
- Floating pt. arithmetic unit
- Working registers

→ MCU decodes the instruction & determines how to execute it

→ If instruction is scalar / program control instruction → executed within MCU

→ Main memory stores the program, each PE uses operands stored in this local memory

Advantages of Array Processors = v) More energy efficient than MIMD → One fetch for all data

- Increases overall instruction processing speed vi) Higher speed up than MIMD (2x)
- Improves overall capacity of system as most Array Processors operate asynch to system
- Array processors have their own memory, thus provide extra memory for systems with low memory.
- Exploits significant level of data parallelism



→ SIMD Units : The H/W Units performing operations parallelly

It receives two vectors as input, each with a set of operands

It performs same operat<sup>n</sup> on both set of operands (one from each vector) & outputs a vector with result



\* ) Vector Processor<sup>(VP)</sup>: It is a central processing unit that can perform the complete vector input in individual instruction.

→ Used in scientific & research computation → high power requirement

→ Scientific problems can be specified in terms of vectors or matrices → <sup>use</sup> vector processing

→ Features :

i) Vector is a structured set of elements. The elements are scalar instructions, vector elements. Struct called it's layout

ii) In one clk period, 2 successive pairs of vector elements are processed.

The vector has dual vector pipes & dual set of vector functional units → Two 2 pairs of

iii) In parallel vector processing, more than 2 results generated per clk cycle. elements

It automatically starts when:-

→ Successive vector instructions use different functional unit & multiple vector register

→ Successive " " " the results flow from one vector reg as operand of other utilizing diff functional unit. This is called chaining.

iv) VP implements better with higher vector due to foundat<sup>n</sup> delay in pipeline

v) VP decreases overhead to maintain loop control variables

\* ) Vector processor Classification

Memory to Memory Architecture

Register to Register Architecture

→ Three categories to e  
→ Instn. level parallelism

→ Thread feed parallel

→ multiple inputs & parallel

→ vector data paralleliz.



## i) Memory-to-Memory Architecture

- Memory - to - memory transfer
  - Source operands, intermediate & final results are read directly from memory
  - Base address, offset, increment, vector length must be specified
  - Eg: TI-ASC, CDC STAR-100, Cyber-205
  - No "limitation" of size
  - Speed is comparatively slow

## ii) Register-to-register Architecture

- ii) Register-File-registered memory
    - Operands and results read indirectly from memory via large no. of scalar or vector registers
    - Eg: Cray-1, Fujitsu VP-200
    - Limited size
    - Speed is very high compared to memory to memory
    - R/W cost is high



## Shared Memory Symmetric multiprocessing:

Very common parallel computer architecture (Uses 4 to 32 processors)

Provides global address space for writing 11 programs → easy to program

Each processor has a private cache

Each processor is connected to main memory (shared) either with shared bus or using Interconnect<sup>TM</sup> Network

The avg time to access main memory from any processor is same  $\rightarrow$  Symmetric Multiprocessing(SMP)



i) fork : Statement used to create a process

ii) join : " " when parent process needs the result of child process

Eg:

Process A

```

fork B
sum ← sum + f(A)
join B
end A
  
```

Process B

```

:
sum ← sum + f(B)
end B
  
```

When A forks B, then still join, both A and B are working concurrently. Thus data may be misread or miswritten.

Suppose Sum = 0 → A read

→ B also read simultaneously

A did Sum ← sum + f(A)

B did Sum ← sum + f(B)

Now A writes Sum as f(A) & then B writes Sum as f(B). Final Sum → f(B)

This is because B should have read sum after A has done its work.

Need for Lock

We have the following hardware for lock :-

lock: LD R1, 0, L // C(R1) ← C(L)

CMP R1, #0 // Compare R1 with 0

BNZ Lock // If R1 ≠ 0 try again  
goes on busy-wait if L=1

ST L, #1 // L ← 1

RET // Return

Locking

Process A

```

fork B
lock sum;
Sum ← sum + f(A)
unlock sum;
join B
end A
  
```

Process B

```

lock sum
Sum ← sum + f(B)
unlock sum
end B
  
```

immediate operand ↑

unlock: ST L, #0 // Store 0 in L

L=0 → unlocked

L=1 → locked

However the problem is that lock is not an atomic instruction. Between the instructions of lock, if other process is reading lock then there may be dirty read. So for atomic instruction:

$\rightarrow$  Test & Set  $\rightarrow$  atomic read-modify-write

lock:  $TST\ R1, 0, L // C(R1) \leftarrow L, L \leftarrow 1$  unlock:  $ST\ L, 1, 0 // L \leftarrow 0$   
 $CMP\ R1, \#0 // Compare R1 with 0$  | RET  
 $BNZ\ lock // If R1 \neq 0 try again$

Barrier  $\rightarrow$  ensures all processes complete before proceeding further  
 $\hookrightarrow$  also requires atomic instruction

Sequential Consistency: A multiprocessor is sequentially consistent if the result of any execution is the same as if the operation of all processors were executed in some sequential order and the operations of each individual processor occurs in this sequence in the order specified by its program.

## \* Difference between Grid & Cluster Computing

### ~~Grid~~ Cluster Computing Homogeneous

- i) Collection of tightly or loosely coupled computers that work together to act as a single entity.
- ii) Computers are dedicated to some work & perform no other task.
- iii) Computers are located close to each other.
- iv) Computers are connected by high speed LAN.

v) Centralized Network Topology - Central Server  
vi) Whole system functions as a single system.



### Grid Computing

- i) A network of ~~central~~ homogeneous or heterogeneous computers working together to perform a task difficult for a single machine.
- ii) Computers contribute their unused processing resources to grid network.
- iii) Computers may be located at large distances from each other.
- iv) Computers are connected by slow speed bus or internet.
- v) Decentralized Network Topology Peer to Peer  
vi) Each node is autonomous & can opt out anytime.



\* Interconnection Networks: Each system needs to interact to other system via connection links.

Direct

Each node connected to other node or neighbour node by dedicated communication line

Indirect

Each node is connected to many other nodes using one or more switching elements

Processor → Memory

→ Crossbar Switch: Simultaneously connects  $(P_i, M_j)$

Non-blocking N/W

Routing mechanism is called Crosspoint Switch

For  $m$  Processors &  $n$  Memory modules,  $n^2$  switches are reqd

At any time

$P_i \leftrightarrow M_j$

$P_i \leftrightarrow M_k$

( $i \neq k$ )  
 $(j \neq k)$



Structure of Crossbar Switch :-



→ Multistage Interconnect N/W or Generalized Cube N/W: It's basic component

has a control input  $C$ .



A general multistage N/W has  $N$  i/p,  $N$  o/p,  $N = 2^m$  → no. of stages

Each stage →  $N/2$  interstage switches. So No. of switch =  $m(N/2) = \frac{N}{2} \log_2 m$   
(See fig in Boos-Rajaraman Pg. 154)

→ Shuffle Interconnect:  $\text{Shuffle}(a_m a_{m-1} \dots a_2 a_1) = a_{m-1} a_{m-2} \dots a_2 a_1 a_m$

↳ Circular left shift by 1 bit

→ Omega N/W :  $\log_2 N$  cascaded switched using shuffle connect<sup>m</sup>  
 See fig. Pg 156

→ Butterfly N/W : Does not allow N i/p to N o/p connection without conflicts



\* ) Direct Interconnected<sup>m</sup> N/W : Characteristics of Interconnected<sup>m</sup> N/W

i) Total N/W Bandwidth : Bytes/second that N/W can support  
 For a ring of n CEs with individual link

$$BW = B, \text{ total } BW = mB, \text{ for Bus} = B$$

ii) Bisection Bandwidth : Imagine a cut dividing N/W into 2 parts & calculating BW across the cut. In case of ring of n CEs,  
 it is  $2B$ . for Bus =  $B$

iii) No. of ports each N/W switch has . For ring , it is 3.  $\xrightarrow{\text{for Bus} = 1}$   $\xrightarrow{\text{for Bus} = }$

iv) Total no. of links b/w switches & b/w switches & CEs : In ring it is  $2m$



→ 2D grid :



$n$  processors,  $\sqrt{n}$  rows,  $\sqrt{n}$  columns,

Each row  $\rightarrow (\sqrt{n} - 1)$  links  
 Each col  $\rightarrow (\sqrt{n} - 1)$  links

$$\text{So, total no. of link} = 2\sqrt{n}(\sqrt{n} - 1) + n$$

$$\text{If BW of each link} = B \\ \text{So, total BW} = 2\sqrt{n}(\sqrt{n} - 1)B$$

↓  
 Each switch to CE

→ 2D Toroidal: Switches at edges are also connected by links.



→ Max symmetric N/W

→ Disadvantage: Non-planar

If  $m$  CE, total BW =  $2mB$

Bisection BW =  $2mB$

No. of links =  $3m$

Ports per switch = 5 ( $4+1$ )

→  $N$  cube or Hypercube: Interconnects  $2^N$  CEs one at each corner of a cube.

$N=1$



$N=2$



1 cube

$N=3$



2 cube

3 cube

$$\text{Total BW} = \frac{B(n \log_2 n)}{2} \quad (n = 2^N)$$

$$\text{Bisection BW} = \frac{Bn}{2}$$

$$\text{No. of ports per switch} = 1 + \log_2 n$$

$$\text{Total no. of links} = n + n \log_2 \left(\frac{n}{2}\right)$$

\* Routing Mechanisms for directly connected systems:

→ Each node connected to other node via switch → sophisticated to → Router

→ Router: Examine incoming messages & decided when it should be sent (for which CE it is destined)

→ Messages  $\xrightarrow{\text{broken}} \text{packets} \xrightarrow{\text{Header}} \boxed{\text{Header} \quad \text{Data}}$  → Packet switching

Time taken by datagram to reach destination =  $\frac{\text{Info of Destination CE}}{\text{order of packet (seq. No.)}}$

Since Router Destined packets re-ordered

→ Routier require storage at each router to buffer packets → to order them

→ Virtual Cut through Routier: Router buffers a ~~node~~ only if the next link it is to take is occupied

packet at a node

→ Wormhole Routier: Packet sub-divided into flits (flow control digits)

→ First flit = Header info, rest → data Header flit → block, other flit → buffered

→ All flit follow the header flit → Very fast, space efficient

→ Flits of diff packet → Not interleaved

→ Successive flits of a packet → async pipeline w/ H/W



## \* Performance Benchmarks:

i) Peak-performance: Theoretical maximum based on best possible utilization of resources

ii) Sustained performance: Based on running application-oriented benchmarks

Benchmark: Set of programs or

Program fragments

Used to decide which system to use in which application to compare performance of various machines.

Indicate instant<sup>n</sup> In terms of event<sup>n</sup> rate  $\rightarrow$  MIPS, a Mflops

Indicates floating pt. capability

Digital pt.  
of act<sup>n</sup>

i) Synthetic benchmarks: Small programs constructed for benchmarking but don't represent real computation (workload)

G: Whetstone formulated for FORTRAN to measure flop performance,

Dhrystone to measure integer performance in C

$\rightarrow$  Doesn't reflect actual behavior of complex computation

i) Kernel benchmarks: Program fragments extracted from real programs. Heavily used core of the programs.

G: Livermore Loops for FORTRAN, Linpack for FORTRAN

Drawback: Performance results (in Mflops) are very large for non-scientific computations

ii) Application benchmarks: Several complete applications / programs that reflect the workload of a standard user, called (benchmark suites)

Implementation is very time & resource intensive

Produces meaningful results

G: SPEC (system performance evaluation corporal<sup>n</sup> group)

SPEC2006, SPEC MPI2007, SPEC OMP2012  
for parallel computing

$$S(m) = \frac{T_s + \alpha(m) T_p(m, n)}{T_s + \frac{\alpha(m)}{n} T_p(m, n)} = \frac{\alpha + \alpha(m)(1-\alpha)}{\alpha + \frac{\alpha(m)}{n}(1-\alpha)}$$

i)  $G(m)=1 \rightarrow$  Amdahl's law  
ii)  $G(m)=n \rightarrow$  Gustafson's law

\* One overhead in Parallel Processing: Parallel overhead is the time reqd to coordinate  $n$  tasks as opposed to doing useful work. Reasons for this are :-

i) IPC: If each processor ( $p$  processors) spends  $t_{comm}$  time for data transfer, then IPC time =  $p \times t_{comm}$  (or  $p \times t_{comm} + \alpha$ ) ( $0 < \alpha < 1$ )  $\xrightarrow{\text{computing/communication overlap factor}}$   
If  $\uparrow$  high  $\Rightarrow$   $\uparrow$  parallel overhead

ii) Load Imbalance: Always not possible to determine size of sub-problem for each processor. Load may not be uniform. Some may be idle, some overloaded  
 $\hookrightarrow$  parallel overhead

iii) Inter-task Synchronization: If subtask B on processor  $P_B$  depends on subtask A running on processor  $P_A$ , then execution of B is delayed till A is complete. For that time, B is idle  $\rightarrow$  parallel overhead  
 $\uparrow$   
(wait or join)  
or barrier

iv) Extra Computation: If the best sequential algorithm does not have higher degree of parallelism, then we may be forced to use a parallel algo with more operations count - Eg: Bidirectional Gaussian elimination  
The extra computation accounts to parallel overhead.

v) Other overheads: Task creation, Task scheduling, Task termination, memory latency, cache coherence enforcement etc.

Coarse grain task  $\rightarrow$  Less II overhead b/w less degree of parallelism & fine " " "  $\rightarrow$  Large " " " Large " " "

\* Parallel Balance Point: On  $\uparrow$  no. of processors  $\rightarrow$  execution time of prog  $\downarrow$   
If problem size is fixed, " " " "  $\rightarrow$  comput<sup>ing</sup> time per processor + (Workload)

Execution time starts  $\uparrow$   $\leftarrow$  Workload  $\searrow$  After some time  
~~Parallel overhead~~

Do an optimal no. of processors where on  $\uparrow$  processors without  $\uparrow$  problem size, the overall execution time  $\downarrow$  is called Parallel Balance point.



Workload same,  
Time ↓

Strong Scaling  
efficiency not fixed  
on ↑ processors

### \* Speedup performance laws

i) Amdahl's Law: Let a program has  $\alpha$  "fixed" of non-parallelizable opert &  $1-\alpha$  is parallelizable. If serial implementation takes  $T_s$  time.

$$\text{So, } T_p = \alpha T_s + (1-\alpha) T_s$$

and  $T_p = \alpha T_s + (1-\alpha) \frac{T_s}{n}$  for  $n$  processors

Problem size  
is fixed

$$\text{So Speedup} = S(n) = \frac{T(1)}{T(n)} = \frac{T_s}{T_p} = \frac{T_s}{\alpha T_s + (1-\alpha) \frac{T_s}{n}} = \frac{1}{\alpha + \frac{(1-\alpha)}{n}}$$



Workload is constant  
but execution time  
decreases.

ii) Gustafson's Law: When accuracy is more important than response time.

As machine size  $\uparrow$  to obtain more computing power, problem size can  $\uparrow$  to create a greater workload to produce more accurate sol<sup>n</sup> & execution time is unchanged.

On  $\uparrow$  problem size, sequential problem  $\rightarrow$  no longer bottleneck.

$$\text{Here } S(n) = \frac{T_s + T_p(1, w)}{T_s + T_p(n, w)} \quad (w = \text{workload}, n = \text{processors})$$

Weak  
Scaling  
efficiency fixed  
on processor

$$\text{Now } T_p(n, w) = n T_p(1, w) \quad \text{and } \alpha = \frac{T_s}{T_s + T_p(n, w)}$$

$$\text{So, } S(n) = \alpha + n(1-\alpha) \quad \text{if } n \text{ is large} \quad S(n) \approx n(1-\alpha)$$

$$= n - \alpha(n-1)$$

Time same, Workload  $\uparrow$



iii) Sun and Ni's Law: Generalizes Amdahl & Gustafson's law by(memory-bounded speedup)

Solve maximum ~~size~~ possible size of problem limited by memory capacity

If we  $n$  processors of small memory  $\rightarrow$  we get a total large memory. Now using this memory we can scale up problem size then if time limit satisfies

Gustafson law then  $\uparrow$  size  $\rightarrow$  better sol<sup>n</sup>, better efficiency. Execution time may increase



w



$$\text{Time} \times T = g(w) \rightarrow \text{memory}$$
 ~~$\cdot \text{Speedup} = g(w)(1-\alpha)$~~