

Throughput for non-pipeline

$$= \frac{95}{4} \times 10^6 \text{ instructions/sec}$$

Page:

Date:

$$= 6.15 \text{ MIPS}$$

Linear Pipelines: They are used to perform fixed functions.

Non-linear Pipeline: are dynamic pipelines. It allows feed back and feed forward connections in addition to streamline connections.

- It can be reconfigured to perform different functions at different time i.e. Multifunction
- Gives more than one output, output of the pipeline is not necessarily from the last stage.



3-stage non-linear pipeline

| Two function X and Y<br>not necessarily from last stage |                          |                            |
|---------------------------------------------------------|--------------------------|----------------------------|
| Streamline connections:<br>S1 to S2                     | Feed forward<br>S1 to S3 | Feed back<br>From S3 to S2 |
| S2 to S3                                                |                          | S3 to S1                   |

### Reservation Table

It specifies the utilization pattern of successive stages in a pipeline

- Rows represents stages and columns represent clock time units.

## For Linear Pipeline

$\frac{TC}{hd}$

$c = \text{reg/c cost}$   
 Page = Total time  
 Date:  $b$  = batch by  
 $d$  = delay

- \* Reservation table for Linear Pipeline Processing follows diagonal stream line pattern.

R-T for 4 Stage Pipeline  $\rightarrow$  Time (clock cycles)

|        | 1              | 2 | 3 | 4 |
|--------|----------------|---|---|---|
| Stages | S <sub>1</sub> | X |   |   |
|        | S <sub>2</sub> |   | X |   |
|        | S <sub>3</sub> |   |   | X |
|        | S <sub>4</sub> |   |   | X |

## For Non-Linear Pipeline

- Non-Linear pattern
- On the evaluation of different functions multiple reservation table can be generated.
- Each R-T display form of data in pipeline for one complete evaluation of pipeline.

Reservation table for X

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----------------|---|---|---|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   | X |   | X |   |
| S <sub>2</sub> |   | X | X |   |   |   |   |   |
| S <sub>3</sub> |   |   | X | X |   | X |   |   |

$\theta$  ← clock units.  
 = evaluation time.

Processing sequence for function X

$S_1 \rightarrow S_2 \rightarrow S_3 \rightarrow S_2 \rightarrow S_3 \rightarrow S_1 \rightarrow S_3 \rightarrow S_1$

- Multiple check marks in a row means repeated usage of same stage in different cycle.
- Contiguous check marks - extended usage of stage over more than one cycle.
- Multiple check marks in simultaneous usage of multiple stages see fig.
- Total no of clock units = evaluation time.

static pipeline  $\rightarrow$  linear pipeline.  
 dynamic  $\rightarrow$  non-linear pipeline

Page :

Date :

### Latency:

- An initiation refers to the start of single function evaluation.
- No of clock cycles between two initiation of pipeline is the latency.
- Latency values must be positive integers.

### Collisions:

- An attempt by two or more initiation to use the same pipeline stage at the same time is called collision.
- Collision implies resource conflicts between two initiations.
- Collisions must be avoided.

|                       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time | 1              | 2              | 3              | 4                             | 5                             | 6                             | 7              | 8              | 9 | 10 | 11 | 12 |
|-----------------------|---|---|---|---|---|---|---|---|------|----------------|----------------|----------------|-------------------------------|-------------------------------|-------------------------------|----------------|----------------|---|----|----|----|
| S <sub>1</sub>        | X |   |   |   | X | X |   |   | Time | S <sub>1</sub> | X <sub>1</sub> | X <sub>2</sub> | X <sub>3</sub> X <sub>1</sub> | X <sub>1</sub> X <sub>2</sub> | X <sub>2</sub> X <sub>3</sub> |                |                |   |    |    |    |
| Stages S <sub>2</sub> |   | X | X |   |   |   |   |   | Time | S <sub>2</sub> | X <sub>1</sub> | X <sub>2</sub> |                               | X <sub>3</sub> X <sub>4</sub> | X <sub>4</sub>                |                |                |   |    |    |    |
| S <sub>3</sub>        |   | X | X | X |   |   |   |   | Time | S <sub>3</sub> | X <sub>1</sub> | X <sub>2</sub> | X <sub>3</sub>                | X <sub>1</sub> X <sub>2</sub> | X <sub>2</sub> X <sub>3</sub> | X <sub>3</sub> | X <sub>4</sub> |   |    |    |    |

Reservation table for functions

### Forbidden latencies

Latencies that cause collisions are called forbidden latencies.

e.g. Considering the reservation table

$$S_1 : \{(6-1), (8-1), (8-6)\} \text{ i.e } \{5, 7, 2\}$$

$$S_2 : \{(4-2)\} \text{ i.e } \{2\}$$

$$S_3 : \{(3-3), (7-3), (7-5)\} \text{ i.e } \{2, 4, 2\}$$

forbidden latencies  $\{2, 4, 5, 7\}$

## Permissible latencies

- Latencies that does not cause collisions are permissible latencies.
- Permissible latencies =  $\{1, 3, 6, 8\}$  for last problem.

Latency sequence?

Sequence of permissible latencies  
 $\{1, 3, 6, 8\}$

Latency cycle:

A latency sequence that repeats the same subsequence (cycle) indefinitely.  
 For ex L.S repeat after 1, 3, 6, or 8.

Average latency  
 the sum of all latencies divided by the number of latencies along the cycle.

$$\frac{1+8}{2} = 4.5$$

constant cycle

latency cycle only one latency value  
 $6, 6, 6, 6$ .

Collision free scheduling

To avoid collisions and to achieve high throughput, all the tasks awaiting for initiation should be properly scheduled.

While scheduling main objective is

To obtain shortest average latency between initiations without causing collisions.

Steps:

Collision vector:

It is m-bit binary vector  $C = (C_m C_{m-1} \dots C_1)$  that can be obtained from  $C_m$  and  $C_{m-1}$  set of permissible and forbidden latencies.

$$m \leq n-1, n = \text{column}.$$

$C_i = 1$  Forbidden.

$C_i = 0$  Permissible

e.g.  $C_2 = 1011010$

Page :

Date :

In the previous example

forbidden latencies =  $\{2, 4, 5, 7\}$

Maxim F.L = 7

so m = 7

Collision vector  $G = \{C_7 C_6 C_5 C_4 C_3 C_2 C_1\}$

Permissible latency =  $\{1, 3, 6\}$

Now for Forbidden latencies =  $\{2, 4, 5, 7\}$

collision vector

$$= \{C_7 C_6 C_5 C_4 C_3 C_2 C_1\} = (1011010)$$

1 0 1      0 1 0

State diagram

Constructed from collision vector



It specifies the permissible state transitions among successive initiations.

Numericals

|                | 1              | 2 | 3 | Time | 4 | 5 | 6 | 7 | 8 |
|----------------|----------------|---|---|------|---|---|---|---|---|
| S <sub>1</sub> | X              |   |   |      |   |   | X |   | X |
| Stages         | S <sub>2</sub> |   | X |      | X |   |   |   |   |
|                | S <sub>3</sub> |   |   | X    |   | X |   | X |   |

S<sub>1</sub>:  $\{(6-1), (8-6), (8-1)\}$  i.e.  $\{S_1, 2, 7\}$

S<sub>2</sub>:  $\{(4-2)\}$  i.e.  $\{2\}$

S<sub>3</sub>:  $\{(5-3), (7-5), (7-3)\}$  i.e.  $\{2, 2, 4\}$

forbidden latencies =  $\{2, 4, 5, 7\}$

It cause collision.

Permissible latencies = {1, 3, 6, 8<sup>+9</sup>}  
It does not cause collision

Page \_\_\_\_\_  
Date \_\_\_\_\_  
8 of more  
all are  
permissible

Maximum forbidden latency (cm) = 7  
so the collision vector is of 7 bits.

Collision vector = {c<sub>7</sub> c<sub>6</sub> c<sub>5</sub> c<sub>4</sub> c<sub>3</sub> c<sub>2</sub> c<sub>1</sub>}

1 = forbidden.

0 = permissible.

Collision vector = (1011010)

Permissible latency = {1, 3, 6, 8<sup>+9</sup>}

Initial collision vector

1011010

with latency 8

1011010

+ 0101101

—————

1111111 Next state

(So, the new state 1111111 with latency 1)

with latency 3

1011010

+ 0001011

—————

1011011

next state = 1011011

with latency 6

1011010

+ 0000001

—————

1011011

next state = 1011011

with latency 8

1011010

+ 0000000

—————

1011010

next state = 1011011



simple Cycle: It is a latency cycle in which each state appears only once.

greedy cycle: A greedy cycle is a simple cycle whose edges are all made with minimum latency or minimum avg

| Latency cycle    | Types of cycle     | Average latency |
|------------------|--------------------|-----------------|
| (3)              | Simple cycle const | 3 (MAL) Greedy  |
| minimum edge (6) | Simple cycle 11    | 6               |
| (8)              | Simple cycle 11    | 8               |
| (1, 8)           | 9                  | 4.5 Greedy      |
| (3, 8)           | 11                 | 5               |
| (6, 8)           | 11                 | 5.5             |
| (3, 6, 13)       | --                 |                 |
| (3, 13, 6)       | --                 |                 |
| (1, 8, 13, 8)    | --                 |                 |

generally there are 2 greed cycles

$$MAL = 3$$

## Question-2

Page :

Date : / /

|           | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-----------|---|---|---|---|---|---|---|---|---|
| S1        | X |   |   |   |   |   |   |   | X |
| Stages S2 |   | X | X |   |   |   |   | X |   |
| S3        |   |   | X |   |   |   |   |   |   |
| S4        |   |   |   | X | X |   |   |   |   |
| S5        |   |   |   |   |   | X | X |   |   |

Forbidden Latency =  $\{1, 5, 6, 8\}$  cause collision

Permissible latency =  $\{2, 3, 4, 7, 9\}$  does not cause collision

Collision vector

$$:\{C_8 C_7 C_6 C_5 C_4 C_3 C_2 C_1\} = \{10110001\}$$



bound: Throughput =  $\frac{f}{mal} = \frac{1}{MAL^T C}$

Lower Bound on MAL

The maximum no of checkmarks in any row of the reservation table.

~~if~~ Lower bound  $\neq$  MAL  
then optimal latency not achieved

Upper bound of MAL

Then no of is in icv + 1

Hazards: Circumstances that would cause incorrect execution if the next instruction was launched.

Structural hazards: Attempting to use the same hardware to do different things at the same time.

Data hazards: Instruction depend on result of prior instruction still in the pipeline.

Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branch and jump)

Tomasulo's Algorithm.

- It is a method of implementing dynamic scheduling.
- Robert Tomasulo.
- It removes name dependency through register renaming.
- It tracks when operands are available to satisfy the data dependency.

## AMDAHL Law

It is a law governing the speedup of using parallel processors on a problem, versus using only one serial processor, under the assumption that problem size remains same.

Speed up:

- Speed up metric is a quantitative measure of performance.
- Speed up is ratio of time it takes to execute a program in serial to the time it takes to execute in parallel.
- $S(n) = \frac{\text{Time taken serially}}{\text{Time taken parallelly}}$   
for example  $T(1) = 1 \text{ sec}$  & if  $n=2$  processor then  
 $T(2) = \frac{1}{2} = 0.5 \text{ sec}$   
Hence  $S(n) = \frac{T(1)}{T(2)} = \frac{1}{0.5} = 2$  mean speed up by 2 times.

$$S(n) \leq \frac{1}{f + \frac{1-f}{\sum_{i=1}^n \frac{1}{P_i}}} \quad f = \text{serial fraction.}$$

$1-f = \text{parallel fraction.}$

## Numerical 1

1. 70% is parallel  
30% is serial  
maximum speedup ?  
8 CPUs

$$S(n) = \frac{1}{5 + \frac{1}{1-\frac{1}{n}}} = \frac{1}{0.3 + \frac{0.7}{8}} \quad \begin{array}{l} \text{Page: } \\ \text{Date: } \end{array}$$

$$= \frac{8}{2.4 + 0.7} = \frac{8}{3.1} = \frac{80}{31}$$

$$= 2.6$$

### Numericals:

Q-1 ~~Speed~~ 15000 instruction = ?

$$f = 25 \text{ MHz}$$

$$K = 5$$

One instruction per clock cycle.

speed up =  $\frac{\text{Time in non parallel}}{\text{Time in parallel}}$

$$= \frac{15000 \times T}{(K+n-1)T}$$

$$= \frac{\text{no of instruction} \times \text{time for one inst}}{(\text{no of ins}) \times \text{time}}$$

$$= \frac{15000 \times 5}{5 + 15000 - 1} = \frac{15000 \times 5}{14994} = \underline{\underline{9.999}}$$

$$\text{Efficiency} = \frac{4.99}{5} = 0.99\%$$

$$\text{Throughput} = \frac{E_K}{T} = \frac{0.99}{15}$$

$$= \frac{0.99 \times 0.84 \times 10^6}{15} = \underline{\underline{99000.0}}$$

$$= 0.94 \times 25 \times 10^6$$

$$= 235$$

PCR - Performance cost ratio =

$$\frac{\text{Performance}}{\text{cost}}$$

$$= \frac{f}{C+Kh}$$

where  $f$  = clock rate

$C$  = cost of all logic gates

$h$  = cost of  $K$ -stage pipeline.

$$\text{PCR} = \frac{f}{C+Kh} =$$

$$(Kh + C) \text{PCR} = f$$

$$Kh \text{PCR} + C \text{PCR} = f$$

$$K = \frac{f - C \text{PCR}}{h \text{PCR}}$$

|    | 1 | 2 | 3 | 4 | 5 | 6 |
|----|---|---|---|---|---|---|
| S1 | X |   |   |   |   | X |
| S2 |   | X |   | X |   |   |
| S3 |   |   | X |   |   |   |
| S4 |   | . |   | X | X |   |

Considering the reservation table

S1:  $\{6-1\}$  i.e  $\{5\}$

S2:  $\{4-2\}$  i.e  $\{2, 3\}$

S3:

S4:  $\{5-4\}$  i.e  $\{1\}$

Forbidden latencies  $\{1, 2, 5\}$

Permissible latencies  $\{3, 4, 6^+\}$

Initial collision vector

$$C_x = S C_4 C_3 C_2 C_1 = 10011$$

10011

Page :

Date :

/ /

shift 00010  
 10011  
 -----  
 10011

gated latency gun

3

7

6

Type of gun  
 const  
 low  
 w

A memory  
bus3  
4  
5

6

$$MAL = 3$$

To make 1 permissible we get

|    | 1 | 2    | 3 | 4 | 5  | 6    | 7 |
|----|---|------|---|---|----|------|---|
| S1 | X |      |   |   |    |      | X |
| S2 |   | X    |   | X |    |      |   |
| S3 |   |      | X |   |    |      |   |
| S4 |   |      |   | X |    | → Y1 |   |
| D  |   | 001X |   |   | D1 |      |   |

Optimal constant latency = 2, which is special for MAL.

c Maxi Theroypt  $\leftarrow \frac{E_k}{C}$

$$= \frac{SD}{kC}$$

$$= \frac{(n)^k}{(k+n-1) \times C}$$

$$= \frac{n}{C(k+n-1)}$$

# Hazards in Pipelining

→ Data Hazards

RAW (TD)

WAR (AD)

WAW (DD)

Page :

Date :

/ /

→ Structural Hazards

When multiple instruction needs same resource

→ Control Hazards

All instructions who change the program count lead to control hazards.

Non -

$$\textcircled{1} \quad f = 25 \text{ MHz}$$

$$\text{CPI} = 4$$

Pipel

$$K = 5$$

$$T = \frac{1}{25 \times 10^6}$$

$$n = 100$$

$$\text{speed up} = \frac{T_{\text{in Non}}}{T_{\text{Pip}}}$$

$$= \frac{4}{25 \times 10^6} \times 100$$

$$T_f = \frac{(K+n-1)T}{64} = \frac{K+(n-1)}{64} = 5.2 \text{ ms}$$

So that the speed up is

$$S_k = \frac{T_f}{T_x} = \frac{16}{5.2} = 3.078$$

$$\text{PCR} = \text{Performance Cost Rate} = \frac{\text{Performance}}{\text{cost}} - \frac{t}{c+kh}$$

$c$  = cost of digital gates

$h$  = latency cost

$K$  = No of stages

$t$  = latch time

$t$  = time taken by how without pipeline.

$t$  = time taken by how with pipeline.

$$t = \frac{1}{d + \frac{t}{K}}$$

$$\text{PCR} = \frac{K}{dk+t} \times \frac{1}{ctkh}$$

$$\text{max PCR} = \frac{d}{dk} \left[ \frac{K}{dk+t} \times \frac{1}{ctkh} \right]$$

$$\text{freq} = \frac{1}{d + \frac{K}{K}}$$

$$K = \sqrt{\frac{ct}{dh}}$$

### Non-Linear Pipeline

#### Reservation Table

|       | $T_1$ | $T_2$ | $T_3$ | $T_4$ | $T_5$ |
|-------|-------|-------|-------|-------|-------|
| $S_1$ | X     | /     |       |       |       |
| $S_2$ | X     |       |       |       |       |
| $S_3$ |       | X     |       |       |       |
| $S_4$ |       |       | X     |       |       |
| $S_5$ |       |       |       | X     |       |



### Linear Pipeline

$S_1$

$S_2$

$S_3$

$S_4$

$S_5$

Page :

Date :

output X



|       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-------|---|---|---|---|---|---|---|---|
| $S_1$ | X |   |   |   |   | X |   | X |
| $S_2$ |   | X | X |   |   |   |   |   |
| $S_3$ |   |   | X | X | X | X |   |   |

Reservation table for X

|       | 1 | 2 | 3 | 4 | 5 | 6 |
|-------|---|---|---|---|---|---|
| $S_1$ | X |   |   |   | X |   |
| $S_2$ |   |   | X |   |   |   |
| $S_3$ |   | X | X |   | X |   |

Reservation table for Y

Permissible  $\rightarrow$  Not cause collision  
forbidden  $\rightarrow$  cause collision

For X

Forbidden latencies  $\rightarrow \{2, 4, 5, 7\}$

$$S_1 = \{5, 7, 2\}$$

$$S_2 = \{2\}$$

$$S_3 = \{2, 4\}$$

$$\text{Permissible} = \{1, 3, 6, 8^+\}$$

- Initial collision vector -

7 6 5 4 3 2 1

ICV - 1 0 1 1 0 1 0



|                |          |
|----------------|----------|
|                | 00000001 |
|                | 1011010  |
| 1              | 2        |
| 2              | 3        |
| 3              | 4        |
| 4              | 5        |
| 5              | 6        |
| S <sub>1</sub> | X        |
| S <sub>2</sub> |          |
| S <sub>3</sub> | X X X    |

$$S_1 = \{4\}$$

$$S_2 = \{ \}$$

$$S_3 = \{2, 4\}$$

Forbidden latencies: {2, 4}

Permissible latencies: {1, 3, 5, 6<sup>+</sup>}

Initial collision vector.

6 5 4 3 2 1  
0 0 1 0 1 0

Maximum of  
Forbidden latencies.

00° 00° 00° 00° 00° 00°  
00 01 01 01 01 01  
00 1 01 1 01 1  
00 1 01 1 01 1

0 0 1 0 1 0  
0 0 0 1 0 1  
0 0 1 1 1 1

0 0 0 1 1 1  
0 0 1 0 1 0  
1 1 1 1

0 0 0 0 0 1  
0 0 0 0 1 0  
0 0 1 0 1 1



Simple cycle  $\rightarrow (5), (3, 5) (1, 5) (3)$

Greedy cycle  $\rightarrow (3), (1, 5)$

containing minima.

Page :

Date : / /

MAL = 3

|                | 1 | 2 | 3 | 4 | 5 | 6 |
|----------------|---|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   |   | X |
| S <sub>2</sub> |   | X |   | X |   |   |
| S <sub>3</sub> |   |   | X |   |   |   |
| S <sub>4</sub> |   |   |   | X | X |   |

$$S_1 = \{5\}$$

$$S_2 = \{2\}$$

$$S_3 = \{\}$$

$$S_4 = \{1\}$$

Forbidden latency =  $\{1, 2, 5\}$

Permissible latency =  $\{3, 4, 6^+\}$

10011

10011

Initial Collision Vector

5 1 3 2 1  
1 0 0 1 1

100 11  
000 10  
100 11



7.1

Simple cyl 3, 4, 6'

9.2

Greedy - 3, 4

10

MAL = 3

9.2

9.83

## Bounds on MAL

- Page: \_\_\_\_\_  
Date: \_\_\_\_\_
1. MAL is lower bounded by maximum no. of checkmarks in any row of reservation table.
  2. MAL is less than or equal to average latency of any greedy cycle of state, transition diagram.
  3. upper bound of MAL is no of 1's in initial collision vector.

maximum no of checkmark  $\leq$  MAL (no of 1's in initial collision vector)



|                | 1 | 2 | 3 | 4 | 5 |
|----------------|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   | X |
| S <sub>2</sub> |   | X |   | X |   |
| S <sub>3</sub> |   |   | X | X |   |



$\frac{100010}{10001} \xrightarrow{10001}$   
 $\frac{10001}{110011} \xrightarrow{110011}$

$2 \leq \text{MAL} \leq (4)$

MAL = 3

|                | 1 | 2 | 3 | 4              | 5              | 6 | 7 | 8 |
|----------------|---|---|---|----------------|----------------|---|---|---|
| S <sub>1</sub> | X |   |   |                |                |   | X |   |
| S <sub>2</sub> |   | X |   |                | X              |   |   |   |
| S <sub>3</sub> |   |   | X |                | X              |   |   |   |
| D <sub>1</sub> |   |   |   | D <sub>1</sub> |                |   |   |   |
| D <sub>2</sub> |   |   |   |                | D <sub>2</sub> |   |   |   |



# Simple cycle.      4      7       $(4, 5)$   
 $(5, 7)$ ,  $(3, 4)$ ,  $(3, 5)$ ,  $(1, 3)$ ,  $(1, 7)$ ,  $(1, 3, 7)$ ,  
 $(1, 3, 5, 4)$

# Greedy way       $(1, 3)$

$$\# \text{ MAL} = \frac{1+3}{2} = 2$$

|       |                |                |                |                |                |                |                |                |                |                 |                 |                 |                 |                 |                 |                 |                 |                 |                 |                 |                 |
|-------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| $s_1$ | x <sub>1</sub> | x <sub>2</sub> | x <sub>3</sub> | x <sub>4</sub> | x <sub>5</sub> | x <sub>6</sub> | x <sub>7</sub> | x <sub>8</sub> | x <sub>9</sub> | x <sub>10</sub> | x <sub>11</sub> | x <sub>12</sub> | x <sub>13</sub> | x <sub>14</sub> | x <sub>15</sub> | x <sub>16</sub> | x <sub>17</sub> | x <sub>18</sub> | x <sub>19</sub> | x <sub>20</sub> | x <sub>21</sub> |
| $s_2$ | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  |
| $s_3$ | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  | x <sub>1</sub>  |

Latency cycle (2)

| $x_2$ | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12             | 13             | 14             | 15             | 16             | 17             | 18             | 19             | 20             | 21             | 22             | 23 |
|-------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|
| $s_1$ | x <sub>1</sub> | x <sub>2</sub> | x <sub>1</sub> | x <sub>1</sub> | x <sub>1</sub> | x <sub>2</sub> | x <sub>1</sub> | x <sub>3</sub> | x <sub>1</sub> | x <sub>4</sub> | x <sub>2</sub> | x <sub>3</sub> | x <sub>5</sub> | x <sub>3</sub> | x <sub>4</sub> | x <sub>6</sub> | x <sub>4</sub> | x <sub>5</sub> | x <sub>5</sub> | x <sub>6</sub> | x <sub>6</sub> | x <sub>6</sub> |    |
| $s_2$ | x <sub>1</sub> | x <sub>1</sub> | x <sub>1</sub> | x <sub>2</sub> | x <sub>2</sub> | x <sub>3</sub> | x <sub>2</sub> | x <sub>3</sub> | x <sub>3</sub> | x <sub>4</sub> | x <sub>4</sub> | x <sub>5</sub> | x <sub>4</sub> | x <sub>5</sub> | x <sub>5</sub> | x <sub>6</sub> |    |
| $s_3$ | x <sub>1</sub> | x <sub>1</sub> | x <sub>1</sub> | x <sub>1</sub> | x <sub>2</sub> | x <sub>1</sub> | x <sub>2</sub> | x <sub>3</sub> | x <sub>2</sub> | x <sub>3</sub> | x <sub>4</sub> | x <sub>3</sub> | x <sub>4</sub> | x <sub>3</sub> | x <sub>5</sub> | x <sub>4</sub> | x <sub>5</sub> | x <sub>6</sub> | x <sub>5</sub> | x <sub>6</sub> | x <sub>6</sub> | x <sub>6</sub> |    |



reduced execution

$\Rightarrow$  Fraction of  $^{\wedge}$  time no enhancement is used =

$$\frac{1 - (\alpha_1 + \alpha_2 + \dots + \alpha_m)}{1 - (\alpha_1 + \alpha_2 + \dots + \alpha_m) + \frac{\alpha_1}{S_1} + \frac{\alpha_2}{S_2} + \dots + \frac{\alpha_m}{S_m}}$$

ii) Suppose we double the speed ratio b/w the vector mode and the scalar mode by hardware improvement. Cal. the effective speedup that can be achieved.

iii) Suppose the same speedup obtained in part (ii) must be obtained by compiler improvement instead of hardware improvement. What could be new vectorization ratio  $\alpha$  that should be supported by the vectorizing compiler to achieve the same enhancement.

$$\text{speedup} = \frac{1}{1 - \alpha + \frac{\alpha}{S}}$$

$$= \frac{1}{1 - 0.75 + \frac{0.75}{18}}$$

$$\text{Speedup} = 3.43$$



$\Rightarrow$  Let no of instructions to be executed  
 $= K$  millions

$m \rightarrow$  Parallel mode

$$\alpha K$$

sequential mode

$$(1-\alpha)K$$

$$T = \frac{\alpha K}{mn} + \frac{(1-\alpha)K}{n}$$

$$T = \frac{K(\alpha + (1-\alpha)m)}{mn}$$

$$\Rightarrow \text{Effective MIPS} = \frac{K}{\frac{K[\alpha + (1-\alpha)m]}{mn}}$$

$$\text{effective MIPS} = \frac{mn}{\alpha + (1-\alpha)m}$$

## ★ MPI Broadcast :-

Date .....



# Syntax :



`MPI_Bcast (*buf, count, datatype, root, communication)`

→ root : who is broadcasting .

ex matrix A[ ] mxn Send to all ?.

double A[N][N];

`MPI_Bcast (A, N*N, MPI_DOUBLE, 0, MPI_COMM_WORLD);`

Spiral

```

double A[N][N];
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
    for (i=0; i<N; i++)
        for (j=0; j<N; j++)
            A[i][j] = i*j;
}

```

# MPI Scatter:-



Syntax :-

`MPI_Scatter(*Sbuf, Scount, Stype, *Dbuf, &scount, &stype, root, communicator)`

ex

Date .....

Ex:-

```
main( int argc, char *argv[] )  
{  
    int size, *Sendbuf, recvbuf[100];  
    MPI_Init( &argc, &argv );  
    MPI_Comm_size( MPI_COMM_WORLD, &size );  
    Sendbuf = (int*) malloc( size * 100 * sizeof(int) );  
    MPI_Scatter( Sendbuf, 100, MPI_INT, recvbuf,  
                 100, MPI_INT, 0, MPI_COMM_WORLD );  
    MPI_Finalize();
```

}

# Gather :- Date .....



Syntax:-

`MPI_Gather (*sbuf, scount, stype, *rbuf, rcount,  
stype, root, comm);`

# for program replace Scatter by Gather.

# Reduce :-

Syntax:-

`MPI_Reduce (*sbuf, *recvbuf, count, datatype,  
op, root, comm);`

Op = operation → MPI\_MAX  
MPI\_MIN  
MPI\_SUM  
MPI\_PROD

# MPI.org

Date .....

est



# write a program to find value of  $\pi$ .

#     "     "     "     "     of     factorial     sum



## OpenMP :-

Date .....

→ Thread based

```
#include<omp.h>
#pragma omp parallel
{
    |
    |
    |
}
```

# gcc: -fopenmp file\_name

```
#include<omp.h>
#include<stdio.h>
int main( int argc, char *argv[] )
{
    #pragma omp parallel private(thread_id)
    {
        thread_id = omp_get_thread_num();
        printf("Hello world. from thread id %d of
               %d \n", omp_get_thread_num(),
               omp_get_num_threads());
    }
}
```

id

no. of threads

Date .....

```
int thread, size;  
thread = omp_get_thread_num();  
size = omp_get_num_threads();
```

Thread id  
no. of threads.

Threadid = omp\_get\_thread\_num();

Size = omp\_get\_num\_threads();

Book # parallel programming in C openmp using Quinn

```
# pragma omp parallel shared(a,b,c,d, size) private  
                  (i, thread id)
```

```
{  
if ( threadid == 0 )  
{  
    a = 50;  
    _____  
    _____  
    _____  
}
```

Date .....

# Pragma omp parallel

# Pragma omp for

for ( i=0; i<=n; i++)

—  
—  
—

parallelize  
Threads.

int thread\_id, size, a[100], b[100], c[100];

Threadid = omp. set\_thread\_num();

size = omp\_set\_num\_threads();

for ( i= Threadid ; i<= (max(100) <sup>i=i+4</sup> ~~(threadid+1)\*size~~; i++ )  
{  
    a[i] = ~~b~~ b[i] + c[i]

}

|                |                |                |                |                  |
|----------------|----------------|----------------|----------------|------------------|
| 0              | 1              | 2              | 3              | } for 4 threads. |
| 4              | 5              | 6              | 7              |                  |
| 8              | 9              | 10             | 11             |                  |
| 12             |                |                |                |                  |
| T <sub>0</sub> | T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> |                  |

Thread

Date .....

| Thread | 0    | 1     | 2     | 3     |
|--------|------|-------|-------|-------|
|        | 0-24 | 25-49 | 50-74 | 75-99 |

```
for( i = Threadid*25 ; i < (Threadid+1)*25 ; i++ )  
{  
    a[i] = b[i] + c[i]
```

```
#include <mpi.h>
```

```
main(int argc, char **argv) {
    char message[20];
    int i, rank, size, type = 99;
    MPI_Status status;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if(rank == 0) {
        strcpy(message, "Hello, World");
        for(i=1, i<size, i++)
            MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD);
    }
}
```

```
else
```

```
    MPI_Recv(message, 13, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status);
```

```
    printf("Message from process = %d : %s\n", rank, message);
    MPI_Finalize();
}
```

```
#include <mpi.h>
```

```
main( int argc, char **argv ) {
```

```
    char message[20];
```

```
    int i, rank, size, type = 99;
```

```
    MPI_Status status,
```

```
    MPI_Init(&argc, &argv);
```

```
    MPI_Comm_size(MPI_COMM_WORLD, &size);
```

```
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
```

```
    if (rank == 0) {
```

```
        strcpy(message, "Hello, World");
```

```
        for (i = 1, i < size, i++)
```

```
            MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD);
```

```
}
```

```
else
```

```
    MPI_Recv(message, 13, MPI_CHAR, 0, type, MPI_COMM_WORLD,
```

```
&status);
```

```
if (rank == 0)
```

```
    printf("Message from process = %d : %s\n", rank, message);
```

```
    MPI_Finalize();
```

```
}
```

```
#include<mpi.h>
```

```
main(int argc, char **argv) {
```

```
    char message[20];
```

```
    int i, rank, size, type = 99;
```

```
    MPI_Status status,
```

```
    MPI_Init(argc, &argv);
```

```
    MPI_Comm_size(MPI_COMM_WORLD, &size);
```

```
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
```

```
    if(rank == 0){
```

```
        strcpy(message, "Hello, World");
```

```
        for(i=1, i < size, i++)
```

```
            MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD);
```

```
    else
```

```
        MPI_Recv(message, 13, MPI_CHAR, 0, type, MPI_COMM_WORLD,
```

```
        &status)
```

```
    if(rank == 0) printf("Message from process = %d : %s\n", rank, message);
```

```
    MPI_Finalize();
```

```
}
```

int argc, myrank, msgtag = 1;

MPI\_Comm\_rank(MPI\_COMM\_WORLD, &myrank);

if (myrank == 0)

    MPI\_Send(&x, 1, MPI\_INT, 1, msgtag, MPI\_COMM\_WORLD);

else

    if (myrank == 1)

        MPI\_Recv(&y, 1, MPI\_INT, 0, msgtag, MPI\_COMM\_WORLD, &status);

Books:

# Programming Massively Parallel Processors

by David B Kirk ↗ (CUDA)

OpenMP , IMP

only those from Unit-4

# Pragma omp barriers

- Reduction clause in omp.

Syntax:

```
## Pragma omp parallel for reduction(+ : sum)
for (k=0 ; k<100 ; k++)
{
    sum = sum + funct(k);
}
```



This will parallel execute k=0 to 100 (Barriers iteration equally to each thread & then partial sum of each thread is replaced added (+) thread, & final sum will be kept in each thread).