

# Pipeline Architecture: Instruction Pipeline.

In this a stream of instructions can be executed by overlapping fetch, decode and execute phase of a instruction cycle.

A instruction pipeline reads instruction from the memory while previous instructions are being executed in other segments of the pipeline. Thus we can execute multiple instructions simultaneously.

| Steps                   | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 |
|-------------------------|----|----|----|----|----|----|----|----|----|----|
| Instruction<br>(Branch) | IF | ID | OF | EX |    |    |    |    |    |    |
| ②                       |    | IF | ID | OF | EX |    |    |    |    |    |
| 3                       |    |    | IF | ID | OF | EX |    |    |    |    |
| 4                       |    |    |    | IF | ID | OF | EX |    |    |    |
| 5                       |    |    |    |    | IF | ID | OF | EX |    |    |
| 6                       |    |    |    |    |    | IF | ID | OF |    |    |
| 7                       |    |    |    |    |    |    | IF | ID |    |    |
| 8                       |    |    |    |    |    |    |    | IF | ID |    |
|                         |    |    |    |    |    |    |    |    | IF | ID |

Timing of instruction Pipeline

IF  $\Rightarrow$  Instruction Fetch

ID  $\Rightarrow$  Instruction Decode

OF  $\Rightarrow$  Operand Fetch

EX  $\Rightarrow$  Execute



Problems in instructions Pipeline :

- \* Resource conflicts

When multiple segments are trying to access same resource, then it leads to resource conflicts.

- ° Data dependency conflicts

When a instruction depends on another instruction cycle, but that instruction is not calculated till now.

6 Designing pipelining processor is complex.

Page :

Date :

/ /

- Branch difficulty conflicts when we are executing

### ~~Branch Prediction~~

Advantages of Pipelining:-

- Increase in the number of pipeline stages increases the number of instructions executed simultaneously.
- Faster ALU can be designed.
- It increases the overall performance of CPU

### questions.

Instructions are executing using 4 stage Pipeline  $S_1 \rightarrow 1\text{ ms}$ ,  $S_2 \rightarrow 2\text{ ms}$ ,  $S_3 \rightarrow 1\text{ ms}$ ,  $S_4 \rightarrow 1\text{ ms}$  then calculate its efficiency.

|                | 1     | 2              | 3              | 4              | 5              | 6              | 7              | 8              |
|----------------|-------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | $S_1$ | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> |                |                |                | X              |
| I <sub>2</sub> |       | $S_2$          | I <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>3</sub> | X X            |
| I <sub>3</sub> |       |                | $S_3$          | I <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> |                | X              |
| I <sub>4</sub> |       |                |                | $S_4$          | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>3</sub> |

$$\text{Efficiency} = \frac{\text{No of cycles Utilized}}{\text{Total no of cycles}}$$

$$= \frac{15}{36} \times \text{Efficiency}.$$

## Hazards in a Pipeline:

It is the scenario that prevent the next instruction in the instruction stream from executing during its designated clock cycles.

### → Data Hazards

It occurs when instructions that exhibit data dependence, modify data in different stages of a pipeline.

IF / ID / OF / EX / save  
 Read After Write      IF / ID / OF / EX / save  
 Write after read      IF ID OF / EX  
 Write after write

### → Structural hazard.

It occurs due to resource conflict in the pipeline.

→ Control hazard: It occurs due to program control instructions. Branch, jump, SKIP, CALL, RETURN

## Branch Prediction



take my branch  
if my guess

① Static Prediction

② Dynamic Prediction

## Delayed Branching



Branch history table

| Branch address | Target address | prediction |
|----------------|----------------|------------|
| 80             | 100            | /          |
| 129            | 400            | /          |
| 34             | 300            | 0          |
| 75             | 600            | /          |
| 25             | 700            | /          |
| 100            | 800            | 0          |
| 161            | 900            | 0          |
| 518            | 1000           | /          |
| 15             | P              |            |



fetching

Data Dependence .

- ① Data Dependence
- ② Control dependence
- ③ Resource Dependence .

## Dependence graph

It is a directed graph which is used to describe the dependence relationship between the statements.

Nodes: correspond to statement (instructions)

Directed edges: shows the ordered relation among statements.



Data Dependence: The ordering relationship between statements is indicated by data dependence

Data dependency: is a situation in which program statement refers to the data of a preceding statement.

5-types of data dependency:

- ① flow dependency
- ② Anti dependency
- ③ output dependency
- ④ I/O dependency
- ⑤ unknown

① Flow dependence.

A statement  $s_2$  is flow dependent on statement  $s_1$  if an execution path exists from  $s_1$  to  $s_2$  and if at least one output of  $s_1$  feeds to an input in  $s_2$ .

It is denoted as:

$$S_1 \rightarrow S_2$$

example

Page :

Date : / /

S1: Load R1, A //  $R1 \leftarrow \text{Memory}[A]$

S2: ADD R2, R1 //  $R2 \leftarrow R2 + R1$

First S1 is executed then S2

## ② Anti dependence

Statement S2 is anti dependent on the statement S1 if S2 follows S1 in the program order and if the output of S2 overlaps the input to S1.

It is denoted as

$$S_1 \not\rightarrow S_2$$

example

S1: Add R2, R1 //  $R2 \leftarrow R2 + R1$

S2: Move R1, R3 //  $R1 \leftarrow R3$

## ③ Two statements are output dependent if they produce (write) the same output variable.

It is denoted as

$$S_1 \circlearrowright S_2$$

example:

S1: Load R1, A //  $R1 \leftarrow \text{Memory}[A]$

S2: Move R1, R3 //  $R1 \leftarrow R3$

## ④ I/O (Input + Output) dependence.

Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file is referenced by both I/O statements.

denoted as

$S_1 \xrightarrow{I/O} S_3$

example

$S_1$ : Read (4), A[1]

$S_3$ : Write (4), A[1]

### ⑤ Unknown dependence :

When dependence relation between two statements cannot be determined, it is known as unknown dependency.

Following situations:

- The subscript of a variable is itself subscripted (indirect addressing). Eg:  $a[i[j]]$
- The subscript does not have loop index variable  $a[i]$
- Variable appears more than once with subscripts having different coefficient of loop variables. eg  $a[ij]$  and  $a[lj]$

Example 1:

$S_1$ : Load R1, A //  $R1 \leftarrow \text{Memory}[A]$

$S_2$ : Add R2, R1 //  $R2 \leftarrow R2 + R1$

$S_3$ : Move R1, R3 //  $R1 \leftarrow R3$

$S_4$ : Store B, R1 //  $\text{Memory}[B] \leftarrow R1$

- o  $S_1 \rightarrow S_2$ ,  $S_2$  is flow dependent on  $S_1$
- o  $S_1 \rightarrow S_3$ ,  $S_3$  is output dependent on  $S_1$
- o  $S_1 \rightarrow S_4$ ,  $S_4$  is flow dependent on  $S_1$
- o  $S_2 \rightarrow S_3$ ,  $S_3$  is antidependent on  $S_2$
- o  $S_2 \rightarrow S_2$ ,  $S_2$  is flow dependent on  $S_2$

$S_2 \rightarrow S_4$  : No dependency

$S_3 \rightarrow S_4$  :  $S_4$  is flow dependent on  $S_3$



Flow dependency

$S_1 \rightarrow S_2$

$S_3 \rightarrow S_4$

$S_2 \rightarrow S_2$

$S_1 \rightarrow S_4$

Anti-dependency

$S_2 \rightarrow S_3$

Output dependency

$S_1 \rightarrow S_3$

Example 2: I/O Dependencies

$S_1$ : read( $C_4$ ),  $A(i)$  // Read array  $A$  from tape unit 4  
 $S_2$ : rewind( $C_4$ ) // Rewind tape unit 4  
 $S_3$ : write( $C_4$ ),  $B(i)$  // Write array  $B$  to tape unit 4  
 $S_4$ : rewind( $C_4$ ) // Rewind tape unit 4

I/O dependency

$S_1 \rightarrow S_3$

No other dependency.



Control dependence:

control dependency arise.

When the order of execution of statements cannot be determined before runtime.

for example: if condition, will not be resolved until runtime.

Control dependence often prohibits parallelism from being exploited.

Compilers are used to eliminate this control dependence and exploit the parallelism.

Control-independent example

```
for (i=0; i<n; i++) {
    a[i] = c[i]
```

if (a[i] < 0) a[i] = 1;

Control-dependent example

```
for (i=1; i<n; i++) {
```

if (a[i-1] < 0) a[i] = 1;

}

depends on previous value.

Resource dependence.

Resource dependence is concerned with conflicts in using shared resources, such as integer and floating point units (ALU), registers and memory areas among parallel events etc.

→ ALU conflicts are called ALU dependence

→ Memory (storage) conflicts are called storage dependence.

Example:

$I_1$  and  $I_2$  are accessing addition unit, so ALU dependent.

$$I_1: A = B + C$$

$$I_2: G = D + H$$



## Numericals

Page :

Date : / /

- a) S1 :  $A = B + D$   
 S2 :  $C = A * 3$   
 S3 :  $A = A + C$   
 S4 :  $E = A / 2$

S1 to S2 :  $S_2$  is flow dependent on S1

S1 to S3 :  $S_3$  is <sup>flow and</sup> output dependent on S1

S1 to S4 :  $S_4$  is flow dependent on S1  
and flow.

S2 to S3 :  $S_3$  is Anti dependent on S2

S2 to S4 : No dependency

⊗

S3 to S3 : flow dependency.

S3 to S4 : flow dependency



example 2 :

- S1 :  $X = \sin(Y)$   
 S2 :  $Z = X + W$   
 S3 :  $Y = -2.5 * W$   
 S4 :  $X = \cos(Z)$



S1 to S2 : flow

S1 to S3 : anti

S1 to S4 : output

S2 to S3 : NO

S2 to S4 : flow, anti

S3 to S4 : NO d.

question 2<sup>o</sup>

$$DO \ I = I, N$$

$$S1: A(I+1) = B(I-1) + C(I)$$

Page :

Date : / /

$$B(I) = A(I)^*$$

$$C(I) = B(I)^* - 1$$

to continue



Q-3

$$P1: C = D^* E$$

$$P2: M = G_1 + C$$

$$P3: A = B + C$$

$$P4: C = L + M$$

$$P5: F = G_1 \div E$$



D-3

S1: Load R1, A //  $R1 \leftarrow \text{Memory}[A]$

S2: Add R2, R1 //  $R2 \leftarrow R2 + R1$

S3: Move R1, R3 //  $R1 \leftarrow R3$

S4: Store B, R1 //  $\text{Memory}[B] \leftarrow R1$



~~Question 26~~Bernstein's conditions

In 1966, Bernstein gave a set of conditions on the basis of which two processes can execute in parallel.

## ⇒ NOTATION:

- o Process  $P_i$  is a software entity.
- o Input  $I_i$  is the set of all input variables for a process  $P_i$   
It is also called read set or domain of  $P_i$
- o Output  $O_i$  is the set of all output variables for  $P_i$   
 $O_i$  is also called write set of  $P_i$ .

Consider two processes  $P_1$  and  $P_2$ , their inputs are  $I_1$  and  $I_2$  and outputs are  $O_1$  and  $O_2$  respectively



The two processes can execute in parallel if they are independent and do not create conflicting results

Bernstein conditions

Two processes  $P_1, P_2$  with input sets  $I_1, I_2$  and output set  $O_1, O_2$  can execute in parallel (denoted by  $P_1 \parallel P_2$ ) if

$$\left\{ \begin{array}{l} I_1 \cap O_2 = \emptyset \\ I_2 \cap O_1 = \emptyset \\ O_1 \cap O_2 = \emptyset \end{array} \right.$$

These 3 conditions are known as Bernstein's conditions.

Term of data dependencies:

Bernstein conditions imply that two processes can execute in parallel if they are:

- o flow-independent  $I_2 \cap O_1 = \emptyset$
- o anti-independent  $I_1 \cap O_2 = \emptyset$
- o output-independent  $O_1 \cap O_2 = \emptyset$

In general  $P_1, P_2, P_3, \dots, P_k$  can execute in parallel if Bernstein's conditions are satisfied on a pair-wise basis. This means that:

$P_i \parallel P_2 \parallel P_3 \dots \parallel P_k$  if and only if  $P_i \parallel P_j$  for all  $i \neq j$

Property of Parallelism relation ( $\parallel$ ) is:

Commutative ( $P_i \parallel P_j$  implies  $P_j \parallel P_i$ )

Non-transitive ( $P_i \parallel P_j$  and  $P_j \parallel P_k$  does not imply  $P_i \parallel P_k$ )

Associative  $(P_i \parallel P_j) \parallel P_k = P_i \parallel (P_j \parallel P_k)$

### Question 1

$$a = x + y;$$

$$b = x + z;$$

Solution:

$$I_1 = \{x, y\}$$

$$I_2 = \{x, z\}$$

$$O_1 = \{a\}$$

$$O_2 = \{b\}$$

$$\text{Here } I_1 \cap O_2 = \emptyset$$

$$I_2 \cap O_1 = \emptyset$$

$$I_1 \cap O_2 = \emptyset$$

So, Bernstein Condition are satisfied.

Question 2:

$$a = x + y$$

$$b = a + b'$$

Page :

Date : / /

Solution:

$$\text{Here } I_1 = \{x, y\} \quad O_1 = \{a\}$$

$$I_2 = \{a, b'\} \quad O_2 = \{b\}$$

$$\text{Here } I_1 \cap O_2 = \emptyset$$

$$I_2 \cap O_1 \neq \emptyset$$

$$O_1 \cap O_2 = \emptyset$$

So, Bernstein conditions are not satisfied.

Hence, these two statements can not be executed in parallel.

Q.3

$$P_1 : C = D \times E$$

$$P_2 : M = G_1 + C$$

$$P_3 : A = B + C$$

$$P_4 : C = L + M$$

$$P_5 : F = G_1 \div E$$

Input sets:

$$I_1 = \{D, E\}$$

$$I_2 = \{G_1, C\}$$

$$I_3 = \{B, C\}$$

$$I_4 = \{L, M\}$$

$$I_5 = \{G_1, E\}$$

Output sets:

$$O_1 = \{C\}$$

$$O_2 = \{M\}$$

$$O_3 = \{A\}$$

$$O_4 = \{C\}$$

$$O_5 = \{F\}$$

① Using Bernstein condition we find that:

For  $P_1 \wedge P_2$   $I_1 \cap O_2 = \emptyset, I_2 \cap O_1 = \{C\}, O_1 \cap O_2 = \emptyset$

Bernstein condition does not satisfy

So  $P_1$  and  $P_2$  can not be executed in parallel.

For  $P_1 \wedge P_3$  same as them.

Not satisfied.



Question 2:

$$S1: A = B + C$$

$$S2: C = B^* D \rightarrow S3 = S = 0$$

$$S4: DO I = A, 100 \\ S = S + X(I)$$

END DO

$$SS: IF(S <= 100) C = C^*$$

Input sets

$$I1 = \{B, C\}$$

$$I2 = \{B, D\}$$

$$I3 = \{I\}$$

$$I4 = \{A, S, X\}$$

$$I5 = \{S, C\}$$

Output sets

$$O1 = \{A\}$$

$$O2 = \{C\}$$

$$O3 = \{S\}$$

$$O4 = \{S\}$$

$$O5 = \{C\}$$

Now check pair wise

For  $S_1 \wedge S_2$

$$I_1 \cap O_2 = \{C\}, I_2 \cap O_1 = \emptyset, O_1 \cap O_2 = \emptyset \quad (S_1 \wedge S_2) X$$

For  $S_1 \wedge S_3$

$$I_1 \cap O_3 = \emptyset, I_3 \cap O_1 = \emptyset, O_1 \cap O_3 = \emptyset \quad (S_1 \wedge S_3) \checkmark$$

For  $S_1 \wedge S_4$ :

$$I_1 \cap O_4 = \emptyset, I_4 \cap O_1 = \{A\}, O_1 \cap O_4 = \emptyset \quad (S_1 \wedge S_4) X$$

For  $S_1 \wedge S_5$ :

$$I_1 \cap O_5 = \{C\}, I_5 \cap O_1 = \emptyset, O_1 \cap O_5 = \emptyset \quad (S_1 \wedge S_5) X$$

For  $S_2 \wedge S_3$

$$I_2 \cap O_3 = \emptyset, I_3 \cap O_2 = \{C\}, O_2 \cap O_3 = \emptyset \quad (S_2 \wedge S_3) \checkmark$$

For  $S_2 \wedge S_4$

$$I_2 \cap O_4 = \emptyset, I_4 \cap O_2 = \emptyset, O_2 \cap O_4 = \emptyset \quad (S_2 \wedge S_4) \checkmark$$

For  $S_2 \wedge S_5$ :

$$I_2 \cap O_5 = \emptyset, I_5 \cap O_2 = \{C\}, O_2 \cap O_5 = \emptyset \quad (S_2 \wedge S_5) X$$

For  $S_3 \wedge S_4$ :

$$I_3 \cap O_4 = \emptyset, I_4 \cap O_3 = \{S\}, O_3 \cap O_4 = \{S\}, (S_3 \wedge S_4) X$$

For  $S_3 \wedge S_5$ :

$$I_3 \cap O_5 = \emptyset, I_5 \cap O_3 = \{S\}, O_3 \cap O_5 = \emptyset \quad (S_3 \wedge S_5) X$$

For  $S_4 \wedge S_5$

$$I_4 \cap O_5 = \emptyset, I_5 \cap O_4 = \{S\}, O_4 \cap O_5 = \emptyset \quad (S_4 \wedge S_5) X$$

So 3 pairs can execute in parallel

$$(S_1 \wedge S_3), (S_2 \wedge S_3), (S_2 \wedge S_4)$$

Q- Now reconstruct the given program to maximize the parallelism.

Because 3 pairs can execute in parallel  
 $(S11|S3), (S21|S3), (S21|S4)$

Page :

Date :

P1:  $A = B + C$        $S = 0$

P2:  $C = B * D$        $DO I = A, 100$   
 $S = S + X(I)$   
 END DO

P3: if ( $S < T \cdot 10^6$ )       $Z = C^T Z$

Numericals on Pipeline:

$K = \text{no of stages}$

$n = \text{no of tasks}$

Total time to process 'n tasks' in 'K stage pipelined processor'

$T_K = [K + (n-1)] \text{ clock cycles}$   
 $K$  cycles for first task and  
 $n-1$  cycles for remaining  $n-1$  tasks

when pipeline not used

$T_1 = nK$

Some common terms related to pipeline structure:

① Pipeline clock period ( $\tau$ )

Let

$T_i = \text{Time delay of stage } S_i$

$d = \text{time delay of latch } L$

$T_m = \text{maximum stage delay}$



Clock Period:  $\tau$

$$\tau = \max\{\tau_i\}_{i=1}^k + d$$

$$\tau = \tau_m + d$$

Pipeline frequency  $f$

$$f = \frac{1}{\tau}$$

Ideally,  $k$ -stage pipeline processes  $n$  tasks in  $k(n-1)$  clock cycles.

$k$  cycles for first task

$n-1$  cycles for remaining  $n-1$  tasks

Total time to process  $n$  tasks in "pipelined processor":

$$T_{lc} = [k + (n-1)]\tau \quad (\tau \rightarrow \text{clock period})$$

Total time for non pipelined processor:

$$T_l = nk\tau$$

Speed up factor =  $\frac{\text{Time for non pipelined}}{\text{Time for pipeline}}$

$$= \frac{nK}{(k+n-1)K} = \frac{nK}{k+n-1}$$

Efficiency: It is an indicator how efficiently the resources of the pipeline are used.

$$E_k = \frac{\text{Speed up}}{K} = \frac{nK}{(k+n-1)K}$$

$$E_k = \frac{n}{k+n-1}$$

Throughput:  
The number of tasks that can be completed per unit time is called Pipeline throughput.

$$T_K = \frac{E_K}{T} = \frac{n}{(k+n-1)T} = \frac{nT}{n+k-1}$$

Numericals:

1. Determine the number of clock cycles that it takes to process 200 tasks in a six segment pipeline.

$$K=6$$

$$n=200$$

$$\begin{aligned} \text{No of clock cycles} &= k+n-1 \\ &= 6+200-1 \\ &\approx 205 \text{ clock cycles.} \end{aligned}$$

P.2 Non pipeline takes 50ns to process a task

same task can be processed in a six segment pipeline with a clock cycle of 10ns.

Find (a) Speed up ratio of 100 tasks  $T_n = 50\text{ns}$

$$\text{Speed up} = \frac{n T_n}{(n+k-1) T_p} \quad \left| \begin{array}{l} T_p = 10\text{ns}, K=6 \\ n=100 \end{array} \right.$$

$$\begin{aligned} &= \frac{100 \times 50}{(100+6-1) \times 10} \\ &= \frac{100 \times 50}{105} = 4.76 \end{aligned}$$

$$\text{Speedup} = \frac{n T_n}{(n+k-1) T_p} = \frac{T_n}{\left(1 + \frac{k-1}{n}\right) T_p}$$

now  $n \rightarrow \infty$

$$\text{Speedup} = \frac{T_n}{T_p}$$

The maximum speed up ( $S$ ) can be achieved when  
no of task increases is,  $S = k$  as  $n \rightarrow \infty$

so Maximum speed up will be 6  
i.e. ( $S = 6$ )

Q-2

Time delays of 4 stages

$$T_1 = 60 \text{ ns}$$

$$T_2 = 70 \text{ ns}$$

$$T_3 = 90 \text{ ns}$$

$$T_4 = 80 \text{ ns}$$

and interface latch has a delay of 10ns

- (a) Clock Period
- (b) Frequency
- (c) Speed up

$$\begin{aligned} \text{Clock period} &= T_{\text{max}} + \Delta \\ &= 90 + 10 \\ &= 100 \text{ ns} \end{aligned}$$

$$\begin{aligned} \text{frequency} &= \frac{1}{\text{Clock Period}} \\ &= \frac{10^9}{100} = 10^7 \text{ Hz} \end{aligned}$$

$$\text{Speed Up} = \frac{\text{Non-Pipeline Cycle}}{\text{Pipeline Cycle Time}}$$

$$= \frac{T_1 + T_2 + T_3 + T_4}{T}$$

$$= \frac{60 + 70 + 90 + 80}{100} = 2.8$$

Q-3

Consider 4-segment adder pipeline.

$T_1 = 5\text{ns}$ ,  $T_2 = 3\text{ns}$ ,  $T_3 = 9\text{ns}$ ,  $T_4 = 4.5\text{ns}$   
interface latencies =  $3\text{ns}$

Page : / /  
Date : / /

(a) How long would it take to add 100 pairs of numbers in the pipeline?

$$\text{Clock period} = 9\text{s} + 5 = 100\text{ns}$$

$$\begin{aligned}\text{Time for 100 pairs} &= (n+k-1)\tau \\ &= (100+4-1) \cdot 100 \\ &= 103 \times 100 \text{ ns} \\ &= 10.3 \text{ ms}\end{aligned}$$

(b) How can we reduce the total time to about to one half of the time calculated in (a)?

To reduce total time period, divide 3 stages into 2 stages

$$T_3 = 9\text{ns} \quad \begin{cases} T_{31} = 5\text{ns} \\ T_{32} = 4\text{ns} \end{cases}$$

$$\begin{aligned}\text{Clock period} &= T_m + d \\ &= 5\text{ns} + 3 \\ &= 8\text{ns}\end{aligned}$$

$$\begin{aligned}\text{Total time} &= (k+n-1)\tau \\ &= (5+100-1) \times 8 \text{ ns}\end{aligned}$$

$$T_K = 8720 \text{ ns}$$

Q-4

$$D = 15000$$

$$f = 25 \text{ MHz}$$

$$k = 5$$

Speed up factor ?

Efficiency  
throughput

$$\text{Speed up} = \frac{nK\tau}{[k+(n-1)]\tau}$$

Page :

Date : / /

$$= \frac{nk}{k+(n-1)} = \frac{5000 \times 5}{5+15000-1} \\ = 4.999$$

$$\text{Efficiency} = \frac{\text{Speed up}}{K}$$

$$= \frac{4.999}{5} = \frac{4.999}{5} = 0.999$$

$$\text{Throughput}(H_K) : \frac{E}{\tau}$$

$$= \frac{nk}{k+n-1} = 24.99$$

Question 5:

| non-pipelined                                                 | upgraded<br>pipelined |
|---------------------------------------------------------------|-----------------------|
| clock rate = 15 MHz                                           | K = 5                 |
| average cycles per<br>instruction = $\frac{4}{\text{second}}$ | clock speed = 90 MHz  |

Speed up =

$$\frac{\text{non-pipeline}}{\text{Pipeline}}$$

$$= \frac{\frac{4}{15} \times 100}{(k+n-1)\tau} - \frac{\frac{4}{90} \times 100}{(k+n-1)\tau}$$

$$= \frac{T_1}{T_2} = \frac{16}{5 \cdot 2} = 3.02$$

$$\text{Throughput} = \frac{\text{Efficiency}}{\text{Clock period}} = \frac{nk}{k+n-1}$$

$$= 19.23$$

Throughput for non pipeline

$$= \frac{25}{4} \times 10^6 \text{ instructions/sec}$$

Page: 1  
Date: 1

$$= 6.25 \text{ MIPS}$$

Linear Pipelines: They are used to perform fixed function

Non-Linear Pipeline: are dynamic pipelines. It allows feed back and feed forward connections in addition to streamline connections.

- It can be reconfigured to perform different functions at different time i.e. Multifunction
- Gives more than one output, output of the pipeline is not necessarily from the last stage.



3-stage non-linear pipeline

Two function X and Y  
not necessarily from last stage

Streamline connections:  
S1 to S2  
S2 to S3  
S1 to S3  
S3 to S2  
S3 to S1

### Reservation Table

It specifies the utilization pattern of successive stages in a pipeline

- Rows represents stages and columns represent clock time units.

## For Linear Pipeline

Page :

Date :

- \* Reservation table for Linear Pipeline processes follows diagonal stream line pattern.

R-T for 4 Stage Pipeline  $\rightarrow$  Time (clock cycles)

|        | 1              | 2 | 3 | 4 |
|--------|----------------|---|---|---|
| Stages | S <sub>1</sub> | X |   |   |
|        | S <sub>2</sub> |   | X |   |
|        | S <sub>3</sub> |   |   | X |
|        | S <sub>4</sub> |   |   | X |

## For Non-Linear Pipeline

- Non-Linear pattern
- On the evaluation of different functions multiple reservation table can be generated.
- Each R-T display form of data in pipeline for one complete evaluation of pipeline.

Reservation table for X

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----------------|---|---|---|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   | X |   |   | X |
| S <sub>2</sub> |   | X | X |   |   |   |   |   |
| S <sub>3</sub> |   |   | X | X |   | X |   |   |

$\delta \leftarrow$  clock units.  
= evaluation period.

Processing sequence for function X

S<sub>1</sub>  $\rightarrow$  S<sub>2</sub>  $\rightarrow$  S<sub>3</sub>  $\rightarrow$  S<sub>2</sub>  $\rightarrow$  S<sub>3</sub>  $\rightarrow$  S<sub>1</sub>  $\rightarrow$  S<sub>3</sub>  $\rightarrow$  S<sub>1</sub>

→ Multiple check marks in a row means repeated usage of same stage in different cycle.

→ Contiguous check marks - extended usage of stage over more than one cycle.

→ Multiple check marks in simultaneous usage of multiple stages same time.

→ Total no of clock units = evaluation time.

static pipeline  $\rightarrow$  linear pipeline.  
dynamic  $\rightarrow$  non-linear pipeline

Page : / /  
Date : / /

Latency:

- An initiation refers to the start of single function evaluation.
- No of clock cycles between two initiation of pipeline is the latency.
- Latency values must be positive integers.

Collisions:

- An attempt by two or more initiation to use the same pipeline stage at the same time is called collision.
- Collision implies resource conflicts between two initiations.
- Collisions must be avoided.

|                | 1              | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time | 1              | 2              | 3                             | 4                             | 5                             | 6                             | 7              | 8 | 9 | 10 | 11 | 12 | Time |
|----------------|----------------|---|---|---|---|---|---|---|------|----------------|----------------|-------------------------------|-------------------------------|-------------------------------|-------------------------------|----------------|---|---|----|----|----|------|
| S <sub>1</sub> | X              |   |   |   | X | X |   |   |      | X <sub>1</sub> | X <sub>2</sub> | X <sub>3</sub> X <sub>1</sub> | X <sub>1</sub> X <sub>2</sub> | X <sub>2</sub> X <sub>3</sub> |                               |                |   |   |    |    |    |      |
| Stages         | S <sub>2</sub> | X | X |   |   |   |   |   |      | X <sub>1</sub> | X <sub>2</sub> |                               | X <sub>3</sub> X <sub>4</sub> | X <sub>4</sub>                |                               |                |   |   |    |    |    |      |
| S <sub>3</sub> |                | X | X | X | X |   |   |   |      | X <sub>1</sub> | X <sub>2</sub> | X <sub>3</sub> X <sub>2</sub> | X <sub>1</sub> X <sub>3</sub> | X <sub>2</sub> X <sub>4</sub> | X <sub>3</sub> X <sub>4</sub> | X <sub>4</sub> |   |   |    |    |    |      |

Reservation table of functions

Forbidden latencies

## Permissible latencies

- Latencies that does not cause collisions are permissible latencies.
- Permissible latencies =  $\{1, 3, 6, 8\}$  for last problem

Latency sequence?

Sequence of permissible latencies  
 $\{1, 3, 6, 8\}$ .

Latency cycle:

A latency sequence that repeats the same subsequence (cycle) indefinitely.  
For ex: L.S repeat after 1, 3, 6, or 8.

Average latency

the sum of all latencies divided by the number of latencies along the cycle.

$$\frac{1+8}{2} = 4.5$$

constant cycle

latency cycle only one latency value  
6, 6, 6, 6.

Collision free scheduling

To avoid collisions and to achieve high throughput, all the tasks awaiting for initiation should be properly scheduled.

While scheduling main objective is

to obtain shortest average latency between initiations without causing collisions.

Steps:

Collision vector:

It is m-bit binary vector  $C = (C_m C_{m-1} \dots C_1)$  that can be obtained from ~~combined~~ from ~~combined~~ set of permissible and forbidden latencies.

$$M \leq n-1, n = \text{columns}.$$

Date: / /

$c_i = 1$  Forbidden  
 $c_i = 0$  Permissible  
 If  $c_2 = 1011010$

Page : / /  
 Date : / / /

In the previous example  
 forbidden latencies =  $\{2, 4, 5, 7\}$   
 Max F.L = 7  
 So m = 7

Collision vector  $G = \{c_7 c_6 c_5 c_4 c_3 c_2 c_1\}$   
 Permissible latency =  $\{1, 3, 6\}$   
 Now for Forbidden latencies =  $\{2, 4, 5, 7\}$   
 Collision vector  
 $= \{c_7 c_6 c_5 c_4 c_3 c_2 c_1\} = (1011010)$

State diagram  
 Constructed from collision vector

It specifies the permissible state transitions among successive initiations.



### Numericals

|        | 1     | 2 | 3 | Time | 4 | 5 | 6 | 7 | 8 |
|--------|-------|---|---|------|---|---|---|---|---|
| Stages | $S_1$ | X |   |      |   |   | X |   | X |
|        | $S_2$ |   | X |      | X |   |   |   |   |
|        | $S_3$ |   |   | X    | X | X |   |   |   |

$S_1: \{(6-1), (8-6), (8-1)\}$  i.e.  $\{5, 2, 7\}$   
 $S_2: \{(4-2)\}$  i.e.  $\{2\}$   
 $S_3: \{(5-3), (7-5), (7-3)\}$  i.e.  $\{2, 2, 4\}$

forbidden latencies =  $\{2, 4, 5, 7\}$   
 It cause collision.

Permissible latencies = {1, 3, 6, 8<sup>+</sup>}

It does not cause collision

Page :  
Date :  
8 of many  
all are  
permissible

Maximum forbidden latency (m) = 7  
so the collision vector is of 7 bits.

Collision vector = {c<sub>7</sub> c<sub>6</sub> c<sub>5</sub> c<sub>4</sub> c<sub>3</sub> c<sub>2</sub> c<sub>1</sub>}

1 = forbidden.

0 = permissible.

Collision vector = (1011010)

Permissible latency = {1, 3, 6, 8<sup>+</sup>}

Initial collision vector

1011010

with latency 8

$$\begin{array}{r} 1011010 \\ + 0101101 \\ \hline 1111111 \end{array}$$

Next state

So, the new state 111111 with latency 1

With latency 3

$$\begin{array}{r} 1011010 \\ + 0001011 \\ \hline 1011011 \end{array}$$

next state = 1011011

With latency 6

$$\begin{array}{r} 1011010 \\ 0000001 \\ \hline 1011011 \end{array}$$

next state = 1011011

With latency 8

$$\begin{array}{r} 1011010 \\ 0000000 \\ \hline 1011010 \end{array}$$

Next state = 1011011



Simple Cycle: It is a latency cycle in which each state appears only once.

Greedy cycle: A greedy cycle is a simple cycle whose edges are all made with minimum latency or maximum avg

| Latency cycle   | Types of cycle     | Average latency |
|-----------------|--------------------|-----------------|
| (3)             | Simple cycle const | 3 (MAL) Greedy  |
| (6)             | Simple cycle 11    | 6               |
| (8)             | Simple cycle 4     | 8               |
| (3, 8)          | 9                  | 4.5 Greedy      |
| (3, 6)          | n                  | 5               |
| (6, 8)          | n                  | 7               |
| (2, 6, 3)       | --                 |                 |
| (3, 3, 6)       | --                 |                 |
| (1, 8, 1, 3, 8) | --                 |                 |

generally there are 2 greed cycles

$$MAL = 3$$

## Question-2

Page :

Date : / /

|        | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------|----|---|---|---|---|---|---|---|---|
| S1     | X  |   |   |   |   |   |   |   | X |
| Stages | S2 |   | X |   |   |   |   | X |   |
| S3     |    |   | X |   |   |   |   |   |   |
| S4     |    |   |   | X | X |   |   |   |   |
| S5     |    |   |   |   |   | X | X |   |   |

forbidden latency =  $\{1, 5, 6, 8\}$  cause collision  
 permissible latency =  $\{2, 3, 4, 7, 9^+\}$  does not cause collision

collision vector

$$:\{C_8 C_7 C_6 C_5 C_4 C_3 C_2 C_1\} - \{10110001\}$$



Bound

Lower Bound on MAL

The maximum no of checkmarks in any row of the reservation table.

~~if~~ if lower bound  $\neq$  MAL

then optimal latency not achieved

Upper bound of MAL

Then no of is in icv + 1