

## Pipelining

used for to increase the overall performance of CPU

Non-Pipelining - Single h/w component which can be take only 1 task at a time, from its input and produce the result at the output.

$$I/P \rightarrow \boxed{H/W \text{ Design}} \rightarrow O/P$$

Drawback! —  $\rightarrow$  only 1 input can be processed at a time

$\rightarrow$  Partial o/p or segment o/p is not possible.



E-engine  
F-fitting  
P-painting

It is sequential work every person.  
complete the work at first then started next car making.



Pipeline — pipeline is the process of accumulating instruction from the processor through a pipeline. It allows storing and executing instructions in an orderly process. It is also known as pipeline processing.

Instead of a single h/w component we can split the h/w design into small component as segment.

The segment are connected with each other through an interface register & they can execute multiple tasks independently or parallelly.



In Pipeline system each segment consists of an I/P register followed by a combinational circuit. The register is used to hold data and combinational circuit performs operation on it. The O/P of combinational circuit is applied to the I/P register of next segment.

### Type of Pipeline —

1. Arithmetic Pipeline
2. Instruction Pipeline

### 1. Arithmetic Pipeline:

Its are usually found in most of the computers. They are used for floating point operators, multiplication of fixed point number etc. for example - The input to the floating point Adder pipeline

$$x = A \times 2^a \quad y = B \times 2^b$$

Here A & B are mantissas (significant digit of floating point no.), while a & b are exponent.

The floating point addition & subtraction is done in 4 parts

1. Compare the exponent.
2. Align the mantissas
3. Add or subtract mantissas
4. Product the result.

Instruction Pipelining → It is a pipeline process which executes different instructions belonging to a process in its different stages.

## Stages in an Instruction Pipeline

1. Instruction Fetch (IF)  $\rightarrow$  In this stage the CPU reads instructions from the address in the memory whose value is present in the program counter.  
e.g ADD, SUB.
  2. a) Instruction Decode (ID)  $\rightarrow$  Instruction is decided and the register file is accessed to get the values from the register used in the instruction.  
b) Operand Fetch (OF)  $\rightarrow$  e.g.  $A + B \xrightarrow{\text{Fetch}}$
  3. Instruction Execution (IE)  $\rightarrow$  ALU operations are performed
  4. Result Writeback to Main Memory or Register (WB)
  5. Instruction checking (IC)



$\therefore$  ID & IC are very fast operations.



example of instruction pipeline

The time that each step takes to fulfill its  $f^2$  depends on the instructions and the way it is executed. The design of an instruction pipeline would be most efficient if the instruction cycle is divided into segments to equalize the time taken by each segment.

(4)

Calculation of No. of clock cycles Required to process a group of Tasks

→ Let there be 100 task & they have to be execute in 4 stage pipelining system.

|       | Time space Diagram → clock cycle |       |       |       |       |       |       |       |       |          |          | ... |
|-------|----------------------------------|-------|-------|-------|-------|-------|-------|-------|-------|----------|----------|-----|
|       | 1                                | 2     | 3     | 4     | 5     | 6     | 7     | 8     | 9     | 10       | 11       | ... |
| $s_1$ | $t_1$                            | $t_2$ | $t_3$ | $t_4$ | $t_5$ | $t_6$ | $t_7$ | $t_8$ | $t_9$ | $t_{10}$ | $t_{11}$ | ... |
| $s_2$ | $t_1$                            | $t_2$ | $t_3$ | $t_4$ | $t_5$ | $t_6$ | $t_7$ | $t_8$ | $t_9$ | $t_{10}$ | ...      | ... |
| $s_3$ | $t_1$                            | $t_2$ | $t_3$ | $t_4$ | $t_5$ | $t_6$ | $t_7$ | $t_8$ | $t_9$ | ...      | ...      | ... |
| $s_4$ |                                  | $t_1$ | $t_2$ | $t_3$ | $t_4$ | $t_5$ | $t_6$ | $t_7$ | $t_8$ | $t_9$    | ...      | ... |
|       |                                  |       |       |       |       |       |       |       |       |          |          |     |
|       |                                  |       |       |       |       |       |       |       |       |          |          |     |

For task

$$2 \rightarrow 5 \xrightarrow{\text{no of task}} (2+3)$$

$$3 \rightarrow 6 \xrightarrow{\text{no of task}} (3+3)$$

$$4 \rightarrow 7 \xrightarrow{\text{no of task}} (4+3)$$

↑  
Stages-1

Total no of clocks for 100 tasks

$$\text{No. of clock cycles} = \frac{4 + 99}{K} = 103$$

$$n = n + (K-1)$$

$$= K + (n-1) \rightarrow \text{Each task 1 clock cycle time each}$$

↓  
Stages no of task  
above diagram.

- Q) 4-instructions are executed in a 4-stage pipelining  
 Assume that each instruction requires different stage delay (in terms of no. of clock cycles) as mentioned in the table below:-

|                | IF | ID | EX | WB |     |
|----------------|----|----|----|----|-----|
| I <sub>1</sub> | 1  | 3  | 2  | 1  | → 7 |
| I <sub>2</sub> | 2  | 2  | 3  | 1  | → 8 |
| I <sub>3</sub> | 1  | 1  | 1  | 1  | → 4 |
| I <sub>4</sub> | 2  | 1  | 1  | 2  | → 6 |

How many clock cycles will be required to complete these 4 instructions in the given pipeline system.

|    | t <sub>1</sub> | t <sub>2</sub> | t <sub>3</sub> | t <sub>4</sub> | t <sub>5</sub> | t <sub>6</sub> | t <sub>7</sub> | t <sub>8</sub> | t <sub>9</sub> | t <sub>10</sub> | t <sub>11</sub> | t <sub>12</sub> | t <sub>13</sub> |
|----|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|-----------------|-----------------|-----------------|-----------------|
| IF | I <sub>1</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>4</sub> |                |                |                |                 |                 |                 |                 |
| ID | I <sub>1</sub> | I <sub>1</sub> | I <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |                |                 |                 |                 |                 |
| EX |                |                |                | I <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub>  |                 |                 |                 |
| WB |                |                |                |                |                | I <sub>1</sub> |                |                | I <sub>2</sub> | I <sub>3</sub>  | I <sub>4</sub>  |                 |                 |

Answer → 13 clock cycles

// Let's see the speed up in this case  
 $\text{speed up} = \frac{t_{np}}{t_p} = \frac{f + 8 + 4 + b}{13} = \frac{25}{13} \approx 2$

Instructions execution in pipeline are :-

- Speed up ratio (S)
- Frequency (f)
- Efficiency (E)
- Throughput (H)

(4)

(5)

Speed up — If we have  $n$  tasks & they have to be executed in a  $K$ -stage pipeline system & clock cycle time of pipeline system is  $t_{cp}$ , then no. of clock cycles required to execute  $n$  tasks =  $n + (K-1) = K + (n-1)$

Total time required to execute  $n$  tasks =  $(n + K - 1) t_{cp}$

or  $t_{cp}$  = time of 1 clock cycle

If the clock cycle time for a non-pipeline system (time required by a non-pipeline system to execute 1 task) =  $t_{nlp} = n \times t_{cp}$

So, to execute  $n$  tasks in non-pipeline system, time required =  $n \times t_{nlp}$

Speed up factors of pipeline system

$$\text{Speed up} = \frac{\text{Time required by non-pipeline system}}{\text{Time required by pipeline system}}$$

$$= \frac{n \cdot t_{nlp}}{(n+K-1) t_{cp}}$$

if  $n \gg K$  then  $(K+n-1) \approx n$

$$\text{Speedup ratio} \approx \frac{n \cdot t_{nlp}}{n t_{cp}} \approx \frac{t_{nlp}}{t_{cp}}$$

• Frequency( $f$ ) =  $\frac{1}{t_{cp}}$

• Efficiency  $E_K = \frac{S_K}{K}$

$$S_K = \frac{n \times t_{nlp}}{(K+n-1) t_{cp}} = \frac{n \times K \times t_{cp}}{(n+K-1) t_{cp}}$$

$$= \frac{n \times K}{K+n-1}$$

$$\text{So, } E_K = \frac{S_K}{K} = \frac{n}{K+n-1}$$

Throughput :- No. of instructions execute per meter clock cycles

$$T = \frac{n}{(K+n-1) t_{cp}} = \frac{\text{no. of instructions}}{\text{Total time to complete the instructions}}$$

$$= \frac{n f}{K+n-1}$$

### Pipeline Model (Four stage)

cycle time  
be excess

- Q) A 4 stage pipeline has the stage delay of 150 ns, 120 ns, 160 ns and 140 ns respectively. Registers that are used b/w the stages have a delay of 5 ns each. Assuming constant clocking rate, what is the total time taken by pipeline to process 1000 data items.

$$t_p = \max(120, 150, 160, 140) + \text{Register stage delay}$$

$$= 160 + 5 = 165 \text{ ns}$$

$$\text{Total times} = 165(n+K-1)$$

$$= 165(1000+3) = 165.5 \mu\text{s}$$

- Q) 5 functional units operate using pipelining with clock cycle delay 10 ns, 8 ns, 10 ns, 10 ns, 7 ns. Assume pipeline have no 1 ns overhead. Find the speed up factors.

$$t_p = \max(10, 8, 10, 10, 7) \text{ ns} + 1 \text{ ns}$$

$$= 10 + 1 = 11 \text{ ns.}$$

$$\text{For non-pipeline system, } t_{NP} = 10 + 8 + 10 + 10 + 7$$

$$= 45 \text{ ns}$$

$$\text{Speedup factors} = \frac{45}{11} \approx 4$$

⑧ operand Forwarding Technique -

Non-Topic - Hazard During Pipelining

Hazard: Circumstances that would cause incorrect execution of the next instruction was launched.

Data Hazard - When an instruction attempts to use a resource before it is ready. It is classified 3 types, depending on the order of read & write access in the instruction. There are three types of Data Hazards possible:-

→ Read after Write Hazard [WR or True dependency WAR RAW]

→ Write after Read Hazard [RW] or Anti dependency RAW WAR

→ Write after Write Hazard [WW] output dependency RAW WAR

\* READ AFTER WRITE HAZARD -

Example -

i: ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>; // R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub> destination operands

j: SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>6</sub>; // R<sub>4</sub> ← R<sub>1</sub> - R<sub>6</sub> source operands

|    | 1 | 2   | 3                               | 4                                  | 5                               | 6 | 7 | 8 |
|----|---|-----|---------------------------------|------------------------------------|---------------------------------|---|---|---|
| IF | i | j   |                                 |                                    |                                 |   |   |   |
| ID |   | ADD | SUB                             |                                    |                                 |   |   |   |
| OF |   |     | R <sub>2</sub> , R <sub>3</sub> | (R <sub>1</sub> , R <sub>6</sub> ) |                                 |   |   |   |
| RF |   |     |                                 | R <sub>2</sub> + R <sub>3</sub>    | R <sub>1</sub> - R <sub>6</sub> |   |   |   |
| WB |   |     |                                 |                                    | R <sub>1</sub>                  |   |   |   |

Inconsistency  
∴ R<sub>1</sub> loaded at clock pulse 4 is not the correct value.  
It should have been done only after the result had been stored in R<sub>1</sub> at clock

Solution:-

Delay the Pipe! Solution for WR Hazard pulse 5.

Idea:- Delay the execution of operand fetch till the result gets stored in the register which is the cause of inconsistency.

|    | 1 | 2   | 3                               | 4                               | 5                               | 6                       | 7 | 8 |
|----|---|-----|---------------------------------|---------------------------------|---------------------------------|-------------------------|---|---|
| IF | i | j   |                                 |                                 |                                 |                         |   |   |
| ID |   | ADD | SUB                             |                                 |                                 |                         |   |   |
| OF |   |     | R <sub>2</sub> , R <sub>3</sub> |                                 | R <sub>1</sub> , R <sub>6</sub> |                         |   |   |
| RF |   |     |                                 | R <sub>2</sub> + R <sub>3</sub> |                                 | R <sub>1</sub> ← Result |   |   |
| WB |   |     |                                 |                                 |                                 | R <sub>4</sub> ← Result |   |   |

unbalanced  
cause incorrect  
the next instruction  
resource  
depending

### Sol 2 Operand forwarding Technique.

Idea:— Forward the result of execution unit of first instruction to the operand fetch unit of the second instruction, from where it can be read by the required instruction as their operand.

|    | 1 | 2   | 3         | 4                                           | 5                          | 6                 | 7                          |
|----|---|-----|-----------|---------------------------------------------|----------------------------|-------------------|----------------------------|
| IF | C | S   |           |                                             |                            |                   |                            |
| ID |   | ADD | SUB       |                                             |                            |                   |                            |
| OF |   |     | $R_2 R_3$ | $\xrightarrow{\text{Delay}}$<br>$R_2 + R_3$ | $R_2 + R_3$                | $R_4$             |                            |
| EX |   |     |           | $R_2 + R_3$                                 | $R_2 + R_3$                | $R_2 + R_3 - R_4$ |                            |
| WB |   |     |           |                                             | $R_4 \leftarrow$<br>Result |                   | $R_4 \leftarrow$<br>Result |

Benefit → Delay of only 1 clk cycle incurred in this case in comparison to 2 in Sol 1.

Ex-  $i: \text{ADD } R_1, R_2, R_3$        $j: \text{SUB } R_4, R_1, R_2$

$\overbrace{W_i}^{\text{Register}} \quad \overbrace{R_j}^1$

in which the results are written

$$W_i = \{R_1\}, R_i = \{R_2, R_3\}$$
$$R_j = \{R_1, R_2\}, W_j = \{R_4\}$$

Condition- If  $i < j$  &  $W_i \cap R_j \neq \emptyset$ , then WR Hazard

Let's check for given example- may occur

$$W_i \cap R_j = \{R_1\} \text{ & } i < j \text{ (or given order)}$$

So, WR Hazard occur

WR Hazard - occurs when an instruction refers to a result that has not yet been calculated or returned.

control  
value is it  
the pr  
one  
pi

10)

Data Hazard :- when Instructions that exhibit data dependence modify data in different stages of pipeline

Write after Read Hazard (RW / Anti-dependency Hazard)

Example -

This can also occurs with both arithmetic & logical ops

$i : ADD R_1, R_2, R_3 ; \text{ if } R_1 \leftarrow R_2 + R_3$

$j : OR R_2, R_3, R_4 ; \text{ if } R_2 \leftarrow R_3 \text{ or } R_4$

Logical instruction execution is faster than arithmetic.  
Hence, it may happen that the  $j$ th instruction gets executed before the  $i$ th instruction.

→ If in case this happens, then the  $i$ th instruction would add into  $R_2$  a modified value of  $R_2$  & thus it would produce incorrect result. This is called RW Hazard.

Condition of RW Hazard :-

If  $i < j$  and  $R_i \cap W_j \neq \emptyset$ ; then RW Hazard may occurs.

Solution:- Delay the Pipe → Discussed in an elaborate manner in the Pipeline Stall, Pipeline stall is also called pipeline stall, pipeline break, pipeline bubbling.

$$R_i = \{R_2, R_3\}, W_j = \{R_2\}$$

$$R_i \cap W_j = \{R_2\} \neq \emptyset$$

## Control Hazard (Branch Hazard)

What is it?

on many instruction pipeline microarchitectures, the process will not know the outcome of the branch when it needs to insert a new instructions to the pipeline (generally the fetch stage)

Pipeline cannot operate normally due to control flow (non-sequential)

Example

S<sub>1</sub>

S<sub>2</sub> if(cond)

S<sub>3</sub>

S<sub>4</sub>

S<sub>5</sub>

if the cond is false  
Jump to S<sub>6</sub>

wasteage of pipeline so this is not required.



Drawbacks of this hazard problem caused by this hazard:-

It reduces the speedup of the pipeline.

Structural Hazard :-

→ It occurs due to a resource conflict

in the pipeline

- A resource conflict is a situation when more than one instruction tries to access the same resource in the same cycle. [In the same clock pulse]
- The resource can be register, memory or ALU.

Example -

I<sub>1</sub> : ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>; // R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>

I<sub>2</sub> : ADD R<sub>4</sub>, R<sub>5</sub>, R<sub>6</sub>; // R<sub>4</sub> ← R<sub>5</sub> + R<sub>6</sub>

I<sub>3</sub> : SUB R<sub>7</sub>, R<sub>1</sub>, R<sub>8</sub>; // R<sub>7</sub> ← R<sub>1</sub> - R<sub>8</sub>

|    | 1                                      | 2                                      | 3                                      | 4                               | 5 | 6 | 7 |
|----|----------------------------------------|----------------------------------------|----------------------------------------|---------------------------------|---|---|---|
| IF | I <sub>1</sub>                         | I <sub>2</sub>                         | I <sub>3</sub>                         |                                 |   |   |   |
| ID | ADD<br>R <sub>2</sub> , R <sub>3</sub> | ADD<br>R <sub>5</sub> , R <sub>6</sub> | SUB<br>R <sub>1</sub> , R <sub>8</sub> |                                 |   |   |   |
| Ex |                                        | R <sub>2</sub> + R <sub>3</sub>        | A <sub>5</sub> + R <sub>6</sub>        | R <sub>1</sub> - R <sub>8</sub> |   |   |   |
| WB |                                        |                                        | R <sub>1</sub> ...<br>result           | R <sub>4</sub>                  |   |   |   |

In consistency as  
I<sub>2</sub> needs previous  
value of R<sub>1</sub>

R<sub>1</sub> used by two different instructions for  
different processor in the same clock cycle

Solution :

Delay the pipe (Delay I<sub>3</sub>'s ID phase till I<sub>1</sub> complete)

Also called Pipeline Bubbling

Break.

Pipeline Hazards :- The hazard occurs due to unavailability of required data, instruction or resource during execution of an instruction in pipelining architecture.

eg I<sub>1</sub> F D E WB

I<sub>2</sub> F D E WB

I<sub>3</sub> F D STALL E WB

I<sub>4</sub> STALL F D E STALL WB

STALL - It is the delay of execution process. This helps in not occurring pipeline hazards.

support + But due to STALL, this instruction execution process takes more time for execution which needs to the decrease in the performance. OP

Hazard: Circumstances that would cause incorrect executions if the next instruction was launched.

→ Structural hazard :- It will basically arise b/c it don't have enough resources in a pipeline system implementation.



→ This is a stall bubble. (stall bubble)

When two or more different instructions want to use same resource e.g. MEM uses the same memory port as IF as shown in this.

Dealing with structural hazard:

Stall

- low cost, simple
- increases CPI
- use for rate care

⇒ Pipeline hardware resource

- use for multi-cycle resources
- good performance
- sometimes complex e.g. RAM

Replicate resource

- good performance
- increases cost (+ maybe interconnect delay)
- useful for cheap or divisible resource

2.   
 a)   
 b)   
 c)   
 d)   
 e)   
 f)   
 g)   
 h)   
 i)   
 j)   
 k)   
 l)   
 m)   
 n)   
 o)   
 p)   
 q)   
 r)   
 s)   
 t)   
 u)   
 v)   
 w)   
 x)   
 y)   
 z)   
 T = 20ns  
 S = 4  
 N = 100  
 • Pipelining

Write after write hazard (WAW) :- occurs when transaction tries to write an operand out of orders.

e.g/

$$i : R_2 \leftarrow R_4 + R_7$$

$$j : R_2 \leftarrow R_1 + R_3$$

→ If instruction  $j$  updates the value of  $R_2$  before  $i$ , the final value written into register  $R_2$  would be that of instruction  $i$  but what was required was of instruction  $j$ . Hence a WAW hazard is said to occur.

### Solution

Delay the writeback of  $j$  until  $i$  finishes executing.

Conditions of WW →

if  $O(i) \cap O(j) \neq \emptyset$  with  $W_i \neq \emptyset$ , WW hazard may occur.

Checking for the current example

$$W_i = \{R_2\}$$

$$W_j = \{R_2\}$$

$$W_i \cap W_j = \{R_2\} \neq \emptyset$$

Hence WW hazard may occurs.

## Pipeline

- To improve the performance of a CPU we have two options:
  1. Improve the hardware by introducing faster circuits.
  2. Arrange the hardware such that more than one operation can be performed at the same time.
- Since there is a limit on the speed of hardware and the cost of faster circuits is quite high, we have to adopt the 2nd option

## Arithmetic pipeline

$$x = 0.9505 \times 10^3 = Ax \times 10^a$$

$$y = 0.8100 \times 10^2 = Bx \times 10^b$$

$$= 0.0810 \times 10^3$$

$$z = x + y$$

$$= 1.0315 \times 10^3$$

$$= 0.10315 \times 10^4$$

Pipeline for floating point addition / subtraction

In this arithmetic pipeline used to do some arithmetic op's like FP additions. We considering that these type of pipeline 4 stage or three stage 1-4. Let us see them.

In case of floating point add/sub the exponent must be same for the both add/sub type calculate exponent. explain problem.



### Instruction pipelining

Step'

|               | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 |
|---------------|----|----|----|----|----|----|----|----|----|----|----|----|----|
| Instruction 1 | IF | ID | OF | Ex |    |    |    |    |    |    |    |    |    |
| 2             | IF | ID | OF | Ex |    |    |    |    |    |    |    |    |    |
| Branch 3      |    | IF | ID | OF | Ex |    |    |    |    |    |    |    |    |
| 4             |    |    | IF | -  | -  | IF | ID | OF | Ex |    |    |    |    |
| 5             |    |    |    | -  | -  | -  | IF | ID | OF | Ex |    |    |    |
| 6             |    |    |    | *  | -  | -  | -  | IF | ID | OF | Ex |    |    |
|               |    |    |    |    | -  | -  | -  | IF | ID | OF | Ex |    |    |

### Timing of Instruction pipeline

We know that every instruction in a program can consist of multiple instructions in instruction pipeline architecture where each instruction can perform different stages or phases.

1st instruction IF then goto 10 in 2nd cycle & so on.

In 3rd instruction, there are two categories Branch & jump instruction conditional & unconditional. what happens linear or sequential execution are hamper as well that the next instructions update which is the result in memory update the 3rd instruction, so not executable. Then due to the presence of jump or goto statement instruction that program control suppose to target the different memory location. that's next instruction fetch to enhance.

This is known as pipeline stalling.

That's why pipeline efficiency decrease. And the deterioration of the performance of the pipeline is

Stall - The next instruction will be cleaned up means the pipeline is empty.

## Pipeline Model (Four stage)



In pipeline system each segment consists of an I/P register followed by a combinational circuit. The register is used to hold data and Combinational Circuite perform operation on it. The off of Combinational circuite is applied to the I/P register of next segment.

### Example

- Suppose an instruction is divided into 4 segments and a program has 6 instructions. Find the total time required to complete all operations in pipelining as well as non-pipelining.

Sol.

$$\begin{aligned} t_{NP} &= N \times S \times T \\ &= 6 \times 4 \times T = 24T \text{ sec} \end{aligned}$$

$$\begin{aligned} t_P &= 1 \times S T^{\text{cycle time}} + (N-1) T \\ &= 4T + 5T = 9T \end{aligned}$$

$t_P$  = Time taken for 1st task + Time taken for remaining tasks



- If a processor takes 20ns to execute a sub operation, assume that a pipeline has 4 segments & execute 100 no. of instructions in a sequence. Find the following.

(a) Time taken to execute all the instruction in

a) non pipeline b) pipeline

c) what is the overall speed of pipelining over non-pipelining.

$$\begin{aligned} T &= 20 \text{ ns} & a) t_{NP} &= 100 \times 4 \times 20 \text{ ns} & b) t_P &= 4 \times 20 + 99 \times 20 \\ & & & & & = 103 \times 20 \text{ ns} = 2060 \text{ ns} \\ S &= 4 & & & & \end{aligned}$$

$$N = 100 \quad c) \text{Overall speed of P over NP} = \frac{8000}{2060} = 3.88$$

∴ pipelining is 3.88 times faster than non-pipeline.

Pipeline for  
4 stage pipeline for a RISC



Stages :

1. Instruction Fetch (IF) :-

- Fetching instruction

- Incrementing Program Counter

2. Instruction Decode (ID) :-

- Interpretation of opcode

- Decode and search where source registers are present

- Read the register contents → check for branch

3. Execution (Ex) :-

- Effective address Computation.

- performs the operation specified by ALU.

4. Memory operation (Mem)

- Read

- Write

5) Write back (WB)

- wrote the result into register file.

CA-PF



or



Q. How many clock cycle required to finished.



For 1st instruction = 12 CC required ( $1+2+3+4+2$ )

For 2nd instruction = 16 CC required

Pipeline takes the slowest stage time:

|                | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12             | 13             | 14             | 15             | 16             | 17             | 18             | 19 | 20 |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|----|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | M <sub>1</sub> | M <sub>1</sub> | M <sub>1</sub> | M <sub>1</sub> | N <sub>1</sub> |                |                |                |                |                |                |                |                |                |                |    |    |
| I <sub>2</sub> | F <sub>2</sub> | -              | D <sub>2</sub> | -              | E <sub>2</sub> | E <sub>2</sub> | E <sub>2</sub> | -              | M <sub>2</sub> | M <sub>2</sub> | m <sub>2</sub> | m <sub>2</sub> | w <sub>2</sub> | w <sub>2</sub> |                |                |                |                |    |    |
| I <sub>3</sub> | F <sub>3</sub> | -              | -              | D <sub>3</sub> | D <sub>3</sub> | -              | -              | E <sub>3</sub> | E <sub>3</sub> | E <sub>3</sub> | -              | -              | M <sub>3</sub> | M <sub>3</sub> | M <sub>3</sub> | M <sub>3</sub> | N <sub>3</sub> | W <sub>3</sub> |    |    |
| I <sub>4</sub> |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |    |    |

Q. 4- Instructions are executed in a 4 stage pipelining. Assume that each instruction requires different stage delay (in terms of no. of clock cycles) as mention in the table below.

|                | IF | ID | EX | WB |     |
|----------------|----|----|----|----|-----|
| I <sub>1</sub> | 1  | 3  | 2  | 1  | → 7 |
| I <sub>2</sub> | 2  | 2  | 3  | 1  | → 8 |
| I <sub>3</sub> | 1  | 1  | 1  | 1  | → 4 |
| I <sub>4</sub> | 2  | 1  | 1  | 2  | → 6 |

How many clock cycle will be required to complete these 4 instructions in the given pipeline system.

|                | f <sub>1</sub> | f <sub>2</sub> | f <sub>3</sub> | f <sub>4</sub> | f <sub>5</sub> | f <sub>6</sub> | f <sub>7</sub> | f <sub>8</sub> | f <sub>9</sub> | f <sub>10</sub> | f <sub>11</sub> | f <sub>12</sub> | f <sub>13</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | IF             | D <sub>1</sub> | D <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | B <sub>1</sub> | W <sub>1</sub> |                |                |                 |                 |                 |                 |
| I <sub>2</sub> | IF             | F <sub>2</sub> | F <sub>2</sub> | D <sub>2</sub> | D <sub>2</sub> | E <sub>2</sub> | B <sub>2</sub> | E <sub>2</sub> | W <sub>2</sub> |                 |                 |                 |                 |
| I <sub>3</sub> | IF             |                |                | F <sub>3</sub> | -              | D <sub>3</sub> | -              | -              | -              | E <sub>3</sub>  | W <sub>3</sub>  |                 |                 |
| I <sub>4</sub> | IF             |                |                | F <sub>4</sub> | F <sub>4</sub> | D <sub>4</sub> | -              | -              | -              | E <sub>4</sub>  | W <sub>4</sub>  | W <sub>4</sub>  |                 |

Answer → 13 clock cycle.

## Classification of Pipeline :-

Instruction Pipelining — Execution of stream of instructions

is pipelined by overlapping the execution of current instruction with fetch, decode, and operand fetch of subsequent instruction. These are also called instruction look-ahead pipeline. It is inter-instruction pipeline.

It is executed by follows steps.

- IF, ID, - Fetch the opcode
- Decode
- Fetch the operand
- Execution of the above ins.
- Store / write back result

CA-PQ



*019-13*

The time that each step takes to fulfill its function depends on the instruction and the way it is executed. The design of an instruction pipeline would be most efficient if the instruction cycle is divided into segments to equal duration.

## Arithmetic Pipeline

CB-PI0

The ALU of Computer is segmented in various data formats for pipeline operations. They are used for floating point operators, multiplication of fixed point no. etc.

example for the input to the floating point Adder. pipeline

$$x = A + 2^a \quad y = B + 2^b$$

Here A & B are mantissas (significant digit of floating point no.) while a & b are exponent.

The floating point addition & Subtraction is done in 4 parts.

1. Compare the exponent
2. Align the mantissas
3. Add or Subtract mantissas
4. Product the result.

$$x = 0.9784 \times 10^3$$

$$y = 0.8621 \times 10^2$$

$$N = \pm m \times 2^{\pm E}$$

$m$  = representation of bits  
Number = N

Step 1 :- Comparison of exponent

Step 2 : If it is unequal, then align the mantissa with respect to exponent

$$y = 0.0862 \times 10^3$$

Step 3 :- Add / subtract mantissas

$$\begin{aligned} z &= (0.97840 + 0.0862) \times 10^3 \\ &= 1.06461 \times 10^3 \end{aligned}$$

Step 4 :- Normalize the mantissa if necessary.

Architecture of Arithmetic Pipelining:-

→ It is divided into 4 Categories/stages/segments



EXPO NENT



Stage 1:

Compare abs  
(a-b)

MANTISA



Stage 2:

Select exponent

Realignment  
of mantisa

Stage 3:

Addition of  
mantisa

Stage 4:

Adjust Exponent

normalize  
Result

In this arithmetic pipeline used to do some arithmetic op's like FP addition. We Considering that there type of pipeline 4 stages or phases.

Let us see no.

In case of floating point add/sub the exponent must be same for both the no then the mantisa of the significant can get before the significant add/sub. I should equate exponent.

Instruction execution in pipeline are :-

- Speed up ratio (S)
- Frequency (f)
- Efficiency (B)
- Throughput (H)

### Processor Pipeline :-

Some data streams is processed by a cascade of processors. Each processor does some specific task.



$$\text{Speedup ratio} = \frac{\text{Execution Time in a Non-pipeline architecture}}{\text{Execution Time in a pipeline architecture}}$$

Let.

$T_{NP}$  = Execution time of a single instruction on a non-pipeline architecture.

$T_P$  = Execution time in a pipeline architecture

$N$  = No. of instructions to be executed

$K$  = No. of steps/stages.

$$\text{So. Speedup ratio or } \frac{\frac{N \cdot T_{NP}}{K \cdot T_P + (N-1)T_P}}{=} \frac{N \cdot T_{NP}}{(N+K-1)T_P}$$

$$\text{When } N \gg K, \text{ we can ignore } K = \frac{N \cdot T_{NP}}{N \cdot T_P}$$

$$\therefore \text{Speedup Ratio} = \frac{T_{NP}}{T_P}$$

$$\text{In ideal case pipeline } T_{NP} = K T_P$$

$$\text{Speedup ratio} = K \frac{T_P}{T_P}$$

$$\approx \text{Speedup ratio} \propto K$$

## Calculate cycle Time

### In pipelined architecture —

- There is global clock that synchronizes the working of all stages.
- At the beginning of each clock cycle, each stage reads the data from its register and process it.
- Frequency of clock is set such that all the stages are synchronized.
- Cycle time is the value of one clock cycle.

There are two cases possible.

#### Case 1 All the stages offer same delay

Cycle time = Delay offered by one stage including the delay due to its register.

#### Case 2 All the stages do not offer same delay :-

Cycle time = Maximum delay offered by any stage including the delay due to its register

- Q. A 4 stage pipeline has the stages delay of 150 ns, 120 ns, 160 ns and 140 ns respectively. Register that are used b/w the stages have a delay of 5 ns each. Assuming constant clocking rate, what is the total time taken by pipeline to process 1000 data items.

$$\begin{aligned} T_p &= \max(120, 150, 160, 140) + \text{Register stage delay} \\ &= 160 + 5 = 165 \text{ ns} \end{aligned}$$

$$\text{Total time} = 165(n+k-1) = 165(1000+3) = 165.5 \text{ ms}$$

- Calculating Frequency of clock :-

$$\text{Frequency of the clock } (f) = \frac{1}{\text{Cycle time}} \\ = \frac{1}{T_p}$$

- Efficiency ( $\eta$ )

$$\text{Efficiency } (\eta) = \frac{\text{Speed up}}{\text{No. of stages in Pipelined Architecture.}} \\ = \frac{s_K}{K}$$

$$\therefore s_K = \frac{n \times T_{np}}{(K+n-1) T_p}$$

$$= \frac{n \times K \times T_p}{(n+K-1) T_p} = \frac{nK}{n+K-1}$$

$$\therefore E_K = \frac{s_K}{K} = \frac{n}{K+n-1}$$

- Throughput :- No. of instruction executed per unit time

$$H = \frac{n}{(K+n-1) T_p} = \frac{\text{No. of Instruction executed}}{\text{Total time taken.}} \\ = \frac{n f}{(K+n-1)}$$

CA-P15

5 functional units operate using pipeline with clock cycle delay 10ns, 8ns, 10ns, 10ns, 7ns. Assume pipeline have no 1ns overhead. Find the speed up factors.

$$T_p = \max(10, 8, 10, 10, 7) \text{ ns} + 1 \text{ ns}$$

$$= 11 \text{ ns.}$$

For non pipeline system,  $T_{np} = 10+8+10+10+7 = 45 \text{ ns}$

$$\text{Speedup factor} = \frac{45}{11} \approx 4$$

Q. The stage delays in a 4 stage pipeline are 800, 500, 400, 300 picoseconds. The first stage is replaced with a J/K equivalent design involving two stages with respective 600 and 350 picoseconds. The throughput increase of the pipeline is — %.

Execution Time in 4 Stage Pipeline

$$T_p = \max(800, 500, 400, 300) + 0 = 800 \text{ picosecond.}$$

Throughput (throughput)  $\frac{1}{800} =$

Execution Time in 2 Stage

$$\text{Cycle time}(t_p) = \max(600, 350) + 0$$

$$= 600 \text{ ps}$$

Throughput 2 Stage pipeline

$$\text{Throughput} = \text{Number of instruction}/600 \text{ ps}$$

Throughput Increase

$$= \frac{\text{Final throughput} - \text{Initial throughput}}{\text{Initial throughput}} \times 100$$

$$= (1/600 - 1/800) / (1/800) \times 100$$

$$= \{(800/600) - 1\} \times 100 = (1.33 - 1) \times 100$$

$$= 33.33\%$$

## Pipeline Optimization

1. Linear pipeline are static pipeline bcz they are used to perform fixed function. The output of the pipeline is produced from the last stage. Static pipeline is specified by single Reservation Table. The reservation Table is trivial in the sense that data flows in linear streamline. All initiations to a static pipeline use the same reservation table.

Non-linear pipeline are dynamic pipeline bcz they can be reconfigured to perform variable fn at different times. Functions partitioning is relatively difficult because the pipeline stages are interconnected with loops, <sup>(feedforward & feedback)</sup> in addition to streamline connections. The output of the pipeline is not necessarily produced from the last stage. Dynamic pipeline may allow different initiations to follow a mix of reservation table.



A three stage pipeline.

A reservation Table is a way of representing the task flow pattern of a pipelined system. A reservation table has several rows and columns. Each row of the reservation table represents one resource of the pipeline and each column represent one time-slice of the pipeline.

|       |                | Time →         | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-------|----------------|----------------|---|---|---|---|---|---|---|---|
|       |                | S <sub>1</sub> | X |   |   |   | X |   | X |   |
| Stage | S <sub>2</sub> |                | X |   | X |   |   |   |   |   |
|       | S <sub>3</sub> |                |   | X |   | X |   | X |   |   |

Diagram → RT Table → Collision vector  
 ↓ State diagram → forbidden latency  
 → permissible "

(i) Simple cycle (ii) Greedy cycle (iii) M AL (minimum avg latency)

CPD-2

### latency

The number of time units (clock cycles) b/w two initiations of a pipeline is the latency b/w them. Latency values must be non-negative integers.

Any attempt by two or more initiations to use the same pipeline stage at the same ~~time~~ will cause a collision. A collision implies resource conflicts b/w two initiations in the pipeline. Latencies that cause collisions are called forbidden latencies.

To detect a forbidden latency, one needs simply to check the distance b/w any two checkmarks on the same row of the reservation table.

e.g. The distance b/w the 1st mark and the second mark in row 5, in fig. is 5 is a forbidden latency.

Similarly latencies 2, 4, 5, 7 are all seen to be forbidden from inspecting the same reservation table.

$$\begin{aligned}\text{Forbidden latency} &= \{ (6-1), (8-1), (8-6) \}; (4-2), (5-3); (7-3), (7-5) \} \\ &= \{ 5, 7, \underline{2}, \underline{2}, 2, 4, \underline{2} \} \\ &= \{ 2, 4, 5, 7 \}.\end{aligned}$$

### Collision Free Scheduling —

The main objective is to obtain the shortest avg. latency b/w initiations without causing collisions.

then we study —

collision vector — is a method of analyzing how often we can initiate a new operation into the pipeline and maintain synchronous flow without collisions.

For reservation table with  $n$  columns, the maximum forbidden latency  $m \leq n-1$ .

The permissible latency  $p$  should be as small as possible.

The choice is made in the range  $1 \leq p \leq m-1$

CPO-3

A permissible latency of  $P=1$  corresponds to the ideal case.

The combined set of permissible and forbidden latencies can be easily displayed by a collision vector, which is an  $m$ -bit binary vector.  $C = (C_m C_{m-1} \dots C_2, C_1)$

The value of  $C_i = 1$  if latency  $i$  causes a collision and  $C_i = 0$  if latency  $i$  is permissible.

Note that it is always true the  $C_m = 1$ , corresponding to the maximum forbidden latency.

permissible latencies = {1, 3, 6, ~~7~~}

Collision vector = { $C_7 C_6 C_5 C_4 C_3 C_2 C_1$ }  
1 0 1 1 0 1 0

The collision Vector

$C_x = \{1011010\}$  for  $f^n X$

$C_y = \{1010\}$  for  $f^n Y$

CPO

CPO-4  
Simple cycle / no one wasted in twice



Not a simple cycle

List of all the simple cycle

| <u>cycle</u> | <u>Avg. latency</u> |
|--------------|---------------------|
| (1,8)        | 8                   |
| (3,8)        | 4.5                 |
| (6,8)        | 5.5                 |
| (8,8)        | 7                   |

Greedy cycle - Greedy cycle are those cycle in which avg. latency  $\leq$  no. of bits in initial collision vector.