

# COMPUTER SCIENCE

## Computer Organization and Architecture

### Instruction Pipelining

Lecture\_08



Vijay Agarwal sir



A graphic of a construction barrier made of orange and white striped panels, with two yellow bollards on top, positioned on the left side of the slide.

**TOPICS  
TO BE  
COVERED**

**o1**

## **Pipelining Hazards**

SolutionResource Redirection  
Renaming.HW Interlock &  
OPERAND Forwarding

① Structural Hazards

Resource Conflict

② Data Hazards

Data

③ Control Hazards

Branch Instr  
(TOC Instr)

Program Control Instr

Hw SW

Branch Prediction Delay Branch

# Pipeline Hazards

Occur when the Pipeline, or some portion of the pipeline must stall Because conditions Do not permit Continued execution

There are three Types of hazards:

- ① Resource
- ② Data
- ③ Control



Also referred to as a Pipeline bubble

In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operation.

1. *Resource conflicts* caused by access to memory by two segments at the same time. Most of these conflicts can be resolved by using separate instruction and data memories.
2. *Data dependency* conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available.
3. *Branch difficulties* arise from branch and other instructions that change the value of PC.

## Hazards/Dependencies In the pipeline

- Dependency is a major problem in the pipeline, causes extra cycle.
- Cycle in the pipeline without new input is called as extra cycle. Also named as “Stall”.
- When stall is present in the pipeline then  $CPI \neq 1$ .
- There are 3 kinds of dependencies possible in the pipeline-
  - I. Structural dependency/ Structural Hazards
  - II. Data dependency/ Data Hazards
  - III. Control dependency/ Control Hazards

## Resource Conflict

- Sol<sup>n</sup>
  - Resource Replication  
(Using Additional Resource)
  - Renaming

## Data Dependency

I<sub>1</sub> : ADD r<sub>1</sub> r<sub>2</sub> r<sub>3</sub>;       r<sub>1</sub>  $\leftarrow$  r<sub>2</sub> + r<sub>3</sub>

I<sub>2</sub> : MUL r<sub>4</sub> r<sub>1</sub> r<sub>5</sub>;      r<sub>4</sub>  $\leftarrow$  r<sub>1</sub> \* r<sub>5</sub>

Data Dep. → H/w Interlock  
→ OPERAND Forwarding | By Passing | Short Circuit

ANTI (WAR) Dep.  
OUTPUT (write) Dep } Solution Register Renaming.

Q. The performance of a pipelined processor suffer if

[GATE-2002 : 2 Marks]

→ Control

- A the pipeline stages have different delays
- B consecutive instructions are dependent on each other  
→ Data Dep.
- C the pipeline stages share hardware resources  
→ Structural Def.
- D All of the above

Q.

For a pipelined CPU with a single ALU, consider the following situations

1. The  $j^{th} + 1^{st}$  instruction uses the result of the  $j^{th}$  instruction as an operand → *Data Def.*
2. The execution of a conditional jump instruction → *Control Def.*
3. The  $j^{th}$  and  $j^{th} + 1^{st}$  instructions require the ALU at the same time → *Structural Def.*

Which of the above can cause a hazard?

[GATE-2003 : 1 Marks]

- A 1 and 2 only
- B 2 and 3 only
- C 3 only
- D All the three

Q.

A pipelined processor uses a 4-stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode (ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX stage. The sequence of instructions corresponding to the statement  $X = (S - R * (P + Q))/T$  is given below. The values of variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the instruction sequence.

|       |            |                           |
|-------|------------|---------------------------|
| ADD   | R5, R0, R1 | ; $R5 \leftarrow R0 + R1$ |
| MUL   | R6, R2, R5 | ; $R6 \leftarrow R2 * R5$ |
| SUB   | R5, R3, R6 | ; $R5 \leftarrow R3 - R6$ |
| DIV   | R6, R5, R4 | ; $R6 \leftarrow R5 / R4$ |
| STORE | R6, X      | ; $X \leftarrow R6$       |

The IF, ID and WB stages take 1 clock cycle each. The EX stage takes 1 clock cycle each for the ADD, SUB and STORE operations, and 3 clock cycles each for MUL and DIV operations. Operand forwarding from the EX stage to the ID stage is used. The number of clock cycles required to complete the sequence of instructions is

[GATE-2006: 2 Marks]

- A 10
- ~~B 12~~
- C 14
- D 16

Q. Consider the sequence of machine instructions given below:

|     |            |
|-----|------------|
| MUL | R5, R0, R1 |
| DIV | R6, R2, R3 |
| ADD | R7, R5, R6 |
| SUB | R8, R7, R4 |

In the above sequence, R0 to R8 are general purpose registers. In the instructions shown, the first register stores the result of the operation performed on the second and the third registers. This sequence of instructions is to be executed in a pipelined instruction processor with the following 4 stages: (1) Instruction Fetch and Decode (IF), (2) Operand Fetch (OF), (3) Perform Operation (PO) and (4) Write back the Result (WB). The PO state takes 1 clock cycle for ADD or SUB instruction. 3 clock cycles for MUL instruction and 5 clock cycles for DIV instruction. The pipelined processor uses operand forwarding from the PO stage to the OF stage. The number of clock cycles taken for the execution of the above sequence of instructions is \_\_\_\_\_.

[GATE-2015 (Set-2): 2 Marks]

**NAT**

The instruction pipeline of a RISC processor has the following stages. Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Perform Operation (PO) and Writeback (WB). The IF, ID, OF and WB stages take 1 clock cycle each for every instruction. Consider a sequence of 100 instructions. In the PO stage, 40 instructions take 3 clock cycles each, 35 instructions take 2 clock cycles each, and the remaining 25 instructions take 1 clock cycle each. Assume that there are no data hazards and no control hazards.

The number of clock cycles required for completion of execution of the sequence of instructions is \_\_\_\_.

[GATE-2018-CS: 2M]

**MCQ**

Consider an instruction pipeline with five stages without any branch prediction: Fetch Instruction (FI), Decode Instruction (DI), Fetch Operand (FO), Execute Instruction (EI) and Write Operand (WO). The stage delays for FI, DI, FO, EI and WO are 5 ns, 7 ns, 10 ns, 8 ns and 6 ns, respectively. There are intermediate storage buffers after each stage and the delay of each buffer is 1 ns. A program consisting of 12 instruction  $I_1, I_2, I_3, \dots, I_{12}$  is executed in this pipelined processor. Instruction  $I_4$  is the only branch instruction and its branch target is  $I_9$ . If the branch is taken during the execution of this program, the time (in ns) needed to complete the program is

[GATE-2013-CS: 2M]

A 132

B 165

C 176

D 328

## Types of Data Hazard

- Read after write (RAW), or true dependency
  - ❖ An instruction modifies a register or memory location.
  - ❖ Succeeding instruction reads data in memory or register location.
  - ❖ Hazard occurs if the read takes place before write operation is complete.
- Write after read (WAR), or antidependency
  - ❖ An instruction reads a register or memory location.
  - ❖ Succeeding instruction writes to the location.
  - ❖ Hazard occurs if the write operation completes before the read operation takes place.
- Write after write (WAW), or Write/output dependency
  - ❖ Two instruction both write to the same location.
  - ❖ Hazard occurs if the write operations take place in the reverse order of the intended sequence.

## Types of Data Hazard

- CPU always execute a sequence the program from top to bottom in a sequence called as “in-order” execution.
- In this execution sequence if any instruction is dependent then the remaining instruction are also sharing the stall cycles even, they are independent.

Ex:- Code



In- order execution sequence



Stalls

## Types of Data Hazard

- In the above execution sequence ' $I_2$ ' is data dependent on " $I_1$ " So,  $I_2$  will be waiting until the ' $I_1$ ' execution is completed when the hardware does not support operand forwarding.  
This waiting creates stall in the pipeline this stall are also shared by the  $I_3$  and  $I_4$  instruction even they are independent.
- To utilize this point instruction scheduling concept states that insert the independent instruction first.  
It causes out of order execution {Re-order execution}.
- Out of order execution sequence is  $I_1 - I_3 - I_4 - I_2$
- Out of order execution create two more dependencies in the pipeline ..
  - (i) Anti data dependency
  - (ii) output/Write data dependency

# Types of Data Hazard

# Instruction "J"      Instruction "I"

Read – Before – write [True Data Loss]

Write – Before – Read [ANTI Data loss]

Write – Before – write [O/P DATA Loss]

↓  
AFTER

↓  
Delay

↓  
Stall

↓  
Hazard

## Types of Data Hazard

**Anti Dependency** will be occurred when the instruction ‘J’ try to write the data before instruction ‘I’ reads it.

Example - I<sub>3</sub> executed before I<sub>2</sub> so I<sub>3</sub> modified the register (r<sub>4</sub>) before I<sub>2</sub> reads it, therefore I<sub>2</sub> incorrectly read the new value from (r<sub>4</sub>) [Data loss].

**Output dependency** will be occurred when the instruction “J” tried to write the data before instruction ‘I’ writes.

Ex- I<sub>4</sub> executes before I<sub>2</sub> so, I<sub>4</sub> modifies the register (r<sub>3</sub>) before I<sub>2</sub> writes it therefore destination is incorrectly updated with old value [data loss].

## Types of Data Hazard

**Note:** To handle the above dependency problem hardware technique is used i.e., register renaming.

This technique states that, use the re-order buffer to store the out-of-order instruction output later update the register file with a re-order buffer content after the completion of a dependent instruction therefore data is safe.

# Types of Data Hazard

# Register Re-naming:

Execution

$$I_1 \rightarrow r_0$$

$$\begin{array}{l} I_3 \rightarrow \\ I_4 \rightarrow \end{array} \left. \right\} \text{Re-order buffer}$$

Other Temp. Register

$$I_2 \rightarrow r_3 \text{ (Status 1)}$$

Re-order buffer  $\rightarrow$  Reg. file {R<sub>4</sub>, R<sub>3</sub>}

(R<sub>3</sub>, R<sub>4</sub>)

↳ finish then Temp Reg Data Stored in Register file

## Control Hazard

- Also known as a branch hazard
- Occurs when the pipeline makes the wrong decision on a branch prediction.
- Brings instructions into the pipeline that must subsequently be discarded
- Dealing with Branches:
  - ❖ Multiple Streams
  - ❖ Prefetch branch target
  - ❖ Loop buffer
  - ❖ Branch prediction
  - ❖ Delayed branch

# Control Dependency:

↳ Control Dep. is created in the Pipeline Due to

↳ Branch Instn

↳ Program Control

↳ TOC Instruction

↳ Unconditional TOC

↳ Conditional TOC

## Control Dependency/Control Hazards

### Control Dependency:

Control dependency problem, will be occurred in the pipeline, when the Branch Instruction, Function call, program Control/transfer of control instruction are executed.

Consider the program code-

1000 :  $I_1 \downarrow$  Falling /Fall through path

1001 :  $I_2$

1002 :  $I_3$  (JMP 2000)

1003 :  $I_4$

...

2000       $BI_1 \downarrow$  Taken path

2001       $BI_2 \downarrow$

Expected O/P sequence:

$I_1 - I_2 - I_3 - BI_1 - BI_2$

# Control Dependency:

Target Address is available at the end of k Stage

$$\text{Branch Penalty} = k - 1$$



## Topic : Dependencies In he pipeline

- Analysis:

Target address availability with respect to question

1. Stage number given then

Branch penalty = stage number -1

2. Stage name given then

Branch penalty = [Respective name stage position] -1

3. Until Instruction execution is completed /All Instruction proceed with all stage then

Branch penalty = [Last stage number] -1



## Control Dependency/Control Hazards

| PC: 1000          | C1             | C2             | C3             | C4                     | C5             | C6             |
|-------------------|----------------|----------------|----------------|------------------------|----------------|----------------|
| $I_1$<br>PC: 1001 | IF<br>PC: 1001 | ID             | EX             | MA                     | WB             |                |
| $I_2$             |                | IF<br>PC: 1002 | ID             | EX                     | MA             | WB             |
| $I_3$             |                |                | IF<br>PC: 1003 | ID<br>2000<br>PC: 1004 | EX             | MA             |
| $I_4$             |                |                |                | IF<br>2000<br>PC: 1004 | ID             | EX             |
| $BI_1$            |                |                |                |                        | IF<br>PC: 2000 | ID             |
| $BI_2$            |                |                |                |                        |                | IF<br>PC: 2000 |

∴ Actual O/P sequence:  $I_1 - I_2 - I_3 - I_4 - BI_1 - BI_2$



## Control Dependency/Control Hazards

In the above execution sequence, un-wanted instruction is executed in the program so, program functionality is unsafe. This kind of disturbance in pipeline is called control dependency.

To handle the above problem, we need to stop the unwanted instruction fetch. For this purpose “NOP” : Instruction is inserted after the jump instruction. This process is called as freeze or flush & that is described below \_\_\_\_\_.

NOP: opcode: [Delay instruction, no operation]

# Control Dependency.

- ✓ ①. Flushing
- ✓ ②. Stalling

- W<sup>n</sup> {
- (A) · Branch Prediction
  - (B) · Delayed Branch
  - (C) · Nop Instruction
- } Sel<sup>n</sup>

# Control Dependency.

Stalling : Design a pipeline in such a manner, if it Detects the Branch Instruction, then we have to stall (Stop Fetching the Next Sequential Instruction into the pipeline) until Branch Instr Evaluation [Execution] Not Completed.

# Control Dependency.

Sol'n of Control Dependency



# Control Dependency.



# Control Dependency.



## Branch Inst

(i) Unconditional : JMP x

Conditional :  
JNZ [Jump on Not zero]  
JZ [Jump on zero]  
BNE [Branch on Not equal]  
etc.

Branch Prediction: In this we Predict Branch is taken  $\textcircled{a}$  Branch is Not taken & If Prediction is Correct then Accordingly Instruction will execute.

(i) Branch is Always Taken: (In this we assume every time Branch is taken, So in this Next Instruction is Target Address Instruction for execution.

(ii) Branch is Never Taken: In this we Assume every time Branch is Not taken, So in this Next Instruction is the Next Sequential Instruction for execution.



① Branch is always Taken :

(1001 - 2000 - 2001 - 2002)



If guess [Prediction]  
Wrong then Suffer from  
Stalls.

↓  
If it is Normal Instruction [Non Branch Inst]  
then Prediction fail &  
Suffer from Stalls.

② Branch is Never Taken:

1000, 1001, 1002, 1003, 1004.

In this type Branch Inst<sup>n</sup> Not Coming. Branch is Not taken  
Cues: (Non Branch Inst<sup>n</sup> Coming)  
So Next Sequential Instruction inserted into Pipeline.

~~wrong~~ If we getting Branch Instruction (Prediction Wrong)  
then we have to Suffer from Stalls

These Two are Static Branch Prediction Method.

## Dynamic Prediction

(i) bit Implementation.



Branch Loop Buffer : It is a High Speed Buffer, Maintained in the Instruction Fetch Stage of Pipeline.

It's Just Kind  
a Database

| Branch Instn Address | Target Address |
|----------------------|----------------|
| <u>1001</u>          | 2000           |

4

When the Next time this type of Instn Come then if Match with the Address Directly getting the Target Address.



Delayed Branch: It is a Compile technique [Slow Path] in which trying to Rearrange the Instruction if possible.

If Not Possible to Rearrange then Insert 'NOP Instruction' After the Branch Instruction.

NOP Instruction: all the Instruction they Fetch & execute but Not update Any Register or Not effect our Desired output.



Rearrange  
the Instn

Delay Slot

→ After the Branch Instruction

Independent Instruction (which effect  
Prog Execution  
Output.)

RISC: Target Address  
2<sup>nd</sup> Stalls

$$\text{Delay Slot} = 2 - 1 = 1 \text{ stall}$$

Delay Slot

'K' Stage Target Address

Delay slot :  $k-1$  Delay slot.



## Control Dependency/Control Hazards

Code: [with flush operation]:

|       |   |                   |
|-------|---|-------------------|
| 1000  | : | $I_1$             |
| 1001  | : | $I_2$             |
| 1002  | : | $I_3$ (JMP .2000) |
| 1003  | : | <u>NOP</u>        |
| 1004  | : | $I_4$             |
| ..... |   | .....             |
| 2000  | : | B $I_1$           |
| 2001  | : | B $I_2$           |



## Topic : Dependencies In the pipeline



| PC: 1000 | C1             | C2             | C3             | C4                                                     | C5             | C6             |
|----------|----------------|----------------|----------------|--------------------------------------------------------|----------------|----------------|
| $I_1$    | IF<br>PC: 1001 | ID             | EX             | MA                                                     | WB             |                |
| $I_2$    |                | IF<br>PC: 1002 | ID             | EX                                                     | MA             | WB             |
| NOP      |                |                | IF<br>PC: 1003 | Unconditional TOC<br>ID<br>2000<br>PC: <del>1004</del> | EX             | MA             |
| $I_4$    |                |                |                | IF<br>2000<br>PC: <del>1004</del>                      | ID             | EX             |
| $BI_1$   |                |                |                |                                                        | IF<br>PC: 2001 | ID             |
| $BI_2$   |                |                |                |                                                        |                | IF<br>PC: 2002 |

∴ Actual O/P sequence:  $I_1 - I_2 - I_3 - \underline{\text{NOP}} - BI_1 - BI_2$



## Topic : Dependencies In he pipeline



Flush operation creates stall in the pipeline. Number of stalls created in the pipeline due to a branch instruction is called as “branch penalty”.

It is depending on the availability of a “Target Address” in the pipeline i.e.

Branch penalty = [At what stage target address available in the pipeline] -1

**NOTE :** RISC pipeline branch penalty is always one because target address is available in the second stage of a pipeline.



## Topic : Dependencies In he pipeline

Number of stall created from the branches during the program execution is calculated as :

$$\# \text{ Stall} = [\text{Branch frequency}] * [\text{Branch penalty}]$$

number stalls from Branches = Number of Branch  
\* instruction in the prog.  
Number of Stalls/  
Branch



## Topic : Dependencies In he pipeline



- **Analysis:**

Target address availability with respect to question

1. Stage number given then

Branch penalty = stage number -1

2. Stage number given then

Branch penalty = [Respective name stage position] -1

3. Until Instruction execution is completed /All Instruction proceed with all stage then

Branch penalty = [Last stage number] -1



## Topic : Dependencies In he pipeline



**NOTE:** To minimize the control dependency stalls in the pipeline, hardware technique is used i.e., branch predication buffer [Branch target buffer or loop buffer]

It is a high-speed buffer present in the IF used to hold the predicted target address. address is present in first address then stall is not created.

When the target address is present in first address then stall is not created. [Not practically present].



## Topic : Dependencies In he pipeline

**NOTE:** When the question states that pipeline with branch prediction ,then assume that target address is available in the 1<sup>st</sup> stage so, Branch penalty '0'.

Other wise [pipeline without branch prediction] assume that target address is not present in the 1<sup>st</sup> stage. So, penalty depends on the target address availability in the pipeline.



## Topic : Dependencies In he pipeline



**Note:** To minimize the control dependency stalls software technique is used i.e., Delayed Branch.

- **Delayed Branch:** It is a compiler technique so, Compiler rearranges the code, if possible, to arrange or substitute the non instruction after the branch instruction if not possible to rearrange to preserve the execution path.



## Topic : Dependencies In he pipeline

Code:

1000 :  $I_1$

1001 :  $I_2$

1002 :  $I_3$  (JMP, 2000)

1003 :  $I_4$

.....

.....

2000  $BI_1$

2001  $BI_2$

.....

.....

Expected o/p seq.

$I_1 - I_2 - I_3 - BI_1 - BI_2$

Actual o/p Seq.

$I_1 - I_2 - I_3 - I_4 - BI_1 - BI_2$



## Topic : Dependencies In he pipeline



#      'Compiler'  
      'Re-arrangement'

Code:

System Generated Address :     $I_1$   
                                         $I_3$  (JMP 2000)  
                                         $I_2$   
                                         $I_4$   
                                       ....  
                                        $Bl_1$   
                                        $Bl_2$



## Topic : Dependencies In the pipeline

| PC: I <sub>1</sub> | C1 | C2                       | C3                             | C4                         | C5                         |
|--------------------|----|--------------------------|--------------------------------|----------------------------|----------------------------|
| I <sub>1</sub>     | IF | ID                       | EX                             | MA                         | WB                         |
| PC: I <sub>3</sub> |    |                          |                                |                            |                            |
| I <sub>3</sub>     |    | IF<br>PC: I <sub>2</sub> | ID<br>Unconditiona<br>l<br>TOC | EA                         | MA                         |
| I <sub>2</sub>     |    |                          | IF<br>PC : BI <sub>1</sub>     | ID                         | EX                         |
| BI <sub>1</sub>    |    |                          |                                | IF<br>PC : BI <sub>2</sub> | ID                         |
| BI <sub>2</sub>    |    |                          |                                |                            | IF<br>PC : BI <sub>3</sub> |

Actual o/p : I<sub>1</sub> - I<sub>3</sub> - I<sub>2</sub> - BI<sub>2</sub> - BI<sub>1</sub>



## Topic : Dependencies In he pipeline

# ‘Compiler’  
‘Non- substitution’

Code:

System Generated address:  $I_2$   
 $I_3$  (JMP  $BI_1$ )  
~~NOP~~  
 $I_4$   
.....  
 $BI_1$   
 $BI_2$



## Topic : Dependencies In the pipeline

| PC: I <sub>1</sub> | C1                       | C2            | C3                                 | C4                        | C5                        | C6 |
|--------------------|--------------------------|---------------|------------------------------------|---------------------------|---------------------------|----|
| I <sub>1</sub>     | IF                       | ID            | EX                                 | MA                        | WB                        |    |
| PC: I <sub>2</sub> |                          |               |                                    |                           |                           |    |
| I <sub>3</sub>     | IF<br>PC: I <sub>3</sub> | ID            | EX                                 | MA                        | WB                        |    |
| I <sub>3</sub>     |                          | IF<br>PC: NOP | ID<br>Unconditional<br>NOP.<br>TOC | EX                        | MA                        |    |
| NOP                |                          |               | IF<br>PC: BI <sub>1</sub>          | ID                        | EX                        |    |
| BI <sub>1</sub>    |                          |               |                                    | IF<br>PC: BI <sub>2</sub> | ID                        |    |
| BI <sub>2</sub>    |                          |               |                                    |                           | IF<br>PC: BI <sub>2</sub> |    |

∴ Actual O/P sequence: I<sub>1</sub> - I<sub>2</sub> - I<sub>3</sub> - NOP Stall - BI<sub>1</sub> - BI<sub>2</sub>

# Control Dependency

Generally  $CPI = 1$

$$\text{Avg Instruction ET} = CPI \times \text{Cycle time}$$

But Due to Control Hazard  $CPI \neq 1$

$$CPI = 1 + \frac{\# \text{Stalls}}{\text{Instn}}$$

*Per*

$$\frac{\# \text{Stall}}{\text{Per Instn}} = \frac{\text{Branch Instn}}{\text{Frequency}} \times \text{Branch Penalty}$$

$$\text{Avg Instn ET} = \left( 1 + \frac{\# \text{Stalls}}{\text{Instn}} \right) \times \text{Cycle time}$$

$$\text{Branch Penalty} = \left( \frac{\text{Target Address Stage}}{L} \right) - 1$$

## In Pipeline

$$\text{Avg Bus MET} = \frac{(1 + \# \text{ stalls}/\text{inst}) \times \text{cycle time}}{\text{ETPIPE}}$$

## Speed Up factor with Stalls

Performance analysis with stalls :

$$S = \frac{\text{Avg Instruction ET}_{\text{nonpipe}}}{\text{Avg Instruction ET}_{\text{pipe}}}$$

$$S = \frac{ET_{NP}}{ET_P}$$

$$S = \frac{CPI_{\text{nonpipe}} \times \text{cycle time}_{\text{non pipe}}}{CPI_{\text{pipe}} \times \text{cycle time}_{\text{pipe}}}$$

## Speed Up factor with Stalls

Ideal CPI of pipeline is always '1' due to a dependency problem some stalls are created in the pipeline.

$$S = \frac{CPI_{\text{nonpipe}} \times \text{cycle time}_{\text{nonpipe}}}{(1 + \text{number of stalls / Instruction})(\text{cycle time}_{\text{pipe}})}$$

When the pipeline stages are perfectly balanced then 1 task execution time in the non-pipeline is also equal to number of stages in the pipeline i.e.....

$$\begin{aligned} t_n &= 2 \times 2 \times 2 \\ &= 8 \quad t_n = k \cdot t_p \\ \text{under this condition } t_n &= 8 \times 2 \times 2 \\ t_n &= 8 \text{ ns} \end{aligned}$$

$$S = \frac{k \cdot t_p}{(1 + \text{number of stalls / Instruction})(\text{cycle time}_{\text{pipe}})}$$

~~# Stage~~

$$S = \frac{(\text{Pipeline Depth})^k}{(1 + \text{number of stalls / Instruction})}$$

A CPU has five-stages pipeline and runs at 1 GHz frequency. Instruction fetch happens in the first stage of the pipeline. A conditional branch instruction computes the target address and evaluates the condition in the 'third stage' of the pipeline. The processor stops fetching new instructions following a conditional branch until the branch outcome is known. A program executes  $10^9$  instructions out of which 20% are conditional branches. If each instruction takes one cycle to complete on average, then total execution time of the program is

[GATE-2006 : 2 Marks]

- A 1.0 second
- B 1.2 second
- C 1.4 second
- D 1.6 second

$$\text{Frequency} = 1 \text{ GHz} \Rightarrow \text{Cycletime} = \frac{1}{1 \text{ GHz}} \text{ sec} = 1 \text{ nsec}$$

5 Stage Pipeline

$$\text{Total} = 10^9 \text{ Instn}.$$

Target Address = 3rd Stage

$$\begin{aligned}\text{Branch Penalty} &= 3 - 1 \\ &= 2 \text{ Stalls.}\end{aligned}$$

$$\begin{aligned}\text{Avg Instn ET} &= (1 + \# \text{Stall/Instn}) \times \text{Cycle time} \\ &\Rightarrow (1 + 0.4) \times 1 \text{ nsec}\end{aligned}$$

$$\text{Avg Instn ET} = 1.4 \text{ nsec}$$

Best Approach



2 STALL'S

$$\# \text{Stalls/Instn} = 0.80 \times 0 + 0.20 \times 2$$

$$\# \text{Stall/Instn} = 0.4.$$

Program Contain =  $10^9$  Instn

$$\text{Prog. ET} = 1.4 \times 10^{-9} \times 10^9$$
$$= 1.4 \text{ sec } \underline{\text{Avg}}$$

Another  
Approach.

5 Stage Pipeline.

3<sup>rd</sup> STAGE



$$\text{Avg CPI} = 0.80 \times L + 0.20 \times 3$$

$$= 0.8 + 0.6$$

$$\boxed{\text{Avg CPI} = 1.4 \text{ cycle}}$$

$$\text{Prog ET} = 1.4 \times 10^{-9} \times 10^9$$

$$= 1.4 \text{ Sec}$$

$$\begin{aligned}\text{Avg Instn ET} &= 1.4 \times 10^{-9} \\ &= 1.4 \text{ nsec}\end{aligned}$$

**Q**

Consider a 5-stage instruction pipeline, where all stages are perfectly balanced. & have a cycle time of 20ns. Target address of branch Instruction is available at the end of the execution(last Stage), there are 30% instruction.

(i) What is Average Instruction Execution time(ignore the fact some are conditional)? 44 nsec

(ii) What is the speed up factor? 2.27 nsec

(iii) Among the branch instruction 30% are conditional & 70% of them does not satisfy the condition then What is Average Instruction Execution time ?

① # Stage = 5

Cycle time = 20usec

Target Address is Available  
at the end of LAST Stage (5<sup>th</sup> Stage)

Branch  
Penalty = 5 - 1 = 4

① Avg Instn ET<sub>PIPE</sub> =  $(1 + \# \text{Stalls/Instn}) \times \text{Cycle time}$

$$\Rightarrow (1 + 1.2) \times 20$$
$$= 2.2 \times 20$$

①  $\boxed{\text{ET}_{\text{PIPE}} = 44 \text{ usec}}$  Avg

30.1 Branch Instn.



$$\# \text{Stalls/Instn} = \frac{70 \times 0 + 30 \times 4}{2}$$

$$\# \text{Stalls/Instn} = 1.2$$

Q (ii)

$$\text{Speed up Factor} = \frac{ET_{NP}}{ET_{PIPE}}$$

$$\Rightarrow \frac{5 \times 20 \text{ sec}}{(1 + \# Stalls / T_wt^n) \times \text{Cycle}}$$

$$\Rightarrow \frac{5 \times 20}{(1 + 1.2) \times 20}$$

$$= \frac{5}{2.2}$$

(ii)  $S = 2.27 \text{ Ans}$

↓ Perfectly balanced

$$\Rightarrow \frac{5}{(1 + \# Stalls / T_wt^n)}$$

$$= \frac{5}{1 + 1.2} = 2.27$$

# Stage = 5

③ Cycle time = 20ns

Target Address is Available  
at the end of LAST Stage (5<sup>th</sup> Stage)

Branch  
Penalty =  $5 - 1 = 4$

30.I Branch  
Instn.



Now Complete for 3<sup>rd</sup> Question  
Diagram

(iii)

$$\begin{aligned}\# \text{Stalls/Instr} &= \\ &\cdot 70 \times 0 + \\ &\cdot 30 \times \cdot 70 \times 4 + \\ &\cdot 30 \times \cdot 30 \times \cdot 30 \times 4 + \\ &\cdot 30 \times \cdot 30 \times \cdot 70 \times 0 = 0.948.\end{aligned}$$



$$\text{Avg } \text{Dwell} \text{ ET} = (1 + \# \text{Stalls} | \text{Dwell}) \times \text{cycle time}$$

$$\Rightarrow (1 + 0.948) \times 20$$

$$= 1.948 \times 20 \text{ ns}$$

$$\boxed{\text{ET} = 38.96 \text{ nsec}} \quad \text{Ans}$$

Q

Consider a 6-stage instruction pipeline, where all stages are perfectly balanced. Assume that there is no cycle-time overhead of pipelining. When an application is executing on this 6-stage pipeline, the speedup achieved with respect to non-pipelined execution if 25% of the instructions incur 2 pipeline stall cycles is \_\_\_\_\_.

P  
W

[GATE-2014(Set-1) : 2 Marks]



Assume time =  $t_C$

$$t_n = \underline{6t_C}$$

$$S = \frac{ET_{NONPIPE}}{ET_{PIPE}}$$

$$t_n = K * t_P$$

$$\Rightarrow \frac{6 * t_C}{(1 + \#Stalls/Inst^n) * \text{Cycle time}}$$

$$\Rightarrow \frac{6 * t_C}{(1 + 0.5) * t_C}$$

$$= \frac{6}{1.5} = \textcircled{4}$$

*S=4*



$$\#Stalls/Inst^n = .75 \times 0 + .25 \times 2$$

$$\#Stalls/Inst^n = 0.5$$

**Q. 6**

**HW**

P  
W

An instruction pipeline has five stages, namely, instruction fetch (IF), instruction decode and register fetch (ID/RF), instruction execution (EX), memory access (MEM), and register writeback (WB) with stage latencies 1ns, 2.2ns, 2ns, 1ns. and 0.75ns, respectively (ns stands for nanoseconds). To gain in terms of frequency, the designers have decided to split the ID/RF stage into three stages (ID, RF1, RF2) each of latency 2.2/3 ns. Also, the EX stage is split into two stages (EX1, EX2) each of latency 1ns. The new design has a total of eight pipeline stages. A program has 20% branch instructions which execute in the EX stage and produce the next instruction pointer at the end of the EX stage in the old design and at the end of the EX2 stage in the new design. The IF stage stalls after fetching a branch instruction until the next instruction pointer is computed. All instructions other than the branch instructions have an average CPI of one in both the designs. The execution times of this program on the old and the new design are P and Q nanoseconds, respectively. The value of P/Q is \_\_\_\_.

**[GATE-2014 (Set-3): 2 Marks]**

Consider a non-pipelined processor operating at 2.5 GHz. It takes 5 clock cycles to complete an instruction. You are going to make a 5-stage pipeline out of this processor. Overheads associated with pipelining force you to operate the pipelined processor at 2 GHz. In a given program, assume that 30% are memory instructions, 60% are ALU instructions and the rest are branch instructions. 5% of the memory instructions cause stalls of 50 clock cycles each due to cache misses and 50% of the branch instructions cause stalls of 2 cycles each. Assume that there are no stalls associated with the execution of ALU instructions. For this program, the speedup achieved by the pipelined processor over the non-pipelined processor (round off to 2 decimal places) is 2.16 Avg

[GATE-2020-CS: 2M]

Pipeline    Cycles time =  $\frac{1}{2 \text{ GHz}} = 0.5 \text{ nsec}$

$$\# \text{Stalls/Inst^n} =$$

$$30 \times 0.05 \times 50$$

$$+ 60 \times 0$$

$$+ 10 \times 50 \times 2$$

$\# \text{Stall/Inst^n} = 0.85$

$$\text{Avg Inst^NET} = (1 + \# \text{Stall/Inst^n}) \times \text{Cycle time}$$

$$\Rightarrow (1 + 0.85) \times 0.5 \\ = 1.85 \times 0.5$$

$E_{PIPE} = 0.925 \text{ nsec}$



Non pipeline

5 clock cycle

2.5 GHz

$$\text{Cycletime} = \frac{1}{2.5 \text{ GHz}} = 0.4 \text{ nsec}$$

$$ET_{\text{Nonpipe}} = 5 \times 0.4 \text{ nsec}$$

$$ET_{NP} = 2 \text{ nsec}$$

$$S = \frac{ET_{NP}}{ET_{PIPE}} = \frac{2}{0.925}$$

$$S = 2.16$$

Ans

✓ Register renaming is done in pipelined processors

[GATE-2012-CS: 1M]

- A As an alternative to register allocation at compile time
- B For efficient access to function parameters and local variables
- C To handle certain kinds of hazards
- D As part of address translation

Delayed branching can help in the handling of control hazards.

The following code is to run on a pipelined processor with one branch delay slot:

I1: ADD R2  $\leftarrow$  R7 + R8  
I2: SUB R4  $\leftarrow$  R5 - R6x  
I3: ADD R1  $\leftarrow$  R2 + R3  
I4: STORE Memory [R4]  $\leftarrow$  R1

BRANCH to Label if /R1 == 0

Which of the instruction I1, I2, I3 or I4 can legitimately occupy the delay slot without any other program modification?

A I1

Ans (D)

B I2

[GATE-2008-CS: 2M]

C I3

D I4

@  $\textcircled{I}_1 : R_2 \leftarrow R_7 + R_8$

the OLP of  $\textcircled{I}_1$  ( $R_2$  value)

is used in the  $\textcircled{I}_3$  as a

Input

so  $\textcircled{I}_1 \times$   
 $\textcircled{I}_1$  can not insert in Delay slot.

b)  $\textcircled{P}_2 : R_4 \leftarrow R_5 - R_6$ .

this  $R_4$  is using in  $M[R_4]$  string

so  $\textcircled{I}_2$  can not insert the result  
in Delay slot.

c)  $\textcircled{I}_3 : R_1 \leftarrow R_2 + R_3$ .

This  $R_1$  value using in the  
 $\textcircled{P}_4$  & condition checking  
as well.

so  $\textcircled{I}_3$  can not insert in Delay  
slot.

d)  $\textcircled{I}_4$  ✓

can Insert  
in Delay slot Bcz  
No changes in Program.

**THANK  
YOU!**

