

# COMPUTER SCIENCE

## Computer Organization and Architecture

### Instruction Pipelining

Lecture\_04



Vijay Agarwal sir





**TOPICS  
TO BE  
COVERED**



o1

## Pipelining Hazards

## PIPELINE CONCEPT

ET in PIPELINE & NON PIPELINE

### SPEED UP FACTOR

Throughput & efficiency.

$$\boxed{CPI = L}$$

Concept

↳ Cycle Per Instruction

How to set this CPI in Uniform & NonUniform Delay.

Practice Question

IS-GATE QUESTION

NK EXAM Question.

# Pipelining Strategy

Similar to the use of  
An assembly line in a  
Manufacturing plant

To apply this concept  
To instruction  
Execution we must  
Recognize that an  
Instruction has a  
Number of stages



New inputs are  
Accepted at one end  
Before previously  
Accepted inputs  
Appear as output at  
The other end

- Pipelining is a mechanism which is used to improve the performance of the system in which task (Instruction) are executed in overlapping manner.
- Pipelining is a decomposition technique that means the problem is divided into sub problem & Assign the sub problem to the pipes then operate the pipe under the same clock.

Let us consider 4 segment pipeline used to execute 4 instructions the execution sequence as:

Segment/stages =  $[S_1 \ S_2 \ S_3 \ S_4]$

Instruction:  $[I_1 \ I_2 \ I_3 \ I_4]$



$n = 4, t_n = 4$ , Non pipeline

**Non-PIPELINE**



PIPELINE

 $k = 4$  $n = 4$ 

PIPELINE

# PIPELINE Design



# Pipeline Hazard's



## In The RISC Pipeline 5 Stages:

1. Instruction Fetch {IF Stage} IF
2. Instruction Decode{ID Stage} ID
3. Execute {EX Stage} EX
4. Memory Access {MA Stage} MA
5. Write Back {WB Stage} WB.

**Note** If 'N Stage' entire CPU is divided  
into 'N functional Unit'.

## RISC

- ① IF
- ② ID : (Decode . Operand Loaded from  
Registers )
- ③ EX : Data Processing
- MA : Memory Access
- WB : Write Back .

## 1. Instruction Fetch {IF Stage}:

In this stage Instruction is fetched from Memory.

(MEM to  
CPU(IR))

## 2. Instruction Decode{ID Stage} :

In this Stage 2 operation are performed:

(i) Decode the instruction

(ii) Operand loading(fetching) from the register file.

This stage also contain comparator circuit to evaluate the branch condition.

Directly in the Register File,

Q)

How the operand is available in the beginning of the Instruction ?



$I_1: LOAD \gamma_1 A$

$I_2: LOAD \gamma_2 B$

$I_3: ADD \gamma_3 \gamma_1 \gamma_2$

$I_4: STORE C, \gamma_3$   
 $M[C] \leftarrow \gamma_3.$

## In ID Phase

(Sel)

In the ID Stage we Fetch [Load] Operand from  
the 'Register file'. [Not from Memory], Because Previous

③

$I_3$ : ADD  $R_3, R_1 \underline{R_2}$

Instruction already stored  
that Operand in the  
Register file.

Here from the Register file  $R_1, R_2$  Register

Content load into the input Registers I<sub>1</sub> & I<sub>2</sub>  
for the ALU operation.

$I_1 \& I_2$   
R<sub>1</sub> & R<sub>2</sub> store  
'A' & 'B'

3. Execute {EX Stage}: In this stage Data Processing (ALU Operations) are performed
4. Memory Access {MA Stage} : In this Stage Operand (Data) will be accessed from memory(load or store).
5. Write Back {WB Stage} : In this stage Register write (Operand storing into reg. file) operation performed.

In RISC    5 Stage

IF : [Mem to CPU]

ID : Decode

EX : ALU.

MA : Mem.

WB : WB.

But

In the GATE Question

Some times 4 stage,

Consider! • 5 stage• 6 stage  
are given

① IF

② ID

③ OF (operand Fetch)

④ EX (PO (Perform operation))

⑤ MA

⑥ WB.

①

Sometimes

IF & ID into L Stage .

②

Sometimes

ID & OF into L Stage .

## Additional Stages

### ① Fetch Instruction (FI)

- ❖ Read the next expected Instruction into a buffer.

### ② Decode Instruction (DI)

- ❖ Determine the opcode and the operand specifiers.

### ③ Calculate operands(CO)

- ❖ Calculate the effective address of each source operand.

- ❖ This may involve displacement, register indirect or other forms of address calculations.

### ④ Fetch Operands(FO)

- ❖ Fetch each operand from memory.
- ❖ Operands in register need not be fetched.

### ⑤ Execute Instruction(EI)

- ❖ Perform the indicated operation and store the result, if any, in the specified destination operand location

### ⑥ Write Operand(WO)

- ❖ Store the result in memory



Timing Diagram for Instruction pipeline operation

|       |    |    |    |    |    |  |
|-------|----|----|----|----|----|--|
| $I_1$ | IF | ID | EX | MA | WB |  |
| $I_2$ | IF | ID | EX | MA | WB |  |
|       | IF | ID | EX | MA | WB |  |
|       |    |    |    |    |    |  |

Extra cycle  
(Stalls, Bubbles)

$$I_1 : \underline{\gamma_0} \in \gamma_1 + \gamma_2$$

$$(I_2) : \underline{\gamma_4} \in \underline{\gamma_0} * \gamma_3$$

$I_2$  Depend on  $I_1$   
ie  $I_2$  use the Result of  $I_1$   
as a operand.

So  $I_2$  Must wait until  
Completion of  $I_1$  'Stall'  
This waiting create 'extra cycle'

## Pipeline Hazards

Occur when the  
Pipeline, or some  
portion of the  
pipeline must stall  
Because conditions  
Do not permit  
Continued execution

## Dependency in the Pipeline



There are three  
Types of hazards:

- ① Resource
- ② Data
- ③ Control



Also referred to as a  
Pipeline bubble

In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operation.

1. *Resource conflicts* caused by access to memory by two segments at the same time. Most of these conflicts can be resolved by using separate instruction and data memories.
2. *Data dependency* conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available.
3. *Branch difficulties* arise from branch and other instructions that change the value of PC.



## Hazards/Dependencies In the pipeline

- Dependency is a major problem in the pipeline, causes extra cycle.
- Cycle in the pipeline without new input is called as extra cycle. Also named as "Stall".
- When stall is present in the pipeline then CPI  $\neq 1$ .
- There are 3 kinds of dependencies possible in the pipeline-
  - I. Structural dependency/ Structural Hazards
  - II. Data dependency/ Data Hazards
  - III. Control dependency/ Control Hazards

- ① Structural Dep. [Resource Conflict]
- ② Data Dep. [Due to operand]
- ③ Control Dep. [Branch | TOC Inst']

# Structural Dependency

Structural Dependency is created when 2 or More Phase require the same Resource, so that they are Not Able to Run (Process/Execute) simultaneously.

- Structural Dependency is created in the Pipeline Due to Resource Conflict.
- Resource May be Register, ALU or functional Unit etc.

# Structural Dependency

CC: Clock Cycle.

P  
W

|                | CC1 | CC2 | CC3 | CC4 | CC5 |
|----------------|-----|-----|-----|-----|-----|
| I <sub>1</sub> | MEM | ID  | ALU | MEM | WB  |
| I <sub>2</sub> |     | MEM | ID  | ALU | MEM |
| I <sub>3</sub> |     |     | MEM | ID  | ALU |
| I <sub>4</sub> |     |     |     | MEM |     |

Diagram (I)  
Resource Conflict

- In CC4 Both I<sub>1</sub> & I<sub>4</sub> Accessing the Same Resource [Memory] at the Same time.
- This situation is called Resource Conflict & No One (I<sub>1</sub> & I<sub>4</sub>) go for Execution.

• Here Both  $I_1$  &  $I_4$  Can not go for Execution .

So keep  $I_4$  into the Waiting Untill the Resource become Available.

This Waiting Creates an Extra Cycle [Stalls] in the Pipeline .

# Structural Dependency

CC: clock cycle.



|    | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 |
|----|-----|-----|-----|-----|-----|-----|-----|
| I1 | MEM | ID  | ALU | MEM | WB  |     |     |
| I2 |     | MEM | ID  | ALU | MEM | WB  |     |
| I3 |     |     | MEM | ID  | ALU | MEM | WB  |
| I4 |     |     |     | MEM | MEM | MEM | MEM |

3 Extra cycle  
[3 STALL's]

In this  $I_1$  is Accessing Memory for Operand  
(Data) Store

$I_4$  is Accessing Memory to Fetch the Instruction



To minimize the stalls due to structural dependency  
one hardware mechanism is used called as Renaming.

Renaming mechanism State Divide the memory into  
Independent Module to store the Instruction & Data separately

Called [ I-Cache | CODE  
Memory ] & [ D-Cache | Data Memory ].  
[ CM ] [ DM ]

# Structural Dependency

CC: clock cycle. W



## Data Dependency

(Ii)

$I_i I_1 : \text{ADD } r_1 \ r_2 \ r_3 ;$

$$r_1 \leftarrow r_2 + r_3$$

$I_j I_2 : \text{MUL } r_4 \ r_1 \ r_5 ;$

$$r_4 \leftarrow r_1 * r_5$$

(Ii+1)

$I_2$  is Data Dependant On  $I_1$

Data Dependency is created in the Pipeline when

Next Instruction  $[I_2 | I_{i+1} | I_j]$  use the Result of  
Previous Instruction  $[I_1 | I_i | I_i]$  as operand / As a Input.

# Data Dependency

$I_1: \text{ADD } \gamma_1 \gamma_2 \gamma_3; \quad \gamma_1 \leftarrow \gamma_2 + \gamma_3$

$I_2: \text{MUL } \gamma_4 \gamma_1 \gamma_5; \quad \gamma_4 \leftarrow \underline{\gamma_1 * \gamma_5}$

|        | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | cc7 | cc8 | cc9 |
|--------|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| $I_1:$ | IF  | ID  | OF  | EX  | MA  | WB  |     |     |     |
| $I_2:$ | IF  | ID  |     |     |     | OF  | Ex  | MA  | WR. |
| .      | .   | .   |     |     |     |     |     |     |     |

STALL'S

## Data Dependency

- In this  $I_2$  is Data Dependant on  $I_1$
- So  $I_2$  will wait until the output of  $I_1$  ( $r_1$  value) is available.
- This waiting creates an extra cycle [Stalls] in the Pipeline.

## Solution of DATA Dep.

To Minimize the Stall in the Pipeline

Due to Data Dependency H/w M/c ism

is used Called ① OPERAND Forwarding /  
② Hardware Interlock

OPERAND FORWARDING: Access the Value from Register  
before Update into Register file.

# Data Dependency

## OPERAND Forwarding

$I_1: \text{ADD } \tau_1 \ \tau_2 \ \tau_3;$        $\tau_1 \leftarrow \tau_2 + \tau_3$

$I_2 \quad \text{MUL } \tau_4 \ \tau_1 \ \tau_5;$        $\tau_4 \leftarrow \tau_1 * \tau_5$



# Solution Of Data Dependency

## Hardware Interlocks

The most straightforward method is to insert hardware interlocks. An interlock is a circuit that detects instructions whose source operands are destinations of instructions further up in the pipeline.

## Operand Forwarding

Another technique called operand forwarding uses special hardware to detect a conflict and then avoid it by routing the data through special paths between pipeline segments. For example, instead of transferring an ALU result into a destination register, the hardware checks the destination operand, and if it is needed as a source in the next instruction, it passes the result directly into the ALU input, bypassing the register file.

# Data Dependency

## Data Hazard

(Read After Write) RAW (Data Dep.)

Write After Read WAR (Anti Dep.)

(Write After Write) WAW (Output Dep.)

Q.

Consider a pipelined processor with the following four stages

IF: Instruction Fetch

ID : Instruction Decode and Operand Fetch

EX : Execute

WB : Write Back

The IF, ID and WB stages take one clock cycle each to complete the operation. The number of clock cycles for the EX stage depends on the instruction. The ADD and SUB instructions need 1 clock cycle and the MUL instruction need 3 clock cycles in the EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the following sequence of instructions?

P  
W

Without operand forwarding.

IF  
ID  
EX  
WB

EX

|       |     |     |     |    |                            |
|-------|-----|-----|-----|----|----------------------------|
| $I_1$ | ADD | R2  | R1  | R0 | $R_2 \leftarrow R_1 + R_0$ |
| $I_2$ | MUL | R4, | R3, | R2 | $R_4 \leftarrow R_3 * R_2$ |
| $I_3$ | SUB | R6, | R5, | R4 | $R_6 \leftarrow R_5 - R_4$ |

A 7  
B 8  
C 10  
D 14



P  
W

With operand forwarding.

IF  
ID  
EX  
WB

EX

I<sub>1</sub>

I<sub>2</sub>

I<sub>3</sub>

ADD

MUL

SUB

R2

R3,

R5,

R0

R2

R4

R2  $\leftarrow$  R1 + R0

R4  $\leftarrow$  R3 \* R2

R6  $\leftarrow$  R5 - R4

With operand Forwarding

[GATE-2007 : 2 Marks]

A 7

B 8

C 10

D 14

⑧ Ans



DF

DD

EX

WB

EX

|   |     |     |     |    |                         |
|---|-----|-----|-----|----|-------------------------|
| ⊥ | ADD | R2  | R1  | R0 | $R2 \leftarrow R1 + R0$ |
| 3 | MUL | R4, | R3, | R2 | $R4 \leftarrow R3 * R2$ |
| ⊥ | SUB | R6, | R5, | R4 | $R6 \leftarrow R5 - R4$ |

[GATE-2007 : 2 Marks]

A 7

B 8

C 10

D 14

Q.8

The performance of a pipelined processor suffer if

[GATE-2002 : 2 Marks]

- A the pipeline stages have different delays
- B consecutive instructions are dependent on each other
- C the pipeline stages share hardware resources
- D All of the above

**MCQ**

A 5-stage pipelined processor has Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Perform Operation (PO) and Write Operand (WO) stage. The IF, ID, OF and WO stage take 1 clock cycle each for any instruction. The PO stage takes 1 clock cycle for ADD and SUB instructions ,3 clock cycles for MUL instruction, and 6 clock cycles for DIV instruction respectively. Operand forwarding is used in the pipeline. What is the number of clock cycles needed to execute the following sequence of instructions?

**Instruction**

- I<sub>0</sub>: MUL R<sub>2</sub>, R<sub>0</sub>, R<sub>1</sub>
- I<sub>1</sub>: DIV R<sub>5</sub>, R<sub>3</sub>, R<sub>4</sub>
- I<sub>2</sub>: ADD R<sub>2</sub>, R<sub>5</sub>, R<sub>2</sub>
- I<sub>3</sub>: SUB R<sub>5</sub>, R<sub>2</sub>, R<sub>6</sub>

**Meaning of Instruction**

- R<sub>2</sub>  $\leftarrow$  R<sub>0</sub>\*R<sub>1</sub>
- R<sub>5</sub>  $\leftarrow$  R<sub>3</sub>/R<sub>4</sub>
- R<sub>2</sub>  $\leftarrow$  R<sub>5</sub> + R<sub>2</sub>
- R<sub>5</sub>  $\leftarrow$  R<sub>2</sub> - R<sub>6</sub>

**A** 13

**B** 15

**C** 17

**D** 19

[GATE-2010-CS: 2M]

**Q.**

For a pipelined CPU with a single ALU, consider the following situations

P  
W

1. The  $j^{th} + 1^{st}$  instruction uses the result of the  $j^{th}$  instruction as an operand
2. The execution of a conditional jump instruction
3. The  $j^{th}$  and  $j^{th} + 1^{st}$  instructions require the ALU at the same time

Which of the above can cause a hazard?

**[GATE-2003 : 1 Marks]**

- A** 1 and 2 only
- B** 2 and 3 only
- C** 3 only
- D** All the three

Q.

A pipelined processor uses a 4-stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode (ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX stage. The sequence of instructions corresponding to the statement  $X = (S - R * (P + Q))/T$  is given below. The values of variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the instruction sequence.

|       |            |                           |
|-------|------------|---------------------------|
| ADD   | R5, R0, R1 | ; $R5 \leftarrow R0 + R1$ |
| MUL   | R6, R2, R5 | ; $R6 \leftarrow R2 * R5$ |
| SUB   | R5, R3, R6 | ; $R5 \leftarrow R3 - R6$ |
| DIV   | R6, R5, R4 | ; $R6 \leftarrow R5 / R4$ |
| STORE | R6, X      | ; $X \leftarrow R6$       |

The IF, ID and WB stages take 1 clock cycle each. The EX stage takes 1 clock cycle each for the ADD, SUB and STORE operations, and 3 clock cycles each for MUL and DIV operations. Operand forwarding from the EX stage to the ID stage is used. The number of clock cycles required to complete the sequence of instructions is

[GATE-2006: 2 Marks]

- A 10
- B 12
- C 14
- D 16

Q. Consider the sequence of machine instructions given below:

|     |            |
|-----|------------|
| MUL | R5, R0, R1 |
| DIV | R6, R2, R3 |
| ADD | R7, R5, R6 |
| SUB | R8, R7, R4 |

In the above sequence, R0 to R8 are general purpose registers. In the instructions shown, the first register stores the result of the operation performed on the second and the third registers. This sequence of instructions is to be executed in a pipelined instruction processor with the following 4 stages: (1) Instruction Fetch and Decode (IF), (2) Operand Fetch (OF), (3) Perform Operation (PO) and (4) Write back the Result (WB). The PO state takes 1 clock cycle for ADD or SUB instruction. 3 clock cycles for MUL instruction and 5 clock cycles for DIV instruction. The pipelined processor uses operand forwarding from the PO stage to the OF stage. The number of clock cycles taken for the execution of the above sequence of instructions is \_\_\_\_\_.

[GATE-2015 (Set-2): 2 Marks]

**THANK  
YOU!**

