

- Yourselves of pipelining
- 3) Pipelined data Path
- 3) Pipeline hazards
- 4) Pipeline with interlocks
- 5) Forwarding
- 5) Performance metrics
- 7) Interrupts / Exceptions
- instr. pipeline
- hardware mech.

### Pipelining: Stages



The stages include:

Instruction Fetch (IF)

Instruction decode (ID)

Execute (EX)

Memory (MEM)

Write back (WB)

## Pipeline hazards

### Pipelining:

overlapping different stages of an instr is called pipelining.

- An instruction requires several steps which mainly involve fetching, decoding and execution. So to make processor faster a method called pipelining is applied.

### \* 2 stage Pipelining - 8085

8085 = non-pipelined processor (F E F E F E . . . )



8086 = pipelined processor (F E E E E E . . . )



### \* 3-stage Pipelining - 80386 / ARM - 7 (P)

Here in 2 stages the instruction process is divided into 3 stages: 1) Fetching  
2) decoding  
3) execution.

total time (by non-pipelined)  $\rightarrow$



Total cycles  $K \times N$

$$3 \times 4 = 12 \text{ cycles}$$

K

(F D E E E E . . .)



$$\begin{aligned} \text{Total cycle} &= [K + N - 1] \\ &= 3 + 4 - 1 \\ &= (6 \text{ cycles}) \end{aligned}$$

total time  
(pipelined)

\* 4 stage pipelining

This involves Fetch, Decoder, execute and store.

\* 5 stage pipelining - PENTIUM

Fetch, decoder, Address generation, execute, store

\* 6 stage pipelining - Pentium Pro

IF, ID, AG, OF, EX, WR.

$I_1$

$I_2$

$I_3$

IF | ID | AG | OF | EX | WR | IF | ID | AG | OF | EX | WR | IF | ID | AG | OF | EX | WR |

$G = K$

IF | ID | AG | OF | EX | WR |  $\leftarrow I_1$

IF | ID | AG | OF | EX | WR |  $\leftarrow I_2$

IF | ID | AG | OF | EX | WR |  $\leftarrow I_3$

Total cycles.

$K + N - 1$

Q) Instruction pipeline with 5 stages  
 +D IF, ID, AG, EX, OW without any branch prediction  
 Total 12 instr.

The stage delays: 5 ns IF  
 7 ns ID  
 10 ns AG  
 8 ns EX  
 6 ns OW

\* Propagation delay: time taken for all signals to travel from one end of the stage to the other.



Intermediate storage buffer latency buffer delay of 1ns.

I4 is the only branch inst. branch target = 9  
 time needed to complete program?

Pipeline diagram:

[1] add  $r_1, r_2 \rightarrow r_3$

[2] sub  $r_4, r_2 \rightarrow r_5$

[3] mul  $r_5, r_8 \rightarrow r^2$



## Instruction Packet

- Instr. contents
- Program counter
- Intermediate results
- control signals



## Data Hazards

- ① RAW (read after write)

## Pipeline Hazards

### ① Data hazards / Data dependency hazard

#### 1.1) RAW (Read after write)

Data hazard is caused when the result of one instruction becomes the source of the next instruction



## IN-ORDER PIPELINE

→ a preceding instr. is always ahead of a succeeding instr. in the pipeline

→ It includes **RAW** hazards only.

\* out-of-order pipelines can have WAR and WAW hazards

### • WAW hazard

[1]: add  $r_1, r_2, r_3$        $r_1 \leftarrow r_2 + r_3$

[2]: sub  $r_1, r_4, r_3$        $r_1 \leftarrow r_4 - r_3$

⇒ Instr [2] cannot write the value of  $r_1$ , before  
Instr [1] writes to it. will lead to a WAW hazard.

### • WAR hazard

[1]: add  $r_1, r_2, r_3$        $r_1 \leftarrow r_2 + r_3$

[2]: add  $r_2, r_5, r_6$        $r_2 \leftarrow r_5 + r_6$

⇒ Instr [2] cannot write the value of  $r_2$ , before  
Instr [1] writes to it leads the value of  $r_2$ . leads  
to WAR hazard.

### ② Control hazard

→ Pipelining assumes that the program will always flow in a sequential manner. The programs are sequential most of the times, it is not true always.

Sometimes, branches do occur in our program. So in this, all the forthcoming instructions that have been fetched/decided have to be flushed/ discarded. and the process has to start all over again from the branch add.

[1]: breq .  
 [2]: mov r1, 4  
 [3]: add r2, r4, r3  
 ...  
 : foo:  
 [4]: add r4, r1, r2



if branch is taken, then  
 instr [2] and next [3] may  
 get fetched incorrectly

## \* Structural hazards

→ caused by physical constraints in the architecture  
like the buses.  
 or diff instr try to access the same resource

even in the most basic form of pipelining, we want to execute one instr. and fetch the next one.  
 Now if execution involves registers, pipelining is possible. But if execution requires to read/write data from memory.

[1]: st r4, 20[r5]  
 [2]: sub r8, r9, r10  
 [3]: add r1, r2, 10[r3]

[3]: tries to read w[r3] (MA unit) in cycle 4

[1]: tries to write to 20[r5] in cycle 4.

## \* Solutions in software

### • Data hazards.

[1]: add r1, r2, r3  
 [2]: sub r3, r1, r4,

## Insert nop instructions.

- [1]: add  $r_1, r_2, r_3$  ← done
- [2]: nop } RW
- [3]: nop } MA
- [4]: nop } EX
- [5]: sub  $r_3, r_1, r_4$  OF sequence lab 1001

Reorder the code:

- Q add  $r_1, r_2, r_3$   
add  $r_4, r_1, r_3$   
add  $r_8, r_5, r_6$  →  
add  $r_9, r_8, r_5$   
add  $r_{10}, r_{11}, r_{12}$   
add  $r_{13}, r_{10}, r_2$

- add  $r_1, r_2, r_3$   
add  $r_8, r_5, r_6$   
add  $r_{10}, r_{11}, r_{12}$   
nop  
add  $r_4, r_1, r_3$   
add  $r_9, r_8, r_5$   
add  $r_{13}, r_{10}, r_2$

Solution for control hazard



- Q add  $r_1, r_2, r_3$   
add  $r_4, r_5, r_6$   
b. foo  
add  $r_8, r_9, r_{10}$

- b. foo
- add  $r_1, r_2, r_3$   
add  $r_4, r_5, r_6$  } delay  
add  $r_8, r_9, r_{10}$

if there is no valid inst b4 branch inst then,  
pt inserts  $\boxed{\text{nops}}$ .

## Interlocks:

→ Type of hardware mechanism to enforce correctness (Interlock).

### Two kinds of interlocks

#### 1). Data-Lock

→ Do not allow the consumer instr to move beyond the OF stage till it has reached the correct values.

Implication: Stall the IF and OF stages.

#### 2). Branch Lock

→ we never execute instr. in the wrong paths.

Pipeline bubble → up inst

→ it is inserted into a stage when the prev stage needs to be stalled.

### Q How to stall a pipeline?

1. Disable the write functionality of

→ IF - OF register

PC (program counter)

2 to insert a bubble.

→ write a bubble (nop instr) into the OF-EX register.

Forwarding:  
(No bubbles):



Q what does not

when does inst 2 need the value of  $\sigma_1$ ?

→ cycle 3, OF stage **X**

cycle 4, EX stage

Q. When does I1 produce the value of  $\sigma_1$ ?

→ cycles 1, RW stage **X**

End of cycle 3, EX stage

### Types of Forwarding paths

\* 3 stage path

→ RW → OF



\* 2 Stage path

RW → EX

~~MA → EX~~

MA → OF (Not required)

\* 1 Stage path.

MA → EX (alu, load, store)

MA → OF (X)

RW → MA (load to store)

Ans.

### Interlocks with forwarding

1. Data Lock.

→ only need to check for the load-use hazard

→ if the instr. in the EX stage is load and OF stage uses its loaded value - then stall for 1 cycle.



## 2) Branch - lock

→ Remains lock.

## \* The curious case of all instruction

- The call instruction generates the value of ra in the RW stage
- Any instruction that uses the value of ra (such as ret) still works.



## \* Performance metrics:

- Performance is inversely proportional to the time it takes to execute a program.

$$T = \# \text{seconds}$$

$$= \frac{\# \text{seconds}}{\# \text{cycle}} \times \frac{\# \text{cycle}}{\# \text{instr}} \times \frac{\# \text{instr}}{\text{CPI}}$$

time  $\downarrow f$

$$T = \frac{\text{CPI} \times \# \text{instr}}{f}$$

CPI: cycles per instr.

f: frequency  
(cycles/sec)

$$P \propto \frac{1/\text{PC} \times f}{\# \text{instr}}$$

→ performance

## \* Computing the CPI

$n$  inst<sub>j</sub>,  $K$  stages..

$$CPI = \frac{n + K - 1}{n}$$

cycles  
inst<sub>j</sub>

~~real~~  $(n-1+K)$ .



## \* max frequency ..

$t_{max}$  = maximum amount of time that it takes to execute any insts to algo work.

minimum clock cycle time of a single cycle

$$\text{pipeline} = t_{max}$$

If the pipeline has stages and balanced then

$$\text{Time per stage} = \boxed{t_{max}/k}$$

If stages are not balanced, having some delay.  
latch delay =  $d$ .

$$t_{stage} = \frac{t_{max}}{K} + d$$

$$\frac{1}{f} = \frac{t_{max}}{K} + d$$

