

## Pipelined Processor

Reference: Eric Goren Schmidt's EE446 Slides from Harris

- Built top on the single cycle processor
- Add registers for each stage

1. Fetch Stage: Read instruction from instruction memory
2. Decode Stage: Read the source operands, decode instruction
3. Execute Stage: Perform ALU operation
4. Memory Stage: Read/Write data memory
5. Writeback stage: Write the result to the register file

If # of stages < 5  
 ↓  
 single-cycle processor

### Critical Path and Logic Delays of Single Cycle



$$T_{c1} = t_{pfa-pf} + 2t_{Mem} + t_{dec} + t_{RF\ read} + t_{ALU} + 2t_{MRW} + t_{RF\ setup}$$

ignore

$$T_{c1} = 680 \text{ ps}$$

Throughput: 1 instruction per 680 ps  $\approx 1.47$  billion instruction



Throughput: 1 instruction per 200 ps  $\approx 5$  billion instructions per second

### Pipelined Processor Abstraction





! Problem 1: In Cycle 5, LDR is in WB  $\rightarrow$  we want WA3D = R<sub>2</sub>.  
 However AND is in decode WA3D = R<sub>5</sub> !

Rule: All signals associated with a particular instruction must advance through the pipeline in unison-

Datapath Version 2



Datapath Final Version 3



\* At Fetch stage, logically  $(PC+4)F = PC+8$ . Thus there is no need for the additional adder.



\* Branch Signal is different from single-cycle-

### Instruction Execution Example

0xA000 LDR R2, R0, #40  
 0xA004 AND R3, R2, R10  
 0xA008 STR R4, R1, #20  
 0xA00C BUI #10

$PCSrcW = 0$ ,  $RegWriteW = 0$ ,  $MemtoRegW = X$ ,  $NewWriteM = 0$

Cycle 1:



Cycle 2:



## Cycle 3 :



Cycle 4 :



Cycle 5:



## Cycle 6:



Cycle 7:



Cycle 8:



① Control Hazard which instruction to fetch next?

## HAZARD TYPES

### Data Hazard

An instruction tries to read a register that has not yet been written back by a previous instruction

### Control Hazard

The decision of what instruction to fetch next has not been made by the time the fetch takes place.

### Data Hazard: Example 1

ADD R1, R4, R5

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7



→ Software Solution: Adding 2 NOPs

→ Solution 2: Forwarding



# Datapath v4 (with Hazard Unit)



## Match

$$\text{Match-IE-M} = (\text{RAIE} == \text{WA3M})$$

$$\text{Match-IE-W} = (\text{RAIE} == \text{WA3W})$$

if (Match-IE-M . RegWriteM) → Forward AE = 10

else if (Match-IE-W . RegWriteM) → Forward AE = 01  
ResultW

else ForwardAE = 00  
[RAI]

→ Memory stage has priority

\* STR and B instructions do not write results to RF (no need to have their results forwarded)  
thus use RegWrite W, M signals in the conditions

## Data Hazard: Example 2

LDR R2, [R4, #40]

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7



## Solution: Stalling



In Decode Stall: → Execute stage control signals are zero

→ Execute stage register is flushed

→ All previous stages must be stalled

\* The pipeline register directly after the stalled stage must be cleared (flushed)



when LDR occurs

$$\text{Match-12D-E} = (\text{RA1D} == \text{WA3E}) + (\text{RA2D} == \text{WA3E})$$

$$\text{LDRstall} = \text{Match-12D-E} \cdot \text{MemtoRegE} \rightarrow 1 \text{ for LDR}$$

Stall F = Stall D = Flush E = LDRstall → look at the next page  
 later it will change!

## Control Hazard

### Branch (B)

- Instruction after branch fetched before branch occurs
- These 4 instructions must be flushed if branch happens
  - ↓
  - ! Degrades the system performance

### Reducing Branch Misprediction Penalty

Early branch decision: BTA and Cond $\bar{x}$  are known in Execute Stage

### Flush 2 instructions

### Stall and flush Branches

- when branch is taken, flush Decode and Execute
- when a PC write is in the pipeline → stall fetch
- Stall fetch → Flush Decode
- PCWrPending is asserted when a PC write is in progress
- During this time, fetch is stalled, decode is flushed
- When the PC write reaches the Writeback stage (PCSrcW asserted), StallF is released to allow the write to occur # but FlushD is still asserted

$PCWrPending^F = PCSrcD + PCSrc\bar{E} + PCSrcM \rightarrow$  Fetch is stalled, Decode is flushed

$$StallF = LDRstall + PCWrPending^F$$

$$StallD = LDRstall$$

$$FlushD = PCWrPending^F + PCSrcW + Branchtaken\bar{E}$$

$$Flush\bar{E} = LDRstall + Branchtaken\bar{E};$$

## Pipelined Processor Performance

Ideally  $CPI = 1$ , however flushes or stalls may occur  $\rightarrow$  waste cycle  
 LDR  $CPI = 1 \rightarrow 2$  (for each LDR followed by a Register)



### Example:

25% loads  
10% stores  
13% branches  
52% data processing

Suppose

40% of loads used by next instruction that uses the result  
50% of branches mispredicted (taken)

$\rightarrow$  load  $CPI = 1$  (no stalling)  $\rightarrow 2$  (when stalling)

$$CPI_{Iw} = 1 \times 0.6 + 2 \times 0.4 = 1.4 \text{ for LDR}$$

$\rightarrow$  Branch  $CPI = 1$  (no stalling)  $\rightarrow 3$  (when stalling)

$$CPI_{beq} = 1 \times 0.5 + 3 \times 0.5 = 2$$

$$\boxed{\text{Avg. CPI}} = (0.25) \downarrow_{\text{LDR}} \times 1.4 + (0.1) \downarrow_{\text{St}} \times 1 + (0.13) \downarrow_{\text{B}} \times 2 + (0.52) \downarrow_{\text{DP}} \times 1 = 1.23$$

### Critical Path

$$T_{CS} = \max \left[ \begin{array}{l} t_{pq} + t_{mem} + t_{setup} \\ 2(t_{RF} \downarrow_{\text{read}} + t_{setup}) \\ t_{pq} + 2t_{mux} + t_{RF} + t_{setup} \\ t_{pq} + t_{mem} + t_{setup} \\ 2(t_{pq} + t_{mux} + t_{RF} \downarrow_{\text{setup}}) \end{array} \right] \begin{array}{l} F \\ D \\ E \\ M \\ WB \end{array}$$

↓  
half cycle work  
thus multiply with 2