

---

# Pipelining and Hazard

# Pipeline Analogy: Doing Laundry

- Ann, Brian, Cathy, and Dave each have one load of clothes to wash, dry, fold, and put away
  - Washer takes 30 minutes
  - Dryer takes 30 minutes
  - “Folder” takes 30 minutes
  - “Stasher” takes 30 minutes to put clothes into drawers



# Sequential Laundry



- Sequential laundry takes 8 hours for 4 loads

# Pipelined Laundry



- Pipelined laundry takes 3.5 hours for 4 loads!

# Pipelining Lessons (1/2)



- Pipelining doesn't help *latency* of single task, just *throughput* of entire workload
- *Multiple* tasks operating simultaneously using different resources
- **Potential speedup = number of pipeline stages**
- Speedup reduced by time to *fill* and *drain* the pipeline: 8 hours/3.5 hours or 2.3X v. potential 4X in this example

# Pipelining Lessons (2/2)



- Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?
  - Pipeline rate limited by *slowest* pipeline stage
  - Unbalanced lengths of pipeline stages reduces speedup

# Recall: 5 Stages of MIPS Datapath

- 1) IF: Instruction Fetch, Increment PC
- 2) ID: Instruction Decode, Read Registers
- 3) EX: Execution (ALU)
  - Load/Store: Calculate Address
  - Others: Perform Operation
- 4) MEM:
  - Load: Read Data from Memory
  - Store: Write Data to Memory
- 5) WB: Write Data Back to Register

# Pipelined Datapath



- Add registers between stages
  - Hold information produced in previous cycle
- 5 stage pipeline
  - Clock rate *potentially* 5x faster

# Pipelining Changes

- Registers affect flow of information
  - Name registers for adjacent stages (e.g. IF/ID)
  - Registers *separate* the information between stages
  - At any instance of time, each stage working on a *different* instruction!
- Will need to re-examine placement of wires and hardware in datapath

# More Detailed Pipeline

- Examine flow through pipeline for  $l_w$



# Instruction Fetch (IF) for Load



# Instruction Decode (ID) for Load



# Execute (EX) for Load



# Memory (MEM) for Load



# Write Back (WB) for Load

There's something wrong here! (Can you spot it?)

lw  
Write back



# Corrected Datapath

- Now any instruction that writes to a register will work properly



# Pipelined Execution Representation



- Every instruction must take same number of steps, so some will idle
  - e.g. MEM stage for any arithmetic instruction

# Graphical Pipeline Diagrams



- Use datapath figure below to represent pipeline:



# Graphical Pipeline Representation

- RegFile: right half is read, left half is write



# Instruction Level Parallelism (ILP)

- Pipelining allows us to execute parts of multiple instructions at the same time using the same hardware!
  - This is known as *instruction level parallelism*
- **Recall:** Types of parallelism
  - DLP: same operation on lots of data (SIMD)
  - TLP: executing multiple threads “simultaneously” (OpenMP)

# Pipeline Performance (1/2)

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| sw       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |

- What is pipelined clock rate?
  - Compare pipelined datapath with single-cycle datapath

# Pipeline Performance (2/2)



# Pipeline Speedup

- Use  $T_c$  (“time between completion of instructions”) to measure speedup
  - $T_{c,\text{pipelined}} \geq \frac{T_{c,\text{single-cycle}}}{\text{Number of stages}}$
  - Equality only achieved if stages are *balanced* (i.e. take the same amount of time)
- If not balanced, speedup is reduced
- Speedup due to increased throughput
  - Latency for each instruction does not decrease

# Pipelining and ISA Design

- MIPS Instruction Set designed for pipelining!
- All instructions are 32-bits
  - Easier to fetch and decode in one cycle
- Few and regular instruction formats, 2 source register fields always in same place
  - Can decode and read registers in one step
- Memory operands only in Loads and Stores
  - Can calculate address 3<sup>rd</sup> stage, access memory 4<sup>th</sup> stage
- Alignment of memory operands
  - Memory access takes only one cycle

# Review: Pipelined Datapath



# Graphical Pipeline Representation

- RegFile: right half is read, left half is write



**Question:** Which of the following signals (buses or control signals) for MIPS-lite does NOT need to be passed into the EX pipeline stage?

- PC + 4**
- MemWr**
- RegWr**
- imm16**



# Pipelining Hazards

A *hazard* is a situation that prevents starting the next instruction in the next clock cycle

## 1) *Structural hazard*

- A required resource is busy  
(e.g. needed in multiple stages)

## 2) *Data hazard*

- Data dependency between instructions
- Need to wait for previous instruction to complete its data read/write

## 3) *Control hazard*

- Flow of execution depends on previous instruction

# Agenda

- Structural Hazards
- Data Hazards
  - Forwarding
- Data Hazards (Continued)
  - Load Delay Slot
- Control Hazards
  - Branch and Jump Delay Slots
  - Branch Prediction

# 1. Structural Hazards

- Conflict for use of a resource
- MIPS pipeline with a single memory?
  - Load/Store requires memory access for data
  - Instruction fetch would have to *stall* for that cycle
    - Causes a pipeline “*bubble*”
- Hence, pipelined datapaths require separate instruction/data memories
  - Separate L1 I\$ and L1 D\$ take care of this

# Structural Hazard #1: Single Memory



# Structural Hazard #2: Registers (1/2)



## Structural Hazard #2: Registers (2/2)

- Two different solutions have been used:
  - 1) Split RegFile access in two: Write during 1<sup>st</sup> half and Read during 2<sup>nd</sup> half of each clock cycle
    - Possible because RegFile access is *VERY* fast (takes less than half the time of ALU stage)
  - 2) Build RegFile with independent read and write ports
- **Conclusion:** Read and Write to registers during same clock cycle is okay

# Agenda

- Structural Hazards
- Data Hazards
  - Forwarding
- Data Hazards (Continued)
  - Load Delay Slot
- Control Hazards
  - Branch and Jump Delay Slots
  - Branch Prediction

## 2. Data Hazards (1/2)

- Consider the following sequence of instructions:

add \$t0, \$t1, \$t2

sub \$t4, \$t0, \$t3

and \$t5, \$t0, \$t6

or \$t7, \$t0, \$t8

xor \$t9, \$t0, \$t10

## 2. Data Hazards (2/2)

- Data-flow *backward* in time are hazards



# Data Hazard Solution: Forwarding

- Forward result as soon as it is available
  - OK that it's not stored in RegFile yet



# Datapath for Forwarding (1/2)

- What changes need to be made here?



# Datapath for Forwarding (2/2)

- Handled by *forwarding unit*



# Agenda

- Structural Hazards
- Data Hazards
  - Forwarding
- Data Hazards (Continued)
  - Load Delay Slot
- Control Hazards
  - Branch and Jump Delay Slots
  - Branch Prediction

# Data Hazard: Loads (1/4)

- **Recall:** Dataflow backwards in time are hazards



- Can't solve all cases with forwarding
  - Must *stall* instruction dependent on load, then forward (more hardware)

# Data Hazard: Loads (2/4)

- *Hardware* stalls pipeline
  - Called “hardware interlock”

Schematically, this is what we want, but in reality stalls done “horizontally”

**lw \$t0, 0(\$t1)**

**sub \$t3,\$t0,\$t2**

**and \$t5,\$t0,\$t4**

**or \$t7,\$t0,\$t6**



# Data Hazard: Loads (3/4)

- Stall is equivalent to nop

**lw \$t0, 0(\$t1)**

**nop**

**sub \$t3,\$t0,\$t2**

**and \$t5,\$t0,\$t4**

**or \$t7,\$t0,\$t6**



# Data Hazard: Loads (4/4)

- Slot after a load is called a *load delay slot*
  - If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle
  - Letting the hardware stall the instruction in the delay slot is equivalent to putting a `nop` in the slot (except the latter uses more code space)
- **Idea:** Let the compiler put an unrelated instruction in that slot → no stall!

# Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instruction!
- MIPS code for  $A=B+E; C=B+F;$



# Agenda

- Structural Hazards
- Data Hazards
  - Forwarding
- Data Hazards (Continued)
  - Load Delay Slot
- Control Hazards
  - Branch and Jump Delay Slots
  - Branch Prediction

## 3. Control Hazards

- Branch (beq, bne) determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
    - Still working on ID stage of branch
- **Simple Solution:** Stall on *every* branch until we have the new PC value
  - How long must we stall?

# Branch Stall

- When is comparison result available?



### 3. Control Hazard: Branching

- **Option #1:** Insert **special branch comparator** in ID stage
  - As soon as instruction is decoded, immediately make a decision and set the new value of the PC
  - **Benefit:** Branch decision made in 2<sup>nd</sup> stage, so only one `nop` is needed instead of two
  - **Side Note:** This means that branches are idle in EX, MEM, and WB

# Improved Branch Stall

- When is comparison result available?



# Datapath for ID Branch Comparator

- What changes need to be made here?



# Datapath for ID Branch Comparator

- Handled by *hazard detection unit*



### 3. Control Hazard: Branching

- **Option #2:** *Branch Prediction* – guess outcome of a branch, fix afterwards if necessary
  - Must cancel (*flush*) all instructions in pipeline that depended on guess that was wrong
  - How many instructions do we end up flushing?
- Achieve simplest hardware if we predict that all branches are NOT taken

### 3. Control Hazard: Branching

- **Option #3:** *Branch delay slot*
  - Whether or not we take the branch, *always* execute the instruction immediately following the branch
  - Worst-Case: Put a `nop` in the branch-delay slot
  - Better Case: Move an instruction from before the branch into the branch-delay slot
    - Must not affect the logic of program

### 3. Control Hazard: Branching

- MIPS uses this *delayed branch* concept
  - Re-ordering instructions is a common way to speed up programs
  - Compiler finds an instruction to put in the branch delay slot  $\approx 50\%$  of the time
- Jumps also have a delay slot
  - Why is one needed?

# Delayed Branch Example

## Nondelayed Branch

```
or $8, $9, $10
```

```
add $1, $2, $3
```

```
sub $4, $5, $6
```

```
beq $1, $4, Exit
```

```
xor $10, $1, $11
```

Exit:

## Delayed Branch

```
add $1, $2, $3
```

```
sub $4, $5, $6
```

```
beq $1, $4, Exit
```

or \$8, \$9, \$10

```
xor $10, $1, $11
```

Why not any of the  
other instructions?

# Delayed Jump in MIPS

- MIPS Green Sheet for `jal`:

$R[31] = PC + 8;$     $PC = \text{JumpAddr}$

- $PC + 8$  because of *jump delay slot!*
- Instruction at  $PC + 4$  always gets executed before `jal` jumps to label, so return to  $PC + 8$

# Agenda

- Structural Hazards
- Data Hazards
  - Forwarding
- Data Hazards (Continued)
  - Load Delay Slot
- **Control Hazards**
  - Branch and Jump Delay Slots
  - **Branch Prediction**

# Dynamic Branch Prediction

- Branch penalty is more significant in deeper pipelines
  - Also superscalar pipelines
- Use *dynamic branch prediction*
  - Have branch prediction buffer (a.k.a. branch history table) that stores outcomes (taken/not taken) indexed by recent branch instruction addresses
  - To execute a branch
    - Check table and predict the same outcome for next fetch
    - If wrong, flush pipeline and flip prediction

# 1-Bit Predictor: Shortcoming

- Examine the code below, assuming both loops will be executed multiple times:



- Inner loop branches are predicted wrong twice!
  - Predict as taken on last iteration of inner loop
  - Then predict as not taken on first iteration of inner loop next time around

# 2-Bit Predictor

- Only change prediction after two successive incorrect predictions



# Summary

## Pipelined Control (Simplified)



# Pipelined Control

- Control signals derived from instruction
  - As in single-cycle implementation



# Pipelined Control

PCSrc



# Data Hazards in ALU Instructions

- Consider this sequence:

```
sub $2, $1,$3  
and $12, $2,$5  
or $13, $6,$2  
add $14, $2,$2  
sw $15,100($2)
```

- We can resolve hazards with forwarding
  - How do we detect when to forward?

# Dependencies & Forwarding



# Detecting the Need to Forward

- Pass register numbers along pipeline
  - e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register
- ALU operand register numbers in EX stage are given by
  - ID/EX.RegisterRs, ID/EX.RegisterRt
- Data hazards when
  - 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
  - 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
  - 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
  - 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt



# Detecting the Need to Forward

- But only if forwarding instruction will write to a register!
  - EX/MEM.RegWrite, MEM/WB.RegWrite
- And only if Rd for that instruction is not \$zero
  - EX/MEM.RegisterRd  $\neq 0$ ,
  - MEM/WB.RegisterRd  $\neq 0$

# Forwarding Paths



b. With forwarding

# Forwarding Conditions

- EX hazard
  - if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))  
**ForwardA = 10**
  - if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))  
**ForwardB = 10**
- MEM hazard
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))  
**ForwardA = 01**
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))  
**ForwardB = 01**

# Double Data Hazard

- Consider the sequence:

add \$1, \$1, \$2

add \$1, \$1, \$3

add \$1, \$1, \$4

- Both hazards occur
  - Want to use the most recent
- Revise MEM hazard condition
  - Only fwd if EX hazard condition isn't true

# Revised Forwarding Condition

- MEM hazard
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))  
ForwardA = 01
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))  
ForwardB = 01

# Datapath with Forwarding



# Load-Use Data Hazard



# Load-Use Hazard Detection

- Check when using instruction is decoded in ID stage
- ALU operand register numbers in ID stage are given by
  - IF/ID.RegisterRs, IF/ID.RegisterRt
- Load-use hazard when
  - ID/EX.MemRead and
$$((ID/EX.RegisterRt = IF/ID.RegisterRs) \text{ or } (ID/EX.RegisterRt = IF/ID.RegisterRt))$$
- If detected, stall and insert bubble

# How to Stall the Pipeline

- Force control values in ID/EX register to 0
  - EX, MEM and WB do nop (no-operation)
- Prevent update of PC and IF/ID register
  - Using instruction is decoded again
  - Following instruction is fetched again
  - 1-cycle stall allows MEM to read data for 1w
    - Can subsequently forward to EX stage

# Stall/Bubble in the Pipeline



# Stall/Bubble in the Pipeline



# Datapath with Hazard Detection



# Stalls and Performance

## The BIG Picture

- Stalls reduce performance
  - But are required to get correct results
- Compiler can arrange code to avoid hazards and stalls
  - Requires knowledge of the pipeline structure

# Branch Hazards

- If branch outcome determined in MEM



# Reducing Branch Delay

- Move hardware to determine outcome to ID stage
  - Target address adder
  - Register comparator
- Example: branch taken

```
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
      ...
72: lw $4, 50($7)
```

# Example: Branch Taken



# Example: Branch Taken



# Data Hazards for Branches

- If a comparison register is a destination of 2<sup>nd</sup> or 3<sup>rd</sup> preceding ALU instruction



- Can resolve using forwarding

# Data Hazards for Branches

- If a comparison register is a destination of preceding ALU instruction or 2<sup>nd</sup> preceding load instruction
  - Need 1 stall cycle



# Data Hazards for Branches

- If a comparison register is a destination of immediately preceding load instruction
  - Need 2 stall cycles



# Exceptions and Interrupts

- “Unexpected” events requiring change in flow of control
  - Different ISAs use the terms differently
- Exception
  - Arises within the CPU
    - e.g., undefined opcode, overflow, syscall, ...
- Interrupt
  - From an external I/O controller
- Dealing with them without sacrificing performance is hard

# Handling Exceptions

- In MIPS, exceptions managed by a System Control Coprocessor (CP0)
- Save PC of offending (or interrupted) instruction
  - In MIPS: Exception Program Counter (EPC)
- Save indication of the problem
  - In MIPS: Cause register
  - We'll assume 1-bit
    - 0 for undefined opcode, 1 for overflow
- Jump to handler at 8000 00180

# An Alternate Mechanism

- Vectored Interrupts
  - Handler address determined by the cause
- Example:
  - Undefined opcode: C000 0000
  - Overflow: C000 0020
  - ...: C000 0040
- Instructions either
  - Deal with the interrupt, or
  - Jump to real handler

# Handler Actions

- Read cause, and transfer to relevant handler
- Determine action required
- If restartable
  - Take corrective action
  - use EPC to return to program
- Otherwise
  - Terminate program
  - Report error using EPC, cause, ...

# Exceptions in a Pipeline

- Another form of control hazard
- Consider overflow on add in EX stage
  - add \$1, \$2, \$1
    - Prevent \$1 from being clobbered
    - Complete previous instructions
    - Flush add and subsequent instructions
    - Set Cause and EPC register values
    - Transfer control to handler
- Similar to mispredicted branch
  - Use much of the same hardware

# Pipeline with Exceptions



# Exception Properties

- Restartable exceptions
  - Pipeline can flush the instruction
  - Handler executes, then returns to the instruction
    - Refetched and executed from scratch
- PC saved in EPC register
  - Identifies causing instruction
  - Actually PC + 4 is saved
    - Handler must adjust

# Cortex A8 and Intel i7

| Processor                             | ARM A8                 | Intel Core i7 920                     |
|---------------------------------------|------------------------|---------------------------------------|
| Market                                | Personal Mobile Device | Server, cloud                         |
| Thermal design power                  | 2 Watts                | 130 Watts                             |
| Clock rate                            | 1 GHz                  | 2.66 GHz                              |
| Cores/Chip                            | 1                      | 4                                     |
| Floating point?                       | No                     | Yes                                   |
| Multiple issue?                       | Dynamic                | Dynamic                               |
| Peak instructions/clock cycle         | 2                      | 4                                     |
| Pipeline stages                       | 14                     | 14                                    |
| Pipeline schedule                     | Static in-order        | Dynamic out-of-order with speculation |
| Branch prediction                     | 2-level                | 2-level                               |
| 1 <sup>st</sup> level caches/core     | 32 KiB I, 32 KiB D     | 32 KiB I, 32 KiB D                    |
| 2 <sup>nd</sup> level caches/core     | 128-1024 KiB           | 256 KiB                               |
| 3 <sup>rd</sup> level caches (shared) | -                      | 2- 8 MB                               |

# ARM Cortex-A8 Pipeline

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

Branch mispredict  
penalty = 13 cycles



# Core i7 Pipeline



**Question:** For each code sequences below, choose one of the statements below:

1:

```
lw $t0,0($t0)  
add $t1,$t0,$t0
```

2:

```
add $t1,$t0,$t0  
addi $t2,$t0,5  
addi $t4,$t1,5
```

3:

```
addi $t1,$t0,1  
addi $t2,$t0,2  
addi $t3,$t0,2  
addi $t3,$t0,4  
addi $t5,$t1,5
```

- No stalls as is**
- No stalls with forwarding**
- Must stall**

# Code Sequence 1



# Code Sequence 2



# Code Sequence 3



# Summary

- Hazards reduce effectiveness of pipelining
  - Cause stalls/bubbles
- Structural Hazards
  - Conflict in use of datapath component
- Data Hazards
  - Need to wait for result of a previous instruction
- Control Hazards
  - Address of next instruction uncertain/unknown
  - Branch and jump delay slots