



# Pipelined Processor Design

Instructor: Huzefa Rangwala, PhD

CS 465

Fall 2012

# Review on Single Cycle Datapath

- Subset of the core MIPS ISA
  - Arithmetic/Logic instructions: AND, OR, ADD, SUB, SLT
  - Data flow instructions: LW, SW
  - Branch instructions: BEQ, J
- Five steps in processor design
  - Analyze the instruction
  - Determine the datapath components
  - Assemble the components
  - Determine the control
  - Design the control unit

# Complete Single Cycle Datapath



How lw, sw, R-Type, beq, j instructions work?

# Delays in Single Cycle Datapath



What are the delays for **lw, sw, R-Type, beq, j** instructions?

# Single Cycle Implementation

- Calculate cycle time assuming negligible delays except:
  - memory (2ns), ALU and adders (2ns), register file access (1ns)

| Instruction class | Instruction Fetch | Register Access | ALU | Register/ Memory Access | Register Access |   |
|-------------------|-------------------|-----------------|-----|-------------------------|-----------------|---|
| R-Type            | X                 | X               | X   | R                       |                 | 6 |
| Load              | X                 | X               | X   | M                       | X               | 8 |
| Store             | X                 | X               | X   | M                       |                 | 7 |
| Branch            | X                 | X               | X   |                         |                 | 5 |
| Jump              | X                 |                 |     |                         |                 | 2 |

# Remarks on Single Cycle Datapath

- Single cycle datapath ensures the execution of any instruction within one clock cycle
  - Functional units must be duplicated if used multiple times by one instruction, e.g. ALU **Why?**
  - Functional units can be shared if used by different instructions
- Single cycle datapath is not efficient in time
  - Clock cycle time is determined by the instruction taking the longest time, eg. **lw in MIPS**
  - Variable clock cycle time is too complicated
- Alternative design/implementation approaches
  - Multiple clock cycles per instruction –

# Outline

- Today's topic
  - Pipelining is an implementation technique in which multiple instructions are overlapped in execution
  - Subset of MIPS instructions
    - lw, sw, and, or, add, sub, slt, beq
- Outline
  - Pipeline high-level introduction
    - Stages, hazards
  - Pipelined datapath and control design

# Pipelining is Natural!

- Laundry example

- Ann, Brian, Cathy, Dave each has one load of clothes to wash, dry, and fold



- Washer takes 30 minutes



- Dryer takes 40 minutes



- “Folder” takes 20 minutes



# Sequential Laundry



- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would laundry take?

# Pipelined Laundry



- Start work ASAP
- Pipelined laundry takes 3.5 hours for 4 loads

# Pipelining Lessons (I)



- Multiple tasks operating simultaneously using different resources
- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Pipeline rate is limited by slowest pipeline stage
  - Unbalanced lengths of pipeline stages reduces speedup

# Pipelining Lessons (II)



- Potential speedup = Number pipeline stages
- Time to “fill” pipeline and time to “drain” it reduces speedup- startup and wind down
- Stall for dependencies

# MIPS Pipeline

- Five stages, one step per stage
  1. IF: Instruction fetch from memory
  2. ID: Instruction decode & register read
  3. EX: Execute operation or calculate address
  4. MEM: Access memory operand
  5. WB: Write result back to register

# Pipeline Performance

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- Compare pipelined datapath with single-cycle datapath

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| sw       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |

# Pipeline Performance

Single-cycle ( $T_c = 800\text{ps}$ )



Pipelined ( $T_c = 200\text{ps}$ )



# Pipeline Speedup

- If all stages are balanced
  - i.e., all take the same time
  - Time between instructions<sub>pipelined</sub>  
= Time between instructions<sub>nonpipelined</sub>
- If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not decrease

# Pipelining and ISA Design

- MIPS ISA designed for pipelining
  - All instructions are 32-bits
    - Easier to fetch and decode in one cycle
    - c.f. x86: 1- to 17-byte instructions
  - Few and regular instruction formats
    - Can decode and read registers in one step
  - Load/store addressing
    - Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage
  - Alignment of memory operands
    - Memory access takes only one cycle

# Hazards

- Situations that prevent starting the next instruction in the next cycle
- Structure hazards
  - A required resource is busy
- Data hazard
  - Need to wait for previous instruction to complete its data read/write
- Control hazard
  - Deciding on control action depends on previous instruction

# Structure Hazards

- Conflict for use of a resource
- In MIPS pipeline with a single memory
  - Load/store requires data access
  - Instruction fetch would have to *stall* for that cycle
    - Would cause a pipeline “bubble”
- Hence, pipelined datapaths require separate instruction/data memories
  - Or separate instruction/data caches

# Structural Hazard: One Memory

Time (clock cycles)



- Solution 1: add more HW
- Hazards can always be resolved by waiting

# Structural Hazard: One Memory

Time (clock cycles)



- Hazards can always be resolved by waiting

# Data Hazard Example

- Data hazard: an instruction depends on the result of a previous instruction still in the pipeline

**add r1 ,r2,r3**

**sub r4, r1 ,r3**

**and r6, r1 ,r7**

**or r8, r1 ,r9**

**xor r10, r1 ,r11**

# Data Hazard Example

- Dependences backward in time are hazards



- Compilers can help, but it gets messy and difficult

# Data Hazard Solution



- Solution : “forward” result from one stage to another

# Data Hazard Even with Forwarding



- Can't go back in time! Must delay/stall instruction dependent on loads

# Data Hazard Even with Forwarding



- Must delay/stall instruction dependent on loads
- Sometimes the instruction sequence can be reordered to avoid pipeline stalls

# Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instruction
- C code for  $A = B + E; C = B + F;$



# Control Hazards

- Branch determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
    - Still working on ID stage of branch
- In MIPS pipeline
  - Need to compare registers and compute target early in the pipeline
  - Add hardware to do it in ID stage

# Stall on Branch

- Wait until branch outcome determined before fetching next instruction



# Branch Prediction

- Longer pipelines can't readily determine branch outcome early
  - Stall penalty becomes unacceptable
- Predict outcome of branch
  - Only stall if prediction is wrong
- In MIPS pipeline
  - Can predict branches not taken
  - Fetch instruction after branch, with no delay

# Control Hazard Solution: Predict

- Predict: guess one direction then back up if wrong
- Impact: 0 lost cycles per branch instruction if right, 1 if wrong
  - Need to “Squash” and restart following instruction if wrong
- Prediction scheme
  - Random prediction: correct - 50% of time
  - History-based prediction: correct- 90% of time

# MIPS with Predict Not Taken



# More-Realistic Branch Prediction

- Static branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- Dynamic branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history of each branch
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history

# Pipeline Summary

## The BIG Picture

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation

# MIPS Pipelined Datapath



# Pipeline registers

- Need registers between stages
  - To hold information produced in previous cycle



# Pipeline Operation

- Cycle-by-cycle flow of instructions through the pipelined datapath
  - “Single-clock-cycle” pipeline diagram
    - Shows pipeline usage in a single cycle
    - Highlight resources used
  - c.f. “multi-clock-cycle” diagram
    - Graph of operation over time
- We’ll look at “single-clock-cycle” diagrams for load & store

# IF for Load, Store, ...



# ID for Load, Store, ...



# EX for Load



# MEM for Load



# WB for Load

Iw  
Write back



# Corrected Datapath for Load



# EX for Store



# MEM for Store



# WB for Store

SW  
Write-back



# Multi-Cycle Pipeline Diagram

- Form showing resource usage



# Multi-Cycle Pipeline Diagram

- Traditional form



# Single-Cycle Pipeline Diagram

- State of pipeline in a given cycle



# Pipelined Control (Simplified)



# Overview on Datapath Control

|                 |               |  |           |           |            |             |
|-----------------|---------------|--|-----------|-----------|------------|-------------|
| <b>op</b>       | 00 0000       |  | 10 0011   | 10 1011   | 00 0100    | 00 0010     |
|                 | <b>R-type</b> |  | <b>lw</b> | <b>sw</b> | <b>beq</b> | <b>jump</b> |
| <b>RegDst</b>   | I             |  | 0         | x         | x          | x           |
| <b>ALUSrc</b>   | 0             |  | I         | I         | 0          | x           |
| <b>MemtoReg</b> | 0             |  | I         | x         | x          | x           |
| <b>RegWrite</b> | I             |  | I         | 0         | 0          | 0           |
| <b>MemWrite</b> | 0             |  | 0         | I         | 0          | 0           |
| <b>Branch</b>   | 0             |  | 0         | 0         | I          | 0           |
| <b>Jump</b>     | 0             |  | 0         | 0         | 0          | I           |
| <b>ALUop</b>    | “R-type”      |  | Add       | Add       | Subtract   | xx          |

- For the subset of instructions under consideration
  - ALUOp = 00 for Add, 01 for Sub, and 10 for R-type

# Observations

- No write control for all pipeline registers and PC since they are updated at every clock cycle
- To specify the control for the pipeline, set the control values during each pipeline stage
- Control lines can be divided into 5 groups:
  - IF –                NONE
  - ID                –                NONE
  - ALU                –                RegDst, ALUOp, ALUSrc
  - MEM                –                Branch, MemRead, MemWrite
  - WB                –                MemtoReg, RegWrite
- Group these nine control lines into 3 subsets:
  - ALUControl, MEMControl, WBControl
- Control signals are generated at ID stage, how to pass them to other stages?

# Pipelined Control

- Control signals derived from instruction
  - As in single-cycle implementation



# Pipelined Control

PCSrc



# Add Forwarding Paths



b. With forwarding

# Exceptions and Interrupts

- “Unexpected” events requiring change in flow of control
  - Different ISAs use the terms differently
- Exception
  - Arises within the CPU
    - e.g., undefined opcode, overflow, syscall, ...
- Interrupt
  - From an external I/O controller
- Dealing with them without sacrificing performance is hard

# Handling Exceptions

- In MIPS, exceptions managed by a System Control Coprocessor (CP0)
- Save PC of offending (or interrupted) instruction
  - In MIPS: Exception Program Counter (EPC)
- Save indication of the problem
  - In MIPS: Cause register
  - We'll assume 1-bit
    - 0 for undefined opcode, 1 for overflow
- Jump to handler at 8000 00180

# An Alternate Mechanism

- Vectored Interrupts
  - Handler address determined by the cause
- Example:
  - Undefined opcode: C000 0000
  - Overflow: C000 0020
  - ....: C000 0040
- Instructions either
  - Deal with the interrupt, or
  - Jump to real handler

# Handler Actions

- Read cause, and transfer to relevant handler
- Determine action required
- If restartable
  - Take corrective action
  - use EPC to return to program
- Otherwise
  - Terminate program
  - Report error using EPC, cause, ...

# Exceptions in a Pipeline

- Another form of control hazard
- Consider overflow on add in EX stage
  - add \$1, \$2, \$1
    - Prevent \$1 from being clobbered
    - Complete previous instructions
    - Flush add and subsequent instructions
    - Set Cause and EPC register values
    - Transfer control to handler
- Similar to mispredicted branch
  - Use much of the same hardware

# Pipeline with Exceptions



# Exception Properties

- Restartable exceptions
  - Pipeline can flush the instruction
  - Handler executes, then returns to the instruction
    - Refetched and executed from scratch
- PC saved in EPC register
  - Identifies causing instruction
  - Actually PC + 4 is saved
    - Handler must adjust

# Exception Example

- Exception on `add` in

```
40      sub   $11, $2, $4
44      and   $12, $2, $5
48      or    $13, $2, $6
4C      add   $1, $2, $1
50      slt   $15, $6, $7
54      lw    $16, 50($7)
```

...

- Handler

```
80000180      sw    $25, 1000($0)
80000184      sw    $26, 1004($0)
```

...

# Exception Example



# Exception Example



# Multiple Exceptions

- Pipelining overlaps multiple instructions
  - Could have multiple exceptions at once
- Simple approach: deal with exception from earliest instruction
  - Flush subsequent instructions
  - “Precise” exceptions
- In complex pipelines
  - Multiple instructions issued per cycle
  - Out-of-order completion
  - Maintaining precise exceptions is difficult!

# Instruction-Level Parallelism (ILP)

- Pipelining: executing multiple instructions in parallel
- To increase ILP
  - Deeper pipeline
    - Less work per stage  $\Rightarrow$  shorter clock cycle
  - Multiple issue
    - Replicate pipeline stages  $\Rightarrow$  multiple pipelines
    - Start multiple instructions per clock cycle
    - CPI < 1, so use Instructions Per Cycle (IPC)
    - E.g., 4GHz 4-way multiple-issue
      - 16 BIPS, peak CPI = 0.25, peak IPC = 4
    - But dependencies reduce this in practice

# Multiple Issue

- Static multiple issue
  - Compiler groups instructions to be issued together
  - Packages them into “issue slots”
  - Compiler detects and avoids hazards
- Dynamic multiple issue
  - CPU examines instruction stream and chooses instructions to issue each cycle
  - Compiler can help by reordering instructions
  - CPU resolves hazards using advanced techniques at runtime

# Speculation

- “Guess” what to do with an instruction
  - Start operation as soon as possible
  - Check whether guess was right
    - If so, complete the operation
    - If not, roll-back and do the right thing
- Common to static and dynamic multiple issue
- Examples
  - Speculate on branch outcome
    - Roll back if path taken is different
  - Speculate on load
    - Roll back if location is updated

# Compiler/Hardware Speculation

- Compiler can reorder instructions
  - e.g., move load before branch
  - Can include “fix-up” instructions to recover from incorrect guess
- Hardware can look ahead for instructions to execute
  - Buffer results until it determines they are actually needed
  - Flush buffers on incorrect speculation

# Static Multiple Issue

- Compiler groups instructions into “issue packets”
  - Group of instructions that can be issued on a single cycle
  - Determined by pipeline resources required
- Think of an issue packet as a very long instruction
  - Specifies multiple concurrent operations
  - ⇒ Very Long Instruction Word (VLIW)

# Scheduling Static Multiple Issue

- Compiler must remove some/all hazards
  - Reorder instructions into issue packets
  - No dependencies with a packet
  - Possibly some dependencies between packets
    - Varies between ISAs; compiler must know!
  - Pad with nop if necessary

# MIPS with Static Dual Issue

- Two-issue packets
  - One ALU/branch instruction
  - One load/store instruction
  - 64-bit aligned
    - ALU/branch, then load/store
    - Pad an unused instruction with nop

| Address | Instruction type | Pipeline Stages |    |    |     |     |     |    |
|---------|------------------|-----------------|----|----|-----|-----|-----|----|
| n       | ALU/branch       | IF              | ID | EX | MEM | WB  |     |    |
| n + 4   | Load/store       | IF              | ID | EX | MEM | WB  |     |    |
| n + 8   | ALU/branch       |                 | IF | ID | EX  | MEM | WB  |    |
| n + 12  | Load/store       |                 | IF | ID | EX  | MEM | WB  |    |
| n + 16  | ALU/branch       |                 |    | IF | ID  | EX  | MEM | WB |
| n + 20  | Load/store       |                 |    | IF | ID  | EX  | MEM | WB |

# MIPS with Static Dual Issue



# Hazards in the Dual-Issue MIPS

- More instructions executing in parallel
- EX data hazard
  - Forwarding avoided stalls with single-issue
  - Now can't use ALU result in load/store in same packet
    - add \$t0, \$s0, \$s1  
load \$s2, 0(\$t0)
    - Split into two packets, effectively a stall
- More aggressive scheduling required

# Thought Question:

- In our simple single-issue five-stage pipeline we have a use latency of one clock cycle --- prevents one instruction from using the result without stalling.
- How many extra stalls may be needed for loads or even ALU in a 2-issue 5-stage pipeline?
  - Remember we had zero stalls for ALU instructions (zero-use latency) in the single issue pipeline.

# Scheduling Example

- Schedule this for dual-issue MIPS

```
Loop: lw    $t0, 0($s1)      # $t0=array element
       addu $t0, $t0, $s2      # add scalar in $s2
       sw    $t0, 0($s1)      # store result
       addi $s1, $s1,-4        # decrement pointer
       bne  $s1, $zero, Loop  # branch $s1!=0
```

|       | ALU/branch              | Load/store          | cycle |
|-------|-------------------------|---------------------|-------|
| Loop: | nop                     | lw    \$t0, 0(\$s1) | 1     |
|       | addi \$s1, \$s1,-4      | nop                 | 2     |
|       | addu \$t0, \$t0, \$s2   | nop                 | 3     |
|       | bne  \$s1, \$zero, Loop | sw    \$t0, 4(\$s1) | 4     |

- $\text{IPC} = 5/4 = 1.25$  (c.f. peak IPC = 2)

# Loop Unrolling

- Replicate loop body to expose more parallelism
  - Reduces loop-control overhead
- Use different registers per replication
  - Called “register renaming”
  - Avoid loop-carried “anti-dependencies”
    - Store followed by a load of the same register
    - Aka “name dependence”
      - Reuse of a register name

# Loop Unrolling Example

|       | ALU/branch             | Load/store        | cycle |
|-------|------------------------|-------------------|-------|
| Loop: | addi \$s1, \$s1,-16    | lw \$t0, 0(\$s1)  | 1     |
|       | nop                    | lw \$t1, 12(\$s1) | 2     |
|       | addu \$t0, \$t0, \$s2  | lw \$t2, 8(\$s1)  | 3     |
|       | addu \$t1, \$t1, \$s2  | lw \$t3, 4(\$s1)  | 4     |
|       | addu \$t2, \$t2, \$s2  | sw \$t0, 16(\$s1) | 5     |
|       | addu \$t3, \$t4, \$s2  | sw \$t1, 12(\$s1) | 6     |
|       | nop                    | sw \$t2, 8(\$s1)  | 7     |
|       | bne \$s1, \$zero, Loop | sw \$t3, 4(\$s1)  | 8     |

- $\text{IPC} = 14/8 = 1.75$ 
  - Closer to 2, but at cost of registers and code size

# Dynamic Multiple Issue

- “Superscalar” processors
- CPU decides whether to issue 0, 1, 2, ... each cycle
  - Avoiding structural and data hazards
- Avoids the need for compiler scheduling
  - Though it may still help
  - Code semantics ensured by the CPU

# Dynamic Pipeline Scheduling

- Allow the CPU to execute instructions out of order to avoid stalls
  - But commit result to registers in order
- Example

```
lw      $t0, 20($s2)
addu   $t1, $t0, $t2
sub    $s4, $s4, $t3
slti   $t5, $s4, 20
```

- Can start sub while addu is waiting for lw

# Dynamically Scheduled CPU



# Register Renaming

- Reservation stations and reorder buffer effectively provide register renaming
- On instruction issue to reservation station
  - If operand is available in register file or reorder buffer
    - Copied to reservation station
    - For the issuing instruction, register copy of the operand is no longer written.
  - If operand is not yet available
    - It will be provided to the reservation station by a function unit bypassing the register file.
    - Register update may not be required

# Speculation

- Predict branch and continue issuing
  - Don't commit until branch outcome determined
- Load speculation
  - Avoid load and cache miss delay
    - Predict the effective address
    - Predict loaded value
    - Load before completing outstanding stores
    - Bypass stored values to load unit
  - Don't commit load until speculation cleared

# Why Do Dynamic Scheduling?

- Why not just let the compiler schedule code?
- Not all stalls are predictable
  - e.g., cache misses
- Can't always schedule around branches
  - Branch outcome is dynamically determined
- Different implementations of an ISA have different latencies and hazards

# Does Multiple Issue Work?

## The BIG Picture

- Yes, but not as much as we'd like
- Programs have real dependencies that limit ILP
- Some dependencies are hard to eliminate
  - e.g., pointer aliasing
- Some parallelism is hard to expose
  - Limited window size during instruction issue
- Memory delays and limited bandwidth
  - Hard to keep pipelines full
- Speculation can help if done well

# Power Efficiency

- Complexity of dynamic scheduling and speculations requires power
- Multiple simpler cores may be better

| Microprocessor | Year | Clock Rate | Pipeline Stages | Issue width | Out-of-order/ Speculation | Cores | Power |
|----------------|------|------------|-----------------|-------------|---------------------------|-------|-------|
| i486           | 1989 | 25MHz      | 5               | 1           | No                        | 1     | 5W    |
| Pentium        | 1993 | 66MHz      | 5               | 2           | No                        | 1     | 10W   |
| Pentium Pro    | 1997 | 200MHz     | 10              | 3           | Yes                       | 1     | 29W   |
| P4 Willamette  | 2001 | 2000MHz    | 22              | 3           | Yes                       | 1     | 75W   |
| P4 Prescott    | 2004 | 3600MHz    | 31              | 3           | Yes                       | 1     | 103W  |
| Core           | 2006 | 2930MHz    | 14              | 4           | Yes                       | 2     | 75W   |
| UltraSparc III | 2003 | 1950MHz    | 14              | 4           | No                        | 1     | 90W   |
| UltraSparc T1  | 2005 | 1200MHz    | 6               | 1           | No                        | 8     | 70W   |

# The Opteron X4 Microarchitecture



# The Opteron X4 Pipeline Flow

- For integer operations



- FP is 5 stages longer
- Up to 106 RISC-ops in progress
- Bottlenecks
  - Complex instructions with long dependencies
  - Branch mispredictions
  - Memory access delays

# Concluding Remarks

- ISA influences design of datapath and control
- Datapath and control influence design of ISA
- Pipelining improves instruction throughput using parallelism
  - More instructions completed per second
  - Latency for each instruction not reduced
- Hazards: structural, data, control
- Multiple issue and dynamic scheduling (ILP)
  - Dependencies limit achievable parallelism
  - Complexity leads to the power wall