



## Chapter 4 The Processor



### Introduction

§4.1 Introduction

- CPU performance factors
  - Instruction count
    - Determined by ISA and compiler
  - CPI and Cycle time
    - Determined by CPU hardware
- We will examine two MIPS implementations
  - A simplified version
  - A more realistic pipelined version
- Simple subset, shows most aspects
  - Memory reference: lw, sw
  - Arithmetic/logical: add, sub, and, or, slt
  - Control transfer: beq, j



## What Hardware to Implement?

1. Immediate addressing



`addi $s0,$s1,100`

2. Register addressing



`add $s0,$s1,$s3`

3. Base addressing



`lw $s0,4($s1)`

4. PC-relative addressing



`beq $s0,$s1,L1`

5. Pseudodirect addressing



`J L2`

3/2/2016 Chapter 4-The Processor- RST@NTHU

## 32-bit MIPS Instruction Set

|        | 31        | 26                    | 21        | 16               | 11           | 6            | 0 |  |  |  |  |
|--------|-----------|-----------------------|-----------|------------------|--------------|--------------|---|--|--|--|--|
| R-type | <b>op</b> | <b>rs</b>             | <b>rt</b> | <b>rd</b>        | <b>shamt</b> | <b>funct</b> |   |  |  |  |  |
|        | 6 bits    | 5 bits                | 5 bits    | 5 bits           | 5 bits       | 6 bits       |   |  |  |  |  |
| I-type | <b>op</b> | <b>rs</b>             | <b>rt</b> | <b>immediate</b> |              |              |   |  |  |  |  |
|        | 6 bits    | 5 bits                | 5 bits    | 16 bits          |              |              |   |  |  |  |  |
| J-type | <b>op</b> | <b>target address</b> |           |                  |              |              |   |  |  |  |  |
|        | 6 bits    | 26 bits               |           |                  |              |              |   |  |  |  |  |

### The different fields are:

|                       |                                   |
|-----------------------|-----------------------------------|
| <b>op</b>             | operation of the instruction      |
| <b>rs, rt, rd</b>     | source and destination register   |
| <b>shamt</b>          | shift amount                      |
| <b>funct</b>          | selects variant of the “op” field |
| <b>Immediate</b>      | address / constant                |
| <b>target address</b> | target address of jump            |

## A MIPS Subset

- R-Type:

- add rd, rs, rt
- sub rd, rs, rt
- and rd, rs, rt
- or rd, rs, rt
- slt rd, rs, rt

- Load/Store:

- lw rt,rs,imm16
- sw rt,rs,imm16

- Imm operand:

- addi rt,rs,imm16

- Branch:

- beq rs,rt,imm16

- Jump:

- j target



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Instruction Execution

- PC → instruction memory, fetch instruction
- Register numbers → register file, read registers
- Depending on instruction class
  - Use ALU to calculate
    - Arithmetic result
    - Memory address for load/store
    - Branch target address
  - Access data memory for load/store
  - PC ← target address or PC + 4

3/2/2016 Chapter 4-The Processor- RST@NTHU

# CPU Overview



302

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Multiplexers

- Can't just join wires together
  - Use multiplexers



3/2/2016

Chapter 4-The Processor- RST@NTHU



## Control



3/2/2016 Chapter 4-The Processor- RST@NTHU



## Logic Design Basics

- Information encoded in binary
  - Low voltage = 0, High voltage = 1
  - One wire per bit
  - Multi-bit data encoded on multi-wire buses
- Combinational element
  - Operate on data
  - Output is a function of input
- State (sequential) elements
  - Store information

§4.2 Logic Design Conventions

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Combinational Elements

- AND-gate

- $Y = A \& B$



- Adder

- $Y = A + B$



- Multiplexer

- $Y = S ? A : B$



- Arithmetic/Logic Unit

- $Y = F(A, B)$



3/2/2016 Chapter 4-The Processor- RST@NTHU



## Sequential Elements

- Register: stores data in a circuit

  - Uses a clock signal to determine when to update the stored value
  - Rising edge-triggered: update when Clk changes from 0 to 1



305

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Sequential Elements

- Register with write control
  - Only updates on clock edge when write control input is 1
  - Used when stored value is required later



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Clocking Methodology

- Combinational logic transforms data during clock cycles
  - Between clock edges
  - Input from state elements, output to state element
  - Longest delay determines clock period



306

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Building a Datapath

- Datapath
  - Elements that process data and addresses in the CPU
    - Registers, ALUs, mux's, memories, ...
- We will build a MIPS datapath incrementally
  - Refining the overview design

239

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Instruction Fetch



241

3/2/2016



## R-Format Instructions

- Read two register operands
- Perform arithmetic/logical operation
- Write register result
- $R[rd] \leftarrow R[rs] \text{ op } R[rt]$  Ex: add rd, rs, rt
  - Ra, Rb, Rw come from inst.'s rs, rt, and rd fields
  - ALU and RegWrite: control logic after decode

|    |    |    |    |       |       |
|----|----|----|----|-------|-------|
| op | rs | rt | rd | shamt | funct |
|----|----|----|----|-------|-------|



241

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Load/Store Instructions

- Read register operands
- Calculate address using 16-bit offset
  - Use ALU, but sign-extend offset
- Load: Read memory and update register
- Store: Write register value to memory
- $R[rt] \leftarrow \text{Mem}[R[rs] + \text{SignExt}[\text{imm16}]]$  Ex: lw rt,rs,imm16

|    |    |    |                     |
|----|----|----|---------------------|
| op | rs | rt | constant or address |
|----|----|----|---------------------|



243

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Branch Instructions

op | rs | rt | constant or address

- Read register operands
- Compare operands
  - Use ALU, subtract and check Zero output
- Calculate target address
  - Sign-extend displacement
  - Shift left 2 places (word displacement)
  - Add to PC + 4
    - Already calculated by instruction fetch

242

3/2/2016 Chapter 4-The Processor- RST@NTU



## Branch Instructions



beq rs, rt, imm16

op | rs | rt | constant or address

3/2/2016 Chapter 4-The Processor- RST@NTU



## Composing the Elements

- First-cut data path does one instruction in one clock cycle
  - Each datapath element can only do one function at a time
  - Hence, we need separate instruction and data memories
- Use multiplexers where alternate data sources are used for different instructions

313

3/2/2016 Chapter 4-The Processor- RST@NTHU



## R-Type/Load/Store Datapath



246

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Full Datapath



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Data Flow for add rd, rs, rt



3/2/2016 Chapter 4-The Processor- RST@NTHU



## §4.4 A Simple Implementation Scheme

## ALU Control

000000 | rs | rt | rd | shamt | funct



- ALU used for
  - Load/Store:  $F = \text{add}$
  - Branch:  $F = \text{subtract}$
  - R-type:  $F$  depends on funct field

| ALU control | Function         |
|-------------|------------------|
| 0000        | AND              |
| 0001        | OR               |
| 0010        | add              |
| 0110        | subtract         |
| 0111        | set-on-less-than |
| 1100        | NOR              |

247

3/2/2016 Chapter 4-The Processor- RST@NTHU

## A 1-bit ALU: and, or, add



C-32

3/2/2016 Chapter 4-The Processor- RST@NTHU



## ALU Control

- Assume 2-bit ALUOp derived from opcode
  - Combinational logic derives ALU control



| opcode | ALUOp | Operation        | funct  | ALU function     | ALU control |
|--------|-------|------------------|--------|------------------|-------------|
| lw     | 00    | load word        | xxxxxx | add              | 0010        |
| sw     | 00    | store word       | xxxxxx | add              | 0010        |
| beq    | 01    | branch equal     | xxxxxx | subtract         | 0110        |
| R-type | 10    | add              | 100000 | add              | 0010        |
|        |       | subtract         | 100010 | subtract         | 0110        |
|        |       | AND              | 100100 | AND              | 0000        |
|        |       | OR               | 100101 | OR               | 0001        |
|        |       | set-on-less-than | 101010 | set-on-less-than | 0111        |

3/2/2016 Chapter 4-The Processor- RST@NTHU



## The Main Control Unit

- Control signals derived from instruction



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Datapath With Control



253

3/2/2016 Chapter 4-The Processor- RST@NTHU

## R-Type Instruction



254

add  
sub  
and  
or  
slt  
...

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Load Instruction lw \$s1, 4(\$t0)



255

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Branch-on-Equal Instruction



256

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Implementing Jumps



- Jump uses word address
- Update PC with concatenation of
  - Top 4 bits of old PC  $\text{PC}_{31\ldots 28}$
  - 26-bit jump address  $\text{Address}_{26\text{ bits}}$
  - 00
- Need an extra control signal decoded from opcode

258

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Datapath With Jumps Added



259

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Performance Issues

- Longest delay determines clock period
  - Critical path: load instruction
  - Instruction memory → register file → ALU → data memory → register file
- Not feasible to vary period for different instructions
- Violates design principle
  - Making the common case fast
- We will improve performance by **pipelining**

259

3/2/2016 Chapter 4-The Processor- RST@NTHU

### §4.5 An Overview of Pipelining



260

3/2/2016 Chapter 4-The Processor- RST@NTHU

- Four loads:
  - $4 * 4 = 16$  hours
- Pipeline non-stop:
  - Speedup  
 $= (4 * 4) / 7 = 2.3$
  - $4n / (1n + 3) \approx 4$   
= number of stages

## Ford's Assembly Line



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Multicycle Implementation

- Balance the work amount of each partition



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Partition Single-Cycle Datapath

- Add intermediate registers



## MIPS Pipeline

Five stages, one step per stage

- IF:** Instruction fetch from memory
- ID:** Instruction decode & register read
- EX:** Execute operation or calculate address
- MEM:** Access memory operand
- WB:** Write result back to register

## Micro-Instructions

| Step name | Action for R-type instructions | Action for memory-reference instructions                 |                                                                             | Action for branches              | Action for jumps |
|-----------|--------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------|----------------------------------|------------------|
| IF        |                                | IR = Memory[PC]<br>PC = PC + 4                           |                                                                             |                                  |                  |
| ID        |                                | A = Reg [IR[25-21]]<br>B = Reg [IR[20-16]]               |                                                                             |                                  |                  |
| EXE       | ALUOut = A op B                | ALUOut = A + sign-extend (IR[15-0])                      | if (A == B) then ALUOut = PC + (sign-extend (IR[15-0]) << 2)<br>PC = ALUOut | PC = PC [31-28]    (IR[25-0]<<2) |                  |
| MEM       |                                | Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B |                                                                             |                                  |                  |
| WB        | Reg [IR[15-11]] = ALUOut       | Load: Reg[IR[20-16]] = MDR                               |                                                                             |                                  |                  |

3/2/2016 Chapter 4-The Processor RST©NTHU

## Pipeline Performance

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- Compare pipelined datapath with single-cycle datapath

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| sw       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |

263

3/2/2016 Chapter 4-The Processor RST©NTHU



## Pipeline Performance



264

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Pipeline Speedup

- If all stages are balanced
  - i.e., all take the same time
  - Time between instructions<sub>pipelined</sub> = Time between instructions<sub>nonpipelined</sub> / Number of stages
- If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not decrease

3/2/2016 Chapter 4-The Processor- RST@NTHU

# Pipelining and ISA Design

- MIPS ISA designed for pipelining
  - All instructions are 32-bits
    - Easier to fetch and decode in one cycle
    - c.f. x86: 1- to 17-byte instructions
  - Few and regular instruction formats
    - Can decode and read registers in one step
  - Load/store addressing
    - Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage (IF, ID, EXE, MEM, WB)
  - Alignment of memory operands
    - Memory access takes only one cycle

3/2/2016 Chapter 4-The Processor- RST@NTHU

# Hazards

- Situations that prevent starting next instruction in next cycle
- **Structure** hazards
  - A required resource is busy
- **Data** hazard
  - Need to wait for previous instruction to complete its data read/write
- **Control** hazard
  - Deciding on control action depends on previous instruction

265

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Structure Hazards

- Conflict for use of a resource
- In MIPS pipeline with a single memory
  - Load/store requires data access
  - Instruction fetch would have to *stall* for that cycle
    - Would cause a pipeline “bubble”
- Hence, pipelined datapaths require **separate** instruction/data memories
  - Or separate instruction/data caches

265

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Data Hazards

- An instruction depends on completion of data access by a previous instruction
  - add \$s0, \$t0, \$t1
  - sub \$t2, \$s0, \$t3



266

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Forwarding (aka Bypassing)

- Use result when it is computed
  - Don't wait for it to be stored in a register
  - Requires extra connections in the datapath



266

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Load-Use Data Hazard

- Cannot always avoid stalls by forwarding
  - If value not computed when needed
  - Cannot forward in time!



268

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instruction
- C code for  $A = B + E; C = B + F;$



268

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Control Hazards

- Branch determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
    - Still working on ID stage of branch
- In MIPS pipeline
  - Need to compare registers and compute target early in the pipeline
  - Add hardware to do it in ID stage

269

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Check EQ using XNOR



| A | B | Y |
|---|---|---|
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |

3/2/2016 Chapter 4-The Processor- RST@NTU

## Stall on Branch

- Wait until branch outcome determined before fetching next instruction



270

3/2/2016 Chapter 4-The Processor- RST@NTU



## Branch Prediction

- Longer pipelines can't readily determine branch outcome early
  - Stall penalty becomes unacceptable
- Predict outcome of branch
  - Only stall if prediction is wrong
- In MIPS pipeline
  - Can predict branches not taken
  - Fetch instruction after branch, with no delay

271

3/2/2016 Chapter 4-The Processor- RST@NTHU



## MIPS with Predict Not Taken



272

3/2/2016 Chapter 4-The Processor- RST@NTHU



## More-Realistic Branch Prediction

- Static branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- Dynamic branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history of each branch
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history



3/2/2016 Chapter 4-The Processor- RST@NTHU



## Pipeline Summary

### The BIG Picture

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation

273

3/2/2016 Chapter 4-The Processor- RST@NTHU

## §4.6 Pipelined Datapath and Control



# Pipeline Operation

- Cycle-by-cycle flow of instructions through the pipelined datapath
  - “Single-clock-cycle” pipeline diagram
    - Shows pipeline usage in a single cycle
    - Highlight resources used
  - c.f. “multi-clock-cycle” diagram
    - Graph of operation over time
- We'll look at “single-clock-cycle” diagrams for load & store

3/2/2016 Chapter 4-The Processor- RST@NTHU

## IF for Load, Store, ...



279

3/2/2016 Chapter 4-The Processor- RST@NTHU

## ID for Load, Store, ...

lw  
Instruction decode



279

3/2/2016 Chapter 4-The Processor- RST@NTHU

## EX for Load

lw  
Execution



280

3/2/2016 Chapter 4-The Processor- RST@NTHU

## MEM for Load



281

3/2/2016 Chapter 4-The Processor- RST@NTHU

## WB for Load



281

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Corrected Datapath for Load



284

3/2/2016 Chapter 4-The Processor- RST@NTHU

## EX for Store

sw \$t0, 16(\$s1)



282

3/2/2016 Chapter 4-The Processor- RST@NTHU

## MEM for Store

sw \$t0, 16(\$s1)



283

3/2/2016 Chapter 4-The Processor- RST@NTHU

## WB for Store



283

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Multi-Cycle Pipeline Diagram

- Form showing **resource** usage



## Multi-Cycle Pipeline Diagram

- Traditional** form



## Multiple-Cycle Pipeline Diagram

- State of pipeline in a given cycle



287

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Pipelined Control (Simplified)



289

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Pipelined Control

- Control signals derived from instruction
  - As in single-cycle implementation



291

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Pipelined Control (cont.)

- Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle later
- Signals for MEM (MemWr, Branch) are used 2 cycles later
- Signals for WB (MemtoReg, RegWr) are used 3 cycles later



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Pipelined Control



292

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Data Hazards in ALU Instructions

§4.7 Data Hazards Forwarding vs. Stalling

- Consider this sequence:
 

```
sub $2, $1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,100($2)
```
- We can resolve hazards with forwarding
  - How do we detect when to forward?

291

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Data Dependency & Availability

- R-type      

|    |    |    |    |       |       |
|----|----|----|----|-------|-------|
| op | rs | rt | rd | shamt | funct |
|----|----|----|----|-------|-------|

  - Inputs: rs, rt @ ID/EX
  - Output: rd @ EX/MEM
- I-type      

|    |    |    |                     |  |  |
|----|----|----|---------------------|--|--|
| op | rs | rt | constant or address |  |  |
|----|----|----|---------------------|--|--|

  - lw
    - Inputs: rs @ ID/EX
    - Output: rt @ MEM/WB
  - SW:
    - Inputs: rs @ ID/EX, rt @ EX/MEM
  - branch
    - Inputs: rs, rt @ ID
    - Output: PC @ ID (move from EX)

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Dependencies & Forwarding



3/2/2016 Chapter 4-The Processor- RST@NTHU



## Detecting the Need to Forward

- Pass register numbers along pipeline
  - e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register
- ALU operand register numbers in EX stage are given by
  - ID/EX.RegisterRs, ID/EX.RegisterRt
- Data hazards when

- |                                                                                                                                                                              |                                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs<br>1b. EX/MEM.RegisterRd = ID/EX.RegisterRt<br>2a. MEM/WB.RegisterRd = ID/EX.RegisterRs<br>2b. MEM/WB.RegisterRd = ID/EX.RegisterRt | Fwd from EX/MEM pipeline reg<br><br>Fwd from MEM/WB pipeline reg |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|



3/2/2016 Chapter 4-The Processor- RST@NTHU



## Detecting the Need to Forward

- But only if forwarding instruction will write to a register!
  - EX/MEM.RegWrite, MEM/WB.RegWrite
- And only if Rd for that instruction is not \$zero
  - EX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Forwarding Paths



b. With forwarding

297

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Forwarding Conditions

- EX hazard (EX/MEM)
  - if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))  
**ForwardA = 10**
  - if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))  
**ForwardB = 10**
- MEM hazard (MEM/WB)
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))  
**ForwardA = 01**
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))  
**ForwardB = 01**

296,298

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Double Data Hazard

- Consider the sequence:  
$$\begin{array}{l} \text{add \$1,\$1,\$2} \\ \text{add \$1,\$1,\$3} \\ \text{add \$1,\$1,\$4} \end{array}$$
 } } \*
- Both hazards occur
  - Want to use the most recent
- Revise MEM hazard condition
  - fwd only if EX hazard condition isn't true

298

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Revised Forwarding Condition

- MEM hazard
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))  
ForwardA = 01
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)  
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))  
ForwardB = 01

299

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Datapath with Forwarding



299

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Load-Use Data Hazard



301

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Load-Use Hazard Detection

- Check when using instruction is decoded in ID stage
- ALU operand register numbers in ID stage are given by
  - IF/ID.RegisterRs, IF/ID.RegisterRt
- Load-use hazard when
  - ID/EX.MemRead and  
 $((ID/EX.RegisterRt = IF/ID.RegisterRs)$   
or  
 $(ID/EX.RegisterRt = IF/ID.RegisterRt))$
- If detected, stall and insert bubble

3/2/2016 Chapter 4-The Processor- RST@NTHU



## How to Stall the Pipeline

- Force control values in ID/EX register to 0
  - EX, MEM and WB do **nop** (no-operation)
- Prevent update of PC and IF/ID register
  - Using instruction is decoded again
  - Following instruction is fetched again
  - 1-cycle stall allows MEM to read data for 1w
    - Can subsequently forward to EX stage

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Stall/Bubble in the Pipeline



3/2/2016 Chapter 4-The Processor- RST@NTHU



## Stall/Bubble in the Pipeline



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Datapath with Hazard Detection



## Stalls and Performance

### The BIG Picture

- Stalls reduce performance
  - But are required to get correct results
- Compiler can arrange code to avoid hazards and stalls
  - Requires knowledge of the pipeline structure

## Branch Hazards

- If branch outcome determined in MEM



304

## Reducing Branch Delay

- Move hardware to determine outcome to ID stage
  - Target address adder
  - Register comparator
- Example: branch taken

```

36: sub $10, $4, $8
40: beq $1, $3, 7 # (40+4)+7*4=72
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
      ...
72: lw $4, 50($7)
  
```



Move  
comparator  
to ID

## Example: Branch Taken 1/2



308

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Example: Branch Taken 2/2



308

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Data Hazards for Branches

- If a comparison register is a destination of 2<sup>nd</sup> or 3<sup>rd</sup> preceding ALU instruction



- Can resolve using forwarding

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Data Hazards for Branches

- If a comparison register is a destination of preceding ALU instruction or 2<sup>nd</sup> preceding load instruction
  - Need 1 stall cycle



(what if branch address is calculated using ALU?)

3/2/2016

Chapter 4-The Processor- RST@NTHU

## Data Hazards for Branches

- If a comparison register is a destination of immediately preceding load instruction
  - Need 2 stall cycles



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Dynamic Branch Prediction

- In deeper and superscalar pipelines, branch penalty is more significant
- Use dynamic prediction
  - Branch prediction buffer (aka branch history table)
  - Indexed by recent branch instruction addresses
  - Stores outcome (taken/not taken)
  - To execute a branch
    - Check table, expect the same outcome
    - Start fetching from fall-through or target
    - If wrong, flush pipeline and flip prediction

309

3/2/2016 Chapter 4-The Processor- RST@NTHU

## 1-Bit Predictor: Shortcoming

- Inner loop branches mispredicted twice!



- Mispredict as taken on last iteration of inner loop
- Then mispredict as not taken on first iteration of inner loop next time around

3/2/2016 Chapter 4-The Processor- RST@NTHU

## 2-Bit Predictor

- Only change prediction on two successive mispredictions



310

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Calculating the Branch Target

- Even with predictor, still need to calculate the target address
  - 1-cycle penalty for a taken branch
- Branch target buffer
  - Cache of target addresses
  - Indexed by PC when instruction fetched
    - If hit and instruction is branch predicted taken, can fetch target immediately

3/2/2016 Chapter 4-The Processor- RST@NTHU

§4.9 Exceptions

## Exceptions and Interrupts

- “Unexpected” events requiring change in flow of control
  - Different ISAs use the terms differently
- Exception
  - Arises within the CPU
    - e.g., undefined opcode, overflow, syscall, ...
- Interrupt
  - From an external I/O controller
- Dealing with them without sacrificing performance is hard

313

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Handling Exceptions

- In MIPS, exceptions managed by a System Control Coprocessor (CP0)
- Save PC of offending (or interrupted) instruction
  - In MIPS: Exception Program Counter (EPC)
- Save indication of the problem
  - In MIPS: Cause register
  - We'll assume 1-bit
    - 0 for undefined opcode, 1 for overflow
- Jump to handler at 8000 0180

314

3/2/2016 Chapter 4-The Processor- RST@NTHU



## An Alternate Mechanism

- Vectored Interrupts
  - Handler address determined by the cause
- Example:
  - Undefined opcode: C000 0000
  - Overflow: C000 0020
  - ....: C000 0040
- Instructions either
  - Deal with the interrupt, or
  - Jump to real handler

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Handler Actions

- Read cause, and transfer to relevant handler
- Determine action required
- If restartable
  - Take corrective action
  - use EPC to return to program
- Otherwise
  - Terminate program
  - Report error using EPC, cause, ...

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Exceptions in a Pipeline

- Another form of control hazard
- Consider overflow on add in EX stage  
**add \$1, \$2, \$1**
  - Prevent \$1 from being clobbered
  - **Complete previous instructions**
  - Flush add and subsequent instructions
  - Set Cause and EPC register values
  - Transfer control to handler
- Similar to mispredicted branch
  - Use much of the same hardware

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Pipeline with Exceptions



317

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Exception Properties

- Restartable exceptions
  - Pipeline can flush the instruction
  - Handler executes, then returns to the instruction
    - Refetched and executed from scratch
- PC saved in EPC register
  - Identifies causing instruction
  - Actually PC + 4 is saved
    - Handler must adjust

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Exception Example

- Exception on **add** in

```

40    sub   $11, $2, $4
44    and   $12, $2, $5
48    or    $13, $2, $6
4C  add  $1, $2, $1
50    slt   $15, $6, $7
54    lw    $16, 50($7)
...

```

- Handler

```

80000180  sw    $26, 1000($0)
80000184  sw    $27, 1004($0)
...

```

|    |    |        |
|----|----|--------|
| 26 | k0 | for OS |
| 27 | k1 | kernel |

316

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Exception Example



318

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Exception Example



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Multiple Exceptions

- Pipelining overlaps multiple instructions
  - Could have multiple exceptions at once
- Simple approach: deal with exception of the **earliest** instruction
  - Flush subsequent instructions
  - “Precise” exceptions
- In complex pipelines
  - Multiple instructions issued per cycle
  - Out-of-order completion
  - Maintaining precise exceptions is difficult!

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Imprecise Exceptions

- Just stop pipeline and save state
  - Including exception cause(s)
- Let the handler work out
  - Which instruction(s) had exceptions
  - Which to complete or flush
    - May require “manual” completion
- Simplifies hardware, but more complex handler software
- Not feasible for complex multiple-issue out-of-order pipelines

3/2/2016 Chapter 4-The Processor- RST©NTHU



## Instruction-Level Parallelism (ILP)

- Pipelining: executing multiple instructions in parallel
- To increase ILP
  - Deeper pipeline
    - Less work per stage  $\Rightarrow$  shorter clock cycle
  - Multiple issue
    - Replicate pipeline stages  $\Rightarrow$  multiple pipelines
    - Start multiple instructions per clock cycle
    - CPI < 1, so use Instructions Per Cycle (IPC)
    - E.g., 4GHz 4-way multiple-issue
      - 16 BIPS, peak CPI = 0.25, peak IPC = 4
    - But dependencies reduce this in practice

§4.10 Parallelism and Advanced Instruction Level Parallelism

3/2/2016 Chapter 4-The Processor- RST©NTHU



## Multiple Issue

- **Static** multiple issue
  - Compiler groups instructions to be issued together
  - Packages them into “issue slots”
  - Compiler detects and avoids hazards
- **Dynamic** multiple issue
  - CPU examines instruction stream and chooses instructions to issue each cycle
  - Compiler can help by reordering instructions
  - CPU resolves hazards using advanced techniques at runtime

320

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Speculation

- “Guess” what to do with an instruction
  - Start operation as soon as possible
  - Check whether guess was right
    - If so, complete the operation
    - If not, roll-back and do the right thing
- Common to static and dynamic multiple issue
- Examples
  - Speculate on branch outcome
    - Roll back if path taken is different
  - Speculate on load
    - Roll back if location is updated

321

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Compiler/Hardware Speculation

- Compiler can reorder instructions
  - e.g., move load before branch
  - Can include “fix-up” instructions to recover from incorrect guess
- Hardware can look ahead for instructions to execute
  - Buffer results until it determines they are actually needed
  - Flush buffers on incorrect speculation

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Speculation and Exceptions

- What if exception occurs on a speculatively executed instruction?
  - e.g., speculative load before null-pointer check
- Static speculation
  - Can add ISA support for deferring exceptions
- Dynamic speculation
  - Can buffer exceptions until instruction completion (which may not occur)

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Static Multiple Issue

- Compiler groups instructions into “issue packets”
  - Group of instructions that can be issued on a single cycle
  - Determined by pipeline resources required
- Think of an issue packet as a very long instruction
  - Specifies multiple concurrent operations
  - ⇒ Very Long Instruction Word (VLIW)

322

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Scheduling Static Multiple Issue

- Compiler must remove some/all hazards
  - Reorder instructions into issue packets
  - No dependencies with a packet
  - Possibly some dependencies between packets
    - Varies between ISAs; compiler must know!
  - Pad with nop if necessary

3/2/2016 Chapter 4-The Processor- RST@NTHU



## MIPS with Static Dual Issue

- Two-issue packets
  - One ALU/branch instruction
  - One load/store instruction
  - 64-bit aligned instruction
    - ALU/branch, then load/store
    - Pad an unused instruction with nop

| Address | Instruction type | Pipeline Stages |    |    |     |     |     |    |  |
|---------|------------------|-----------------|----|----|-----|-----|-----|----|--|
|         |                  | IF              | ID | EX | MEM | WB  |     |    |  |
| n       | ALU/branch       |                 |    |    |     |     |     |    |  |
| n + 4   | Load/store       | IF              | ID | EX | MEM | WB  |     |    |  |
| n + 8   | ALU/branch       |                 | IF | ID | EX  | MEM | WB  |    |  |
| n + 12  | Load/store       |                 | IF | ID | EX  | MEM | WB  |    |  |
| n + 16  | ALU/branch       |                 |    | IF | ID  | EX  | MEM | WB |  |
| n + 20  | Load/store       |                 |    | IF | ID  | EX  | MEM | WB |  |

323

3/2/2016 Chapter 4-The Processor- RST@NTHU



## MIPS with Static Dual Issue



324

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Hazards in the Dual-Issue MIPS

- More instructions executing in parallel
- EX data hazard
  - Forwarding avoided stalls with single-issue
  - Now can't use ALU result in load/store in same packet
    - add \$t0, \$s0, \$s1  
load \$s2, 0(\$t0)
    - Split into two packets, effectively a stall
- Load-use hazard
  - Still one cycle use latency, but now two instructions
- More aggressive scheduling required

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Scheduling Example

- Schedule this for dual-issue MIPS

```
Loop: lw    $t0, 0($s1)      # $t0=array element
      addu $t0, $t0, $s2      # add scalar in $s2
      sw    $t0, 0($s1)      # store result
      addi $s1, $s1,-4        # decrement pointer
      bne  $s1, $zero, Loop   # branch $s1!=0
```

|       | ALU/branch             | Load/store       | cycle |
|-------|------------------------|------------------|-------|
| Loop: | nop                    | lw \$t0, 0(\$s1) | 1     |
|       | addi \$s1, \$s1,-4     | nop              | 2     |
|       | addu \$t0, \$t0, \$s2  | nop              | 3     |
|       | bne \$s1, \$zero, Loop | sw \$t0, 4(\$s1) | 4     |

- IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

325

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Loop Unrolling

- Replicate loop body to expose more parallelism
  - Reduces loop-control overhead
- Use different registers per replication
  - Called “register renaming”
  - Avoid loop-carried “anti-dependencies”
    - Store followed by a load of the same register
    - Aka “name dependence”
      - Reuse of a register name

326

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Loop Unrolling Example

|       | ALU/branch             | Load/store        | cycle |
|-------|------------------------|-------------------|-------|
| Loop: | addi \$s1, \$s1,-16    | lw \$t0, 0(\$s1)  | 1     |
|       | nop                    | lw \$t1, 12(\$s1) | 2     |
|       | addu \$t0, \$t0, \$s2  | lw \$t2, 8(\$s1)  | 3     |
|       | addu \$t1, \$t1, \$s2  | lw \$t3, 4(\$s1)  | 4     |
|       | addu \$t2, \$t2, \$s2  | sw \$t0, 16(\$s1) | 5     |
|       | addu \$t3, \$t4, \$s2  | sw \$t1, 12(\$s1) | 6     |
|       | nop                    | sw \$t2, 8(\$s1)  | 7     |
|       | bne \$s1, \$zero, Loop | sw \$t3, 4(\$s1)  | 8     |

- IPC = 14/8 = 1.75
  - Closer to 2, but at cost of registers and code size

326

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Dynamic Multiple Issue

- “Superscalar” processors
- CPU decides whether to issue 0, 1, 2, ... each cycle
  - Avoiding structural and data hazards
- Avoids the need for compiler scheduling
  - Though it may still help
  - Code semantics ensured by the CPU

327

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Dynamic Pipeline Scheduling

- Allow the CPU to execute instructions **out of order** to avoid stalls
  - But commit result to registers in order
- Example

```
lw      $t0, 20($s2)
addu   $t1, $t0, $t2
sub    $s4, $s4, $t3
slti   $t5, $s4, 20
```

  - Can start sub while addu is waiting for lw

327

3/2/2016 Chapter 4-The Processor- RST@NTHU



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Register Renaming

- Reservation stations and reorder buffer effectively provide register renaming
- On instruction issue to reservation station
  - If operand is available in register file or reorder buffer
    - Copied to reservation station
    - No longer required in the register; can be overwritten
  - If operand is not yet available
    - It will be provided to the reservation station by a function unit
    - Register update may not be required



400

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Speculation

- Predict branch and continue issuing
  - Don't commit until branch outcome determined
- Load speculation
  - Avoid load and cache miss delay
    - Predict the effective address
    - Predict loaded value
    - Load before completing outstanding stores
    - Bypass stored values to load unit
  - Don't commit load until speculation cleared

3/2/2016 Chapter 4-The Processor- RST@NTHU

## Why Do Dynamic Scheduling?

- Why not just let the compiler schedule code?
- Not all stalls are predictable
  - e.g., cache misses
- Can't always schedule around branches
  - Branch outcome is dynamically determined
- Different implementations of an ISA have different latencies and hazards

3/2/2016 Chapter 4-The Processor- RST@NTHU



# Does Multiple Issue Work?

## The BIG Picture

- Yes, but **not** as much as we'd like
- Programs have real dependencies that limit ILP
- Some dependencies are hard to eliminate
  - e.g., pointer aliasing
- Some parallelism is hard to expose
  - Limited window size during instruction issue
- Memory delays and limited bandwidth
  - Hard to keep pipelines full
- Speculation can help if done well

3/2/2016 Chapter 4-The Processor- RST@NTHU



# Power Efficiency

- Complexity of dynamic scheduling and speculations requires power
- Multiple simpler cores may be better

| Microprocessor | Year | Clock Rate | Pipeline Stages | Issue width | Out-of-order/Speculation | Cores | Power |
|----------------|------|------------|-----------------|-------------|--------------------------|-------|-------|
| i486           | 1989 | 25MHz      | 5               | 1           | No                       | 1     | 5W    |
| Pentium        | 1993 | 66MHz      | 5               | 2           | No                       | 1     | 10W   |
| Pentium Pro    | 1997 | 200MHz     | 10              | 3           | Yes                      | 1     | 29W   |
| P4 Willamette  | 2001 | 2000MHz    | 22              | 3           | Yes                      | 1     | 75W   |
| P4 Prescott    | 2004 | 3600MHz    | 31              | 3           | Yes                      | 1     | 103W  |
| Core           | 2006 | 2930MHz    | 14              | 4           | Yes                      | 2     | 75W   |
| UltraSparc III | 2003 | 1950MHz    | 14              | 4           | No                       | 1     | 90W   |
| UltraSparc T1  | 2005 | 1200MHz    | 6               | 1           | No                       | 8     | 70W   |

3/2/2016 Chapter 4-The Processor- RST@NTHU

### §4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines

## Cortex A8 and Intel i7

| Processor                             | ARM A8                 | Intel Core i7 920                     |
|---------------------------------------|------------------------|---------------------------------------|
| Market                                | Personal Mobile Device | Server, cloud                         |
| Thermal design power                  | 2 Watts                | 130 Watts                             |
| Clock rate                            | 1 GHz                  | 2.66 GHz                              |
| Cores/Chip                            | 1                      | 4                                     |
| Floating point?                       | No                     | Yes                                   |
| Multiple issue?                       | Dynamic                | Dynamic                               |
| Peak instructions/clock cycle         | 2                      | 4                                     |
| Pipeline stages                       | 14                     | 14                                    |
| Pipeline schedule                     | Static in-order        | Dynamic out-of-order with speculation |
| Branch prediction                     | 2-level                | 2-level                               |
| 1 <sup>st</sup> level caches/core     | 32 KiB I, 32 KiB D     | 32 KiB I, 32 KiB D                    |
| 2 <sup>nd</sup> level caches/core     | 128-1024 KiB           | 256 KiB                               |
| 3 <sup>rd</sup> level caches (shared) | -                      | 2- 8 MB                               |

3/2/2016 Chapter 4-The Processor- RST@NTHU

## ARM Cortex-A8 Pipeline



3/2/2016 Chapter 4-The Processor- RST@NTHU

# ARM Cortex-A8 Performance



3/2/2016 Chapter 4-The Processor- RST@NTHU

# Core i7 Pipeline



3/2/2016 Chapter 4-The Processor- RST@NTHU

# Core i7 Performance



3/2/2016 Chapter 4-The Processor- RST@NTHU

## §4.12 Instruction-Level Parallelism and Matrix Multiply

### Matrix Multiply

#### Unrolled C code

```

1 #include <x86intrin.h>
2 #define UNROLL (4)
3
4 void dgemm ( int n, double* A, double* B, double* C)
5 {
6     for ( int i = 0; i < n; i+=UNROLL*4 )
7         for ( int j = 0; j < n; j++ ) {
8             __m256d c[4];
9             for ( int x = 0; x < UNROLL; x++ )
10                 c[x] = _mm256_load_pd(C+i+x*4+j*n);
11
12            for( int k = 0; k < n; k++ )
13            {
14                __m256d b = _mm256_broadcast_sd(B+k+j*n);
15                for (int x = 0; x < UNROLL; x++)
16                    c[x] = _mm256_add_pd(c[x],
17                                         _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));
18            }
19
20            for ( int x = 0; x < UNROLL; x++ )
21                _mm256_store_pd(C+i+x*4+j*n, c[x]);
22        }
23    }

```

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Matrix Multiply

### Assembly code:

```

1 vmovapd (%r11),%ymm4
2 mov %rbx,%rax
3 xor %ecx,%ecx
4 vmovapd 0x20(%r11),%ymm3
5 vmovapd 0x40(%r11),%ymm2
6 vmovapd 0x60(%r11),%ymm1
7 vbroadcastsd (%rcx,%r9,1),%ymm0
8 add $0x8,%rcx # register %rcx = %rcx + 8
9 vmulpd (%rax),%ymm0,%ymm5
10 vaddpd %ymm5,%ymm4,%ymm4
11 vmulpd 0x20(%rax),%ymm0,%ymm5
12 vaddpd %ymm5,%ymm3,%ymm3
13 vmulpd 0x40(%rax),%ymm0,%ymm5
14 vmulpd 0x60(%rax),%ymm0,%ymm0
15 add %r8,%rax
16 cmp %r10,%rcx
17 vaddpd %ymm5,%ymm2,%ymm2
18 vaddpd %ymm0,%ymm1,%ymm1
19 jne 68 <dgemm+0x68>
20 add $0x1,%esi
21 vmovapd %ymm4,(%r11)
22 vmovapd %ymm3,0x20(%r11)
23 vmovapd %ymm2,0x40(%r11)
24 vmovapd %ymm1,0x60(%r11)

# Load 4 elements of C into %ymm4
# register %rax = %rbx
# register %ecx = 0
# Load 4 elements of C into %ymm3
# Load 4 elements of C into %ymm2
# Load 4 elements of C into %ymm1
# Make 4 copies of B element
# Parallel mul %ymm1,4 A elements
# Parallel add %ymm5, %ymm4
# Parallel mul %ymm1,4 A elements
# Parallel add %ymm5, %ymm3
# Parallel mul %ymm1,4 A elements
# Parallel mul %ymm1,4 A elements
# register %rax = %rax + %r8
# compare %r8 to %rax
# Parallel add %ymm5, %ymm2
# Parallel add %ymm0, %ymm1
# jump if not %r8 != %rax
# register %esi = %esi + 1
# Store %ymm4 into 4 C elements
# Store %ymm3 into 4 C elements
# Store %ymm2 into 4 C elements
# Store %ymm1 into 4 C elements

```

3/2/2016 Chapter 4-The Processor- RST@NTHU



## Performance Impact



3/2/2016 Chapter 4-The Processor- RST@NTHU

## Fallacies

- Pipelining is easy (!)
  - The basic idea is easy
  - The devil is in the details
    - e.g., detecting data hazards
- Pipelining is independent of technology
  - So why haven't we always done pipelining?
  - More transistors make more advanced techniques feasible
  - Pipeline-related ISA design needs to consider technology trends
    - e.g., predicated instructions

407

3/2/2016 Chapter 4-The Processor- RST©NTHU

## Pitfalls

- Poor ISA design can make pipelining harder
  - e.g., complex instruction sets (VAX, IA-32)
    - Significant overhead to make pipelining work
    - IA-32 micro-op approach
  - e.g., complex addressing modes
    - Register update side effects, memory indirection
  - e.g., delayed branches
    - Advanced pipelines have long delay slots

407

3/2/2016 Chapter 4-The Processor- RST©NTHU

## Concluding Remarks

- ISA influences design of datapath and control
- Datapath and control influence design of ISA
- Pipelining improves instruction throughput using parallelism
  - More instructions completed per second
  - Latency for each instruction not reduced
- Hazards: structural, data, control
- Multiple issue and dynamic scheduling (ILP)
  - Dependencies limit achievable parallelism
  - Complexity leads to the power wall