



# Chapter 4

## The Processor

# Introduction

- CPU performance factors (  $T = I \times CPI \times T_c$  )
  - Instruction count
    - Determined by ISA and compiler
  - CPI and Cycle time
    - Determined by CPU hardware
- We will examine two RISC-V implementations
  - A simplified version (**limited functionality**)
  - A more realistic pipelined version
- **Simple subset**, shows most aspects
  - Memory reference: ld, sd
  - Arithmetic/logical: add, sub, and, or
  - Control transfer: beq

# Instruction Execution

- PC → instruction memory, fetch instruction
- Register numbers → register file, read registers
- Depending on instruction class
  - Use ALU to calculate
    - Arithmetic result
    - Memory address for load/store
    - Branch comparison
  - Access data memory for load/store
  - PC ← target address or PC + 4

# CPU Overview



# CPU Overview

Two important aspects **of instruction execution** are missing in this design:

1. Data from two different sources cannot join a **single link**, this can be done using **multiplexers**
2. Units' activities must be controlled using control lines (**Control signals**) through the **control unit**.

# Multiplexers



# Control unit (Control signals)



# Logic Design Basics

- Information encoded in binary
  - Low voltage = 0, High voltage = 1
  - One wire per bit
  - Multi-bit data encoded on multi-wire buses
- Combinational element
  - Operate on data
  - Output is a function of input
- State (sequential) elements
  - Store information

# Combinational Elements

- AND-gate

- $Y = A \& B$



- Multiplexer

- $Y = S ? I_1 : I_0$



- Adder

- $Y = A + B$



- Arithmetic/Logic Unit

- $Y = F(A, B)$



# Sequential Elements

- Register: stores data in a circuit
  - Uses a clock signal to determine when to update the stored value
  - Edge-triggered: update when Clk changes from 0 to 1



# Sequential Elements

- Register with write control
  - Only updates on clock edge when write control input is 1
  - Used when stored value is required later



# Clocking Methodology

- Combinational logic transforms data during clock cycles
  - Between clock edges
  - Input from state elements, output to state element
  - Longest delay determines clock period



# Building a Datapath

- Datapath
  - Elements that process data and addresses in the CPU
    - Registers, ALUs, mux's, memories, ...
  - We will build a RISC-V datapath incrementally
    - Refining the overview design

# Instruction Fetch



# R-Format Instructions

- Read two register operands (**comb**)
- Perform arithmetic/logical operation
- Write register result (**State**)
- Read and write in the same clock



# Load/Store Instructions

- Read register operands
- Calculate address using 12-bit offset
  - Use ALU, but sign-extend offset
- Load: Read memory and update register
- Store: Write register value to memory



# Branch Instructions

- Read register operands
- Compare operands
  - Use ALU, subtract and check Zero output
- Calculate target address
  - Sign-extend displacement
  - Shift left 1 place (halfword displacement)
  - Add to PC value

# Branch Instructions



# Composing the Elements

- First-cut data path does an instruction in one clock cycle
  - Each datapath element can only do one function at a time
  - Hence, we need separate instruction and data memories
- Use multiplexers where alternate data sources are used for different instructions

# R-Type/Load/Store Datapath



# Full Datapath



# ALU Control

- ALU used for
  - Load/Store:  $F = \text{add}$
  - Branch:  $F = \text{subtract}$
  - R-type:  $F$  depends on opcode

| ALU control | Function |
|-------------|----------|
| 0000        | AND      |
| 0001        | OR       |
| 0010        | add      |
| 0110        | subtract |

# ALU Control

- Assume 2-bit ALUOp derived from opcode
  - Combinational logic derives ALU control

| opcode | ALUOp | Operation       | Opcode field | ALU function | ALU control |
|--------|-------|-----------------|--------------|--------------|-------------|
| ld     | 00    | load register   | XXXXXXXXXXXX | add          | 0010        |
| sd     | 00    | store register  | XXXXXXXXXXXX | add          | 0010        |
| beq    | 01    | branch on equal | XXXXXXXXXXXX | subtract     | 0110        |
| R-type | 10    | add             | 100000       | add          | 0010        |
|        |       | subtract        | 100010       | subtract     | 0110        |
|        |       | AND             | 100100       | AND          | 0000        |
|        |       | OR              | 100101       | OR           | 0001        |



# The Main Control Unit

## Control signals derived from instruction

|             | Name<br>(Bit position) | 31:25           | 24:20 | Fields<br>19:15 | 14:12  | 11:7          | 6:0    |
|-------------|------------------------|-----------------|-------|-----------------|--------|---------------|--------|
| (a) R-type  |                        | funct7          | rs2   | rs1             | funct3 | rd            | opcode |
| (b) I-type  |                        | immediate[11:0] |       | rs1             | funct3 | rd            | opcode |
| (c) S-type  |                        | immed[11:5]     | rs2   | rs1             | funct3 | immed[4:0]    | opcode |
| (d) SB-type |                        | immed[12,10:5]  | rs2   | rs1             | funct3 | immed[4:1,11] | opcode |

| ALUOp | Funct7 field |        |       |       |       |       |       |       |       |       |       |       | Funct3 field | Operation |
|-------|--------------|--------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--------------|-----------|
|       | ALUOp1       | ALUOp0 | I[31] | I[30] | I[29] | I[28] | I[27] | I[26] | I[25] | I[14] | I[13] | I[12] |              |           |
| 0     | 0            | X      | X     | X     | X     | X     | X     | X     | X     | X     | X     | X     | 0010         |           |
| X     | 1            | X      | X     | X     | X     | X     | X     | X     | X     | X     | X     | X     | 0110         |           |
| 1     | X            | 0      | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0010         |           |
| 1     | X            | 0      | 1     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0110         |           |
| 1     | X            | 0      | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 1     | 1     | 1     | 0000         |           |
| 1     | X            | 0      | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 1     | 1     | 0     | 0001         |           |



# Control Unit

| <u>Step #1</u>    | <u>Types</u>            | <u>Steps and Control signals</u>                                                                                                        |                            |                                      |
|-------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|----------------------------|--------------------------------------|
| Instruction fetch | Data Movement (ld / sd) | 1. Read operands (X and immediate)                                                                                                      | 2. Calculate address (ALU) | 3. Read or write from/to data memory |
|                   |                         | <b>Load: ALUSrc (1) , ALUOp (ADD), MemRead (1) , MemtoReg (1), RegWrite (1)</b><br><b>Store: ALUSrc (1) , ALUOp (ADD), MemWrite (1)</b> |                            |                                      |
|                   | ALU                     | 1. Read operands (2 Regs )                                                                                                              | 2. Calculate result (ALU)  | 3. Update Register                   |
|                   |                         | <b>ALUSrc (0) , ALUOp (XX) , MemtoReg (0), RegWrite (1)</b>                                                                             |                            |                                      |
|                   | Branch                  | 1. Read operands (2 Regs )                                                                                                              | 2. Compare (ALU subtract)  | 3. Select and update PC              |
|                   |                         | <b>Branch (1), ALUSrc (0), MUX PC select = (Zero (X) . Branch (1))</b>                                                                  |                            |                                      |

- All steps are occurred within the same clock cycle (control signals for those steps should be activated)

# Datapath With Control



# R-Type Instruction



# Load Instruction



# BEQ Instruction



# Performance Issues

- Longest delay determines clock period
  - Critical path: load instruction
  - Instruction memory → register file → ALU → data memory → register file
- Not feasible to vary period for different instructions
- Violates design principle
  - Making the common case fast
- We will improve performance by pipelining

# Instruction pipelining

Pipelining is a CPU implementation technique in which multiple instructions are overlapped in execution.

Instruction pipelining exploits Instruction-Level Parallelism (ILP)

An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. Each step is called **a pipeline stage** or a *pipeline segment*.

**The stages or steps are connected in a linear fashion:** one stage to the next to form the pipeline -- instructions enter at one end and progress through the stages and exit at the other end.

The time to move an instruction one step down the pipeline is equal to *the machine cycle* and is determined by the stage with the longest processing delay.

**Pipelining increases the CPU instruction throughput: The average number of instructions completed per cycle.**

**Under ideal conditions (no stall cycles), instruction throughput is one instruction per machine cycle, or ideal CPI = 1**

**The time needed to complete all processing steps of an instruction (also called instruction completion latency).**

Minimum instruction latency = n cycles, where n is the number of pipeline stages

# Pipelining Analogy (Example)

- Pipelined laundry: overlapping execution
  - Parallelism improves performance



- Four loads:
  - Speedup  
 $= 8/3.5 = 2.3$
- Non-stop:
  - Speedup  
 $= 2n/0.5n \approx 4$   
= number of stages

# RISC-V Pipeline

- Five stages, one step per stage
  1. IF: Instruction fetch from memory
  2. ID: Instruction decode & register read
  3. EX: Execute operation or calculate address
  4. MEM: Access memory operand
  5. WB: Write result back to register

# Pipeline Performance

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- Compare pipelined datapath with single-cycle datapath

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| ld       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| sd       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |



# Pipeline Performance



# Pipeline Speedup

- For the previous example (**Single cycle - unpipelined**) with 3 instructions, the total time for executing the instructions is  $3 * 800 = 2400$  ps
- For the (**pipelined**) example, the total time is 1400 ps
- Speedup =  $2400/1400$
- This is not the case in general (**large number of instructions**), if the pipeline is full then the time to execute an instruction in the pipelined example is equal to one cycle = 200.
- Speedup =  $\frac{\text{Average Time to execute an instruction (unpipelined)}}{\text{Average Time to execute an instruction (pipelined)}} = \frac{800}{200} = 4$

# Pipeline Speedup

## Example:

For an **unpipelined** CPU: **Clock cycle = 1ns**, 4 cycles for **ALU** operations and **branches** and 5 cycles for **memory** operations with instruction frequencies of **40%, 20% and 40%**, respectively.

If pipelining adds **0.2 ns** to the machine clock cycle, **calculate** the speedup in instruction execution from pipelining assuming one instruction will be executed per one clock cycle, CPI = 1.

## Solution:

Non-pipelined Average instruction execution time = Clock cycle x Average CPI

$$= 1 \text{ ns} \times ((40\% + 20\%) \times 4 + 40\% \times 5) = 1 \text{ ns} \times 4.4 = 4.4 \text{ ns}$$

In the pipelined implementation five stages are used with an average instruction execution time of:  $1 \text{ ns} + 0.2 \text{ ns} = 1.2 \text{ ns}$

**Speedup = Instruction time unpipelined /Instruction time pipelined**  
 $= 4.4 \text{ ns} / 1.2 \text{ ns} = 3.7 \text{ times faster}$

# Pipeline Speedup

- If all stages are balanced
  - i.e., all take the same time
  - Time between instructions<sub>pipelined</sub>  
= Time between instructions<sub>nonpipelined</sub>  
                                Number of stages
- If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not decrease

# Pipelining and ISA Design

- RISC-V ISA designed for pipelining
  - All instructions are 32-bits
    - Easier to **fetch and decode** in one cycle
    - Compared to x86: 1- to 17-byte instructions (**hard**)
  - Few and regular instruction formats
    - Can decode and read registers in one step
  - Load/store addressing
    - Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage

# Hazards

- There are **situations** in pipelining when the **next instruction** cannot execute in the **following** clock cycle. These events are called **hazards**, and there are **three** different types.

## 1. Structure hazards

A required resource is busy

## 2. Data hazard

Need to wait for previous instruction to complete its data read/write

## 3. Control hazard

Deciding on control action depends on previous instruction

# Structure Hazards

- Conflict for use of a resource
- If RISC-V pipeline with a **single memory (data and instruction)**
  - Load/store instruction requires data access
  - Instruction fetch would have to *stall* for that cycle
    - Would cause a pipeline “bubble”
- Hence, pipelined datapaths require **two separate instruction/data memories**
  - Or separate instruction/data caches

# Data Hazards

- An instruction depends on completion of data access by a previous instruction

- add  $x19, x0, x1$
  - sub  $x2, x19, x3$



- In **add instruction**:  $x19$  will be updated after the fifth cycle (**End of cycle**)
- In **sub instruction**:  $x19$  should be available before the second cycle.

# Hazard Solution

- To overcome the problem of hazard there are many techniques can be used:
  1. Pipeline **stall** (Adding **Bubbles**): NOP instructions
  2. **Forwarding (Bypassing)**: adding hardware shortcut between the source and the destination
  3. Instruction **Reordering** (Out of order)

# Data Hazards

- An instruction depends on completion of data access by a previous instruction
  - add **x19**, x0, x1
  - sub x2, **x19**, x3



# Forwarding (aka Bypassing)

- Use result when it is computed
  - Don't wait for it to be stored in a register
  - Requires extra connections in the datapath



# Load-Use Data Hazard

- Can't always avoid stalls by forwarding
  - If value not computed when needed
  - Can't forward backward in time!



# Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instruction
- C code for  $a = b + e; c = b + f;$



# Control Hazards

- Branch determines flow of control
  - Fetching next instruction depends **on branch outcome**
  - Pipeline can't always fetch correct instruction
    - Still working on ID stage of branch
- In RISC-V pipeline
  - Need to compare registers and compute target early in the pipeline
  - Add **hardware** to do it in **ID stage (second stage)**

# Stall on Branch

- Wait until branch outcome determined before fetching next instruction



# Branch Prediction

- Longer pipelines can't readily determine branch outcome early
  - Stall penalty becomes **unacceptable**
- **Predict** outcome of branch
  - Only stall if prediction is wrong
- In RISC-V pipeline
  - Can predict branches not taken
  - Fetch instruction after branch, with no delay

# More-Realistic Branch Prediction

- **Static** branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- **Dynamic** branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history of each branch
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history

# Pipeline Summary

## The BIG Picture

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation



**MK**  
MORGAN KAUFMANN

# RISC-V Pipelined Datapath



# Pipeline registers

- Need registers between stages
  - To hold information produced in previous cycle
  - To share the data with other stage if needed



■ 96-bit      256-bit      193-bit      128-bit

# Pipeline Operation

- Cycle-by-cycle flow of instructions through the pipelined datapath
  - “Single-clock-cycle” pipeline diagram
    - Shows pipeline usage in a single cycle
    - Highlight resources used
  - c.f. “multi-clock-cycle” diagram
    - Graph of operation over time
- We'll look at “single-clock-cycle” diagrams for load & store

# IF for Load, Store, ...



# ID for Load, Store, ...



# EX for Load



# MEM for Load



# WB for Load



# Corrected Datapath for Load



# EX for Store



# MEM for Store



# WB for Store



# Multi-Cycle Pipeline Diagram

- Form showing resource usage



# Multi-Cycle Pipeline Diagram

## ■ Traditional form



# Single-Cycle Pipeline Diagram

## State of pipeline in a given cycle



# Pipelined Control (Simplified)



# Pipelined Control

- Control signals derived from instruction

| Signal name | Effect when deasserted                                                           | Effect when asserted                                                                                    |
|-------------|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| RegWrite    | None.                                                                            | The register on the Write register input is written with the value on the Write data input.             |
| ALUSrc      | The second ALU operand comes from the second register file output (Read data 2). | The second ALU operand is the sign-extended, 12 bits of the instruction.                                |
| PCSrc       | The PC is replaced by the output of the adder that computes the value of PC + 4. | The PC is replaced by the output of the adder that computes the branch target.                          |
| MemRead     | None.                                                                            | Data memory contents designated by the address input are put on the Read data output.                   |
| MemWrite    | None.                                                                            | Data memory contents designated by the address input are replaced by the value on the Write data input. |
| MemtoReg    | The value fed to the register Write data input comes from the ALU.               | The value fed to the register Write data input comes from the data memory.                              |

EX

M

WB

| Instruction | Execution/address calculation stage control lines |        | Memory access stage control lines |          |           | Write-back stage control lines |           |
|-------------|---------------------------------------------------|--------|-----------------------------------|----------|-----------|--------------------------------|-----------|
|             | ALUOp                                             | ALUSrc | Branch                            | Mem-Read | Mem-Write | Reg-Write                      | Memto-Reg |
| R-format    | 10                                                | 0      | 0                                 | 0        | 0         | 1                              | 0         |
| ld          | 00                                                | 1      | 0                                 | 1        | 0         | 1                              | 1         |
| sd          | 00                                                | 1      | 0                                 | 0        | 1         | 0                              | X         |
| beq         | 01                                                | 0      | 1                                 | 0        | 0         | 0                              | X         |



# Pipelined Control

- Control signals derived from instruction
  - As in single-cycle implementation



# Pipelined Control



# Data Hazards in ALU Instructions

- Consider this sequence:

```
sub    x2, x1, x3  
and    x12, x2, x5  
or     x13, x6, x2  
add    x14, x2, x2  
sd     x15, 100(x2)
```

- We can resolve hazards with forwarding
  - How do we detect when to forward?

# Dependencies & Forwarding



# Detecting the Need to Forward

- Pass register numbers along pipeline
  - e.g., ID/EX.RegisterRs1 = register number for Rs1 sitting in ID/EX pipeline register
- ALU operand register numbers in EX stage are given by
  - ID/EX.RegisterRs1, ID/EX.RegisterRs2
- Data hazards when
  - 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1
  - 1b. EX/MEM.RegisterRd = ID/EX.RegisterRs2
  - 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs1
  - 2b. MEM/WB.RegisterRd = ID/EX.RegisterRs2



# Detecting the Need to Forward

- But only if forwarding instruction will write to a register!
  - EX/MEM.RegWrite, MEM/WB.RegWrite
- And only if Rd for that instruction is not x0
  - EX/MEM.RegisterRd  $\neq$  0,  
MEM/WB.RegisterRd  $\neq$  0



# Forwarding Paths



# Forwarding Conditions \*

| Mux control   | Source | Explanation                                                                    |
|---------------|--------|--------------------------------------------------------------------------------|
| ForwardA = 00 | ID/EX  | The first ALU operand comes from the register file.                            |
| ForwardA = 10 | EX/MEM | The first ALU operand is forwarded from the prior ALU result.                  |
| ForwardA = 01 | MEM/WB | The first ALU operand is forwarded from data memory or an earlier ALU result.  |
| ForwardB = 00 | ID/EX  | The second ALU operand comes from the register file.                           |
| ForwardB = 10 | EX/MEM | The second ALU operand is forwarded from the prior ALU result.                 |
| ForwardB = 01 | MEM/WB | The second ALU operand is forwarded from data memory or an earlier ALU result. |

# Double Data Hazard

- Consider the sequence:

add  $x_1, x_1, x_2$

add  $x_1, x_1, x_3$

add  $x_1, x_1, x_4$

- Both hazards occur
  - Want to use the most recent
- Revise MEM hazard condition
  - Only fwd if EX hazard condition isn't true

# Revised Forwarding Condition

- MEM hazard
  - if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0)  
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs1))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs1)) ForwardA = 01
  - if (MEM/WB.RegWrite  
and (MEM/WB.RegisterRd ≠ 0)  
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)  
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs2))  
and (MEM/WB.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 01

# Datapath with Forwarding



# Load-Use Hazard Detection

- Check when using instruction is decoded in ID stage
- ALU operand register numbers in ID stage are given by
  - IF/ID.RegisterRs1, IF/ID.RegisterRs2
- Load-use hazard when
  - ID/EX.MemRead and
$$((ID/EX.RegisterRd = IF/ID.RegisterRs1) \text{ or } (ID/EX.RegisterRd = IF/ID.RegisterRs2))$$
- If detected, stall and insert bubble



# How to Stall the Pipeline

- Force control values in ID/EX register to 0
  - EX, MEM and WB do nop (no-operation)
- Prevent update of PC and IF/ID register
  - Using instruction is decoded again
  - Following instruction is fetched again
  - 1-cycle stall allows MEM to read data for 1d
    - Can subsequently forward to EX stage



# Load-Use Data Hazard



# Datapath with Hazard Detection



# Stalls and Performance

## The BIG Picture

- Stalls reduce performance
  - But are required to get correct results
- Compiler can arrange code to avoid hazards and stalls
  - Requires knowledge of the pipeline structure



**MK**  
MORGAN KAUFMANN

# Branch Hazards

- If branch outcome determined in MEM



# Reducing Branch Delay

- Move hardware to determine outcome to ID stage
  - Target address adder
  - Register comparator
- Example: branch taken

```
36: sub x10, x4, x8
40: beq x1, x3, 16 // PC-relative branch
           // to 40+16*2=72
44: and x12, x2, x5
48: orr x13, x2, x6
52: add x14, x4, x2
56: sub x15, x6, x7
...
72: ld x4, 50(x7)
```



# Example: Branch Taken



# Example: Branch Taken



# Dynamic Branch Prediction

- In deeper and superscalar pipelines, branch penalty is more significant
- Use dynamic prediction
  - Branch prediction buffer (aka branch history table)
  - Indexed by recent branch instruction addresses
  - Stores outcome (taken/not taken)
  - To execute a branch
    - Check table, expect the same outcome
    - Start fetching from fall-through or target
    - If wrong, flush pipeline and flip prediction

# 1-Bit Predictor: Shortcoming

- Inner loop branches mispredicted twice!



- Mispredict as taken on last iteration of inner loop
- Then mispredict as not taken on first iteration of inner loop next time around

# 2-Bit Predictor

- Only change prediction on two successive mispredictions



# Calculating the Branch Target

- Even with predictor, still need to calculate the target address
  - 1-cycle penalty for a taken branch
- Branch target buffer
  - Cache of target addresses
  - Indexed by PC when instruction fetched
    - If hit and instruction is branch predicted taken, can fetch target immediately



MK  
MORGAN KAUFMANN

# Exceptions and Interrupts

- “Unexpected” events requiring change in flow of control
  - Different ISAs use the terms differently
- Exception
  - Arises within the CPU
    - e.g., undefined opcode, syscall, ...
- Interrupt
  - From an external I/O controller
- Dealing with them without sacrificing performance is hard

# Handling Exceptions

- Save PC of offending (or interrupted) instruction
  - In RISC-V: Supervisor Exception Program Counter (SEPC)
- Save indication of the problem
  - In RISC-V: Supervisor Exception Cause Register (SCAUSE)
    - 64 bits, but most bits unused
      - Exception code field: 2 for undefined opcode, 12 for hardware malfunction, ...
  - Jump to handler
    - Assume at 0000 0000 1C09 0000<sub>hex</sub>

# An Alternate Mechanism

- Vectored Interrupts
  - Handler address determined by the cause
- Exception vector address to be added to a vector table base register:
  - Undefined opcode       $00\ 0100\ 0000_{\text{two}}$
  - Hardware malfunction:  $01\ 1000\ 0000_{\text{two}}$
  - ...                        ...
- Instructions either
  - Deal with the interrupt, or
  - Jump to real handler

# Handler Actions

- Read cause, and transfer to relevant handler
- Determine action required
- If restartable
  - Take corrective action
  - use SEPC to return to program
- Otherwise
  - Terminate program
  - Report error using SEPC, SCAUSE, ...

# Exceptions in a Pipeline

- Another form of control hazard
- Consider malfunction on add in EX stage
  - add x1, x2, x1
    - Prevent x1 from being clobbered
    - Complete previous instructions
    - Flush add and subsequent instructions
    - Set SEPC and SCAUSE register values
    - Transfer control to handler
  - Similar to mispredicted branch
    - Use much of the same hardware

# Pipeline with Exceptions



# Exception Properties

- Restartable exceptions
  - Pipeline can flush the instruction
  - Handler executes, then returns to the instruction
    - Refetched and executed from scratch
- PC saved in SEPC register
  - Identifies causing instruction

# Exception Example

- Exception on `add` in

```
40      sub   x11, x2, x4
44      and   x12, x2, x5
48      orr   x13, x2, x6
4c      add   x1, x2, x1
50      sub   x15, x6, x7
54      ld    x16, 100(x7)
```

...

- Handler

```
1c090000      sd   x26, 1000(x10)
1c090004      sd   x27, 1008(x10)
```

...



# Exception Example



# Exception Example



# Multiple Exceptions

- Pipelining overlaps multiple instructions
  - Could have multiple exceptions at once
- Simple approach: deal with exception from earliest instruction
  - Flush subsequent instructions
  - “Precise” exceptions
- In complex pipelines
  - Multiple instructions issued per cycle
  - Out-of-order completion
  - Maintaining precise exceptions is difficult!

# Imprecise Exceptions

- Just stop pipeline and save state
  - Including exception cause(s)
- Let the handler work out
  - Which instruction(s) had exceptions
  - Which to complete or flush
    - May require “manual” completion
- Simplifies hardware, but more complex handler software
- Not feasible for complex multiple-issue out-of-order pipelines

# Instruction-Level Parallelism (ILP)

- Pipelining: executing multiple instructions in parallel
- To increase ILP
  - Deeper pipeline
    - Less work per stage  $\Rightarrow$  shorter clock cycle
  - Multiple issue
    - Replicate pipeline stages  $\Rightarrow$  multiple pipelines
    - Start multiple instructions per clock cycle
    - CPI < 1, so use Instructions Per Cycle (IPC)
    - E.g., 4GHz 4-way multiple-issue
      - 16 BIPS, peak CPI = 0.25, peak IPC = 4
    - But dependencies reduce this in practice

# Multiple Issue

- Static multiple issue
  - Compiler groups instructions to be issued together
  - Packages them into “issue slots”
  - Compiler detects and avoids hazards
- Dynamic multiple issue
  - CPU examines instruction stream and chooses instructions to issue each cycle
  - Compiler can help by reordering instructions
  - CPU resolves hazards using advanced techniques at runtime



# Speculation

- “Guess” what to do with an instruction
  - Start operation as soon as possible
  - Check whether guess was right
    - If so, complete the operation
    - If not, roll-back and do the right thing
- Common to static and dynamic multiple issue
- Examples
  - Speculate on branch outcome
    - Roll back if path taken is different
  - Speculate on load
    - Roll back if location is updated

# Compiler/Hardware Speculation

- Compiler can reorder instructions
  - e.g., move load before branch
  - Can include “fix-up” instructions to recover from incorrect guess
- Hardware can look ahead for instructions to execute
  - Buffer results until it determines they are actually needed
  - Flush buffers on incorrect speculation

# Speculation and Exceptions

- What if exception occurs on a speculatively executed instruction?
  - e.g., speculative load before null-pointer check
- Static speculation
  - Can add ISA support for deferring exceptions
- Dynamic speculation
  - Can buffer exceptions until instruction completion (which may not occur)

# Static Multiple Issue

- Compiler groups instructions into “issue packets”
  - Group of instructions that can be issued on a single cycle
  - Determined by pipeline resources required
- Think of an issue packet as a very long instruction
  - Specifies multiple concurrent operations
  - ⇒ Very Long Instruction Word (VLIW)

# Scheduling Static Multiple Issue

- Compiler must remove some/all hazards
  - Reorder instructions into issue packets
  - No dependencies with a packet
  - Possibly some dependencies between packets
    - Varies between ISAs; compiler must know!
  - Pad with nop if necessary

# RISC-V with Static Dual Issue

- Two-issue packets
  - One ALU/branch instruction
  - One load/store instruction
  - 64-bit aligned
    - ALU/branch, then load/store
    - Pad an unused instruction with nop

| Address | Instruction type | Pipeline Stages |    |    |     |     |     |     |
|---------|------------------|-----------------|----|----|-----|-----|-----|-----|
|         |                  | IF              | ID | EX | MEM | WB  |     |     |
| n       | ALU/branch       |                 |    |    |     |     |     |     |
| n + 4   | Load/store       | IF              | ID | EX | MEM | WB  |     |     |
| n + 8   | ALU/branch       |                 | IF | ID | EX  | MEM | WB  |     |
| n + 12  | Load/store       |                 |    | IF | ID  | EX  | MEM | WB  |
| n + 16  | ALU/branch       |                 |    |    | IF  | ID  | EX  | MEM |
| n + 20  | Load/store       |                 |    |    | IF  | ID  | EX  | MEM |
|         |                  |                 |    |    |     |     |     | WB  |



# RISC-V with Static Dual Issue



# Hazards in the Dual-Issue RISC-V

- More instructions executing in parallel
- EX data hazard
  - Forwarding avoided stalls with single-issue
  - Now can't use ALU result in load/store in same packet
    - add  $x_{10}$ ,  $x_0$ ,  $x_1$   
ld  $x_2$ , 0( $x_{10}$ )
    - Split into two packets, effectively a stall
- Load-use hazard
  - Still one cycle use latency, but now two instructions
  - More aggressive scheduling required

# Scheduling Example

- Schedule this for dual-issue RISC-V

```
Loop: ld    x31,0(x20)      // x31=array element
      add   x31,x31,x21      // add scalar in x21
      sd    x31,0(x20)      // store result
      addi  x20,x20,-8       // decrement pointer
      blt   x22,x20,Loop     // branch if x22 < x20
```

|       | ALU/branch       | Load/store    | cycle |
|-------|------------------|---------------|-------|
| Loop: | nop              | ld x31,0(x20) | 1     |
|       | addi x20,x20,-8  | nop           | 2     |
|       | add x31,x31,x21  | nop           | 3     |
|       | blt x22,x20,Loop | sd x31,8(x20) | 4     |

- IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

# Loop Unrolling

- Replicate loop body to expose more parallelism
  - Reduces loop-control overhead
- Use different registers per replication
  - Called “register renaming”
  - Avoid loop-carried “anti-dependencies”
    - Store followed by a load of the same register
    - Aka “name dependence”
      - Reuse of a register name



MK  
MORGAN KAUFMANN

# Loop Unrolling Example

|       | ALU/branch       | Load/store      | cycle |
|-------|------------------|-----------------|-------|
| Loop: | addi x20,x20,-32 | ld x28, 0(x20)  | 1     |
|       | nop              | ld x29, 24(x20) | 2     |
|       | add x28,x28,x21  | ld x30, 16(x20) | 3     |
|       | add x29,x29,x21  | ld x31, 8(x20)  | 4     |
|       | add x30,x30,x21  | sd x28, 32(x20) | 5     |
|       | add x31,x31,x21  | sd x29, 24(x20) | 6     |
|       | nop              | sd x30, 16(x20) | 7     |
|       | blt x22,x20,Loop | sd x31, 8(x20)  | 8     |

- IPC =  $14/8 = 1.75$ 
  - Closer to 2, but at cost of registers and code size

# Dynamic Multiple Issue

- “Superscalar” processors
- CPU decides whether to issue 0, 1, 2, ... each cycle
  - Avoiding structural and data hazards
- Avoids the need for compiler scheduling
  - Though it may still help
  - Code semantics ensured by the CPU

# Dynamic Pipeline Scheduling

- Allow the CPU to execute instructions out of order to avoid stalls
  - But commit result to registers in order
- Example

ld x31,20(x21)

add x1,x31,x2

sub x23,x23,x3

andi x5,x23,20

- Can start sub while add is waiting for ld



# Dynamically Scheduled CPU



# Register Renaming

- Reservation stations and reorder buffer effectively provide register renaming
- On instruction issue to reservation station
  - If operand is available in register file or reorder buffer
    - Copied to reservation station
    - No longer required in the register; can be overwritten
  - If operand is not yet available
    - It will be provided to the reservation station by a function unit
    - Register update may not be required



# Speculation

- Predict branch and continue issuing
  - Don't commit until branch outcome determined
- Load speculation
  - Avoid load and cache miss delay
    - Predict the effective address
    - Predict loaded value
    - Load before completing outstanding stores
    - Bypass stored values to load unit
  - Don't commit load until speculation cleared

# Why Do Dynamic Scheduling?

- Why not just let the compiler schedule code?
- Not all stalls are predictable
  - e.g., cache misses
- Can't always schedule around branches
  - Branch outcome is dynamically determined
- Different implementations of an ISA have different latencies and hazards

# Does Multiple Issue Work?

## The BIG Picture

- Yes, but not as much as we'd like
- Programs have real dependencies that limit ILP
- Some dependencies are hard to eliminate
  - e.g., pointer aliasing
- Some parallelism is hard to expose
  - Limited window size during instruction issue
- Memory delays and limited bandwidth
  - Hard to keep pipelines full
- Speculation can help if done well



**MK**  
MORGAN KAUFMANN

# Power Efficiency

- Complexity of dynamic scheduling and speculations requires power
- Multiple simpler cores may be better

| Microprocessor | Year | Clock Rate | Pipeline Stages | Issue width | Out-of-order/Speculation | Cores | Power |
|----------------|------|------------|-----------------|-------------|--------------------------|-------|-------|
| i486           | 1989 | 25MHz      | 5               | 1           | No                       | 1     | 5W    |
| Pentium        | 1993 | 66MHz      | 5               | 2           | No                       | 1     | 10W   |
| Pentium Pro    | 1997 | 200MHz     | 10              | 3           | Yes                      | 1     | 29W   |
| P4 Willamette  | 2001 | 2000MHz    | 22              | 3           | Yes                      | 1     | 75W   |
| P4 Prescott    | 2004 | 3600MHz    | 31              | 3           | Yes                      | 1     | 103W  |
| Core           | 2006 | 2930MHz    | 14              | 4           | Yes                      | 2     | 75W   |
| UltraSparc III | 2003 | 1950MHz    | 14              | 4           | No                       | 1     | 90W   |
| UltraSparc T1  | 2005 | 1200MHz    | 6               | 1           | No                       | 8     | 70W   |



# Cortex A53 and Intel i7

| Processor                             | ARM A53                            | Intel Core i7 920                        |
|---------------------------------------|------------------------------------|------------------------------------------|
| Market                                | Personal Mobile Device             | Server, cloud                            |
| Thermal design power                  | 100 milliWatts<br>(1 core @ 1 GHz) | 130 Watts                                |
| Clock rate                            | 1.5 GHz                            | 2.66 GHz                                 |
| Cores/Chip                            | 4 (configurable)                   | 4                                        |
| Floating point?                       | Yes                                | Yes                                      |
| Multiple issue?                       | Dynamic                            | Dynamic                                  |
| Peak instructions/clock cycle         | 2                                  | 4                                        |
| Pipeline stages                       | 8                                  | 14                                       |
| Pipeline schedule                     | Static in-order                    | Dynamic out-of-order<br>with speculation |
| Branch prediction                     | Hybrid                             | 2-level                                  |
| 1 <sup>st</sup> level caches/core     | 16-64 KiB I, 16-64 KiB D           | 32 KiB I, 32 KiB D                       |
| 2 <sup>nd</sup> level caches/core     | 128-2048 KiB                       | 256 KiB (per core)                       |
| 3 <sup>rd</sup> level caches (shared) | (platform dependent)               | 2-8 MB                                   |

# ARM Cortex-A53 Pipeline



# Fallacies

- Pipelining is easy (!)
  - The basic idea is easy
  - The devil is in the details
    - e.g., detecting data hazards
- Pipelining is independent of technology
  - So why haven't we always done pipelining?
  - More transistors make more advanced techniques feasible
  - Pipeline-related ISA design needs to take account of technology trends
    - e.g., predicated instructions

# Pitfalls

- Poor ISA design can make pipelining harder
  - e.g., complex instruction sets (VAX, IA-32)
    - Significant overhead to make pipelining work
    - IA-32 micro-op approach
  - e.g., complex addressing modes
    - Register update side effects, memory indirection
  - e.g., delayed branches
    - Advanced pipelines have long delay slots

# Concluding Remarks

- ISA influences design of datapath and control
- Datapath and control influence design of ISA
- Pipelining improves instruction throughput using parallelism
  - More instructions completed per second
  - Latency for each instruction not reduced
- Hazards: structural, data, control
- Multiple issue and dynamic scheduling (ILP)
  - Dependencies limit achievable parallelism
  - Complexity leads to the power wall