



**EECS 151/251A  
Spring 2020  
Digital Design and Integrated  
Circuits**

Instructor:  
John Wawrzynek

**Lecture 11: RISC-V**

# *Project Introduction*

- You will design and optimize a RISC-V processor
- Phase 1: Design and demonstrate a processor
- Phase 2:
  - ASIC Lab – implement cache memory and generate complete chip layout
  - FPGA Lab – Add video display and graphics accelerator

*Today discuss how to design the processor*



# What is RISC-V?

- Fifth generation of RISC design from UC Berkeley
- A high-quality, license-free, royalty-free RISC ISA specification
- Experiencing rapid uptake in both industry and academia
- Supported by growing shared software ecosystem
- Appropriate for all levels of computing system, from micro-controllers to supercomputers
  - 32-bit, 64-bit, and 128-bit variants (we're using 32-bit in class, textbook uses 64-bit)
- Standard maintained by non-profit RISC-V Foundation

*<https://riscv.org/specifications/>*

# Foundation Members (60+)



**Platinum:**



Berkeley  
Architecture  
Research

bluespec

Google

IBM.

Mellanox®  
TECHNOLOGIES



Microsoft

Rambus

Cryptography Research



SiFive

cortus

D R A P E R

SAMSUNG

Microsemi

QUALCOMM®



NVIDIA.

WD Western  
Digital®



**Gold, Silver, Auditors:**

AMD

ANDES  
TECHNOLOGY

Codasip

Esperanto  
Technologies

IDT

INTRINSIX

MP Processor

Sur Technology

antmicro



Blockstream

GRAY  
RESEARCH



ROA  
LOGIC

Technion

ESPRESSIF

ETH zürich

LATTICE  
SEMICONDUCTOR

Rumble

VectorBlox  
embedded supercomputing

ICT

中国科学院  
INSTITUTE OF COMPUTING TECHNOLOGIES

MEDIATEK

lowRISC

runtime.io

# Instruction Set Architecture (ISA)

- Job of a CPU (*Central Processing Unit*, aka *Core*): execute *instructions*
- Instructions: CPU's primitives operations
  - Instructions performed one after another in sequence
  - Each instruction does a small amount of work (a tiny part of a larger program).
  - Each instruction has an *operation* applied to *operands*,
  - and might be used change the sequence of instruction.
- CPUs belong to “families,” each implementing its own set of instructions
- CPU’s particular set of instructions implements an *Instruction Set Architecture (ISA)*
  - Examples: ARM, Intel x86, MIPS, RISC-V, IBM/Motorola PowerPC (old Mac), Intel IA64, ...



*If you need more info on processor organization.*

# RISC Processor Instructions in Brief



- *Compilers generate machine instructions to execute your programs in the following way:*
- Load/Store instructions move operands between main memory (cache hierarchy) and core register file.
- Register/Register instructions perform arithmetic and logical operations on register file values as operands and result returned to register file.
- Register/Immediate instructions perform arithmetic and logical operations on register file value and constants.
- Branch instructions are used for looping and if-than-else (data dependent operations).
- Jumps are used for function call and return.

# Complete RV32I ISA

|                       |     |          |     |             |       |
|-----------------------|-----|----------|-----|-------------|-------|
| imm[31:12]            |     |          | rd  | 0110111     | LUI   |
| imm[31:12]            |     |          | rd  | 0010111     | AUIPC |
| imm[20 10:1 11 19:12] |     |          | rd  | 1101111     | JAL   |
| imm[11:0]             |     | rs1      | 000 | rd          | JALR  |
| imm[12 10:5]          | rs2 | rs1      | 000 | imm[4:1 11] | BEQ   |
| imm[12 10:5]          | rs2 | rs1      | 001 | imm[4:1 11] | BNE   |
| imm[12 10:5]          | rs2 | rs1      | 100 | imm[4:1 11] | BLT   |
| imm[12 10:5]          | rs2 | rs1      | 101 | imm[4:1 11] | BGE   |
| imm[12 10:5]          | rs2 | rs1      | 110 | imm[4:1 11] | BLTU  |
| imm[12 10:5]          | rs2 | rs1      | 111 | imm[4:1 11] | BGEU  |
| imm[11:0]             |     | rs1      | 000 | rd          | LB    |
| imm[11:0]             |     | rs1      | 001 | rd          | LH    |
| imm[11:0]             |     | rs1      | 010 | rd          | LW    |
| imm[11:0]             |     | rs1      | 100 | rd          | LBU   |
| imm[11:0]             |     | rs1      | 101 | rd          | LHU   |
| imm[11:5]             | rs2 | rs1      | 000 | imm[4:0]    | SB    |
| imm[11:5]             | rs2 | rs1      | 001 | imm[4:0]    | SH    |
| imm[11:5]             | rs2 | rs1      | 010 | imm[4:0]    | SW    |
| imm[11:0]             |     | rs1      | 000 | rd          | ADDI  |
| imm[11:0]             |     | rs1      | 010 | rd          | SLTI  |
| imm[11:0]             |     | rs1      | 011 | rd          | SLTIU |
| imm[11:0]             |     | rs1      | 100 | rd          | XORI  |
| imm[11:0]             |     | rs1      | 110 | rd          | ORI   |
| imm[11:0]             |     | rs1      | 111 | rd          | ANDI  |
| 00000000              |     | 00010011 |     |             |       |

|              |       |      |       |     |         |         |
|--------------|-------|------|-------|-----|---------|---------|
| 0000000      | shamt | rs1  | 001   | rd  | 0010011 | SLLI    |
| 0000000      | shamt | rs1  | 101   | rd  | 0010011 | SRLI    |
| 0100000      | shamt | rs1  | 101   | rd  | 0010011 | SRAI    |
| 0000000      | rs2   | rs1  | 000   | rd  | 0110011 | ADD     |
| 0100000      | rs2   | rs1  | 000   | rd  | 0110011 | SUB     |
| 0000000      | rs2   | rs1  | 001   | rd  | 0110011 | SLL     |
| 0000000      | rs2   | rs1  | 010   | rd  | 0110011 | SLT     |
| 0000000      | rs2   | rs1  | 011   | rd  | 0110011 | SLTU    |
| 0000000      | rs2   | rs1  | 100   | rd  | 0110011 | XOR     |
| 0000000      | rs2   | rs1  | 101   | rd  | 0110011 | SRL     |
| 0100000      | rs2   | rs1  | 101   | rd  | 0110011 | SRA     |
| 0000000      | rs2   | rs1  | 110   | rd  | 0110011 | OR      |
| 0000000      | rs2   | rs1  | 111   | rd  | 0110011 | AND     |
| 0000         | pred  | succ | 00000 | 000 | 00000   | 0001111 |
| 0000         | 0000  | 0000 | 00000 | 001 | 00000   | 0001111 |
| 000000000000 |       |      | 00000 | 000 | 00000   | 1110011 |
| 000000000001 |       |      | 00000 | 000 | 00000   | 1110011 |
| csr          | rs1   | 001  | rd    |     |         | 1110011 |
| csr          | rs1   | 101  | rd    |     |         | 1110011 |
| csr          | rs1   | 011  | rd    |     |         | 1110011 |
| csr          | zimm  | 101  | rd    |     |         | 1110011 |
| csr          | zimm  | 110  | rd    |     |         | 1110011 |
| csr          | zimm  | 111  | rd    |     |         | 1110011 |

Not in EECS151/251A

SLLI  
SRLI  
SRAI  
ADD  
SUB  
SLL  
SLT  
SLTU  
XOR  
SRL  
SRA  
OR  
AND  
FENCE  
FENCE.I  
ECALL  
EBREAK \*CSRRW  
CSRRS  
CSRRC  
CSRRWI  
CSRRI  
CSRCI

\* implemented in the ASIC project

# Summary of RISC-V Instruction Formats

*Binary encoding of machine instructions. Note the common fields.*

| 31 | 30 | 25 24 | 21 | 20 | 19 | 15 14 | 12 11 | 8 | 7 | 6 | 0 |        |
|----|----|-------|----|----|----|-------|-------|---|---|---|---|--------|
|    |    |       |    |    |    |       |       |   |   |   |   | R-type |
|    |    |       |    |    |    |       |       |   |   |   |   | I-type |
|    |    |       |    |    |    |       |       |   |   |   |   | S-type |
|    |    |       |    |    |    |       |       |   |   |   |   | B-type |
|    |    |       |    |    |    |       |       |   |   |   |   | U-type |
|    |    |       |    |    |    |       |       |   |   |   |   | J-type |

# “State” Required by RV32I ISA

Each instruction reads and updates this state during execution:

- Registers (**x**0 .. **x**31)
  - Register file (or *regfile*) **Reg** holds 32 registers x 32 bits/register:  
**Reg**[0] .. **Reg**[31]
  - First register read specified by *rs1* field in instruction
  - Second register read specified by *rs2* field in instruction
  - Write register (destination) specified by *rd* field in instruction
  - **x**0 is always 0 (writes to **Reg**[0] are ignored)
- Program Counter (PC)
  - Holds address of current instruction
- Memory (MEM)
  - Holds both instructions & data, in one 32-bit byte-addressed memory space
  - We’ll use separate memories for instructions (**IMEM**) and data (**DMEM**)
    - *Later we’ll replace these with instruction and data caches*
  - Instructions are read (*fetched*) from instruction memory (assume **IMEM** read-only)
  - Load/store instructions access data memory

# RISC-V State Elements

- State encodes everything about the execution status of a processor:

- PC register
- 32 registers
- Memory



*Note: for these state elements, clock is used for write but not for read (asynchronous read, synchronous write).*

# RISC-V Microarchitecture Organization

## Datapath + Controller + External Memory



# Microarchitecture

Multiple implementations for a single instruction set architecture:

- *Single-cycle*
  - Each instruction executes in a single clock cycle.
- *Multicycle*
  - Each instruction is broken up into a series of shorter steps with one step per clock cycle.
- *Pipelined (variant on “multicycle”)*
  - Each instruction is broken up into a series of steps with one step per clock cycle
  - Multiple instructions execute at once by overlapping in time.
- *Superscalar*
  - Multiple functional units to execute multiple instructions at the same time
- *Out of order...*
  - Instructions are reordered by the hardware

# First Design: One-Instruction-Per-Cycle RISC-V Machine

On every tick of the clock, the computer executes one instruction



1. Current state outputs drive the inputs to the combinational logic, whose outputs settle at the values of the state before the next clock edge
2. At the rising clock edge, all the state elements are updated with the combinational logic outputs, and execution moves to the next clock cycle (next instruction)  
13

# Basic Phases of Instruction Execution



# Implementing the add instruction

|         |     |     |     |    |         |     |
|---------|-----|-----|-----|----|---------|-----|
| 0000000 | rs2 | rs1 | 000 | rd | 0110011 | ADD |
|---------|-----|-----|-----|----|---------|-----|

**add rd, rs1, rs2**

- Instruction makes two changes to machine's state:

$$\text{Reg}[rd] = \text{Reg}[rs1] + \text{Reg}[rs2]$$

$$PC = PC + 4$$

# Datapath for add



# Timing Diagram for add



# Implementing the **sub** instruction

|         |     |     |     |    |         |     |
|---------|-----|-----|-----|----|---------|-----|
| 0000000 | rs2 | rs1 | 000 | rd | 0110011 | ADD |
| 0100000 | rs2 | rs1 | 000 | rd | 0110011 | SUB |

**sub rd, rs1, rs2**

$$\text{Reg}[rd] = \text{Reg}[rs1] - \text{Reg}[rs2]$$

- Almost the same as add, except now have to subtract operands instead of adding them
- `inst[30]` selects between add and subtract

# Datapath for add/sub



# Implementing other R-Format instructions

|         |     |     |     |    |         |      |
|---------|-----|-----|-----|----|---------|------|
| 0000000 | rs2 | rs1 | 000 | rd | 0110011 | ADD  |
| 0100000 | rs2 | rs1 | 000 | rd | 0110011 | SUB  |
| 0000000 | rs2 | rs1 | 001 | rd | 0110011 | SLL  |
| 0000000 | rs2 | rs1 | 010 | rd | 0110011 | SLT  |
| 0000000 | rs2 | rs1 | 011 | rd | 0110011 | SLTU |
| 0000000 | rs2 | rs1 | 100 | rd | 0110011 | XOR  |
| 0000000 | rs2 | rs1 | 101 | rd | 0110011 | SRL  |
| 0100000 | rs2 | rs1 | 101 | rd | 0110011 | SRA  |
| 0000000 | rs2 | rs1 | 110 | rd | 0110011 | OR   |
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 | AND  |

- All implemented by decoding funct3 and funct7 fields and selecting appropriate ALU function

# Implementing the addi instruction

- RISC-V Assembly Instruction: *Uses the “I-type” instruction format*

**addi rd, rs1, integer**

$Reg[rd] = Reg[rs1] + \text{sign\_extend}(\text{immediate})$

*example:*

**addi x15, x1, -50**

| 31           | 20 19 | 15 14  | 12 11 | 7 6     | 0      |
|--------------|-------|--------|-------|---------|--------|
| imm[11:0]    | rs1   | funct3 | rd    | opcode  |        |
| 12           | 5     | 3      | 5     | 7       | 0      |
| 111111001110 | 00001 | 000    | 01111 | 0010011 |        |
| imm=-50      |       | rs1=1  | ADD   | rd=15   | OP-Imm |

# Review: Datapath for add/sub



# Adding addi to datapath



# I-type Format immediates



- High 12 bits of instruction (inst[31:20]) copied to low 12 bits of immediate (imm[11:0])
- Immediate is sign-extended by copying value of inst[31] to fill the upper 20 bits of the immediate value (imm[31:12])

# Adding addi to datapath



# Implementing Load Word instruction

- RISC-V Assembly Instruction: *Also uses the “I-type” instruction format*

**lw rd, integer(rs1)**

$Reg[rd] = DMEM[Reg[rs1]] + sign\_extend(immediate)$

*example:*

**addi x14,8(x2)**

| 31           | 20 19 | 15 14  | 12 11 | 7 6     | 0    |
|--------------|-------|--------|-------|---------|------|
| imm[11:0]    | rs1   | funct3 | rd    | opcode  |      |
| 12           | 5     | 3      | 5     | 7       | 0    |
| 000000001000 | 00010 | 010    | 01110 | 0000011 |      |
| imm=+8       |       | rs1=2  | LW    | rd=14   | LOAD |

# Review: Adding addi to datapath



# Adding 1w to datapath



# Adding 1w to datapath



# All RV32 Load Instructions

|           |     |     |    |         |     |
|-----------|-----|-----|----|---------|-----|
| imm[11:0] | rs1 | 000 | rd | 0000011 | LB  |
| imm[11:0] | rs1 | 001 | rd | 0000011 | LH  |
| imm[11:0] | rs1 | 010 | rd | 0000011 | LW  |
| imm[11:0] | rs1 | 100 | rd | 0000011 | LBU |
| imm[11:0] | rs1 | 101 | rd | 0000011 | LHU |

↑ funct3 field encodes size and signedness of load data

- Supporting the narrower loads requires additional circuits to extract the correct byte/halfword from the value loaded from memory, and sign- or zero-extend the result to 32 bits before writing back to register file.

# Implementing Store Word instruction

- RISC-V Assembly Instruction: *Uses the “S-type” instruction format*

```
sw rs2, integer(rs1
```

$$DDEM[Reg[rs1] + sign\ extend(immediate)] = Reg[rs2]$$

## *example:*

**sw x14, 8(x2)**



offset[11:5] rs2=14 rs1=2 SW offset[4:0] STORE

= 0

rs1=2

SW

offset[4:0]

# STORE

= 8



combined 12-bit offset = 8

# Review: Adding 1w to datapath



# Adding **sw** to datapath



# Adding *sw* to datapath



# Review: I-Format immediates



- High 12 bits of instruction (inst[31:20]) copied to low 12 bits of immediate (imm[11:0])
- Immediate is sign-extended by copying value of inst[31] to fill the upper 20 bits of the immediate value (imm[31:12])

# I & S -type Immediate Generator

inst[31:0]



31

11 10

5 4

0

imm[31:0]

- Just need a 5-bit mux to select between two positions where low five bits of immediate can reside in instruction
- Other bits in immediate are wired to fixed positions in instruction

# Implementing Branches

*Uses the “B-type” instruction format*



- RISC-V Assembly Instruction, example:

**beq rs1, rs2, label**

*if  $rs1 == rs2$   $pc \leftarrow pc + offset$  // offset computed by compiler/assembler and stored in the immediate field(s)*

*example:*

**beq x1, x2, L1**

- B-format is mostly same as S-Format, with two register sources (rs1/rs2) and a 12-bit immediate
- But now immediate represents values -4096 to +4094 in 2-byte increments
- The 12 immediate bits encode *even* 13-bit signed byte offsets (lowest bit of offset is always zero, so no need to store it)

# Review: Adding sw to datapath



# Adding branches to datapath



# Adding branches to datapath



# Branch Comparator



- $\text{BrEq} = 1$ , if  $A=B$
- $\text{BrLT} = 1$ , if  $A < B$
- $\text{BrUn} = 1$  selects unsigned comparison for  $\text{BrLT}$ ,  $0=\text{signed}$
- BGE branch:  $A \geq B$ , if  $!(A < B)$

# RISC-V Branch Immediates

- 12-bit immediate encodes PC-relative offset of -4096 to +4094 bytes in multiples of 2 bytes
- RISC-V approach: keep 11 immediate bits in fixed position in output value, and rotate LSB of S-format to be bit 12 of B-format



S-Immediate



B-Immediate (shift left by 1)

Only one bit changes position between S and B, so only need a single-bit 2-way mux

# RISC-V Immediate Encoding

## Instruction Encodings, $\text{inst}[31:0]$

| 31 | 30 | 25 24 | 21 | 20 | 19 | 15 14 | 12 11 | 8 | 7 | 6 | 0 |        |
|----|----|-------|----|----|----|-------|-------|---|---|---|---|--------|
|    |    |       |    |    |    |       |       |   |   |   |   | opcode |
|    |    |       |    |    |    |       |       |   |   |   |   | R-type |
|    |    |       |    |    |    |       |       |   |   |   |   | I-type |
|    |    |       |    |    |    |       |       |   |   |   |   | S-type |
|    |    |       |    |    |    |       |       |   |   |   |   | B-type |

## 32-bit immediates produced, $\text{imm}[31:0]$

| 31 | 30 | 20 19                 | 12 | 11               | 10                   | 5                    | 4                    | 1                 | 0 |             |
|----|----|-----------------------|----|------------------|----------------------|----------------------|----------------------|-------------------|---|-------------|
|    |    | — $\text{inst}[31]$ — |    |                  |                      | $\text{inst}[30:25]$ | $\text{inst}[24:21]$ | $\text{inst}[20]$ |   | I-immediate |
|    |    | — $\text{inst}[31]$ — |    |                  |                      | $\text{inst}[30:25]$ | $\text{inst}[11:8]$  | $\text{inst}[7]$  |   | S-immediate |
|    |    | — $\text{inst}[31]$ — |    | $\text{inst}[7]$ | $\text{inst}[30:25]$ | $\text{inst}[11:8]$  | 0                    |                   |   | B-immediate |

Upper bits sign-extended from  $\text{inst}[31]$  always

Only bit 7 of instruction changes role in immediate between S and B

# Implementing JALR Instruction (I-Format)



- JALR rd, rs, immediate
  - Writes PC+4 to Reg[rd] (return address)
  - Sets PC = Reg[rs1] + offset
  - Uses same immediates as arithmetic and loads
    - *no* multiplication by 2 bytes

# Review: Adding branches to datapath



# Adding jalr to datapath



# Adding jalr to datapath



# Implementing jal Instruction

*Uses the “J-type” instruction format*



- JAL saves PC+4 in Reg[rd] (the return address)
- Set PC = PC + offset (PC-relative jump)
- Target somewhere within  $\pm 2^{19}$  locations, 2 bytes apart
  - $\pm 2^{18}$  32-bit instructions
- Immediate encoding optimized similarly to branch instruction to reduce hardware cost

# Adding jal to datapath



# Adding jail to datapath



# Single-Cycle RISC-V RV32I Datapath



# *Controller Implementation:*

- ❑ Control logic works really well as a case statement...

```
always @* begin
    op = instr[26:31];
    imm = instr[15:0]; ...

    reg_dst = 1'bx;    // Don't care
    reg_write = 1'b0; // By default don't write
    ...
    case (op)
        6'b000000: begin reg_write = 1; ... end
        ...
    endcase
end
```

# COVID-19 Announcements (tentative)

- ❑ Lectures:
  - ❑ All lectures are recorded and available online.
  - ❑ If you feel ill, *please do not come to class*.
  - ❑ If you feel uncomfortable coming to class, or would simply like to do your part to help stop the spread, please feel free not to come.
- ❑ Labs:
  - ❑ Lab meetings will continue, however, ...
  - ❑ If you feel ill, *please do not come to lab*, but do let us know.
  - ❑ All lab software is available remotely, so feel free to not work in 125 Cory.
  - ❑ FPGA boards are available for check-out.
  - ❑ Check-offs will be done in person as before, or now can be done with Zoom/Skype (contact Tan).

# **COVID-19 Announcements (tentative)**

- ❑ Exam:
  - ❑ To be held during previously scheduled time slot, 6-9PM Thursday.
  - ❑ However, it will be take-at-home style.
  - ❑ Exam will be posted at 6PM and your solution must be turned in using gradescope by 9PM.
  - ❑ You will be allowed to use lecture notes and other resources, but not allowed to discuss exam questions or answers with other students during the exam time.
- ❑ Office hours:
  - ❑ for meeting the GSIs or professor, in addition to in person meetings, teleconference (Skype/Zoom) will be available
- ❑ Discussion Session:
  - ❑ Will be held as usual, and Zoom will be used for remote participation.



## Processor Pipelining

# Review: Processor Performance

## Program Execution Time

$$= (\# \text{ instructions})(\text{cycles/instruction})(\text{seconds/cycle})$$

$$= \# \text{ instructions} \times \text{CPI} \times T_C$$

# Single-Cycle Performance

- $T_C$  is limited by the critical path ( $1w$ )



# Single-Cycle Performance

- *Single-cycle critical path:*

$$T_c = t_{q\_PC} + t_{mem} + \max(t_{RFread}, t_{sext} + t_{mux}) + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$$

- *In most implementations, limiting paths are:*

- *memory, ALU, register file.*

- $T_c = t_{q\_PC} + 2t_{mem} + t_{RFread} + t_{mux} + t_{ALU} + t_{RFsetup}$

# Pipelined Processor

- Use temporal parallelism
- Divide single-cycle processor into 5 stages:
  - Fetch
  - Decode
  - Execute
  - Memory
  - Writeback
- Add pipeline registers between stages

# Single-Cycle vs. Pipelined Performance

## Single-Cycle



## Pipelined



# Single-Cycle and Pipelined Datapath



# Corrected Pipelined Datapath

- *WriteReg must arrive at the same time as Result*



# Pipelined Control



Same control unit as single-cycle processor  
Control delayed to proper pipeline stage

# Pipeline Hazards

- Occurs when an instruction depends on results from previous instruction that hasn't completed.
- Types of hazards:
  - **Data hazard:** register value not written back to register file yet
  - **Control hazard:** next instruction not decided yet (caused by branches)

*We need to design ways to avoid hazards, else we pay the price in CPI (cycles per instruction) and processor performance suffers.*

# Processor Pipelining

*Deeper pipeline example.*

|            |            |           |            |            |           |           |           |
|------------|------------|-----------|------------|------------|-----------|-----------|-----------|
| <b>IF1</b> | <b>IF2</b> | <b>ID</b> | <b>X1</b>  | <b>X2</b>  | <b>M1</b> | <b>M2</b> | <b>WB</b> |
|            |            |           | <b>IF1</b> | <b>IF2</b> | <b>ID</b> | <b>X1</b> | <b>X2</b> |

*Deeper pipelines => less logic per stage => high clock rate.*

*But*

*Deeper pipelines\* => more hazards => more cost and/or higher CPI.*

*Cycles per instruction might go up because of unresolvable hazards.*

*Remember, Performance = # instructions X Frequency<sub>clk</sub> / CPI*

*\*Many designs included pipelines as long as 7, 10 and even 20 stages (like in the [Intel Pentium 4](#)). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their [Pentium D](#) derivatives) had a 31-stage pipeline.*

*How about shorter pipelines ... Less cost, less performance (but higher cost efficiency)*



## 3-Stage Pipeline

# 3-Stage Pipeline (used for FPGA/ASIC project)

*The blocks in the datapath with the greatest delay are: IMEM, ALU, and DMEM. Allocate one pipeline stage to each:*



*Use PC register as address to IMEM and retrieve next instruction. Instruction gets stored in a pipeline register, also called “instruction register”, in this case.*

*Use ALU to compute result, memory address, or branch target address.*

*Access data memory or I/O device for load or store. Allow for setup time for register file write.*

*Most details you will need to work out for yourself. Some details to follow ... In particular, let's look at hazards.*

# 3-stage Pipeline

## Data Hazard



The fix:

Selectively forward ALU result back to input of ALU.



- Need to add mux at input to ALU, add control logic to sense when to activate. Check reference for details.

# 3-stage Pipeline

## Load Hazard

|                   |   |   |   |   |  |
|-------------------|---|---|---|---|--|
| Iw x5, offset(x4) | I | X | M |   |  |
| add x7, x6, x5    |   | I | X | M |  |

value needed here!

Memory value known here. It is written into the regfile on this edge.

The fix: *Delay the dependent instruction by one cycle to allow the load to complete, send the result of load directly to the ALU (and to the regfile). No delay if not dependent!*

|                   |   |   |     |     |   |
|-------------------|---|---|-----|-----|---|
| Iw x5, offset(x4) | I | X | M   |     |   |
| add x7, x6, x5    |   | I | nop | nop |   |
| add x7, x6, x5    |   |   | I   | X   | M |

# 3-stage Pipeline

## Control Hazard



Several Possibilities:<sup>\*</sup>

The fix:

1. Always delay fetch of instruction after branch
  2. Assume branch “not taken”, continue with instruction at PC+4, and correct later if wrong.
  3. Predict branch taken or not based on history (state) and correct later if wrong.
- 
1. Simple, but all branches now take 2 cycles (lowers performance)
  2. Simple, only some branches take 2 cycles (better performance)
  3. Complex, very few branches take 2 cycles (best performance)

\* MIPS defines “branch delay slot”, RISC-V doesn’t

# Predict “not taken”

## Control Hazard

Branch address ready at end of X stage:

- If branch “not taken”, do nothing.
- If branch “taken”, then kill instruction in I stage (about to enter X stage) and fetch at new target address (PC)

|                           |   |   |   |   |   |
|---------------------------|---|---|---|---|---|
| <b>bneq x1, x1, L1</b>    | I | X | M |   |   |
| <b>add x5, x3, x4</b>     |   | I | X | M |   |
| <b>add x6, x1, x2</b>     |   |   | I | X | M |
| <b>L1: sub x7, x6, x5</b> |   |   |   | I | X |

**Not taken**

|                           |   |   |     |     |   |
|---------------------------|---|---|-----|-----|---|
| <b>beq x1, x1, L1</b>     | I | X | M   |     |   |
| <b>add x5, x3, x4</b>     |   | I | nop | nop |   |
| <b>L1: sub x7, x6, x5</b> |   |   | I   | X   | M |
|                           |   |   |     |     |   |

**Taken**

## EECS151 Project CPU Pipelining Summary



- Pipeline rules:
  - Writes/reads to/from DMem are clocked on the leading edge of the clock in the “M” stage
  - Writes to RegFile at the end of the “M” stage
  - Instruction Decode and Register File access is up to you.
- Branch: predict “not-taken”
- Load: 1 cycle delay/stall on *dependent* instruction
- Bypass ALU for data hazards
- More details in upcoming spec