

# *RISC-V microarchitecture*

---

These transparencies are based on those provided with the following book:  
David Harris and Sarah Harris, “**Digital Design and Computer Architecture, RISC-V Edition**”,  
1st Edition, 2021, Elsevier

# Introduction

- **Microarchitecture:** how to implement an architecture in hardware
- Processor:
  - **Datapath:** functional blocks
  - **Control:** control signals
- RISC-V standard defines the ISA, not the architecture
- Different RISC-V architectures can be / have been developed
- We will consider a (CISC-like) multicycle architecture (because it is the simplest to understand)

# Processor Performance

- **Program execution time**

Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)

- **Definitions:**

- CPI: cycles/instruction
- clock period: seconds/cycle

- **Challenge is to satisfy constraints of:**

- Cost
- Power
- Performance

# RISC-V Processor

- Consider **subset** of RISC-V instructions:
  - **R-type ALU instructions:**
    - **add, sub, and, or, slt**
  - **Memory instructions:**
    - **lw, sw**
  - **Branch instructions:**
    - **beq**
  - **I-type ALU instructions:**
    - **addi, andi, ori, xori**
  - **J-type instructions:**
    - **jal**

# Architectural State Elements

Determine everything about a processor:

– Architectural state:

- 32 registers
- PC
- Memory

# RISC-V Architectural State Elements



**RF**

# RISC-V Architectural State Elements



**RF**

# Multicycle RISC-V Processor



# Multicycle Control

# RISC-V instruction formats

| 7 bits                          | 5 bits | 5 bits | 3 bits | 5 bits                | 7 bits |               |
|---------------------------------|--------|--------|--------|-----------------------|--------|---------------|
| funct7                          | rs2    | rs1    | funct3 | rd                    | op     | <b>R-Type</b> |
| imm <sub>11:0</sub>             |        | rs1    | funct3 | rd                    | op     | <b>I-Type</b> |
| imm <sub>11:5</sub>             | rs2    | rs1    | funct3 | imm <sub>4:0</sub>    | op     | <b>S-Type</b> |
| imm <sub>12,10:5</sub>          | rs2    | rs1    | funct3 | imm <sub>4:1,11</sub> | op     | <b>B-Type</b> |
| imm <sub>31:12</sub>            |        |        |        | rd                    | op     | <b>U-Type</b> |
| imm <sub>20,10:1,11,19:12</sub> |        |        |        | rd                    | op     | <b>J-Type</b> |
| 20 bits                         |        |        | 5 bits |                       | 7 bits |               |

# Multicycle Control

## High-Level View



## Low-Level View



# Multicycle Control: Instruction Decoder



| op | Instruction | ImmSrc |
|----|-------------|--------|
| 3  | lw          | 00     |
| 35 | sw          | 01     |
| 51 | R-type      | XX     |
| 99 | beq         | 10     |

# Multicycle Control: Main FSM



**To declutter FSM:**

- **Write enable signals** (**RegWrite**, **MemWrite**, **IRWrite**, **PCUpdate**, and **Branch**) are **0** if not listed in a state.
- **Other signals are don't care** if not listed in a state

# Main FSM: Fetch



# Main FSM: Decode `lw` instruction



# Main FSM: Address



# Main FSM: Read Memory



# Main FSM: Read Memory Datapath



# Main FSM: Write RF



# Main FSM: Write RF Datapath



# Main FSM: Fetch Revisited

Calculate **PC+4**  
during Fetch stage  
(ALU isn't being  
used for other  
purposes)



# Main FSM: Fetch (PC+4) Datapath



# Multicycle Control: Other Instructions

# Main FSM: SW



# Main FSM: sw Datapath



# Main FSM: R-Type Execute



- **R-type ALU instructions: add, sub, and, or, slt**

# Main FSM: R-Type Execute Datapath



# Main FSM: R-Type ALU Write Back



# Main FSM: R-Type ALU Write Back



# Main FSM: beq

- **Need to calculate:** Branch Target Address:  
Calculate branch target address = PC + imm
  - Branch Target Address (BTA = PC + imm)
  - **rs1 - rs2** (to see if equal)
- **ALU** isn't being used in Decode stage
  - Use it to calculate Target Address (PC + imm)
- PC is updated in Fetch stage, so need to save **old (current) PC**

**B-Type**

| 31:25                  | 24:20  | 19:15  | 14:12  | 11:7                  | 6:0    |
|------------------------|--------|--------|--------|-----------------------|--------|
| imm <sub>12,10:5</sub> | rs2    | rs1    | funct3 | imm <sub>4:1,11</sub> | op     |
| 7 bits                 | 5 bits | 5 bits | 3 bits | 5 bits                | 7 bits |

# Main FSM: Decode Revisited



# Main FSM: Decode (Target Address)



# Main FSM: beq



# Main FSM: beq Datapath



# **Extending the RISC-V Multicycle Processor**

# Main FSM: I-Type ALU Execute



**I-type ALU instructions:**  
**addi ,andi ,ori ,xori**

# Main FSM: I-Type ALU Exec. Datapath



# Main FSM: jal



# Main FSM: jal Datapath



# Main FSM: jal



**PC + 4 is written to rd in S7: ALUWB**

**J-Type**



# Multicycle Processor Main FSM

| State    | Datapath μOp                                     |
|----------|--------------------------------------------------|
| Fetch    | Instr $\leftarrow$ Mem[PC]; PC $\leftarrow$ PC+4 |
| Decode   | ALUOut $\leftarrow$ PCTarget                     |
| MemAdr   | ALUOut $\leftarrow$ rs1 + imm                    |
| MemRead  | Data $\leftarrow$ Mem[ALUOut]                    |
| MemWB    | rd $\leftarrow$ Data                             |
| MemWrite | Mem[ALUOut] $\leftarrow$ rd                      |
| ExecuteR | ALUOut $\leftarrow$ rs1 op rs2                   |
| Executel | ALUOut $\leftarrow$ rs1 op imm                   |
| ALUWB    | rd $\leftarrow$ ALUOut                           |
| BEQ      | ALUResult = rs1-rs2; if Zero, PC = ALUOut        |
| JAL      | PC = ALUOut; ALUOut = PC+4                       |



# Performance

# Processor Performance

- Instructions take different number of cycles:
  - 3 cycles: `beq`
  - 4 cycles: R-type, `addi`, `sw`, `jal`
  - 5 cycles: `lw`
- CPI is weighted average
- SPECINT2000 benchmark:
  - **25%** loads
  - **10%** stores
  - **13%** branches
  - **52%** R-type

$$\text{Average CPI} = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12$$

# Critical Path



# Processor Performance

Multicycle critical path:

- **Assumptions:**
  - RF is faster than memory
  - Writing memory is faster than reading memory

$$T_{c\_multi} = t_{pcq} + t_{dec} + 2t_{mux} + \max(t_{ALU}, t_{mem}) + t_{setup}$$

# Performance Example

| Element                | Parameter            | Delay (ps) |
|------------------------|----------------------|------------|
| Register clock-to-Q    | $t_{pcq \text{ PC}}$ | 40         |
| Register setup         | $t_{\text{setup}}$   | 50         |
| Multiplexer            | $t_{\text{mux}}$     | 30         |
| AND-OR gate            | $t_{\text{AND-OR}}$  | 20         |
| ALU                    | $t_{\text{ALU}}$     | 120        |
| Decoder (Control Unit) | $t_{\text{dec}}$     | 25         |
| Extend unit            | $t_{\text{dec}}$     | 35         |
| Memory read            | $t_{\text{mem}}$     | 200        |
| Register file read     | $t_{RF\text{read}}$  | 100        |
| Register file setup    | $t_{RF\text{setup}}$ | 60         |

$$T_{c\_multi} = t_{pcq} + t_{\text{dec}} + 2t_{\text{mux}} + \max(t_{\text{ALU}}, t_{\text{mem}}) + t_{\text{setup}}$$

$$= \boxed{\quad}$$

# Performance Example

For a program with **100 billion** instructions executing on a **multicycle** RISC-V processor

- **CPI** = 4.12 cycles/instruction
- **Clock cycle time:**  $T_{c\_multi}$  = 375 ps

**Execution Time** = (# instructions)  $\times$  CPI  $\times$   $T_c$

.