

ECE 281

# Lesson 34 & 36: Microarchitecture

Digital Design and Computer Architecture Lecture Notes

© 2021 Sarah Harris and David Harris

These notes may be used and modified for educational and/or  
non-commercial purposes so long as the source is attributed.

Modified for use by USAF Academy, 2025

# Chapter 7 :: Topics

- Introduction
- Performance Analysis
- Single-Cycle Processor
- Multicycle Processor
- Pipelined Processor
- Advanced Microarchitecture

which is faster?

Next lesson:  
Lab 5 Prelab Workday

instructor  
perogative  
day

No Lab 5 Report

- 1: [Introduction](#)
- 2: [Single-Cycle Processor: Datapath lw Instruction](#)
- 3: [Single-Cycle Processor: Datapath Other Instructions](#)
- 4: [Single-Cycle Processor: Control](#)
- 5: [Single-Cycle Processor: Extending](#)
- 6: [Single-Cycle Processor: Performance](#)
- 6a: [Single-Cycle Processor: Testbench](#)
- 6b: [Single-Cycle Processor: SystemVerilog](#)
- 6c: [Single-Cycle Processor: Tie Celebration](#)



# Last Time



← J ALPHA

**Address of main:**  
0x10180

← pc = 0x00010180

# Introduction

- **Microarchitecture:** how to implement an architecture in hardware
- Processor:
  - **Datapath:** functional blocks
  - **Control:** control signals



# Microarchitecture

- **Multiple implementations** for a single architecture:
  - **Single-cycle**: Each instruction executes in a single cycle
  - **Multicycle**: Each instruction is broken up into series of shorter steps
  - **Pipelined**: Each instruction broken up into series of steps & multiple instructions execute at once



# Review: Instruction Formats

| 7 bits                 | 5 bits                          | 5 bits | 3 bits | 5 bits                | 7 bits |        |
|------------------------|---------------------------------|--------|--------|-----------------------|--------|--------|
| funct7                 | rs2                             | rs1    | funct3 | rd                    | op     | R-Type |
| imm <sub>11:0</sub>    |                                 | rs1    | funct3 | rd                    | op     | I-Type |
| imm <sub>11:5</sub>    | rs2                             | rs1    | funct3 | imm <sub>4:0</sub>    | op     | S-Type |
| imm <sub>12,10:5</sub> | rs2                             | rs1    | funct3 | imm <sub>4:1,11</sub> | op     | B-Type |
|                        | imm <sub>31:12</sub>            |        |        | rd                    | op     | U-Type |
|                        | imm <sub>20,10:1,11,19:12</sub> |        |        | rd                    | op     | J-Type |
| 20 bits                |                                 |        |        | 5 bits                | 7 bits |        |

# RISC-V Processor

- Consider **subset** of RISC-V instructions:

- R-type ALU instructions:

- add, sub, and, or, slt  $\leftarrow$  R-Type

- Memory instructions:

$\xrightarrow{J\text{-type}}$  • lw, sw  $\leftarrow$  S-Type

- Branch instructions:

- beq  $\leftarrow$  B-Type

# RISC-V Architectural State Elements



# Chapter 7: Microarchitecture

## **Single-Cycle RISC-V Processor**

# Example Program

- Design datapath
- View example program executing

*in Instruction Memory*

## Example Program:

| Address | Instruction        | Type | Fields                                                                                                          | Machine Language |
|---------|--------------------|------|-----------------------------------------------------------------------------------------------------------------|------------------|
| 0x1000  | l7: lw x6, -4 (x9) | I    | imm <sub>11:0</sub> 111111111100<br>rs1 01001<br>f3 010<br>rd 00110<br>op 0000011                               | FFC4A303         |
| 0x1004  | sw x6, 8 (x9)      | S    | imm <sub>11:5</sub> 0000000<br>rs2 00110<br>rs1 01001<br>f3 010<br>imm <sub>4:0</sub> 01000<br>op 0100011       | 0064A423         |
| 0x1008  | or x4, x5, x6      | R    | funct7 0000000<br>rs2 00110<br>rs1 00101<br>f3 110<br>rd 00100<br>op 0110011                                    | 0062E233         |
| 0x100C  | beq x4, x4, L7     | B    | imm <sub>12,10:5</sub> 1111111<br>rs2 00100<br>rs1 00100<br>f3 000<br>imm <sub>4:1,11</sub> 10101<br>op 1100011 | FE420AE3         |

# Single-Cycle RISC-V Processor

- **Datapath:** start with `lw` instruction

- **Example:**      `lw x6, -4(x9)`

`lw rd, imm(rs1)`

## I-Type



# Single-Cycle Datapath: lw fetch

## STEP 1: Fetch instruction



| Address | Instruction                     | Type | Fields                                                                                                  | Machine Language                   |
|---------|---------------------------------|------|---------------------------------------------------------------------------------------------------------|------------------------------------|
| 0x1000  | <code>l7: lw x6, -4 (x9)</code> | I    | $\text{imm}_{11:0}$<br>111111111100<br><code>rs1</code> 01001 <code>f3</code> 010 <code>rd</code> 00110 | <code>op</code> 0000011   FFC4A303 |

# Single-Cycle Datapath: lw Reg Read

**STEP 2:** Read source operand (**rs1**) from RF



# Single-Cycle Datapath: lw Immediate

## **STEP 3:** Extend the immediate



# Single-Cycle Datapath: lw Address

## STEP 4: Compute the memory address



| ALUControl <sub>2:0</sub> | Function |
|---------------------------|----------|
| 000                       | add      |
| 001                       | subtract |
| 010                       | and      |
| 011                       | or       |
| 101                       | SLT      |

| Address | Instruction        | Type | Fields                                                                            | Machine Language |
|---------|--------------------|------|-----------------------------------------------------------------------------------|------------------|
| 0x1000  | L7: lw x6, -4 (x9) | I    | imm <sub>11:0</sub> : 111111111100<br>rs1: 01001 f3: 010 rd: 00110<br>op: 0000011 | FFC4A303         |

# Single-Cycle Datapath: lw Mem Read

**STEP 5:** Read data from memory and write it back to register file



| Address | Instruction        | Type | Fields                                       | Machine Language                              |
|---------|--------------------|------|----------------------------------------------|-----------------------------------------------|
| 0x1000  | I7: lw x6, -4 (x9) | I    | imm <sub>11:0</sub><br>rs1<br>f3<br>rd<br>op | 111111111100 01001 010 00110 0000011 FFC4A303 |

6

# Single-Cycle Datapath: PC Increment

## STEP 6: Determine address of next instruction



# Chapter 7: Microarchitecture

## **Single-Cycle Datapath: Other Instructions**

# Single-Cycle Datapath: sw

- **Immediate:** now in  $\{\text{instr}[31:25], \text{instr}[11:7]\}$
- **Add control signals:** ImmSrc, MemWrite



| Address | Instruction   | Type | Fields                         |              |              |           |                             |               | Machine Language |
|---------|---------------|------|--------------------------------|--------------|--------------|-----------|-----------------------------|---------------|------------------|
| 0x1004  | sw x6, 8 (x9) | S    | imm <sub>11:5</sub><br>0000000 | rs2<br>00110 | rs1<br>01001 | f3<br>010 | imm <sub>4:0</sub><br>01000 | op<br>0100011 | 0064A423         |

# Single-Cycle Datapath: Immediate

| ImmSrc | ImmExt                                                    | Instruction Type |
|--------|-----------------------------------------------------------|------------------|
| 0      | <code>{{20{instr[31]}}, instr[31:20]}</code>              | I-Type           |
| 1      | <code>{{20{instr[31]}}, instr[31:25], instr[11:7]}</code> | S-Type           |

Difference?



# Single-Cycle Datapath: R-type

- Read from **rs1** and **rs2** (instead of **imm**)
- Write **ALUResult** to **rd**



| Address | Instruction   | Type | Fields                                                                                                                 | Machine Language |
|---------|---------------|------|------------------------------------------------------------------------------------------------------------------------|------------------|
| 0x1008  | or x4, x5, x6 | R    | funct7<br>0000000      rs2<br>00110      rs1<br>00101      f3<br>110      rd<br>00100      op<br>0110011      0062E233 |                  |

# Single-Cycle Datapath: beq

Calculate **target address: PCTarget = PC + imm**



| Address | Instruction    | Type | Fields                                                                                              | Machine Language |
|---------|----------------|------|-----------------------------------------------------------------------------------------------------|------------------|
| 0x100C  | beq x4, x4, L7 | B    | imm <sub>12,10:5</sub> rs2 rs1 f3 imm <sub>4:1,11</sub> op<br>1111111 00100 00100 000 10101 1100011 | FE420AE3         |

# Single-Cycle Datapath: ImmExt

| ImmSrc <sub>1:0</sub> | ImmExt                                                                                                                  | Instruction Type |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------|------------------|
| 00                    | $\{{\text{20}\{\text{instr}[31]\}}, \text{instr[31:20]}$                                                                | I-Type           |
| 01                    | $\{{\text{20}\{\text{instr}[31]\}}, \text{instr[31:25]}, \text{instr[11:7]}$                                            | S-Type           |
| 10                    | $\{{\text{19}\{\text{instr}[31]\}}, \text{instr[31]}, \text{instr[7]}, \text{instr[30:25]}, \text{instr[11:8]}, 1'b0\}$ | B-Type           |

I-Type



S-Type



B-Type



# Single-Cycle RISC-V Processor



Chapter 7: Microarchitecture

# **Single-Cycle Control**

# Single-Cycle Control

## High-Level View



## Low-Level View



# Single-Cycle Control: Main Decoder

| op | Instr. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|--------|----------|--------|--------|----------|-----------|--------|-------|
| 3  | lw     | 1        | 00     | 1      | 0        | 1         | 0      | 00    |
| 35 | sw     | 0        | 01     | 1      | 1        | X         | 0      | 00    |
| 51 | R-type | 1        | XX     | 0      | 0        | 0         | 0      | 10    |
| 99 | beq    | 0        | 10     | 0      | 0        | X         | 1      | 01    |



# Review: ALU

| ALUControl <sub>2:0</sub> | Function |
|---------------------------|----------|
| 000                       | add      |
| 001                       | subtract |
| 010                       | and      |
| 011                       | or       |
| 101                       | SLT      |



# Review: ALU

| <b>ALUControl<sub>2:0</sub></b> | <b>Function</b> |
|---------------------------------|-----------------|
| 000                             | add             |
| 001                             | subtract        |
| 010                             | and             |
| 011                             | or              |
| 101                             | SLT             |



# Single-Cycle Control: ALU Decoder



# Single-Cycle Control: ALU Decoder

| <i>ALUOp</i> | <i>funct3</i> | <i>op<sub>5</sub>, funct<sub>7</sub><sub>5</sub></i> | <i>Instruction</i> | <i>ALUControl<sub>2:0</sub></i> |
|--------------|---------------|------------------------------------------------------|--------------------|---------------------------------|
| 00           | x             | x                                                    | <b>lw, sw</b>      | 000 (add)                       |
| 01           | x             | x                                                    | <b>beq</b>         | 001 (subtract)                  |
| 10           | 000           | 00, 01, 10                                           | <b>add</b>         | 000 (add)                       |
|              | 000           | 11                                                   | <b>sub</b>         | 001 (subtract)                  |
|              | 010           | x                                                    | <b>slt</b>         | 101 (set less than)             |
|              | 110           | x                                                    | <b>or</b>          | 011 (or)                        |
|              | 111           | x                                                    | <b>and</b>         | 010 (and)                       |



# Example: and

And

| op | Instruct | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|----------|----------|--------|--------|----------|-----------|--------|-------|
| 51 | R-type   | 1        | XX     | 0      | 0        | 0         | 0      | 10    |



# Chapter 7: Microarchitecture

**Extending the  
Single-Cycle  
Processor**

# Extended Functionality: I-Type ALU

Enhance the single-cycle processor to handle **I-Type ALU instructions**: addi, andi, ori, and slti

- Similar to R-type instructions
- But second source comes from immediate
- Change **ALUSrc** to select the immediate
- And **ImmSrc** to pick the correct immediate

# Extended Functionality: I-Type ALU

| op        | Instruct.     | RegWrite | ImmSrc    | ALUSrc   | MemWrite | ResultSrc | Branch   | ALUOp     |
|-----------|---------------|----------|-----------|----------|----------|-----------|----------|-----------|
| 3         | <b>lw</b>     | 1        | 00        | 1        | 0        | 1         | 0        | 00        |
| 35        | <b>sw</b>     | 0        | 01        | 1        | 1        | X         | 0        | 00        |
| 51        | <b>R-type</b> | 1        | <b>XX</b> | <b>0</b> | <b>0</b> | <b>0</b>  | <b>0</b> | <b>10</b> |
| 99        | <b>beq</b>    | 0        | 10        | 0        | 0        | X         | 1        | 01        |
| <b>19</b> | <b>I-type</b> | <b>1</b> | <b>00</b> | <b>1</b> | <b>0</b> | <b>0</b>  | <b>0</b> | <b>10</b> |



# Extended Functionality: addi

addi

| op | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp |
|----|-----------|----------|--------|--------|----------|-----------|--------|-------|
| 19 | I-type    | 1        | 00     | 1      | 0        | 0         | 0      | 10    |



# Extended Functionality: jal

Enhance the single-cycle processor to handle **jal**

- **Similar to beq**
- But jump is **always taken**
  - *PCSrc* should be 1
- **Immediate format** is different
  - Need a new *ImmSrc* of 11
- And **jal must compute PC+4 and store in rd**
  - Take PC+4 from adder through ResultMux

① jump subroutine  
②  $rd \leftarrow PC + 4$

or gate

# Extended Functionality: jal



# Extended Functionality: *ImmExt*

| <b>ImmSrc<sub>1:0</sub></b> | <b>ImmExt</b>                                                                        | <b>Instruction Type</b> |
|-----------------------------|--------------------------------------------------------------------------------------|-------------------------|
| 00                          | <code>{{20{instr[31]}}, instr[31:20]}</code>                                         | I-Type                  |
| 01                          | <code>{{20{instr[31]}}, instr[31:25], instr[11:7]}</code>                            | S-Type                  |
| 10                          | <code>{{19{instr[31]}}, instr[31], instr[7], instr[30:25], instr[11:8], 1'b0}</code> | B-Type                  |
| <b>11</b>                   | <b><code>{{12{instr[31]}}, instr[19:12], instr[20], instr[30:21], 1'b0}</code></b>   | <b>J-Type</b>           |

**I-Type**

| 31:20                           | 19:15            | 14:12               | 11:7            | 6:0             |
|---------------------------------|------------------|---------------------|-----------------|-----------------|
| <code>imm<sub>11:0</sub></code> | <code>rs1</code> | <code>funct3</code> | <code>rd</code> | <code>op</code> |
| 12 bits                         | 5 bits           | 3 bits              | 5 bits          | 7 bits          |

**B-Type**

| 31:25                              | 24:20            | 19:15            | 14:12               | 11:7                              | 6:0             |
|------------------------------------|------------------|------------------|---------------------|-----------------------------------|-----------------|
| <code>imm<sub>12,10:5</sub></code> | <code>rs2</code> | <code>rs1</code> | <code>funct3</code> | <code>imm<sub>4:1,11</sub></code> | <code>op</code> |
| 7 bits                             | 5 bits           | 5 bits           | 3 bits              | 5 bits                            | 7 bits          |

**S-Type**

| 31:25                           | 24:20            | 19:15            | 14:12               | 11:7                           | 6:0             |
|---------------------------------|------------------|------------------|---------------------|--------------------------------|-----------------|
| <code>imm<sub>11:5</sub></code> | <code>rs2</code> | <code>rs1</code> | <code>funct3</code> | <code>imm<sub>4:0</sub></code> | <code>op</code> |
| 7 bits                          | 5 bits           | 5 bits           | 3 bits              | 5 bits                         | 7 bits          |

**J-Type**

| 31:12                                              | 11:7            | 6:0             |
|----------------------------------------------------|-----------------|-----------------|
| <b><code>imm<sub>20,10:1,11,19:12</sub></code></b> | <code>rd</code> | <code>op</code> |
| 20 bits                                            | 5 bits          | 7 bits          |

# Extended Functionality: jal

| op  | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp | Jump |
|-----|-----------|----------|--------|--------|----------|-----------|--------|-------|------|
| 3   | lw        | 1        | 00     | 1      | 0        | 01        | 0      | 00    | 0    |
| 35  | sw        | 0        | 01     | 1      | 1        | XX        | 0      | 00    | 0    |
| 51  | R-type    | 1        | XX     | 0      | 0        | 00        | 0      | 10    | 0    |
| 99  | beq       | 0        | 10     | 0      | 0        | XX        | 1      | 01    | 0    |
| 19  | I-type    | 1        | 00     | 1      | 0        | 00        | 0      | 10    | 0    |
| 111 | jal       | 1        | 11     | X      | 0        | 10        | 0      | XX    | 1    |

# Extended Functionality: jal

- jumps to subroutine  
- return address saved in ra

| op  | Instruct. | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp | Jump |
|-----|-----------|----------|--------|--------|----------|-----------|--------|-------|------|
| 111 | jal       | 1        | 11     | X      | 0        | 10        | 0      | XX    | 1    |



# Chapter 7: Microarchitecture

## **Single-Cycle Performance**

# Processor Performance

## Program Execution Time

$$= (\# \text{instructions})(\text{cycles/instruction})(\text{seconds/cycle})$$

$$= \# \text{instructions} \times \text{CPI} \times T_c$$

100      ·      |      + 100ns  
↑            ↑      ✎



# Single-Cycle Processor Performance



$T_C$  limited by critical path (1w)

# Single-Cycle Processor Performance

- **Single-cycle critical path:**

$$T_{c\_single} = t_{pcq\_PC} + t_{mem} + \max[t_{RFread}, t_{dec} + t_{ext} + t_{mux}] + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$$

- **Typically, limiting paths are:**

- memory, ALU, register file

- So,  $T_{c\_single} = t_{pcq\_PC} + t_{mem} + t_{RFread} + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$   
 $= t_{pcq\_PC} + 2t_{mem} + t_{RFread} + t_{ALU} + t_{mux} + t_{RFsetup}$

# Single-Cycle Performance Example

| Element                | Parameter            | Delay (ps)         |
|------------------------|----------------------|--------------------|
| Register clock-to-Q    | $t_{pcq\_PC}$        | 40                 |
| Register setup         | $t_{\text{setup}}$   | 50                 |
| Multiplexer            | $t_{\text{mux}}$     | 30                 |
| AND-OR gate            | $t_{\text{AND-OR}}$  | 20                 |
| ALU                    | $t_{\text{ALU}}$     | 120                |
| Decoder (Control Unit) | $t_{\text{dec}}$     | 25                 |
| Extend unit            | $t_{\text{ext}}$     | 35                 |
| Memory read            | $t_{\text{mem}}$     | 200 <i>slowest</i> |
| Register file read     | $t_{RF\text{read}}$  | 100                |
| Register file setup    | $t_{RF\text{setup}}$ | 60                 |

$$\begin{aligned}T_{c\_single} &= t_{pcq\_PC} + 2t_{\text{mem}} + t_{RF\text{read}} + t_{\text{ALU}} + t_{\text{mux}} + t_{RF\text{setup}} \\&= (40 + 2*200 + 100 + 120 + 30 + 60) \text{ ps} = 750 \text{ ps}\end{aligned}$$

# Single-Cycle Performance Example

Program with 100 billion instructions:

$$\begin{aligned}\text{Execution Time} &= \# \text{ instructions} \times \text{CPI} \times T_C \\ &= (100 \times 10^9)(1)(750 \times 10^{-12} \text{ s}) \\ &= \mathbf{75 \text{ seconds}}\end{aligned}$$