

# Microarchitecture (Fundamentals & Design)

Microarchitecture: Implementation of the ISA under specific design constraints and goals

## How Does a Machine Process Instructions?

- What does processing an instruction mean?
- We will assume the von Neumann model (for now)



- Processing an instruction: Transforming AS to AS' according to the ISA specification of the instruction

# The “Process Instruction” Step

- ISA specifies abstractly what AS' should be, given an instruction and AS
  - It defines an **abstract finite state machine** where
    - State = programmer-visible state
    - Next-state logic = instruction execution specification
  - From ISA point of view, there are no “intermediate states” between AS and AS' during instruction execution
    - One state transition per instruction
- Microarchitecture implements how AS is transformed to AS'
  - There are many choices in implementation
  - We can have programmer-invisible state to optimize the speed of instruction execution: **multiple** state transitions per instruction
    - Choice 1: AS → AS' (transform AS to AS' in a single clock cycle)
    - Choice 2: AS → AS+MS1 → AS+MS2 → AS+MS3 → AS' (take multiple clock cycles to transform AS to AS')

microactual State -

## A Very Basic Instruction Processing Engine

- Each instruction takes a single clock cycle to execute
- Only combinational logic is used to implement instruction execution
  - No intermediate, programmer-invisible, state updates

AS = Architectural (programmer visible) state  
at the beginning of a clock cycle



AS' = Architectural (programmer visible) state  
at the end of a clock cycle

- # Single-cycle vs. Multi-cycle Machines
- Single-cycle machines
    - Each instruction takes a single clock cycle
    - All state updates made at the end of an instruction's execution
    - Big disadvantage: The slowest instruction determines cycle time → long clock cycle time
  - Multi-cycle machines
    - Instruction processing broken into multiple cycles/stages
    - State updates can be made during an instruction's execution
    - Architectural state updates made at the end of an instruction's execution
    - Advantage over single-cycle: The slowest "stage" determines cycle time
  - Both single-cycle and multi-cycle machines literally follow the von Neumann model at the microarchitecture level

## Single-cycle vs. Multi-cycle: Control & Data

- Single-cycle machine:
  - Control signals are generated in the same clock cycle as the one during which data signals are operated on
  - Everything related to an instruction happens in one clock cycle (serialized processing)
- Multi-cycle machine:
  - Control signals needed in the next cycle can be generated in the current cycle
  - Latency of control processing can be overlapped with latency of datapath operation (more parallelism)

# Instruction Processing Viewed Another Way

- Instructions transform Data (AS) to Data' (AS')
- This transformation is done by **functional units**
  - Units that "operate" on data
- These units need to be told what to do to the data
- An instruction processing engine consists of two components
  - **Datapath:** Consists of **hardware elements** that deal with and transform data signals
    - **functional units** that operate on data
    - **hardware structures** (e.g., wires, muxes, decoders, tri-state bufs) that enable the flow of data into the functional units and registers
    - **storage units** that store data (e.g., registers)
  - **Control logic:** Consists of **hardware elements** that determine control signals, i.e., **signals that specify what the datapath elements should do to the data**

## Many Ways of Datapath and Control Design

- There are many ways of designing the **datapath** and **control logic**
- Example ways
  - Single-cycle, multi-cycle, pipelined datapath and control
  - Single-bus vs. multi-bus datapaths
  - Hardwired/combinational vs. microcoded/microprogrammed control
    - Control signals generated by combinational logic versus
    - Control signals stored in a memory structure
- **Control signals and structure depend on the datapath design**

# Above are the fundamentals of Microarchitecture

Next we will step into the design.

## Performance Analysis:

Execution time of a single instruction:

$$\underbrace{\{CPI\}}_{\downarrow} \times \{\text{clock cycle time}\}$$

cycles per instruction

Execution time of an entire program

Sum over all instructions  $\sum \{CPI\} \times \{\text{clock cycle time}\}$

Also:  $\sum \{\# \text{ of instructions}\} \times \{\text{Average CPI}\} \times \{\text{clock cycle time}\}$

Single-cycle uarch performance

$$CPI = 1$$

Clock cycle time = longggg

Multi-cycle uarch performance

CPI = different for each instruction

Clock cycle time = short

Average CPI  $\rightarrow$  hopefully small

We have 2 degrees of freedom to optimize independently

# MIPS State Elements



- Program counter:  
32-bit register
- Instruction memory:  
Takes 32-bit address A and reads the 32-bit data (i.e., instruction) from that address to the read data output RD
- Register file:  
The 32-element, 32-bit register file has 2 read ports and 1 write port
- Data memory:  
If the write enable, WE, is 1, it writes 32-bit data WD into memory location at 32-bit address A on the rising edge of the clock.  
If the write enable is 0, it reads 32-bit data from address A onto RD.

For now, we will assume:  
ultra-fast memory (and reg-file)  
Combinational read

Output of the read data port is a combinational function of the memory contents and the corresponding input address.

Synchronous write

target location is updated at the positive edge clock transition when write enable is asserted.

Single-cycle, synchronous memory

Contrast this with memory that tells when the data is ready (i.e. Read Signal : indicating the r/w is done)

# Instruction Processing

- 5 generic steps (P&H book)
  - Instruction fetch (IF)
  - Instruction decode and register operand fetch (ID/RF)
  - Execute/Evaluate memory address (EX/AG)
  - Memory operand fetch (MEM)
  - Store/writeback result (WB)



# R-Type ALU Instructions

- R-type: 3 register operands

MIPS assembly (e.g., register-register signed addition)

```
add $s0, $s1, $s2      #$s0=rd, $s1=rs, $s2=rt
```



- Semantics

```
if MEM[PC] == add rd rs rt  
GPR[rd] ← GPR[rs] + GPR[rt]  
PC ← PC + 4
```

## (R-Type) ALU Datapath



# I-Type ALU Instructions

- I-type: 2 register operands and 1 immediate

MIPS assembly (e.g., register-immediate signed addition)

addi \$s0, \$s1, 5

#\$s0=rt, \$s1=rs

## Machine Encoding

|          |        |        |           |
|----------|--------|--------|-----------|
| addi (0) | rs     | rt     | immediate |
| 6 bits   | 5 bits | 5 bits | 16 bits   |

I-Type

## Semantics

if  $\text{MEM}[\text{PC}] == \text{addi } \text{rs } \text{rt } \text{immediate}$

$\text{PC} \leftarrow \text{PC} + 4$

$\text{GPR}[\text{rt}] \leftarrow \text{GPR}[\text{rs}] + \text{sign-extend}(\text{immediate})$

## Datapath for R- and I-Type ALU Insts.



if  $\text{MEM}[\text{PC}] == \text{ADDI } \text{rt } \text{rs } \text{immediate}$   
 $\text{GPR}[\text{rt}] \leftarrow \text{GPR}[\text{rs}] + \text{sign-extend}(\text{immediate})$   
 $\text{PC} \leftarrow \text{PC} + 4$

IF ID EX MEM WB

Combinational state update logic

# Load Instructions

## ■ Load 4-byte word

MIPS assembly

`lw $s3, 8($s0)`

`#$s0=rs, $s3=rt`

Machine Encoding



## ■ Semantics

if  $\text{MEM}[\text{PC}] == \text{lw rt offset}_{16}(\text{base})$

$\text{PC} \leftarrow \text{PC} + 4$

address = sign-extend(offset) + GPR(base)

$\text{GPR}[rt] \leftarrow \text{MEM}[\text{address}]$

General Purpose Register.

Add this part  
based on I-type  
Datapath

## LW Datapath



if  $\text{MEM}[\text{PC}] == \text{lw rt offset}_{16}(\text{base})$   
 $\text{address} = \text{sign-extend}(\text{offset}) + \text{GPR}[\text{base}]$   
 $\text{GPR}[rt] \leftarrow \text{MEM}[\text{address}]$   
 $\text{PC} \leftarrow \text{PC} + 4$



# Store Instructions

## ■ Store 4-byte word

MIPS assembly

sw \$s3, 8(\$s0)

#\$s0=rs, \$s3=rt

Machine Encoding



I-Type

## ■ Semantics

if  $\text{Mem}[\text{PC}] == \text{sw rt offset}_{16}(\text{base})$

$\text{PC} \leftarrow \text{PC} + 4$

address = sign-extend(offset) + GPR(base)

$\text{MEM}[\text{address}] \leftarrow \text{GPR}[rt]$

## SW Datapath



if  $\text{MEM}[\text{PC}] == \text{SW rt offset}_{16}(\text{base})$   
address = sign-extend(offset) + GPR(base)  
 $\text{MEM}[\text{address}] \leftarrow \text{GPR}[rt]$   
 $\text{PC} \leftarrow \text{PC} + 4$

IF ID EX MEM WB

Combinational state update logic

We combine all the above together!

# Datapath for Non-Control-Flow Insts.



The data written back can come from either the "ALU result" or "Read data"

# Control Flow Instruction

## Jump Instruction

- Unconditional branch or jump

j target

|        |           |
|--------|-----------|
| j (2)  | immediate |
| 6 bits | 26 bits   |

J-Type

- 2 = opcode
- immediate (target) = target address

- Semantics

if  $\text{MEM}[\text{PC}] == j \text{ immediate}_{26}$

Computational logic

target = {  $\text{PC}^+ [31:28]$ , immediate<sub>26</sub>, 2' b00 }

$\text{PC} \leftarrow \text{target}$

BIT Slicing!

Similar

## Other Jumps in MIPS

- jr: jump register

Semantics

if  $\text{MEM}[\text{PC}] == jr \text{ rs}$

$\text{PC} \leftarrow \text{GPR(rs)}$



- jal: jump and link (function calls)

Semantics

if  $\text{MEM}[\text{PC}] == jal \text{ immediate}_{26}$

$\$ra \leftarrow \text{PC} + 4$

Store ahead in order to return  
in future

target = {  $\text{PC}^+ [31:28]$ , immediate<sub>26</sub>, 2' b00 }

$\text{PC} \leftarrow \text{target}$

Similar

- jalr: jump and link register

Semantics

if  $\text{MEM}[\text{PC}] == jalr \text{ rs}$

$\$ra \leftarrow \text{PC} + 4$

$\text{PC} \leftarrow \text{GPR(rs)}$

# Conditional Branch Instructions

### ■ beq (Branch if Equal)

**beq \$s0, \$s1, offset** # $\$s0=rs, \$s1=rt$



- Semantics (assuming no branch delay slot)

if  $\text{MEM}[\text{PC}] == \text{beq } \text{rs } \text{rt } \text{immediate}_{16}$

$$\text{target} = \text{PC}^+ + \text{sign-extend}(\text{immediate}) \times 4$$

if GPR[rs]==GPR[rt] then PC  $\leftarrow$  target

else  $\text{PC} \leftarrow \text{PC} + 4$

#### ❑ Variations: beq, bne, blez, bgtz

## Conditional Branch Datapath (for you to finish)



# Single-Cycle Hardwired Control

## ■ As combinational function of $\text{Inst} = \text{MEM}[\text{PC}]$

| 31     | 26     | 25 | 21     | 20 | 16     | 15 | 11        | 10 | 6      | 5 | 0      |
|--------|--------|----|--------|----|--------|----|-----------|----|--------|---|--------|
| 0      | rs     |    | rt     |    | rd     |    | shamt     |    | funct  |   | R-Type |
| 6 bits | 5 bits |    | 5 bits |    | 5 bits |    | 5 bits    |    | 6 bits |   |        |
| 31     | 26     | 25 | 21     | 20 | 16     | 15 |           |    |        |   | 0      |
| opcode | rs     |    | rt     |    |        |    | immediate |    |        |   | I-Type |
| 6 bits | 5 bits |    | 5 bits |    |        |    | 16 bits   |    |        |   |        |
| 31     | 26     | 25 |        |    |        |    |           |    |        |   | 0      |
| opcode |        |    |        |    |        |    | immediate |    |        |   | J-Type |
| 6 bits |        |    |        |    |        |    | 26 bits   |    |        |   |        |

## ■ Consider

- All R-type and I-type ALU instructions
- lw and sw
- beq, bne, blez, bgtz branch
- j, jr, jal, jalr omitted jump

# Single-Bit Control Signals (I)

|          | <b>When De-asserted</b>                                             | <b>When asserted</b>                                                 | <b>Equation</b>                                               |
|----------|---------------------------------------------------------------------|----------------------------------------------------------------------|---------------------------------------------------------------|
| RegDest  | GPR write select according to <b>rt</b> , i.e., inst[20:16]         | GPR write select according to <b>rd</b> , i.e., inst[15:11]          | opcode==0                                                     |
| ALUSrc   | 2 <sup>nd</sup> ALU input from 2 <sup>nd</sup> <b>GPR</b> read port | 2 <sup>nd</sup> ALU input from sign-extended 16-bit <b>immediate</b> | (opcode!=0) && (opcode!=BEQ) && (opcode!=BNE)                 |
| MemtoReg | Steer <b>ALU result</b> to GPR write port                           | Steer <b>memory output</b> to GPR write port                         | opcode==LW                                                    |
| RegWrite | GPR <b>write disabled</b>                                           | GPR <b>write enabled</b>                                             | (opcode!=SW) && (opcode!=Bxx) && (opcode!=J) && (opcode!=JR)) |

# Single-Bit Control Signals (II)

|                    | <b>When De-asserted</b>               | <b>When asserted</b>                                      | <b>Equation</b>                       |
|--------------------|---------------------------------------|-----------------------------------------------------------|---------------------------------------|
| MemRead            | Memory <b>read disabled</b>           | Memory <b>read port returns load value</b>                | opcode==LW                            |
| MemWrite           | Memory <b>write disabled</b>          | Memory <b>write enabled</b>                               | opcode==SW                            |
| PCSrc <sub>1</sub> | According to <b>PCSrc<sub>2</sub></b> | next PC is based on <b>26-bit immediate jump target</b>   | (opcode==J)    (opcode==JAL)          |
| PCSrc <sub>2</sub> | next PC = <b>PC + 4</b>               | next PC is based on <b>16-bit immediate branch target</b> | (opcode==Bxx) && “bcond is satisfied” |

# R-Type ALU



# I-Type ALU



# LW



Based on original figure from [P&H CO&D, COPYRIGHT 2004]

# SW



Based on original figure from [P&H CO&D, COPYRIGHT 2004]

# Branch (Not Taken)



Based on original figure from [P&H CO&D, COPYRIGHT 2004  
Eduardo A. Rodriguez, All Rights Reserved 1

# Branch (Taken)



Based on original figure from [P&H CO&D, COPYRIGHT

# Jump



'Based on original figure from [P&H CO&D, COPYRIGHT 2014, REPRODUCED WITH PERMISSION OF THE AUTHOR.]' [RIGHTS RESERVED](#)

# What is in That Control Box?

Combinational Logic → Hardwired Control  
Control signals generated combinatorially based on bits in instruction encoding

Sequential Logic → Sequential Control  
A memory structure contains the control signals associated with an instruction  
Called "control store"

Both types of control structure can be used in single-cycle processors

# Another Complete Single-Cycle Processor



Similarly, We Need to Design the Control Unit

- Control signals are generated by the decoder in control unit

| Instruction | $Op_{5:0}$ | RegWrite | RegDst | AluSrc | Branch | MemWrite | MemtoReg | $ALUOp_{1:0}$ | Jump |
|-------------|------------|----------|--------|--------|--------|----------|----------|---------------|------|
| R-type      | 000000     | 1        | 1      | 0      | 0      | 0        | 0        | 10            | 0    |
| lw          | 100011     | 1        | 0      | 1      | 0      | 0        | 1        | 00            | 0    |
| sw          | 101011     | 0        | X      | 1      | 0      | 1        | X        | 00            | 0    |
| beq         | 000100     | 0        | X      | 0      | 1      | 0        | X        | 01            | 0    |
| addi        | 001000     | 1        | 0      | 1      | 0      | 0        | 0        | 00            | 0    |
| j           | 000010     | 0        | X      | X      | X      | 0        | X        | XX            | 1    |