



# Computer Architecture (CS F342)

Design, Analysis, Execution and Optimization of Instructions

Reduced Instruction Set Computer (RISC): MIPS  
Multicycle Datapath and Control unit

Book: COD by Patterson and Hennessy from Chapter **4 A**

```
#define OPCODE 0b11111000000000000000000000000000
#define RS      0b00000011110000000000000000000000
#define RT      0b00000000000011110000000000000000
#define RD      0b00000000000000001111000000000000
#define SHIFT   0b0000000000000000000000001111000000
#define OFFSET  0b00000000000000001111111111111111
```

```
int PC, IMM[1024], DMM[1024], RF[32], ALUControl; bool ZERO;
```

**Load**(IMM, DMM);

Set PC with address 1st instruction which is stored in IMM;

```
while (1){
    switch((IMM[PC] & OPCODE) >>26){
```

case **R-type**:

Set **ALUControl**; //ADD, SUB, AND, OR, etc

RF[(IMM[PC] & RD)>>11] = **ALU**(RF[(IMM[PC] & RS)>>21], RF[(IMM[PC] & RT) >>16]); PC = PC + 4;

case **SW-type**:

Set **ALUControl** = 0b0010;

DMM[**ALU**((IMM[PC] & RS)>>21, (IMM[PC] & OFFSET) ] = RF[(IMM[PC] & RT)>>16]; PC = PC + 4;

case **LW-type**:

Set **ALUControl** = 0b0010;

RF[IMM[PC] & RT]= DMM[**ALU**((IMM[PC] & RS) >>21, IMM[PC] & OFFSET) ]; PC = PC + 4;

case **B-type**:

Set **ALUControl** = 0b0110;

**ALU**(RF[(IMM[PC] & RS)>>21], RF[(IMM[PC] & RT) >>16]);

IF (ZERO ==1) PC = (PC + 4) + ((IMM[PC] & OFFSET) <<2); **ELSE** PC = PC + 4;

}

## Fetch-and-Execute Algorithm for MIPS Microprocessor (32 bit)

**ALU**(Src1, Src2){

switch (**ALUControl**) {

case **B-type**: ZERO = (Src1 - Src2) == 0 ? 1: 0;

break;

case **ADD**: return (Src1 + Src2); break;

...

}

Need conversion for  
16 bit offset to 32 bits

# Single-cycle datapath



Redundant  
Adders &  
Memory

# Performance analysis

- ! The execution time of a program is a metric

- *Execution Time = (#instructions)  $\left(\frac{\text{Cycles}}{\text{instruction}}\right) \left(\frac{\text{Seconds}}{\text{cycle}}\right)$*
- #instructions or length of a program depends on ISA
- How did we measure the clock period?
- Which instruction was dominating the clock period?
- **How can we reduce the clock period or improve the frequency ( $f=1/T$ )?**

# Improvement on clock period

4

How about LW?



Combinational Logic = Combinational Logic-1 U Combinational Logic-2

# Placement of Registers/ Partition the logic

- How does one place the registers in the datapath?
- $T = \text{Max}\{ (\text{src-1}, \text{dstn-1}), (\text{src-2}, \text{dstn-2}), \dots \}$
- What  $\text{Max}\{ \}$  is telling us?
- Each partition can take almost equal time for computation

# Problems of Single-cycle Datapath Design

- Single-cycle design works well but inefficient design
- Clock length (worst-case delay) is same for all instructions
- It is not a **balanced** design
- CPI is 1
- Use more resources:
  - Adder
  - Memory
- Necessity of balanced datapath design technique by focusing on common-case design & analysis principle

# Balanced datapath design: Multi-cycle approach

- Instruction execution can be broken down to smaller steps
- Simple instruction can complete the execution earlier than the complex instructions
- One can design the multi-cycle datapath as similar in single-cycle
  - Connecting architectural elements with **the storage** using combinational logic
  - Next, design the controller

# Balanced datapath design: Multi-cycle approach

- The key difference is
  - Controller produces different signals on different steps/states
  - A finite state machine (FSM) approach

# Balanced datapath design: Multi-cycle approach

- Combined the instruction and data memory
- Remove the redundant adders
- Incorporate the **non-architectural state/storage** elements
  - Not visible to the programmer

# Single-cycle datapath



# Balanced datapath design: Multi-cycle approach

- Remove the redundant adders & memory
- Where do we place this operations?
- How does one control such operations?

```
#define OPCODE 0b11111000000000000000000000000000
#define RS      0b00000011110000000000000000000000
#define RT      0b00000000000011110000000000000000
#define RD      0b00000000000000001111000000000000
#define SHIFT   0b0000000000000000000000001111000000
#define OFFSET  0b00000000000000001111111111111111
```

int PC, IMM[1024], DMM[1024], RF[32], ALUControl; bool ZERO;

**Load**(IMM, DMM);

Set PC with address 1st instruction which is stored in IMM;

```
while (1){
    switch((IMM[PC] & OPCODE) >>26){
```

case **R-type**:

Set **ALUControl**; //ADD, SUB, AND, OR, etc  
 $RF[(IMM[PC] \& RD) \gg 11] = \text{ALU}(RF[(IMM[PC] \& RS) \gg 21], RF[(IMM[PC] \& RT) \gg 16]);$  PC =  $PC + 4;$

case **SW-type**:

Set **ALUControl** = 0b0010;  
 $DMM[\text{ALU}((IMM[PC] \& RS) \gg 21, (IMM[PC] \& OFFSET))] = RF[(IMM[PC] \& RT) \gg 16];$  PC = PC + 4;

case **LW-type**:

Set **ALUControl** = 0b0010;  
 $RF[IMM[PC] \& RT] = DMM[\text{ALU}((IMM[PC] \& RS) \gg 21, IMM[PC] \& OFFSET)];$  PC = PC + 4;

case **B-type**:

Set **ALUControl** = 0b0110;  
 $\text{ALU}(RF[(IMM[PC] \& RS) \gg 21], RF[(IMM[PC] \& RT) \gg 16]);$   
IF (ZERO ==1) PC =  $(PC + 4) + ((IMM[PC] \& OFFSET) \ll 2);$  ELSE PC = PC + 4;

}

## Fetch-and-Execute Algorithm for MIPS Microprocessor (32 bit)

**ALU**(Src1, Src2){

switch (**ALUControl**) {

case **B-type**: ZERO = (Src1 - Src2) == 0 ? 1 : 0;

break;

case **ADD**: return (Src1 + Src2); break;

...

}

Need conversion for  
16 bit offset to 32 bits

# Key points: Resource Sharing leads to Multi-Cycle Approach

14

- Datapath components are shared
  - Functional units
  - Memory
  - Interconnect
- Balanced datapath design

Book: COD by Patterson and Hennessy from Chapter **4 A**

```
#define OPCODE 0b11111000000000000000000000000000
#define RS      0b00000011110000000000000000000000
#define RT      0b00000000000011110000000000000000
#define RD      0b00000000000000001111000000000000
#define SHIFT   0b0000000000000000000000001111000000
#define OFFSET  0b00000000000000001111111111111111
```

```
int PC, MM[1024], RF[32], ALUControl; bool ZERO;
int IR, A, B, ALUOut, MDR; //non-architectural elements
Load(MM);
Set PC with address 1st instruction which is stored in IMM;
while (1){ IR = MM[PC]; ALUControl = 0b0010; PC = ALU(PC, 4);
A = RF[(IR & RS)>>21]; B = RF[(IR & RT) >>16];
switch((IR & OPCODE) >>26){
    case R-type:
        Set ALUControl; //ADD, SUB, AND, OR, etc
        ALUOut = ALU(A, B); RF[(IR & RD)>>11] = ALUOut;
    case LW-type:
        ALUControl = 0b0010;
        ALUOut = ALU(A, (IR & OFFSET)); MDR = MM[ALUOut]; RF[(IR & RT)>>16] = MDR;
    case SW-type:
        ALUControl = 0b0010;
        ALUOut = ALU(A, (IR & OFFSET)); MM[ALUOut] = B;
    case B-type: ALUControl = 0b0010;
        ALUOut = ALU(PC, (IR & OFFSET) <<2);
        ALUControl = 0b0110; ALU(A, B);
        IF (ZERO ==1) {PC = ALUOut;}
}
```

## Fetch-and-Execute Algorithm for MIPS Microprocessor (32 bit)

```
ALU(Src1, Src2){
    switch (ALUControl){
        case B-type: ZERO = (Src1 - Src2) == 0 ? 1 : 0; break;
        case ADD: return (Src1 + Src2);
        ...
    }
}
```

Need conversion for  
16 bit offset to 32 bits

Multi-Cycle Approach

## Single memory unit, **Fetch stage** and datapath for reusing the ALU

16

# Single memory unit, Fetch stage and datapath for reusing the ALU

17



# Balanced datapath design: Multi-cycle approach

- Combined the instruction and data memory
- Remove the redundant adders
- Incorporate the **non-architectural state/storage** elements

## Fetch stage and Non-architectural elements Instruction register

19

## Fetch stage and Non-architectural elements Instruction register

20



1:  $IR = M[PC]$ ;  $PC = PC + 4$

## Iw-instr. and non-architectural elements: A, Data and ALUOut-register

21

- 1:  $IR = M[PC]$ ;  $PC = PC + 4$
- 2:  $A = Reg[25:21]$ ;
- 3:  $ALUOut = A + \text{SignExt}(Imm)$
- 4:  $MDR = M[ALUOut]$
- 5:  $Reg[20:16] = MDR$

# Iw-instr. and non-architectural elements: A, Data and ALUOut-register

22



## sw-instr and non-architectural elements: B-register

23

- 1:  $IR = M[PC]$ ;  $PC = PC + 4$
- 2:  $A = Reg[25:21]$ ;  $B=Reg[20:16]$
- 3:  $ALUOut = A + \text{SignExt}(Imm)$
- 4:  $M[ALUOut] = B$

# sw-instr and non-architectural elements: B-register

24



- 1:  $IR = M[PC]$ ;  $PC = PC + 4$
- 2:  $A = Reg[25:21]$ ;  $B = Reg[20:16]$
- 3:  $ALUOut = A + \text{SignExtn}(Imm)$
- 4:  $M[ALUOut] = B$

## R-type instruction

25

- 1:  $IR = M[PC]$ ;  $PC = PC + 4$
- 2:  $A = Reg[25:21]$ ;  $B = Reg[20:16]$
- 3:  $ALUOut = A + B$
- 4:  $Reg[15:11] = ALUOut$

# R-type instruction

26



Fetch stage and B(R)-type instruction   BEQ R1, R2, offset; if true, PC =PC+4+offset  
else PC = PC+4

27

- 1: IR[PC]; PC = PC + 4
- 2: ALUOut = PC + Shift(Sign(offset))
- 3: PC = ALUOut if Zero == 1 //A-B

Fetch stage and B(R)-type instruction   BEQ R1, R2, offset; if true, PC = PC+4+offset  
else PC = PC+4

28



## I-type instruction: ADDI

29

| op            | rs            | rd            | Immediate     |              |              |
|---------------|---------------|---------------|---------------|--------------|--------------|
| 6-bits(31-26) | 5-bits(25-21) | 5-bits(20-16) | 5-bits(15-11) | 5-bits(10-6) | 6-bits (5-0) |

- 1:  $IR = M[PC]$ ;  $PC = PC + 4$
- 2:  $A = Reg[25:21];$
- 3:  $ALUOut = A + \text{SignExt}(Imm)$
- 4:  $Reg[20:16] = ALUOut$

# I-type instruction: ADDI

30

| op            | rs            | rd            | Immediate     |              |              |
|---------------|---------------|---------------|---------------|--------------|--------------|
| 6-bits(31-26) | 5-bits(25-21) | 5-bits(20-16) | 5-bits(15-11) | 5-bits(10-6) | 6-bits (5-0) |



## Fetch stage and Jump instruction

PC □ PC+4[31:28] addrs[27:0]

- 
- 1:  $IR = M[PC]$ ;  $PC = PC + 4$
  - 2:  $PC = \{PC[31:28], LShift(Addr)\}$

# Fetch stage and Jump instruction

PC  $\square$  PC+4[31:28] addrs[27:0]



# Combined Datapath of all types of instructions

33



# Control signals

- IorD
- Jump
- Memwrite
- MemRead
- IRWrite
- RegDst
- MemtoReg
- RegWrite
- ALUSrcA
- ALUSrcR1:0
- PCWrite
- Branch
- PCSrc
- ALUOp
- ALUControl

Truth Table?

15 Control signals

How many bits are needed to represent a state?

- States

# Multicycle model



# Summary of Steps taken to execute any instruction class

| Step name                                              | Action for R-type instructions | Action for memory reference instructions                                                                      | Action for branches                 | Action for jumps                           |
|--------------------------------------------------------|--------------------------------|---------------------------------------------------------------------------------------------------------------|-------------------------------------|--------------------------------------------|
| Instruction fetch                                      |                                | $IR \leq Memory[PC]$<br>$PC \leq PC + 4$                                                                      |                                     |                                            |
| Instruction decode/register fetch                      |                                | $A \leq Reg[IR[25:21]]$<br>$B \leq Reg[IR[20:16]]$<br>$ALUOut \leq PC + (\text{sign-extend}(IR[15:0]) \ll 2)$ |                                     |                                            |
| Execution, address computation, branch/jump completion | $ALUOut \leq A \text{ op } B$  | $ALUOut \leq A + \text{sign-extend}(IR[15:0])$                                                                | if ( $A == B$ )<br>$PC \leq ALUOut$ | $PC \leq \{PC[31:28], (IR[25:0]), 2'b00\}$ |
| Memory access or R-type completion                     | $Reg[IR[15:11]] \leq ALUOut$   | Load: $MDR \leq Memory[ALUOut]$<br>or<br>Store: $Memory[ALUOut] \leq B$                                       |                                     |                                            |
| Memory read completion                                 |                                | Load: $Reg[IR[20:16]] \leq MDR$                                                                               |                                     |                                            |

# Fetch stage

| Machine state | Operation                                  | Control signals                                                                              | Next state |
|---------------|--------------------------------------------|----------------------------------------------------------------------------------------------|------------|
| T0            | InsR $\square$ M[PC];<br>PC $\square$ PC+4 | IorD=0, IRWrite=1, MemRead=1<br>ALUSrcA=0, ALUSrcB=01, ALUOp=00, PCSrc=00, PCWrite=1, Jump=0 | T1         |

# Decode stage

| Machine state | Operation                                                           | Control signals                                  | Next state |
|---------------|---------------------------------------------------------------------|--------------------------------------------------|------------|
| T1            | $(PC+4) + Lshift(SigExtn(offset))$<br>A = Reg[25:21] B = Reg[20:16] | ALUSrcA=0, ALUSrcB <sub>1:0</sub> = 11, ALUOp=00 | x          |

Useful for BRZ instruction

# LW type instruction

39

| Machine state | Operation                        | Control signals                                  | Next State |
|---------------|----------------------------------|--------------------------------------------------|------------|
| T2            | A + sigEx(offset)                | ALUSrcA=1, ALUSrcB <sub>1:0</sub> = 10, ALUOp=00 | T3         |
| T3            | MDR $\leftarrow$ M[A+sigEx(off)] | IorD=1 , MemRead=1                               | T4         |
| T4            | RF[dest] $\leftarrow$ MDR        | RegDst=0, MemtoReg=1, RegWrite=1                 | T0         |

# SW type instruction

40

| Machine state | Operation                                    | Control signals                                                | Next state |
|---------------|----------------------------------------------|----------------------------------------------------------------|------------|
| T2            | $A + \text{sigEx}(\text{offset})$            | $\text{ALUSrcA}=1, \text{ALUSrcB}_{1:0} = 10, \text{ALUOp}=00$ | T5         |
| T5            | $M[A+\text{sigEx}(\text{offset})] \square B$ | $\text{IorD}=1, \text{MemWrite}=1$                             | T0         |

# R-type instruction

41

| Machine state | Operation         | Control signals                                  | Next state |
|---------------|-------------------|--------------------------------------------------|------------|
| T6            | A Op B            | ALUSrcA=1, ALUSrcB <sub>1:0</sub> = 00, ALUOp=00 | T7         |
| T7            | RF[dstn] □ A Op B | RegDst=1, MemtoReg=0, RegWrite=1                 | T0         |

# B-type instruction

42

| Machine state | Operation                                                                           | Control signals                                                                          | Next State |
|---------------|-------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|------------|
| T1            | $(PC+4) +$<br>SigExtn(offset)<br>$A = \text{Reg}[25:21]$<br>$B = \text{Reg}[20:16]$ | $\text{ALUSrcA}=0$ , $\text{ALUSrcB}_{1:0} = 11$ , $\text{ALUOp}=00$ , $\text{PCSrc}=01$ | T8         |
| T8            | A-B                                                                                 | $\text{ALUSrcA}=1$ , $\text{ALUSrcB}_{1:0} = 00$ , $\text{ALUOp}=01$ , Branch=1          | T0         |

Decoding stage

# ADDI instruction

43

| Machine state | Operation                           | Control signals                                  | Next state |
|---------------|-------------------------------------|--------------------------------------------------|------------|
| T2            | A OP SigExtn(offset)                | ALUSrcA=1, ALUSrcB <sub>1:0</sub> = 10, ALUOp=00 | T9         |
| T9            | RF[Destn] □ A OP<br>SigExtn(offset) | RegDst=0, MemtoReg=0                             | T0         |

# Jump instruction

44

| Machine state | Operation               | Control signals  | Next state |
|---------------|-------------------------|------------------|------------|
| T10           | A OP<br>SigExtn(offset) | Jump=1, PCSrc=10 | T0         |

# Controller States' information: Graphical Representation

45



# Clock-cycle needed for the instructions

| Instructions | Clock-cycle |
|--------------|-------------|
| LW           | 5           |
| SW           | 4           |
| R-type       | 4           |
| BEQ          | 3           |
| ADDI         | 4           |
| J            | 3           |

# (Finite state) Control Unit

47



# Fetch stage

| Machine state<br>(Inputs) | Operation                                        | Control signals<br>(Outputs)                                                                    | Next state<br>(Outputs) |
|---------------------------|--------------------------------------------------|-------------------------------------------------------------------------------------------------|-------------------------|
| T0                        | InsR $\leftarrow$ M[PC];<br>PC $\leftarrow$ PC+4 | IorD=0, IRWrite=1, MemRead=1<br>ALUSrcA=0, ALUSrcB=01, ALUOp=00,<br>PCSrc=00, PCWrite=1, Jump=0 | T1                      |

# Microprogram based control unit

49

| Address | IorD | Jump | Memwrite | MemRead | IRWrite | RegDst | MemtoReg | RegWrite | ALUSrcA | ALUSrcB_1:0 | PCWrite | Branch | PCSsrc | ALUOp | NextAddress | Mode |        | States |
|---------|------|------|----------|---------|---------|--------|----------|----------|---------|-------------|---------|--------|--------|-------|-------------|------|--------|--------|
| 0       | 0    | 0    | 0        | 1       | 1       | x      | x        | 0        | 0       | 01          | 1       | 0      | 00     | 00    | 1           | 0    | Fetch  | T0     |
| 1       | x    | 0    | 0        | 0       | 0       | x      | x        | 0        | 0       | 11          | 0       | 0      | 00     | 00    | xxxx        | 1    | Decode | T1     |
| 2       | x    | 0    | 0        | 0       | 0       | x      | x        | 0        | 1       | 10          | 0       | 0      | 00     | 00    | 3           | 0    |        | T2     |
| 3       | 1    | 0    | 0        | 1       | 0       | x      | x        | 0        | x       | xx          | 0       | 0      | 00     | xx    | 4           | 0    | LW     | T3     |
| 4       | x    | 0    | 0        | 0       | 0       | 0      | 1        | 1        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    |        | T4     |
| 5       | x    | 0    | 0        | 0       | 0       | x      | x        | 0        | 1       | 10          | 0       | 0      | 00     | 00    | 6           | 0    |        | T2     |
| 6       | 1    | 0    | 1        | 0       | 0       | x      | x        | 0        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    | SW     | T3     |
| 7       | x    | 0    | 0        | 0       | 0       | x      | x        | 0        | 1       | 00          | 0       | 0      | 00     | 00    | 8           | 0    |        | T2     |
| 8       | x    | 0    | 0        | 0       | 0       | 1      | 0        | 1        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    | ADD    | T3     |
| 9       | x    | 0    | 0        | 0       | 0       | x      | x        | 0        | 1       | 10          | 0       | 0      | 00     | 00    | 10          | 0    |        | T2     |
| 10      | x    | 0    | 0        | 0       | 0       | 0      | 0        | 0        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    | ADDI   | T3     |
| 11      | x    | 1    | 0        | 0       | 0       | x      | x        | 0        | x       | xx          | 0       | 0      | 10     | xx    | 0           | 0    | J      | T2     |
| 12      | x    | 0    | 0        | 0       | 0       | x      | x        | 0        | 1       | 00          | 0       | 1      | 01     | 01    | 0           | 0    | BEQ    | T2     |

Can we reduce the bits used for NextAddress field?

Book-Hamacher-ch5  
P&H – ch-appx-C

## Controller's Truth (state) Table

CF: Control Field: 15 Signals or 17 bits

NA: Next Address: 4 bit

For better understanding, we have represented the value in NextAddress and the Address as decimal.

| Decoding Table |         |
|----------------|---------|
| Opcode         | Address |
| LW             | 2       |
| SW             | 5       |
| ADD            | 7       |
| ADDI           | 9       |
| J              | 11      |
| BEQ            | 12      |

# Microprogrammed Control

- Organization of microprogrammed control unit



Microprogram sequencer:  
all units, except control  
memory.

CF: Control Field

NA: Next Address

# Hardwired based control unit

51

| Address | lOrD | Jump | Memwrite | IRWrite | RegDst | MemtoReg | RegWrite | ALUSrcA | ALUSrcB_1:0 | PCWrite | Branch | PCSsrc | ALUOp | NextAddress | Mode |        | States |
|---------|------|------|----------|---------|--------|----------|----------|---------|-------------|---------|--------|--------|-------|-------------|------|--------|--------|
| 0       | 0    | 0    | 0        | 1       | x      | x        | 0        | 0       | 01          | 1       | 0      | 00     | 00    | 1           | 0    | Fetch  | T0     |
| 1       | x    | 0    | 0        | 0       | x      | x        | 0        | 0       | 11          | 0       | 0      | 00     | 00    | xxxx        | 1    | Decode | T1     |
| 2       | x    | 0    | 0        | 0       | x      | x        | 0        | 1       | 10          | 0       | 0      | 00     | 00    | 3           | 0    |        | T2     |
| 3       | 1    | 0    | 0        | 0       | x      | x        | 0        | x       | xx          | 0       | 0      | 00     | xx    | 4           | 0    | LW     | T3     |
| 4       | x    | 0    | 0        | 0       | 0      | 1        | 1        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    |        | T4     |
| 5       | x    | 0    | 0        | 0       | x      | x        | 0        | 1       | 10          | 0       | 0      | 00     | 00    | 6           | 0    |        | T2     |
| 6       | 1    | 0    | 1        | 0       | x      | x        | 0        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    | SW     | T3     |
| 7       | x    | 0    | 0        | 0       | x      | x        | 0        | 1       | 00          | 0       | 0      | 00     | 00    | 8           | 0    |        | T2     |
| 8       | x    | 0    | 0        | 0       | 1      | 0        | 1        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    | ADD    | T3     |
| 9       | x    | 0    | 0        | 0       | x      | x        | 0        | 1       | 10          | 0       | 0      | 00     | 00    | 10          | 0    |        | T2     |
| 10      | x    | 0    | 0        | 0       | 0      | 0        | 0        | x       | xx          | 0       | 0      | 00     | xx    | 0           | 0    | ADDI   | T3     |
| 11      | x    | 1    | 0        | 0       | x      | x        | 0        | x       | xx          | 0       | 0      | 10     | xx    | 0           | 0    | J      | T2     |
| 12      | x    | 0    | 0        | 0       | x      | x        | 0        | 1       | 00          | 0       | 1      | 01     | 01    | 0           | 0    | BEQ    | T2     |

| Inputs |   | Output |
|--------|---|--------|
| A      | B | Y      |
| 0      | 0 | 0      |
| 0      | 1 | 0      |
| 1      | 0 | 0      |
| 1      | 1 | 1      |



| Decoding Table |         |
|----------------|---------|
| Opcode         | Address |
| LW             | 2       |
| SW             | 5       |
| ADD            | 7       |
| ADDI           | 9       |
| J              | 11      |
| BEQ            | 12      |

# Controller States' information for Hardwired sequential CU

52



# Hardwired based control unit

53



# Hardwired based control unit

$IorD = T0' + T3.LW + T3.SW +$   
 $Jump = T2.J$   
 $Memwrite = T3.SW$   
 $MemRead = T0' +$   
 $IRWrite =$   
 $RegDst =$   
 $MemtoReg =$   
 $RegWrite =$   
 $ALUSrcA =$   
 $ALUSrcB\_1:0 =$   
 $PCWrite =$   
 $Branch =$   
 $PCSsrc =$   
 $ALUOp = \{T0', T0'\} + \{T1', T1'\} + \{(T2.LW)', (T2.LW)'\}$   
 $INCR = T0 + T1 + T2.LW + T3.LW + T2.SW$   
 $CLR = T4.LW + T3.SW +$

# Two approaches

- All the ALU operations (ALUOp) similar as in Single-cycle approach
- Generate the control signals using:
  - Hardwired-based approach
  - Microprogrammed-based approach
- Inputs of the control unit similar as in Single-cycle approach

# Comparison between Single and Multi-cycle datapath

- Single cycle
  - No non-architectural component
  - Every instruction takes one clock cycle (CPI=1)
- Multi-cycle
  - Non-architectural elements/register
    - Instruction, data (memory), A, B, ALU-output
  - Instructions take different unit of clock cycle
    - Average CPI (>1)

# Comparison between Single and Multi-cycle Control unit

Single Cycle

CLK

10 ns

| ALUOp | Meaning                    |
|-------|----------------------------|
| 00    | add                        |
| 01    | subtract                   |
| 10    | Look at <i>funct</i> field |
| 11    | n/a                        |

| Instr.    | Jump     | RegDst   | RegWrite | ALUSrc   | Branch   | ALUOp1   | ALUOp0   | MemRead  | MemWrite | MemtoReg |
|-----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
| R-type    | 0        | 1        | 1        | 0        | 0        | 1        | 0        | 0        | 0        | 0        |
| <b>lw</b> | <b>0</b> | <b>0</b> | <b>1</b> | <b>1</b> | <b>0</b> | <b>0</b> | <b>0</b> | <b>1</b> | <b>0</b> | <b>1</b> |
| sw        | 0        | x        | 0        | 1        | 0        | 0        | 0        | 0        | 1        | x        |
| addi      | 0        | 0        | 1        | 1        | 0        | 0        | 0        | 0        | 0        | 0        |
| B-type    | 0        | x        | 0        | 0        | 1        | 0        | 1        | 0        | 0        | x        |
| J-type    | 1        | x        | 0        | x        | x        | x        | x        | 0        | 0        | x        |

Multi-Cycle

LW's FSM

5 ns

| Machine state | Operation              | Control signals                                  | Next State |
|---------------|------------------------|--------------------------------------------------|------------|
| T2            | A + sigEx(offset)      | ALUSrcA=1, ALUSrcB <sub>1:0</sub> = 10, ALUOp=00 | T3         |
| T3            | Data □ M[A+sigEx(off)] | IorD=1                                           | T4         |
| T4            | RF[dest] □ Data        | RegDst=0, MemtoReg=1, RegWrite=1                 | T0         |

# Performance Analysis

- The program consists of approximately 25% loads, 10% stores, 11% branches, 2% jumps, and 52% R-type instructions. Determine the average CPI for this program.

# Performance Analysis

- The program consists of approximately 25% loads, 10% stores, 11% branches, 2% jumps, and 52% R-type instructions. Determine the average CPI for this program.
- The average CPI is the sum over each instruction of the CPI for that instruction multiplied by the fraction of the time that instruction is used
- **Average CPI =  $(0.11 + 0.02)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12$**
- For Single-cycle approach, Avg. CPI is 1

# Performance Analysis

60

Each cycle involved **one** ALU operation, memory access, or register file access

- Each instruction is using only one stage at any time

- Assumptions:

- the register file is faster than the memory and
- writing memory is faster than reading memory

- Datapath has two possible critical paths:

$$T_c = t_{clk2q} + t_{mux} + \max\{t_{mux} + t_{ALU}, t_{mem}\} + t_{setup}$$

Begin of  
each  
stage

Delay elements

Setup  
time for  
each  
stage

$t_{clk2q}$  = clock-to-Q  
D-ff

# Performance Analysis

61

- Each cycle involved **one** ALU operation, memory access, or register file access
  - Each instruction is using only one stage at any time
  - Assumptions:
    - the register file is faster than the memory and
    - writing memory is faster than reading memory
  - Datapath has two possible critical paths:
  - $T_c = t_{pcq} + t_{mux} + \max\{t_{ALU} + t_{mux} + t_{mem}, t_{mem}\} + t_{setup}$

**Path-1:**  $t_{pcq} + t_{mux} + t_{ALU} + t_{mux} + t_{mem} + t_{setup}$

**Path-2:**  $t_{pcq} + t_{mux} + t_{mem} + t_{setup}$

Combined Datapath of all types of instructions

**Green line decides the Critical paths**



# Performance Analysis

- XYZ-organization is contemplating building the multi-cycle MIPS processor instead of the single-cycle processor. For both designs, the organization plans on using a 65-nm CMOS manufacturing process. The organization has determined that the logic elements have the delays given in Table. Help the organization compare each processor's execution time for a program with 100 billion instructions

| Parameter | Delay (ps) |
|-----------|------------|
|           | 30         |
|           | 250        |
|           | 20         |
|           | 200        |
|           | 25         |
|           | 20         |

# Performance Analysis of Multi-Cycle Implementation

- XYZ-organization is contemplating building the multi-cycle MIPS processor instead of the single-cycle processor. For both designs, the organization plans on using a 65-nm CMOS manufacturing process. The organization has determined that the logic elements have the delays given in Table. Help the organization compare each processor's execution time for a program with 100 billion instructions

- $$T_c = t_{pcq\_PC} + t_{mux} + \max\{t_{ALU} + t_{mux}, t_{mem}\} + t_{setup}$$
- $$T_c = 30 + 25 + 250 + 20 = \textcolor{red}{350 \text{ ps}}$$
- Execution time =
 
$$(100 * 10^9 \text{ instrs.}) * (\textcolor{red}{4.12} \text{ cycle/instrs.}) * (350 * 10^{-12} \text{ s/cycle})$$

$$= 133.9 \text{ seconds}$$

| Parameter | Delay (ps) |
|-----------|------------|
|           | 30         |
|           | 250        |
|           | 20         |
|           | 200        |
|           | 25         |
|           | 20         |

# Performance Analysis: A Comparison

- For multi-cycle,  $T_c = 350$  ps and CPI = 4.12
- For single-cycle,  $T_c = 925$  ps and CPI = 1
- For multi-cycle, execution time = 133.9 seconds
- For single-cycle, execution time = 92.5 seconds
- This example shows multi-cycle processor is slow than the single-cycle processor; why is it so?
  - Sequencing overhead: 30 (clk-Q) + 20 ( $t_{setup}$ )
- Multi-cycle processor is less expensive
- It has 5-nonarchitectural elements

| Parameter | Delay (ps) |
|-----------|------------|
|           | 30         |
|           | 250        |
|           | 20         |
|           | 200        |
|           | 25         |
|           | 20         |

# Homeworks

- How many cycles are required to run the following program on the multicycle MIPS-processor? What is the CPI of this program?

```
Addi $s1, $s2, 5  
Sub $t0, $t1, $t2  
Lw $t3, 15($s1)  
Sw $t5, 72($t0)  
Or $t2, $s4, $s5
```

# Homework

- Write CPP programs which can simulate the behaviors of the following instructions at **functional level** for MIPS-processor with **single cycle** and **multi-cycle** datapath:
  - R-type: ADD
  - M-type: LW & SW
  - B-type: BEQ
  - I-type: ADDI
  - J-type: J
- Write a CPP program for microprogramme-based CU for MIPS ISA

# Processor Design With Constraints on Interconnection

Will it be a Single-Cycle or Multi-Cycle Processor?

# Design this 32 bits MIPS Processor: LW, SW, ADDI, ADD, BNE

69



# Design this 32 bits MIPS Processor: LW, SW, ADDI, ADD, BNE

70

## Fetch Stage

| Machine state | Operation                                   | Control signals                                        | Next state |
|---------------|---------------------------------------------|--------------------------------------------------------|------------|
| T0            | MAR $\square$ PC<br>A $\square$ PC          | PCOut=1, MARIn=1<br>Ain=1                              | T1         |
| T1            | MDR $\square$ Mem[MAR]<br><br>B $\square$ 4 | mdrMuxSel=0, MDRIIn=1,<br><br>FrOut=1, Bin=1, ALUOp=00 | T2         |
| T2            | ALUOut $\square$ A+B                        | ALUOutIn=1                                             | T3         |
| T3            | PC $\square$ ALUOut                         | PCIIn=1, ALUROut=1                                     | T4         |
| T4            | IR $\square$ MDR                            | MDROut=1, IRIn=1                                       | xxx        |

# Design this 32 bits MIPS Processor: LW, SW, ADDI, ADD, BNE

71

## ADD

| Machine state | Operation | Control signals | Next state |
|---------------|-----------|-----------------|------------|
|               |           |                 |            |
|               |           |                 |            |
|               |           |                 |            |
|               |           |                 |            |
|               |           |                 |            |

# Homework: Design this 32 bits MIPS Processor: LW, SW, ADDI, ADD, BNE

72

## All other Instructions

## No of states: T0 to T29

# Organization of a CISC-style processor.



# Three-bus CISC-style processor organization.



# Summary

- Disadvantages of Single-cycle processor
- Necessity of balanced design approach and multi-cycle processor
- Datapath design of multi-cycle processor
- Design of Controls for multi-cycle processor
- Performance analysis of multi-cycle processor
- Comparison between single-cycle and multi-cycle processor
- Disadvantages of multi-cycle processor