

# RISC Design

---

Virendra Singh

Professor

Computer Architecture and Dependable Systems Lab

Department of Electrical Engineering, and  
Dept. of Computer Science & Engineering

Indian Institute of Technology Bombay

<http://www.ee.iitb.ac.in/~viren/>

E-mail: [viren@{ee, cse}.iitb.ac.in](mailto:viren@{ee, cse}.iitb.ac.in)

*EE-739: Processor Design @ IITB*

---



Lecture 15 (16 February 2022)

CADSL

$$S \xrightarrow{I} S'$$

Single cycle.

T.

(

$$S \xrightarrow{I} S'$$



Multicycle implementation

↑

{ Reuse of resources  
(cost effective)

↑

(HFC)

Merged: Level 2 FC



# Multicycle Datapath



# 3 to 5 Cycles for an Instruction

| Step                                                      | R-type<br>(4 cycles)                                             | Mem. Ref.<br>(4 or 5 cycles)                                                                                                                                                    | Branch type<br>(3 cycles)                                          | J-type<br>(3 cycles)                                                       |
|-----------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|----------------------------------------------------------------------------|
| Instruction fetch                                         |                                                                  | $\text{IR} \leftarrow \text{Memory}[\text{PC}]$ ; $\text{PC} \leftarrow \text{PC} + 4$                                                                                          |                                                                    | $S_1$                                                                      |
| Instr. decode/<br>Reg. fetch                              |                                                                  | $A \leftarrow \text{Reg}(\text{IR}[21-25])$ ; $B \leftarrow \text{Reg}(\text{IR}[16-20])$<br>$\text{ALUOut} \leftarrow \text{PC} + (\text{sign extend } \text{IR}[0-15]) \ll 2$ |                                                                    | $S_2$                                                                      |
| Execution, addr.<br>Comp., branch &<br>jump<br>completion | $\text{ALUOut} \leftarrow A \text{ op } B$<br>$S_3$              | $\text{ALUOut} \leftarrow A + \text{sign extend } (\text{IR}[0-15])$<br>$S_5$                                                                                                   | If $(A = B)$<br>then $\text{PC} \leftarrow \text{ALUOut}$<br>$S_6$ | $\text{PC} \leftarrow \text{PC}[28-31]$<br>  <br>$(\text{IR}[0-25] \ll 2)$ |
| Mem. Access or<br>R-type<br>completion                    | $\text{Reg}(\text{IR}[11-15]) \leftarrow \text{ALUOut}$<br>$S_4$ | $\text{MDR} \leftarrow M[\text{ALUout}]$<br>or $M[\text{ALUOut}] \leftarrow B$                                                                                                  |                                                                    | $S_7$                                                                      |
| Memory read<br>completion                                 |                                                                  | $\text{Reg}(\text{IR}[16-20]) \leftarrow \text{MDR}$                                                                                                                            |                                                                    |                                                                            |



# Cycle 1 of 5: Instruction Fetch (IF)

- Read instruction into IR,  $M[PC] \rightarrow IR$

- Control signals used:

|                  |   |   |
|------------------|---|---|
| » IorD           | = | 0 |
| » MemRead        | = | 1 |
| » <u>IRWrite</u> | = | 1 |

select PC  
read memory  
write IR



- Increment PC,  $PC + 4 \rightarrow PC$

- Control signals used:

|            |   |    |
|------------|---|----|
| » ALUSrcA  | = | 0  |
| » ALUSrcB  | = | 01 |
| » ALUOp    | = | 00 |
| » PCSource | = | 00 |
| » PCWrite  | = | 1  |

select PC into ALU  
select constant 4  
ALU adds  
select ALU output  
write PC



# Cycle 2 of 5: Instruction Decode (ID)

---

|   | 31-26  | 25-21 | 20-16 | 15-11                  | 10-6  | 5-0     |
|---|--------|-------|-------|------------------------|-------|---------|
| R | opcode | reg 1 | reg 2 | reg 3                  | shamt | fnicode |
| I | opcode | reg 1 | reg 2 | word address increment |       |         |
| J | opcode |       |       | word address jump      |       |         |

- Control unit decodes instruction
- Datapath prepares for execution
  - R and I types, reg 1 → A reg, reg 2 → B reg
    - » No control signals needed
  - Branch type, compute branch address in ALUOut
    - » ALUSrcA = 0 select PC into ALU
    - » ALUSrcB = 11 Instr. Bits 0-15 shift 2 into ALU
    - » ALUOp = 00 ALU adds



# Cycle 3 of 5: Execute (EX)

---

- R type: execute function on reg A and reg B, result in ALUOut
  - Control signals used:
    - » ALUSrcA = 1 A reg into ALU
    - » ALUSrcB = 00 B reg into ALU
    - » ALUOp = 10 instr. Bits 0-5 control ALU
- I type, lw or sw: compute memory address in ALUOut  $\leftarrow$  A reg + sign extend IR[0-15]
  - Control signals used:
    - » ALUSrcA = 1 A reg into ALU
    - » ALUSrcB = 10 Instr. Bits 0-15 into ALU
    - » ALUOp = 00 ALU adds



# Cycle 3 of 5: Execute (EX)

---

- I type, beq: subtract reg A and reg B, write ALUOut to PC
  - Control signals used:

|                                         |   |    |                |
|-----------------------------------------|---|----|----------------|
| » ALUSrcA                               | = | 1  | A reg into ALU |
| » ALUsrcB                               | = | 00 | B reg into ALU |
| » ALUOp                                 | = | 01 | ALU subtracts  |
| » If zero = 1, PCSource                 | = | 01 | ALUOut to PC   |
| » If zero = 1, PCwriteCond =            | = | 1  | write PC       |
| » <b>Instruction complete, go to IF</b> |   |    |                |
- J type: write jump address to PC  $\leftarrow$  IR[0-25] shift 2 and four leading bits of PC
  - Control signals used:

|                                         |   |    |          |
|-----------------------------------------|---|----|----------|
| » PCSource                              | = | 10 |          |
| » PCWrite                               | = | 1  | write PC |
| » <b>Instruction complete, go to IF</b> |   |    |          |



# Cycle 4 of 5: Reg Write/Memory

---

- R type, write destination register from ALUOut
  - Control signals used:
    - » **RegDst** = 1      Instr. Bits 11-15 specify reg.
    - » **MemtoReg** = 0      ALUOut into reg.
    - » **RegWrite** = 1      write register
    - » **Instruction complete, go to IF**
- I type, lw: read M[ALUOut] into MDR
  - Control signals used:
    - » **IorD** = 1      select ALUOut into mem adr.
    - » **MemRead** = 1      read memory to MDR
- I type, sw: write M[ALUOut] from B reg
  - Control signals used:
    - » **IorD** = 1      select ALUOut into mem adr.
    - » **MemWrite** = 1      write memory
    - » **Instruction complete, go to IF**



# Cycle 5 of 5: Reg Write

---

- I type, lw: write MDR to reg[IR(16-20)]

- Control signals used:

|            |   |   |                                 |
|------------|---|---|---------------------------------|
| » RegDst   | = | 0 | instr. Bits 16-20 are write reg |
| » MemtoReg | = | 1 | MDR to reg file write input     |
| » RegWrite | = | 1 | read memory to MDR              |

» **Instruction complete, go to IF**

For an alternative method of designing datapath, see  
N. Tredennick, *Microprocessor Logic Design, the Flowchart Method*,  
Digital Press, 1987.



# 1-bit Control Signals

| Signal name | Value = 0                             | Value = 1                             |
|-------------|---------------------------------------|---------------------------------------|
| RegDst      | Write reg. # = bit 16-20              | Write reg. # = bit 11-15              |
| RegWrite    | No action                             | Write reg. $\leftarrow$ Write data    |
| ALUSrcA     | First ALU Operand $\leftarrow$ PC     | First ALU Operand $\leftarrow$ Reg. A |
| MemRead     | No action                             | Mem.Data Output $\leftarrow$ M[Addr.] |
| MemWrite    | No action                             | M[Addr.] $\leftarrow$ Mem. Data Input |
| MemtoReg    | Reg.File Write In $\leftarrow$ ALUOut | Reg.File Write In $\leftarrow$ MDR    |
| IorD        | Mem. Addr. $\leftarrow$ PC            | Mem. Addr. $\leftarrow$ ALUOut        |
| IRWrite     | No action                             | IR $\leftarrow$ Mem.Data Output       |
| PCWrite     | No action                             | PC is written                         |
| PCWriteCond | No action                             | PC is written if zero(ALU)=1          |



# 2-bit Control Signals

---

| Signal name | Value | Action                                                                               |
|-------------|-------|--------------------------------------------------------------------------------------|
| ALUOp       | 00    | ALU performs add                                                                     |
|             | 01    | ALU performs subtract                                                                |
|             | 10    | Funct. field (0-5 bits of IR ) determines ALU operation                              |
| ALUSrcB     | 00    | Second input of ALU $\leftarrow$ B reg.                                              |
|             | 01    | Second input of ALU $\leftarrow$ 4 (constant)                                        |
|             | 10    | Second input of ALU $\leftarrow$ 0-15 bits of IR sign ext. to 32b                    |
|             | 11    | Second input of ALU $\leftarrow$ 0-15 bits of IR sign ext. and left shift 2 bits     |
| PCSource    | 00    | ALU output (PC +4) sent to PC                                                        |
|             | 01    | ALUOut (branch target addr.) sent to PC                                              |
|             | 10    | Jump address IR[0-25] shifted left 2 bits, concatenated with PC+4[28-31], sent to PC |



# Control: Finite State Machine



# State 0: Instruction Fetch (CC1)



# State 0 Control FSM Outputs



# State 1: Instr. Decode/Reg. Fetch/ Branch Address (CC2)

*S2*



# State 1 Control FSM Outputs



# State 1 (Opcode = lw) → FSM-M (CC3-5)

Rst 'fromif.'

(S6)

CC4

PCSource



# State 1 (Opcode= sw) → FSM-M (CC3-4)



# FSM-M (Memory Access)



# State 1(Opcode=R-type) → FSM-R (CC3-4)



# FSM-R (R-type Instruction)

---



# State 1 (Opcode = beq ) → FSM-B (CC3)



# Write PC on “zero”

---



# FSM-B (Branch)

---



# State 1 (Opcode = j) → FSM-J (CC3)



# Write PC

---



# FSM-J (Jump)

---



# Control FSM



# Control FSM (Controller)



# Designing the Control FSM

---

- Encode states; need 4 bits for 10 states, e.g.,
  - State 0 is 0000, state 1 is 0001, and so on.
- Write a truth table for combinational logic:

| Opcode | Present state | Control signals  | Next state |
|--------|---------------|------------------|------------|
| 000000 | 0000          | 0001000110000100 | 0001       |
| .....  | .....         | .....            | .....      |

- Synthesize a logic circuit from the truth table.
- Connect four flip-flops between the next state outputs and present state inputs.



# Block Diagram of a Processor



# How Long Does It Take? Again

---

- Assume control logic is fast and does not affect the critical timing. Major time components are ALU, memory read/write, and register read/write.
- Time for hardware operations, suppose
  - Memory read or write
  - Register read
  - ALU operation
  - Register write



# ✓ Single-Cycle Datapath

- R-type 6ns
- Load word (I-type) 8ns
- Store word (I-type) 7ns
- Branch on equal (I-type) 5ns
- Jump (J-type) 2ns
- Clock cycle time = 8ns
- Each instruction takes *one* cycle



$$CPI = 1$$



# Multicycle Datapath

- Clock cycle time is determined by the longest operation, ALU or memory:  
R:  $S_1 \rightarrow S_2 \rightarrow S_3 \rightarrow S_4$   
Lw:  $S_1 \rightarrow S_2 \rightarrow S_5 \rightarrow S_6 \rightarrow S_7$   
Sw:  $S_1 \rightarrow S_2 \rightarrow S_5 \rightarrow S_8$

- Clock cycle time = 2ns

- Cycles per instruction (CPI):

- lw
- sw
- R-type (A/L)
- beq
- j

|     |        |
|-----|--------|
| 5 ✓ | (10ns) |
| 4 ✓ | (8ns)  |
| 4 ✓ | (8ns)  |
| 3 ✓ | (6ns)  |
| 3 ✓ | (6ns)  |

*Uniform  
instr.  
CPI*

$$= 5 + 4 + 4 + 3 + 3 = \frac{19}{5} = 3.8$$



$$\frac{\text{Time}}{\text{Proj}} = \frac{DC \times \underline{\underline{CPL}} \times \underline{\underline{C}}}{\underline{\underline{=}}} =$$

$$= DC \times 3.8 \times 2 \text{ ns}$$

$$= DC \times \underline{\underline{7.6}},$$

Supply circ

$$= DC \times \underline{\underline{CPL}} \times \underline{\underline{C}}$$

$$= DC \times \underline{\underline{1}} \times \underline{\underline{8}}$$

saving

Multibit.



Weighted CP  $\frac{1}{\text{freq}}$

| $R \rightarrow$ | 4  | 50%   |
|-----------------|----|-------|
| LW              | 5  | 20%   |
| Stock.          | 4. | 10%.  |
| Branch          | 3  | 10%.  |
| Jmp,            | 9  | 10%). |

MX

$$= 4 \times 5 + 5 \times 2 + 4 \times 1 \\ + 3 \times 1 + 3 \times 1$$

$$= \frac{2 + 1 + \underline{8} + \underline{3} + 3}{4}$$

$$\frac{\text{Time}}{\text{Proof}} = \frac{PC \times CQ \times G}{C} = PC \times 4 \times 2 \therefore \underline{8 \times 2}$$



# CPI of a Computer

---

$$\text{CPI} = \frac{\sum_k (\text{Instructions of type } k) \times \text{CPI}_k}{\sum_k (\text{instructions of type } k)}$$

where

$\text{CPI}_k$  = Cycles for instruction of type k

**Note:** *CPI is dependent on the instruction mix of the program being run. Standard benchmark programs are used for specifying the performance of CPUs.*



# Example

---

- Consider a program containing:

|              |     |   |
|--------------|-----|---|
| • loads      | 25% | ✓ |
| • stores     | 10% | ✓ |
| • branches   | 11% | ✓ |
| • jumps      | 2%  | — |
| • Arithmetic | 52% | — |

dynamic.

$$\begin{aligned} \text{CPI} = & 0.25 \times 5 + 0.10 \times 4 + 0.11 \times 3 + \\ & 0.02 \times 3 + 0.52 \times 4 \end{aligned}$$

mixture

$$= 4.12 \text{ for multicycle datapath}$$

$$\text{CPI} = 1.00 \text{ for single-cycle datapath}$$

$$\begin{aligned} \frac{\text{Time}}{\text{Pf}} &= 1 \times 4.12 \times 2 \\ &= 8.24 \text{ ns} \end{aligned}$$



# Multicycle vs. Single-Cycle

---

$$\begin{aligned}\text{Performance ratio} &= \frac{\text{Single cycle time / Multicycle time}}{\frac{(\text{CPI} \times \text{cycle time}) \text{ for single-cycle}}{(\text{CPI} \times \text{cycle time}) \text{ for multicycle}}} \\ &= \frac{1.00 \times 8\text{ns}}{4.12 \times 2\text{ns}} = 0.97\end{aligned}$$

*Single cycle is faster in this case, but remember, performance ratio depends on the instruction mix.*



Single Cycle

$$\underline{\text{CPI}} = 1$$

long  $\checkmark \tau \uparrow$

Multicycle



$= n\tau$

$$\text{CPI} \uparrow \cdot \underline{\text{CPI} > 1}$$

Way to go beyond this performance.

Time/  
Prog.  $= \underline{I \times CPI \times \tau}$

Reuse of resource



## Combined Datapaths



# Multicycle Datapath



# ILP: Instruction Level Parallelism

- Single-cycle and multi-cycle datapaths execute one instruction at a time.
- How can we get better performance?



# Traffic Flow



# Way Ahead. . . .



a alamy stock photo

HEWNJW  
www.alamy.com



16 Feb 2022

EE-739@IITB

Chaining of instruction  
**CADSL**

# Automobile Team Assembly



1 car assembled every four hours  
6 cars per day  
180 cars per month  
2,040 cars per year

# Automobile Assembly Line



First car assembled in 4 hours (pipeline latency)

thereafter 1 car per hour ✓

21 cars on first day, thereafter 24 cars per day

717 cars per month ✓

8,637 cars per year ✓

# Throughput: Team Assembly

---



Time of assembling one car =  $n$  hours

where  $n$  is the number of nearly equal subtasks,  
each requiring 1 unit of time

Throughput =  $1/n$  cars per unit time



# Throughput: Assembly Line



Time to complete first car =  $n$  time units (latency)

$$24 - 4 + 1 \approx 21$$

Cars completed in time  $T$  =  $T - n + 1$

$$\frac{T - n + 1}{T}$$

Throughput =  $1 - (n - 1)/T$  car per unit time

sub tasks,

Throughput (assembly line) ✓ =  $\frac{1 - (n - 1)/T}{1/n} = n - \frac{n(n - 1)}{T}$

Throughput (team assembly) =  $n - \frac{n(n - 1)}{T}$  →  $n$  as  $T \rightarrow \infty$



# Some Features of Assembly Line



# Instructions

## Subtask

- Fetch instruction
- Read Operands
- Execute / Compute the address of memory
- Access memory
- Write back (Update system state)

↙ {  
Branch of  
programmer's  
promise

$I_1 \ I_2 \ \dots$

Update  
exactly in  
the same  
sequence.

{  
ILLUSION



# INSTRUCTIONS

I<sub>1</sub>  
I<sub>2</sub>  
I<sub>3</sub>  
:  
I<sub>N</sub>

chaining of Instruction

↓  
Advantage of both  
| Single cycle impl: 8  
mult-cycle implement'







$$\begin{aligned}
 S_1 : a &= \text{SRC} \\
 D_1 : d &\rightarrow c + e
 \end{aligned}$$

Both do not meet



How often these assumptions fail?



NOT

OFTEN

$$a = b + c$$
$$d = b \times 2$$

$$\gamma = \delta \tau t$$

Implement — assuming that  
these assumptions are always respected

- ⊕ Should have a way to detect the failure.
- ⊕ Mechanism to correct.



R:

IF OR. EX WB

1 I<sub>1</sub>

2 I<sub>2</sub> I<sub>1</sub>

3 I<sub>3</sub> I<sub>2</sub> \$I\_1

4 I<sub>4</sub> I<sub>3</sub> I<sub>2</sub> I<sub>1</sub> →

5 I<sub>5</sub> I<sub>4</sub> I<sub>3</sub> I<sub>2</sub>

Local

IF OR EX MA WB

1 I<sub>1</sub>

2 I<sub>2</sub> I<sub>1</sub> - -



R: IF - OR - EX - WB

L: IF - OR - EX - MA - WB

S: IF - OR - EX - MA

B: IF - OR - EX

J: IF - OR - EX

IF - OR - EX - MA - WB .)



$\Rightarrow$  IF - OR - EX - MEM. WB

Load  $I_1$

$Z$   $L_1$

$I_1$



$I_1$

— $\Rightarrow$  Single



# Thank You

