

## Computer organization & Architecture

→ prerequisite

↳ How a Computer works internally → Digital logic

↳ CPU, RAM, I/O → Keyboard, Mouse, disk  
circuits & Design

Intel i3, i5, ...

4GB | 8GB | ...

Smart phone



Laptop

Smart watch

COA → Hardware

OS → Software

CD, DB

Micro-processor

what's a Computer : programmable Computer

CPU



Neumann-Architecture



stored program computer

program: Sequence of Instructions

Compiler  
C → MLL

Instruction: Binary Sequence / String

Data : Binary - seq | String

Block Diagram:



When the power is switched off all the data in main memory is removed.

**RAM** - Random-Access Memory

**ROM** - Read only Memory

Mother-board:

**ROM**

- CPU slot (Micro processor)
- PCI slot (can connect sound cards etc)
- Slots for HDD, DVD-ROMS
- RAM (Memory) slots
- Graphics card slot
- Back panel with connections for peripherals and USB Memory.

Micro controllers

ESP-32 chip

ARDUINO boards

Raspberry Pi boards



Every thing inside a chip (All Hardware of Computer) is a Micro controller chip.

Ph: 844-844-0102

### Basics of Memory organization:

→ All the programs are in Main Memory.

→ CPU performs memory-Requests

Eg: 6264 Memory Chip

Data sheet 6264

64K x 16

Size of each cell in bits

N = # Cells

Logically → cell/word location



$$\dots : 2^2$$

$$\dots : 2^3$$

$$1K = 2^{10} = 1024 \approx 10^3$$

$$1M = 2^{20} \approx 10^6$$

$$1G = 2^{30} \approx 10^9$$

$$1T = 2^{40} \approx 10^{12}$$

64K  
 $\Rightarrow 2^6 \times 2^{10} = 16 \text{ bits}$   
 $2^{16} \text{ cells} / \text{Locations}$

K bits  $\rightarrow 2^K$

16 bits  $\rightarrow 2^{16} \text{ unique addresses}$

64K x 32  
↓      ↓  
Address bus    databus = 32 bits

$$\begin{array}{ccccccc} & B & C & 1 & C \\ (1011\ 1100\ 0001\ 1100)_2 & \downarrow & \downarrow & & \downarrow \\ (B\ C\ 1\ C)_{16} & & & & \end{array}$$

C<sub>1</sub> DLC

### Address-decoding Logic:



$n \times 2^n$  : Decoder



At any time only one output is active/ high

$(0000\ 0000\ 0000\ 0010)_2$

high: 1 5V  
Low: 0 OV

### Pin-diagram of 6264 :

VCC: Power

A<sub>12</sub> - A<sub>0</sub>: Address Bus

D<sub>7</sub> - D<sub>0</sub>: Data Bus

GND: Ground OV

13-bit Address

$$\begin{aligned} A_0 - A_{12} &\Rightarrow 2^{13} \\ &\Rightarrow 2^3 \times 2^{10} \\ &\Rightarrow 8K \end{aligned}$$

8Kx8

R/W : 1=R 0=W

NC : Not Connected

OE : Enable output  
CE1: chipselect1 } Control pins

Note: No coherency address will be issued. They are corresponding to the output line of address decoder.

Ph: 844-844-0102



### Memory Addressability:

#### Byte and word - Addressability

#### Physical Address Space:



$2^n$  cells / words

0 to  $2^n - 1$

$$n = \log_2 N$$

Byte Addressable: one byte using

→ 8 bits one Address if  $N \times 8$

Nibble: 4 bits

Word Addressable:

$$IW = 1B \quad (N \times 8)$$

$$IW = 2B \quad (N \times 16)$$

$$IW = 3B \quad (N \times 24)$$

→ One Address corresponds to 2B.

Up: Byte Addr



Ignore the first Byte

16-bit processor:

→ MOV R1 2000



Register  
↓  
Memory  
inside  
CPU

$$A \leftarrow x^i$$

$$x = x + 1^i$$

$$\downarrow 2B [2000, 2001]$$



MOV R1 2000



Little Endian

Ph: 844-844-0102

### Big-endian System:



### Little-Endian System:



MOV R1 2000 → ALL MLL

{1011010 }

MOV R1 2000

INCR R1

MOV 2000 R1



(1224)<sub>16</sub>



### CPU-Memory Interfacing:

#### Pin-Structure:

① Active-low → 0

② Active-High → 1 ↓

ALE

8085 A

8-bit

8086

16bit

$\overline{RD}$  RD' }  
 $\overline{WR}$  WR' }  
 $\overline{INTA}$  }  
 Active when they are low

③ Time-multiplexed-Pin:

8085 A: AD0 - - - AD7  
 - A8 - - - A15

8 bit

1W=1B

16-addr pins  $\Rightarrow$  2<sup>16</sup> cells

$$= 2^{16} B$$

= 64KB size of MM

A0 - - - A7 A8 - - - A15  $\leftarrow$  addr  
 D0 - - - D7  $\leftarrow$  data

ALE: Address Latch Enable

{ ALE = 1 (A0 - - - A7) Address Lines  
 ALE = 0 (D0 - - - D7) Data Lines

cpu  $\leftrightarrow$  RAM

- ① send the address
- ② R/W data

8086 CPU: 16bit MP

AD0 - - - AD15 A16 - - - A19

(0-19) 20 bit Address

Data  $\Rightarrow$  16 bits

$$2^{20} \text{ cells} = 1 \text{ MW}$$

$\downarrow$

2B

$\Rightarrow$  2MB

MAX Mode  $\leftarrow$  MN/HX  $\rightarrow$  0  
 MIN Mode  $\leftarrow$

## Address-Latch-chip:

**Ph: 844844-0102**

STB: Strobe  
OE: output Enable



STB      0 → Input is disabled.  
          1 → Input is enabled

OE      0 → Output is Enabled.  
          1 → Output is disabled.

## CPU-Memory Interfacing:



3MHz

16 Address pins

8 data pins

when ALE & Data pins are connected to RAM

More the pins, the complexity of the chip will increase.

① Clock-Cycle:

$$\text{cycle-time} = \frac{1}{f}$$

$$= \frac{1}{\text{freq of operation of MP}}$$



Speed of MP

$$2\text{GHz} = f$$

$$2 \times 10^9 \text{ Hz} = f$$

$$2 \times 10^9 = f$$

$$2 \times 10^9 \text{ Hz}$$

$$2 \times 10^9 \text{ cycles/sec}$$

$$t = \frac{1}{2 \times 10^9} \text{ sec}$$

$$= 0.5 \times 10^{-9}$$

$$= 0.5 \text{ nsec}$$

$$\text{ms} = 10^3$$

$$\text{ns} = 10^6$$

$$\text{nsec} = 10^9$$

② Machine (M/C) cycle:

# of Cycles required to complete an operation.

one instruction LDA (2080)<sub>16</sub>

Load into accumulator Register

(a) Send the address

T<sub>1</sub>: ALE to I

$$\left. \begin{array}{l} \{ AD_7 - AD_0 : (80)_{16} \\ A_{15} - AS : (20)_{16} \end{array} \right\}$$



(b) Send Control Signals:

$$T_2: \begin{cases} \text{ALE} = 0 \\ \overline{\text{RD}} = 0 \\ \overline{\text{WR}} = 1 \end{cases}$$

MP: 2GHz - 4GHz

RAM: 1044 MHz  $\approx$  1.4 GHz

MM

3GHz Very Exp  
↓  
Slowest

8GB  
16GB  
32GB

(c) Read the data to ACC:

$$\left. \begin{array}{l} T_3: \\ T_4: \end{array} \right\}$$

than MP

Do - - - D<sub>7</sub>

Ph: 844-844-0102

slower than up

RAM clock cycle relatively

{ T<sub>3</sub>: } Do - - - D<sub>7</sub>  
 { T<sub>4</sub>: }

once the data is read from Memory to Data lines, processor will store them inside Accumulator.

INCR ACC



control unit

Arithmetic logic unit

### System Bus Configuration:



printer: page are over

↓ Need to send the interrupt signal to CPU

to display a message to the user



cpu  $\xrightarrow{1011}$  M

if keyboard address = 1011

### Ambiguity problem:



Bus Configuration : I/O processor ①

High performance Computing  
Ph: 844-844-0102  
NVIDIA + IOP

- Common control Signals
- Common Address space
- Different Buses
- I/O - Memory direct access
- Expensive

when I/O is active, M is low  
M is active, I/O is low



Bus Configuration : Isolated I/O ②

Eg: 8085A



pin 34

I/O/M  
1 → I/O  
0 → Mem

we can have multiple I/O devices

Let addr = 1011, RD=1, I/O/M Signal

Each I/O device have unique address

| I/O/M | RD |          |
|-------|----|----------|
| 1     | 0  | I/O Read |
| 0     | 0  | Mem Read |

No requirement of I/O processor.

Bus Configuration : Memory mapped I/O ③

→ Shared address space using decoders

Ph: 844-844-0102



2<sup>4</sup> addr    00 - -  
               01 - - } Mem-addr  
               10 - -  
               11 - -

⇒ 11 - - → I/O device addr



④ up that supports 128MW memory

Memory-Mapped I/O

3MSB core → assign to I/O ports | devices

128MW

$$2^7 \times 2^{20} W$$

$$\Rightarrow 2^7$$

Address : 27 bits

# I/O addresses : ?

# Memory addresses : ?

I/O Addresses : 2<sup>24</sup>

Memory Addresses:

$$7 \times 2^{24}$$



① Fetch-Cycle:

② Execute-Cycle: CPU-ORG



J1: LDA 2080 8085-Syntax

Fetch-Cycle:

Load Accumulator

(93)<sub>H</sub> 0x23

program:

opcode operand

SIM

Set Interrupt Mask

I1: LDA (2080)<sub>16</sub> → 3B

→ 1B

I2: SIM

Move Immediate

program Counter: 0000

Step-size: 3B

→ stores the address  
of the next instruction to be  
executed.

MP Fetches LDA 2080

goes to Execute cycle



A, B, C, D, E, H, L

C → Loader  
GCC → MLL → MM

program counter: 0003

step-size: 1B

MP fetches SIM

Next goes to Execute Cycle

program counter: 0004

step-size: 2B

If instruction is executed

program counter: 0006

Fixed-Length Instruction CPU:

↓  
Step size is fixed.

All instructions are fixed length  
**Ph: 844-844-0102**

Variable-length instruction CPU:

(8085, 8086 - - -)

Stepsize = f (op code)



Q) 32-bit MP  $\Rightarrow 1W = 4B$

$I_1 : 2W \quad 234 \text{ to } 23B$  starting memory addr  $= 0X234$

$I_2 : 1W \quad 23C$   $\Rightarrow$  value of PC when  $I_6$  is being executed  
 $(258)_{16}$

$I_3 : 1W \quad 240$  Memory is byte addressable.

$I_4 : 2W \quad 244$

$I_5 : 1W \quad 24C$

$I_6 : 2W \quad 250$

$I_7 : 1W \quad 258$

$I_8 : 1W \quad 25C$

0, 1, 2, ..., 9, A, B, C, D, E, F  
Hexadecimal

$$\begin{array}{r} 234 \\ 8 \\ \hline 23C \end{array}$$

$$\begin{array}{r} 23C \\ 14 \\ \hline 240 \end{array}$$

$$\begin{array}{r} 240 \\ 4 \\ \hline 244 \\ 8 \\ \hline 24C \\ 8 \\ \hline 250 \\ 8 \\ \hline 258 \end{array}$$

Q) Memory is word addressable (Same as above)

$$1W = 32b = 4B$$

$$\text{program-starting-addr} = (2346)_{10}$$

pc-value when  $I_6$  is being executed.

I<sub>1</sub>: 2346I<sub>2</sub>: 2348I<sub>3</sub>: 2349I<sub>4</sub>: 2350I<sub>5</sub>: 2352I<sub>6</sub>: 2353PC → I<sub>7</sub>: 2355

IP : 2356

 $(2355)_{10}$ 

(Q3)

CPU with 2-bit instruction program:

Fixed-len-instruction

- CP4

program-load-address =  $(300)_{10}$ 

which of these is a valid value of PC.

2-bit = 3B

Ⓐ 400 Ⓑ 500 Ⓒ 600 Ⓓ 700

~~Ⓐ 400~~ ~~Ⓑ 500~~ ~~Ⓒ 600~~ ~~Ⓓ 700~~

By default memory

Byte addressable

PC : 300

PC : 303

PC : 306

; 309

;



300 + i \* 3

300 + 100 \* 3

Execution-Cycle:→ Instruction format → 1B  
→ 2B  
→ 3B ...LDA 2080  
opcode operand

→ CPU-organization

Gatecs@appliedcourse.com

Stack-Cpus: (Zero address Instructions)

- $I_1 : \text{MOV } A, 2080$
- $I_2 : \text{MOV } B, 2090$
- $I_3 : \text{MOV } C, 2096$
- $I_4 : \text{PUSH } A$
- $I_5 : \text{PUSH } B$
- $I_6 : \boxed{\text{MUL}}$  opcode, No addr
- $I_7 : \text{PUSH } C$
- $I_8 : \boxed{\text{ADD}}$
- $I_9 : \text{POP } D$   
 $D = A * B + C$



TR - Temporary Register



TOS  
Top of the stack

ALU: Arithmetic and logic unit

- zero-address Instruction format ← ALU instr
- Non-All instructions need not be 0-address-instr

→ Instruction cycle: MUL  $I_6$

Fetch: Read MUL from memory to CPU based on PC by the control unit.

Execute: Interpret the Opcode

pop  $\leftarrow$  2nd op

pop  $\leftarrow$  1st op

MUL  $\leftarrow$

PUSH  $\rightarrow$



MUL



Stack-CPU:

I<sub>1</sub>: Push b

I<sub>2</sub>: push x

3: Add

4: pop c

5: push c

6: push y

7: ADD

8: push c

9: sub

10: pop z



$$c = b + x$$



$$c+y - c \Rightarrow y$$



$$z = y$$

CPU Organization: Accumulator CPU

1. Address Instruction

I<sub>1</sub>: Load 2080;  $ACC \leftarrow M[2080]$

I<sub>2</sub>: MUL 2090;  $ACC \leftarrow ACC \times M[2090]$

I<sub>3</sub>: ADD 2096;  $ACC \leftarrow ACC + M[2096]$

Instruction Cycle: I<sub>2</sub>



(a) Fetch: MUL 2090 read from Memory.

$$(a \times b) + c$$

(b) Execute: Interpret

$$TR \leftarrow M[2090]$$

Multiply

$$ACC \leftarrow M[2080] + M[2090]$$

Example: ENIAC (1945) had 20 ACC

Ph: 844-844-0102

{ IBM 701 (1952)  
IBM 650  
PDP 8

General register CPU

one-addr-Instructions : opcode Addr      memory/Register

$$X = (A+B) \mid (C+D)$$

(P1)

- 1: Load A
- 2: Add B
- 3: STORE T

4: Load C

5: Add D

6: Store Y

7: Load T

8: DIV Y

9: STORE X

$$Acc \leftarrow A+B$$

$$T \leftarrow A+B$$

$$Acc \leftarrow C+D$$

$$Y \leftarrow C+D$$

$$Acc \leftarrow (A+B)$$

$$Acc \leftarrow (A+B) / (C+D)$$

X

(P2)

- 1: Load C
- 2: ADD D      Acc C+D
- 3: STORE T      T  $\leftarrow C+D$

4: Load A

5: ADD B      Acc  $\leftarrow A+B$

6: DIV T

$$Acc \leftarrow (A+B) / (C+D)$$

7: STORE X

Expand opcode-technique:

Up with fixed-length instructions (16b):

① 1-addr instruction:



② 0-addr instruction:





0000 0000 }  
1111 1111 }

- ① up that supports 0-addr-instr & 1-addr-instr.

Memory size:  $16W = 2^4 W$

fixed-sized instr: 6 bits  $\geq 1W$

TWO 1-addr instructions

# possible zero-addr instructions = ?



### 0-addr



- ②

32-bit up:

256 MW memory (max-supported)

fixed-size addr-instr: 1W

Eight 1-addr instr

# possible 0-address instr = ?

256 HW  $\Rightarrow 2^8 \times 2^{20}$  W

$$\Rightarrow 2^{28} \text{ W}$$

Memory  
Ph: 044-844-0102

Memory

CPU:



### 0-address instructions:



$$\Rightarrow 2^3 \times 2^{28}$$

$\Rightarrow 2^{31}$  possible 0-address instructions.



### Q) up with 0-addr & 1-addr instruction

fixed-sized-instr : 32b

$$\text{Memory : } 128 \text{ KW} = 2^{17} \text{ W}$$

if n 1-addr instruction exist, # 0-address-instr = ?



$$\Rightarrow 2^{15}$$



n 1-addr instructions

- Reg-Mem Reference architecture.
- Reg-Reg Reference Architecture.

### Reg-Reg Reference CPU:

I<sub>1</sub>:  $\text{opcode dest op1 op2}$   
 $\text{ADD } r_0, r_1, r_2 \quad r_0 \leftarrow r_1 + r_2$



### - 3-Address instruction

- Loading Store: Reg  $\leftrightarrow$  Mem

Eg:  $x = (A+B) * (C+D)$

3-reg

load  $r_0, A$  (2-address-instruction)  
 load  $r_1, B$   
 add  $r_0, r_1, r_1$   
 load  $r_1, C$   
 load  $r_2, D$   
 add  $r_1, r_1, r_2$  (3-address-instruction)  $r_1 \leftarrow C+D$   
 MUL  $r_0, r_0, r_1$   $r_0 \leftarrow (A+B) * (C+D)$   
 store  $X, r_0$

### 4-Address instructions:

I<sub>1</sub>: opcode dest-addr src1 src2  
 I<sub>2</sub>: op1 op2  
 :  
 I<sub>20</sub>

addr-next-instr  
 addr(I<sub>20</sub>)

→ Not used widely

PC

Address of the next instruction.

Ph: 844-844-0102

(\*)

## Instruction - Execution Sequence:

### Accumulator - CPU:



1-Address  
instruction



### Fetch:

① PC

RD = 0  
addr of instr

CPU generates  
mem-request



Instruction register



readonly  
Memory

Design time  
binding

{

ROM

### Instruction decode:

- 000 - Add
- 001 - MUL
- 010 - Sub
- 011 - LOAD
- ⋮

3b | 5b

x3



PSW

Flag Reg

program status

wad: will store

All miscellaneous information  
about the currently executed  
instruction.

### operand Fetch:

#### (a) From Register



ALU 81TR

#### (b) From Memory



ALU 81TR



### Reg-Mem Reference Architectures

in Reg-Mem Reference

MUL  $r_0, 2080$   
 ↓      ↓  
 operand1    operand2      Memory Location

MUL  $r_0, r_1$   
 ↓      ↓  
 Reg    Reg

→  $r_1 : ADD r_0, 2080$        $r_0 \uparrow (r_0 + m[2080])$

→ 2-addr-format

→ Fetch

→ Execute: Interpret

Fetch  $m[2080]$

ADD  $r_0 + m[2080]$

$r_0 \leftarrow r_0 + m[2080]$



Reg-file  
 Collection  
 of  
 Registers

$$X = (A+B) * (C+D)$$

1: MOV  $r_0, A$

$r_0 \leftarrow A$

2: Add  $r_0, B$

$r_0 \leftarrow A+B$

3: MOV  $r_1, C$

$r_1 \leftarrow C$

4: Add  $r_1, D$

$r_1 \leftarrow C+D$

5: **MUL  $r_0, r_1$**

$r_0 \leftarrow r_0 * r_1$

6: MOV  $X, r_0$

{MOV  $T, r_1$   
 {MUL  $r_0, T$

20-bit instructions

512W memory

TWO 2-addr-instr

150 1-addr-instr

# possible 0-addr-instructions = ?



Memory

512W

2-addr {  
 00  
 01  
 10  
 11 } → 1-address  
 0-address



$$2 \times 2^9 \Rightarrow 2^{10} \Rightarrow 1024$$

- given 150-1 addr

instructions

zero-address Instructions:



→ 874 \* 2^9 zero-address  
instructions possible.

Q

instruction-format  $\langle \text{opcode}, \text{reg-addr}, \text{mem-addr} \rangle$

$$\# \text{ registers} = 128 = 2^7$$

256 MW Memory

32-bit instructions

@ n 2-addr instructions

$$\# 1\text{-addr instr} = ?$$

$$256 \Rightarrow 2^8 \text{ MW} \Rightarrow 2^8 N$$



2^7 possible 2-address  
Instr

$2^{7-n}$ 

1-address Instructions



$$\Rightarrow \boxed{(2^{7-n}) \times 2^7} \Rightarrow 1\text{-address instructions.}$$

(b) n 2-addr instr

m 1-addr instr

# 0-addr instructions = ?



$$\frac{(2^{7-n}) \times 2^7}{\text{possible 1-address}} \text{Instructions}$$

Given m - 2address Instructions

$$(2^{7-n}) \times 2^7 - m$$

 $\Rightarrow$  # of possible zero address instructions.

$$[(2^{7-n}) \times 2^7 - m] \times 2^{18} \quad \text{opcode } \underline{\text{Memory}}$$

opcode, reg



No memory

$$[(2^{7-n}) \times 2^{18} - m] \times 2^7$$

Ph: 844-844-0102

Program Status Word:  
(PSW)  
Currently executing instr

i<sub>3</sub>, i<sub>5</sub>, i<sub>7</sub>, i<sub>9</sub>

8085: 8bit  
1W=1B Flag-Register



- Carry (8086)
- Parity
- Auxiliary carry
- zero
- sign
- overflow.

ALU

Flag: 1 (Set)

0 (Reset)

8086: 16-bit



### ① Carry-Flag: CY

ALU → 8 bit unsigned int  $\Rightarrow$  0 to  $2^8 - 1$   
0 to 15

$$\rightarrow \underline{8+5'}$$

$$\begin{array}{r} 1000 \\ + 0001 \\ \hline 1101 \end{array}$$

No carry  $0101 = ACC, \underline{CY=0}$

$$\Rightarrow \underline{8+9}$$

$$\begin{array}{r} 1000 \\ + 1001 \\ \hline 0001 \end{array}$$

$$0001 = ACC \quad \underline{CY=1} \quad \Rightarrow \boxed{10001}$$

### ② Parity-Flag:

$$(Eq) \quad ACC = \boxed{1010\ 1010}$$

$$ACC = 1011\ 1111$$

$$\begin{array}{r} 1011\ 1010 \\ \hline \end{array}$$

PF=1 if # of ones is even

PF=0 if # of ones is odd.

$\Rightarrow$  Error-Correcting & Detection Codes

③ Auxiliary-carry:

set if  $\exists$  a carry from LN to UN

**Ph: 844-844-0102**

$$\begin{array}{r} 2A \\ + 34 \\ \hline ACC = 5E \end{array}$$

ubits: Nibble

BCDEF

$$\begin{array}{r} SA \\ 2A \\ \hline ACC = 54 \\ AC = 1 \end{array}$$

$$\begin{array}{r} 10 \\ 10 \\ \hline 20 \end{array}$$

AC = 0

Binary-coded Decimal Arithmetic: (BCD)

$$\begin{array}{c} \xrightarrow{\quad} \text{Unpacked} \leftarrow \\ \text{BCD} \\ \xrightarrow{\quad} \text{packed} \end{array}$$

$$(26)_{10} = 0000\ 0010\ 0000\ 0110$$

$$(26)_{10} = (11010)_2$$

$$16+8+2$$

packed  
 $\downarrow$

use Nibble

④  $(26)_{10}$

$$\begin{array}{r} \Rightarrow 26 = 0010\ 0110 \\ 27 = 0010\ 0111 \\ \hline 53 = 0100\ 1101 = ACC \end{array}$$

$CY = 0$   
 $AC = 0$

0---15 Hexadecimal  
0---9 decimal

Rule: If  $LN > 9$  then add 6 to Accumulator

$$\begin{array}{r} 0100\ 1101 \\ 0000\ 0110 \\ \hline 0101\ 0011^3 \end{array}$$

$$\begin{array}{r} 28 \\ 29 \\ + \\ \hline 57 \end{array}$$

$$\begin{array}{r} 0010\ 1000 \\ 0010\ 1001 \\ \hline 0101\ 0001 \end{array}$$

$$\begin{array}{r} 0101\ 0001 \\ 0000\ 0110 \\ \hline 0101\ 0111 \end{array}$$

Rule: If  $LN < 9$  &  $AC = 1$  then Add 6

$$\begin{array}{r} 88 \\ 89 \\ \hline 177 \end{array} = \begin{array}{r} 1000 \curvearrowleft 1000 \\ 1000 \quad 1001 \\ \hline 10001 \quad 0001 \end{array}$$

AC=1

CY=1

Rule: AC=1 & LN < 9 add 6 to LN

CY=1 & UN < 9 add 6 to UN

$$\Rightarrow \begin{array}{r} 10001 \quad 0001 \\ 0110 \quad 0110 \\ \hline 10111 \quad 0111 \end{array}$$

177

(4) Zero-Flag: ACC=0 ZF=1

2000: LDA 2080

JZ: Jump if zero

2003: JZ 4000

pc: 2009

2004:

:

4000: 

↓  
4000

if ( $x == 0$ )  $\rightarrow$   $i = 10; i > 0;$   
 {  
 }  
 else  
 {  
 }  
 =

while  
switch

JNZ: Jump if Non-zero

Signed-Integer

(5) Sign-Flag:



(3)

overflow: Signed-Airthmatic

unsigned  
 $0 - (2^{n-1})$

Ph: 844-844-0102

2's Complement-Binary

$0010$   
 $0111$   
 $0000$   
 $\{ \quad 1000$   
 $1111$   
 $1101$

Equivalent Decimal

$+2$        $0 - --$   
 $+7$        $0 - --$   
 $+0$   
 $-8$   
 $-1$   
 $-3$

Signed

$-8 \dots -1 +0 \dots +7$

$$\begin{array}{r}
 1000 \\
 1's \quad 0111 \\
 \text{Complement} \\
 \downarrow \\
 2's \text{ Complement} \quad \overline{\begin{array}{r} 0111 \\ 1000 \end{array}} \quad (-8)
 \end{array}$$

$$\begin{array}{r}
 \neg 1 \Rightarrow \quad 0001 \\
 \quad \quad \quad \downarrow \\
 \neg 1's \quad 1110 \\
 \neg 2's \quad \overline{1111}
 \end{array}$$

$$\begin{array}{r}
 \neg 3 \Rightarrow \quad 0011 \\
 \neg 1's \Rightarrow 1100 \\
 \neg 2's \quad \overline{1101}
 \end{array}$$

n-bit  
unsigned Integer  
 $0 - 2^{n-1}$

Signed Integer

$-2^{n-1}$  to  $+(2^{n-1}-1)$

(a)

$$\begin{array}{r}
 +7 \\
 +3 \\
 \hline +10 > +7
 \end{array}
 \quad \begin{array}{r}
 0111 \\
 0011 \\
 \hline 1010
 \end{array}
 \Rightarrow \boxed{-} \quad \text{overflow} = 1$$

$$\begin{array}{r} +7 \\ -3 \\ \hline +4 < 7 \end{array}$$

$$\begin{array}{r} 0111 \\ 1101 \\ \hline 0100 \end{array}$$

$$\begin{array}{r} -7 \\ +3 \\ \hline -4 \end{array}$$

$$\begin{array}{r} 11 \\ 1001 \\ 0011 \\ \hline 1100 \end{array}$$

$$\begin{array}{r} 1100 \\ -4 \\ \hline \end{array}$$

NO In carry  
NO OUT carry

OV=0

|   | Incarry | out-carry | overflow |
|---|---------|-----------|----------|
| c | 0       | 0         | 0        |
| d | 0       | 1         | 1        |
| a | 1       | 0         | 1        |
| b | 1       | 1         | 0        |

d

$$\begin{array}{r} 1101 \\ -3 \\ \hline 0110 \end{array}$$

-10 -8 -7

OV=1

MSB

Incarry XOR OUT carry

Control Flags:

① TRAP  $\leftrightarrow$  0 (G10)  
 $\leftrightarrow$  1 (CTRCE | single step) break

debugging : gdb

$\begin{cases} \rightarrow I_1: \leftrightarrow \text{Var \& Reg} \\ \rightarrow I_2: \end{cases}$

② Interrupt-Flag:



- 0 (disable-INTR)
- 1 (Enable-INTR)

③ Direction Flag: String-instruction → Seq of char  
↓  
IB



0 : L  $\rightarrow$  H      ACX  
1 : H  $\rightarrow$  L      XCA



More Solved problems:

① The chip select logic for a certain DRAM chip in a single memory system design is shown below. Assume that the memory system has 16 address lines denoted by A<sub>15</sub> to A<sub>0</sub>. What is the range of addresser (in hexa decimal) of the memory system that can get enabled by the chip select (CS) signals?

- A C800 to CFFF
- B CA00 to CAFF
- C C800 to C8FF
- D DA00 to DFFF





C 8-F 0-F 0-F



- ② A processor has 16-integer registers ( $R_0, R_1, \dots, R_{15}$ ) and 64 floating point registers ( $F_0, F_1, \dots, F_{63}$ ). It uses a 2-byte instruction format. There are 4 categories of instructions:
- Type-1, Type-2, Type-3 and Type-4. Type-1 category consists of 4 instructions, each with 3 integer register operands (3Fr).
  - Type-2 category consists of eight instructions, each with 2 floating point register operands (2Fs).
  - Type-3 category consists of fourteen instructions, each with one integer register operand and one floating point register operand.
  - Type-4 category consists of N instructions, each with a floating point register operand (1F). The maximum value of N is —.



Type 1 [4]  $\underline{\underline{4 \quad 4 \quad 4}}$   $\Rightarrow 2^{12} \times 4 = 2^{14}$

$$\begin{array}{r} 2^{16} \\ - 2^{14} \\ \hline - 2^{15} \end{array}$$

Type 2 [8]  $\underline{\underline{4 \quad 6 \quad 6}}$   $\Rightarrow 2^{12} \times 8 = 2^{15}$

Type 3 [14]  $\underline{\underline{6 \quad 4 \quad 6}}$   $\Rightarrow 2^{10} \times 14$

$$\begin{array}{r} - 2^{10} \times 14 \\ \hline 2048 = 2^{11} \end{array}$$

Type 4 [N] 16b 6b  $\Rightarrow N \times 2^6 = 2048$

$$\Rightarrow N = \frac{2^{11}}{2^6} \Rightarrow 2^5 = 32$$

(Q3) A processor can support a maximum memory of 4GB, where memory is word-addressable (a word consists of two bytes). The size of the address bus of the processor is at least 31 bits.



(Q4) A processor has 40 distinct instructions and 24 general purpose registers. A 32-bit instruction word has an opcode, two register operands and an immediate operand. The number of bits available for the immediate operand field is \_\_\_\_\_.



$$\Rightarrow 32 - (6+5+5)$$

$$2^5 = 32$$

$$2^6 = 64$$

$$2^4 = 16 \text{ reg}$$

$$2^5 = 32$$



@Q5

Consider a processor with 64 registers and an instruction set of size twelve. Each instruction has five distinct fields, namely opcode, two source register identifiers, one destination register identifier, and a twelve-bit immediate value. Each instruction must be stored in memory in a byte-aligned fashion. If a program has 100 instructions the amount of memory (in Bytes) consumed by the program

Text is



twelve  
↓  
4 bits

$$\frac{34b}{8} = 4.25B \approx 5B$$

$$5 \times 100 = \underline{\underline{500B}}$$



Q6

For Computer based on 3-address instruction format, each address field can be used to specify which of the following

**Ph: 844-844-0102**

S1: A memory operand

S2: A processor register

~~S3: An implied accumulator register.~~

(A) Either S1 or S2

(B) Either S2 or S3

(C) Only S2 and S3

(D) All of S1, S2 and S3.

3-address

op code    addr1, addr2

Q7 Consider two processors P<sub>1</sub> and P<sub>2</sub> executing the same instruction set. Assume that under identical conditions for the same input, a program running on P<sub>2</sub> takes 25% less time but incurs 20% more CPI (clock cycles per instruction)

a) Compared to the program running on P<sub>1</sub>. If the clock

frequency of P<sub>1</sub> is 1GHz, then the clock frequency of P<sub>2</sub> is

1.1GHz, then the clock frequency of P<sub>2</sub> (in GHz) is

~~(A) 1.6~~

(B) 3.2

(C) 1.2

(D) 0.8

P<sub>1</sub>      P<sub>2</sub>

f<sub>2</sub> 1GHz      f = ?

$$\text{Diagram: } \text{Clock cycle symbol} \rightarrow t = \frac{1}{f}$$

$$= \frac{1}{1 \times 10^9}$$

$$= 10^{-9} \text{ s} = 1 \text{ ns}$$

P<sub>1</sub>

1 CPI

$$t_1 = 1 \text{ ns}$$

P<sub>2</sub>

Ph: 844-344-0102

1.2 CPI

$$t_2 = 0.75 \text{ ns}$$



$$\Rightarrow 0.75 \text{ ns} = 1.2 t_2$$

$$\Rightarrow t_2 = \frac{0.75 \text{ ns}}{1.2}$$

$$f_2 = \frac{1.2}{0.75} \text{ GHz}$$

$$= \boxed{1.6 \text{ GHz}}$$

(Q8) A machine has 32-bit architecture, with word length 32 bits instructions. It has 64 registers, each of which is 32 bits long. It needs to support 45 instructions, which have an immediate operand in addition to two register operands. Assuming that the immediate operand is an unsigned integer, the maximum value of immediate operand is \_\_\_\_\_



$$14 \text{ bits}$$

$$0 \rightarrow (2^{14} - 1)$$

$$\Rightarrow 16384 - 1$$

$$\Rightarrow \boxed{16383}$$



Effective Address



Register-based AM:

Eg:



$$r0 \leftarrow r0 + r1$$

Register  $\rightarrow$  memory Reference (Reg)



Execute stage: 3-reg mem-ref



mem-reference

Implied / Implicit AM:

Eg: STC      ~~1 reg mem ref~~      PSN/flag-reg  
set carry      CY <- 1



CLC      clear carry      CY <- 0  
ADD (stack op)



zero-address instructions

Immediate AM:

(Eg) MOV R0  $(2^3)_B$       source  
Constant       $R0 \leftarrow 2^3$



data (limited bits)

int a = 56;

(Eg) MOV dest  $(2^3)_B$  Src  
R0      meaningless

$(2^3)_B \leftarrow R0 X$

int 56 = a;

⑧ up with multiplexed  $AD_0 - AD_{23}$

address - 24 bits

⑨ 256 2-word instructions

Data - 24 bits

⑩ 128 processor-registers

Instr format: opcode, reg, reg, immediate

Largest unsigned number in immediate field?



$$= 2^{26} - 1$$

Direct Addressing Mode:

Eg: Add  $r_0, [2080]$   
 $r_0 \leftarrow r_0 + m[2080]$



Usage: static variable

int main()  
 Compiler Design      Symbol Table      static int i=0;  
 Dir AM      MOV [2080] (00)<sub>16</sub>  
 Imm AM

throughout the  
 life-time of static variable : program

Indirect Addressing Mode:- pointers

Eg: Add  $@r_0, r_1$  (memory Register Indirect)  
 reg.indir      reg.Am  
 $[r_0]$



$$M[2080] \leftarrow M[2080] + R_1$$

**Ph: 844-844-0102**

Eg: Add @2080,  $R_1$  (Memory-Indirect AM)

↓      ↓

mem-indir reg Am  
AM

$$m[4000] \leftarrow m[4000] + R_1$$

IR



Indexed AM: Arrays → Randomly access data @ any index



Eg: MOV RI, BX, IDX

↓

Reg AM

Base-Addr → Base-addr-Reg

IDX: Index-Reg



Indirect-Indexed AM:



Indexed part



→ Auto-increment AM

→ Auto-decrement AM



$$2080 + 2 \text{ Step size} \\ \Rightarrow 2082$$



Auto-increment AM

Pre-auto incr  
post auto incr

MOV R0, +(R1)

MOV R0, (R1)+

C: x++ vs  
i+x

Auto-decrement AM

pre-auto decr  
post-auto deer

MOV R0, -(R1)

MOV R0, (R1)-

R1: 2080

S.S: 2B

More Solved problems:

①

32 bit up



A: Mem-addr

B: reg-addr

# memory cycle organized during execution-cycle.

Fetch  
2 Mem-R

Execute

ID: 0 1-Reg R

- ① 2 mem-R
- ② 1 reg-R
- ③ Arithmetic
- ④ 1 mem-R + 1 mem-W



A = 2080

⑧ 3.2 GHz MP



8-clock cycles: Mem-Reg

4: Arithmetic of  
0: Register + operand

Aug-fetch-rate = ?  
(in M/S)

7.24 Cycles

7.24 × 0.3125

⇒ 2.26 ns

⇒ 1W

$$\begin{aligned} & f = 2.2 \text{ GHz} \\ & \Rightarrow t = \frac{1}{3.2 \times 10^9} \\ & = 0.3125 \text{ ns/cycle} \end{aligned}$$

Aug.fetch rate = ?

$$= \frac{1}{0.26 \times 10^9}$$

$$= 442.4 \text{ MW/S}$$

(Q3) Consider the C struct defines below:

Ph: 844-844-0102

Struct data S

let  
int marks [100]; → 2B  
char grade; → 1B  
int cnumber; → 2B  
};

Struct data student:



The base address of the student is available in register R1.

The field student.grade can be accessed efficiently using

x (A) post-increment addressing mode (R1)

x (B) post-decrement addressing mode - (R1)

x (C) Register direct addressing mode R1

(D) Index addressing mode,  $X(R1)$ , where  $X$  is an offset represented in 2's complement 16-bit representation.

Transfer of Control AM: Based + Relative

→ PC-relative AM

→ Based / Base-reg AM

→ Jump / Branch: unconditional



→ Conditional Jump / Branch if ZF in psw=1

JZ2 addr  
JNZ

Gatecse@appliedcourse.com

$\text{for}(\cdot; \cdot; \cdot)$

while(  
{  
y = }  
y = }

Ph: 844-844-0102

## Segmented Memory:

20 bit addr

$2^{20}$  cells = 1 MB mem

$$\frac{2^{20}}{2^4} = 2^{16} = 64\text{KB}$$



16 - equal sized Segments

## PC-relative Addressing:



$$4000 - 2084 \\ = 1916$$

(46) Base-Addr Am: / Based-Am

$$EA = \text{Base} + \text{Addr} + \frac{\text{addr-offset}}{11}$$

1000                    2345

Dynamic program-relocation



$P_2$

jump BX offset  
jump R1 2345



Solved problems:

- ① 6B long pc-relative jump-instr started at addr  
(543025)<sub>10</sub> onwards. JUMP Rel-addr (-48)
- (-48) Signed displacement is present in the addr-field of the instruction.

ⓐ what is the branch-addr:

$$\begin{aligned} \text{pc: } & 543031 + (-48) \\ = & 542983 \end{aligned}$$



- ⓑ If base-reg contains 683823; what's the branch address if base-reg. addr is used.

jump R1 -48

Base reg: 683823  
+ -48

Q3

~~based~~-Indexed Am is used with index-value = 32

Ph: 644-844-0102

branch-addr = ?

$$(68\ 38\ 23) + 32 = 48$$

Q2

A 2B-instruction is present in the memory location starting at  $(254)_{10}$ .

| opcode | Addr | $(232)_{10}$                 |
|--------|------|------------------------------|
| 8b     | 255  | $M[232] = 185$<br>$R1 = 222$ |

a) calc EA if immediate AM

$$\rightarrow 255$$



b) Reg Am : Reg Address of R1

c) Direct Am : 232

d) Reg-indirect Am : 222

e) Mem-indirect Am : 185

f) Indexed-Am (R1; index-reg) : BA + RA  
 $232 + 222 = 454$

g) PC-relative AM

$$\text{jump} \rightarrow \text{pc} : 256 + 232 = 488$$

Q3 Consider a RISC machine where each machine instruction is exactly 4 bytes long. Conditional and unconditional branch instructions use pc-relative addressing mode with offset specified in bytes to the target location of the branch.

Instruction. Further, the offset is always with respect to

consider the following instruction sequence.

| Instr. No               | Instruction           |
|-------------------------|-----------------------|
| $0-3 \rightarrow i$     | add R2, R3, R4        |
| $4-7 \rightarrow i+1$   | sub R5, R6, R7        |
| $8-11 \rightarrow i+2$  | cmp R1, R9, R10       |
| $12-15 \rightarrow i+3$ | <b>beq R1, offset</b> |
| 16 →                    |                       |



If the target of the branch instruction is  $i$ , then the decimal value of the offset is  $-16$ .

## Instruction Set: Introduction:

- ① Data Transfer : MOV, LOAD, STORE, PUSH, POP, IN, OUT etc.
- ② Data Manipulation
- ③ Transfer of Control / Program Control
  - ↳ jump
  - JZ
  - DNZ ...

### Data-Manipulation-instr:

### Data Sheet

- ① Arithmetic instruction: ADD, MUL, SUB, DIV, INC, DEC etc

(a) INC Ry

Ry: 1111 1111

$\begin{array}{|c|} \hline 0 \\ \hline \end{array} \leftarrow CY: \quad \begin{array}{|c|c|} \hline 0 & 0000 0000 \\ \hline \end{array}$

(b) DEC Ry

Ry: 0000 0000

$\begin{array}{|c|} \hline 1 \\ \hline \end{array} \leftarrow CY: \quad \begin{array}{|c|c|} \hline 1 & 1111 1111 \\ \hline \end{array}$

cy bit does not change.



- ② Logical Instructions : AND, OR, XOR, CMP etc  
MOD

④ CMP  $r_0$   $r_1$        $r_0 - r_1 \rightarrow \text{SUB(CALU)}$

$$x_0 = x_1 \Rightarrow cy: 0 \quad \begin{matrix} \square \\ \text{1011 1011} \\ \text{1010 1010} \end{matrix} \quad \begin{matrix} r_0 \\ r_1 \end{matrix}$$

$$r_0 > r_1 \Rightarrow cy:0 \geq 0$$

Diagram illustrating the state transition of a 4-bit counter. The initial state is 1010, and the final state is 1011. The counter has a synchronous clear input (S) and a parallel load input (PSN).

- ### ③ Shift-operations :-



$$a < 2; \quad a > 2;$$

Signed: Numeri



#### ④ Rotate Instruction:



rotate with carry  
Right



#### Transfer-of-Control/ program-control Instructions:

unconditional      Conditional

TOC

↓  
Jump  
HALT

TOC

↓  
JNZ  
JZ

{  
sub-routines  
sub-programs  
functions}

#### ① jump / BRANCH

1000: J1  
1002: J2  
→ 1004: J2 (HALT)  
1006: J3

JUMP [2000]  
PC: 1003      PC: 2000

1006 : I<sub>3</sub>

!

2000 : I<sub>4</sub>2002 : I<sub>5</sub>

!

!

PC = 1002

I<sub>1</sub> I<sub>2</sub> I<sub>3</sub> I<sub>4</sub> I<sub>5</sub>

② **HALT** → User view : Computer is Halted/Stopped **RESTART**

→ CPU-view → HALT ⇒ (PC = 1002)

**HLT**1000 : I<sub>1</sub>1001 : I<sub>2</sub>

→ 1002 : I<sub>3</sub> [HALT]

PC: 1003 : I<sub>4</sub>

!

PC: 1003 X

IR: [HALT]

PC: 1002

③ **JZ & JNZ** (Conditional Transfer of Control)

✓ 1000: MOV R<sub>0</sub>, (03)<sub>16</sub>R<sub>0</sub>: ~~X~~ → ACC & ALU

ZF: 1

✓ 1001: I<sub>2</sub>✓ 1002: DEC R<sub>0</sub>

1004: JNZ [100]

ZF: 1 if ACC ≠ 0

→ 1006: I<sub>3</sub>ZF: 1 if result of  
prev-instr  
is zero1007: I<sub>4</sub>

!

1000: I<sub>1</sub>1001: I<sub>2</sub>→ 1002: I<sub>3</sub> (CALL 2000)1003: I<sub>4</sub>2000: I<sub>5</sub>2001: I<sub>6</sub>

→ 2002: RET

\* 2003: I<sub>7</sub>

pc: 1003 2000

pc: 2003 1003

PUSH PC

POP

 Run-time  
Stack

Main Memory



### Direct vs Indirect Transfer of Control

|                    |                                                  | ↓ | ↓                          | EA & Addressing-Mode |
|--------------------|--------------------------------------------------|---|----------------------------|----------------------|
| un-<br>conditional | Jump addr<br>CALL addr<br>HALT                   |   | RET → unconditional        |                      |
| conditional        | JNZ addr<br>JZ addr<br>CALLZ addr<br>CALLNZ addr |   | RETZ, RETNZ<br>RETC, RETNC | Conditional<br>RET   |



H/W interrupt pins

- RST 7.5
- RST 6.5
- RST 5.5
- INTR

Instruction Fetch

Instruction Decode

Operand Fetch

Write back

INTA - Interrupt Acknowledgement

### Interrupt-Cycle

① Check the INTR pins

② Context-switch/Transfer of Control

using Interrupt vector table (IVT)

③ Interrupt sub-routine (ISR)

④ IRET/RETI

Interrupt Return | Return|Interrupt



1000: I<sub>1</sub>

1002: I<sub>2</sub> ← INTR

1004: I<sub>3</sub> ← check INTR Pins

1005: I<sub>4</sub>

→ I<sub>3</sub> → I<sub>1</sub> → I<sub>2</sub> → I<sub>4</sub>



(9)

## Single-INTR vs Nested-INTR processing

Ph: 844-844-0102



INTR → Interrupt Cycle

RST 5.5 → ignore this



$\left\{ \begin{array}{l} \text{RST 7.5} \\ \text{6.5} \\ \text{5.5} \\ \text{INTR} \end{array} \right.$

[JOT]

Need to place  
the highest priority interrupt.

First ↑

[Oven]

IN=4B

(Q1)

### Instruction

### Meaning

### Size(0)

| <u>Addr</u> | <u>IF &amp;<br/>ID</u> | <u>OF<br/>WD</u> |
|-------------|------------------------|------------------|
| 5230        | 16                     | 16               |
| 5238        | 16                     | 16               |
| 5246        | 8                      | 2                |
| 5250        | 08                     | 2                |
| 5254        | 16                     | 2+8              |
| 5262        | 16                     | 2+8              |
| 5270        | 8                      | —                |
| 5274        | 88                     | 56               |

I<sub>1</sub>: MOV r0, @3000

$r_0 \leftarrow M[3000]$

2

I<sub>2</sub>: MOV r1, @2000

$r_1 \leftarrow M[2000]$

2

I<sub>3</sub>: Add r0, r1

$r_0 \leftarrow r_0 + r_1$

1

I<sub>4</sub>: MUL r0, r1

$r_0 \leftarrow r_0 * r_1$

!

I<sub>5</sub>: SUB r0, [4000]

$r_0 \leftarrow r_0 - m[4000]$

2

I<sub>6</sub>: MOV 3(r0), r1

$M[30+r_0] \leftarrow r_1$

2

INT → I<sub>7</sub>: Halt

stop

(a) Byte-address-Memory 32 bit MP

Starting-addr =  $(5230)_{10}$

INTR occurs during execution of Halt

Gatecse@appliedcourse.com What's the return address pushed to the stack?



- (b) INT occurs during execution of SUB, Return-address pushed into Stack = ?

while executing IS

PC 5262

contains



- (c) MP : 2.5 GHz

IFGID: 8 cycles / word

$$\text{Cycle time} = \frac{1}{2.5 \text{ GHz}}$$

$$= 0.4 \text{ ns}$$

Arithmeticop{} : 2 cycles

Mem-ref : 8 cycles

Reg-ref : 0 cycles

Execution time = ?

Total cycle = 144

$$\Rightarrow 144 \times 0.4 = 57.6 \text{ ns}$$

④ up that supports 4 INTR :  $I_1, I_2, I_3, I_4$

Ph: 844-844-0102

ISR

Service time: 20 ns, 40 ns, 16 ns & 10 ns

Response time of INTR : 5 ns



Priority:  $I_1 > I_2 > I_3 > I_4$

Max & Min time required to service  $I_4$  given nested - INT - Processing.

If only  $I_4$  Interrupt occurs then  
time to resolve  $\approx I_4: 5 + 10$   
Min Time  $= 15 \text{ ns}$

Max time to resolve  $I_4$   
 $= 25 + 45 + 21 + 15 = 106 \text{ ns}$

If high priority interrupts are coming frequently  
then serving of  $I_4$  may lead to starvation.

### Types of Interrupts:

① Hardware: Pins



a) External: I/O devices & Sensors

b) Internal: clock-error, CPU-temp, invalid-opcode,  
div-by-zero, stack-overflow.

## Software Interrupts:

8085      RST0      →  
               RST1  
               RST2  
               !  
               RST7

IUT Ph: 844-844-0102

| INTR | Addr |
|------|------|
| RST0 | 00H  |
| RST1 | 08H  |
| RST2 | 10H  |
| RST3 | 18H  |
| RST4 | 20H  |
| RST5 | 28H  |
| RST6 | 30H  |
| RST7 | 38H  |

Non-Maskable

Maskable

8085

③ Maskable & Non-Maskable INTR  
     ↓  
     Ignore (RST 7.5 6.5 5.5)      (TRAP)      ↓  
     cannot be ignored by the Micro processor

Eg: SIM : Set Instruction Mask  
     RIM : Read Instruction Mask.  
     zero - addr - instruction.



8085

④ Vectored vs Non-vectored Interrupt  
     ↓  
     7.5 0  
     6.5 1  
     5.5 2  
     Addr 0 2080 → ISR  
     1 3080  
     2 4080  
     3 5080  
     INT# IUT

INTERRUPT

8085 : INTR

INTA & wait for ISR-addr



## (5) Level-Triggered

Elec Engg



Ph: 844-844-0102

## (6) Edge-Triggered



## RISC (Reduced Instruction Set Computer) (UCB, ...)

### CISC (Complex Instruction Set Computer)

CISC MUL [2080] [3080]  
 {  
 MUL R0, R1  
 MUL R0, [2080]

- More # of instr → More circuitry
- fewer registers
- variable # cycles/ instr
- Pentium, i3, i5, i7, ... (Win)

RISC  
 1 } LOAD R0, [2080]  
 1 } LOAD R1, [2080]  
 1 } MUL R0, R1  
 1 } STORE [2080], R0

- fewer # of instr → Simple circuitry
- More # of reg → Pipelining
- 1 cycle/instr
- ARM Mobility, ipads, ...  
 CPU  
 Laptops  
 → Pipelining

## Register Organization in RISC:

$$\text{Total # of reg} = g + \frac{w \times l}{2} + w(2 \times c)$$

g: # of Global-reg

l: # of local reg | window

pipelining



Global Register file



reg window

C: Common  
L: Local



over Lapped

reg windows

Q.1

Consider the following processor design characteristics.

**Ph: 844-844-0102**

- ✓ I. Register-to-Register arithmetic operations only.
- ✓ II. Fixed-length instruction format 1 cycle/instr
- ✓ III. Hardwired Control unit  
L ↴ **DLC**

IV. Which of the characteristics above are used in the design of a **RISC** processor?

- (A) I and II only
- (B) II and III only
- (C) I and III only
- ✓ (D) I, II and III

### Components of a Computer:



$$\textcircled{1} \uparrow \text{performance} \propto \frac{1}{\text{execution time}} \downarrow$$

$$\textcircled{2} \downarrow \text{execution time} \propto \text{access-time} \downarrow \quad \text{LDA @ 2080} \quad \left. \begin{array}{l} \text{MOV } r0, r1 \\ \vdots \end{array} \right]$$

$$\text{Performance} \propto \frac{1}{\text{access-time}}$$

Memory-addr-reg : Stores the address temporarily  
(MAR)

Memory Buffer data Reg  
(MBR)

↓  
Stores the data temporarily.



### Micro-operation (uop)

Eg:  $MOV R0, R1 \rightarrow$  single uop

→ cycles for uop → 1 clock cycle

→ Memory uop (multiple-cycles)

→ Control Signals : seq of CS

→ Control Signals :



### Micro-program (u-program)

sequence of u-operations



① Fetch-Cycle:



② Execution-Cycle: Instruction-Decode



③ Execution-Cycle: Operand Fetch (OF)

④ Direct Mode:

e.g.) LDA [2080]

T<sub>1</sub>: IR[addr]  $\xrightarrow{LB}$  MAR

T<sub>2</sub>: M[ MAR]  $\xrightarrow{SB}$  MBR  
PC + I  $\xrightarrow{LB}$  PC

T<sub>3</sub>: MBR  $\xrightarrow{LB}$  ACC



T1:  $IR[addr] \xrightarrow{LB} MAR$

T2:  $M[MAR] \xrightarrow{SB} MBR [2080]$   
 $PC + I \xrightarrow{LB} PC$

\* T3:  $MBR \xrightarrow{LB} MAR$

T4:  $M[MAR] \xrightarrow{SB} MBR$  operand  
 $MBR \xrightarrow{LB} ACC$

④ Execution-stage: Write-back

Eg: STA [2000] direct-Addressing

T1:  $ACC \xrightarrow{LB} MBR$

T2:  $IR[addr] \xrightarrow{LB} MAR$

T3:  $MBR \xrightarrow{SB} M[MAR]$

$PC + I \xrightarrow{LB} PC$



⑤ INTR-Cycle:

T1:  $PC \rightarrow MBR$

T2:  $SP \rightarrow MAR$

T3:  $MBR \rightarrow M[MAR]$

T4:  $ISP\text{-addr} \rightarrow PC$



$ISP \rightarrow PC$   
 $addr$

ADD r0 r1



⑤ INTR-Cycle:

✓ T<sub>1</sub>: PC  $\xleftarrow{LB}$  MBR

✓ T<sub>2</sub>: SP  $\xrightarrow{LB}$  MAR

T<sub>3</sub>: [MBR  $\rightarrow$  M[MAR]]

✓ T<sub>4</sub>: ISR-addr  $\rightarrow$  PC  $\rightarrow$  IRT  
 ||  
 MBR  $\rightarrow$  PC



Q which of the following operations is satisfied by the u-prog

T<sub>1</sub>: PC  $\rightarrow$  MBR

T<sub>2</sub>: X  $\rightarrow$  MAR

T<sub>3</sub>: MBR  $\rightarrow$  memory

T<sub>4</sub>: Y  $\rightarrow$  PC

a) IF      b) OF      c) conditional jump

pc  $\rightarrow$  Memory (stack)

INTR subprogram initialization.

[CALL...ret]

control unit: (cu)  
 Hardwired  
 Micro programmed  
 → control-signal-sequence  
 → # Control signals  
 # instructions

inst → mop

mop → control Signals



data path  
 (eg) PC  $\xrightarrow{LB}$  MAR (MOP)

S0: copy value from PC to LB  
S1: copy from LB to MAR

(eg)

APPLIED  
COURSE

$$CS = \{S_0, S_1, \dots, S_7\}$$

$$\text{Inst-set} = \{ \begin{array}{l} I_1 \\ I_2 \\ I_3 \\ \text{ADD, SUB, MUL} \end{array} \}_{\begin{array}{l} 00 \\ 01 \\ 10 \end{array}}$$

Ph: 844-844-0102

CU

| Mop   | $I_1$               | $I_2$               | $I_3$               |
|-------|---------------------|---------------------|---------------------|
| $T_1$ | $S_0, S_2$          | $S_1, S_7$          | $S_6$               |
| $T_2$ | $S_1, S_4$          | $S_0, S_5$          | $S_1, S_2$          |
| $T_3$ | $S_0, S_6$<br>$S_3$ | $S_1, S_2$<br>$S_3$ | $S_6, S_7$<br>$S_3$ |

Control Signal - DB

$I_1 \rightarrow T_1$ :

$T_2$ :

$T_3$ :

Designing CU

### Hardwired Approach:

→ DLC using sum-of-products expressions

→ fast & real-time-applications

→ Inflexible

→ RISC

| Mop   | $I_1$           | $I_2$           | $I_3$           |
|-------|-----------------|-----------------|-----------------|
| $T_1$ | $S_0, S_2, S_3$ | $S_1, S_0, S_3$ | $S_2, S_3, S_1$ |
| $T_2$ | $S_1, S_0, S_2$ | $S_2, S_1, S_3$ | $S_2, S_1$      |
| $T_3$ | $S_0, S_2, S_3$ | $S_2, S_1$      | $S_0, S_2, S_3$ |

$$S_0 = T_1(I_1 + I_2) + T_2(I_1) + T_3(I_1 + I_3)$$

SOP

$$S_1 = T_1 I_1 + T_1 I_2 + T_2 I_1 + T_3 I_3$$

SOP

$$S_2 = T_1 I_1 + T_2 I_1 + T_3 I_3$$

SOP

$$S_3 = T_1 I_1 + T_2 I_1 + T_3 I_3$$

SOP



# Micro-programmed Control Unit:

MOP  
 $T_1: S_1, S_2$   
 $T_2: \rightarrow$   
 $T_3: \vdots$



## Control word / μ-instruction format:

| (U, Z, N)<br>Branch<br>condition | carry, sign, Z<br>Flags | Control<br>field<br>S <sub>0</sub> S <sub>1</sub> ... S <sub>n-1</sub> | next-Control<br>mem-addr | Control<br>ROM |
|----------------------------------|-------------------------|------------------------------------------------------------------------|--------------------------|----------------|
| 0000 U                           | 101                     | ↓ ↓ ↓<br>1 0 1 ... 1                                                   | 0006                     |                |
| 0001 N                           | 010                     | 01 00 ... 1                                                            | 0002                     |                |
| 0002 Z                           | 101                     | 1011 ... 1                                                             | 0010                     |                |

Control-word → Instruction

## Horizontal-Minstauction:





$$K = \log_2 m$$

$n = \# \text{Control Signals}$

$$n = 1024 = 2^{10}$$

$$m = 100 < 2^7$$

Horizontal

vertical

$\rightarrow$  longer cw  $\Rightarrow$  longer CROM

shorter cw = small CROM

$\rightarrow$  No decoder  $\Rightarrow$  faster

decoder  $\Rightarrow$  slower

Speed : Hardwired  $>$  Horizontal  $>$  Vertical

Flexibility : Hardwired  $<$  Horizontal  $<$  Vertical

C-ROM

CROM + decoder

programmable ROM



PROM

PROM + decoder

Eg.:

cu with 32K control-word ROM  
 64 control-signals

8 flags



address = 15 bits

Horizontal cw: BC, flags, controlfield, next-addr in ROM

x 8 64

15

82 bits

Virtual cw: x 3 6

15

24 bits

Q1 6-groups of mutually exclusive CS.



② # bits in control field (Horizontal) CF

$$1+8+15+23+14+30 = 91 \text{ bits.}$$

③ vertical



$$\text{④ bits saved } 91 - 22 = 69$$



⑤ Two groups of CS: G1 & G2



$$\# \text{ bits in Control Field} = 8 + 6 = 14$$

⑥ 256 instructions  $\Rightarrow 256 \times 16 \Rightarrow 2^8 \times 2^4 = 2^{12}$

T<sub>1</sub>, T<sub>2</sub> ... T<sub>16</sub> 16 cycles/instaution

Horizontal - w

32  
12  
240  
284

240-CS

16-flags

16-BC

Size of CAR & CDR?

12 bits 284 bits.



= 284

default: vertical

256-CS  
Each Mop has 2 CS-active  
CF  
8 bits # bits needed for CF: ?  
16 bit



## Performance Evaluation: Pipelining



Speed-up-factor (S):

$$S = \frac{P_x}{P_y} = \frac{1/ET_x}{1/ET_y} = \frac{ET_y}{ET_x} = \frac{6}{2} = 3$$

How fast is X as compared to Y?  
 ↓      ↓  
 2ns    6ns

Arndhals Law:

Common-case-fast

$$E(ET_{old}) = 0.1f_2 + 0.9f_2 \\ = 2ns$$



$$E(ET_{\text{new}}) = 0.1 \times 2 + 0.9 \times 1 \\ = 1.1 \text{ ns}$$



$$S_{\text{overall}} = \frac{2}{1.1}$$

Generalization:

$$E(ET_{\text{old}}) = \sum_{i=1}^n x_i F_i$$

$$E(ET_{\text{new}}) = \sum_{i=1}^n F_i (x_i | s_i)$$

$$\boxed{S_{\text{overall}} = \frac{ET_{\text{old}}}{ET_{\text{new}}}}$$



Special Case:

$$S_{\text{overall}} = \frac{E(ET_{\text{old}})}{E(ET_{\text{new}})} = \frac{1}{(1-F) + F/S}$$

$$= [(1-F) + F/S]$$

$$(1-F) * 1 \\ + \frac{1}{S} * F$$



{ Assumption:  
Each instruction takes same time}

Integer and floating point instructions.

FP-unit is enhanced & it runs 2x faster

10% of instructions in program are FP-operations

$$\text{performance-gain} = S_{\text{overall}} = ?$$

$$ET_{\text{old}} = 1.05$$

$$ET_{\text{new}} = 0.9 + \frac{0.1}{2}$$

$$S_{\text{overall}} = \frac{1}{0.9 + 0.05}$$

$$= \boxed{1.05}$$



Q Cache is 16X faster than main-memory

CPU-refers to cache 70% of time

$$\text{performance-gain} = S_{\text{overall}} = ?$$

$$= \frac{1}{(0.3 \times 1 + 0.7 \times \frac{1}{16})}$$



$$= 2.9$$

3-enhancements:

$$\text{Overall} = 10$$

Find  $x\%$ 

$$E_{old} = 1.15$$

$$E_{new} = 0.25 \times \frac{1}{30} +$$

$$\frac{0.25}{20} + \frac{x}{15} + (0.5 - x)$$



$$\text{Overall} = 10$$

$$\Rightarrow \frac{0.25}{30} + \frac{0.25}{20} + \frac{x}{15} + (0.5 - x) = 10$$

$$\Rightarrow$$

$$x = 0.45$$

### Calculation of CPU-time:

 $\rightarrow$  process-clock

$$\rightarrow \text{cycle-time } (t) = \frac{1}{f}$$

 $\rightarrow$  Constant

$$\text{frequency } (f) = \frac{1}{t}$$



3.2 GHz

1 cycle  $\rightarrow t$  sec $\frac{1}{f}$  Hz  $\leftarrow$  1 Sec

Ic: Instruction-Count

cpi: cycles-per-instruction

$$\text{cpu-time: } Ic \times CPI \times t$$

$$\# \text{instr} \times \frac{\text{cycles}}{\text{instr}} \times \frac{\text{Time}}{\text{cycles}}$$

(72)

Different Instruction types: data-transfer, data-manipulation,

Transfer-of-control

**Ph: 844-844-0102**

$$\text{cpu-time} = \sum_{i=1}^K I_{C_i} * CPI_i * t$$

(Q)  $f = 3.2 \text{ GHz}$     $\mu\text{p}$     $t = \frac{1}{3.2 \times 10^9} = 0.31 \text{ ns}$

| Instr-Type | I C  | CPI |
|------------|------|-----|
| Load       | 200  | 12  |
| Store      | 300  | 10  |
| ARITH.     | 200  | 8   |
| Logical    | 100  | 6   |
| shift      | 150  | 4   |
| Branch     | 50   | 2   |
|            | 1000 |     |

(a) Aug-instr-EI  $= 8.3 \times 0.31 \text{ ns} = 2.57 \text{ ns}$

avg       $0.2 \times 12 + 0.3 \times 10 + 0.2 \times 8 + 0.1 \times 6 + 0.15 \times 4 + 0.05 \times 2$   
 cycles per instruction       $= 8.3 \text{ cycle/instr}$

(b) MIPS-rate: avg 1 instr  $\rightarrow 2.57 \text{ ns}$

← 15

MIPS  
millions of       $0.3861 \times 10^9 \leftarrow \frac{1}{2.57 \times 10^9}$

Instructions per  
Second       $\Rightarrow 386.1 \times 10^6$

$\Rightarrow 386 \text{ MIPS}$

(c) program ET

$$\Rightarrow 1000 \times 2.57 \text{ ns}$$

$$\Rightarrow 2.57 \times 10^6 \text{ s}$$

@

CPU-1

$$t = 4 \text{ ns} \text{ (Cycle Time)}$$

Load - 8 cycles  
STORE - 8      } -40%

ALU - 6      } → 40%

Branch - 4      } → 20%

$$\begin{aligned} \text{avg. instr ET old} &= (0.4 \times 8 + \\ &\quad \text{Speed up} \geq ? \quad 0.4 \times 6 + \\ &\quad 0.2 \times 4) \times 4 \\ &= \underline{\underline{25.6 \text{ ns}}} \end{aligned}$$

$$\frac{25.6}{4.8} = \boxed{5.33}$$

CPU-2

$$t = 4.8 \text{ ns}$$

$$CPI = 1$$

$$\begin{aligned} \text{avg. instr ET new} &= [0.4 \times 1 + \\ &\quad 0.4 \times 1 + \\ &\quad 0.2 \times 1] \times 4.8 \end{aligned}$$

$$= \underline{\underline{48 \text{ ns}}}$$

Flynn's CPU classification :- performance → ET  
→ Throughput      RISC VS CISC

Concurrency: 2 or more events @ same-time.

Event: program / sub-program / instruction | stage of an instruction.

SISD, SIMD, MISD, MIMD

Single Instruction stream & Single data Stream.

PH.644-044-0102



ALU: PU

MU: Reg |

Cache |

Main-mem

$$x = x + 5;$$

Uni-processor System



{ CU: Control unit  
PU: Processor unit  
MU: Memory unit

②

SIMD: Single Instruction multiple data.



Eg: Add 1 to all elements of an array



→ Vector-processing  
→ GPU (Graphical processing unit)  
AI & ML (DL)

Ardono 90's  
Nvidia  
AMD

→ Scientific-Applications



Not very popular.

uni processor  
SISD  
SIMD  
MISD  
MIMD  
↓  
Multi processor



$x$ : Variable

$$y = x^2$$

$$z = \sqrt{x}$$

$$k = x^{1/2}$$

$$m = x/\sqrt{2}$$

④ MIMD: Multi-processor System



p7, p51  
QualComm  
Snapdragon

Pipelining:

A Concurrency Approach → 2 or more Events @ same time

$I_1$ : IF<sub>1</sub> S<sub>1</sub> [2080] ID<sub>1</sub> S<sub>2</sub> [2082] OF<sub>1</sub> S<sub>3</sub> ALU<sub>1</sub> S<sub>4</sub> WB<sub>1</sub> S<sub>5</sub>

$I_2$ : IF<sub>2</sub> [2082] ID<sub>2</sub> OF<sub>2</sub> ALU<sub>2</sub> WB<sub>2</sub>



### Pipelined - Overlapped Execution:



$T_i : Task_i / instruction$

Pipeline-depth = 3 = # Stages.

$$T_p = 3 * t_p \Rightarrow K t_p$$

### Non-overlapped Execution:



Non-overlapped execution (g) non-pipelined

## Performance of a pipeline:

$$\# \text{ segments} = K$$

$$\text{clock-cycle-time} = t_p, t_n$$

$$\# \text{ Tasks} \underset{\text{or}}{=} \# \text{ instructions} = n$$

→ Each Segment takes one-cycle

$$ET_{\text{pipe}} = \underline{(K + (n-1))t_p}$$

$$ET_{\text{non-pipe}} = \underline{n(t_n + K)}$$

$$S = \frac{n \cdot t_n \cdot K}{(K + (n-1))t_p}$$

$$S = \boxed{\frac{n \cdot T_n}{T_p + (n-1)t_p}}$$

$$\text{Case: } n \rightarrow \infty, S = \frac{n T_n}{n t_p - t_p + K t_p}$$

$$= \frac{n T_n}{n t_p + (K-1) t_p}$$

K-Small  $\leq n$  is very Large

$$\Rightarrow \boxed{S = T_n / t_p = \frac{K t_n}{t_p}} \approx K, \text{ if } (t_n \approx t_p)$$

## Throughput of a pipeline

$$= \frac{\# \text{ tasks - completed}}{\text{time - taken}}$$

$$n \text{ tasks} \rightarrow (K + n-1) t_p \text{ sec.}$$

$$\boxed{\frac{n}{(K+n-1)t_p} \leftarrow 1 \text{ sec}}$$

$$\text{Efficiency of a pipeline: } \frac{S}{K} = \boxed{n}$$

Ph: 844-844-0102

efficiency # stages = ?

$$\eta_b = 70\%$$

$$= \frac{S}{K}$$

$$= \frac{12}{K}$$

$$= 0.7K = 12$$

$$\Rightarrow K = \frac{12}{0.7} = 17.1 \Rightarrow \boxed{18}$$

### Types of Pipelines:

#### ① Instruction / Linear Pipeline



forward-direction

#### ② Non-Linear pipeline

Tasks: Segments/ stages



#### ③ Synchronous:

pipe-line operation is controlled by the clock-Signal

↳ All linear pipelines are Synchronous

#### ④ Asynchronous Pipeline:

operations are controlled based on Hand shaking signals,

↳ All Non-linear pipelines are asynchronous.

⑤ Uniform-delay | balanced:



Pb: 044-844-0102

$$t_p = \frac{t_{IS}}{(\text{stage-delay} + \text{buffer-delay} + \text{overhead})}$$

⑥ Non-uniform delay | Imbalanced:



$$t_p = \max(SD_i + BD_i + O_i)$$

clock  
time



⑦

Pipeline-A

$$K_1 = 8$$

$$\text{uniform-delay} = 2\text{ns}$$

Time saved when 100-tasks are

instead of B.

$$\text{ETA} = (8+99)2 = 214\text{ns}$$

$$\approx (K+n-1)t_p$$

Pipeline-B

$$K_2 = 6$$

Stage-wise delays: 2, 6, 4, 1, 3, 2ns  
pipelined through A

$$(6+99)*6 = 630\text{ns}$$

$$\Rightarrow 630 - 214 = \boxed{416\text{ns}}$$

(Q2) 4-Segment pipeline with delays: 20 ns, 30 ns, 40 ns & 15 ns  
**Ph: 844-844-0102**

speedup & n<sub>l</sub> cohen Large-number of tasks are Executed.

$$S = \frac{T_n}{t_p} = \frac{\text{Total time for non-pipelined execution}}{\text{clocktime for pipelined Execution}}$$

$$= \frac{105}{40} = 2.625$$

$$\eta_l = \frac{S}{K} = \frac{2.625}{4} * 100 \\ = 65.6\%$$

(Q3) 4-stage pipeline with stage-delays : 30ns, 40ns, 20ns &  
 between the stages has a delay  
10 ns - Interface-register  
Pipelined of 5ns. performance-gain of the pipeline?

let n = Large.

$$S = \frac{T_n}{t_p} = \frac{100}{45} \\ = 2.22$$

$$\begin{aligned} \text{MAX}(30, 40, 20, 10) \\ \text{MAX}(35, 45, 25, 15) \end{aligned}$$

$$\eta_l = \frac{S}{K} = \frac{2.22}{K=4} * 100 \\ = 55.5\%$$

**(Q4)** 4-Segment pipeline with stage-delays: 2<sup>nd</sup>, 8<sup>th</sup>, 5<sup>th</sup>, 10 ns. An Alternative pipeline contains same stages with same delays but stage 2 is divided into 2 sub-stages of equal-delays & stage 4 is divided into 2 sub-stages of same delay. Speedup of alternative pipeline over the older one.

let  $B = 1 \text{ ns}$

Sol:



**(Q5)** { 4-stage pipeline where different instructions spent different amount of time @ each-stage }

| real/world       | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|------------------|----------------|----------------|----------------|----------------|
| 7 I <sub>1</sub> | 1              | 3              | 1              | 2              |
| 6 I <sub>2</sub> | 1              | 1              | 3              | 1              |
| 7 I <sub>3</sub> | 1              | 2              | 1              | 3              |
| 4 I <sub>4</sub> | 1              | 1              | 1              | 1              |
| 24               |                |                |                |                |

① # cycles to complete the above program.

13 Cycles



(2)

$$\text{for } i=1 \text{ to } 1000 \text{ do }$$

{

I<sub>1</sub>;I<sub>2</sub>;I<sub>3</sub>;I<sub>4</sub>;

}

I<sub>4</sub>-Completed

Loop-level  
parallelism

|                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>1</sub> |  |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|--|
| S <sub>4</sub> |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |  |
| S <sub>3</sub> |                | I <sub>4</sub> | -              | -              | -              | I <sub>1</sub> |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |  |
| S <sub>2</sub> | I <sub>4</sub> | I <sub>1</sub> | I <sub>1</sub> | I <sub>1</sub> |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |  |
| S <sub>1</sub> | I <sub>1</sub> |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |  |
|                | 9              | 10             | 11             | 12             | 13             | 14             | 15             | 16             | 17             | 18             | 19             | 20             |                |                |                |                |                |                |                |                |                |                |                |                |                |                |  |

# Cycles-overlapped between 2-iterations = 13 - 9 + 1 = 5

i=1 I<sub>4</sub>i=2 I<sub>1</sub>

RISC Pipelining: 5-Stage Instruction-pipeline

Recap: 3 types of instruction

smaller Instruction set



(2)

Data-Manipulation:Add  $r_0, r_1, r_2$ 

Ph: 844-844-0102

$$r_0 \leftarrow \overline{\text{ALU}}(r_1 + r_2)$$

all operands in registers  
(RISC)

(3)

Transfer of Control:

Unconditional

Jump 2080

$$\boxed{PC \leftarrow 2080}$$

Conditional

DJNZ  $r_0, 2080$ 

$$\rightarrow \text{Dec } r_0$$

$$\text{Compare } r_0 == 0$$

T

F

$$\boxed{PC \leftarrow 2080}$$

complex conditional  
Transfer  
of Control

JZ  
JNZ

5 Stages  $\rightarrow$  IF, ID, ALU-Execute,



② STORE  $r_0, r_1$

$M[3+r_0] \leftarrow r_1$



③ ADD  $r_0, r_1, r_2$



(4) Jump 2080

① IF       $IR \leftarrow \text{instruction}$   
               $PC \leftarrow PC + \text{step-size}$

Diagram illustrating a 1D instruction format:

- opcode**: An 8-bit field.
- PC ← 2080**: An 8-bit field.

③  $\text{Exe} - X$

⑨ Mem-ACC - X

(5) WB ← PC ← 2080<sup>WF</sup>

⑤ DANZ 80, 2080

① IF       $IR \leftarrow \text{instruction}$   
               $PC \leftarrow PC + \text{step size}$

② FD ALU-operation on  $r_0$

③ Exc  $r_0 \leftarrow r_0 - 1$   
if  $r_0 \neq 0$   
 $ACC = 1$

## ④ Mem-Acc

⑤   $PC \leftarrow 2080$  if  $ACC = 1$   
- if  $ACC = 0$



Pipeline Dependencies:

→ Major problems/issues

→ Structural

→ Data

→ Control

Phy 844 844-0102

Structural Dependency

→ Resource Conflict

→ Waiting / Stall

|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | IF             | ID             | ALU            | WB             |                |
| I <sub>2</sub> |                | IF             | 1D             | ALU            |                |
| I <sub>3</sub> |                |                | IF             | 1D             |                |
| I <sub>4</sub> |                |                |                | IF             |                |

↓  
wait/stall



Eg 2:

|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub>  | C <sub>6</sub>  | C <sub>7</sub> |
|----------------|----------------|----------------|----------------|----------------|-----------------|-----------------|----------------|
| I <sub>1</sub> | Mem            | 1D             | ALU            | Mem            | WB <sub>R</sub> |                 |                |
| I <sub>2</sub> | Mem            | 1D             | ALU            | Mem            | WB <sub>R</sub> |                 |                |
| I <sub>3</sub> |                | Mem            | 1D             | ALU            | Mem             | WB <sub>R</sub> |                |
| I <sub>4</sub> |                |                | X              | X              | X               | Mem             |                |

3-Cycle-Stall

5-Stage RISC

WB: reg-write

Solution to structural-dep: Renaming

Ph: 844-844-0102

→ Code-Memory (CM)

→ Data-Memory (DM)

2-independent modules



|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub> | C <sub>6</sub> | C <sub>7</sub> | C <sub>8</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | CM             | ID             | ALU            | DM             | WB             |                |                |                |
| I <sub>2</sub> |                | CM             | ID             | ALU            | DM             | WB             |                |                |
| I <sub>3</sub> |                |                | CM             | ID             | ALU            | DM             | WB             |                |
| I <sub>4</sub> |                |                |                | CM             | ID             | ALU            | DM             | WB             |

zero-stalls

## ② Data Dependency:



|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | IF             | ID             | EXC            | WB             |
| I <sub>2</sub> |                | IF             | ID             |                |

src<sub>1</sub>: r<sub>0</sub>  
src<sub>2</sub>: r<sub>4</sub>  
dest: r<sub>3</sub>

## ID-Stage:

|                | OP  | DEST           | Indep src <sub>1</sub> | Indep src <sub>2</sub> | dep src <sub>1</sub>             | dep src <sub>2</sub> |
|----------------|-----|----------------|------------------------|------------------------|----------------------------------|----------------------|
| I <sub>1</sub> | ADD | r <sub>0</sub> | r <sub>1</sub>         | r <sub>2</sub>         | —                                | —                    |
| I <sub>2</sub> | SUB | r <sub>3</sub> | —                      | r <sub>4</sub>         | r <sub>0</sub> (I <sub>1</sub> ) | —                    |

TOMASULO  
-Algorithm

|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub>         |
|----------------|----------------|----------------|----------------|----------------|------------------------|
| I <sub>1</sub> | IF             | ID             | EXC            | WB             |                        |
| I <sub>2</sub> |                | IF             | —              | —              | ID → r <sub>0new</sub> |

Q-Stalls

Eg: 5-stage RISC



|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub> | C <sub>6</sub> | C <sub>7</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | IF             | ID             | EX             | MA             | WB             |                |                |
| I <sub>2</sub> |                | IF             | ID             | -              | -              | EX             | - - -          |

2-stalls

Bypass / Latch

Solution:Operand Forwarding

↑ default in RISC

|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub> | C <sub>6</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | IF             | ID             | EX             | MA             | WB             |                |
| I <sub>2</sub> |                |                | ID             | EX             | MA             | WB             |

$r_1 + r_2$

$r_0 = r_1 + r_2$

No stalls

Eq:I<sub>1</sub>: Load  $r_0, 3(r_1)$ I<sub>2</sub>: Add  $r_3, r_0, r_2$ I<sub>3</sub>: Load  $r_4, 4(r_2)$ I<sub>4</sub>: Sub  $r_3, r_5, r_4$ 

→ Load &amp; Store Spend 3-Cycles in MA-stage.

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| I <sub>1</sub> | IF | ID | EX | MA | MA | MA | WB |    |    |    |    |    |    |    |    |
| I <sub>2</sub> |    | IF | ID | -  | -  | -  | EX | MA | WB |    |    |    |    |    |    |
| I <sub>3</sub> |    |    | IF | -  | -  | -  | ID | EX | MA | HA | HA | MA | WB |    |    |
| I <sub>4</sub> |    |    |    |    |    |    | IF | ID | -  | -  | EX | MA | NB |    |    |

|                | IF | ID | EX | MA  | WB |  | # Cycles to Complete           |
|----------------|----|----|----|-----|----|--|--------------------------------|
| I <sub>1</sub> | 1  | 1  | 1  | (3) | 1  |  | I <sub>1</sub> -I <sub>4</sub> |
| I <sub>2</sub> | 1  | 1  | 1  | 1   | 1  |  |                                |
| I <sub>3</sub> |    |    |    |     |    |  |                                |
| I <sub>4</sub> | 1  | 1  | 1  | 1   | 1  |  |                                |

14

$I_1: \text{Add } r_0, r_1, r_2$   
 $I_2: \text{MUL } r_3, r_0, r_4$   
 $I_3: \text{DIV } r_5, r_6, r_7$   
 $I_4: \text{SUB } r_8, r_5, r_7$

→ EX takes 3 cycles for MUL & 6 cycles for DIV

→ RISC → operand Forwarding

MAK

|       | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 |
|-------|----|----|----|----|----|----|----|----|----|-----|-----|-----|-----|-----|-----|
| $I_1$ | IF | ID | EX | WB |    |    |    |    |    |     |     |     |     |     |     |
| $I_2$ | IF | ID | EX | EX | EX | WB | WB |    |    |     |     |     |     |     |     |
| $I_3$ |    | IF | ID | -  | -  | EX | EX | EX | EX | WB  | -   |     |     |     |     |
| $I_4$ |    |    | IF | ID | -  | -  | -  | -  | EX | WB  | -   | EX  | WB  |     |     |

ALU

# clock cycles to complete  $I_1 - I_4$ : 14

### ③ Control - Dependency:

✓ 1000:  $I_1$

✓ 1001:  $I_2$

✓ 1002:  $I_3$  (Jump 2080)

✗ 1003:  $I_4$

:

✓ 2080: BI<sub>1</sub>

:

✓ 2081: BI<sub>2</sub>

:

RISC

Jump 2080 { IF : IR ← instruction  
PC ← PC + SS }

{ ID : PC ← 2080 }

|                 | C1 | C2 | C3 | C4 | C5 | C6 | C7  |
|-----------------|----|----|----|----|----|----|-----|
| $I_1$           | IF | ID | EX | MA | WB |    |     |
| $I_2$           |    | IF | ID | EX | MA | WB |     |
| $I_3$           |    |    | IF | 1D |    |    |     |
| $I_4$           |    |    |    | IF | ID | EX | X   |
| BI <sub>1</sub> |    |    |    |    | IF | ID | EX  |
| BI <sub>2</sub> |    |    |    |    |    | IF | ID  |
|                 |    |    |    |    |    |    | ... |

unwanted instruction  
↓  
FLUSH DELAY

pc ← pc + SS

→ Branch prediction:  
→ { additional HW }  
No stalls



→ jump @ 2080

↳ Not the target-addr ⇒ Branch prediction is not possible.

Branch-penalty = # stalls due to branch operation.

Software - Solutions: Delayed-Branch (Compiler)

→ Rearrange (1) (a) add (2) NOP instruction

↓  
User-program:

1000: I<sub>1</sub>

→ 1001: I<sub>2</sub>

→ 1002: I<sub>3</sub> (jump 2000)

1003: I<sub>4</sub>

:

2000 : BI<sub>1</sub>

2001 : BI<sub>2</sub>

|                 | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub> | C <sub>6</sub> | C <sub>7</sub> |
|-----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub>  | IF             | ID             | EX             | MA             | WB             |                | PC → 2080      |
| I <sub>3</sub>  |                | IF             | ID             |                |                |                |                |
| I <sub>2</sub>  |                |                | IF             | ID             | EX             | MA             | WB             |
| BI <sub>1</sub> |                |                |                | IF             | ID             | EX             | MA             |
| BI <sub>2</sub> |                |                |                |                | IF             | ID             | EX...          |

I<sub>2</sub> & I<sub>3</sub> are independent

→ NOP-instruction

User program

1000: I<sub>1</sub>

1001: I<sub>2</sub>

1002: I<sub>3</sub> (jump 2000)

1003: NOP

X 1004: I<sub>4</sub>

⋮

2000: BI<sub>1</sub>

2001: BI<sub>2</sub>

IF ID

IF ID

IF ...

Gat@appliedcourse.com

## Instruction Scheduling (to avoid stalls)

Ph: 844-844-0102

→ DBMS

$I_1: \text{Add } r_0, r_1, r_2$

$I_2: \text{Sub } r_3, r_0, r_4$

$I_3: \text{MUL } r_4, r_5, r_6$

$I_4: \text{DIV } r_3, r_7, r_8$

out-of-order:

data dep



order:  $I_1 \rightarrow I_2 \rightarrow I_3 \rightarrow I_4$

(Stalls)

Error! Anti-data dependency:

$I_1: \text{Add } r_0, r_1, r_2$

$I_2: \text{Sub } r_3, r_0, r_4$

$I_3: \text{MUL } r_4, r_5, r_6$

$I_4: \text{DIV } r_3, r_7, r_8$

$r_4 = 10$

$r_5 = 10$

$r_6 = 10$



$I_3 - I_2$

$r_4$

write-before-read

Order:  $I_1 \rightarrow I_2 \rightarrow I_3 \rightarrow I_4$

$r_4 = 10$

Error! OLP Dependency:

$I_1: \text{Add } r_0, r_1, r_2$

$I_2: \text{Sub } r_3, r_0, r_4$

$I_3: \text{MUL } r_4, r_5, r_6$

$I_4: \text{DIV } r_3, r_7, r_8$

out-of-order:



$I_4 - I_2$

$r_3 = r_2$

write-before-write

Order:

$I_1 \rightarrow I_2 \rightarrow I_3 \rightarrow I_4$





Re-order Buffers to  $r_4$  &  $r_3$  from  $I_3$  &  $I_4$  resp

### More Solved problems -1

- ① Consider the sequence of machine instructions

$\checkmark$  MUL R5, R0, R1      3  
 $\checkmark$  DIV R6, R2, R3      5  
 $\checkmark$  ADD R7, R5, R6      1  
 $\checkmark$  SUB R8, R7, R4      1

In the above Sequence R0 to R8 are general purpose Registers. In the instructions shown, the first register stores the result of the operation performed on the second and third registers. This sequence of instruction is to be executed in a pipelined instruction processor.

with the following 4 stages: (1) Instruction fetch and decode(IF) (2) operand Fetch(OF) (3) perform operation(po)

and (4) write back the result (WB). The IF, OF and WB

stage taking 1 clock cycle each for any instruction.

The po stage takes 1 clock cycle for ADD or SUB

instruction, 3 clock cycles for MUL instruction and 5 clock cycles for DIV instruction. The pipelined processor uses "operand forwarding" from the PO stage to the OF stage. The number of clock cycles taken for the execution of the above sequence of instructions is \_\_\_\_\_

- (A) 11
- (B) 12
- (C) 13
- (D) 14

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14    |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|----|-------|
| I <sub>1</sub> | IF | OF | PO | PO | PO | WB |    |    |    | PO | PO | PO | PO | WB    |
| I <sub>2</sub> |    | IF | OF |    |    |    | PO | PO | PO | PO | WB |    | PO | WB    |
| I <sub>3</sub> |    |    | IF | -  | -  | OF | -  | -  | -  | -  | -  | -  | PO | WB    |
| I <sub>4</sub> |    |    |    | IF | -  | -  | OF | -  | -  | -  | -  | -  | -  | PO WB |

- ② Suppose the functions F and G can be computed in 5 and 3 nano seconds by functional units UF and UG, respectively. Given two instances of UF and two instances of UG, it is required to implement the computation  $F(G(x_i))$  for  $1 \leq i \leq 10$ . Ignoring all other delays, the minimum time required to complete this computation is \_\_\_\_\_ nano seconds.



Both Even & odd cases run parallelly.

- ③ The instruction pipeline of a RISC processor has the following stages. Instruction Fetch (IF), Instruction Decode (ID), operand Fetch (oF), Perform operation (po) and write back (WB). The IF, ID, oF and WB stages take 1 clock cycle each for every instruction. Consider a sequence of 100 instructions. In the po stage, 40 instructions take 3 clock cycles each, 35 instructions take 2 clock cycles each, and the remaining 25 instructions take 1 clock cycle each. Assume that there are no data hazards and no control hazards. The number of clock cycles

required for Completion of execution of the Sequence of  
 instruction is 219 Ph: 844-844-0102

IF, ID, OF, WB  $\rightarrow$  1 Cycle

$n = 100$  instructions  $\rightarrow 40 \rightarrow 3$  cycles (P0)  
 $35 \rightarrow 2$  cycles (P0)  
 $25 \rightarrow 1$  cycle (P0)

K=5

$$(K+n-1) + (40 \times 2) + (35 \times 1) + (25 \times 0) \\ = 104 + 80 + 35 \\ = 219$$

- (4) Instruction Execution in a processor is divided into 5 stages. Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Execute (EX) and Write Back (WB). These stages take 5, 4, 20, 10 and 3 nano seconds respectively. A pipelined implementation of the processor requires buffering between each pair of consecutive stages with a delay of 2ns. Two pipelined implementations of the processor are contemplated.

(i) a Naive pipeline implementation (NP) with 5 stages and

(ii) An Efficient pipeline (EP) where the OF stage is divided into stages OF1 and OF2 with execution time of 12ns and 8ns respectively.

The Speedup (correct to two decimal places) achieved by EP over NP in Executing 20 independent instructions with no hazard is \_\_\_\_\_

**Ph: 844-844-0102**

$$\begin{array}{c}
 \text{NP} \quad K=5 \\
 \hline
 t_p = 22 \text{ ns} \\
 n=20 \\
 K=5 \\
 \Rightarrow (5+20-1)*22 \text{ ns} \\
 \Rightarrow 24*22 \text{ ns} \\
 \Rightarrow \boxed{528 \text{ ns}}
 \end{array}
 \quad
 \begin{array}{c}
 \text{EP} \quad K=6 \\
 \hline
 t_p' = 14 \text{ ns} \\
 n=20 \\
 K=6 \\
 \Rightarrow (20+6-1)*14 \text{ ns} \\
 \Rightarrow 25*14 \\
 \Rightarrow \boxed{350 \text{ ns}}
 \end{array}
 \quad
 \begin{array}{c}
 K+n-1 \\
 \hline
 \boxed{5, 4, 20, 10 \text{ and } 3} \\
 5, 4, 12, 8, 7, 10, 3
 \end{array}$$

$$\Rightarrow \text{Speedup} = \frac{528}{350} = 1.508 = \boxed{1.51}$$

- ⑤ The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 pico seconds. The first stage (with delay 800 pico seconds) is replaced with a functionality equivalent design involving two stages with respective delays 600 and 350 pico seconds. The throughput increase of the pipeline is \_\_\_\_\_ percent.

$$\begin{array}{c}
 4=K \\
 \hline
 800, 500, 400, 300 \\
 \hline
 600, 350, 500, 400, 300
 \end{array}
 \quad
 \begin{array}{l}
 \text{Nano sec} - 10^{-9} \text{ sec} \\
 \text{Pico sec} - 10^{-12} \text{ sec}
 \end{array}$$

$$t_p = 800 \text{ ns}$$

If  $n$  is very Large

$$\frac{n-K+1}{n} \approx 1 \text{ instruction}$$

→ cycle

$$\text{Thr}_1 = \frac{1}{800}$$

$$t_p = 600$$

**Ph: 844-844-0102**

$$\text{Thr}_2 = \frac{1}{600}$$

$$\Rightarrow \frac{\text{Thr}_2 - \text{Thr}_1}{\text{Thr}_1} * 100$$

$$= \boxed{33.33\%}$$

### More solved problems-11

- ① Consider a **3 GHz** processor with a three stage pipeline and stage latencies  $v_1, v_2$  and  $v_3$  such that  $v_1 = 3v_2/4 = 2v_3$ . If the longest pipeline stage is split into two pipeline stages of equal latency, the new frequency is **(4) GHz**, ignoring delays in the pipeline registers.

- (A) 2
- (B) 4**
- (C) 8
- (D) 16

$$\begin{matrix} 6x & 8x & 3x \\ & \swarrow & \searrow \\ 6x & 4x & 4x & 3x \end{matrix}$$

$$t_p = 6x$$

$$f = \frac{1}{6x} = 4 \text{ GHz}$$

$$\left\{ \begin{array}{l} K=3 \\ v_1 = \frac{3}{4}v_2 = 2v_3 \end{array} \right.$$

$$\text{let } v_1 = 6x \text{ ns}$$

$$\left\{ \begin{array}{l} v_2 = \frac{4}{3} \times 6x = 8x \text{ ns} \\ v_3 = 3x \text{ ns} \end{array} \right.$$

$$t_p = 8x \text{ ns}$$

$$f = \frac{1}{8x} = \frac{1}{24} \text{ GHz}$$

$$\frac{1}{2} = 24 \text{ GHz}$$

(2) Consider a non-pipelined processor with a clock rate of  $2.5 \text{ GHz}$  and average cycles per instruction of four. The same processor is upgraded to a pipelined processor with five stages, but due to the internal pipeline delay, the clock speed is reduced to  $2 \text{ GHz}$ . Assume that there are no stalls in the pipeline. The speedup achieved in this pipelined processor is \_\_\_\_\_

(A) 3.2

(B) 3.0

(C) 2.2

(D) 2.0

$$f = 2.5 \text{ GHz} \Rightarrow T = \frac{1}{2.5}$$

$$CPI = 4 \quad \sum T_{old} \Rightarrow 4 * \frac{1}{2.5} \text{ ns} \\ 1.6 \text{ ns}$$

pipeline  
processor

$K=5$

$$f_p = 2 \text{ GHz} \quad T = \frac{1}{2 \text{ GHz}}$$

$$CPI = 1$$

$$T = 0.5 \text{ ns}$$

$$ET_{new} = 1 * 0.5 \text{ ns}$$

$$= 0.5 \text{ ns}$$

$$\text{Speedup} = \frac{1.6}{0.5} \\ = 3.2$$

(3) Consider the following code sequence having five instructions  $I_1$  to  $I_5$ . Each of these instructions has the following format.

OP Ri, Rj, RK

where operation op is performed on the contents of



Consider the following three statements.

$\times$  S<sub>1</sub>: There is an anti-dependence between instructions I<sub>2</sub> and I<sub>5</sub> → write-before-read

$\checkmark$  S<sub>2</sub>: There is an anti-dependence between instructions I<sub>2</sub> and I<sub>4</sub>.

$\times$  S<sub>3</sub>: Within an instruction pipeline an anti-dependence always creates one or more stalls.

which one of the above statements is/are correct?

(A) Only S<sub>1</sub> is true

(B) Only S<sub>2</sub> is true

(C) Only S<sub>1</sub> and S<sub>2</sub> are true.

(D) Only S<sub>2</sub> and S<sub>3</sub> are true.

④ Consider the following processes (ns stands for nano-seconds).

Assume that the pipeline registers have zero latency.

Ph: 844-844-0102

P<sub>1</sub>: Four-stage pipeline with stage latencies 1 ns, 2 ns, 2 ns, 1 ns  $\Rightarrow 2$

P<sub>2</sub>: Four-stage pipeline with stage latencies 1 ns, 1.5 ns, 1.5 ns, 1.5 ns  $\Rightarrow 1.5$

P<sub>3</sub>: Five-stage pipeline with stage latencies 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns  $\Rightarrow 1$

P<sub>4</sub>: Five-stage pipeline with stage latencies 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns  $\Rightarrow 1.1$

which processor has the highest peak clock frequency?

- (A) P<sub>1</sub>
- (B) P<sub>2</sub>
- (C) P<sub>3</sub>
- (D) P<sub>4</sub>

$$f = \frac{1}{T}$$

- ⑤ An instruction pipeline has five stages, namely instruction fetch (IF), instruction decode and register access (ID/RF), instruction execution (EX), memory access (MEM), and register write back (WB) with stage latencies 1 ns, 2.2 ns, 2 ns, 1 ns, and 0.75 ns respectively. (ns stands for nano seconds).

To gain in terms of frequency, the designers have decided to split the **ID/RF** stage into three stages **(ID, RF1, RF2)** Ph: 844-844-0102

each of Latency **2.2/3 ns.** Also EX Stage is split into two

stages **(EX1, EX2)** each of latency 1ns. The new design has a

total of **8** pipeline stages. A program has **20%** branch

instructions which execute in the **EX stage** and produce

the next instruction pointer at the end of the **EX stage**

~~and produced~~ in the old design and at the end of the

~~EX stage~~ in the new design. The **IF stage stalls** after

fetching a branch instruction until the **next instruction**

pointer is computed. All instructions other than the

branch instruction have an average **CPI of one** in both

the designs. The execution times of this program on the

old and the new design are P and Q nanoseconds, respec-

tively. The value of **P/Q** is **1.54**

- (A) 1.5
- (B) 1.4
- (C) 1.8
- (D) 2.5



**20% of branch**

**instructions**

$$P = 3.08$$

$$= (0.8 \times 1 + 0.2 \times 3) 2.2$$

**2nd Case**

$$= (0.8 \times 1 + 0.2 \times 6) 1 \text{ ns}$$

$$Q = 2 \text{ ns}$$

$$P/Q = 3.08/2 = 1.54$$

## Memory Organization:

Reg, Cache, RAM, Disks

Ph: 844-844-0102



Hierarchical-access-memory-hierarchy  
(Default)



Locality of Reference:

Spatial

Temporal



Spatial



Temporal



Keep instructions & data

in Cache is known as Spatial LOR

i, j, k, l, m variable

Repeatedly accessing the variables kept in cache known as Temporal LOR.

$$T_1 \leq T_2 \leq T_3 \leq \dots$$

Simultaneous - Access:

Expected time to access

$$T_{avg} =$$

$$H_1 T_1 + (1-H_1) H_2 T_2$$

$$+ (1-H_1)(1-H_2) H_3 T_3 + \dots$$

$$+ (1-H_1)(1-H_2) \dots (1-H_{n-1}) H_n T_n$$

Expected Time  
to Access



$$m_{mem} = \frac{1}{T_{avg}} \text{ words/sec}$$

Avg Access Speed.

$$\text{Hit-ratio rate} = \frac{\# \text{hits}}{\# \text{accesses}} \quad 0 \leq H_r \leq 1$$

$$= H_i \quad \text{Miss.}$$

$T_i$ : time to access stage  $i$

(104)



$$\begin{aligned}
 T_{avg} = & H_1 T_1 + (1-H_1) H_2 (T_2 + T_1) + (1-H_1)(1-H_2) H_3 (T_1 + T_2 + T_3) \\
 & + \dots \\
 & + (1-H_1)(1-H_2) \dots (1-H_{n-1}) H_n (T_1 + T_2 + \dots + T_n) \\
 L_1 & \text{ Miss } \quad L_2 \text{ Miss } \quad L_{n-1} \text{ Miss } \quad L_n \text{ Hit in } \\
 & \downarrow \qquad \downarrow \qquad \downarrow \qquad \downarrow \\
 & \Rightarrow H_n = 1
 \end{aligned}$$

(10) 2-level memory-hierarchy (sequential)

L<sub>1</sub> is 8 times faster than L<sub>2</sub>

$$T_1 = 20 \text{ ns}$$

$$T_1 = T_{avg} - 40 \text{ ns}$$

Find H<sub>1</sub>

$$\begin{aligned}
 T_{avg} = & H_1 \times T_1 + (1-H_1) H_2 T_2 = T_1 + 40 \text{ ns} \\
 20 \times H_1 + (1-H_1) \times 1 \times 160 & = T_1 + 40 \\
 20 H_1 + 160 - 160 H_1 & = 20 + 40 \\
 \Rightarrow H_1 & = 0.714
 \end{aligned}$$

(71.4%)

## Cache-Memory organization:

- Memory-organization
- Mapping-techniques
- Replacement-Methods
- Updates
- Multi-level Cache  $\rightarrow L_1 \ L_2 \ L_3$



## Memory Organization:



$$4 \text{ GB} = \frac{32}{2^3} \text{ B} = 2^{32} \text{ B}$$

$n = \# \text{ blocks in MM}$

Addr: 32b

Data: 8b



Cache Index



$$4 \text{ GB} = \frac{32}{2^3} \text{ B} = 2^{32} \text{ B}$$

$$\# \text{ Blocks in MM} = \frac{2^{32}}{2^{14}} = 2^{18} = n$$

$$\# \text{ Blocks in CM} = \frac{2^{25}}{2^{14}} = 2^1 = m$$



Q1  
 $32\text{KB} \text{ cache} = 2^{15} \text{ B}$   
 $32\text{B} \text{-blocks} = 2^5 \text{ B} \Rightarrow 1\text{B} = 1\text{W}$   
 CPU: 32b physical Addr

# lines in CM =  $\frac{2^{15}}{2^5} = 2^{10} = m$

# blocks in MM =  $\frac{2^{32}}{2^5} = 2^{32-5} = 2^{27} = n$

Addr-Format =



f(C) : mod

### Mapping Techniques:

#### Direct-Address-Mapping

physical-addr:



# blocks in =  $2^{18} = n$   
 MM

cache-Index:



# lines in =  $2^{11} = m$

CM

$p_{A1} = B_2$   
 $p_{A2} = B_{m+2}$

f: tag mod m = line#



Computing - Mod:

$$\text{Base 10: } \underline{123} \bmod \frac{10}{10} = 3$$

$$\underline{1678} \bmod \frac{100}{10} = 78$$



Base 2:

$$\underline{1000} \bmod 100 = 00$$

$$\underline{1001} \bmod 100 = 01$$

$$\underline{1010} \bmod 100 = 10$$



In case of cache miss, we need to replace in the cache.

Limitation of Direct Mapping:

Eg:  $64\text{KB}$  direct-mapped cache  
 $2^{16}\text{B}$   $32\text{B-blocks/line} \Rightarrow 2^5$

cpu: 32b-addr

cache-Controller: one valid bit, one modified bit, tag-bits.

# tag-bits  $\leftarrow$  Tag  
# size of tag-memory!



### Associative - addr- Mapping: (Fully)



$m - 2^{11}$   
Circuit Complex

### $m$ - parallel Comparisons:



215

$$\text{Line size} = 32\text{B} \Rightarrow 2^5 \text{B}$$

$$\text{Main-mem-size} = 2^{28} \text{B}$$

1 valid bit, 1 modified bit, 2 replacement bits.

a) # tag-bits : 23

b) Tag-Memory size:  $(4 + 23) * 2^{10}$  # blocks =  $n = \frac{2^{28}}{2^{25}} = 2^{23}$



$$\Rightarrow n = \frac{2^{15}}{2^5} = 2^{10}$$

Set-Associative Mapping / Cache Memory

Flexible Than  
DM AM  
Line/Block  
 $m = 2^{11}$



each set : 2 lines

$$S = \frac{m}{2} = \frac{2^{11}}{2} = 2^{10}$$

$$\text{Set-Index} = \text{Tag} \oplus \text{A} \bmod 2^{10}$$

a: b0

2-Comparisons

4-Comparisons



# Comparisons = 2 in 2-way SA-Mem  
=  $K$  in K-way SA-mem

1 way - set - Associative  $\Rightarrow$  Direct-Mapping



In practice  
2-way  
4-way

M-way Set-associative  $\Rightarrow$  Fully - Associative



$$S = \frac{210}{24} = 26$$

$$2^4 = \underline{\text{16-way}} - \text{SA} - \text{MM}$$

(a) addr-format

(b) Size of Tag-Mem with 4-additional bits / block.



$$(4b + 17b) * 2^{10}$$

## Replacement - Strategies:



Direct  
Addressing

addr: B<sub>0</sub>, B<sub>1</sub>, B<sub>m</sub>, B<sub>2</sub> ...

$$m \bmod m \Rightarrow 0 \Rightarrow L_0$$



## Associative mapping

B<sub>3m</sub>



2-way  
Set-Associative  
Mapping

$$g \bmod s \quad s \bmod s \Rightarrow 0$$

OS

Replacement: FIFO Queue

addr: B<sub>0</sub>, B<sub>s</sub>, B<sub>1</sub>, B<sub>2s</sub>, B<sub>s+1</sub>, B<sub>2s+1</sub>...

1 mod s  $\Rightarrow$  1

2s mod s  $\Rightarrow$  0

2s+1 mod s  $\Rightarrow$  1

| v/I TAG |   |                   |                 |
|---------|---|-------------------|-----------------|
| 1       | v | B <sub>0</sub>    | S <sub>0</sub>  |
| 1       | v | B <sub>s</sub>    | S <sub>1</sub>  |
| 1       | v | B <sub>1</sub>    | S <sub>2</sub>  |
| 1       | v | B <sub>2s</sub>   | S <sub>3</sub>  |
| 1       | v | B <sub>s+1</sub>  | S <sub>4</sub>  |
| 1       | v | B <sub>2s+1</sub> | S <sub>5</sub>  |
| 1       | v |                   | S <sub>6</sub>  |
| 1       | v |                   | S <sub>7</sub>  |
| 1       | v |                   | S <sub>8</sub>  |
| 1       | v |                   | S <sub>9</sub>  |
| 1       | v |                   | S <sub>10</sub> |
| 1       | v |                   | S <sub>11</sub> |
| 1       | v |                   | S <sub>12</sub> |
| 1       | v |                   | S <sub>13</sub> |
| 1       | v |                   | S <sub>14</sub> |
| 1       | v |                   | S <sub>15</sub> |
| 1       | v |                   | S <sub>16</sub> |
| 1       | v |                   | S <sub>17</sub> |
| 1       | v |                   | S <sub>18</sub> |
| 1       | v |                   | S <sub>19</sub> |
| 1       | v |                   | S <sub>20</sub> |

2-way set associative Mapping

Replacement : LRU (Least Recently Used)  $\rightarrow$  used in practical

addr: B<sub>0</sub>, B<sub>s</sub>, B<sub>1</sub>, B<sub>2s</sub>, B<sub>s+1</sub>, B<sub>2s+1</sub>...  
• LRU      M      T LRU      Most

2s mod s  $\Rightarrow$  0

B<sub>2s+1</sub> mod s  $\Rightarrow$  1

| v/I Tag |   |                   |                 |
|---------|---|-------------------|-----------------|
| 1       | v | B <sub>0</sub>    | S <sub>0</sub>  |
| 1       | v | B <sub>s</sub>    | S <sub>1</sub>  |
| 1       | v | B <sub>1</sub>    | S <sub>2</sub>  |
| 1       | v | B <sub>2s</sub>   | S <sub>3</sub>  |
| 1       | v | B <sub>s+1</sub>  | S <sub>4</sub>  |
| 1       | v | B <sub>2s+1</sub> | S <sub>5</sub>  |
| 1       | v |                   | S <sub>6</sub>  |
| 1       | v |                   | S <sub>7</sub>  |
| 1       | v |                   | S <sub>8</sub>  |
| 1       | v |                   | S <sub>9</sub>  |
| 1       | v |                   | S <sub>10</sub> |
| 1       | v |                   | S <sub>11</sub> |
| 1       | v |                   | S <sub>12</sub> |
| 1       | v |                   | S <sub>13</sub> |
| 1       | v |                   | S <sub>14</sub> |
| 1       | v |                   | S <sub>15</sub> |
| 1       | v |                   | S <sub>16</sub> |
| 1       | v |                   | S <sub>17</sub> |
| 1       | v |                   | S <sub>18</sub> |
| 1       | v |                   | S <sub>19</sub> |
| 1       | v |                   | S <sub>20</sub> |

2-way set associative Mapping

Replacement : optimal

addr: B<sub>0</sub>, B<sub>s</sub>, B<sub>1</sub>, B<sub>2s</sub>, B<sub>s+1</sub>,  
B<sub>0</sub>, B<sub>s</sub>  
time  
Future

Not practical

| v/I Tag |   |                  |                 |
|---------|---|------------------|-----------------|
| 1       | v | B <sub>0</sub>   | S <sub>0</sub>  |
| 1       | v | B <sub>s</sub>   | S <sub>1</sub>  |
| 1       | v | B <sub>1</sub>   | S <sub>2</sub>  |
| 1       | v | B <sub>2s</sub>  | S <sub>3</sub>  |
| 1       | v | B <sub>s+1</sub> | S <sub>4</sub>  |
| 1       | v |                  | S <sub>5</sub>  |
| 1       | v |                  | S <sub>6</sub>  |
| 1       | v |                  | S <sub>7</sub>  |
| 1       | v |                  | S <sub>8</sub>  |
| 1       | v |                  | S <sub>9</sub>  |
| 1       | v |                  | S <sub>10</sub> |
| 1       | v |                  | S <sub>11</sub> |
| 1       | v |                  | S <sub>12</sub> |
| 1       | v |                  | S <sub>13</sub> |
| 1       | v |                  | S <sub>14</sub> |
| 1       | v |                  | S <sub>15</sub> |
| 1       | v |                  | S <sub>16</sub> |
| 1       | v |                  | S <sub>17</sub> |
| 1       | v |                  | S <sub>18</sub> |
| 1       | v |                  | S <sub>19</sub> |
| 1       | v |                  | S <sub>20</sub> |

Eq 1:

4-block Cache (Initially Empty)

**PIN 844-844-0102**

Order of main-memory block reference

4, 5, 7, 12, 14, 5, 13, 4, 5, 7

calculate hit-ratio if

① DAM

② 2-way-set-associative with LRU

①

| V/I | Tag |      |
|-----|-----|------|
| 1 ✓ | 4   | 12/4 |
| 1 ✓ | 5   | 13/5 |
|     | 7   |      |
| 1 ✓ |     | CM   |



$m=4$

$\Rightarrow 4 \bmod 4 = 0$

$5 \bmod 4 = 1$

$7 \bmod 4 = 1$

$12 \bmod 4 = 0$

$4 \bmod 4 = 0$

$5 \bmod 4 = 1 \Rightarrow \text{HIT} \checkmark$

$13 \bmod 4 = 1 \Rightarrow$

$4 \text{ HIT } \checkmark$

$5 \text{ MIS}$

$7 \text{ HIT} \checkmark$

M M M M H H H H H H H H M  
 4, 5, 7, 12, 4, 5, 13, 4, 5, 7

②

| V/I | Tag |        |
|-----|-----|--------|
| ✓   | 4   | 0 } S0 |
| ✓   | 12  | 1 } S0 |
| ✓   | 5   | 2 } S1 |
| ✓   | 13  | 3 } S1 |
|     |     | CM     |

$\Rightarrow 4 \bmod 2 = 0 \text{ Set0}$

$\Rightarrow 5 \bmod 2 = 1 \text{ Set1}$

$\Rightarrow 7 \bmod 2 = 1 \text{ Set1}$

$\Rightarrow 12 \bmod 2 = 0 \text{ Set0}$

$13 \bmod 2 = 1$

$\Rightarrow \frac{4}{10} = 40\% \text{ Hit Ratio}$

Updation Techniques:

→ update / Modify bit

Cache-coherence:

→ Write-Through

→ Write-back

Any Edit/updation in Cache  
memory need to update in  
Main memory.



2-way set associative  
Mapping

addr: B<sub>0</sub>, B<sub>1</sub>, B<sub>2</sub>, ..., B<sub>n-1</sub>  
W R R

Write-Through

$$T_W = \max(CM_w, MM_w)$$

↳ Simultaneous Write

Expected time

$$T_{avg-r} = H_r T_r + (1-H_r) (T_{cr} + T_{mr})$$

+ Talloc → which  
block  
is to be  
Replaced

$$T_{avg-w} = \frac{H_w T_w}{Hit} + \frac{(1-H_w) (T_{alloc} + T_w) + T_{mr}}{Miss}$$



$$T_{avg-wt} = f_r * T_{avg-r} + f_w * T_{avg-w}$$

Write-back :



$$T_{avg-r} = H_r T_{cr} + (1-H_r) \left[ P_d * (T_{WB} + T_{alloc} + T_{mr} + T_{cr}) + P_c * (T_{alloc} + T_{mr} + T_{cr}) \right]$$

c: clean bit  
 d: dirty bit

$$T_{avg-w} = H_w T_{cw} + (1-H_w) \left[ P_d (T_{WB} + T_{alloc} + T_{mr} + T_{cw}) + P_c (T_{alloc} + T_{mr} + T_{cw}) \right]$$

$$T_{avg-WB} = f_r T_{avg-r} + f_w T_{avg-w}$$

Multi-Level Cache:



Miss-Penalty:

Time taken to service a miss

Global-miss-rate:  $\frac{#addr\ gen}{10}$

$$\frac{2}{10}$$

$$\frac{0}{10}$$

Local-miss-rate

$$\rightarrow \frac{4}{10}$$

$$\frac{2}{4}$$

$$\frac{0}{20}$$

$$T_{avg} = (H_L * T_L) + \text{Miss rate}_L * \text{Miss-penalty}_L$$

$$\text{Miss-penalty}_L = (H_{L2} * T_{L2}) + \text{Miss rate}_{L2} * \text{Miss-penalty}_{L2}$$

$$\text{Miss-penalty}_{L2} = T_{MM}$$

Types of Cache-Miss:

① Compulsory / Cold-start / First-ref miss

cpu → B1



② Capacity - Miss



size of cache ↑

③ Conflict-Miss



K-way SA ↑

Initially cache is Empty. we need to bring the data

from MM and store in CM ⇒ Compulsory Miss  
cold-start

More Solved problems-1

① A certain process deploys a single-level cache. The cache block size is 8 words and the word size is 4 bytes. The memory system uses a 60 MHz clock. To solve a cache-miss, the memory controller first takes 1 cycle to accept the starting address of the block, it then takes 3 cycles to fetch all the eight words of the block and finally transmits the words of the requested block at the rate of 1 word per cycle. The maximum

bandwidth for the memory system when the program

running on the processor issues a series of read operations

$160 \times 10^6$  bytes/sec.

⇒ 160

$$f = 60 \text{ MHz}$$

Ph: 844-844-0102

$$t = \frac{1}{60M}$$



best-case

$$8W = 1 + 3 + 8 \times 1$$

$= 12 \text{ Cycles}$

$$\frac{4B \times 8W}{W} \rightarrow 12$$

$$32B \rightarrow 12 \times \frac{1}{60M} \text{ sec}$$

$$= 160 \text{ MB/s}$$

- (2) A certain processor uses a fully associative cache of size 16 KB, The cache block size is 16 bytes. Assume that the main memory is byte addressable and uses a 32-bit address. How many bits are required for the tag and index fields respectively in the address generated by the processor?

(A) 24 bits and 0 bits

(B) 28 bits and 4 bits

(C) 24 bits and 4 bits

(D) 28 bits and 0 bits.



The size of the physical address space of a processor

**Ph: 844-844-0102**

is  $2^P$  bytes. The word length is  $2^W$  bytes. The capacity of cache memory is  $2^N$  bytes. The size of each cache block is  $2^M$  words. For a **K-way set-associative** cache memory, the length (in number of bits) of the tag field is

(A)  $P - N - \log_2 K$

(B)  $P - N + \log_2 K$

(C)  $P - N - M - W - \log_2 K$

(D)  $P - N - M - W + \log_2 K$



$$CM_W = 2^{N-W}$$

$$\frac{2^N}{2^W} = \# \text{ Blocks in CM}$$

$$\# \text{ sets} = \left( \frac{2^{N-W-M}}{K} \right)$$

$$\Rightarrow P-W - N-W-M + \log_2 K - + M$$

**$P - N + \log_2 K$**

$$\Rightarrow N-W-M - \log_2 K$$

- (4) Consider a two-level cache hierarchy L1 and L2 caches. An application incurs **1.4** memory accesses per instruction on average. For this application, the miss rate of L1 cache **0.1**, the L2 cache experience **7 misses per 1000 instructions**. The miss rate of L2 expressed correct to two decimal places is **0.05**



Ph: 844-844-0102

$$LMR = \frac{7}{140}$$

$$= \frac{1}{20}$$

$$= 0.05 = 5\%$$

Global Miss-rate

- ⑤ Consider a two-way set associative cache with 256 blocks and uses LRU replacement. Initially the cache is empty. Conflict misses are those misses which occur due to the contention of multiple blocks for the same cache set. Compulsory misses occur due to first time access to the block. The following sequence of access to memory blocks: (0, 128, 256, 128, 0, 128, 256, 128, 11, 129, 257, 129, 257, 129) is repeated 10-times. The number of Conflict misses experienced by the cache is \_\_\_\_\_

(0, 128, 256, 128, 0, 128, 256, 128, 11, 129, 257, 129, 257, 129) is repeated 10-times. The number of Conflict misses experienced by the cache is \_\_\_\_\_



256 both Compulsory & conflict miss



$$1: 0, 256, \frac{0+256}{2+2} = 4$$

$$2: 0, 256, 0, 256 \xrightarrow{4} 4 \\ 1, 129, 1, 129 \xrightarrow{4} 4 \quad \} 8$$

$$\begin{array}{l} 1 \rightarrow 4 \\ 2 \rightarrow 8 \\ 3 \rightarrow 8 \\ \vdots \\ 10 \rightarrow 8 \end{array}$$

|     |                |
|-----|----------------|
| 256 | S <sub>0</sub> |
| 128 |                |
| 257 | S <sub>1</sub> |
| 129 |                |

$$10 \rightarrow 8 \Rightarrow 9 \times 8 + 4 = 72 + 4 = 76 \text{ Conflict misses}$$

- ⑥ A cache memory unit with a capacity of  $N$  words and block size of  $B$  words is to be designed. If it is designed as a direct mapped cache, the length of the TAG field is 10 bits. If the cache unit is now designed as a 16-way set associative cache, the length of the TAG field is \_\_\_\_\_ bits.



$$\text{Number of Line} = \frac{N}{B}$$

$$10 + \log \frac{N}{B} + \log B = x + \log \frac{N}{16B} + \log B$$

$$10 + \log \frac{N}{B} = x + \log \frac{N}{B} - \log 16$$

$$10 + 4 = x$$

$$\Rightarrow x = 14$$

- ① In a two-level cache system, the access time of L<sub>1</sub> and L<sub>2</sub> are 1 and 8 clock cycles, respectively. The miss penalty from the L<sub>2</sub> cache to main memory is 18 clock cycles. The miss rate of L<sub>1</sub> cache is twice that of L<sub>2</sub>. The average memory access time (AMAT) of this cache system is 2 cycles. The miss rates of L<sub>1</sub> and L<sub>2</sub> respectively are
- A. 0.111 and 0.056  
 B. 0.056 and 0.111  
 C. 0.0892 and 0.1784  
 D. 0.1784 and 0.0892



$$Q = 1 + x(8) + x \cdot \frac{1}{2} \cdot 18^2$$

$$\Rightarrow 9x^2 + 8x + 1 = 2$$

$$x = \frac{1}{18} = 0.0555$$

- ② Consider a machine with byte addressable memory 32/2  
 bytes divided into blocks of size 32 bytes. Assume a direct mapped cache having 512 cache lines are used with this machine. The size of the tag field in bits
- A 12   B 16   C 18   D 24



(3)

The read access times and the hit ratios for different levels of caches in a memory hierarchy are as given below.

**Ph: 844-844-0102**

Instruction  
Data

| Cache                 | Accounts for Hit rate                |           |
|-----------------------|--------------------------------------|-----------|
|                       | Read access time<br>(in Nanoseconds) | Hit ratio |
| I-cache               | 2                                    | 0.8       |
| D-Cache               | 2                                    | 0.9       |
| L <sub>2</sub> -Cache | 8                                    | 0.9       |

The read-access time of main memory is 90 nano seconds.

Assume that the caches use the ~~staggered-word-first read~~ policy and the ~~write back~~ policy. Assume that all the

caches are direct mapped address. Assume that the dirty bit is always 0 for all the blocks in the caches. In execution of a program, 60% of memory reads are for instruction fetch and 40% are for memory operand fetch. The

average read access time in nano seconds (upto 2 decimal places) is \_\_\_\_\_

$$\text{Instr: } 2 + 0.2(8) + 0.2 \times 0.1(90)$$

$$= 2 + 1.6 + 0.18 = 5.4 \text{ ns}$$

$$\text{Data: } = 2 + 0.1 \times 8 + 0.1 \times 0.1 \times 90$$

$$= 3.7 \text{ ns}$$

- ④ The width of the physical address on a machine is 40 bits.  
 The width of the TAG field in a 512KB 8-way set-associative cache is \_\_\_\_\_

~~A) 24~~   B) 20   C) 30   D) 40



$$\Rightarrow 2^{19} = 2^x \times 2^3 \times 2^y$$

$$\Rightarrow 19 = x + y + 3$$

$$\Rightarrow x + y = 16$$

$$\Rightarrow 40 - 16 = 24$$

$\Rightarrow 2^{19}$

$\# \text{ sets} = 2^x$

$\text{Block size} = 2^y$

- ⑤ A file system uses an in-memory cache to cache disk

blocks. The miss rate of the cache is shown in the figure. The latency to read a block from the cache is **1 ms** and to read a block from the disk is **10 ms**. Assume that the cost of checking whether a block exists in the cache is negligible. Available cache size one in multiply of **10MB**.

The smallest cache size required to ensure an average

(14)

read Latency less than  $6 \text{ ms}$  is  $30 \text{ MB}$ .

Cache  $\rightarrow$  DISK  
1ms      10ms

MISS  
rate( $x$ )



$$\Rightarrow (1-x)1 + x(10)$$

$$\text{miss-rate} = x$$

$$\text{Hit-rate} = 1-x$$

cache  
size(MB)

$$9x + 1 < 6 \text{ ms}$$

$$x < 0.5556 \text{ ms}$$

$$\Rightarrow 55.56\%$$

$$\Rightarrow [30 \text{ MB}]$$

- (6) Assume that for a certain processor, a read request takes  $50 \text{ nanoseconds}$  on a cache miss and  $5 \text{ nanoseconds}$  on a cache hit. Suppose while running a program, it was observed that  $80\%$  of the processor read requests result in a cache hit. The average read access time in nanoseconds is \_\_\_\_\_

(A) 10

$$.80 \times 5 + 0.2(50)$$

(B) 12

$$= 4 + 10$$

(C) 13

$$= 14$$

(D) 14

Answer: (C) 13

Explanation: Average access time =  $0.8 \times 5 + 0.2 \times 50 = 13 \text{ ns}$

⑦ In designing a Computer cache system, the cache block (or cache line) size is an important parameter. Which one of the following statements is correct in this context?

Ph: 844-844-0102

- X(A) A smaller block size implies better spatial locality.
- X(B) A smaller block size implies a smaller cache tag and hence lower cache tag overhead.
- X(C) A smaller block size implies a larger cache tag and hence lower cache hit time.
- ✓(D) A smaller block size incurs a lower cache miss penalty

⑧ If the set associativity of a processor cache is doubled while keeping the capacity and block size unchanged, which one of the following is guaranteed to be NOT affected?

- X(A) width of TAG Comparator
- X(B) width of set index decoder
- X(C) width of way selection multiplexer.
- ✓(D) width of processor to main memory data bus-word size



Input-output-organizationI/O Transfer-modes1. Programmed I/O2. INT-Driven I/O3. DMAProgrammed I/O: $I_1 : \dots$  $I_2 : LDA \text{ I/o-addr}$  $I_3 : \dots$  $I_4 : \dots$ CPU-wait-timeInterrupt-driven I/O:PLCprogrammable InterruptControllerI/O Interface

$$ET_{INT-I/O} = 0.45 \mu\text{sec}$$



### single-buffer vs Cascading Mode



DMA (Direct Memory Access):  
I/O Command: port #, mem. add<sub>64</sub>, count, control-signals



### DMA - Modes:

Cycle-stealing : Byte by Byte      word by word

Burst-Mode: Batch Mode | Bulk mode

Timings:

① Preparation-time (cpu-busy-state)

② Transfer time (cpu-blocked-state)



- ① Consider a 1MBPS I/O device interfacing with a CPU using DMA in cycle-stealing mode. whenever 32B of data present in the Buffer, it is transferred to the main memory. Main memory cycle time is 60 micro-sec. percentage of time CPU is in Blocked state during the DMA operation is \_\_\_\_\_

Sol:

$$\text{Preparation time} = 32 \mu s - \text{Busy}$$

$$\Rightarrow 32 B = 32 \mu s$$

$$\text{Trans time} = 60 \mu s \Rightarrow \text{Blocked}$$

$$= \frac{60}{92} = \boxed{\frac{60}{(60+32)}}$$

- ② The size of the data count register of a DMA controller is 16 bits. The processor needs to transfer a file of 29,154 KB from the disk to main memory. The memory is byte addressable. The minimum number of times the DMA controller needs to get control of the system bus from the processor to transfer the file from the disk to main memory is 456.

$2^{16}$ 

64KB

 $1 \rightarrow 64^K$  $? \leftarrow 29154KB$ 

$$\begin{array}{r} 29154 \\ \hline 64 \end{array}$$

$$\Rightarrow [455.53]$$

$$\Rightarrow \boxed{456}$$

       = 0 =