

# COA

## - : COA : -

### #) Memory Organization:



\* Lines      Blocks      Pages  
CM            MN            SM

### \* Direct mapping

Fully Associative "

Set-Asst. "

$2^{10}$  — Kilo

$2^{20}$  — Mega

$2^{30}$  — Giga

$2^{40}$  — Tera

$2^{50}$  — Peta

### \* Direct mapping :

Block no. = Remainder is  
No. of cache line Address of block  
in cache



No. of cache lines =  $2^{\text{cache line}}$   
Tag size = No. of cache lines  $\times$  Tag bits  
(First cache line 3rd Tag)

### \* Fully Associative mapping:



Tag bits = Comparator bits

No. of CL = No. of comparators

Hit Latency = Time taken by one comparator + Time taken by OR gate

Hit Latency = Time taken by one comparator + Time taken by OR gate

### \* Set Associative mapping:

W-way set ass.  $\Rightarrow$  C.Lines grouped into sets

each set contains 'K' no. of lines

Main mem. (Physical Addr.)



Cache size  
no. of sets =  $\frac{\text{no. of lines in cache}}{\text{no. of lines in a set (K)}}$

Cache replacement policies:  
(only for full ass. & set-Ass.)

1. FIFO & Belady Anomaly



2. LRU

3. MRU

4. Optimal algo (Future ref. Unimplementable)

Ex: 4-way set ass.

16 cache blocks

LRU

Request order: 0, 2, 5, 5, 1, 4

Procedure: Request order % 4

Exs: Direct mapped

8 cache blocks

Req. orders 3, 5, 2, 8, 0

Procedure: order % 8

|   |   |   |
|---|---|---|
| 0 | 8 | 0 |
| 1 | 2 | 3 |
| 2 | 3 | 4 |
| 3 | 5 | 6 |
| 4 | 7 | 8 |

\*) Memory Org. :

I) Hierarchical Access:

$$EMAT = H_1 T_1 + (1-H_1) H_2 (T_1 + T_2) + (1-H_2) H_3 (T_1 + T_2 + T_3)$$

$$C_{avg} = \frac{C_1 S_1 + C_2 S_2 + C_3 S_3}{S_1 + S_2 + S_3}$$

C  $\rightarrow$  cost per byte  
S  $\rightarrow$  size

II) Simultaneous Access:

$$EMAT = H_1 T_1 + (1-H_1) H_2 T_2 +$$

$$(1-H_1)(1-H_2) H_3 T_3$$

Here, hit rate of Last level is always 1 in I & II both



Note: By default = Hierarchical access

Note: )



## #) Magnetic Disk :



Total transfer time = Seek time + Rotational latency + Transfer time

1. Surface no. Seek time = Time taken by R/W head to reach correct track.
2. Track no. Rotational latency = Time for 1 rotation/revolution
3. Sector no.
4. Byte no.

$$\text{Time for } \frac{\text{bits transferred}}{2} = \frac{\text{Transfer time}}{\text{Time}} = \frac{\text{File size}}{\text{Track size}} \times \frac{\text{Time for } \frac{1}{2}}{\text{Rotation/Revolution}}$$

Ex: Tracks = 500

Sectors = 100  
(per track)

Bytes = 500  
(per sector)

Seek time = 1 ms

RPM = 600 x P.M.

File size = 250 B

$$\text{TT} = \left( \frac{250 \text{ B}}{100 \times 500 \text{ B}} \right) \times 100 \text{ ms} \\ = 0.5 \text{ ms}$$

Note:  
C, h, S  
cylinder no.  
sector no.

$$① \text{TT} = ST + RL + TT$$

$$RL: 600 \times \rightarrow 60 \rightarrow$$

$$\frac{600}{60} \times \rightarrow 10 \rightarrow$$

$$1 \times \rightarrow \frac{60}{600} \rightarrow$$

$$1 \times = 100 \text{ ms}$$

$$\Rightarrow RL = \frac{100}{2} = 50 \text{ ms}$$

$$③ TT = 1 + 50 + 0.5 = 51.5 \text{ ms}$$

$$② 1 \times \rightarrow \frac{100 \times 500 \text{ B}}{250 \text{ B}} \rightarrow \frac{100}{100 \times 500} \times 250$$

$$= 0.5 \text{ ms}$$

Note: Disk interleaving:

No interleaving  $\Rightarrow 1 \text{ rotation} = 1 \text{ track read}$

Single interleaving  $\Rightarrow 2 \text{ " } = " \text{ " } = " \text{ " }$

Double "  $\Rightarrow 2.75 \text{ " } = " \text{ " } = " \text{ " }$

~~D-maps:~~

\* Direct Mapping:  
MM = 256 MB  
Cache = 16 KB  
Block size = 1 KB  
Tag = 4;  $\text{Dir.} = 2 \times 14$



\* Fully Associative:

MM = 32 GB  
Cache = 32 KB  
Block size = 1 KB  
Tag = 25;  $\text{Dir.} = 25 \times 2^5$ ;  $\text{comparator bits} = 25$ ;  $\text{no. of comparators} = 2^5$



\* Set Associative:

MM = 32 GB  
Cache = 32 KB  
Block size = 1 KB  
4-way



Tag = 22

Tag dir. =  $22 \times 2^5$

MM = 16 GB  
Block size = 4 KB

Tag = 10

4-way

Tag dir. =  $10 \times 2^{14}$

cache = 64 MB



~~Cylinders~~  $\leftrightarrow$  Sec No.

Q:  $c \text{ Sur } sec$   
 $add = \langle 400, 16, 29 \rangle; \text{ sec.no.} = ?$

$\text{Sec.no.} = \frac{400 \times 16 \times 29}{1 \times 20 \times 63} = \frac{400 \times (10 \times 2) \times 6^3 + 16 \times 6^3 + 29}{1 \times 20 \times 63} = \frac{505037}{63} = 805037$

Ans:  $805037$

1 cylinder  $\rightarrow$  8 platters = 16 surfaces = 20 sectors  
~~Surface~~  $\rightarrow$  63 sectors  
1000 cylinders  $\rightarrow$  8000 platters = 16000 surfaces = 20000 sectors

$805037 = 805037$

3

Q: In above, if Sec.no. = 1,24,875; add = ?

$cyl.no. = \frac{1,24,875}{1 \times 20 \times 63} = \frac{99.107}{63} = 135 \text{ sectors}$

$\Rightarrow 124875 - 99 \times 20 \times 63 = 135$

$\text{Sur. No.} = \frac{135}{63} = 2.1 \uparrow = 2$

$\Rightarrow 135 - 2 \times 63 = 9 \text{ sectors}$

Sec.no. = 9  
add  $\langle 99, 2, 9 \rangle$

Q: File size = 42797KB  
 $\langle \text{starting add} \rangle = \langle 1200, 9, 10 \rangle$   
 $\langle \text{ending add} \rangle = ?$

Sectors req. for file =  $\frac{\text{File size}}{\text{Sector size}} = \frac{42797 \times 1024}{512} = 85594$

1 cylinder  $\rightarrow$  8 platters  
1 cylinder  $\rightarrow$  16 surfaces  
1 surface  $\rightarrow$  64 sectors  
1 sector = 512B  
16384 cylinders

$\langle 1200, 9, 10 \rangle \Rightarrow \text{no. of sectors} = 1200 \times (16) \times 64 + 9 \times 64 = 1229416 + 576 = 1229472$

$1229472 = 1229472$

$1229472$   
 $+ 0085594$   
 $1,315,010$

cyl.no.  $\frac{1315010}{1 \times (16) \times 64} = 1284.18 = 1284$

$\Rightarrow 1315010 - 1284 \times (16) \times 64 = 194 \text{ sectors}$

Sur.no.  $\frac{194}{64} = 3.03 = 3$

$\Rightarrow 194 - 3 \times (64) = 2 \text{ sectors}$

Sec.no. = 2

$\langle \text{ending add} \rangle = \langle 1284, 3, 2 \rangle$

## #> Pipelining :-

\* Formulas -

$$T_{WP} = \left( \frac{\text{Sum of clock for each phase of one instruction}}{\text{no. of instructions}} \right) \times (\text{Time of one instruction})$$

$$T_p = \left[ \left( \frac{\text{no. of Phases}}{\text{Instructions}} \right) + \left( \frac{\text{no. of Insts} - 1}{\text{Instructions}} \right) \right] \times (\text{Time of one clock}) = (m+n-1) \times t_c$$

$$\text{Speedup} = \frac{T_{WP}}{T_p}$$

Note:- For single instruction  
 $T_p \geq T_{WP}$  ( $\because$  pipelined after buffer delay)

$$\text{Max. Speedup} = \text{No. of stages}$$

$$\text{Efficiency} = \frac{\text{Speedup}}{\text{Max. Speedup}} \times 100$$

Note:- When CPI = 1 ; Speedup = No. of stages

\* Example:-

$$1 \text{ clk} = 13$$

|                | F | D | E | WB | Phases |
|----------------|---|---|---|----|--------|
| I <sub>1</sub> | 1 | 2 | 1 | 1  | → 5    |
| I <sub>2</sub> | 1 | 2 | 2 | 1  | → 6    |
| I <sub>3</sub> | 2 | 1 | 3 | 2  | → 8    |
| I <sub>4</sub> | 1 | 3 | 2 | 1  | → 7    |
| I <sub>5</sub> | 1 | 2 | 1 | 2  | → 6    |

|                | 1 | 2 | 3 | 4 | 5  | 6 | 7  | 8 | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|----------------|---|---|---|---|----|---|----|---|----|----|----|----|----|----|----|----|----|----|----|
| I <sub>1</sub> | F | D | D | E | wB |   |    |   |    |    |    |    |    |    |    |    |    |    |    |
| I <sub>2</sub> | F | X | D | D | E  | E | wB |   |    |    |    |    |    |    |    |    |    |    |    |
| I <sub>3</sub> | F | F | X | D | X  | E | E  | E | wB | wB |    |    |    |    |    |    |    |    |    |
| I <sub>4</sub> | X | F | X | D | D  | D | X  | E | E  | wB |    |    |    |    |    |    |    |    |    |
| I <sub>5</sub> | X | F | X | X | X  | D | D  | X | E  | wB | wB |    |    |    |    |    |    |    |    |

\* Example (Stage latencies) :-

$$\text{Stages} = 4 ; \text{Instructions} = 1000$$



- (Or) Constant clocking rate
- (Or) Steady state under ideal cond.
- (Or) Large no. of instrns
- (Or) Synchronous pipeline

$$\Rightarrow CPI = 1$$

i) Freq. (const.) = ? at 1 clk = 165 ns, CPI = 1

$$f_{\text{freq.}} = \frac{1}{T} = \frac{1}{165} \text{ GHz} = 6.06 \times 10^3 \text{ GHz} = 6.06 \text{ MHz}$$

$$\begin{aligned} \text{i)} T_{WP} &= 150 + 120 + 160 + 140 = 570 \text{ ns} \\ \text{ii)} T_p &= [m + (n-1)] \times 165 \end{aligned}$$

$$\text{iii)} T_{WP} = (150 + 120 + 160 + 140) \times 1000 \times 1 = 570 \times 1000 = 570,000 \text{ ns}$$

$$\text{iv)} T_p = [4 + (1000-1)] \times 165 = 1003 \times 165 = 165,495 \text{ ns}$$

Note:- Throughput increase = final freq. - initial freq.

$$= \frac{\text{final freq.} - \text{initial freq.}}{\text{initial freq.}}$$

$$= \dots \times 100 \%$$

4

## \* Hazards

- 1. Structural hazard
- 2. Control
- 3. Data

5

\*) Structural hazard :-



IF = instr." Fetch  
OL = operand load  
OS = " store

Those who work on memory but we only have single memory so, 2 stalls  $\otimes$  occurred

\*) Control hazard :-



I, the Execute phase in which stall can occur at I<sub>1</sub>, I<sub>2</sub> or I<sub>10</sub>. So, 3 stalls  $\otimes$  occurred

\*) Ex :- Stages = 6 ; CPI = 1

Non-Branch instr." = 75%

Branch instr." = 25%

Branch instr." incur 2 stalls

$$T_{wp} = 6 \times 1 = 6$$

$$T_p = 0.75 \times 1 + 0.25(1+2) \\ = 1.5$$

\*) Ex :- IF ID EX MEM WB

1st 2.2 ns

2nd 1 ns

0.75 ns

CPI = 1

Non-Branch instr." = 80% Total = 100  
Branch instr." = 20% instr." executed

calc." of Stalls

| IF        | ID        | EX | MEM | WB |
|-----------|-----------|----|-----|----|
| $\otimes$ | $\otimes$ |    | IF  |    |

$\Rightarrow$  2 stalls

$$\text{Execution time} = [0.8 \times 1 + 0.2 \times (1+2)] \times 2.2 \times 100 = 308 \text{ ns}$$



Total instr." = 12, but

I<sub>1</sub>

I<sub>2</sub>

I<sub>3</sub>

I<sub>4</sub>

I<sub>5</sub>

I<sub>6</sub>

I<sub>7</sub>

I<sub>8</sub>

I<sub>9</sub>

I<sub>10</sub>

I<sub>11</sub>

I<sub>12</sub>

not Executed

$$\text{Stalls} = 3; \text{Total instr." executed} = 8$$

$$\text{Execution time, without stalls} = [m + (n-1)] + \text{Stalls}] \times 11 \\ = [5 + (8-1) + 3] \times 11 \\ = 165 \text{ ns}$$

• Data Hazard :- Inst<sup>n</sup>'s exhibiting data dependency modifies data in different stages of pipeline.

(3 types)

1. RAW (Read after write) : Let, J follows Inst<sup>n</sup> I

$$I: R_2 \leftarrow R_1 + R_3$$

$$J: R_4 \leftarrow R_2 + R_3$$

J trying to read  $R_2$  before I written it

Sol<sup>n</sup> is operand forwarding

2. WAR (Write after read) :

$$\text{anti-dependence} \quad I: R_2 \leftarrow R_1 + R_3$$

$$J: R_3 \leftarrow R_4 + R_5$$

J trying to write  $R_3$  before I read it.

3. WAW (Write after write) :

$$I: R_2 \leftarrow R_1 + R_3$$

$$J: R_2 \leftarrow R_4 + R_5$$

J trying to write  $R_2$  before I write it

EX-1) Stages = 5 | Operand forwarding used

IF ] 1 clk

ID ] 1 clk

OF ] 1 clk

Perform op<sup>n</sup> → PO = 1 for ADD & SUB ; 3 for MUL ; 6 for DIV

WO ] 1 clk

Instructions:

$$I_0: MUL \quad R_2, R_0, R_1$$

meaning:

$$R_2 \leftarrow R_0 * R_1$$

$$I_1: DIV \quad R_5, R_3, R_4$$

$$R_5 \leftarrow R_3 / R_4$$

$$I_2: ADD \quad R_2, R_5, R_2$$

$$R_2 \leftarrow R_5 + R_2$$

$$I_3: SUB \quad R_5, R_2, R_6$$

$$R_5 \leftarrow R_2 + R_6$$

Dependencies :

$$I_0, I_1 \quad | \quad I_2, I_3$$

$$I_2 \quad | \quad I_3$$

when done without operand forwarding:

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| I <sub>0</sub> | IF | ID | OF | PO | PO | PO | PO | W0 |    |    |    |    |    |    |    |    |    |    |    |
| I <sub>1</sub> | IF | ID | OF | X  | X  | PO |    |
| I <sub>2</sub> | IF | ID | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | OF | PO | W0 |    |
| I <sub>3</sub> | IF | ID | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | OF | PO | W0 |

$$I_2 \text{ का OF नहीं होगा जबकि}$$

$$I_1, I_0 \text{ के प्रारंभिक स्टेप में यह चुनौती है}$$

$$I_3 \text{ का OF नहीं होगा जबकि}$$

$$I_1, I_0 \text{ के प्रारंभिक स्टेप में यह चुनौती है}$$

when operand forwarding used:

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| I <sub>0</sub> | IF | ID | OF | PO | PO | PO | PO | W0 |    |    |    |    |    |    |    |    |    |    |    |
| I <sub>1</sub> | IF | ID | OF | X  | X  | PO |    |
| I <sub>2</sub> | IF | ID | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | OF | PO | W0 |    |
| I <sub>3</sub> | IF | ID | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | X  | OF | PO | W0 |

$$\Rightarrow 15 \text{ clock}$$

∴ Total clock cycles =  $(2+1) \times 5 = 15$  clock cycles

∴ Total clock cycles =  $(2+1) \times 5 = 15$  clock cycles

∴ Total clock cycles =  $(2+1) \times 5 = 15$  clock cycles

∴ Total clock cycles =  $(2+1) \times 5 = 15$  clock cycles

∴ Total clock cycles =  $(2+1) \times 5 = 15$  clock cycles

∴ Total clock cycles =  $(2+1) \times 5 = 15$  clock cycles

Ex-2: Stages = 4 ; Operand forwarding used from P0 to OF

IF ] 1 clk  
OF ] 1 for ADD/SUB ; 3 for MUL ; 5 for DIV  
P0 ] 1 for ADD/SUB ; 3 for MUL ; 5 for DIV  
WB ] 1 clk

- I<sub>0</sub> : MUL R5, R0, R1
- I<sub>1</sub> : DIV R6, R2, R3
- I<sub>2</sub> : ADD R7, R5, R6
- I<sub>3</sub> : SUB R8, R7, R4

$$\begin{array}{l}
 R5 \leftarrow R0 * R1 \\
 R6 \leftarrow R2 / R3 \\
 R7 \leftarrow R5 + R6 \\
 R8 \leftarrow R7 - R4
 \end{array}$$

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|----|
| IF             | IF | OF | P0 | P0 | P0 | P0 | WB |    |    |    |    |    |    |
| J <sub>1</sub> | IF | OF | X  | X  | P0 | P0 | P0 | P0 | P0 | WB |    |    |    |
| J <sub>2</sub> | IF | X  | X  | X  | X  | X  | OF | P0 | WB |    |    |    |    |
| J <sub>3</sub> | IF | X  | X  | X  | X  | X  | X  | OF | P0 | WB |    |    |    |

$\Rightarrow 13 \text{ clocks}$

Ex-3: Stages = 4 ; Operand forwarding is used

IF ] 1 clk  
DF = (ID + DF) ] 1 clk

EX = 1 for ADD/SUB ; 3 for MUL

WB ] 1 clk

I<sub>0</sub> : ADD R2, R1, R0

$\xrightarrow{I_0}$  I<sub>1</sub> I<sub>2</sub>

I<sub>1</sub> : MUL R4, R3, R2

I<sub>2</sub> : SUB R6, R5, R4

I<sub>0</sub> IF DF EX WB

J<sub>1</sub> IF DF EX EX EX WB

J<sub>2</sub> IF X X DF EX WB



$\Rightarrow 8. \text{ clocks}$

# Instruction format :-

→ 4 address instruction

| Mode/opcode | Operand 1 | Operand 2 | Result | NextInstr. |
|-------------|-----------|-----------|--------|------------|
| x: ADD      | A1        | A2        | A3     | 0x233989   |

→ 3 address instruction: Program Counter stores nextInstr. address

| Opcode | Operand 1 | Operand 2 | Result |
|--------|-----------|-----------|--------|
|        |           |           |        |

ex:  $x = (A+B)*(C+D)$

ADD R1, A, B

$$R1 \leftarrow M[A] + M[B]$$

Note: Immediate operand = operand  
Inst. set length = opcode

ADD R2, C, D

$$R2 \leftarrow M[C] + M[D]$$

Inst. format length = Length of int.  
opcode = log. No. of instructions  
operand = log. words

MUL X, R1, R2

$$M[X] \leftarrow R1 * R2$$

→ 2 address instruction: Operand 1 used to store Result

| Opcode | Operand 1 | Operand 2 |
|--------|-----------|-----------|
|        |           |           |

Ex:  $x = (A+B)*(C+D)$

MOV R1, A

$$R1 \leftarrow M[A]$$

Note:

Result is actually processor register

ADD R1, B

$$R1 \leftarrow R1 + M[B]$$

MOV R2, C

$$R2 \leftarrow M[C]$$

ADD R2, D

$$R2 \leftarrow R2 + M[D]$$

MUL R1, R2

$$R1 \leftarrow R1 * R2$$

MOV X, R1

$$M[X] \leftarrow R1$$

• 1 address instruction: ACC contains result of all operations

| opcode | operand 1 |
|--------|-----------|
|--------|-----------|

$$\text{Ex: } X = (A+B) * (C+D)$$

|         |                           |
|---------|---------------------------|
| LOAD A  | $AC \leftarrow M[A]$      |
| ADD B   | $AC \leftarrow AC + M[B]$ |
| STORE T | $M[T] \leftarrow AC$      |
| LOAD C  | $AC \leftarrow M[C]$      |
| ADD D   | $AC \leftarrow AC + M[D]$ |
| MUL T   | $AC \leftarrow AC * M[T]$ |
| STORE X | $M[X] \leftarrow AC$      |

• Examples :-

$$\begin{aligned} Q1: \text{words} &= 512 = 2^9 \\ \text{no. of 1-addrs instr.} &= 2048 = 2^{11} \\ \frac{2}{2} &= 4 = 2^2 \\ \text{Length of instr.} &=? \end{aligned}$$

$$\begin{aligned} Q2: \text{Registers} &= 64 = 2^6 \\ \text{Instr. Set length} &= \text{opcode} = 12 \leq 2^4 \end{aligned}$$

| opcode | R <sub>1</sub> source | R <sub>2</sub> source | R <sub>3</sub> result | operand 1 |
|--------|-----------------------|-----------------------|-----------------------|-----------|
| 4      | 6                     | 6                     | 6                     | 12        |

100 such instr.

$$Q3: \text{Instr. Length} = 24 \text{ b}$$

$$\text{words} = 32 = 2^5$$

$$50\% \text{ instr.} = 3 - \text{Addrs}$$

$$25\% \text{ } \underline{\quad} \text{ } \underline{\quad}$$

$$25\% \text{ } \underline{\quad} \text{ } \underline{\quad}$$



$$Q4: \text{Int Registers} = 16 = 2^4$$

$$\text{float } " = 64 = 2^6$$

$$\text{Instr. length} = 2B = 16 \text{ b}$$

categories of instr. :-

$$\text{Type-1: 4 instr.} = 2^2$$

each instr. = 3 INT Regs.  
                          operand

$$\text{Type-2: 8 instr.} = 2^3$$

      "      2 float "

$$\text{Type-3: 14 instr.} \approx 2^4$$

      "      1 INT, 1 float "

$$\text{Type-4: N-instr.} = 192N$$

      "      1 float

$$N(\max) = ?$$

|     |   |   |   |
|-----|---|---|---|
| opc | I | I | I |
| 4   | 4 | 4 | 4 |

$$\text{Total} = 2^4 = 16$$

$$\text{Used} = 4$$

$$\text{Unused} = 12$$

$$\xleftarrow{16} \xrightarrow{16}$$

|     |   |   |
|-----|---|---|
| opc | F | F |
| 4   | 6 | 6 |

$$\frac{1}{4} \frac{1}{2} \frac{1}{2}$$

$$\text{Total} = 12$$

$$\text{Used} = 8$$

$$\text{Unused} = 4$$

$$\xleftarrow{16} \xrightarrow{16}$$

|     |   |   |
|-----|---|---|
| opc | I | F |
| 6   | 4 | 6 |

$$\frac{1}{4} \frac{1}{2} \frac{1}{2}$$

$$\text{Total} = 4 \times 2^2$$

$$\text{Used} = 16$$

$$\text{Unused} = 14$$

$$\text{Unused} = 2$$

$$\xleftarrow{16} \xrightarrow{16}$$

|     |   |
|-----|---|
| opc | F |
| 10  | 6 |

$$\frac{1}{6} \frac{1}{4}$$

$$\text{Total} = 2 \times 2^4$$

$$= 32$$

$$\text{Ans}$$

|                                     |                         |
|-------------------------------------|-------------------------|
| • 0 address instruction: STACK Used | Ex: $X = (A+B) * (C+D)$ |
|                                     | $= AB + CD + *$         |

|                                                   |          |
|---------------------------------------------------|----------|
| Ex: ADD                                           | opcode 8 |
| i.e. Pop 2 people from stack & add them then push |          |



$$\text{Immediate} = \text{operand} = 12 \text{ b}$$

$$\Rightarrow 34 \text{ b for 1 instr.}$$

$$\Rightarrow 1000 \times 34 \text{ b} = 5 \text{ B per instr.}$$

$$\Rightarrow 500 \text{ B total}$$

|    |        |           |
|----|--------|-----------|
| 2  | 17     | 5         |
| 00 | opcode | operand 1 |

|    |        |           |           |
|----|--------|-----------|-----------|
| 2  | 12     | 5         | 5         |
| 01 | opcode | operand 1 | operand 2 |

|   |        |           |           |           |
|---|--------|-----------|-----------|-----------|
| 1 | 8      | 5         | 5         | 5         |
| 0 | opcode | operand 1 | operand 2 | operand 3 |

- 1) Addressing modes:** - ways for reference to operand
- I: Immediate:** - Instr<sup>n</sup> itself have operand
- Used for constants, where value known.
  - Large constants & constants whose value are unknown can't be used.
- II: Direct (or Absolute):** - Instr<sup>n</sup> have address (Effective address/Direct address/Absolute address) of operand.
- Used for variables with unknown value, Large value variables, global variables.
  - No. of variables used are limited.
  - Fails where large no. of variables.
- III: Indirect:** - Instr<sup>n</sup> have address of EA of operand.
- IV: Implied:** - Address already provided, no need to provide explicitly.  
Ex: 0-addx. ins<sup>n</sup> (Top of stack)  
Accumulator (ACC)
- V: Register mode:** - Instr<sup>n</sup> have register no. which contains operand.
- Applicable only with few variables, bcz. Regs are limited.
- VI: Register indirect mode:** - Instr<sup>n</sup> have reg. no. of register which have EA of operand.
- Auto inc./dec. register indirect :-  
EA + gives next element address  
(>i++ would be i+2 for i being INT)
  - Array implementation is done.
- VII: Base register/Offset mode:** - Handles relocatable code.  
 $EA = \text{base address} + \text{offset}$  (of instr<sup>n</sup>)
- VIII: Index addr. mode / Relative mode:** -
- Array is accessed.
  - Instr<sup>n</sup> have base addr. of array, Index register have index value where we wanna go.
  - Position independent code is implemented where no change in code needed.
-

## \*> Complex Instruction Set Format (CISC)

Large no. of instr'n  
 Few instr'n → perform specific task  
 Large no. of addressing modes  
 Variable length instr'n format  
 Instr'n inside MM  
 Powerful  
 costly  
 Microprogrammed  
 slow  
 $CPI \rightarrow 1$   
 Less # of registers

## Reduced Instruction Set Format (RISC)

Few no. of instr'n  
 Few instr'n → used very frequently  
 Few addressing modes (3-4)  
 Fixed length instr'n format  
 Loads instr'n in Register, then does work.  
 (All operations done within register)  
 Less powerful  
 cheap  
 Hardwired, so, fast  
 $CPI = 1$   
 More # registers

10

### Notes:

\*> Register renaming is done to eliminate WAR & WAW data hazards

\*> For calculating Twp or Tp :-  $T = CPI \times \text{cycle time}$

|   | Data BW or freq. |
|---|------------------|
| K | $2^{10}$         |
| M | $2^{20}$         |
| G | $2^{30}$         |

- \*> Fully associative mapping
- 8 cache blocks (0 to 7)
- LRU cache replacement policy used.

- Request order: 4, 3, 25, 8, 19, 6, 25, 8, 16, 35, 45, 22, 8, 3, 16, 25, 7

A) Procedure → Put in LRU manner



\*> Write-through: where both MM & cache are updated simultaneously  
 write-back : where MM is updated at the end.



\*> Special purpose registers:

- 1) AC : stores result of ALU
- 2) PC : address of next instr. to be executed.
- 3) IR : instruction register
  - address of current instr. being executed.

- 4) SP : stack pointer
  - address of TOS

- 5) MAR : memory address register
  - used to send address to memory
  - unused =  $2^{25} - 2^{24}$

- 6) DR : data buffer register
  - acts as buffer while accessing memory (R/W)



\*> consider a memory of size  $2K \times 8$  bits  
Decoder size = ? to access memory.



\*> MM RAM (volatile)  
ROM (non-volatile)

\*> Microoperation: break one operation into small-small operations which CPU can execute.

\*> Memory cycle time: Time to read/write 1 unit cell of memory.  
(i.e. 1 B memory) (or 1 word memory)

\*> Throughput: work or task done in 1 unit of time. i.e.

Ex: 1 task is completed in 5ns  $\rightarrow$  1 task/5ns  $\rightarrow$  tasks/sec.

5ns  $\rightarrow$  1 task  
1 s  $\rightarrow$   $\frac{1}{5}$  task

$$\Rightarrow \text{Throughput} = \frac{1}{5n}$$

Instructions

Fixed length  
(opcode variable)

Ex: instr. size = 16  
Address size = 5

15



Variable length  
(opcode fixed)

Ex: opcode size = 6

Address size = 5

16



\*> CPU supports 2-Addrs. & 1-Addrs. instr.

Instruction size = 16b

Address size = 6b

# 2-Addrs. instr. = 10

# 1-Addrs. " = ? = 7L

16

6 6

Remaining from 2-Addrs. =  $2^1 - 10$

= 6

16

10 6

6

$6 \times 2^6 = 72$

$\Rightarrow 7L = 384$

power cons. ↑

| A              | B              | X <sub>RESULT</sub> | A+B              | result                 | Add. " table size     | Mul. " table size     |
|----------------|----------------|---------------------|------------------|------------------------|-----------------------|-----------------------|
| 4b             | 4b             | 5b                  | 8b               | $2^8 \times 5b$        | $2^8 \times 8b$       | $2^8 \times 8b$       |
| n <sub>b</sub> | n <sub>b</sub> | n+1 b               | 2 <sup>n</sup> b | $2^{2n} \times (n+1)b$ | $2^{2n} \times (2n)b$ | $2^{2n} \times (2n)b$ |

\*> 32Kx1 RAM chip  $\Rightarrow$  32Kb capacity

\*> Multiple chips in single memory:-

2) Vertical arrangement: - used when no. of addresses required are more than no. of addresses in a single chip.

Ex: 4 chips of size  $128 \times 8b$  (each)

$\Rightarrow$  Total capacity =  $512 \times 8b$

CPU generates 6A = 9b



2) Horizontal arrangement: Used when data size on each address required more than a chip capacity.

Ex: 2 chips of  $128 \times 8$  b  $\Rightarrow$  7b LA  
we need capacity of  $128 \times 16$  b



3) Hybrid: Used when no. of addrs & no. of data bits both are required more than 1 chip

Ex: From  $128 \times 8$  b chips  $\Rightarrow$  7b LA  
create  $256 \times 16$  b chip.

$$\#(\text{cols}) = X \\ 8 + 8 + 8 + \dots + X = 16 \\ \Rightarrow X = 2$$



$$\# \text{rows} = \frac{256}{128} = 2 \Rightarrow 1:2 \text{ decoder}$$

i.e.



1:2  
Decoder

Q1) Required capacity

|                    | 1 chip capacity   |
|--------------------|-------------------|
| $128K \times 8$ b  | $128K \times 8$ b |
| $256M \times 16$ b | $64M \times 8$ b  |
| $128M \times 64$ b | $128M \times 8$ b |
| $4G \times 16$ b   | $256M \times 8$ b |

Arrangement  
#chips | rows | cols

- 1:  
8 : 4 x 2
- 8 : 1 x 8
- 32 : 16 x 2

→ Effective hit ratio =  $\frac{\text{no. of time accessed (MM)}}{\text{no. of time not accessed (MM)}}$

- 2) Write back: MM updated at the end
- $\rightarrow \text{EMAT} = h \cdot T_{cm} + (1-h)(T_{mm} + \text{write back})$
- (or)  $(T_{mm} + T_{cm} + \text{write back})$

\*> DRAM refresh:



$$1 \text{ chip refresh time} = K \text{-chip refresh time} = \frac{\text{no. of rows}}{\text{time}}$$

Note:

$$\text{Ex: } h = 0.9 \\ \% \text{ read op's} = 70\% \\ \% \text{ write op's} = 30\%$$

MM not accessed

write back = dirty %  $\times$  Tblock  
(i.e. fraction of dirty blocks) (i.e. MM block access time)



## I) Direct mapped cache:

### a) MUX:

$$\# \text{ of MUX} = \text{Tag bits} \Rightarrow \\ \# \text{ of MUX} = \text{no. of blocks in cache} : 1 = 2^{80} : 1$$

### b) comparator:

$$\# \text{ of comparator} = 1 \\ \text{Size} = \text{Tag bit} \Rightarrow$$

$$\Rightarrow \text{Hit latency} = \text{MUX Delay} + \text{comparator delay}$$

## II) Set associative: (K-wars)

### a) MUX:

$$\# = K \times \text{Tag bits}$$

$$\text{Size} = 2^{\text{set no.}} : 1 ; \text{set no.} = \text{bits for set no.}$$

### b) comparators:

$$\# = K$$

$$\text{Size} = \text{Tag bits}$$

### c) OR gate

$$\# = 1$$

$$\text{ips} = K$$

} can be implemented  
using  $K:1$  MUX

$$\text{d) Hit latency} = \text{MUX delay} + \text{comparator delay} + \text{OR gate delay}$$

## Fully associative:

### a) MUX:

NO MUX needed

### b) comparator:

$$\# = 2^{80}$$

$$\text{Size} = \text{Tag bits}$$

### c) OR gate

$$\# = 1$$

$$\text{ips} = 2^{80}$$

$$\text{d) Hit latency} = \text{comparator delay} + \text{OR gate delay}$$

### Note) Control unit design:-

1) Hardwired (fastest)

2) Horizontal

3) Vertical (slowest)

microprog

→ Control Unit Design → Hardwired CU  
 → Hardwired Control Unit:- → Microprogrammed CU

14

## Question GATE-2005

A hardwired CPU uses 10 control signals S1 to S10, in various time steps T1 to T5, to implement 4 instructions I1 to I4 as shown below:



|    | T1         | T2                 | T3                 | T4         | T5         |
|----|------------|--------------------|--------------------|------------|------------|
| I1 | S1, S3, S5 | S2, S4, S6         | S1, S7             | <b>S10</b> | S3, S8     |
| I2 | S1, S3, S5 | S8, S9, <b>S10</b> | <b>S5</b> , S6, S7 | S6         | <b>S10</b> |
| I3 | S1, S3, S5 | S7, S8, <b>S10</b> | S2, S6, S9         | <b>S10</b> | S1, S3     |
| I4 | S1, S3, S5 | S2, S6, S7         | <b>S5, S10</b>     | S6, S9     | <b>S10</b> |

Which of the following pairs of expressions represent the circuit for generating control signals S5 and S10 respectively?

(A)

$$S5 = T1 + T2 \cdot T3 \text{ and}$$

$$S10 = (I1 + I3) \cdot T4 + (I2 + I4) \cdot T5$$

(B)  $S5 = T1 + (I2 + I4) \cdot T3 \text{ and}$

$$S10 = (I1 + I3) \cdot T4 + (I2 + I4) \cdot T5$$

(C)  $S5 = T1 + (I2 + I4) \cdot T3 \text{ and}$

$$S10 = (I2 + I3 + I4) \cdot T2 + (I1 + I3) \cdot T4 + (I2 + I4) \cdot T5$$

(D)  $S5 = T1 + (I2 + I4) \cdot T3 \text{ and}$

$$S10 = (I2 + I3) \cdot T2 + I4 \cdot T3 + (I1 + I3) \cdot T4 + (I2 + I4) \cdot T5$$

For  $S5$

$$= T_1 (I_1 + I_2 + I_3 + I_4) + T_3 (I_2 + I_4)$$

$$T_1 \cdot I_1 + T_3 \cdot I_2 + T_3 \cdot I_4$$

$$= T_1 + I_2 T_3 + I_4 T_3$$

For  $S10$

$$= I_2 T_2 + I_3 T_2 + I_1 T_4 + I_3 T_4 + I_2 T_5 + I_4 T_5 + I_4 T_3$$

$$= (I_2 + I_3) T_2 + (I_1 + I_3) T_4 + (I_2 + I_4) T_5 + I_4 T_3$$

## Question GATE-2004

A CPU has only three instructions I1, I2 and I3, which use the following signals in time steps T1 - T5.

I1 : T1 : Ain, Bout, Cin

T2 : PCout, Bin

T3 : Zout, Ain

T4 : Bin, Cout

T5 : End

I2 : T1 : Cin, Bout, Din

T2 : Aout, Bin

T3 : Zout, Ain

T4 : Bin, Cout

T5 : End

I3 : T1 : Din, Aout

T2 : Ain, Bout

T3 : Zout, Ain

T4 : Dout, Ain

T5 : End

$$\cancel{I_1 (T_1 + T_3)} = I_1 T_1 + I_1 T_3 + I_2 T_3 +$$

$$+ \cancel{I_2 (T_2)} = I_1 T_1 + (I_1 + I_2 + I_3) T_3 +$$

$$+ \cancel{I_3 (T_2 + T_4 + T_5)} = I_1 T_1 + I_3 T_2 + I_3 T_4 +$$

$$= I_1 T_1 + I_3 (T_2 + T_4) + \cancel{T_5}$$

$$I_3 T_2 + I_3 T_3 + I_3 T_1$$

$$I_3 T_2 + I_3 T_4$$

$$T_3$$

Which of the following logic functions will generate the hardwired control for the signal Ain ?

- A.  $T1 \cdot I1 + T2 \cdot I3 + T4 \cdot I3 + T3$   
 C.  $(T1 + T2) \cdot I1 + (T2 + T4) \cdot I3 + T3$

- B.  $(T1 + T2 + T3) \cdot I3 + T1 \cdot I1$   
 D.  $(T1 + T2) \cdot I2 + (T1 + T3) \cdot I1 + T3$

- Microprogrammed CPU :-
- 1) Horizontal MCU :-
- Q: #instr's = 256
- ✓ Each instr.  $\Rightarrow$  16 microops
- Ctrl mem. size = ?
- $\# \text{microoperations} = 256 \times 16 = 2^{12}$
- $\Rightarrow \text{Address} = 12 \text{ b}$

- Q: 8 status bits  $\Rightarrow$  MUX select lines = 3b  
(in MUX input)
- one microoperation is 26b; Ctrl signal = 13b  
Ctrl mem. size = ?



## 2) Vertical MCU :-

- Q: #control signals = 120
- Control signals divided into groups :-
- |                   |                   |                   |                  |                   |                   |
|-------------------|-------------------|-------------------|------------------|-------------------|-------------------|
| G <sub>1</sub> 30 | G <sub>2</sub> 13 | G <sub>3</sub> 12 | G <sub>4</sub> 3 | G <sub>5</sub> 27 | G <sub>6</sub> 35 |
| 55                | 4b                | 4b                | 2b               | 5b                | 6b                |
- $$\sum = 120 \text{ b}$$
- $$\sum = 26 \text{ b}$$

#Bits saved on using vertical MCU as compared to Horizontal MCU = ?

Therefore, 26b for control signal in vertical MCU  
120b " " " " Horizontal "

$$\Rightarrow \# \text{bits saved (per control signal)} = 120 - 26 = 94$$

- Q: ... same as previous.
- Mux select lines = 4b
- #microoperations = 2<sup>11</sup>  $\Rightarrow$  address = 11b
- Size of memory = ?



$$\text{Mem. size} = 2^{11} \times 135 \text{ b}$$



$$\text{Mem. size} = 2^{11} \times 41 \text{ b}$$

- Q: #ctrl signals = 100
- 94 ctrl signals are grouped as :-
- Remaining 6 ctrl signals can't be kept in any group  
(These 6 ctrl signals have to be kept in horizontal manner)

|                   |                   |                  |                  |                   |                   |
|-------------------|-------------------|------------------|------------------|-------------------|-------------------|
| G <sub>1</sub> 30 | G <sub>2</sub> 25 | G <sub>3</sub> 2 | G <sub>4</sub> 5 | G <sub>5</sub> 10 | G <sub>6</sub> 22 |
| 6b                | 5b                | 1b               | 3b               | 1b                | 5b                |

$$\sum = 94 \text{ b}$$

$$\sum = 23 \text{ b}$$

Therefore, for ctrl signal in Vertical MCU = 23 + 6 = 29b  
" " " " " " " " Horizontal " = 100b