

co

(5)

Memory organisation: Hierarchy, cache memory - Address mapping  
 - updation  
 - Replacement technique

principle of virtual and associate memories

2. Instructions, Addressing modes, Instruction pipeline (fixed point and floating point arithmetic)
3. control unit design -  $\mu$  operations, HW control unit design,  $\mu$ -programmed control unit design, Tracing assembly program
4. IO : addressing, data transfer technique, disk and tape memories.

Textbook:

computer architecture : J-P. Hayes

computer organisation : paul chowdary

→ computer architecture - It does not deal with physically touchable components just like civil architect

- It concerns with conceptual design of the system
- It concerns instructions, addressing modes & data format.

Based on instructions, systems are

1) CISC (complex Instruction set)

2) RISC (Reduced Instruction set)

↳ Here every instruction completes in 1 clock cycle  
 ie. clocks per Instruction (CPI) = 1

CISC - More instructions and CPI is different for different instructions  
 - programmer overhead is less  
 - compiler designer overhead is more

RISC - constant set of instructions  
programmer overhead is more  
compiler designer overhead is less

Addressing modes

↓  
how operand reference is given in the instruction  
more addressing modes → more flexibility

Immediate Direct Indirect Relative

CISC has more flexibility as it has more addressing modes than RISC

Data format

↓  
How to interpret the binary string

- Eg: 1) Some systems use 7 bits for character ⇒ ASCII  
2) Some systems use 8 bits for character ⇒ EBCDIC

computer organisation:

- It deals with logical design of the system
- It concerns with physical devices and their interconnection

with the perspective of improving the performance  
similar to library organisation  
(how to arrange books)  
goal: Reducing average access time

Performance of a machine ⇒ Amount of work done in unit

no. of instructions executed, measured in  
MIPS (Million Instructions per second)

suppose MIPS = x then goal is increasing x.  
this can be done by reducing the  
time for each individual instruction

floating point



3 memory references

} without changing H/W  
and S/W, performance  
is increase



1 memory reference to retrieve  
number

→ In RISC more no. of registers.

Accessing registers takes less time than accessing memory

↑ parameter sharing is easy in RISC  
Eg: nested subroutines

2.

90% to 10% rule:

90% of the program time is consumed by 10% of program code. This 10% program code is called as locality of reference. This locality of reference is supplied to the processor fastly so the execution time of program could be reduced

∴ Increase in performance for this purpose  
1st memory is introduced  
this is cache memory.

→ Every instruction

Hardware Fetch (H)

H/W2 - Decode (D) Execution cycle

H/W3 - Execute (E)

H/W3 - Store (S)

i.e. Each stage requires different hardware

At any time, an instruction utilises only one stage H/W

∴ idle H/W is  $\frac{3}{4}$  (conventional way)

- So a technique that allows efficient utilization of resources = Instruction pipeline

- overlapping of instruction stages



Here individual instruction time is not reduced.  
But the average instruction time is reduced

Here all the resources are busy

- Data is either in memory (or) device finally



3.01011

### Classification of Computer Architecture:

#### 1) memory based architecture

- von neumann
- stored program
- Inplace modification



#### 2) Harvard Architecture

- similar to von neumann
- program & data are separately stored.
- parallel access of data & instructions (program)
- modification of program does not disturb data and vice versa



super scalar architecture

- separate processing of integers and floating point numbers

FPU

FPU

FLYN architecture:

Basis: Instruction stream and data stream (CU, PEM, MU)

SISD - single instruction stream & single data stream

SIMD - " " " & multiple

MISD -

MIMD - parallel processing

(n-control units (CU), m-processing element (PEM), memory modules (PEM))

Memory organisation:

not physically placing memories. But what information is to be stored in which memory — Memory organisation

Memory hierarchy:



In the memory hierarchy,  $i^{\text{th}}$  level memory is placed higher than  $(i+1)^{\text{th}}$  level memory. (Time of access:  $T_i < T_{i+1}$ )

size of memory

cost/bit

frequency of access

Instructions

$$S_i < S_{i+1}$$

$$C_i > C_{i+1}$$

$$f_i > f_{i+1}$$

$$I_i \subset I_{i+1}$$

The performance of hierarchy is given by hit ratio. (Availability of referred information at referred level is called as hit).

The hit ratio of bottom most memory is 1 ie hit ratio of disk memory is 1.

Hitratio ( $H$ ) & size ( $S$ )

$$H \propto \frac{1}{T}$$

avg.access time ( $T_{ave}$ )

( $H \propto T^{-1}$ ) can't be derived  
 $H \propto \frac{1}{T}$

Because individual access times are not affected by hit)

The side effect of memory hierarchy is data inconsistency. Same information is available differently at different levels. This problem is resolved using proper updation techniques. Mathematical expressions for memory hierarchy:

Let  $H_1, T_1, S_1, C_1$  gives the specifications of Level 1 memory of 2 level system.  $H_2, T_2, S_2, C_2$  are the specifications of level 2 memory.



$$1) H_2 = 1$$

$$2) \text{Size of total memory} = S = S_1 + S_2$$

$$3) \text{Average cost/bit} \quad C_{ave} = \frac{C_1 S_1 + C_2 S_2}{S_1 + S_2}, \quad \begin{matrix} \text{total cost of level 2 system} \\ \text{total cost of level 1 system} \end{matrix}$$

$$\text{Average access time } T_{ave} = H_1 T_1 + (1-H_1) T_2 \quad \begin{matrix} : \\ (\text{Default}) \end{matrix}$$



$$\text{If } H_1 = 1 \Rightarrow T_{ave} = T_1$$

$$\text{If } H_1 = 0 \Rightarrow T_{ave} = T_2$$

↓ if it is available in level 1

The above hierarchy is not strict hierarchy  
processor can interact with only one level of memory



$$T_{ave} = H_1 T_1 + (1-H_1) (T_1 + T_2) = T_1 + T_2(1-H_1)$$

when there is miss in level 2, instead of taking a word from level 2 to level 1, a group of words are moved (focality of reference).



From bus, data is moved word by word. But we are moving a block. At a time it can't be placed so placement time is to be considered  
placement time,  $T_B = N(T_2 + T_1)$

In general,

for  $n$  level

$$T_{ave} = T_1 + (1-H_1)T_{B_1} + (1-H_1)(1-H_2)T_{B_2} + \dots + (1-H_1)(1-H_2)\dots(1-H_{n-1})T_{B_{n-1}}$$

p) consider a 2-level memory system, in which the average access time without level 1 is 120ns, with level 1 it is 30ns the level 1 access time is 20ns. what is hit ratio?

Sol: for a 2 level system

$$T_{ave} = H_1 T_1 + (1-H_1) T_2$$

Given  $T_2 = 120\text{ns}$

$T_{ave} = 30\text{ns}$

$T_1 = 20\text{ns}$

$$\Rightarrow 30 = H_1 \times 20 + (1-H_1) \times 120$$

$$100H_1 = 90 \Rightarrow H_1 = 0.9$$

P) consider a 2 level memory system in which the access speed of level 1 memory is 5 times faster than level 2. The hit ratio is 0.8. It is needed to increase the average access time by 20% from 60ns

- 1) what is the access time of level 1 memory
  - 2) what is the n
  - 3) what is the new hit ratio
  - 4) what is the % of change in hit ratio
  - 5) If hit ratio = 100%, what will be the access time of  $T_1$  &  $T_2$

Sol: consider default.  $T_{ave} = H_1 T_1 + (1-H_1) T_2$

Given  $T_1 = \frac{T_2}{5}$  (level 1 is 5 times faster than level 2)

Initially  $H_1 = 0.8$

$$T_{ave} = 60\text{ns}$$

$$\Rightarrow 60 = (0.8)T_1 + (1-0.8)5T_1$$

$$60 = 0.8T_1 + T_2$$

$$\Rightarrow T_1 = \frac{60}{1.8} = 33.3 \text{ ns}$$

$$\therefore T_2 = 166.5 \text{ ns}$$

∴ Access time of level 1 = 333ns

Access time of level 2 = 166.5 ns

$$\text{iii) } T_{\text{new}} = ?, \quad T_{\text{aven}} = T_{\text{ave}} + (20\%) T_{\text{ave}}$$

$$= 60 + \frac{20}{100} \times 60$$

= 42.06

$$\Rightarrow T_2 = H_N(33.3) + (1-H_N)(166.5)$$

$$72 = -4 \times 33.3 H_N + 166.5 \Rightarrow H_N = 0.71$$

iv)  $H_1$ : Hit ratio reduced

$$\begin{aligned} \text{Hold} & \xrightarrow{H_1=0.8} \text{Hit} \\ 0.8 & \rightarrow 0.71 \\ \text{ie for } & 0.8 \rightarrow 0.09 \text{ reduction} \\ 100 & \rightarrow ? \\ & = \frac{0.09}{0.8} * 100 \\ & = 11.2\% \end{aligned}$$

v) Hit ratio does not change individual access times. Hence

$$T_1 = \frac{100}{3} \text{ ns} = 33.3 \text{ ns}$$

$$T_2 = \frac{500}{3} \text{ ns} = 166.6 \text{ ns}$$

\* Hit ratio effects average access time.

b) consider 3-level memory system in which individual access times are  $T_1 = 10 \text{ ns/word}$ ,  $T_2 = 100 \text{ ns/word}$ ,  $T_3 = 1000 \text{ ns/word}$  and  $H_1 = 0.8$ ,  $H_2 = 0.9$ ,  $H_3 = 1$ . If the referred word is not available in level 1 memory, a 2 word block is moved from level 2 to level 1. and referred word is handed over to the processor. If it is not present in level 2, a 4 word block is first moved from level 3 to level 2 and the concerned block is taken to level 1. what will be the average access time and throughput of the system

Sol: without placement time

$$T_{ave} = T_1 + (1-H_1)T_{B_1} + (1-H_1)(1-H_2)T_{B_2}$$

$$T_{B_1} = 2 * T_2, \quad T_{B_2} = 4 * T_3$$

$$= 10 + (1-0.8)2 * 100 + (0.2)(0.1)4 * 1000$$

$$T_{ave} = 130 \text{ ns}$$

No. of words accessed/transferred per second

throughput =

$$= \frac{1}{T_{ave}} \text{ word/sec}$$

$$= 7.6 \text{ Mwords/sec}$$

Suppose placement time is considered

$$T_{ave} = T_1 + (1-H_1)T_{B_1} + (1-H_2)(1-H_1)T_{B_2}$$

$$T_{B_1} = 2(T_2 + T_1) \quad T_{B_2} = 4(T_3 + T_2)$$

44  
58  
40  
2

$$T_{ave} = 10 + (0.2) * 2 = 110 + (0.2)(0.1) 4 * 1100 \\ = 142 \text{ ns}$$

$$\text{Throughput} = \frac{1}{142 \times 10^9} = 7.04 \text{ Mwords/sec}$$

011111

### Address mapping

- cache memory and main memory are to be divided into equal sized partitions
- Address mapping decide which main memory block is to be placed where in the cache
  - 1. Direct mapping
  - 2. Associate mapping
  - 3. Set-associate mapping
- Direct mapping:
  - $k^{\text{th}}$  main memory block in  $(k \bmod n)^{\text{th}}$  cache location
  - Tag bits denote which main memory block is currently residing in that cache position

e.g.:



main memory blocks

0, 4, 8, 12, 16, 20, 24, 28 compete for cache block '0'  
↓ ↓  
ccc 001  
110 111  
8 blocks  $\Rightarrow$  3 bits needed

for cache block 1  $\rightarrow$  1, 5, 9, 13, 17, 21, 25, 29

for cache block 2  $\rightarrow$  2, 6, 10, 14, 18, 22, 26, 30

for cache block 3  $\rightarrow$  3, 7, 11, 15, 19, 23, 27, 31

Therefore 4 clusters & each cluster having 8 blocks

$\therefore$  No. of clusters = No. of cache blocks

$$\text{No. of blocks in each cluster} = \frac{\text{No. of mainmemory blocks}}{\text{No. of clusters or cache blocks}} \\ = \frac{M}{N}$$

$$\text{No. of tag bits} (t) = \log_2 \left( \frac{M}{N} \right)$$

$\rightarrow$  suppose tag information

cache memory



$\therefore$  Different main memory blocks are in cache

$\therefore$  Non contiguous

- direct mapping - address mapping is physical address



the location of addressed word in the block

The referenced word is likely to be in which cache block

If word is present in cache, this tag should match with tag of associated cache block

the location of addressed word in the block

Suppose there are  $p$  words/block

$$\text{word offset} = \log_2 p$$

$$\text{cache block offset} = \log_2 N$$

$$\text{tag} = \log_2 (\frac{M}{N})$$

Eg: whether 374 word of main memory is present in cache block?

So: physical address requires 9 bits as it is 512 word main memory

$$374 = \begin{array}{ccccccccc} 256 & 128 & 64 & 32 & 16 & 8 & 4 & 2 & 1 \\ & & & & \underline{0} & \cdot & \cdot & \cdot & \cdot \\ & & & & 1 & 1 & 0 & 1 & 0 \end{array}$$

→ main memory block = 10111<sub>2</sub>

Here

$$\text{word offset} = \log_2 16 = 4 \text{ bits}$$

$$\text{Cache block offset} = \log_2 4 = 2 \text{ bits}$$

$$\text{tag} = \log_2 8 = 3 \text{ bits}$$

Tag Cache block offset word offset

|   |   |   |   |   |   |   |
|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 1 | 0 | 1 | 0 |
|---|---|---|---|---|---|---|

23rd block

$\downarrow$   
 $23:4 = 3$  block  
(cache)

this word has to be placed in cache block 3

The tag information of cache block is 110

But this physical address has tag 101

Not matched

The referred word is not present in cache

→ The reference to the word 374 is resulting in miss since the 3rd block tag information is not matching with tag portion of physical address 374 word

Eg: check whether the word 293 is present in cache

(or) not

$$293 = \begin{array}{ccccccccc} 256 & 128 & 64 & 32 & 16 & 8 & 4 & 2 & 1 \\ & & & \underline{0} & \cdot & \cdot & \cdot & \cdot & \cdot \\ & & & 1 & 0 & 0 & 1 & 1 & 0 \end{array} \rightarrow \text{cache position} = 18:4 = 2 \text{ block}$$

cache block 2

Tag matched, word is present

since the physical address of word 293 is referring the  $2^{14}$  cache block and its tag address is matching with 2nd cache block tag address, the reference is resulting hit

HW implementation for direct mapping:

- To say whether the word is present (or) not we checked only one cache block i.e. compared with tag of a particular cache block

K mod N block

∴ Tag comparator is needed

one input is tag of physical address

other input is tag of selected cache block.

7

for this a MUX is needed

Its select lines are CB offset bits in physical address



→ Hit/Miss

Time for knowing

Internally comparator is



$$\text{Time} = T_{\text{mux}} + T_{\text{comparator}}$$

key in tag comparator  $\Rightarrow T_{\text{Comp}} = T_{\text{EX-NOR}} + T_{\text{AND}}$

$\rightarrow$  Here one problem

multiplexer can give output only one bit at a time

But tag is 3 bits

Hence 3 multiplexers are needed



$\rightarrow$  No. of multiplexers required = No. of tag bits  
 $= \log_2(\frac{N}{B})$

Size of each mux = No. of cache blocks  $\times 1$   
 $= N \times 1$

$\Rightarrow$  The direct mapping is having the limitation which may result too many block replacements. It is exposed to conflict penalties i.e. the presence one main memory block in cache is obstructing incoming block (Main memory) into the cache for same position inspite of other blocks are free.

e.g. consider main memory block references

|           |   |   |   |   |   |   |    |   |    |
|-----------|---|---|---|---|---|---|----|---|----|
| 4         | 8 | 0 | 4 | 8 | 4 | 0 | 12 | 0 | 12 |
| (k mod 4) | 8 | 8 | 8 | 8 | 8 | 8 | 0  | 0 | 0  |
| 0         | 4 | 8 | 0 | 4 | 8 | 0 | 12 | 0 | 12 |
| 1         |   |   |   |   |   |   |    |   |    |
| 2         |   |   |   |   |   |   |    |   |    |
| 3         |   |   |   |   |   |   |    |   |    |

0 hits  
10 faults/miss

This problem is resolved using associate mapping

Associate mapping:

In associate mapping, any block of main memory can be placed anywhere in the block

e.g. 4 8 0 4 8 4 0 12 0 12  
F F F H H H F H H

|   |    |
|---|----|
| 0 | 4  |
| 1 | 8  |
| 2 | 0  |
| 3 | 12 |

6 hits

4 faults / miss

$$\text{Hit ratio} = \frac{6}{6+4} = 0.6$$

In associate mapping, the physical address is divided into two portions



all the main memory blocks are competing for each cache block

⇒ tag bits are more

in blocks (main memory)

$$\text{tag bits} = \log_2 M$$

No. of clusters = 1 (All main memory blocks)

To know whether a word is present (or) not in cache, each cache block is to be searched.

⇒ one tag comparator is needed for each cache block

No. of tag comparators = No. of cache blocks

If all tag comparators says miss then the word is not present in cache block

- There is no need of multiplexer



- It requires more tag bits, more no. of comparators  
⇒ More H/W hence cost is more
- It is fastest in mapping

### Set-associate mapping:

The set-associate mapping contains advantages of both direct mapping and associate mapping.

- The cache is divided into logical sets
- In K-way set-associate mapping, each set is allocated with K-cache blocks

$$\text{No. of sets} = S = \frac{\text{No. of cache blocks (N)}}{K}$$

- The physical address is divided into



e.g.:  $N = 4$ , 2-way set association,  $M = 32$ .

$$\therefore S = \frac{4}{2} = 2$$



If it is one-way set associate mapping then it reduces to direct mapping (No. of sets = No. of blocks)

If it is N-way set associate mapping then it reduces to associate mapping

Ques 11/11

2 million word main memory and 4K word cache are partitioned into 256 word blocks. The word addressable memory has 32-bit words.

1) what is the size of main memory & cache memory.

$$\begin{aligned}\text{Main memory size} &= 2 \times 2^{20} \times 32 \text{ b} \\ &= 2 \times 2^{20} \times 2^2 \text{ B} \\ &= 8 \text{ MB}\end{aligned}$$

$$\begin{aligned}\text{Cache memory size} &= 4 \times 2^{10} \times 32 \text{ b} \\ &= 4 \times 2^{10} \times 2^2 \text{ B} \\ &= 16 \text{ KB}\end{aligned}$$

2) How many bits are required to address physical memory?

It is word addressable

$$\begin{aligned}\text{Total words} &= 2 \times 2^{20} \text{ words} \\ &= 2^{21}\end{aligned}$$

∴ 21 bits are required

3) How many blocks are present in cache memory & main memory

Total no. of blocks in main memory

$$M = \frac{\text{Total words in main memory}}{\text{blocksize}}$$

$$= \frac{2^{21}}{2^8} = 2^{13} = 8 \text{ K blocks}$$

Total number of blocks in cache memory

$$N = \frac{\text{Total words in cache memory}}{\frac{\text{blocksize}}{2^{12}}} \text{ in K blocks}$$

If it is one-way set associate mapping then it reduces to direct mapping (No. of sets = No. of blocks)

If it is N-way set associate mapping then it reduces to associate mapping

Ques

Q) 2 million word main memory and 4K word cache are partitioned into 256 word blocks. The word addressable memory has 32-bit words

i) what is the size of main memory & cache memory.

$$\begin{aligned}\text{Main memory size} &= 2 \times 2^{20} \times 32 \text{ b} \\ &= 2 \times 2^{20} \times 2^2 \text{ B} \\ &= 8 \text{ MB}\end{aligned}$$

$$\begin{aligned}\text{Cache memory size} &= 4 \times 2^{10} \times 32 \text{ b} \\ &= 4 \times 2^{10} \times 2^2 \text{ B} \\ &= 16 \text{ KB}\end{aligned}$$

ii) How many bits are required to address physical memory?  
It is word addressable

$$\begin{aligned}\text{Total words} &= 2 \times 2^{20} \text{ words} \\ &= 2^{21}\end{aligned}$$

∴ 21 bits are required

iii) How many blocks are present in cache memory & main memory.

Total no. of blocks in main memory

$$M = \frac{\text{Total words in main memory}}{\text{blocksize}}$$

$$= \frac{2^{21}}{2^8} = 2^{13} = 8 \text{ K blocks}$$

Total number of blocks in cache memory

$$N = \frac{\text{Total words in cache memory}}{\text{blocksize}}$$

$2^{12} \text{ in LRU}$



$$Time \text{ for } \text{Killing} = T_{\text{Max}} + \text{Compare}$$

An internal comparator is

Time for killing

hit/miss



bits in physical address

if select lines are CB offset

needed

for this  $\Rightarrow$  MUX is

$t$  after IP is tag of selected cache block

one IP is tag of physical address

: tag comparator is needed

kinda block

$\cancel{\text{but}}$  implemented for direct mapping  
i.e. compared with tag of a particular cache

only one cache block

to say whether the word is present (or) not we checked

this implementation for direct mapping

tag address, the reference is resulting in

cache block and if tag address is matching with 2nd cache

since the physical address of word 293 is referring the 2nd

The blocksize in cache 2 is 64 words. A system of 32 bit main memory address with 16 bit word employees 256KB cache memory. Compute the tag bits & tag memory for each case.

### Cache 1 design:

$$\text{Main memory blocks} = \frac{\text{size}}{2^4} = 256 \text{ blocks } 2^{32} / 2^4 = 8GB$$

$$\text{cache 1 block size} = 1K \text{ words} = 2KB$$

$$\text{cache size} = 256 \text{ KB}$$

$$= \frac{256 \text{ KB}}{2^10}$$

$$= 128 \text{ Kwords}$$

$$\therefore \text{No. of blocks in cache 1} = \frac{128K}{1K} = 128 \text{ blocks}$$

It is 4-way set association

$$\therefore \text{No. of sets} = \frac{128}{4} = 32 \text{ sets}$$



1K words &

It is word addressable

$$\therefore \text{Tag bits} = 17$$

$$\text{Tag memory} = N * \text{tag}$$

$$= 128 * 17$$

### Cache 2 design:

$$\text{cache 2 block size} = 64 \text{ words} = 128 \text{ bytes}$$

$$\text{cache size} = \frac{256 \text{ KB}}{128 \text{ B}} (\text{OR}) \quad \frac{128 \text{ KB}}{64 \text{ W}}$$

$$= 2K \text{ blocks}$$

It is 8-way set association

$$\therefore \text{No. of sets} = \frac{2K}{8} = 2^8 \text{ sets}$$



Tag bits = 18

$$\text{Tag memory} = N \times \text{tag}$$

$$= 2K \times 18$$

$$= 36K \text{ bits}$$

p) consider a main memory of  $2CK$  blocks and 2-way set association is used. The position of  $p$ th block of main memory in cache of  $2C$  block is

- (a)  $p \bmod 2K$  set
- (b)  $p \bmod K$  set
- (c)  $p \bmod 2C$  set
- (d)  $p \bmod C$  set

$$\begin{aligned} \text{No. of sets} &= \frac{\text{No. of blocks in cache}}{K} \\ &= \frac{2C}{2} \\ &= C \end{aligned}$$

In set association  $p$ th block is placed in  $p \bmod s$  set where  $s$  is total no. of sets

p) consider a direct mapped cache, where the no. of tag bits is equal to the no. of words in block. No. of words in cache equal to no. blocks of in cache. Let there are  $N$  blocks in cache, what is the size of physical memory in words.

- (a)  $N^2 \cdot 2^N$
- (b)  $2N \cdot 2^N$
- (c)  $N \cdot 2^N$
- (d) none



cache has  $N$  blocks

each block has  $N$  words

$\therefore$  TAG bits =  $N$



$$d = N + \log_2 N + \log_2 N$$

$$= N + \log_2 N^2$$

$$\text{No. of bits in physical address} = N + \log_2 N^2$$

$$\therefore \text{Total no. of words} = 2^{N + \log_2 N^2}$$

$$= 2^N \cdot 2^{\log_2 N^2}$$

$$= 2^N \cdot N^2$$

) A system employs 1000 references. Out of them 100 references result conflict penalties, 150 references result compulsory penalties. Another 150 references result capacity penalty. What will be the hit ratio with direct mapping and associate mapping?

Sol:

Direct Mapping  $\rightarrow$  conflict penalties present ✓

Associate "  $\rightarrow$  NO "

$$\text{Miss ratio in direct mapping} = \frac{100 + 150 + 150}{1000} = \frac{400}{1000} = 0.4$$

$$\therefore \text{Hit ratio} = 1 - 0.4 = 0.6$$

$$\text{Miss ratio in associate mapping} = \frac{150 + 150}{1000}$$

$$= \frac{300}{1000} = 0.3$$

$$\therefore \text{Hit ratio} = 1 - 0.3 = 0.7$$

compulsory penalties are due to fixed block size. The more the size of the block, the less the compulsory penalties

e.g. If block size = 16 word

and we request 17 words  $\Rightarrow$  new block  
 $\therefore$  penalty

If block size = 24 words then no penalty.

capacity penalties are due to the fixed size of cache.  
The more the size of the cache, the less the capacity penalties

- If cache size = Main memory then all the penalties are absent

### updation techniques:

- The write operation result cache coherence problem where as the read operation does not give any problem. In order to reduce/avoid this problem two updation techniques are proposed.

1. write-through

2. write-back

- In the write-through updation, the cache memory and main memory are simultaneously updated. The cache coherence problem is eliminated with write-through updation in a single processor system.

$$\text{the time for updation} = \max(T_c, T_m)$$
$$= T_m$$

$T_c$  - cache memory access time

$T_m$  - Main " "

The write-through updation gives better performance for less number of updations.

In write-back updation, the updation of main memory is done only when the concerned block in cache is chosen for replacement. only last updation is visible to in the main memory. To know whether the block in the cache is updated (or) not an additional bit (dirty bit) is allocated. At the time of replacement, the block that is chosen for replacement require either it is to be copied back into the main memory before getting the incoming block (dirty bit=1) (OR) it is simply overwritten with incoming block (dirty bit=0).

Time for updation is:

$$T_{\text{updation}} = \underbrace{T_B + T_C}_{(T_B^{\text{copy back}} + T_B^{\text{incoming}})} \quad (\text{if dirty bit}=1)$$

$$(T_B^{\text{copy back}} + T_B^{\text{incoming}})$$

$$= T_B + T_C \quad (\text{if dirty bit}=0)$$

Eg: for  $\{ i=0; i<1000; i++ \}$



In write through cache block updated 1000 times  
main memory " 1000 "

In write back cache is updated 1000 times  
Main memory " only 1 time

Consider a cache memory which has 80% hit for read operation and 90% hit for write operation. The cache is 5 times faster than main memory. The main memory access time is 10ns. Let there are 100 references for write operation. Answer the following.

question is using write through strategy.

1) what is the average read access time

" " write "

2) " access time

4) what is the throughput of system.

Note: whether it is read miss (or) ~~average~~ write miss a word block is first moved from main memory to cache memory

Sol:

1) Average read access time

$$T_{Ave_r} = H_r T_c + (1-H_r)(T_B + T_c) \quad \text{Here no replacement}$$

$$T_m = \text{Main memory access time} = 100 \text{ ns}$$

$$T_c = \text{Cache } " " " = \frac{100}{5} = 20 \text{ ns}$$

$$H_r = \text{Read hit ratio} = 80\% = 0.8$$

$$T_{Ave_r} = 0.8 \times 20 + (1-0.8)(2 \times 100 + 20)$$

$$= 16 + 44$$

$$= 60 \text{ ns}$$

2) Average write access time

$$T_{Ave_w} = H_w \times T_{updation} + (1-H_w)(T_B + T_{updation})$$

write-allocate - updated both in Cache & Main memory.

At is default

$$T_{updation} = \text{Max}(T_c, T_m)$$

$$= 100 \text{ ns}$$

$$T_{Ave_w} = 0.9 \times 100 + (1-0.9)(2 \times 100 + 100)$$

$$= 90 + 30$$

$$= 120 \text{ ns}$$

3) Average access time  $\Rightarrow$  for a single word how much time?

Here reference can be write (OR) Read

$$T_{ave} = fr * T_{avg} + fw * T_{avw}$$

fr - Read frequency = 80%

fw - write frequency = 20%

$$\therefore T_{ave} = 80\% * 60 + 20\% * 120 \\ = 48 + 24$$

$$T_{ave} = 72 \text{ ns}$$

4) Throughput =  $\frac{\text{No. of words}}{\text{sec}}$

$$= \frac{1}{T_{ave}}$$

$$= \frac{1}{72 * 10^9 \text{ ns/word}}$$

$$= \frac{10^9}{72} \text{ words/sec}$$

$$= \frac{1000}{72} * 10^6 \text{ words/sec}$$

$$= 13.8 \text{ million words/sec}$$

L3

## 20/11/11 Block replacements:

Objective: Reduce the number of block movements

- 1) FIFO --- Arrival time
  - 2) LRU --- Basis: usage
  - 3) Direct mapped ---  $(K \bmod N)^{th}$  cache block
- performance of block replacement techniques is based on the number of hits

Eg:

consider a cache with 4 blocks

Main memory block references: -

7 0 8 12 # 18 # 13 17 # 8 0 0 16



○ → moved out of cache

→ → movement

□ → hits

∴ In FIFO,

Hits

last block maid out  
of cache

status of cache

-RU :

○ → old reference



last block moved out of cache    status of cache

4

17

7 8 0 16

rect mapped:



4) 3 0 0 0 . . .

last block moved out of cache

4

0

status of cache

16 17 18 7  
14

7, 18, 17, 16 are not disturbed after their first arrival

Repeat the above using 2-way set associate mapping with LRU strategy

each set is applied with LRU

⇒  $(K \text{ mods } 4)$  set in cache

= (K mod 2)

$$\text{No. of sets} = S = \frac{N}{K} = \frac{4}{2} = 2$$



Set. 1

last block moved out of cache

3

8

status of cache

0 16 17 7  
set-0 set-1

## Principle of virtual memory:

The virtual memory allows the construction of programme which exceeds physical memory size. The operating system takes care about swapin and swapout operations. Here the secondary memory and primary memory are divided into equisized partitions (frames/pages). The address mapping converts virtual address to physical address. Paging is the simplest one but it consumes more table space. The segmented paging reduces effective table space but uses two level address translation (segment table, page table).

- Q) consider a 32-bit virtual address and 4 MB main memory are partitioned into 64 KB blocks (frames/pages)
- 1) what is the table space required for paging technique?
  - 2) what is the effective table space needed for segmented paging, with each segment having 64 pages

Sol:  $\Rightarrow \text{Framesize} = \text{pagesize} = 64 \text{ KB}$

Assume 1 word = 1 Byte

$$\text{No. of pages} = \frac{2^{32}}{2^{16}} = 2^{16} \text{ pages}$$

$$\text{No. of frames} = \frac{2^{22}}{2^{16}} = 2^6 \text{ frames} = 64 \text{ frames.}$$



$$\text{Page-table size} = \frac{2^{16} \times 6}{8} \text{ bytes}$$

total 2 references

$$= 48 \text{ KB}$$

2) No. of segments =  $\frac{\text{total pages}}{\text{no. of pages per segment}}$

$$= \frac{2^{16}}{2^6}$$

$= 2^{10} \text{ segments}$



Total 3 references

segmented/paging is slower than paging

### Principle of associate memory:

The associate memory is the fastest memory and the information is retrieved using either the content (or) partial content. It is also called as content addressable memory (CAM). The functional blocks are:

1. Argument register (n)
2. Mask register (n)/Key register
3. Associate memory array ( $m \times n$ )
4. Match logic

- perform the read operation, place the content (or) partial content in argument register. In case of partial matching set the appropriate mask register bits to "one". The match logic gives the concerned matched words. In case of multiple matching, access the words sequentially until the desired word is obtained.
- To perform the write operation, after ensuring that the word is not already present, transfer it into one of the free locations.
- the expression for match logic

$$M_P = \prod_{g=1}^G [K_g + (F_{gj} \odot A_g)]$$

$\forall g, \text{ if } K_g = 0 \rightarrow$  It reduces to complete matching  
 else partial matching

$A_g$  - jth bit of argument register

$K_g$  - gth bit of Mask register

$F_{gj}$  - jth bit in gth row of associate memory

If  ~~$F_{gj} = A_g$~~  and

$F_{g2} = A_2$  and

$F_{g3} = A_3$  and

$F_{g4} = A_4$  then only word 'q' is called as matched one.

If we don't want to match a bit, that bit is masked with Mask register bit ( $K_g$ )

Q)  $K_g = 1 \Rightarrow M_P = \underbrace{(1 + F_{gj} \odot A_g)}_1$



Associate memory does not require any address.

Hence by construction, it is fastest memory.

Registers, cache require addresses some time is spent in address decoding

16

Though the associate memory is the fastest, it needs more time to compare the words in parallel. Hence it is the most costliest memory.

A direct mapped cache implements the following programs. It uses  $512 \times 512$  array. Each element occupies 8 bytes. The row-major order is used for placing the array in MM. Let the cache size is 32 KB and block size 16 B

P<sub>1</sub>: float A[512][512];  
 for (i=0; i<512; i++)  
 for (j=0; j<512; j++)  
 A[i][j] = 5.0;

P<sub>2</sub>: float A[512][512];  
 for (i=0; i<512; i++)  
 for (j=0; j<512; j++)  
 A[i][j] = 5.0

The miss ratio for P<sub>1</sub> is M<sub>1</sub> and for P<sub>2</sub> is M<sub>2</sub>. Find

1) M<sub>1</sub> and 2)  $\frac{M_1}{M_2}$

3) which element of array first time replaces A[0][0].

Sol: Direct Mapped  $\Rightarrow$   $(K \bmod N)^{th}$  block.

Each element of array is 8 Bytes

Block size = 16 Bytes

$\Rightarrow$  No. of elements in each block = 2

$$\text{No. of blocks in cache} = \frac{\text{cache size}}{\text{block size}}$$

32 KB  
16 B

$$= 2^{11} \text{ blocks}$$

$$= 2048 \text{ blocks} = 2K \text{ blocks}$$

Total elements of array =  $512 \times 512$

No. of elements in each block = 2

No. of blocks for array in MM =  $\frac{2^{18}}{2}$

二二四

= 128 K blocks



for program 1:

Block '0' referred twice  
1 is miss & 1 is hit

program 1 refers

$A[0][0], A[0][1], A[0][2], A[0][3], \dots$

三

11

1

517

Every block is referred twice and 1 is hit & 1 is miss

$$\text{Miss ratio } M_1 = \frac{1}{2} = 50\%$$

Total blocks to be moved from MM to cache memory  
= 128 K blocks

Each block causes 1 hit & 1 miss

$$\therefore \text{Total miss references} = 128 \text{ K}$$

$$\text{Total references} = 256 \text{ K}$$

for program 2:

program 2



for every element reference a block is to be moved from main memory

Hence it is resulting miss  
The total blocks that are to be moved = total no. of elements

$$\therefore \text{Total Miss references} = 256 \text{ K}$$

$$\text{Miss ratio } M_2 = \frac{\text{Total miss references}}{\text{Total references}} = \frac{256 \text{ K}}{256 \text{ K}} = 100\%$$

$$\frac{1 \times 11}{M_B} = \frac{504}{1008} = \frac{1}{2}$$

3) No. of blocks occupied by each row =  $\frac{512}{2} = 256$  blocks

$A[0][0]$  is in 0th block.

$0 : 1 \text{ to } 2048 = 0^{\text{th}}$  cache block

next Main memory block = 2048

Total rows =  $\frac{2048}{256} = 8$  rows

$\therefore A[8][0]$  replaces  $A[0][0]$ , first time

| <u>blocks</u> | <u>Row</u> |
|---------------|------------|
| 0 - 255       | 0          |
| 256 - 511     | 1          |
| 512 - 767     | 2          |
| 768 - 1023    | 3          |
| 1024 - 1279   | 4          |
| 1280 - 1535   | 5          |
| 1536 - 1791   | 6          |
| 1792 - 2047   | 7          |
| 2048 -        | 8          |

$\Rightarrow A[8][0]$

Instructions, Addressing modes, Instruction pipelines:

Instruction has - 1) operation code (opcode)

- 2) operand reference (OPR ref)

Instruction 

|        |         |
|--------|---------|
| opcode | OPR ref |
|--------|---------|

Based on opcode (functional classification)

1. Data transfer instructions

2. Arithmetic

3. Logical

Based on the number of operands i.e. number of addresses

## 4-address instructions

ADD A<sub>1</sub>, A<sub>2</sub>, A<sub>3</sub>, A<sub>4</sub>



contain address of next instruction

There is no program counter (PC)

It is equivalent to  $M[A_4] \leftarrow M[A_2] + M[A_3]$

Advantage: simplicity

disadvantage: Lengthy

## 3-address instructions

ADD A<sub>1</sub>, A<sub>2</sub>, A<sub>3</sub>

PC is introduced

suppose 100 instructions  $\Rightarrow$  it required 100 memory references  
for ~~next~~ address of instruction in  
4-address

but in 3-address only PC

adv: fast

disadv: More cost

## 2-address instructions

ADD A<sub>1</sub>, A<sub>2</sub>

$M[A_1] \leftarrow M[A_1] + M[A_2]$

adv: shorter length

disadv: one operand is overwritten with result.

i.e. one operand is permanently lost.

## 1-address instructions

adv: size is reduced

Memory operand is not disturbed

∴ They introduced accumulator (A) register

ADD A<sub>2</sub>

$A \leftarrow (A) + M[A_2]$

## ⇒ 0-address instruction



Here CPI < 1

i.e. More than one instruction in one clock cycle  
ie one fetch

disadv: More complexity.

stack contents should be rigidly placed properly

- Q) A system supports 1-address and 2-address instructions. The 16 bit instruction is stored in 128 word memory. If there exists two 2-address instructions, what will be the number of 1 address instructions supported by system

- ① 256    ② 128    ③ 512    ④ 384

Sol: 128 word  $\Rightarrow$  7 bits required to address

one-address instruction



two-address instruction



If only one-address instructions then  $2^7 = 128$  possible  
but two address also present

For one two-address instructions 7 bits are to be pulled from  
opcode of 1-address instruction

i.e. for one 2-address instruction,  $2^7 = 128$  1-address  
instructions are lost

∴ for two 2-address instruction =  $2 \times 128$ , 1-address  
instructions are lost

∴ Total 1-address instructions =  $256 - 128 - 128$  instructions

$$\text{ie } P + Q * 2^7 = 255$$

valid  
address  
instructions

Total possible

invalid  
1-address  
instructions  
due to the presence  
of Q two-address  
instructions

1-address instructions

Max possible 1-address instructions are  
ie P is Maximum iff Q is minimum

$$\Rightarrow Q=0$$

but 2-address instructions should be present

$$\Rightarrow Q=1$$

$$\therefore \text{Max 1-address instructions} = 384$$

$$\text{and } 1 \leq Q \leq 3$$

19

$$Q \neq 4 \text{ because } P=0$$

### Addressing modes:

The addressing modes indicate how the reference of operand is made in the instruction. The reference of the operand is called effective address.

In non-computable addressing modes, effective address is to be accessed, while it is computed in computable addressing modes. The non-computable addressing modes include

- 1) Immediate addressing mode
- 2) Direct addressing mode
- 3) Indirect addressing mode

Immediate addressing mode is the fastest one. There is no effective address for this mode. It is used for accessing program constants. However, operand is limited.

In the direct addressing mode, effective address is given

The "disadvantage" it is suitable for accessing static variables. It avoids limited range of operands but result limited address range

- In the indirect addressing mode, these two problems are avoided. It is used for accessing local variables. However it consumes more time compared to immediate and direct addressing modes.

In the computable addressing modes, effective address is computed by adding displacement of the instruction to the fixed processor register.

- If the base register is used in the computation then the addressing mode is called as base addressing mode and is useful for relocatable programs
- If the fixed register is index register, the addressing mode is called as index addressing mode and it is suitable for accessing array elements
- The relative addressing mode is obtained using program counter in the computation. It is used for intra segment branching (forward branching if displacement is +ve, backward branching if the displacement is -ve)

### Instruction pipeline:

- Efficient usage of resources
- Time-overlap the instruction phase
- $T_{ave}$  is  $\downarrow$  ( $CPI_{ave}=1$ )
- for  $k$ -stages,  $T_n = (k+n-1)$  clocks,  $t_{clock} = \max(\text{stage delay})$

| S | Register       | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | clocks |
|---|----------------|----------------|----------------|----------------|----------------|----------------|--------|
| F |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> |                |                |        |
| O |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |                |        |
| F | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> |                |        |

- performance of the instruction pipeline is given by the speedup factor ( $s > 1$ )

$$S = \frac{\text{Time without pipeline}}{\text{Time with pipeline}}$$

If  $T_1$  is time for execution with pipeline &  $T_2$  is time without pipeline

$$S = \frac{T_2}{T_1}$$

In ideal case  $S = k$  (no. of stages)

effective speed up factor  $S_{\text{eff.}} = \frac{S_{\text{ideal}}}{(1 + \frac{\text{stall frequency}}{\text{cycles}})}$

- The performance of instruction pipelines is influenced by uneven stage delays, buffer overhead and dependencies among the instructions

- Due to dependencies the instructions in the pipeline are to be reorganised and hence it consumes more time which in turn reduces the average performance.

→ For  $k$ -stage

If  $n$  instructions are to be executed

without pipeline, time =  $(n \times k)$  clocks

with pipeline, time =  $(n+k-1)$  clocks

As  $n \gg k$

$$\begin{aligned} S_{\text{ideal}} &= \frac{n \times k}{n+k-1} \\ &\approx \frac{n \times k}{n} \end{aligned}$$

$$\therefore S_{\text{ideal}} \approx k \checkmark$$

- If no. of stages are increased, pipeline activity ↑. Hence performance increases



- The data dependency occurs when one instruction in pipeline waiting for the result which is not yet computed by previous instruction. The data dependencies are dealt with instruction rescheduling, stall cycle introduction and operand forwarding. Arithmetic instructions are the main resources for the data dependencies.
  - The control dependency occurs when the flow of instruction is changed due to the earlier instruction cycle. The branch instructions are the main resources for the control dependencies. The control dependency is tackled with multiple pipelines, prediction techniques, delayed load.
  - The structural dependency occurs due to the nonavailability of resources. Resource duplication resolves this problem.
- Q) Consider a 5-stage instruction pipeline which implements the two programs. Program A contains 50 instructions. Program B contains 500 instructions. Compute the speed up factor and pipeline efficiency for each case.

Sol: Given, No. of stages =  $K=5$

$$\underline{PA} \quad \underline{PB}$$

$$n=50 \quad n=500$$

An  $K$ -stage pipeline,  $T_n = (K+n-1)$  clocks

$$\therefore \text{for program } PA, T_n = (5+50-1) = 54 \text{ clocks}$$

$$\text{for program } PB, T_n = (5+500-1) = 504 \text{ clocks}$$

In non pipeline,

instructions in PA takes =  $50 \times 5$  clocks

instructions in PB takes =  $500 \times 5$  clocks

for program PA, speed up factor =  $\frac{T_{NP_1}}{T_{P_1}} = \frac{50 \times 5}{54} = 4.6$

program PB, speed up factor =  $\frac{T_{NP_2}}{T_{P_2}} = \frac{500 \times 5}{504} = 4.96$

∴ program B has better performance

Agnoring other factors, as the no. of instructions increases

Performance also increases

We have in ideal case,  $S_{ideal} = K$

But  $P_A \Rightarrow S = 4.6$

$P_B \rightarrow S = 4.96$

$$\therefore \text{efficiency} = \frac{5}{4.6} \times 100 = 92\%$$

$$\therefore \text{efficiency} = \frac{5}{4.96} \times 100 = 97.2\%$$

i.e.  $S=5 \rightarrow \text{efficiency}=100\%$

i.e.  $S=5 \rightarrow \text{efficiency}=100\%$

$$S=4.6 \rightarrow ?$$

$$= 92\%$$

$$S=4.96 \rightarrow ?$$

$$= 99.2$$

D) let the two pipelines are implemented by a system so that  
Pipeline 1 has 8 stages (phases) with uniform delay of  
2ns. The pipeline 2 has 5 stages with the stage delays  
3ns, 1ns, 1ns, 2ns, 3ns.

i) How much time is saved with pipeline 1 for 100 instructions

compared to pipeline 2

so: with pipeline 1, time =  $(K+n-1)$  clocks  
=  $(K+n-1) t_{clock}$

we have  $t_{clock} = 2\text{ns}$

$$T_1 = (8+100-1) \times 2 = 214\text{ ns}$$

with pipeline 2,  $t_{clock} = \max(\text{stage delay})$   
 $= 3\text{ns}$

$$T_2 = (5+100-1) \times 3$$

$$= 312 \text{ ns}$$

$$\therefore \text{Time saved} = 312 - 214 \\ = 98 \text{ ns}$$

For program P<sub>1</sub>, speed up factor =  $\frac{\text{Time without pipeline}}{\text{Time with pipeline}}$

$$= \frac{8 \times 100 \times 2}{214} \\ = 7.47$$

for program P<sub>2</sub>, speed up factor =  $\frac{\text{Time without pipeline}}{\text{Time with pipeline}}$

$$= \frac{100 \times (3+1+1+2+3)}{312} \\ = 3.2$$

$$\therefore \eta_1 (\text{efficiency}) = \frac{8 \times 100}{7.47} = 93.37\%$$

$$\eta_2 (\text{efficiency}) = \frac{5}{3.2} = 64.7\% \quad \text{due to uneven delays}$$



speed up factor.

So: Here no. of instructions are not given

∴ average time is to be considered

without pipeline,  $T_{ave} = \frac{\sum \text{stage delay}}{n}$

$$= (3+1+1+2+3) \\ = 10 \text{ ns}$$

with pipeline

$$T_{ave} = 1 \text{ clock}$$

this is known as  $T_{PIPE} = 1 \text{ clock}$

$$T_{clock} = (3+1) \text{ ns}$$

↑ buffer time as it is operated by same clock

i.e. Buffers are not included in non pipeline systems

$$\therefore S = \frac{10 \text{ ns}}{4 \text{ ns}} = 2.5$$

Consider a 4-stage instruction pipeline which was aimed to implement the following 4 instructions. Here each instruction can spend different number of clocks at different stages. The following table gives the no. of clocks required per instruction per stage. What will be the no. of clocks required to complete the following 4 instructions?

22

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ |
|-------|-------|-------|-------|-------|
| $I_1$ | 3     | 1     | 2     | 1     |
| $I_2$ | 1     | 3     | 1     | 2     |
| $I_3$ | 1     | 1     | 1     | 1     |
| $I_4$ | 2     | 1     | 2     | 1     |

so:

1) The instruction has to complete its previous stage when it requests the next stage.

2) the stage is given to instruction if the stage is free i.e. the previous instruction has released that stage

if stage  $i$  for instruction  $I_g$  is given if

1)  $I_g$  must complete  $S_{i-1}$  time =  $x$

2)  $I_{g-1}$  must have released  $S_i$  time =  $y$

time of allocation =  $\max(x, y)$

|                | 1              | 2              | 3              | 4              | 5              | 6              | 7              |                |    |  |  |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|--|--|
| I <sub>1</sub> | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> | 8              | 9              | 10             |                |    |  |  |
| I <sub>2</sub> | -->            | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |                | 11             |                |    |  |  |
| I <sub>3</sub> | -->            | S <sub>1</sub> |                | S <sub>2</sub> | S <sub>3</sub> |                |                | S <sub>4</sub> | 12 |  |  |
| I <sub>4</sub> | -->            | S <sub>1</sub> |                |                | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |                |    |  |  |

3 stall cycles  
waiting cycles for I<sub>3</sub>  
1 stall cycle for I<sub>4</sub>

∴ These 4 instructions are completed in 12 clock cycles.

b) for (i=1 to 2)

{ I<sub>1</sub>; I<sub>2</sub>; I<sub>3</sub>; I<sub>4</sub>; }

find total no. of clocks required for above problem

Sol:

Suppose for above problem.

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | (3) 3          | ① 4            | 2 12 7         | 6 10           | 10             | 11             | 13             | 14             |
| I <sub>2</sub> | ① 4            | 3 ④            | 1 ③ 2          | 10             | 11             | 12             | 15             | 17             |
| I <sub>3</sub> | 1 5            | 1 ②            | 1 4            | 1 9 1 11       | 12             | 15             | 16             | 13             |
| I <sub>4</sub> | ② 7            | ① 4            | 2 11           | ④ 12           | 14             | 15             | 18             | 19             |

∴ It takes 19 clocks

At the end of first iteration

$$\text{Speed up factor} = \frac{25 \text{ clocks}}{19 \text{ clocks}} = \cancel{2.67} 2$$

At the end of two iterations

$$\text{Speed up factor} = \frac{\cancel{50}}{19} : \frac{48}{19} = 2.53$$

The pipeline system contains 4 stages with stage delays 3 clocks, 2 clocks, 2 clocks, 3 clocks. The pipeline is operated with 1 GHz clock. The non-pipeline is implemented with 2 GHz clock. what is the speed up factor?

$$\begin{aligned}
 \text{Speed up factor} &= \frac{\text{Time without pipeline}}{\text{Time with pipeline}} \\
 &= \frac{(3+2+2+3) * \frac{1}{2 \text{ GHz}}}{3 * \frac{1}{1 \text{ GHz}}} \\
 &= \frac{5}{3} \\
 &\approx 1.7
 \end{aligned}$$

### Data dependency:

The data dependencies occur due to arithmetic instructions, where the result of one instruction are to be used in another instruction.

23

$$\begin{array}{lll}
 \text{Eq: } & \text{SUB } R_1, R_2, R_3 & R_1 \leftarrow R_2 - R_3 \\
 & \text{ADD } R_4, R_1, R_5 & R_4 \leftarrow R_1 + R_5
 \end{array}$$

suppose each instruction takes 1 clock per stage  
pipeline do not consider the data dependency.

| F   | D          | E           | S           |       |
|-----|------------|-------------|-------------|-------|
| SUB | $R_2, R_3$ | $R_2 - R_3$ | $R_1$       | 5     |
| ADD | $(R_1)$    | $R_5$       | $R_1 + R_5$ | $R_4$ |

Here it takes old value.

The correct implementation is

| 1   | 2          | 3           | 4     |  | 5 | 6 | 7 |
|-----|------------|-------------|-------|--|---|---|---|
| F   | D          | E           | S     |  |   |   |   |
| SUB | $R_2, R_3$ | $R_2 - R_3$ | $R_1$ |  |   |   |   |

  

|     |   |   |   |  |  |  |
|-----|---|---|---|--|--|--|
| F   | D | E | S |  |  |  |
| ADD |   |   |   |  |  |  |
|     |   |   |   |  |  |  |

Latency is on write down

→) Introduction of stall cycles solves all the problems  
But low performance

## 2) Instruction rescheduling:

Between the dependent instructions, independent instructions are scheduled

$I_1$   
 $I_g$   
 $I_K$  } Independent instructions.  
 $I_2$

|       |   |       |   |       |   |   |
|-------|---|-------|---|-------|---|---|
| 1     | 2 | 3     | 4 |       |   |   |
| $I_1$ |   |       |   | $I_g$ |   |   |
|       |   |       |   |       | 6 |   |
|       |   | $I_K$ |   |       |   | 7 |
|       |   |       |   | $I_2$ |   |   |

It also takes 7 clocks.

But how to know independent instructions?

$I_g, I_K$  are to be independent among them and also independent of  $I_1, I_2$  and their successors. Independent instructions, are known by compiler by constructing DAG

∴ efficiency of this method depends on

- 1) compiler
- 2) program

Instruction rescheduling does not solve all the problems

## 3) operand forwarding:

In operand forwarding, the value of the operand is given to required stage via interstage buffer registers.

|   |   |   |   |  |  |
|---|---|---|---|--|--|
| 1 | 2 | 3 | 4 |  |  |
| F | D | E | S |  |  |

These buffer registers are shift registers operating in PIPo.

Eg: SUB R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>

ADD R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>

|     | 1  | 2                               | 3  | 4                                                  | 5              |
|-----|----|---------------------------------|----|----------------------------------------------------|----------------|
| F   | id | D                               | DE | E                                                  | S              |
| SUB |    | R <sub>2</sub> , R <sub>3</sub> |    | R <sub>2</sub> -R <sub>3</sub>                     | R <sub>1</sub> |
| ADD |    | F                               | D  | (R <sub>2</sub> -R <sub>3</sub> ) + R <sub>5</sub> | R <sub>4</sub> |

It requires 5 clocks.

extra H/W is needed



24

A path is needed between ELS & E. It should not be permanent path.

Multiplexer is needed.

Select line is activated by compiler as it knows dependencies.

It solves all problems but more cost

Q) IF-Instruction fetch, ID-Instruction Decode, OF-operand fetch, PO-perform operation, WO-write operation are stages for the instruction to be executed.

MUL R<sub>2</sub>, R<sub>1</sub>, R<sub>4</sub>

DIV R<sub>5</sub>, R<sub>3</sub>, R<sub>2</sub>

SUB R<sub>2</sub>, R<sub>5</sub>, R<sub>6</sub>

AD10 R<sub>4</sub>, R<sub>2</sub>, R<sub>5</sub>

Except PO all other stages take 1 clock/instruction-stage. PO takes 3 clock for MUL and 6 clocks for DIV, 1 clock for sub. Final total 10 clocks for above execution.

16 cycles needed

- p) consider a 4-stage instruction pipeline (Fetch, Decode, Execute, write back). The fetch, decode, write back will take 1 clock each for completion. The number of clocks required for execution is 1 clock for addition and subtraction and 3 clocks for multiplication. what will be the min. no. of clocks required to implement the following three instructions. operand forwarding is allowed

ADD R<sub>2</sub>, R<sub>1</sub>, R<sub>0</sub> ; R<sub>2</sub> ← R<sub>1</sub> + R<sub>0</sub>

MUL RU, RB, R2 ; RU < R3 \* R2

SUB R6, R5, R4 ; R6 ← R5 - R4

501

Diagram illustrating the sequential execution of three instructions (ADD, MUL, SUB) across four stages (IF, ID, EX, WB). The MUL instruction is currently being executed, and the SUB instruction has just begun.

| 1                   | 2            | 3                  | 4        | 5 | 6 | 7 |
|---------------------|--------------|--------------------|----------|---|---|---|
| IF<br>ADD<br>R1, R2 | ID<br>R2, R3 | EX<br>R3           | WB<br>R4 |   |   |   |
| IF<br>MUL<br>R1, R2 | ID<br>R2     | EX<br>$(R1+R2)*R3$ | WB<br>R4 |   |   |   |
| IF<br>SUB<br>R1, R2 | ID<br>R2     | EX<br>R3           | WB<br>R4 |   |   |   |

8 cycles i.e. 8 clocks are required

## Types of data dependencies.

The data dependencies are classified into:

1. Read After write (RAW) (OR) TRUE dependency
2. write After Read (WAR) (OR) ANTI dependency
3. write After write (WAW) (OR) OUTPUT "
4. Read After Read (RAR) (OR) ~~Read~~ dependency.

The domain and range relations are used for detecting data dependencies. Domain indicates the source operand and range denotes destination operand  
Any violation to the order will cause hazard. Hence data dependencies also called as hazards

Eg:

|                                                                       | <u>Domain</u>                   | <u>Range</u>   |
|-----------------------------------------------------------------------|---------------------------------|----------------|
| I <sub>1</sub> : ADD R <sub>1</sub> , R <sub>2</sub> , R <sub>3</sub> | R <sub>2</sub> , R <sub>3</sub> | R <sub>1</sub> |
| I <sub>2</sub> : OR R <sub>4</sub> , R <sub>1</sub> , R <sub>5</sub>  | R <sub>1</sub> , R <sub>5</sub> | R <sub>4</sub> |

25

Read operation is done on Domain  
write operation is done on Range

Suppose two pipelines then

OR completes before ADD as it is fast

∴ Error

∴ OR should Read R<sub>1</sub> only after ADD writes R<sub>1</sub>

∴ It is Read After write

$\frac{\text{I instruction}}{\text{II instruction}}$  It is due to  $D_2 \cap R_1 \neq \emptyset$

$\frac{\text{Domain}}{\text{Range}}$

RAW  $\Leftrightarrow D_2 \cap R_1 \neq \emptyset$

WAR  $\Leftrightarrow D_2 \cap D_1 \neq \emptyset$

WAW  $\Leftrightarrow R_2 \cap R_2 \neq \emptyset$

RAR  $\Leftrightarrow D_2 \cap D_1 \neq \emptyset$

Eg: I<sub>1</sub>: ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>

I<sub>2</sub>: OR R<sub>2</sub>, R<sub>1</sub>, R<sub>5</sub>

R<sub>1</sub> should write R<sub>2</sub> only after I<sub>1</sub> has read R<sub>2</sub>

## control dependencies:

- The control dependencies occur due to change in the flow of execution as a result of instruction cycle of earlier one
- The control dependency require reorganisation of pipeline. Hence performance will be reduced.

Ex: I<sub>1</sub>: BZ I<sub>7</sub>      BZ - Branch on Zero

next sequential  
instruction

Target I<sub>7</sub>:

| F              | D              | E              | S              |
|----------------|----------------|----------------|----------------|
| BZ             | I <sub>1</sub> | I <sub>1</sub> | I <sub>1</sub> |
| F              | I <sub>2</sub> | I <sub>2</sub> | I <sub>2</sub> |
| I <sub>2</sub> |                |                |                |
| F              | I <sub>3</sub> | I <sub>3</sub> | I <sub>3</sub> |
| I <sub>3</sub> |                |                |                |
|                |                |                | I <sub>7</sub> |
|                |                |                |                |

If zero flag = 1

then the next instruction is I<sub>7</sub>

Hence whatever work is done by pipeline is waste

Hence its contents are to be flushed out.

It requires (K-1) stalls

If zero flag = 0

then next instruction is I<sub>2</sub>

pipeline is useful

- This zero flag value can be known only after the execution of I<sub>1</sub>

∴ This is control dependency

consider a 5-stage instruction pipeline in which all the instructions can be overlapped except branch instructions. In case of branch instructions, the target is not available until the current branch instruction is completed. Let there are 20% branch instructions. Assume that each stage consumes 2ns delay.

1) what is the average instruction time

Sol: the stall cycles (waiting cycles) for target instruction

$$= 5-1 = 4$$

Every instruction on an average takes 1 clock for completion  
 Total clocks required for an instruction = waiting time + completion time

Branch instructions time = 4+1 = 5 clocks

other instructions time = 0+1 = 1 clock

$\therefore$  20% branch instructions and 80% other instructions

Average instruction time = 80% (0+1) + 20% (4+1) clocks

$$= 0.8 + 1$$

$$= 1.8 \text{ clocks}$$

$$= 1.8 \times 2 \text{ ns}$$

$$= 3.6 \text{ ns}$$

(OR)

$$T_{ave.} = \left( 1 + \frac{\text{stall freq}}{\text{clock}} \times \text{stall cycles} \right) \times T_{clock}$$

$$= (1 + 20\% \times 4) \times 2 \text{ ns}$$

$$= 3.6 \text{ ns}$$

2) effective speed factor

$$S_{eff} = \frac{S_{ideal}}{1 + \frac{\text{stall freq}}{\text{clock}} \times \text{stall cycles}}$$

$$= \frac{5}{1.8}$$

$$= 2.78$$

Q) In the above problem, among the branch instructions, 40% are unconditional. An L of conditional branch instructions does

in anxiety the condition. If there is no penalty due to such branch instructions what is the average instruction time.

Sol:



Branch is taken in unconditional if the condition is satisfied

stall cycles in case of branch = 4 clocks

$$\begin{aligned}
 T_{ave} &= \left[ 1 + 20\% [40\% \times 4 + 60\% (40\% \times 4)] \right] * 2 \text{ ns} \\
 &= (1 + 0.2 [1.6 + 0.6 \times 0.4 \times 4]) * 2 \text{ ns} \\
 &= [1 + 0.2 [1.6 + 0.96]] * 2 \\
 &= 3.024 \text{ ns}
 \end{aligned}$$

2.56 ns  
0.12 ns  
1.512 ns  
3.024 ns

Techniques to resolve the control dependencies:

### 1. Multiple pipelines:

In a two pipeline system, the next sequential instruction is placed for the coming clock after the branch instruction. In second pipeline the target is placed for the next clock. In the fetch stage, one of the pipelines always maintain correct flow of instructions. At the end of branch instruction one pipeline is allowed to continue while the other is stopped. This approach successfully deals IF-THEN-ELSE statement.

IF (cond) THEN(Ix) ELSE(Iy)

the two pipelines system tails for the nested If instructions

e.g:



- If target itself is branch then three pipelines
- Implementation cost is more

27

### 2) Delayed Load Technique:

The load of the branch instruction is the next sequential (or) target instruction. Between the branch and its load independent instructions are scheduled. Here whether the branch is taken (or) not taken the load is getting delayed

### 3) Prediction Techniques:

The prediction techniques will decide whether the next sequential (or) target instruction is to be placed in fetch stage

for the next clock. The predictions can be static (or) dynamic. In dynamic prediction, the next prediction will depend upon the history of previous branch instructions.

The prediction may be:  
1) Branch never takes  
2) Branch Always takes

In branch never takes, the pipeline is placed with next sequential instruction

In branch always takes case, the target instruction is placed in the fetch stage

- The conventional pipeline follows branch never takes assumption

### Implementation of dynamic prediction:

Here the prediction is changed, if two consecutive previous predictions are wrong. The binary flag 'TAKEN' maintains the status of previous branch instructions.



→  $S_1, S_3$  are strong prediction states

i.e. prediction is not changed whatever may be

the outcome of its previous prediction.

It can be implemented using two T-flipflops

→  $S_2, S_4$  are weak prediction states

### Structural dependencies:

- It is resulted due to non-availability of resources
- Tackled with resource duplication. Resource duplication is subject to the constraint of cost and reliability



$$R = R_1 R_2 = 0.81$$

suppose stage 1 has 2 copies of resource



$$R = R_1 \cdot R_2$$

$$R_1 = 1 - (1 - R_{11})(1 - R_{12}) \\ = 0.99$$

$$R = (0.99) \cdot 0.9$$

$$R = 0.901$$

If the system has a single port memory, instruction fetch and memory of another instruction can't be overlapped i.e storage

28

In case of single ALU system, we cannot overlap instruction Fetch and Execution of another instruction if fetch requires updation of program counter (PC).



Consider the two pipeline implementations. Both of them are having same no. of stages and allow overlapping of all instructions except memory related ones. Let there are 20% instructions belong to memory. The pipeline 1 employs single port memory while pipeline 2 uses a port memory. If two memory operations can't be performed simultaneously there will be one stall cycle penalty. Get the speed up factor for two pipeline systems and find

5

$$S_{eff} = \frac{v_{max}}{(1+stall\ freq + stall\ cycles)}$$

Let No. of stages = k

Pipeline 1

|             |        |
|-------------|--------|
| stages      | K      |
| Memory      | 1-port |
| stallcycles | 1      |
| stall freq. | 20%    |

$$S_1 = \frac{K}{1+0.2*1}$$

$$S_1 = \frac{K}{1.2}$$

$$\therefore \frac{S_1}{S_2} = \frac{1}{1.2}$$

Pipeline 2

|        |
|--------|
| K      |
| 2-port |
| 0      |
| 20%    |

$$S_2 = \frac{K}{1+0*0.2}$$

$$S_2 = K$$

$$\Rightarrow S_2 = 1.2 * S_1$$

i.e. performance of pipeline 2 is 1.2 times of pipeline 1

- Q) A system employs a 5-stage instruction pipeline, in which 5% instructions result data dependencies, 10% instructions result control dependencies, 2% instructions result structural dependencies. 10% instructions are exposed for both data and control dependencies. If the penalties for control dependency and data dependency are 3 clocks and 2 clocks. what is average instruction time

$$\text{sol: } T_{ave} = (1 + 5\% * 2 + 10\% * 3 + 2\% * 1 + 10\% * (3+2)) \times \text{default for structural dependency}$$

$$= 1.92 \text{ clocks}$$

## control unit

control unit generates control signals which implement micro-operations. ( $\mu$ -operation)

micro-operation is the elementary operation which results datamovement across registers

Register Transfer Language (RTL) is used to represent  $\mu$ -operation set of operations form uprogram

- control program implement uprogram

Instruction cycle contains

Fetch uprogram

Execute "

Interrupt "

29

control unit can be considered as 4-state sequential machine (fetch, decode, execute, store) each state require different control signals.



control unit can be designed by

- 1) Hardware control unit (HCU)
- 2) microprogrammed control unit (MPCU)



each operation requires

- If control signals are generated only using H/W then it

simultaneous control signals

- All control signals are generated using memory and H/w then it MPCU.
- RTL : syntax is condition : opcode

Operation:

Eg: move data from Register  $R_2$  to Register  $R_1$  at clock 1  
symbolically it is written as

$$T_1 : R_1 \leftarrow R_2 \quad \text{if it is RTL}$$

H/w is



$R_{2\text{out}}, R_{1\text{in}} \leftarrow$  (just like 'taps for water')  
control signals required  $R_{2\text{out}}, R_{1\text{in}}$   
condition, is at clock 1 ( $T_1$ )

Eg: swap operation

H/w implementation require less resources than  
S/w implementation.  $\nwarrow$  registers

3 variables  
ie 3 registers



i.e swap  $R_1$  and  $R_2$  either at clock 1 ( $T_1$ ) or clock 2 ( $T_2$ )

$R_{1\text{in}}, R_{1\text{out}}, R_{2\text{in}}$  and  $R_{2\text{out}}$  are to be opened simultaneously otherwise copy operation is done

$$T_1 + T_2 : R_1 \leftarrow R_2, R_2 \leftarrow R_1$$

Suppose in above diagram, delays are given then, how much time is taken for swapping to be successful



→ consider a single bus architecture in which the minimal register set of the system is connected Program Counter (PC), Instruction Register (IR), Accumulator (AC) and memory.



30

MAR - Memory Address Register

MDR - Memory Data Register

### Fetch:

Here, move the instruction from memory into IR.



visit memory  $\Rightarrow$  place contents of PC into MAR - In first clock

Get instruction  $\Rightarrow$  Read it to MDR — In second clock

Transfer to IR  $\Rightarrow$  Move from MDR to IR - In third clock.

$$PC \leftarrow PC + 1$$

|                              |                |
|------------------------------|----------------|
| $T_1: MAR \leftarrow PC$     | fetch μprogram |
| $T_2: MDR \leftarrow M[MAR]$ |                |
| $T_3: IR \leftarrow MDR$     |                |

$PC \leftarrow PC + 1$

In 8085, it requires 4 clocks (or) 4 T-states

(3 fetch + 1 stable)  
before execution

control signals  $\Rightarrow$   $PC_{out}, MAR_{in} \leftarrow T_1$  clock  
(Address decoder)

$MAR_{out}, MDR_{in} \leftarrow T_2$  clock  
(Address decoder)

~~PC~~:  $MDR_{out}, PC_{in}, IR_{in} \leftarrow T_3$  clock

μprogram for execution phase of ~~instruction~~  $\xrightarrow{PC_{out}}$

It varies from one instruction to other

Eg: ADD X

$\rightarrow$  Add Accumulator contents to the contents of memory location and place the result in accumulator

This is done after fetch.

At the end of fetch IR contains

| IR | opcode | opr ref |
|----|--------|---------|
|    | ADD    | X       |

|                                                         |          |
|---------------------------------------------------------|----------|
| $T_1: MAR \leftarrow IR(\text{opr ref}/\text{Address})$ | 3 clocks |
| $T_2: MDR \leftarrow M[MAR]$                            |          |
| $T_3: A \leftarrow A + MDR$                             |          |

execution phase μprogram

Eg:

INC X

Assume incrementation possible at register level

$T_1: MAR \leftarrow IR(\text{Address})$

$T_2: MDR \leftarrow M[MAR]$   $\therefore$  4 clocks

$T_3: MDR \leftarrow MDR + 1$

$T_4: M[MAR] \leftarrow MDR$

BUN X

The next instruction is taken from X location

↓

placed in PC

T<sub>1</sub>: PC  $\leftarrow$  IR(Address)

1 clock cycle

Hence different instructions require different no. of operations.

If we know which control signals are required at what time, we can design control unit

(1) HCU

(2) MPCU

21

Hardware control unit design:

In HCU design, each control signal is expressed sop expression of instructions, instructions timings and instruction phases.

Each control signal is then derived from dedicated H/W. It provides all control signals parallelly at fixed time intervals (3 gate delays (NOT, AND, OR))

It is used with RISC processors

The only problem is any modification requires redesign and reconnection of H/W components and hence it is not flexible and it is not suitable for design and testing environment

Eg: consider hypothetical control unit which supports only two control signals S<sub>0</sub>, S<sub>1</sub>, instructions. The system is employing 4 S<sub>2</sub>, S<sub>3</sub>. Each instruction requires 4 microoperations. The following table gives instructions and associated microoperations, obtain expressions for S<sub>0</sub> and S<sub>1</sub>.



## Control signals

| MOP   | $I_1$      | $I_2$      |
|-------|------------|------------|
| $T_1$ | $S_0, S_1$ | $S_1, S_3$ |
| $T_2$ | $S_2, S_3$ | $S_1, S_0$ |
| $T_3$ | $S_0, S_2$ | $S_0, S_3$ |
| $T_4$ | $S_1, S_2$ | $S_1, S_3$ |

Q: Each control signal  $S_i = f(I_1, I_2, T_1, T_2, T_3, T_4)$

$$S_1 = (I_1 T_1 + I_2 T_1) + I_2 T_2 + (I_1 T_4 + I_2 T_4)$$

here it supports only 2 instructions  
and  $S_1$  is required for both in  $T_1$

$$\Rightarrow S_1 = T_1 + I_2 T_2 + T_4$$

$$S_3 = I_2 T_1 + I_1 T_2 + I_2 T_3 + I_1 T_4 + I_2 T_4 \\ = I_2 T_1 + I_1 T_2 + I_2 T_3 + T_4$$

Similarly for  $S_0$  and  $S_2$

1 bit opcode is needed for 2 instructions.



- It is fastest design
- HWCU is not flexible

12/11

## P control unit design

offer flexibility

used with CISC system

It does not directly generate control signals. But binary pattern of control signals is stored in control memory.

This binary pattern is called as control word

control word size is decided by nature of microprogramming

1) Horizontal microprogramming (for N control signals N bits)

2) vertical microprogramming (for N control signals  $\log_2 N$  bits)



- Time for control signal  $T_{CS} = T_{CM} + T_{HCO}$

Eg:

|       | $S_0, S_1$ |
|-------|------------|
| $T_1$ | $S_0, S_1$ |
| $T_2$ | $S_2, S_3$ |
| $T_3$ | $S_0, S_2$ |
| $T_4$ | $S_1, S_3$ |

control word

|       |       |       |       |
|-------|-------|-------|-------|
| $s_3$ | $s_2$ | $s_1$ | $s_0$ |
|-------|-------|-------|-------|



$$s_0 = \text{clock} \text{ iff } B_0 = 1$$

- This is horizontal upprogramming. It offers higher degree of parallelism and is used with parallel processing. No. of control signals generated by a single control word =  $N$  signals.

control word is too lengthy and most bits are zeros.

#### vertical upprogramming

- dont store binary pattern. Instead encoded binary pattern of control signals is stored in control memory.  $N$  signals require  $\log N$  bits.

Advantage: smaller control word

Disadvantage: it requires a decoder

it takes more time

Max. degree of parallelism = 1

$$T_{CS} = T_{DM} + T_{Decoder} + T_{HW}$$

- it is slower than control word division



sequencing the control words:

- 1) using the another program counter in the system  
 2) using linked list  
    i.e. having pointer to next location



This is 1. address control

Condition      Control signals      Next address of control word

- 1-address control signal is same for both vertical and horizontal but they differ in the no. of bits for control signals

N bits - Horizontal  
 $\log_2 N$  bits - vertical

- The beginning address of control word is given by the opcode of instruction.

b) consider a micro programmed control unit which uses horizontal microprogramming and it supports 256 instructions. Each instruction on average requires 16 micro operations. The system is using 16 flag conditions and contains 64 control signals

- 1) How many bits are required for each control word
- 2) what is the size of control memory in bytes.

Let 1-address control instructions are required

Sol:

16 flag conditions.

we can use only one condition at a time  
so these can be encoded

∴ 4 bits are required for encoding 16 conditions

it is horizontal micro programming

∴ No. of bits for control signal = 64 bits

Each instruction requires 16 Moperations

⇒ Each instruction require 16 words

Total instructions = 256

∴ Total words in control memory =  $256 \times 16$   
 $= 2^8 \times 2^4$   
 $= 2^{12}$

12 bits are required for each word to address

|                     |                           |                    |
|---------------------|---------------------------|--------------------|
| condition<br>4 bits | control signal<br>64 bits | Address<br>12 bits |
|---------------------|---------------------------|--------------------|

$$\text{Total bits for each control word} = 4 + 64 + 12 \\ = 80 \text{ bits}$$

$$\begin{aligned}\text{size of control Memory} &= 2^{12} \text{ words} \times 80 \text{ bits/word} \\ &= \frac{2^{12} \times 80}{8} \text{ bytes} \\ &= 40 \text{ KB}\end{aligned}$$

b) Repeat the above using vertical microprogramming

|                |                     |               |
|----------------|---------------------|---------------|
| condition<br>4 | control signal<br>6 | Address<br>12 |
|----------------|---------------------|---------------|

34

22 bits

$$\text{size} = \frac{2^{12} \text{ words} \times 22 \text{ bits/word}}{8 \text{ bits}}$$

$$= 11 \text{ KB}$$

c) consider a microprogrammed control unit that has to support the control signals such that it has to generate either 1 (or) none of the 63 signals and almost four from the remaining ones. what will be the minimum number of bits needed in the address field of one control instruction

address control instruction

sol:- either 1 (or) none of 63 signals

→ vertical microprogramming  
signals can be encoded

6 bits are needed

If all are 0 → no signal

- almost four from the remaining  
at max generate 1 (or) 2 (or) 3 (or) 4

Minimum bits required = 4 bits

Total bits in control signal field =  $6+4=10$  bits

| flag | control signal<br>(10 bits) | Address |
|------|-----------------------------|---------|
|------|-----------------------------|---------|

b) consider a microprogrammed control unit design which has to support 6 groups mutually exclusive control signals. The number control signals generated by each group are given below.

|       |       |       |       |       |       |
|-------|-------|-------|-------|-------|-------|
| $G_1$ | $G_2$ | $G_3$ | $G_4$ | $G_5$ | $G_6$ |
| 7     | 11    | 12    | 3     | 6     | 10    |

what will be the minimum no. of bits stored per control word w.r.t. Horizontal  $\mu$  programming.

Sol: Mutually exclusive

⇒ either one (or) none of signals

each group can use vertical  $\mu$  programming

|         |       |       |       |       |       |
|---------|-------|-------|-------|-------|-------|
| $G_1$   | $G_2$ | $G_3$ | $G_4$ | $G_5$ | $G_6$ |
| signals | 7     | 11    | 12    | 3     | 6     |
| bits    | 3     | 4     | 4     | 2     | 3     |

$$\text{Total bits} = 3+4+4+2+3+4 = 20 \text{ bits}$$

$$\text{In Horizontal } \mu\text{ programming} = 7+11+12+3+6+10 \\ = 49 \text{ bits}$$

$$\therefore \text{No. of bits saved} = 49-20 = 29 \text{ bits}$$

The above approach is combinational  $\mu$  programming.

within group - vertical  $\mu$  programming

between groups - Horizontal  $\mu$  programming

## nano programmed control unit design:

- the nano programmed control unit offers most flexibility
- it follows horizontal uprogramming, at the same time it is reducing no. of bits per control word
- it follows two level control memory
- The micro control memory performs sequencing activity while the nano control memory generates the control signal. The time for accessing the control signals is

$$T_{CS} = T_{UCM} + T_{NCM} + T_{HW}$$

- It is used for firm-ware based applications
- the boot strap os program is also stored in nano control memory

25

The total control memory size =  $M \text{ control memory size} + N \text{ nano control memory size}$

$$= H_M (\log_2 H_M + \log_2 H_N) + H_N * P$$

$H_M$  - No. of words in micro control memory

$H_N$  - No. of words in Nano control Memory

P - Number of control signals.

- Any change is made to Nano control Memory. It is the slowest one

eg:  
flexibility ascending order :- HW, HMP, VMP, NCM  
access time ascending order :- HW, HMP, VMP, NCM

- Max degree of parallelism in Nano programmed control unit design is same as horizontal uprogramming

= N

- Time is precious → Horizontal up
- ~~Memory~~ Memory is precious - Nano control Memory



## I/O (Input-output):

### Assembly Language program tracing:

Tracing the assembly language program concerns the computing the contents of registers (or) memory locations. and also computing about time and space requirements.

- The behaviour of the program is decided by the one (or) two key instructions.
- while tracing , status of flag conditions is to be noticed for these key instructions.

Ex:

|                                             | (64 bit word) | size in words | Meaning                                          |
|---------------------------------------------|---------------|---------------|--------------------------------------------------|
| Load R <sub>1</sub> , R <sub>2</sub> (1000) | 4             |               | R <sub>1</sub> ← M[R <sub>2</sub> +1000]         |
| ADD R <sub>2</sub> , R <sub>1</sub>         | 2             |               | R <sub>2</sub> ← R <sub>1</sub> + R <sub>2</sub> |
| STORE R <sub>1</sub> (000), R <sub>2</sub>  | 4             |               | M[R <sub>1</sub> +000] ← R <sub>2</sub>          |
| HALT                                        | 1             |               |                                                  |

If initial contents of R<sub>1</sub>=10, R<sub>2</sub>=1, M[001]=1

/ find the contents after executing the above program

add  $R_1, R_2(1000)$

$R_2 \leftarrow R_1 + R_2$

$[R_f+1000] \leftarrow R_2$

$R_1$        $R_2$

1      1

1      2

$R_1 = 1 \quad R_2 = 2$

$M[1001] = 2$

Let the above program is stored in a byte organised memory from 800 decimal address onwards. If an interrupt occurred after the fetch of ADD instruction, what will be the address stored in stack. 36

|     |                            |
|-----|----------------------------|
| 800 | Load<br>$4 \times 8 = 32$  |
| 831 |                            |
| 832 | $2 \times 8 = 16$<br>ADD   |
| 847 |                            |
| 848 | $4 \times 8 = 32$<br>store |
| 879 |                            |
| 880 | $1 \times 8 = 8$<br>Halt   |
| 887 |                            |

when interrupt occurs it completes present instruction, saves the address of next instruction in stack and branches to interrupt. After completing interrupt it gets a return address from stack.

Address of store instruction is stored on stack decimal value  
= 848

b) If the interrupt is occurred during the halt instruction what will be the saved address

Sol: Halt is implemented like

Halt  $\Rightarrow$  Here: JMP Here

program counter is not incremented

$\therefore$  Address 880 is stored on stack

P) Let the fetch, decode process consume two clocks per word for each instruction. Execution requires 4 clocks per memory instruction and 1 clock per arithmetic instruction. How many clocks are required to complete the program.

|                  | <u>Subinstruction</u> | <u>F+D</u> | <u>E</u> |
|------------------|-----------------------|------------|----------|
| LD R1, R2(1000)  | 4                     | 8          | 4        |
| ADD R2, R1       | 2                     | 4          | 1        |
| store R(100), R2 | 4                     | 8          | 4        |
| Halt             | 1                     | 2          | -        |

$$22 + 9 = 31 \text{ clocks}$$

P) consider the following program where R1, R2, R3 are 32 bit registers. R2=8, R1=0. If the following program is implemented find their final values.

```

MOV A, R3
UP: RRC
      JNC DOWN
      INC R1
DOWN: DEC R2
      JNZ UP
Halt
    
```

- Sol:
- (a) R1=8, R2=0
  - (b) R1=0, R2=0
  - (c) R1=No. of 0's of R3, R2=0
  - (d) R1=No. of 1's of R3, R2=0

Sol:

|       |                         |       |
|-------|-------------------------|-------|
| After | A7 A6 A5 A4 A3 A2 A1 A0 | carry |
| RRC   | 0 A7 A6 A5 A4 A3 A2 A1  | A0    |

P) what will be the instruction required at the blank place so that content of 'A' is restored back.

- (a) NOP (no operation)
- (b) RLC
- (c) RRC
- (d) ADD A, 01H

consider  $R_2 = 4$   $R_3 = (0110)_2$

|     | <u>A</u> | <u>carry</u> | <u><math>\frac{R_2}{R_3}</math></u> | <u><math>R_1</math></u> |
|-----|----------|--------------|-------------------------------------|-------------------------|
| 1.  | 0110     | 0            | 4                                   | 0                       |
| RRC | 0011     | 0            | 3                                   | 0                       |
| RRC | 0101     | 1            | 2                                   | 1                       |
| RRC | 1000     | 1            | 1                                   | 2                       |
| RRC | 0100     | 0            | 0                                   | 2                       |

To get the content of A, we apply another RRC

$$0110 = A$$

## I/O organisation:

37

- Addressing
  - Data transfer techniques
  - Disk and tape memories.
- CPU is not directly connected to I/O devices



because

- 1) speed mismatch
- 2) different format
- 3) different device drivers

so a black box is required between CPU & I/O.  
Based on the capability, the black box can be called

- 1) Interface
- 2) Module — supports multiple interfaces
- 3) I/O processor — implements I/O instructions
- 4) I/O channel — contains I/O processor as one component.

## Interface

It varies from device to device based on the nature of device.

## 1. serial interface

Eg : programmable communication interface (PCI) 8251  
It is also called as USART (universal synchronous Asynchronous Receiver Transmitter)

## 2. parallel interface

Eg : programmable peripheral interface 8255

- Serial interface has to do extra work

1) Responsibility to convert serial to parallel (device to processor)

2) Responsibility to convert parallel to serial (processor to device)

→ serial to parallel conversion is done when data is getting from device. This is at receiver section.

→ parallel to serial conversion is done when data is sending to device. This is at transmitter section.

→ Hence serial interface has two register

serial In parallel out (SIPo) - Receiver section

parallel In serial out (PiSo) - Transmitter section



- Every interface has to do

- buffering
- latching

Buffering:

- Amplification of digital signals without changing the logic levels

- It is selective amplification because only logic-0 and logic-1 are amplified



the strength is decreasing. So buffering is needed

### Latching:

- delaying / retaining
- to synchronise speed between processor and device
- Interface cannot perform error control. It is only a media.

Interface + Error control = Module

IO processor = Module + All IO instructions are executed

IO channel = separate IO processor and separate resources

selector channel; maximum data transfer rate =  $\frac{1}{q} \text{ (Pp)}$

multiplexer channel, max. data transfer rate =  $\frac{q}{P} \text{ (Pp)}$  38

### IO issues:

How to address device in order to communicate?  
there are two ways - 1) Memory mapped IO

2) IO mapped IO

### Memory mapped IO:



Adv: → Addresses are different  
→ All memory transfer instructions can be extended

to IO

LDA 1020

LDA 8020



$A \leftarrow M[1020]$

$A \leftarrow (8020)$

used to transfer data

to accumulator

Memory device data

IO device data

- more I/O addressing modes. Hence more flexibility
- disadv: change in I/O space effect memory space



### I/O Mapped I/O:

- Also called as isolated I/O



- Hence we can decrease (or) increase I/Os without affecting MS.
- Here the addresses are same for both MS and I/Os. For finding the correct address mode bit (M) is used.

### disadv:

1. I/O/M is required to distinguish memory and I/O device addresses, so more no. of address bits
2. Separate I/O instructions are needed  
Eg: IN, OUT
3. Addressing modes are less. So less flexibility

### Data transfer techniques:

How to get data?

Three ways to get data, (or) transfer data

1. program driven — processor control data transfer
2. Interrupt driven - device control data transfer
3. DMA

### program driven:

one instruction continuously checks the status of device if it is ready then perform the data transfer. If it is not ready then it is kept in lim



Adv: It is simple to implement  
No design complexity

disadv: Inefficient usage of CPU

Interrupt driven:  
Responsibility of device to tell processor whenever it is ready

If ready - perform data transfer

29

If not ready - device will do its work.

Device is controlling data transfer so it is peripheral

transfer



Adv: Processor is effectively used

implementation complexity

disadv: More

Access (DMA)

Direct Memory

Access to data transfer

to

- processor is isolated in the data transfer overhead

avoid fetch, decode & execution

transfer

Normally, to perform IO transfer

fetch the set of instructions

execute them

this results to transfer data

- when bulk data is to be transferred this overhead is

ii) processor is isolated: Hence bulk data can be transferred at faster rate

eg: disk-Memory traffic

- To supervise (or) coordinate this activity, DMA is introduced
- DMA can't fetch, decode and execute. It only prepares a channel on which data can be transferred.
- To perform data transfer system bus is required

Address bus + data bus + control bus

- If two system buses, one for CPU and one for DMA it is more costly and is waste
- So sharing of only one system bus between CPU & DMA  
Hence both can't perform their work parallelly
- So due to DMA, process is blocked ie performance is reduced.

$$\% \text{ of time CPU blocked} = \frac{T_x}{T_x + T_y} * 100$$

$T_x$ : word transfer time

$T_y$ : word preparation time



processor is blocked when no system bus

this cycle stealing mode.

- Based on how the system bus shared between processor and DMA controller (or) when the system bus is returned, DMA operation modes can be

- Burst mode
  - cycle stealing mode
  - Bulk transfer mode
- In burst mode, system bus is returned to processor only after the entire data is transferred. So processor delay time is more. So processor performance is effected more.
- In cycle stealing mode - the system bus is returned or exchanged after every word transfer and acquired after every instruction cycle. DMA waiting time is more (word transfer takes less time as it requires only one memory reference. But instruction requires more than one memory reference) 40
- In Block transfer mode, DMA controller returns the system bus after transferring the block of data. If block size is one word then it reduces to cycle stealing mode. If block size is entire data then it reduces to burst mode.
- Q) Consider a 2 Mbps device operating in cycle stealing mode of DMA. Whenever 64 bit data is available the controller transfers it to memory in 2  $\mu$ sec. What is the % of time the processor is blocked.
- Sol:
- $2 \times 10^6 \text{ B} \rightarrow 1 \text{ sec}$
- $8 \text{ B} \rightarrow 4 \mu\text{sec}$
- $\therefore T_A = 4 \mu\text{sec}$
- Given transfer time  $T_x = 2 \mu\text{sec}$



$$\therefore \% \text{ of time CPU blocked} = \frac{d}{d+u} * 100 \\ = 33.33\%$$

∴ efficiency of processor (CPU) = 66.67%.

i.e. processor is busy for 66.67% of time

- D) consider a system which uses 600 MHz clock. If it is needed to transfer 500 Bytes. The DMA initialisation and overhead consumes 80 clocks. Every Byte will take 2 clocks per transfer. How much time taken for this DMA.

Sol:

$$= \text{DMA initialisation time} + \text{Data Transfer time} \\ = 80 + 500 * 2 \\ = 1080 \text{ clocks } \checkmark \\ = \frac{1080}{600} * 10^6 \text{ sec} = 1.8 \mu\text{sec}$$

- D) In the above problem, if the interrupt driven transfer is used for every word, it requires 6 instruction overhead. Two of the instructions take 2 clocks and other instructions take 1 clock each. How much time is required for interrupt driven transfer?

6 instructions - [ 4 instructions - each 1 clock  
 2 instructions - each 2 clocks ]

$$\text{Total} = 4 \times 1 + 2 \times 2 \\ = 8 \text{ clocks}$$

Interrupt driven transfer time =  $8 \times 500 \text{ clocks}$   
 $= \frac{4000}{600} \times 10^6 \text{ sec}$   
 $= 6.6 \mu\text{sec}$

what is the minimum performance gain possible w.r.t DMA when compared to interrupt driven transfer

DMA performance =  $\frac{\text{interrupt time}}{\text{total DMA time}} = \frac{6.6}{1.8} = 3.7$

How many clocks are lost to the processor due to DMA.

These are lost during data transfer time  
 $= 1000 \text{ clocks}$

% of time processor blocked =  $\frac{1000}{1080} \times 100$   
 $= 92.6\%$

### Issues in interrupt implementation:

→ wrt. CPU

→ wrt. device

→ when to interrupt CPU

→ how to interrupt CPU

- The device can interrupt CPU at any time because interrupts are asynchronous in nature

- The device can interrupt CPU by changing the signal at hardware pin. This can

i) level triggered interrupt

ii) edge triggered interrupt

- How much duration should the interrupt be maintained  
 i.e. how much time calling bell should be in Dressed state

It is based on implementation

- 1) Maintain until it is recognised
- 2) Maintain some time until it is recorded in some place

### s.r.t. CPU-ISSUES:

- 1) When to recognise interrupt?

After completion of current instruction

- 2) How to recognise interrupt?

By reading the status of interrupt flag

- 3) What to do?

Branch to Interrupt service Routine (ISR) whenever an interrupt is recognised. ISR address is given either by device (or) it is default.

Vector Interrupt

Scalar Interrupt

- 4) How to resolve in case of multiple interrupts?  
- use priorities

### Secondary Memories:

- It is often required to have larger programs which exceeds the size of main memory. Hence secondary memory is needed for this purpose.
- Secondary memory can be
  - 1) disk (magnetic disk)
  - 2) tape
- Based on the access type, memories can be
  - 1) random access
  - 2) semirandom access
  - 3) serial access

- Disk is a semi random access and data is distributed across circular tracks and sectors.



- on plastic base, magnetic material is coated on concentric circles - called tracks!
- Each track is divided into manageable units called as sectors. It is basic unit for accessing disk. In general, there are equal no. of sectors per track. But the diameter of track is different. So how to place data?

42

Based on method of placing the data, the disk can be:

- 1) constant track capacity disk
- 2) variable track capacity disk.

constant track capacity disk: Here diameter is constant

- Here reading takes place constant time.
- Here reading takes place constant time, the recording density is varied i.e. the space between bits is varied

$$\text{Recording density} = \frac{\text{No. of bits}}{\text{unit distance}}$$

- (e)
- $\rightarrow$  Max. for inner track  
 $\rightarrow$  Min. for outer track

- To perform reading, the disk has to be moved with angular velocity.