

# Cosmos(Sajal)

A-PDF Watermark DEMO: Purchase from www.A-PDF.com to remove the watermark

Date

4.04.12

Textbook :- Morris Mano, William Stallings

Microprocessor:- A.K.Ray

Advanced CA:- Hennessy

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

## COMPUTER ORGANISATION

Syllabus:-

- ① Functionality of Computer
- ② Component of Computer

Computer:- It converts one form of data into another form under the control of program.

(Execution of program)



While execution of program, one form of data is converted into another form. System functionality is execution of program.

Program:- Sequence of instructions along with data.

Program - [instruction  
data]

Instruction :-



To get a response from the master, we need to give instructions & data in Master understandable form.

# Cosmos(Sajal)

Date:  
Page No.

Kitty

Instruction:- Binary sequence which is designed inside the processor to perform some task.

Data:- Binary sequence which is associated with a value based on a weighing factor.

To execute a program we need 3 components:-

- ① CPU :- processing
- ② Memory :- storage
- ③ I/O Devices :- external communication

## Memory Organisation

Program is stored in the memory.



- Memory chip is divided into equal parts called as cell.
- Each cell is recognised by a unique no. called address.
- Address enable the memory cell.
- Each cell recognises two control signals, i.e. RD (read) & WR (write).

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

- When CPU wants to process the instructions, it generates a memory request to read / read the instructions from Memory.

Mem. Request :-

[Address|CS] → Control Signal (Read or Write)

↑  
Where the instruction is present in memory

What type of operation we need to perform with the instruction

- \* For performing an operation, we need address & control signal both.
- Memory Address depends on the memory chip configuration.

## Memory Configuration

Memory chip configuration is represented using following format :-

64K X 8

64K X 16

64K X 32, 4 so on.

Consider 64K X 8 memory chip

(64K) X 8  
↓  
Capacity of the memory

based on capacity, we can define no. of mem. cells.

$$64 \times 2^6 = 2^{10} \text{ cells}$$

based on no. of mem. cells, we can calculate no. of address bits req. of enable a cell :- 16 bits (which is the power) no. of address lines req. = 16 bits

Mem. map Space :-  
0000H to FFFFH  
(hexadecimal representation)

$$1. (0000)_{16} \quad (FFFF)_{16}$$

based on no. of bits req., we determine Mem. map/Address space / Address range  
∴ Mem. map :-  $0000^{16}$  (16 0's) to  $1^{16}$  (16 1's).

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

64 K x 8

it represents cell size

- Byte Addressable Mem.
- Word Addressable Mem.

## BYTE ADDRESS V/s WORD ADDRESS

- Byte Addressable Memory :-  
Each memory cell points to an 8-bit info.
- Word Addressable Memory :-  
Each memory cell points one word info.

| Processor            | Byte (B) | Word (W) |
|----------------------|----------|----------|
| 8-bit microprocessor | 8 bits   | 8 bits   |
| 16-bit               | "        | 16 bits  |
| 32-bit               | "        | 32 bits  |
| n-bit                | "        | n bits   |

Imp Word size depends on the word length of the processor, i.e. no. of bits processed by the processor at a time.

- Existing Mem. system requires Byte Addressable Memory because it is unambiguous.

NOTE:- The mem. system's default address space is byte addressable space.

64 KB → 64 kilo byte

Memory system is represented by bytes, & not by words.

\* Word is ambiguous, because it is diff. for diff. processors.

# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

NOTE:- When processor word size is greater than the memory cell size, then there is need for accessing multiple memory cells.

e.g.

16 bit processor :-

MOV AX, [2000] → Memory Address  
↓  
mnemonic (user understandable format) → destination. → AX represents register  
converted into machine understandable format,  
i.e. opcode

& opcode indicates the type of operation,  
that MOV is data transfer operation.

[2000] → Source

↓  
Memory Address

↓  
8 bits

[Byte Addressable Memory]

- \* To transfer the mem. content, we need to transfer 16 bit data, so we need to access 2 memory cells.

M[2000 H]

& M[2001 H]

MOV R0, [2000]

↓  
16 bit  
processor  
register

[32 bit microprocessor]

[2000H] represent only 8 bit data, so we need to access 4 cells for the transfer.

To → [M[2000 H]  
M[2001 H]  
M[2002 H]  
M[2003 H]]

# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

Memory:-

|      |    |
|------|----|
| 0000 |    |
| 2000 | 23 |
| 2001 | 18 |
| 2002 | 12 |
| 2003 | 24 |
| FFFF |    |

\*  $\text{MOV AX, [2000H]}$

$AX = \begin{matrix} 23 \\ 18 \\ 2001 \end{matrix}$  2000 contains  
higher byte      lower byte

or

$AX = \begin{matrix} 23 \\ 18 \\ 2001 \end{matrix}$  2000 contains  
lower byte      higher byte

\*  $RO = 24 \ 12 \ 18 \ 23$  or  $23 \ 18 \ 12 \ 24$

We don't know whether 2000H contains  
lower byte or higher byte.

As there is no order of data storage in the  
memory, the above instructions generates multiple  
outputs(ambiguity). To avoid ambiguity there is a  
need of memory address interpretation mechanism  
(Endianness (Endianess Mechanism))

Endianness shows the order of data storage.

Little Endian

Big Endian

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

In Little Endian, the order of data storage is ~~that~~ lower address contains lower byte & higher address contains higher byte, i.e.



∴ acc. to little endian

$$AX = 18\ 23$$

In big endian, lower address contains higher byte & higher address contains lower byte, i.e.



∴ acc. to big endian

$$AX = 23\ 18$$

~~little~~ Little endian mechanism is a default interpretation Mechanism.

$$RO = 2\ 4\ 12\ 18\ 23  
(\text{little endian})$$

$$RO = 23\ 18\ 12\ 24  
(\text{big endian})$$

$$AX: \begin{array}{|c|c|} \hline 18 & 23 \\ \hline \end{array} \begin{matrix} 2001 & 2000 \end{matrix} \\ (\text{little endian})$$

$$AX: \begin{array}{|c|c|} \hline 23 & 18 \\ \hline \end{array} \begin{matrix} 2001 & 2000 \end{matrix} \\ (\text{big endian})$$

$$RO: \begin{array}{|c|c|c|c|} \hline 24 & 12 & 18 & 23 \\ \hline \end{array} \begin{matrix} 2003 & 2002 & 2001 & 2000 \end{matrix}$$

$$RO: \begin{array}{|c|c|c|c|} \hline 23 & 18 & 12 & 24 \\ \hline \end{array} \begin{matrix} 2003 & 2002 & 2001 & 2000 \end{matrix}$$

Control Signals:- The processor supports b/w pins to perform the operations to the externally connected components.

The pins are classified into 3 types:-

① Active low pin

These pins are enabled when input is zero ('0').

They are denoted with "Pinname", e.g.

RD, WR, INTA, CS

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## ② Active High Pins:-

These pins are enabled when their input is 1, they are denoted as "Pinname", e.g. ALE, INTR, HOLD, HLDA.

## ③ Time Multiplexed Pins:-

These pins carry two meanings, where the meaning is defined by another pin. The advantage is that it reduces the no. of pins.

e.g. AD<sub>0</sub>-AD<sub>7</sub> (8085)

Address/Data

The meaning of this pin is defined by ALE. When ALE is '1', AD<sub>0</sub>-AD<sub>7</sub> carries the address & when ALE is '0', AD<sub>0</sub>-AD<sub>7</sub> carries the data, so the selection i/p's of AD<sub>0</sub>-AD<sub>7</sub> is ALE.

Buffer:- It is the temporary storage where the i/p & o/p lines are always in the enable state, there is no locking.  
i.e. When there is data in i/p lines, then immediately the data is transferred to the buffer, & when there is data in the buffer, the data is transferred immediately to the o/p lines, we can't lock data in buffer.



Latch:- Latch is also a temporary storage device, where the i/p & o/p lines are under the control of strobe & output-enable pins respectively.

(STB)

(OE)

# Cosmos(Sajal)

**STB**

- o  $\rightarrow$  i/p lines are disabled, & data can't be transferred from i/p lines to latch.
- 1  $\rightarrow$  i/p lines are enabled & data can be transferred from the i/p lines to the latch

**OE**

- o  $\rightarrow$  O/p lines are enabled, & data can be transferred from latch to the output lines
- 1  $\rightarrow$  O/p lines are disabled, & data can't be transferred from latch to the O/p lines.



- \* Latch may act as a buffer, but buffer can't act as a latch.  
Latch will act as a buffer when STB is high & OE is low.

## Memory Interfacing :-



CPU: 2000H & RD  
A<sub>15</sub>-A<sub>8</sub> A<sub>7</sub>-A<sub>0</sub>

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

|     | T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> | T <sub>4</sub> |
|-----|----------------|----------------|----------------|----------------|
| ALE | 1              | 0              | 0              | 0              |
| RD  | 0              | 1              | 0              |                |

## Communication Channel i.e. Bus:-

There is a need of communication channel b/w the major components of the system, i.e. CPU, memory & I/O to satisfy the objective of the computer. This channel is called System Bus. The characteristic of Bus is shared transmission media. The limitation of the bus is only one transmission at a time.



CPU :- Master

Memory, I/O :- Slaves, these have the addresses  
I/O address = port address, two operations are possible namely RD, WR.

\* System bus consists 3 category of lines to perform the communication named as -

- ① Address Lines
- ② Data Lines
- ③ Control Lines.

(Unidirectional) Address Lines :- towards mem. & I/O, they are unidirectional, based on the no. of address lines we can determine the capacity of the system.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

e.g. in 8085

$$\text{AD}_0 - \text{AD}_7 \quad A_8 - A_{15} = 16 \text{ address lines}$$
$$\Rightarrow 2^{16} \text{ cells}$$
$$\Rightarrow 64 \text{ K cells}$$
$$\Rightarrow 64 \text{ KB}$$

in 8086

$$\text{AD}_0 - \text{AD}_{15} \quad A_8 - A_{19} = 20 \text{ address lines}$$
$$\Rightarrow 2^{20} \text{ cells}$$
$$\Rightarrow 2^{10} * 2^{10} \text{ cells}$$
$$\Rightarrow 1 \text{ K} * 1 \text{ K cells}$$
$$\Rightarrow 1 \text{ M cells}$$
$$\Rightarrow 1 \text{ MB}$$

e.g. Do 8 cells (1), D1, D2, D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13, D14, D15

a system with 32 address lines:

$$\Rightarrow 2^{32} \text{ cells}$$
$$\Rightarrow 2^2 * 2^{10} * 2^{10} * 2^{10}$$
$$\Rightarrow 4 * 1\text{K} * 1\text{K} * 1\text{K}$$
$$\Rightarrow 4 * 1\text{M} * 1\text{K}$$
$$\Rightarrow 4 * 1\text{G}$$
$$\Rightarrow 4\text{G cells}$$
$$\Rightarrow 4\text{GB}$$



Data lines carry the data stored in the selected cell (from (i) to (iv)), each data line carrying one bit from the selected cell.  
So, no. of data lines specified the "WORD" length & not "BYTE" length.

(Bidirectional)

- \* Data Lines :- These lines carry the binary sequence b/w CPU, memory & I/O, hence they are bidirectional. Based on the no. of data lines, the processor's word length depends. Based on word length, performance will be measured.

(Unidirectional)

- \* Control Lines :- These lines carries the control & timing signals. Control signals are used to indicate the type of operation & timing signals are used to synchronise the mem. & I/O operations with processor clock.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

| T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> | T <sub>4</sub> |
|----------------|----------------|----------------|----------------|
| 1              | 0              | 0              | 0              |
| 01             | 0              |                |                |

## Communication Channel i.e. Bus:-

There is a need of communication channel b/w the major components of the system, i.e. CPU, memory & I/O to satisfy the objective of the computer. This ~~is~~ channel is called System Bus. The characteristic of Bus is shared transmission media. The limitation of the bus is only one transmission at a time.



CPU:- Master

Memory, I/O:- Slaves, these have the addresses  
I/O address = port address, two operations are possible, namely RD, WR.

★

★ System bus consists 3 category of lines. to perform the communication named as -

- ① Address Lines
- ② Data Lines
- ③ Control Lines.

★

Address Lines:- towards mem. & I/O, they are unidirectional, based on the no. of address lines we can determine the capacity of the system.

Imp

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

e.g. in 8085

$$\begin{aligned} \text{AD}_0 - \text{AD}_7 \quad \text{A}_8 - \text{A}_{15} &= 16 \text{ address lines} \\ \Rightarrow 2^{16} \text{ cells} \\ \Rightarrow 64 \text{ K cells} \\ \Rightarrow 64 \text{ KB} \end{aligned}$$

in 8086

$$\begin{aligned} \text{AD}_0 - \text{AD}_{15} \quad \text{A}_{16} - \text{A}_{19} &= 20 \text{ address lines} \\ \Rightarrow 2^{20} \text{ cells} \\ \Rightarrow 2^{10} \times 2^{10} \text{ cells} \\ \Rightarrow 1 \text{ K} \times 1 \text{ K cells} \\ \Rightarrow 1 \text{ M cells} \\ \Rightarrow 1 \text{ MB} \end{aligned}$$

e.g. Do ~~earries~~ <sup>earries</sup>  
D<sub>1</sub>, D<sub>2</sub>, D<sub>3</sub>, D<sub>4</sub>

a system with 32 address lines:

$$\begin{aligned} \Rightarrow 2^{32} \text{ cells} \\ \Rightarrow 2^2 * 2^{10} * 2^{10} * 2^{10} \\ \Rightarrow 4 * 1\text{K} * 1\text{K} * 1\text{K} \\ \Rightarrow 4 * 1\text{M} * 1\text{k} \\ \Rightarrow 4 * 1\text{G} \\ \Rightarrow 4\text{G cells} \\ \Rightarrow 4\text{GB} \end{aligned}$$



Data lines carry the data stored in the selected cell (from (i) to (k)), each data line carries one bit from the selected cell.  
So, no. of data lines specified is the "WORD" length (not "BYTE" length)

\* Data Lines :- These lines carry the binary sequence b/w CPU, memory & I/O, hence they are bidirectional. Based on the no. of data lines, the processor's word length depends. Based on word length, performance will be measured.

\* Control Lines :- These lines carries the control & timing signals. Control signals are used to indicate the type of operation & timing signals are used to synchronise the mem. & I/O operation with processor clock.

# Cosmos(Sajal)



- There is ambiguity when both I/O & Memory controllers both want to access the bus
- Because of common channel & common control signals & common address space maintained b/w memory & I/O, then there is a possibility of ambiguity.  
To avoid the ambiguity bus configurations are used.

## BUS CONFIGURATIONS

- There are 3 bus configurations used in the processor design, named as:-
  - IOP (Input-Output processor)
  - Isolated I/O (IO mapped I/O)
  - Memory-Mapped-IO

IOP:- In this configuration, separate buses are used for memory & I/O, introducing an additional bus is expensive.  
Only high-performance architecture are using this configuration.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## • Isolated IO :-

This configuration uses common bus & common address space but different control signals, i.e.



MEMRD is initiated by LOAD control signal.  
MEMWR is initiated by STORE control signal.  
IORD is initiated by IN control signal.  
IOWR is initiated by OUT control signal.

- \* both 8085 & 8086 processors use the Isolated IO configuration that can be implemented as follows.

8086:

|      |    |    |
|------|----|----|
| M/IO | RD | WR |
|------|----|----|

↓  
this pin  
selects whether  
to access memory  
or Input-Output

| M/IO | RD | WR |                |
|------|----|----|----------------|
| 1    | 0  | 1  | → Memory Read  |
| 1    | 1  | 0  | → Memory Write |
| 0    | 0  | 1  | → IO read      |
| 0    | 1  | 0  | → IO write     |

- \* In 8085, we only have M/IO as the only difference from 8086 where it is M/IO.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_



where 1000 is some I/O device with I/O port Address as 1000.



\* So there is no ambiguity.

## Memory-Mapped I/O :-

In this configuration, common channel & common control signals are used but unique address space is maintained b/w IO & memory.

Note:- The unique address is maintained by taking some of the addresses from the pool of memory address space & assigning them to the IO ports & disable the same in the memory.

\* At the time of design, we check how many IO devices are there, & we will take that no. of addresses from the memory space of memory.

e.g.



We know that there are 2 IO devices, so we take 2 addresses (assume address 0000 & 0001) from the pool of addresses of memory & assign these addresses to IO devices. Now these 2 addresses are not assigned for memory usage.

Imp.

# Cosmos(Sajal)



OI address is taken from the pool of addresses of memory to I/O device.

## Instruction Cycles

- ① This concept describes the execution sequence of an instruction.
- ② Instruction cycle consists 2 subcycles, i.e.
  - (i) Fetch cycle
  - (ii) Execution cycle.



## Fetch Cycle :-

In this cycle, the instruction is transferred from memory to CPU based on the program counter register. This process is called instruction fetch. At the end of instruction fetch, the program counter is incremented ~~to~~ by step-size to point to next instruction address. ∴ the functionality of program counter is that it holds the starting instruc<sup>n</sup> address & immediately points to the next instruction address.

e.g. if our program is stored starting from mem. location 1000, then we pass this command to let PC knows about the starting address of program.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

$-t = 1000$   
↓  
trace  
executable  
command  
Starting  
address  
of the program

This means that the user transfers the starting address of the program to the Program Counter.

this 1000 is transferred to PC to let it know about the start of the program



then  $PC \leftarrow PC + \text{stepsize}$

→ step-size is the size of the instruction,  
if instruction is 8-bit, then

$$PC \leftarrow PC + 1$$

& if step instruction is 16-bit, then

$$PC \leftarrow PC + 2$$

[because the step-size is 4].

if instruction size is  $n$  bytes, then step-size is  $n$ . [Step-size is in bytes.]

Q1. Consider the following program segment

| In S <sup>n</sup> | Size(words) | PC                      |
|-------------------|-------------|-------------------------|
| I <sub>1</sub>    | 2           | PC = 1008 ; 1000 + 1007 |
| I <sub>2</sub>    | 1           | 1008 + 1011             |
| I <sub>3</sub>    | 1           | 1012 + 1015             |
| I <sub>4</sub>    | 2           | 1015 + 1023             |
| I <sub>5</sub>    | 1           | 1024 + 1027             |

(a) assume that word size is 32 bit & the program has been loaded in the memory with starting address is 1000 (decimal) onwards

bz  
224

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

During execution of  $I_4$ , what value will be present in PC

Ans 32 bit will be stored in 4 bytes(4 bytes).  
 $\therefore$  PC = 1024 during execution of  $I_4$ .

Imp.

During Execution of  $I_4$ , the program counter is incremented.

(b) Assume that memory is word addressable with word size 32 bit. The program has been loaded in the memory with st. address of 1000 (decimal onwards). During execution of  $I_3$ , what is PC

| Ins <sup>n</sup> | size (words) | valid PC   |
|------------------|--------------|------------|
| $I_1$            | 2            | 1000-11001 |
| $I_2$            | 1            | 1002       |
| $I_3$            | 1            | 1003       |
| $I_4$            | 2            | 1004-1005  |
| $I_5$            | 1            | 1006       |

$\therefore$  PC = 1004. [In both (a) & (b), step-size is variable because the processor is supporting variable-length instructions.]

\* We can see that Step-size is variable, because

when the processor supports fixed-length instruction, then step-size is constant,  
if processor supports variable-length instruction, then step-size is variable.

[this is byte addressable memory & each memory cell is of 2 bytes, but each instruction is of 24 bit so to store 1 instruction we need 3 cells.]

Q2. A computer has 24-bit instructions. The program has been loaded in the memory with st. address of 300(decimal onwards). Which is valid value for PC.

- (i) 400  
(ii) 500

- (iii) 600  
(iv) 700

[as step-size is 3 byte.]

The step-size is constant because the processor is supporting fixed length instruction of 24 bits.

# Cosmos(Sajal)

valid PC.

|     |     |
|-----|-----|
| 200 | 302 |
| 303 | 305 |
| 306 | 308 |
| 309 | 311 |

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## Execution Cycle :-

In this cycle, the fetched instruction undergoes processing.

To process the instruction, we need opcode info.  
Instruction format defines the opcode.

## Instruction format :-

It gives the layout of an Instruction.

- Encoding:-  $N$  combinations requires  $\log_2 N$  bits  
e.g. we want to do four operations, like add, sub, mul, div., then there are 4 operations, so we require  $\log_2 4 = 2$  bits in opcode.
- Decoding:-  $N$  bit decoder generates  $2^N$  o/p combinations.  
e.g. 2 bits in opcode, then we can perform 4 operations.

Instruction Format is classified into 5 types based on CPU organisation, EPU organisation is classified into 3 types based on internal storage.

- ① Stack Machine
- ② Accumulator Machine
- ③ General - register orgn

} → CPU organisation

Stack Machine: In this organisation, the ALU operations are performed only on the top of the stack.

[0-Address Instruction]

# Cosmos(Sajal)



TOS (default)

This means that we don't need to give the source & destination address, as the values are fetched & stored on the top of the stack, so the instruction only contains the opcode.

In this organisation, Source & destination addresses are default addresses, i.e. instruction contain only one field, i.e. opcode.

e.g.

Opcode

indicates type of operation.

data transfer opn.  
ALU operations.

as we data can only be transferred from CPU to TOS & vice versa, so source & destinations are default, we don't need to give source or destination address.

Destination

TOS

CPU(Acc.)

Source

CPU(Accumulator)

TOS

? → Data transfer operations  
Push  
Pop

→ ALU operations.

→ 2nd type of operation :- ALU operations.

In this case we need two operands, we will get this from TOS, perform the operation & stores the result back in stack.

e.g.

TOS → Acc  
Pop

Push

TOS ← Acc

Pop

Acc ← TOS

Add

TOS ← TOS + TOS

POP

POP

then perform addition  
& then push the result back into the stack.

Q5. Consider following program & execute onto the stack machine. Each arithmetic operation pops the 2nd operand, pops the 1st operand, operates on them & then store the result back onto the stack.

# Cosmos(Sajal)

ADD instrn inherently contain two pop & 1 push instrn, so I thought it should be obviously larger than a push or pop instrn, but I was wrong. & push & add instrn do ~~not~~ have any dependency to each other, the most we can say that is the microprogram corresponding to add instrn will contain control signals for push & pop.

Push b

Push x

ADD

POP C

Push C

Push y

ADD

Push C

Sub

pop z



instrn also, so microprogram for  $c = bt + x$  add > microprogram of push/pop instrn (definitely)



pop y - bt \* x



Q Which of the following statement(s) is/are true?

- (i) If each push & pop instruction req. 5 bytes of storage, each arithmetic instrn req. 1 byte of storage, then the whole program req. 40 bytes of storage.
- (ii) At the end of execution z contain the same value as y.
- (iii) At the end of the execution, stack is empty.

- (a) 1 only  
(b) 2 only  
(c) 2,3  
(d) 1,2,3

Accumulator In this organisation, one of the ALU operand Machine is by default present in the accumulator, (1-address other ALU operand may be present in instruction memory or processor register. The Accumulator acts as the source as well as the destination. The instruction ~~con reg~~ contains:-

opcode & address of 1 register

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

7.8.1

The operand is transferred from memory or register to the temporary register (TR).

- \* In this organisation, the Inst<sup>n</sup> format consists of 2 fields, i.e. Opcode & address field.

opcode | address → of the second source.

|          |            |                                     |
|----------|------------|-------------------------------------|
| Destn    | Source     | for Data transfer op.<br>we need to |
| Mem      | (CPU(Acc)) |                                     |
| CPU(Acc) | Mem.       |                                     |

The destination for ALU operations is always the accumulator.

|       |          |          |                   |
|-------|----------|----------|-------------------|
| Destn | Source 1 | Source 2 | for ALU operation |
| ACC   | Acc      | Mem/Reg. |                   |

e.g.

→ Load [3000] → ADD [5000]

Acc ← M[3000]

Acc ← Acc + M[5000]

→ Store [5000]

M[5000] ← Acc.

this means that if some free space will be left, we can assign it to stack & hence can work it as a stack machine.

→ ADD R0

Acc ← Acc + R0

Ques- One Address Machine also supports Zero

Address Machine if there is a free space after allocating One address instruction.

Q4. Consider a hypothetical processor which has 4-bit instruction & 2-bit addresses. It supports both 1-address & 0-address instruction.

If there exist 2 One address instructions, how many Zero instructions can be formulated?

Ans- Identify the highest instruction format.

Take 1-address format



# Cosmos(Sajal)

In the question it is given that we have 2 1-address instruction, this means that we only have 2 opcodes, these 2 opcodes requires storage of 2 bits (as per the question), rest 2 bits are for the address. Now take the opcode:

these 2 are free combinations [? 9] → these 2 opcode combinations are used by 1-address format.

Step 2:- Identify the no. of opcode combinations.  
no. of opcode combinations =  $4$ . [ $2^{\text{opcode size}}$ ]

Step 3:- Calculate the no. of free combinations after allocating the highest instructions.

$$\rightarrow \text{no. of free combination} = 4 - 2 = 2$$

Step 4:- Calculate the no. of low order instructions by taking the no. of free combination & decode the value of address field.

Step 2 :- # no. of opcode combinations = 4

'00 address', → assume it is  
'01 address', taken by 1-address  
10 address format.  
11 address

Now, out of these 4, 2 is taken by 1-address instruction.

Step 3:- So, no. of free opcode combn =  $4 - 2 = 2$

$$\text{where free combin.} = \begin{matrix} 10 \text{ Addr.} \\ 11 \text{ Addr.} \end{matrix}$$

Now, for 0-address format.  
address is 2 bits.

∴ we can have a total of 4 combinations from 2 bit address

|                                                                                  |       |       |
|----------------------------------------------------------------------------------|-------|-------|
| OpCodes of 4-bit for 0-address instructions starting with free combination '10'. | 10 00 | 11 00 |
|                                                                                  | 10 01 | 11 01 |
|                                                                                  | 10 10 | 11 10 |
|                                                                                  | 10 11 | 11 11 |

Now for 0-address instruction we have need to subby the opcode only [as we dont need to give the source & destination addresses], so

10 leave the free combinations  
11 can use the address bits also

↳ opcodes of 4 bits for 0-address instruction starting with '11'.

$$\star \quad \text{no. of zero address instn} = \frac{\text{no. of free combinations}}{2 \text{ no. of address bits}}$$

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Q5. Consider a hypothetical processor, a nine-bit instruction is placed in the 128-bit word memory, the processor supports both 1-address & 0-address instruction, if there exists 3 1-address instruction, how many 0-address instructions can be formulated.



128 word  $\rightarrow 2^7$  cells  $\rightarrow$  7 bit address is req.



# of combinations of opcode = 4

# of free combination of opcode = 1

∴ We have

$$\# \text{ of } 0\text{-address instructions} = 1 \times 2^7 = 128$$

- Q6. Consider a computer which has 32 bit instructions & 24 bit address. It supports both 0-address & 1-address instn.

(a) What is the range of 1-address instructions

(b) What is the range of 0-address instructions

Ans. (a)  $0^{32}$  to  $1111110\ 1^{24}$   $0^8 0^{24}$  to  $0^8 1^{24}$   $0^8 1^{24}$  to  $1^{24}$

(b)  $1111110\ 0^{24}$  to  $1111110\ 1^{24}$

$0^8 0^{24}$  to  $1^8 1^{24}$   $1^8 0^{24}$  to  $1^8 1^{24}$

No. of opcode combination =  $256 = 2^8$

(a) 1 address - instruction: 1 to 255  
there

(b) 0 address - instruction:  $255 * 2^{24}$  to  $1 * 2^{24}$ .

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

General Registers: Based on no. of Registers supported by the Organisation processors, this architecture is divided into 2 types :-

- ① Reg-Memory ref. Arch.
- ② Reg-Reg ref. Arch.

## Reg-Memory ref. Arch. :-

This arch. supports less no. of registers. The register acts as the source as well as the destination. The 2nd operand is present in memory or register.



Source 1: Register  
the same register will act as the destination.

Source 2: Register or memory.

\* We are having 4 address registers here.

In this architecture the instruction format consists 2 address fields along with the opcode.

Opcode | Addr1 | Addr2

Type of op<sup>n</sup>

1. Data transfer op<sup>n</sup>.

2. ALU op<sup>n</sup>

Destn

CPU(Register)  
Memory

Destn.

Source1  
reg.

Data transfer

Op<sup>n</sup> doesn't takes place b/w registers,  
i.e. R0  $\leftarrow$  R1 (not possible)

& also not b/w two mem.

Source (e.g. M[1000]  $\leftarrow$  M[12])

Memory (not possible)

CPU(Register)

Source1 reg.  
Source2 reg./Mem.

These are same.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- $\text{Mov } r0 \ [4000]$   
 $r0 \leftarrow M[4000]$
- $\text{Add } r0, r1$   
 $r0 \leftarrow r0 + r1$
- $\text{Mov } [5000] \ r0$   
 $M[5000] \leftarrow r0$
- $\text{Add } r0, [4000]$   
 $r0 \leftarrow r0 + M[4000]$

Q. Consider a hypothetical processor which has 128 word memory. A 16-bit instruction is placed in 1 memory cell. It supports 2-address, 1-address & 0-address instructions. It uses expanding opcode technique. If there exist two 2-address instr<sup>n</sup> & 250 1-address instr<sup>n</sup>, how many 0-address instr<sup>n</sup> can be formulated?

Ans.



For 2-address instructions:-



opcode      address bits  
 as only 2 opcode combinations are used, i.e. # of free combinations for 1 address instructions are 2, now, out of rest 14 address bits we can only use 7 bits for opcode as the other 7 bits will be used for addresses.

$$\text{No. of opcodes} = 2^2 = 4$$

$$\text{No. of free combinations} = 2(4 - 2)$$

$$\text{No. of 1 address instr.} = 2 \times 2^7 = 256 \text{ 7 bits for opcode as the}$$

Now,

$N = 256$  (in 2nd stage.)

$\therefore$  we have

now we have 250 1-address instr<sup>n</sup>.

$$\therefore \text{No. of free combinations} = 6 (256 - 250 = 6), \text{ address}$$

$$\therefore \text{No. of 0-address instr.} = 6 \times 2^7$$

no. of 1-address instr<sup>n</sup> that can be formed are:  $- 2 \times 2^7 = 2^8 = 256$ ,



now out of 256 opcode combinations, only

250 1-address opcode combinations are used, so no. of free combinations for 0-address instructions is 6.

so, for 0-address  $6 \times 2^7$ .



we can use only 6 combinations out of 2<sup>9</sup> combinations.

# Cosmos(Sajal)

$$2^8 \times 2^{10} = 2^{18}$$

$$2^8 + 2^{10} = 2^{18}$$

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

Q. Consider a computer which has 256 K word memory. It supports an instruction format with 4-fields i.e. Opcode, register addressing field (used to represent one of 1 of 60 processor registers), addressing mode field (to represent 1 of the 7 addressing mode), Memory address field. How many no. of operations are supported when a 32 bit instruction is placed in one memory cell.

Ans.



$$32 - (18 + 6 + 3) = 5 \text{ bits}$$

$$\therefore \text{no. of opcodes} = 2^5 = \boxed{32 \text{ opcodes}}$$

(b) If system supports 2-address & 1-address instructions. If there exist 30 2-address instructions, how many 1-address mem. reference instructions are possible?

Ans.



$$\# \text{ of free Opcode} = 32 - 30 = 2$$

$2 \times 2^6$  (as it is mem. reference instructions)  
 $\therefore$  mem ref first bits will not be taken into consideration.)  
 $\therefore 2 \times 2^6$ .

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- ① The above instruction format contain register reference & mem. reference.
- ② As a part of 2-address  $inst^n$ , we can read the source1 or from reg. & source2 from the memory.
- ③ After allocating 30 2-address  $inst^n$ , 2 combinations are free.
- ④ Those combinations are used to define either 1-address mem. ref.  $inst^n$  or 1-address reg. ref.  $inst^n$ .
- ⑤ For the mem.-ref.  $inst^n$ , the free opcode combinations are combined with the register address, i.e.  
 $2 \times 2^6$
- ⑥ For register-ref.  $inst^n$ , the free opcodes are combined with memory address, i.e.  
 $2 \times 2^{18}$

## Register to Register Ref. Architeture

[Destination-reg.  
Source1-reg.  
Source2-reg.]

This architecture supports more no. of supports registers,  $\therefore$  there is no possibility of overwriting. The ALU operations are performed only on the registers. Load & Store  $inst^n$  are used to transfer the data b/w memory & registers.



\* In reg-to-mem. ref. arch., the ALU opn. are taking place b/w reg. (source 1) & reg./mem. (source 2) but in reg-reg ref. arch., the ALU opn. are taking place b/w reg. (source 1) & reg. (source 2).

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

ALU step op<sup>n</sup> supports 3-address "inst" format,  
i.e.

| Opcode                 | addr1                         | addr2   | addr3   |
|------------------------|-------------------------------|---------|---------|
| ALU<br>Op <sup>n</sup> | Register<br>Dest <sup>n</sup> | source1 | source2 |

e.g. Add r0, r1, r2

$$r0 \leftarrow r1 + r2$$

Add r0, [4000], r2 → not possible  
[as all must be registers.]

## 4-Address Inst<sup>n</sup> format

There is a special "inst" format that contains 4-address field along with the opcode

| Opcode                     | addr1             | addr2   | addr3   | addr4                   |
|----------------------------|-------------------|---------|---------|-------------------------|
| Type of<br>Op <sup>n</sup> | Dest <sup>n</sup> | source1 | source2 | next instrn.<br>address |

- ★ Not implemented in program design.  
Only Concept, no reality.

Note: 8085 is an example of Accumulator Machine.

8086 is an example of Register to Memory ref. architecture.

RISC is an example of Register to Register ref. architecture.

- Q. Consider the following expression & identify how many 3-address inst<sup>n</sup>, 2-address inst<sup>n</sup>, 1-address inst<sup>n</sup> & 0-address inst<sup>n</sup> req. to execute the expression:

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

$$X = (A + B) * (C + D)$$

Ans. Where  $X, A, B, C, D$  are the mem. addresses.  
Register-to-Register :-

First load the data into registers from memory

Load  $r_0, A$

Load  $r_1, B$

Load  $r_2, C$

Load  $r_3, D$

Add  $r_4, r_0, r_1$

Add  $r_5, r_2, r_3$

Mul  $r_6, r_4, r_5$

Store  $X, r_6$

Q-Reg. - Mem. ref. :-

MOV  $r_0, A$

MOV  $r_1, B$

MOV

ADD  $r_0, B$

MOV  $r_1, C$

ADD  $r_1, D$

MUL  $r_0, r_1$

Store  $\rightarrow$

MOV  $X, r_0$

Q-addr. instn :-

Push A

Push B

ADD

Push C

Push D

ADD

Mul

Pop X

Acc. Machine:-

Load A ; Acc  $\leftarrow M[A]$

Add B ; Acc  $\leftarrow Acc + M[B]$

Store T ;  $M[T] \leftarrow Acc$

Load C ; Acc  $\leftarrow M[C]$

Add D ; Acc  $\leftarrow Acc + M[D]$

MUL T ; Acc  $\leftarrow Acc * M[T]$

Store X ;  $M[X] \leftarrow Acc$

# Cosmos(Sajal)

Date

Execution Cycle & Fetch cycle:-

- ① Inst<sup>n</sup> Fetch
- ② Decode Inst<sup>n</sup>
- ③ Operand Fetch
- ④ Process data & Store result

21.04.12

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

- Q1 Consider Accumulator Machine. It supports 8-bit instr<sup>n</sup> & 3 bit opcode, ∴ instr<sup>n</sup> format is:-



-t = 2000 → PC → Mem. Reg. → ~~Mem.~~

1 executable command to CPU

(trace the command at address 2000),  
PC ← PC + step-size

-t = 2000 → PC → Memory → ~~2001 0000~~

- ① this address is transferred to PC to start executing the instructions starting from address 2000  
 ② the mem. req. is generated to access the memory

③ (Inst<sup>n</sup> fetch)  
 The instr<sup>n</sup> is fetched & placed in their IR register in CPU

④ Operands (one operand is transferred from ACC to ALU.)

Flag → Extra Storage  
 Inst Decoder

000 | 10001 IR  
 OPCODE ADDRS.  
 ④ this opcode is given to the decoder to decode it (if we give 000 to decoder, we get 1000000 as O/P.)

- ⑤ (when operand 2 is present in reg.)  
 During design time, the binding operations are performed which specifies which operations are to be performed for a particular opcode. This info. is present in ROM. (that contains info. that 000 opcode when decoded to 1000000 should perform addition, etc.)

⑤ This decoded opcode is given to ALU to perform the operation. D<sub>6</sub> D<sub>7</sub>

★ But when operand 2 is in Memory.



→ Now operand is fetched & instr<sup>n</sup> was decoded earlier.



★ Execution Cycle consists of following parts:-

- ① Inst<sup>n</sup> Decode
- ② Operand Fetch
- ③ Process Data  
 (converting i/p to o/p.)

# Cosmos(Sajal)

## Execution Cycle

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Instruction Register:- It holds the currently fetched instruction to decode because the instruction format is predefined in this register.

The encoded opcode format is connected to the decoding circuit to identify the type of operation. Each decoded signal is bind with a unique meaning & the info. about the decoded signal is stored in ROM during design time.



- During operation, the corresponding operation related control signals are generated based on the opcode.
- The process of enabling the ALU gates based on the opcode present in the instruction is called instruction decoding.
- Acc. to accumulator machine, the 1st operand is transferred from accumulator & 2nd operand may be present in the register/memory.
- The 2nd operand address is present in the address field of instruction register.
- The process of transferring the binary sequence, either from register or from memory to ALU is based on IR register address field, this is called operand fetch.
- The process of converting one form of data into another form based on the enable ALU gates is called as process data.

\* PSW (Program Storage Word) / FLAG REGISTER :-  
To store the exceptional conditions of ALU; there is a need of extra storage. The extra storage is called as

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

flag. Flag is a flip-flop. flip-flop is a bi-state device used to store 1-bit information

Flag → Flip-flop → [Set(1)  
Reset(0)]

Flags are classified into 2 types:-

- ① Conditional Flags
- ② Control Flags.

## Conditional flags:-

These flags are set or reset based on some conditions (on the result nature of ALU). If condition is true, the flag is set & when condition is false, flag is reset.

There are 6 conditional flags used in the computer:-

- |                   |            |
|-------------------|------------|
| ① Carry           | ⑤ Sign     |
| ② Parity          | ⑥ Overflow |
| ③ Auxiliary Carry |            |
| ④ Zero Flag       |            |

- Carry Flag :- If there is an extra bit after MSB

Carry flag is also set/reset during subtraction operation. In such cases it acts as a borrow flag, when there is borrow from MSB, then carry flag is set, otherwise it is reset.

True      False  
Set = 1 = C    Reset = 0 = NC  
(C: carry)    (NC: no carry)

This flag is used in (unsigned) arithmetic operations to represent the range exceeding conditions.

"n" bit unsigned range = 0 to  $2^n - 1$   
4 bit      "      " = 0 to 15

|                 |       |
|-----------------|-------|
| 1001            | 1001  |
| 0010            | 1000  |
| 1011 → in range | 10001 |

out of range.

-0111  
0011

as there is no borrow for MSB.

CY = 0  
1010

-1011

1111

as there is borrow for MSB,

∴ CY = 1

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Parity Flags, i.e. if the result of an ALU opn contains even no. of 1's  
If the ALU O/p contains even no. of 1's

$$\begin{array}{c} \text{T} \\ \text{Set}=1=\text{PE} \\ (\text{even parity}) \end{array} \quad \begin{array}{c} \text{F} \\ \text{Reset}=0=\text{PO} \\ (\text{odd parity}) \end{array}$$

This flag is used in serial data communication.

Auxiliary Carry :- (in Packet BCD format)

"If there is an extra bit from lower nibble to higher nibble". or "3rd position to 4th position".

$$\begin{array}{c} \text{T} \\ \text{Set}=1=\text{AC} \\ (\text{Auxiliary carry}) \end{array} \quad \begin{array}{c} \text{F} \\ \text{Reset}=0=\text{NA} \\ (\text{No auxiliary carry}) \end{array}$$

This flag is used in the BCD arithmetic operations.

Packed BCD:- each decimal is represented by 4 bits.

Unpacked BCD:-

This format is used to represent decimal numbers.

There are 2 BCD formats:-

- Unpacked BCD:- each decimal digit is represented by using 8 bits,

$$\text{e.g. } 23 \rightarrow \underline{\underline{0000}} \ \underline{\underline{0010}} \ \underline{\underline{0000}} \ \underline{\underline{0011}}$$

- To minimize this storage, there is packed BCD format.

In this format, each decimal digit is represented by using 4 bits.

$$\text{e.g. } 23 \rightarrow \underline{\underline{0010}} \ \underline{\underline{0011}}$$

e.g.

$$\begin{array}{r} 23 \\ 28 \\ \hline 51 \end{array}$$

$\star$  Case - when no. is outside the range of BCD.

In this case we add 6 because in lower nibble addition, the B is outside the range of BCD.

$$\begin{array}{r} 0100 \\ + 0110 \\ \hline 0101 \end{array}$$

$$+ 0001$$

$$\hline 0110$$

$$\text{(add 6)}$$

$$\text{Even though auxiliary carry is 0.}$$

∴ Hence Auxiliary carry is not set.

# Cosmos(Sajal)

Case 2:- When nos. are in the range of BCD, but there is a carry from lower nibble to higher.

|                |                                                              |                                                              |                                                              |                                                              |                           |
|----------------|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|---------------------------|
| 29<br>28<br>57 | $\begin{array}{r} 0011 \\ + 0010 \\ \hline 0101 \end{array}$ | $\begin{array}{r} 1001 \\ + 1000 \\ \hline 0001 \end{array}$ | $\begin{array}{r} 0101 \\ + 0110 \\ \hline 0111 \end{array}$ | $\begin{array}{r} 0001 \\ + 0110 \\ \hline 0111 \end{array}$ | Date: _____<br>Key: _____ |
|----------------|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|---------------------------|

$$\begin{array}{r} 0101 \\ + 0110 \\ \hline 0111 \end{array}$$

$$\begin{array}{r} 0101 \\ + 0110 \\ \hline 0111 \end{array}$$

nibble 4  
Auxiliary  
Flag is set.

→ Here auxiliary carry is 1.

In this case, both 5 & 1 are in the range of BCD, but there is an auxiliary carry from lower nibble to higher nibble. ∴ even though both 5 & 1 are valid, but Auxiliary Carry is set. ∴ we add 6.

## • Zero Flag :-

If the ALU result is zero



This flag is used to implement the control structures.

## • Sign Flag :-

If in ALU result, MSB is 1



## • Overflow Flag :-

★ ✓ If there is a carry into MSB & no carry out of MSB

★ ✗ If there is no carry into MSB & carry out of MSB



Overflow is used in signed arithmetic computations.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Signed nos. are represented by using 2's complement format.
- When MSB bit is '0', then the no. is +ve, no need to take the complement to conclude the value.  
If MSB bit is '1', then the no. is -ve, so there is a need of taking the 2's compliment to conclude the value.

The range of  $n$ -bit 2's complement no. is  $-(2^{n-1})$  to  $+(2^{n-1}-1)$

4 bit 2's complement range is:-

$$-(2^{4-1}) \text{ to } +(2^{4-1}-1) = -8 \text{ to } +7 \text{ [total } 16 \text{ combinations.]}$$

## Binary Sequence

### Binary (2's complement)

|   |         | <u>Value :-</u> |                 |
|---|---------|-----------------|-----------------|
| 0 | 0 0 0 0 | = + 0 (0)       | <del>0000</del> |
| 0 | 0 0 0 1 | = + 1           | <del>0001</del> |
| 0 | 0 0 1 0 | = + 2           | <del>0010</del> |
| 0 | 0 0 1 1 | = + 3           | <del>0011</del> |
| 0 | 0 1 0 0 | = + 4           | <del>0100</del> |
| 0 | 0 1 0 1 | = + 5           | <del>0101</del> |
| 0 | 0 1 1 0 | = + 6           | <del>0110</del> |
| 0 | 0 1 1 1 | = + 7           | <del>0111</del> |
| 1 | 0 0 0 0 | = - 8           | <del>1000</del> |
| 1 | 0 0 0 1 | = - 7           | <del>1001</del> |
| 1 | 0 0 1 0 | = - 6           | <del>1010</del> |
| 1 | 0 0 1 1 | = - 5           | <del>1011</del> |
| 1 | 0 1 0 0 | = - 4           | <del>1100</del> |
| 1 | 0 1 0 1 | = - 3           | <del>1101</del> |
| 1 | 0 1 1 0 | = - 2           | <del>1110</del> |
| 1 | 0 1 1 1 | = - 1           | <del>1111</del> |

# Cosmos(Sajal)



Example :-

$$(a) \begin{array}{r} +7 \\ +3 \\ \hline +10 \end{array}$$

Justify:-

$$\begin{array}{r} 01010 \\ \text{acc.} \end{array}$$

when the sign of result +10 is +ve,  
5th bit is 0.

$$\begin{array}{r} 0011 \\ 1100 \\ \hline 1100 \end{array}$$

"Learn" range is exceeding, & we can see that there is carry into MSB & no carry out of MSB.

$$\therefore OV = 1.$$

$$(b) \begin{array}{r} +7 \\ -3 \\ +4 \\ \hline 10100 \end{array}$$

as there is carry into MSB & carry out of MSB,  $\therefore OV = 0$ .

$$(c) \begin{array}{r} -7 \\ +3 \\ -4 \\ \hline 1001 \\ 0011 \\ 1100 \end{array}$$

-4 is in the range of -8 to +7.

$\therefore$  Overflow flag is '0'.

$$\begin{array}{r} 0011 \\ 1100 \\ \hline 1100 \end{array}$$

$$OV = 0.$$

Justify:- (0.1100)  $\rightarrow$  accumulator,

Overflow flag  $\rightarrow$  0 means the range is not exceeded, so no need of 5th bit.)

$$(d) \begin{array}{r} -7 \\ -3 \\ -10 \\ \hline 1001 \\ 1101 \\ 10110 \end{array}$$

$$\begin{array}{r} 0011 \\ 1100 \\ -10 \\ -1010 \\ 0101 \\ +1 \\ \hline 0110 \end{array}$$

no carry into MSB, but carry out of MSB,  $\therefore OV = 1$ .

We need 5th bit. (0.0110)  $\rightarrow$  in accumulator  
 $\rightarrow$  in overflow flag.

Justify:-  $OV = 1$ , acc. = 0110

5th bit depends upon the sign of the result, which is -10. ( $\because$  5th bit is 1).

$\therefore$  result:- 10110 (2's complement)

$$\begin{array}{r} 01001 \\ +1 \\ \hline 01010 \end{array}$$

$\rightarrow$  ans.

# Cosmos(Sajal)

eqn. of overflow flag

$$O(xyz) = \bar{x}\bar{y}z + xy\bar{z}$$

MSB of operand 1  
MSB of operand 2  
MSB of result

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Q1. Consider following two 2's-complement nos. & perform addition operation & evaluate flag conditions.

|           |                    |                   |
|-----------|--------------------|-------------------|
| 10010110  | (C) Carry = 1      | Zero = 0 (NZ)     |
| 10110101  | (PE) Parity = 01   | Sign = 0 (PPL)    |
| ①01001011 | (NA) Auxiliary = 0 | Overflow = 1 (OV) |

## Control Flags :-

Based on the status of these flags, the ALU execution sequence is changed. Three flags are employed under this category:-

- ① Trap Flag
- ② Interrupt Flag
- ③ Direction Flag

### Trap

- Trap Flag  $\rightarrow (go=G)$

"If it is 0" :- all statements are executed at a time.  
"If it is 1" :- only one statement is executed at a time.  $\rightarrow (trace=t)$  Used for debugging purpose.

$-t=2000 \rightarrow$  this means that trap flag is set & we can only one statement executes at a time starting from 2000.

- Interrupt flag -
  - 0 → Disable the interrupt (DI)
  - 1 → Enable the interrupt (EI)
- Direction flag -
  - Autoincrementing - 0 (CLD)  
(Clear Direction Flag)
  - Autodecrementing - 1 (STD)  
(Set Direction Flag)

# Cosmos(Sajal)

23 23  
Date: 23  
Page No. 1

## Addressing Modes :-

It shows the way where the required object is present.

(Focused on data) Object may be an instruction or data, : the addressing modes are classified :-

Data oriented addressing mode (1) Sequential control flow addressing mode. (2) Transfer of control flow

Instruction oriented addressing mode

(Focused on instrn)

Instructions are stored sequentially.

Instructions are accessed using incrementation of program counter.

[No if, while, etc statements]

- O/p address of Addressing mode is effective address(EA). EA is the address where the required object is present.
- Object = [EA] ← content of EA

Addressing modes are implemented in 2 ways :-

(1) Implicit :

(2) Explicit

- **Implicit** :- In this implementation, the type of addressing mode info is present in opcode, ∴ there is no need of extra field in the instrn format.



eg MVI B,23

move immediate  
immediate addressing mode  
which is specified in the opcode itself.

Mov 80, #23 → immediate mode

reg name → register addressing mode

II J → direct addressing mode.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

May 20,

@ or () → Indirect addressing mode.

Explicit :-

In this implementation, addressing mode field is maintained in Instn format to identify the type of addressing mode.



Decoders are required to decode which addressing mode is encoded.

Sequential Control Flow Addressing Mode:-

This mode is focussed on the data.

Data is present at two places:-

→ Registers

→ Memory

∴ This addressing mode is further divided

into 2 types:-

① Register Based AM :- When data is present in registers.

② Memory Based AM. :- When data is present in memory.

Register Based AM :- This addressing mode used to read & write the data when it is present in processor registers. Under this category only 1 addressing mode is used i.e. register addressing mode.

This mode is used to access local variables.

Registers are referred by the names.

EA = Register name

data = [EA] = [register name]

# Cosmos(Sajal)

- \* Register Based AM :- When the data is present in the register itself. This mode is used to access local variables because registers reside in the processor architecture whereas memory is separate from processor architecture & it takes less time to access data from registers.

e.g.  $MOV R0, R1$   
FA = R0      FN = R1

opcode → R  
takes less time to access data from registers.



## Memory Based Addressing Mode :-

It is used to read/write the data when it is present in the memory.

There are diff. addressing modes employed in this category :-

- (i) **Implied** :- In this mode, the data is itself available in the opcode, ∴ no effective address.  
e.g. STC :- set carry [there is no address field, it is implied in the opcode itself].  
in this case we have to set the carry flag, the data is present in the carry flag, so it is implied that operand is in carry flag, other examples are CLD (Clear Direction Flag), etc.

- \* all the zero address instructions employ the implied addressing mode, e.g. stack machine.

- (ii) **Immediate Addressing mode** :- This mode is used to access the constant. In this mode, the data is present in the address field of instruction, i.e.

e.g.  $MOV R0, \#25$       Immediate mode is used to initialize the registers with data.

Source AM = immediate, then data = 25.  
Destn AM = register, then EA = R0  
 $R0 \leftarrow 25$

Note :- The limitation of immediate is that the range

# Cosmos(Sajal)

\* \* Add A, B where A & B are the memory addresses.  
 $M[A] \leftarrow M[A] + M[B]$   
3 memory references.

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

of possible constants are delimited by the size of address field.



$$\therefore \text{Possible Constants} = 0 \text{ to } 2^2 - 1 = 0 - 3$$

## (iii) Direct Addressing Mode :-

This mode is used to access the static variables. In this mode the data is present in the memory, so that memory address is present in the address field of instruction.

EA = address field value.  
Memory Addr.

Data = [EA]

= [Mem. Addr.]

1 memory reference is required to read/write the data using this mode.



ADD r0, [4000]

$r0 \leftarrow r0 + M[4000]$

## (iv) Indirect Addressing Mode :-

This mode is used to implement the pointers. Based on the availability of EA, this mode is divided into 2 types, i.e. register indirect AM, memory indirect AM.

# Cosmos(Sajal)

EA = content of address field value

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## • Register Indirect AM :-

In this mode, the EA is present in the register, that register address name is present in the address field of instn.

∴ EA = content of address field value.

OR EA = content of register name in address field.

Data = [EA]

[[reg. name]]  
1. reg. name  
2. ref. EA  
1 mem. ref = data

To access the content of register which is given in the address field. This content gives the EA of ~~it~~ in the memory to get the data.



e.g. ADD [4000], @R0

• M[4000]  $\leftarrow$  M[4000] + M[R0]. :-

Total references = 4  
Total memory reference = 3

• Add @R0, #23

MOV R1, R2

M[R0]  $\leftarrow$  M[R0] + 23

R1  $\leftarrow$  R2

1 reg. ref = EA

MOV [R1], [R2]

1 mem. ref = To write data

1 reg. ref = EA  
1 mem. ref = read data

constant

## • Third Memory Indirect AM :-

In this mode the EA is present in the memory, that memory address is present in the address field of instruction.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

EA = [Addr. field value]

= [Mem. Address]

data = [EA]

= [[Mem. Address]]

1 mem.ref. - EA

1 mem.ref. = To read/write data

2 memory reference are reg.



→ e.g. ADD @4000,(5000) → 6 memory reference.  
M[4000] ← M[M[4000]] + M[M[5000]]

M[[4000]] ← M[[4000]] + M[[5000]]

1 mem.ref. = EA

1 mem.ref. = write  
data

→ ADD @5000,@r1 → 5 memory reference  
M[[5000]] ← M[[5000]] + M[[r1]]  
2 mem.ref.      2 mem.ref.      1 mem.ref.  
                  1 reg.ref.

→ MOV @5000,@r1 → Load @ r1  
M[[5000]] ← M[[r1]]  
2 mem.ref.      1 mem.ref.  
                  1 reg.ref.

→ Load @ 5000  
Acc ← M[[5000]]  
2 mem.ref.

# Cosmos(Sajal)



Date \_\_\_\_\_  
Page No. \_\_\_\_\_

(v)

## Indexed AM:-

Case I:-  
(when  
base address  
resides in  
the address  
field of the  
index & the  
index value  
resides in  
the index  
register)

This addressing mode is used to implement the arrays. To access the array element, there is a need of base address & the index value. Base address points the starting address of an array & the index value gives the offset address of an array.. EA = base address + index value

data = [EA] = [Base address + index value]

Arithmetic computation is req. to calculate the EA in the index addressing mode.



Data = [Base address + [R<sub>i</sub>]]

1 reg. ref. to get index value

1 arithmetic opn to calculate EA

1 mem. ref. is req. to read/write data.

e.g.       $\rightarrow$  base address

ADD R<sub>0</sub>, 2(R<sub>1</sub>)    where R<sub>1</sub> is an index reg.  
                         $\leftarrow$  index addressing mode

$$\bullet \quad R_0 \leftarrow R_0 + M[2 + [R_1]] \quad \rightarrow \quad 1 \text{ mem. ref.}$$

1 reg. freq. ref.

1 mem. ref.

1 reg. ref.

1 arithmetic opn.

$$\bullet \quad \text{ADD } @4000, 200(R_2) \quad \rightarrow \quad 5 \text{ mem. ref.}$$

$$M[R[4000]] \leftarrow M[R[4000]] + M[200 + [R_2]]$$

\* Index AM is also known as Base-Indexed AM.

# Cosmos(Sajal)

(Base Index AM) :-

address field = ?

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Case I:  
When both base address & index value resides in register

Some of the processors supports base & index registers to hold the base address & the displacement espec.

$$EA = [\text{Base register}] + [\text{Index reg.}]$$

$$\text{Data} = [[\text{Base reg.}] + [\text{Index reg.}]]$$

1 reg. ref. = to read base addr.

1 reg. ref. = to read index value

1 arithmetic op = to calculate EA

1 mem. ref. = to read/write data.

e.g.

MOV AX, [BX][SI]

$$AX \leftarrow M[BX] + M[SI]$$

Case II: Hypothetical Case:-

Indirect Index AM :-

In this mode the base address is present in the address field condt. of memory whose EA of base address value is stored in index register

$$EA = [\text{mem. Addr.}] + [\text{Index Register}]$$
$$= [\text{Addr. Field value}] + [\text{Index reg.}]$$

$$\text{Data} = [EA]$$

$$= [[\text{Mem. Addr.}] + [\text{Index reg.}]]$$

1 M. ref. = Base addr. ✓

1 reg. ref. = Index value ✓

1 Arithmetic - EA ✓

1 M. ref = read/write. ✓

e.g. MOV R0, [2000][R1] R1 = index reg.

$$R0 \leftarrow M[2000] M[R1] + [R0]$$

1 reg.  
ref.  
with R1

1 Mem.  
ref.  
1 reg.  
ref.

1 Arithmetic op = EA

1 mem. ref. = data

# Cosmos(Sajal)



Date \_\_\_\_\_  
Page No. \_\_\_\_\_

*Indirect Addressing Mode*

ADD A[R<sub>1</sub>] #B where A & B are mem. addrs.  
R<sub>1</sub> is index address

$$M[EA] + [R_1] = M[EA] + M[B]$$

★ Using index addressing mode, we can access the random array elements. (by ~~randomly changing the value of Index register~~)

(vi) Auto-Indexed :-

This addressing mode is used to access the linear array element.

∴ EA = Base address + Step size

data = [EA]

= [Base address + step size]

If array element size = 1 byte, then step size = 1

If size of array element = 2 byte, then step size = 2.

The step size depends on the amount of the data element to be accessed from the array.

Classification Of Auto-Indexed AM :-



Q. Diff. addressing modes are used to access the operands. Their frequency of operand accessing is shown :-

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

| Operand Accessing | freq.(%) |
|-------------------|----------|
| Register          | 30       |
| Immediate         | 20       |
| Direct            | 22       |
| Mem. Indirect     | 17       |
| Indexed           | 11       |

If 2 cycles are consumed for mem. operations,  
1 cycle consumed for arithmetic computation  
0 cycles consumed if operand is available in  
the register/inst" itself, what is the avg.  
Operand fetch rate of the machine.

Avg.  $0.22 \times 2 + 4 \times 0.17 + 3 \times 0.11 = 0.44 + 0.68 + 0.33$   
~~2.45~~  
~~2.45~~  
 $= 1.45$

Avg. operand  
fetch =  $(0.3 \times 0) + (0.2 \times 0) + (0.22 \times 2) +$   
 $(4 \times 0.17) + (3 \times 0) + 2 \times 0.11 + 0.11 = 1.45$  cycles.  
to fetch the one  
operand.

## Transfer Of Control Flow AM

To implement the control structures, i.e. selection statements (If-then-else, switch-case & goto), iterative statements (while, do-while & for) & subprogram concept, there is a need of transfer of control operations.

During the execution of the above statements, the program control is transferred from current location to target location.

The transfer of control operations are associated with the target address. There are 3 possible mnemonics available to implement the transfer of control operations:- ① Branch ② Jump ③ Skip.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Transfer of control operations are basically divided into 2 types:-

- ① Unconditional Transfer Of Control
- ② Conditional Transfer of Control

## • Unconditional Transfer Of Control:-

While execution of these instn, the control is transferred to the target address w/o even checking any condition.

e.g.

1000: JMP 2000 → Target Address

1001:      |  
                | unconditional TOC

At the end of this instn

PC ← 2000

If JMP 2000 is at 1000 location & next instn is at 1001 location, then before to the execution of JMP 2000, PC contains 1001, but after execution of JMP 2000, 1001 is overwritten by 2000.

Consider the program segment

1000: I<sub>1</sub>

1001: I<sub>2</sub>

1002: I<sub>3</sub> (Jmp 2000)

1003: I<sub>4</sub>

:

2000: BI<sub>1</sub>

2001: BI<sub>2</sub> (JMP 1003)

The execution sequence of the program is:-

- t = 1000

PC = 1000

1001

to 1003 even      1002 JMP 1200d

JMP 2000 PC = 1003 → before execution of JMP 2000 is executed, PC = 2000 → after execution of JMP 2000  
∴ PC becomes

1003  
after execution of JMP 2000, the PC is replaced by 2000.

# Cosmos(Sajal)

Spaghetti

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

PC = 2000 : BI,

PC = 2001 : ~~BD~~ JMP 1003

PC = 2002 → before execution of JMP 1003.

PC = 1003 : I<sub>4</sub> → after execution of JMP 1003  
& before execution of I<sub>4</sub>.

★ To implement the goto, halt statements, machine control statements unconditional transfer of control statements are used.

→ Halt is implemented using ~~and~~ unconditional Transfer of control when the target address is same as current instn.

2000 : Halt;

or 2000 : JMP 2000

l<sub>1</sub> : JMP l<sub>1</sub>

this will halt the processor by executing the same statements again & again by loading the PC with same inst<sup>n</sup> address.

To come out of halt inst<sup>n</sup>, we need to restart the system.

1000 : I<sub>1</sub>

1001 : I<sub>2</sub>

1002 : I<sub>3</sub> (halt)

1003 : I<sub>4</sub>

:

:

PC = 1000

= 1001

= 1002 : halt

= 1003 → before execution of halt

→ Halt 1002 :- after execution of this halt.

1003 :- before execution of this halt.

→ Halt 1002 :- after execution of this halt.

1003 :- before execution of this halt.

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

DAT

22.04

## Machine Control Meaning:-

### Conditional Transfer Of Control:-

These instr<sup>n</sup> are associated with the conditions, during execution of conditional TOC execution, the associated cond<sup>n</sup> undergoes evaluation. If cond<sup>n</sup> is true, then PC is loaded with target address otherwise no change in PC. (i.e. PC will hold the next sequential address.)

e.g. JNZ [2000]

↳ Jump  
when not zero

If the If instr<sup>n</sup> before JNZ 2000 makes the zero flag, then only the PC is loaded with 2000, if instr<sup>n</sup> before JNZ 2000 sets the ZF, then the PC is only incremented to the next instruction after JNZ 2000.

1000: MOV CL,03

PC=1000 CL=03

1001: ADD AL,BL

1001 +

1002: DEC CL

1002 CL=02

1003: JNZ 1001

1003 NZ (when PC=1003 then CL=02.)

1004: MOV [4000],AL

1004 ↴

\*Note:- To implement the selection & iterative statements conditional transfer of control are used.

PC=1001 + non-zero condition is checked, but  
PC=1002 CL=01 PC=1004, but Non-  
PC=1003 NZ zero condition  
PC=1004 is satisfied,  
PC=1001 + so 1004 is overwritten  
PC=1002 CL=00 by 1001.  
PC=1003 NZ(F)  
PC=1004 → at this

time, PC will remain loaded with 1004, no overwriting of PC takes place.

# Cosmos(Sajal)

Date

22.04.12

V: 2nd p.

## Subprogram

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Subprogram means reusable program. Separately developed instn in the main application & store them in the separate memory space called subprogram.

The advantage with the subprogram is that it reduces the memory space, development time & development cost.

Characteristics of subprogram:-

- ① Single entry, single exit.
- ② Main program is suspended during execution of subprogram.
- ③ Control is transferred back to the main program after completion of the subprogram.

Implementation of Subprogram:-

Subprogram concept is implemented by using pair instn, i.e.

call-ret

entry exit  
pt. pt.  
main program present in the subprogram.  
(i.e. entry pt. is present in the main program)  
call invokes the bush operation of the stack to save the return address.

Consider the following program segment & show execution sequence.

1000: I<sub>1</sub>, 2000: BI<sub>1</sub>,  
1001: I<sub>2</sub>, 2001: BI<sub>2</sub>,  
1002: I<sub>3</sub> [call 2000], 2002: BI<sub>3</sub>,  
1003: I<sub>4</sub>, 2003: BI<sub>4</sub> (Ret),  
1004: . . . 2004: . . .

(wait)

As we haven't saved the return address anywhere, the subprogram waits for the return address.

PC = 1001 I<sub>1</sub> is executed

because return address is not saved anywhere. [We need runtime stack to save the return address]

PC = 1002 I<sub>2</sub> " "

PC = 1003 I<sub>3</sub> " "

PC = 2000 . . .

PC = 2001 BI<sub>1</sub> is executed

PC = 2002 BI<sub>2</sub> " "

PC = 2003 BI<sub>3</sub> " "

PC = 2004 BI+ " "

control is transferred to the main program, so PC value 2004 is replaced by

Now, the subprogram waits at PC = 2004 because the return address is not available.

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

Due to absence of return address, the control is not transferred back to the main program. To make the implementation successful, there is a need of runtime stack. Runtime stack is used to store the return address.

Execution seq. with stack

PC = 1000

1001 I<sub>1</sub> is executed

1002 I<sub>2</sub> is "

1003 I<sub>3</sub> " " [I<sub>3</sub>: - call 2000]

PC = 2000

2001 BI<sub>1</sub> is executed

2002 BI<sub>2</sub> is "

2003 BI<sub>3</sub> " "

2004 BI<sub>4</sub> " " [BI<sub>4</sub> is a return statement.]

at this time, return  
pops the TOS

PC = 1003

↑ retrieves  
the return  
address of  
the main  
program. This  
return address  
is used to  
overwrite PC  
value of 2004



\* Only PC is changed during transfer of control operations.

JMP 200 Transfer Of Control

Direct TOC

The target address is present in the instruction itself. e.g. JMP 2000,

JNZ 2000

CALL 2000

target address in the Instn.

Indirect TOC

Target address is not present in the instruction, e.g. RET

no target address in the Instn.

# Cosmos(Sajal)

| TOC      | Unconditional            | Conditional                                                                            | (call on not carry) (call on zero) |
|----------|--------------------------|----------------------------------------------------------------------------------------|------------------------------------|
| Direct   | JMP, goto, Halt,<br>call | JNZ, if-then-else,<br>while-do-while-switch, for, callc(call on carry), callnc, callnz |                                    |
| Indirect | Ret                      | RetC, Retnc, Retz, Retnz<br>Return on carry, Return on zero, Return on no zero         |                                    |

Under the transfer of control flow, 2 addressing modes are employed to calculate the EA of the next instr. i.e.

## ① Relative / PC-relative AM

$$② EA = PC + IR[\text{Address field value}]$$

When the control is to be transferred in the same segment. [intrasegment]

Displacement/relative value  
from the current PC value, what displacement must be added to PC to get the EA of next instruction, so the address field of IR contains this displacement.

$$\therefore PC \leftarrow PC + \text{displacement}.$$

## ② Base-register addressing mode :-

In this mode, the base register is used to point the code segment address. When the program is present in the diff. segments, to execute such program base register addressing mode is used, [easy deallocation of program].

When the control is to be transferred from one segment to other [intersegment]. In this case, the base register contains the starting address of the other segment & the address field of IR contains the displacement in other segment where next instn is present.

$$\therefore EA = [\text{Base reg.}] + IR[\text{Add. field value}]$$

Starting Add. of Displacement/relative value  
Segment

Q1 A PC relative branch instn size is 32 bits. It is placed in the memory with starting address of 342032 (decimal). What is the target address if the signed displacement -12 is placed in the instn.

Ans.



$$PC = 342036$$

$$\text{Now, } PC = PC + IR[\text{addr. field value}] \\ = 342036 - 12 = 342024$$

$$EA = PC + IR[\text{addr. field value}] \\ = 342036 + (-12) = 342024$$

before execution of this instn, PC is incremented to 342036. after execution of this instn, PC = 342024.

# Cosmos(Sajal)

Q3  
Q9

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Q2. A PC-relative branch instr with 5 bytes long is stored in memory 243028(decimal) address onwards. If EA = 243009, what's the displacement?

Ans. EA = PC + displacement

$$243009 = 243033 + \text{displacement}$$
$$\text{displacement} = -24$$

243028 ?  
243029  
243030  
243031  
243032  
PC = 243033

## Types Of Operations/Types of Instrn./ Instrn Set

- Instrn set of processor consists 3 category of instrn to perform 3 diff. operations, named as:-
  - ① Data Transfer Opn.
  - ② Data Manipulation Opn.
  - ③ Transfer of Control.
- Data Transfer Instn.: While execution of these instrn, the data is transferred from source to destination. The constraint is that source may or may not have the storage but the destination always have the storage.

| Source        | Destn              |
|---------------|--------------------|
| reg           | reg → MOV          |
| reg           | Mem → STORE        |
| Mem           | reg → LOAD         |
| ? → Mem.      | Mem → not possible |
| Immediate     | Reg. → MOV         |
| Immediate     | Mem. → MOV         |
| TOS           | Reg. → POP         |
| reg           | TOS → PUSH         |
| Input device  | Reg. → IN          |
| Output device | Reg. → OUT         |

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Diff. mnemonics are used to implement various data transfer operations.

MOV: General purpose data transfer operation.

MOV R0, R1

MOV R0, [4000]

MOV R0, #23 → 8051 instn.

Load: It performs only memory read operations.

Load R0, [4000]

Load R0, @R1

Load R0, #23

Load R0, @5000

Load R0, @R1 → invalid

Load R0, #23 → invalid [no memory to memory operations]

Store: It is used to perform memory write operations only.

Store [4000], R0

Store @R0, R1

Store R0, [R1]

Store R0, #23

Store @R0, @R1 → not valid

[because it is memory to memory opn.]

Store R0, @R1 → invalid

[because it is memory to memory opn.]

push: Data is

SP:- Stack pointer

Push AX



Now, TOS will contain some data,  
∴ 2008 will contain some data,  
SP is decremented by 1, ∴ SP=2007,  
AH is stored at 2007, then SP is  
again decremented by 1, ∴ SP=2006  
& AL is stored at 2006.

It is used to transfer the data from register to TOS. The displacement depends upon the size of register.

For every successful push instn the stack pointer is decremented by displacement times, in above case stack pointer is decremented by 2.

SP ← SP-1  
TOS ← AH  
SP ← SP-1  
TOS ← AL

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Pop:

TOS →

|     |      |      |
|-----|------|------|
|     |      | 0000 |
| AL  | 2006 |      |
| RTI | 2007 |      |
|     | 2008 |      |
|     | FFFF |      |

Pop BX

Now BX is 16 bit, so we require 2 cell values.

$$\therefore BL = M[PC] \quad \{ \rightarrow \text{post increment}$$

$$PC \leftarrow PC + 1$$

$$BH = M[PC]$$

PC ← PC + 1, now PC will be at 2008, & 2008 mem. address will now be the TOS.

For every successful pop op<sup>n</sup>, stack pointer is incremented by the displacement times.

If pop r

where r is a register of 82 bits,  
then SP is incremented by 4

## Data Manipulation Instr.

While execution of these instr<sup>n</sup> one form of data is converted into another form. The classification of Data Manipulation instr<sup>n</sup> is:-

- ① Arithmetic Instr<sup>n</sup> (Add, Sub, Mul, Div, Inc, Dec)
- ② Logical Instr<sup>n</sup> (And, OR, NOT, XOR, CMP)
- ③ Shift & Rotate Instr<sup>n</sup>

- CMP :- This instr<sup>n</sup> invokes the subtraction operation, after the operation the result is reflected in the Zero & Carry Flags, e.g.  $CMP R0, R1 ; R0 = 23 ; R1 = 23$ .  
in this case Zero flag is set.

If both  
operands  
are same,  
then

ZF = 1  
CF = X

After the op<sup>n</sup> read the status of zero flag, if it is 1, then we conclude that both operands are equal.

example 2:-  $CMP R0, R1 ; R0 = 40 ; R1 = 20$

If  $R0 > R1$ , then  
 $ZF = 0$   
 $CF = 0$

SUB: 0100 0000

0010 0000

0010 0000 → NZ

→ NC

→ Carry fl.  
If  $ZF = 0$  &  $CF = 0$ , then we conclude 1st operand  $>$  2nd operand.

# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

example:- CMP  $\text{R}0,\text{R}1$ ,  $\text{R}0=20$ ,  $\text{R}1=10$   
Borrow  $\begin{array}{r} 0010 \\ - 0100 \\ \hline 1110 \end{array}$

If  
 $\text{OP} = \text{COP2}$   
 $\text{ZF} = 0$   
 $\text{CF} = 1$

$\therefore \text{CF} = 1$  [the borrow in carry flag is set when we take ZF = 0 from left hand side of MSB.]

} If  $\text{ZF} = 0$  &  $\text{CF} = 1$ , then we conclude 2nd operand is greater than the 1st operand.

- ③ Shift Opern -
- logical shift -
    - [left (LSL)]
    - [right (LSR)]
  - arithmetic shift -
    - [left (ASL)]
    - [right (ASR)]
- while execution of this instn, the bits are moving towards either left or right by 1. The classification is shown below:
- In logical shift opn, all the bits are moving towards left or right including the sign bit.



The MSB bit is lost, & a '0' is inserted at LSB.



The LSB bit is lost, & a '0' is inserted at MSB.

\* Shift opn is used in serial data communication.

LSL  $\text{R}0 ; \text{R}0 = 23$  Multiplication      LSR  $\text{R}0 ; \text{R}0 = 23$  Division by 2.



NOTE:- logical shift opn is used in the serial data communication. In arithmetic shift opn all bits are moving towards left or right except the sign bit.

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

ASL:  sign bit remains same,  $(n-2)$ th bit is lost.

ASR:  LSB is lost, & at sign bit the same bit is shifted in the sign bit.

This is used in signed arithmetic computations.

ASL 80 : 80=82H



ASR 80



## ④ Rotate Operation:-

While execution of this "inst" all the bits are moving towards left or right without loss of data. The classification



ROL: carry 

ROR: carry 

RCL: carry 

RCR: carry 

Note:- Rotate "inst" is used in optimization in digital motors, to check whether a no. is even or odd (check LSB), to check whether a no. is +ve or -ve (check MSB).

# Cosmos(Sajal)

## Interrupts

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

An usual event to disturb the normal flow of execution. Instr<sup>n</sup> cycle with interrupt concept describes the execution sequence of the program along with the interrupts.



CPU responds to interrupt only when current instruction is executed, i.e. if the interrupt occurs during the execution of instr<sup>n</sup>, then the interrupt will only be handled after the current instr<sup>n</sup> is completely executed. If two interrupt occurs during

\* The above flow chart shows that the CPU will respond to the interrupts only after execution/completion of current instruction.

NOTE: The interrupt sources are enabled or disabled based on the status of the interrupt flag. If interrupt flag → Disable the interrupt sources. → No interrupt cycle

L- Enable the interrupt sources → Interrupt cycle is required to service the interrupts.

- After completion of every normal instr<sup>n</sup>, CPU checks whether there is an interrupt or not.



After completion of every instr<sup>n</sup> execution, the CPU reads the status of interrupt sources, e.g. INIR.

INIR = {  
    0 - NO-interrupt  
    1 - Interrupt.

The processor supports diff. interrupts, so diff. subprograms are required to handle diff. interrupts. All these interrupt subprograms(ISR) are stored in the long term memory which is called as Interrupt Vector Table(IVT). Each ISR must end with IRET (Interrupt Return) or RETI instruction.

When the CPU identifies the interrupt, it pushes the PC into the stack & load the vector address into the program counter. When the CPU encounters IRET/RETI instructions, it invokes the pop operation to restore the PC with the return address. ∴ the CPU can fetch the next instruction from the main program.

## Types Of Interrupt:-

- ① H/W interrupt :- These interrupts are present at the h/w pin of the processor. These are also known as external interrupts.
- ② S/W interrupt :- These interrupts are present in the program, i.e. S/W interrupts are the instruction.
- ③ Maskable interrupts :- These interrupts can be enabled or disabled.
- ④ Non-maskable interrupts (NMI) :- These interrupts are always in the enabled state, so the power failure conditions, temp. sensor conditions & critical condition of the processor are connected to the NMI pin.
- ⑤ Vector Interrupts :- These interrupts are associated with static vector pointers (fixed vector address).



INT0, INT1 are vector interrupts as only one device (which obviously have a single address) is attached to the pin,

but INT2 is non-vector interrupt when more than one device is attached to the interrupt pin.

# Cosmos(Sajal)

8085  
Interrupts

**Hardware Interrupts**

- RST7.5 → Maskable → Vector → Edge
- RST6.5 → Maskable → Vector → Level
- RST5.5 → Maskable → Vector → Level
- RST4.5 → NMI → Vector → Both
- INTR → Maskable → Non-Vector → Level

**S/W interrupts**

No concept of Maskable/NMI

|      |                                                 |
|------|-------------------------------------------------|
| RST0 | → All 8 bit vector/interrupts<br>(addresses): - |
| RST1 | 0000                                            |
| RST2 | 0008                                            |
| RST3 | 0010                                            |
| RST4 | 0018                                            |
| RST5 | 0020                                            |
| RST6 | 0028                                            |
| RST7 | 0030                                            |
|      | 0038                                            |

## Interrupt Priority:-

RST 4.5 > RST 7.5 > RST 6.5 > RST 5.5 > INTR

| addresses | RSTF                 | not applicable. |
|-----------|----------------------|-----------------|
| (0024)    | RST 4.5 (Trap) → NMI | RST0 → 0000     |
| (0030)    | RST 7.5 → Maskable   | RST1 → 0008     |
| (0034)    | → RST 6.5 → Maskable | RST2 → 0010     |
|           |                      | RST3 → 0018     |
|           |                      | RST4 → 0020     |

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Non-vector interrupt: These interrupts are associated with the dynamic vector interrupts. These in the example, INT2 pin is connected with multiple sources, i.e. multiple vector address are possible.
- Level triggered interrupts: - These are enabled or disabled based on the levels, i.e. 1 or 0.



e.g. INTR pin

when Level is 1, only then interrupts are enabled.

- Edge triggered interrupts: - These are enabled or disabled based on the rising edge transition or falling edge transition.

CLK      ↑↑↑↑↑↑↑↑

## Interrupt Structure :-



|      |        |
|------|--------|
| RST0 | → 0000 |
| RST1 | → 0008 |
| RST2 | → 0010 |
| RST3 | → 0018 |
| RST4 | → 0020 |
| RST5 | → 0028 |
| RST6 | → 0030 |
| RST7 | → 0038 |

## Edge or Level :-

- RST 7.5 → Edge  
RST 6.5 → Level  
RST 5.5 → Level  
RST 4.5 → Both level & edge  
INTR → Level

# Cosmos(Sajal)

## Interrupt Priority

When the processor supports multiple interrupt sources, then there is a possibility of simultaneous interrupt. So to service 1 interrupt at a time, there is a need of priority level.

RST4.5 > RST7.5 > RST6.5 > RST5.5 > INTR  
↑                          ↓  
Highest priority              Lowest priority

## RISC Vs CISC

Based on the Type of instruction supported by the processor, architecture is divided into 2 types, i.e.

- Reduced Instruction Set Computer
- Complex     "        "        "

### Characteristic of RISC:-

- ① It contains more no. of registers. (This means that possibility of getting data from register is high.)
- ② It supports less no. of Addressing Modes. (As most of the times data is present in the registers, memory access is low, leading to less addressing modes needed.)
- ③ It supports fixed length instn.
- ④ One instruction per cycle because of fixed length instn.
- ⑤ It allows successful pipeline.
- ⑥ Motorola processor, power PC, advanced RISC machine(ARM).

### Characteristic Of CISC:-

- ① It contains less no. of registers.
- ② It supports more no. of Addressing modes. (Because as less no. of registers, the data is present in memory most of the time, so more no. of address modes are there.)
- ③ Supports variable length instn.
- ④ CPI  $\neq 1$ . Instructions Cycles per instn.
- ⑤ Unsuccessful pipelining.  
Pentium processor.

# Cosmos(Sajal)

## Register Organisation in RISC Processor :-

- ① In RISC processor, the registers are organised as overlapping register windows.
- ② RISC processor supports diff. types of registers, e.g. global registers (G), Local registers (L), In reg. (C), Out reg (C).

Global reg. are accessed by all the register windows.  
Register window consists of Local, In & Out reg.

Window (C)  
In (C)  
Out (C)  
Local (C)

Window size =  $L + 2(C + G) \rightarrow$  as global reg. are accessed by register windows, we have added G also.

\* The In reg. of one window is accessing the Out reg. of another window.

The register file indicates the no. of registers supported by the processor. Overlapping register window organisation means one window Out register acts as the In registers of another window.



Now, window W<sub>1</sub> is overlapping with W<sub>2</sub>, W<sub>2</sub> with W<sub>3</sub>, W<sub>3</sub> with W<sub>4</sub> & W<sub>4</sub> with W<sub>1</sub>.

Register file =  $W * (L + C) + G \rightarrow$  total no. of registers in the processor?

W: total no. of windows

L: no. of local reg.

C: no. of common reg (IN-OUT)

G: no. of global reg.

# Cosmos(Sajal)

Date

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

28.04.12

## Components Of the Computer

The computer consists 3 basic components:-

- ① CPU
- ② Memory
- ③ I/O

### CPU Organisation:-

CPU consists of 3 components, named as:-

- ① Registers
- ② ALU
- ③ CU

→ Registers:- Registers are used to store the data temporarily. (Registers are present in the CPU itself, i.e they are not connected to CPU via Bus.)

While execution of the program, the CPU is frequently communicating with the memory, synchronisation mechanism is seq. Memory hierarchy design is the synchronisation mechanism b/w memory & CPU, the objective of memory hierarchy design is

"Balancing the bandwidth b/w CPU & memory or Minimizing the speed gap b/w CPU & memory or Synchronizing the data transfer rate b/w CPU & memory."

Memory hierarchy means arranging all the system supported memory on priority basis & assign the accessing order.



# Cosmos(Sajal)

From bottom to top (from Secondary Mem to Registers)

- ① Capacity Decreased
- ② Access time decreases.
- ③ Performance  $\propto \frac{1}{\text{Access time}}$

- ④ Performance increases
- ⑤ Cost per bit increases, expensive

Acc. to the memory hierarchy design, the registers are small sized storage components, so the frequently used data is placed in the registers. Those registers are present in the CPU, they are not connected to the system bus.

## Mandatory Registers :-

- ① Program Counter :- holds the starting instr<sup>n</sup> address & immediately points to next instr<sup>n</sup> address
- ② Inst<sup>n</sup> register (IR) :- used to hold the currently fetched instr<sup>n</sup> to decode because instr<sup>n</sup> format is predefined in this register.
- ③ Accumulator (Acc) :- It is used to hold the ALU i/p's & o/p's.
- ④ Memory Address Register (MAR) :- used to carry the address. This reg. is directly connected with the address lines of the system bus.
- ⑤ MBR (Mem. Buffer register) / MDR (Mem. data register) :- used to carry the binary sequence. It is directly connected with the data lines of the system bus.

## Inst<sup>n</sup> execution Sequence using microoperations:-



# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_



## ② Operand Fetch

- ① Reg. Addressing Mode:-

IR[Addrs]  $\rightarrow$  regname  $\rightarrow$  ALU

- ② Immediate AM

IR[Addrs]  $\rightarrow$  ALU

- ③ Direct AM

IR[Addrs]  $\rightarrow$  MAR  $\rightarrow$  ALU's  
RD  $\rightarrow$  CLS

Memory

Binary Seq.  $\rightarrow$  DL's  $\rightarrow$  MBR  $\rightarrow$  ALU

- ④ Reg. Indirect AM

IR[Addrs]  $\rightarrow$  Reg. name  $\xrightarrow{\text{EA}}$  MAR  $\rightarrow$  ALU's  
RB  $\rightarrow$  CLS

Memory

Binary Seq.  $\rightarrow$  DL's  $\rightarrow$  MBR  
ALU

- ⑤ Mem. Indirect AM

IR[Addrs]  $\rightarrow$  MAR  $\rightarrow$  ALU's  
RD  $\rightarrow$  CLS

Memory

MAR  
RD

ALU  $\leftarrow$  MBR  $\leftarrow$  DL's

The data seq.  
is address,  
it is transferred  
to MAR.

- ⑥ Process Data :-



# Cosmos(Sajal)

- **Micro Instruction** :- written by programmer, which is decoded by Control Unit.
- **Microinstruction** :- Every instn has a sequence of microinstn behind it that executes all operations of that instruction.

- Microoperation is a basic seq. to seq. transfer op<sup>n</sup>. It is also known as elementary op<sup>n</sup> or atomic atomic op<sup>n</sup>. Each microop<sup>n</sup> can be completed in 1 cycle. Some of the microoperations requires more than one cycle. e.g.  $R_1 \rightarrow R_2$ , to execute the  $\mu$ op<sup>n</sup>, there is a need of control signals, i.e.



all registers are connected to each other, so for transfer of  $R_1 \rightarrow R_2$ ,  $R_{1\text{out}}$  should be enabled, but as the data is accessible to all other reg., but only those reg. which want to access the data will set their  $R_{1\text{in}}$  pin.

**Microprogram** - The seq. of  $\mu$ op<sup>n</sup> is known as the  $\mu$ program.  $\mu$ op<sup>n</sup> is also known as minstr<sup>n</sup> or control word. e.g. the fetch cycle  $\mu$ program is

Inst<sup>n</sup>  
Fetch

T<sub>1</sub>: PC  $\rightarrow$  MAR, PCout MARin

T<sub>2</sub>: M[MAR]  $\rightarrow$  MBR; MARout MBRin : This  $\mu$ op<sup>n</sup> can't be completed in 1 cycle as it includes mem. access.

T<sub>3</sub>: MBR  $\rightarrow$  IR, MBRout IRin

PC  $\leftarrow$  PC + step-size ; PCout, PCin

## Control Unit :-

- ① System functionality is execution of the program. Program contains 2 objects, i.e. inst<sup>n</sup> & data. Inst<sup>n</sup> cycle is used to execute the inst<sup>n</sup>. Inst<sup>n</sup> cycle contains fetch & execution cycle. These subcycles invokes their  $\mu$ program.  $\mu$ program means sequence of  $\mu$ op<sup>n</sup>. Control signals are req. to execute a  $\mu$ op<sup>n</sup>. Control signals are implemented in the control unit during the design time of processor. CU generates the control signal. Basic b/w unit is able to process the control signals.

# Cosmos(Sajal)



Execution of the program

To design the control unit, the designer identifies 3 points :-

- ① no. of inst<sup>n</sup> supported by the processor.
- ② no. of HOp<sup>n</sup> req. to execute one inst<sup>n</sup>.
- ③ no. of control signals req. to execute 1 HOp<sup>n</sup>

b

e.g. 3 Inst<sup>n</sup>: ADD, SUB, MUL  
4 HOp<sup>n</sup>: T<sub>1</sub>, T<sub>2</sub>, T<sub>3</sub>, T<sub>4</sub>  
control signals: S<sub>0</sub>, S<sub>1</sub>, S<sub>2</sub>, S<sub>3</sub>

| HOp <sup>n</sup> | ADD                             | SUB                             | MUL                             |
|------------------|---------------------------------|---------------------------------|---------------------------------|
| T <sub>1</sub>   | S <sub>0</sub> , S <sub>3</sub> | S <sub>0</sub> , S <sub>2</sub> | S <sub>3</sub> , S <sub>0</sub> |
| T <sub>2</sub>   | S <sub>1</sub> , S <sub>0</sub> | S <sub>1</sub> , S <sub>3</sub> | S <sub>1</sub> , S <sub>2</sub> |
| T <sub>3</sub>   | S <sub>2</sub> , S <sub>3</sub> | S <sub>2</sub> , S <sub>0</sub> | S <sub>0</sub> , S <sub>2</sub> |
| T <sub>4</sub>   | S <sub>1</sub> , S <sub>3</sub> | S <sub>1</sub> , S <sub>0</sub> | S <sub>0</sub> , S <sub>1</sub> |

In RISC processor, all the microinst<sup>n</sup> completes in same no. of T-states, while in CISC processor, each microinst<sup>n</sup> may complete in variable time states.

This database is stored in the control unit for decoding the HOp<sup>n</sup> for execution of the program.

→ In (i) we write the program as ADD  
ADD

→ In (ii), the program is fetched & executed, & the inst<sup>n</sup> are decoded to find out which op<sup>n</sup> to perform by seeing accessing the CU, so fetch cycle sees the opcode in inst<sup>n</sup> & checks the CU for that opcode, which op<sup>n</sup> to perform, as ADD is to be performed.

→ In (iv), as we get to know which op<sup>n</sup> is to be performed, i.e. we check the database in CU to check which HOp<sup>n</sup> are req. to perform the program execution.

# Cosmos(Sajal)



After maintaining the database related to the instr<sup>n</sup>, pOp<sup>n</sup> & control Signals, the designer can use any one of the following approaches to design the control unit :- i.e.

① Hardwired control unit

② Microprogrammed CU

Hardwired CU:-

① In this design, the control signals are expressed as

Sum of Product expression (SOP). & they are directly realized on the independent h/w.

② This design is used in real time applications.

③ It is the fastest control unit design.

④ Even a minor modification requires redesign & reconnection of the control signals, ∵ it is not flexible.

⑤ It is not suitable for testing & design changes.

⑥ The RISC control unit is the hardwired CU.

Q. Consider a hypothetical CU, it supports 3 instr<sup>n</sup> ( $I_1, I_2, I_3$ ) & 4 control signals ( $S_0, S_1, S_2, S_3$ ). Each instr<sup>n</sup> requires 4 pOp<sup>n</sup> ( $T_1, T_2, T_3, T_4$ ). The following table indicates the no. of control signals used for each pOp<sup>n</sup> for each instr<sup>n</sup>. Get the control exp<sup>n</sup> for  $S_0$  &  $S_3$  signals

$$S_0 = T_1 + I_1 T_2 + I_2 T_3 + I_3 T_4 + I_2 T_4 + I_3 T_4$$

$$S_3 = T_1 I_1 + T_1 I_3 + T_2 I_2 + T_3 I_1 + T_4 I_1$$



# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## Microprogrammed CU

- ① In this design, the control memory is used to store the microprogram.
- ② The control memory is a permanent memory, i.e. ROM.
- ③ The "Inst<sup>n</sup>" (control word) are stored in the control memory.
- ④ The format of μop<sup>n</sup> is described below:-

Control word | Inst<sup>n</sup> | μop<sup>n</sup>



- ⑤ Control Signals are stored in 2 formats, i.e. Decoded or Encoded format.  
If control signals are stored in decoded format, it is called as horizontal control word, the corresponding CU is called as horizontal μprogram CU.  
If control signals are stored in encoded binary format, it is called as vertical control word, the corresponding control unit implementation is called vertical μprogrammed CU.

Decoded format [1 bit/control signal]      Encoded (control signals format are stored in control memory)  
When control word is stored in decoded format, then it is called Horizontal control word.

When control word is stored in encoded format, then it is called vertical control word.

## Horizontal Vs Vertical -

### Horizontal programming:

- ⑥ Decoded format of control signals, i.e. 1 bit / Control signals. e.g. n control signals requires n bits.

# Cosmos(Sajal)

$$\begin{aligned}
 & \text{10}^3 \text{ cycles pp} \\
 & 2.3 \text{ Giga} \\
 & 2.3 \times 10^3 \text{ Giga} \\
 & \Rightarrow 2.3 \text{ Giga cycles} \\
 & \text{Data} \\
 & 1000 \\
 & 4.102 + 4.102 \\
 & 12.053 + 12.053 \\
 & = 2.4 + 5.8 + 3.6 + 2.4
 \end{aligned}$$

- ~~It supports 2<sup>log<sub>2</sub>N</sup>~~ log<sub>2</sub>N control words.

No need of external decoders to generate the control signals, so it is faster than vertical.

It allows high degree of parallelism.  
e.g. If degree of parallelism is 'N', which means n control signals are enabled at a time, so it is used in parallel processing applications.

It is bit flexible compared to the hardwired CU.

### Vertical Programming:

It supports encoded format of control signals, i.e. n control signals req.  $\lceil \log_2 N \rceil$  bits.

Shorter control words.

Decoders are req. to generate the control signals, so it is slower than horizontal.

Low degree of parallelism, i.e. 1 or none.

It is very flexible.

- Ascending Order in terms of speed -

Vertical < Horizontal < Handicized

- Ascending order in terms of flexibility

Hardwired < Horizontal < Vertical

Consider a CU which supports 48 control signals.

4. 12 flags. It has 1024 control word memory, what is the size of control word & control mem. in bytes using horizontal programming.



(n size = 62 bits

$$\text{control word mem.} = 62 \times 1024 \text{ bits}$$

$$= 62 \times 1024 \text{ bytes}$$

# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

- Q. Considers a hypothetical control unit, it uses 6 groups of mutually exclusive control signals.

| Group   | G <sub>1</sub> | G <sub>2</sub> | G <sub>3</sub> | G <sub>4</sub> | G <sub>5</sub> | G <sub>6</sub> |
|---------|----------------|----------------|----------------|----------------|----------------|----------------|
| CS & CS | 1              | 5              | 8              | 14             | 3              | 6              |

How many no of bits are saved using the vertical over horizontal.

$$\rightarrow \text{Horizontal} = 1 + 5 + 8 + 14 + 3 + 6 = 37$$

$$\text{Vertical} = \lceil \log_2 1 \rceil + \lceil \log_2 5 \rceil + \lceil \log_2 8 \rceil + \lceil \log_2 14 \rceil + \lceil \log_2 3 \rceil + \lceil \log_2 6 \rceil \\ = 1 + 3 + 3 + 4 + 2 + 3 = 16$$

$$\text{No. of bits saved} = 37 - 16 = 21$$

## High Performance Architecture:

Performance is an indirect measurement which depends upon the execution time or throughput.

- Execution Time: Time required to complete the program execution. It is also called as elapsed time or response time.
- Consider two systems X & Y, used to execute task 1, X system requires 5ms & Y requires 10ms to complete the task.

$$\text{Performance} \propto \frac{1}{\text{Execution Time}}$$

- Throughput means the no. of tasks completed in a unit time. Consider X & Y systems used to execute 10 tasks in 12 hours. X completes 10 tasks & Y completes 5 tasks.

$$\text{Performance} \propto \text{Throughput}$$

- Performance Gain can be calculated using speed up factor (S). Speed-up factor compares 2 entities.

$$S = \frac{\text{Performance of } X}{\text{Performance of } Y} = \frac{1/ET_X}{1/ET_Y} = \frac{ET_Y}{ET_X} = n$$

X System runs "n" times faster than Y System.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

$$S = \frac{10}{5} = 2 \Rightarrow \text{Speed-up factor - 2}$$

$$S = \frac{\text{Throughput}_X}{\text{Throughput}_Y} = n$$

$\times$  system runs "n" times faster than Y  
 $S = \frac{10}{5} = 2$

(CPU time calculation:-

CPU time - program execution time. It can be calculated based on the processor clock



Processor contains clock pin as an I/P pin, this pin is externally connected with clock generator. Clock generator is operating with constant frequency. It then generates the clock pulses. These clock pulses are carried into the processor through the clock pin. All the processor's activities are controlled by the clock. The time

The time state is defined as rising edge to rising edge transition or falling edge to falling edge transition. It is also called as cycle, clock cycle, tick, clock tick, clock period etc.



Cycle Time:-

The time required to transfer the pulse either from rising edge to rising edge or from falling edge to falling is called as cycle time. The cycle time depends on clock frequency. If 1 GHz clock is used, then cycle time is

$$t = \frac{1}{f} = \frac{1}{1 \text{ GHz}} \text{ sec} = \frac{1}{10^9} \text{ sec} = 10^{-9} \text{ s} = 1 \text{ ns}$$

CPU time = Program Execution Time =  
no. of inst<sup>n</sup> per program \*  
no. of cycles per inst \*  
no. of seconds per cycle

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

or  $\text{CPU time} = \text{no. of cycles req. per program} * \text{cycle time}$

If we know the no. of cycles req. for the program & the  $\text{inst}^n$  count, then we can compute  $\text{CPI} (\text{cycles per inst}^n)$ .

i.e  $\text{CPI} = \frac{\text{no. of cycles req. per program}}{\text{Inst}^n \text{ count}}$

$\text{CPU time} = (\text{CPI} * \text{Inst}^n \text{ count}) * \text{cycle time}$

$\text{CPU time} = (\sum_i \text{CPI}_i * \text{IC}_i) * \text{cycle time}$

$i^{\circ}$  :- type of Instn

Q. Consider 1 GHz clock frequency processor used to execute the following program:-

| Inst <sup>n</sup>        | CPI | freq. |                            |
|--------------------------|-----|-------|----------------------------|
| Load & Store             | 4   | 30%   | <u>0.12</u><br><u>+1.2</u> |
| ALU                      | 8   | 50%   | <u>+1.0</u><br><u>+1.2</u> |
| Branch Inst <sup>n</sup> | 6   | 20%   |                            |

- (a) What is avg.  $\text{inst}^n$  execution time.  
(b) performance in terms of MIPS [Million  $\text{Inst}^n$  per second].  
(c) If program contain  $10^9$   $\text{inst}^n$ , what is the program execution time.

Ans (a)  $6.4 = \text{avg. CPI}$   
cycle time  $\frac{1}{10^9} \text{ s} = 10^{-9} \text{ s}$

avg.  $\text{inst}^n$  execution time =  $6.4 \text{ ns}$

(b) Ans.  $1 \text{ inst}^n \rightarrow 6.4 \text{ ns}$   
 $1 \text{ s} \rightarrow \frac{1}{6.4} \times 10^9$   
 $= 156.25 \text{ MIPS}$

# Cosmos(Sajal)

$$\textcircled{c} \quad 1 \text{ instn} \rightarrow 6.4 \text{ ns}$$

$$10^9 \text{ instn} \rightarrow 6.4 \times 10^9 \times 10^{-9} \text{ s} = 6.4 \text{ s}$$

Q. Consider 1 ns clock cycle processor which consumes 8 cycles for load  $inst^n$ , 7 cycles for ALU op $n^2$  & 6 cycles for branch op $n^3$ . The relative freq. of these  $inst^n$  are 30%, 40%, 30% resp. An ideal processor with CPI = 1. A clock speed is 1.3 ns used to execute the program, what is the speed-up using the ideal processor over another processor? 7.0

$$\begin{aligned}
 & 7 \times 1 \text{ ns} = 7 \text{ ns.} \rightarrow \text{Processor 1} \\
 & 24 \cancel{\times 0.3} \cancel{\times 0.27} \\
 & 1 \times 0.3 + 1 \times 0.4 + 1 \times 0.3 = 1 \cancel{\times} 1.3 \text{ ns} \\
 & \quad = 1.3 \text{ ns} \rightarrow \text{Processor 2} \quad \cancel{7} \cancel{27} \cancel{3}
 \end{aligned}$$

$$S = \frac{7 \text{ ns}}{13 \text{ ns}} = 5.3$$

$$\begin{array}{r}
 3 + 2 * 6 \\
 = 15 \\
 \hline
 2 \\
 2.4 \\
 0.8 \\
 3.6 \\
 2.4 \\
 \hline
 9.2 \\
 \hline
 \end{array}
 \quad
 \begin{array}{r}
 1.8 \quad 0 \\
 + 0.45 \\
 \hline
 + 2.00 \\
 \hline
 4.25 \\
 \hline
 \end{array}
 \quad
 \begin{array}{l}
 2ns \rightarrow 1 \text{ instn} \\
 1s \rightarrow \frac{1}{2} \times 10^9 \\
 \hline
 \end{array}$$

$$1 \text{ inst}^n \rightarrow 4 \text{ ns} \quad 1 \text{ inst}^n \rightarrow \frac{9.2 \times 10^{-9}}{2.3} \text{ s inst}^n$$

$$1 \text{ s} \rightarrow \frac{1}{4} \times 10^9 \text{ inst}^n \quad 1 \text{ s} \rightarrow \frac{2.3}{9.2} \times 10^9$$

$$= 0.25 \quad = 250 \text{ MIPS}$$

$$\text{avg. instr}^n \quad \text{cycle time} = \frac{1}{2 \cdot 3} \times 10^{-7} \text{ s}$$

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Kathy

High performance architecture exploit the concurrency.

- Concurrency - means two or more event execution is called concurrency, this event may be a subprogram, program, inst<sup>n</sup>, stages in inst<sup>n</sup>. concurrency implies to parallelism, simultaneous, pipelining.
- Parallelism means two or more events are executed at the same time period (in range of time states).
- Simultaneous - two or more events are executed in the same time instance (in same time state, → two initiation are seq. for simultaneous)
- Pipeling - two or more events are executed in overlapping time span.

High Performance Architecture is classified in 4 types acc to Flynn's Classification:-

- ① SISD (Single Inst<sup>n</sup> Stream Single Data Stream)
- ② SIMD (" " " Multiple " "
- ③ MISD (Multiple " " Single " ")
- ④ MIMD ( " " " Multiple " ")



# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

- Simultaneous Execution [ same inst<sup>n</sup> <sup>is</sup> performed at the same time ]



- Only one processor is in active state, because there is only one memory unit, so at a time only one processor can access the memory unit.
- This concept is not implemented.



- It is parallelism.
- Different inst<sup>n</sup> are performed at the same time, ∴ parallelism.
- To improve the SISD, we add pipelining concept

## Pipelining

Definition: ① One pipe output is connected to the input of another pipe.

# Cosmos(Sajal)

??

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Definition: Accepting new I/P's at one end before previously accepted input appears as an O/P at the other end. ∴ Pipelining allows overlapping execution.

Note: The defn states that the new I/P's are inserted into the pipe before completion of the old I/P's, which means that new I/P's are overlapping with old I/P's. Pipelining allows overlapping execution. Non-pipelining means new I/P's are inserted only after completion of the old I/P's; it allows non-overlapping execution.

Q. 

(CPT = 1)

- Successful characteristic of pipeline is for every new cycle new I/P must be inserted into the pipe, CPI = 1.
- Overlapping execution sequence can be represented by using the space time diagram.
- In space-time diagram, x-axis indicates the no. of cycles & y-axis indicates the no. of stages.



Designing → Pipe has two ends, i.e. I/P & O/P end. B/w I/P of the end & output end different pipes are interconnected pipeline. With each other based on the function of the pipeline to satisfy the objective of the pipeline.

- Each pipe is called stage or segment. B/w the stages interface registers are used to store the intermediate result. It is also called as pipeline register or buffer or latch.

# Cosmos(Sajal)

5/10

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

- All the stages along with the buffer are under the control of common clock.



- Consider 4 Segment pipeline used to execute the 4 task, show the execution seq. using space-time diagram.



\* The 1st i/p to the pipeline requires  $k$  cycles to execute ( $k$ : no. of segment).

→ The rest i/p's require 1 cycle to output  
 $\therefore [k + (n-1)]$  [n : no. of tasks.]

→ In non-overlapping execution =  $k \times n$

Non-overlapping  
Performance evaluation of the pipeline processor:-

→ consider  $k$  segment pipeline

with clock cycle time of  $t_p$ . The very 1st inst<sup>n</sup> (task) in the pipeline is executed in the non-overlapping order, so it requires  $k$  cycles to complete the op<sup>n</sup>. The remaining  $(n-1)$  tasks emerge from the pipe @ one cycled one task/cycle,  $(n-1)$  cycles are req. to complete the remaining task,  $\therefore$  the total time req. to

# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

execute  $n$  task using  $k$  segment pipeline  $\Rightarrow$

$$ET_{\text{pipe}} = (k+n-1)t_p$$

- Consider a non-pipeline processor used to execute  $n$  tasks, each task requires  $t_n$  time to complete  
∴ the total time req. to execute  $n$  tasks using non-pipeline processor is  $ET_{\text{non-pipe}} = nt_n$
- The performance gain of the pipeline over non-pipeline is :-

$$S = \frac{\text{performance}_{\text{pipeline}}}{\text{performance}_{\text{non-pipe}}} = \frac{\frac{1}{ET_{\text{pipe}}}}{\frac{1}{ET_{\text{non-pipe}}}}$$

$$S = \frac{n t_n}{(k+n-1)t_p}$$

- When the no. of tasks increases,  $n \ggg k-1$ , ∴

$$k+n-1 \approx n$$

$$\therefore S = \frac{n t_n}{n t_p} = \frac{t_n}{t_p} = \frac{1}{k} \cancel{k} / \cancel{1} \cancel{k}$$

$$\therefore S = \frac{t_n}{t_p}$$

- When all inst<sup>n</sup> takes same no. of cycles, then non-pipeline inst<sup>n</sup> execution time is equal to the no. of segments in the pipeline

$$t_n = kt_p \Rightarrow S = k t_p / t_p = k$$

$$\therefore S = k \rightarrow \text{Pipeline Depth} \star$$

- When processor operates at 100% efficiency, then the max. speedup is pipeline depth, i.e.

$$S_{\text{max}} = k$$

$$\eta = \frac{S}{S_{\text{max}}} = \frac{S}{k}$$

pipeline efficiency

The throughput of pipeline is no. of tasks processed per total time req. to process the tasks.

$$\text{i.e. Throughput} = \frac{n}{(k+n-1)t_p}$$

# Cosmos(Sajal)

## Types Of Pipelines

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- ① **Linear Pipeline :-** This pipeline is used to perform only one function. (only feed forward.)
- ② **Non-Linear Pipeline -** This can perform multiple functions. It can consist of feed-forward & feed-backward connections.



- ③ **Synchronous Pipeline :-** In this pipeline, the initiations are controlled by the clock, i.e. each stage need to complete its op<sup>n</sup> with the clock cycle, e.g.
- ④ **Asynchronous Pipeline :-** initiations are controlled by handshaking signals, i.e. take the feedback of the stage before assigning the op<sup>n</sup> to the next stage. [Not in use at the moment]

## Hypothetical Pipelines

- ① **Uniform Delay pipeline :-** in this pipeline, all the stages completes the op<sup>n</sup> with the same delay.  $2\text{ns}$     $2\text{ns}$     $2\text{ns}$     $2\text{ns}$



Case ①:  $t_p = \text{stage delay}$

→ If the stage takes  $2\text{ns}$ , & the reg. takes  $1\text{ns}$

Case ②:  $\text{delay} \rightarrow t_p = 3\text{ns}$ .

$t_p = \text{Stage delay} + \text{buffer delay}$

Case ③

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Case(1):-  $t_p = \text{stage delay}$

Case(2):- If buffer delay is included, then

$$t_p = \text{Stage delay} + \text{buffer delay}$$

Case(3):- If we have uniform stage delay but non-uniform buffer delay,

$$t_p = \text{Stage delay} + \max. \text{ buffer delay}$$

Case(4):- If non-uniform delay < buffers

case(5)- If Skew time overhead/setup time overhead is included,

$$\therefore t_p = t_p + \text{skewtime} / \text{setup time overhead.}$$

→ Non-uniform Delay Pipeline:-

In this pipeline diff. stage delays are maintained b/w the stages

①  $t_p = \max. \text{ stage delay}$

② If buffer delay is included,

$$\therefore t_p = \max. \text{ stage delay} + \text{buffer delay}$$

③ If non-uniform buffer delays are used :-

$$t_p = \max. \text{ stage delay} + \max. \text{ buffer delay.}$$

# Cosmos(Sajal)

\* In Non-pipeline processor, buffer is not included.

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

Q1. M.12

An inst<sup>n</sup> pipeline which has the speed up factor 10, while CP<sup>n</sup> with 80% efficiency. What is the no. of stages in the pipeline?

$$0.8 = \frac{10}{k}$$

$$k = \frac{10}{0.8} = 12.5 \approx 13.$$

Q2. Consider 2 pipelines A & B where A is having 6 stages of uniform delay of 2ns. Pipeline B is having 5 stages w.r.t. stage delays of 2ns, 1ns, 3ns, 2ns, 2ns. How much time is saved using pipeline A instead of pipeline B if 100 tasks are pipelined.

Ans.  $t_{pA} = 2 \times 6 = 12 \text{ ns}$

$t_{pB} = 3 \text{ ns}$

$$ET_A = (6+99) \times 2 = 210 \text{ ns}$$

$$ET_B = (5+99) \times 3 = 312 \text{ ns}$$

Time save  $102 \text{ ns}$

$$\frac{10^9}{3} = 3.33 \times 10^8$$

Q3. Consider 4 segment pipeline w.r.t. stage delays 3ns, 5ns, 8ns, 2ns. The interface reg. are having the delay of 1 ns. What is the asymptotic speedup when very large no. of inst<sup>n</sup> are pipelined.

Ans.  $t_p = 8 + 1 = 9 \text{ ns}$

$$S = \frac{t_n}{t_p} + \frac{36}{9} = 14$$

$$\frac{4 \times 8 - 32 + 4}{9} = 36 \text{ ns}$$

Note:  $1 \text{ ns} + 8 \text{ ns} + 1 \text{ ns} + 8 \text{ ns} + 1 \text{ ns} + 8 \text{ ns} + 1 \text{ ns} = 36 \text{ ns}$ .  
 $t_m = 3 \text{ ns} + 1 \text{ ns} + 5 \text{ ns} + 1 \text{ ns} + 8 \text{ ns} + 1 \text{ ns} + 2 \text{ ns} + 1 \text{ ns} = 24 \text{ ns}$ .

Q4. Consider a non-pipeline processor. It takes 50 ns to complete the task. The same task can be processed by using 6 segment pipeline w.r.t. stage delays of 5ns, 8ns, 10ns, 12ns, 2ns, 4ns. What is the speedup factor.

# Cosmos(Sajal)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

Ans.  $S = \frac{m t_n}{(n+k-1) t_p} = \frac{50}{12} = 4.16$

$$S = \frac{m t_n}{(n+k-1) t_p} \quad m >> k-1$$

Q5. Consider 4 stage pipeline where diff. inst<sup>n</sup> are taking diff amount of time at diff stages. Rep<sup>i</sup>eshn below.

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ |                     |
|-------|-------|-------|-------|-------|---------------------|
| $I_1$ | 2     | 3     | 2     | 1     | $I_1 \rightarrow 3$ |
| $I_2$ | 1     | 2     | 1     | 2     | $I_2 \rightarrow 2$ |
| $I_3$ | 2     | 2     | 3     | 2     | $I_3 \rightarrow 3$ |
| $I_4$ | 1     | 2     | 2     | 1     | $I_4 \rightarrow 2$ |

Asynchronous pipeline

- (a) How much time is req. to complete the inst<sup>n</sup>?
- (b) What is the speedup factor?
- (c) What is the efficiency?
- (d) What is the throughput?

Ans. (a)  $I_1 \rightarrow t_p = 4 \times 3 = 12 \text{ ns}$

In Asynchronous pipelining, before passing the inst<sup>n</sup> to next segment, he checks whether the next segment is busy or not

|       | $S_1$ | $S_2$ | $S_3$ |                              |
|-------|-------|-------|-------|------------------------------|
| $I_1$ | 2     | 3     | 5     |                              |
| $I_2$ | 2     | 3     | 5     | → This is non-uniform delay. |
| $I_3$ | 2     | 3     | 5     |                              |

For asynchronous pipeline, draw the space time diagram. [Feedback op<sup>n</sup> is performed via handshaking]



During this time,  $I_2$  can't be forwarded to  $S_2$  as it is executing  $I_1$ ,  $I_2$  will wait in  $S_1$  stage.

# Cosmos(Sajal)

KMSE

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

(a) 15 cycles were req. to complete the opn.

$$(b) S = \frac{t^n}{t_p} = \frac{8ns + 6ns + 9ns + 6ns}{15ms} = \frac{29}{15} = 1.9$$

$$(c) \eta = \frac{S}{K} = \frac{1.9}{4} = 0.47$$

(d) Throughput = no. of tasks executed per unit time  
 $= [4/15]$

$$t_p = 3ns$$

$$t_n = 7ns$$

$$\underline{\underline{2}}$$

$$t_p = (1 + 999) \times 165$$

$$= \frac{1000 \times 165}{ns}$$

$\therefore 1 = 12 ns$

$$S = \frac{12}{K}$$

$$t_p = \frac{12ns}{S}$$

$$S = \frac{30ns}{12ns}$$

$$(15) 4 \times 0.4$$

$$1.6 + 0.8 + 2 = 4.4$$

Avg. instr execution time = 4.4 ns

$$ET_{\text{pipe}} = CPI * (0.2 + 1) = 1.2 * 1 = 1.2 ns$$

$$\therefore S = \frac{4.4 \times 1.2}{1.2} = 3.7$$

$$(16) n \times 2^9 \times 2^{10} \times 2^3 = 2 \times 2^{16} \times 2^{10} \times 2^5$$

$$n \times 2^{12} = 2^{16}$$

$$n = 2^4 = 16.$$

$$(17) S = \frac{100 \times 110 \times 110}{110} = 110$$

$$= \frac{310}{110} = 2.8$$

$$= \frac{31}{11} = 2.8$$

$$= \frac{31}{11} = 2.8$$

$$(18) \frac{50 \times 100}{(6+99) \times 10} = \frac{5000}{1050} = 4.70$$

$$(19) \frac{250 \times 10^2}{400 \times 10^2} = 2500$$

$$(20) \cancel{10} \rightarrow 0.8 = \frac{10}{K}$$

$$K = \frac{10}{0.8} = \frac{100}{8} = 12.$$

$$S = \frac{4}{2500}$$

$$(21) (8+99) \times 2 = 107 \times 2 = 214 ns$$

$$(8+99) \times 3 = 107 \times 3 = 321 ns$$

# Cosmos(Sajal)

$$\begin{aligned}
 \textcircled{29} \quad & (t_p)(D) = (5 + 99)4 = 104 \times 4 = 416 \text{ ns} \\
 & (t_p)D_2 = (8 + 99) \times 2 = 107 \times 2 = 214 \\
 & \text{time saved} = \frac{214}{202} \text{ ns}
 \end{aligned}$$

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

| $S_4$ |       |       |       | $I_1$ | $I_1$ |       |       | $I_2$ |       | $I_3$ | $I_3$ | $I_4$ | $I_4$ |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| $S_3$ | -     |       |       | $I_1$ | $I_1$ | $I_2$ | $I_2$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ | $I_4$ | -     |
| $S_2$ |       |       | $I_1$ | $I_1$ | $I_2$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ | -     | $I_4$ | $I_4$ |       |
| $S_1$ | $I_1$ | $I_1$ | $I_2$ | $I_3$ | $I_3$ | -     |       | $I_4$ | -     |       |       |       |       |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

$$\textcircled{29} \quad S = \frac{29}{15} = 1.9$$

$$\eta = \frac{1.9}{4} = 0.475$$

\textcircled{29}

| $S_4$ |       |       |       | $I_1$ | $I_1$ | $I_2$ |       | $I_3$ |       | $I_4$ | $I_4$ | $I_1$ |  |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--|
| $S_3$ |       |       |       | $I_1$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ | $I_4$ | $I_4$ | $I_4$ | $I_1$ |  |
| $S_2$ |       | $I_1$ | $I_1$ | $I_2$ | $I_3$ | -     | $I_4$ | -     | $I_1$ | -     |       |       |  |
| $S_1$ | $I_1$ | $I_2$ | $I_2$ | $I_3$ | $I_4$ | $I_4$ | $I_1$ | $I_1$ |       |       |       |       |  |

K  $\xrightarrow{I=1}$   $\xrightarrow{I=2}$   $\xrightarrow{I=3}$   $\xrightarrow{I=4}$   $\xrightarrow{I=5}$   $\xrightarrow{I=6}$   $\xrightarrow{I=7}$   $\xrightarrow{I=8}$   $\xrightarrow{I=9}$   $\xrightarrow{I=10}$   $\xrightarrow{I=11}$   $\xrightarrow{I=12}$   $\xrightarrow{I=13}$

Overlapping cycles  $\rightarrow 7$  to 11

\textcircled{11}

| $S_4$ |       |       |       | $I_1$ |       | $I_2$ | $I_2$ | $I_3$ | $I_3$ | $I_3$ | $I_3$ | $I_4$ | $I_4$ |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| $S_3$ |       |       |       | $I_1$ |       | $I_2$ | $I_2$ | $I_3$ | -     | $I_4$ | $I_4$ | -     | $I_1$ |
| $S_2$ |       | $I_1$ | $I_2$ | $I_2$ | $I_3$ | -     | $I_4$ | -     | $I_1$ | -     | -     | $I_2$ | $I_2$ |
| $S_1$ | $I_1$ | $I_1$ | $I_2$ | $I_3$ | $I_3$ | -     | $I_4$ | -     | $I_1$ | $I_2$ | -     | $I_3$ | $I_4$ |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

\textcircled{12}

|       |       |       |       |       |       |       |
|-------|-------|-------|-------|-------|-------|-------|
| $I_1$ | $I_3$ | $I_3$ | $I_3$ | $I_4$ | $I_4$ | $I_4$ |
| -     | $I_4$ | $I_4$ | -     |       |       |       |
|       | $I_4$ |       |       |       |       |       |

20 21 22 23 24 25 26

# Cosmos(Sajal)

## Advanced Pipeline

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

RISC processor supports 3 categories of instn:-

- ① Load & Store : Load  $r_0, 2(r_1)$

$$r_0 \leftarrow M[2 + [r_1]]$$

$$\text{Store } 3(r_1), r_0$$

$$M[3 + [r_1]] \leftarrow r_0$$

→ RISC processor is register to register architecture.

- ② ALU op<sup>n</sup>: Add  $r_0, r_1, r_2$

$$r_0 \leftarrow r_1 + r_2$$

- ③ Transfer of Control:-

Unconditional TOC: JMP 2000

$$PC \leftarrow 2000$$

Conditional TOC: JNZ ~~2000~~  $r_0, 2000$



RISC processor uses 5 stage pipeline to execute the instn:-

- I<sub>1</sub>  
I<sub>2</sub>  
I<sub>3</sub>  
I<sub>4</sub>  
19
- ① Inst<sup>n</sup> fetch(IF) :- In this stage, based on the PC, the inst<sup>n</sup> is transferred from memory to CPU, at the end of this process, program counter is incremented by step size to point the next inst<sup>n</sup> address.
  - ② Inst<sup>n</sup> Decode(ID) :- 2 op<sup>n</sup> are performed in this stage, i.e. decode the inst<sup>n</sup> to identify the type of op<sup>n</sup> & accessing the register file to read the operands. This stage also contains comparator logic to evaluate the branch conditions, i.e., TOC inst<sup>n</sup> execution is completed at the end of 2nd stage of the pipeline.
  - ③ Execution(EX) :- ALU op<sup>n</sup> are performed in this stage.
  - ④ Memory Access(MA) :- Memory read & mem. write op<sup>n</sup> are performed in this stage.
  - ⑤ Write Back(WB) :- Register file is updated in this stage. This stage op<sup>n</sup> is divided in 2 parts, in 1st part register file is updated & in 2nd part, reg. file is read to access the data.

# Cosmos(Sajal)

— King



- ① 1000: Load 70,20(r.)  
PC = 1001

MEMRD  
Source: M

1 2 3  
+ Destination

register  
to  
We will  
get the value  
of  $\alpha$

The registered value was available in 2nd stage.

$2 + [\text{X}_p]$

1

The value is stored in the latch & not in the reg

$\leftarrow M[2 \uparrow [x_1]]$

- ② Store 30,200  
PC ← 1002

It is a mem

Source: registers, EA  
Destination: 3,7, EAA

he will get  
the value of  
 $\sigma_1$  &  $\sigma_2$  in  
2nd stage itself

$$34[\text{O}_2] = EA$$

EA is a  
test.

EA was calculated in previous stage &  $\sigma_1$  is stored at EA.

As no reg.  
write opn  
is needed,  
nothing will  
happen in  
this stage.

- ③ Add 80, 81, 8,

+ → add op is identified  
Source register addressed  
Destination register addressed  
we will get value of  $s_1$  &  $s_2$  in this stage itself

$s_1 + s_2$  → [ ]  
 If  $s_2$  is added & stored in ALU decision

This result  
is transferred  
to the mem-  
brane without  
accessing  
memory

- ④ 1000: JMP 2000

Unconditional  
TOC is identified

in this stage

- © 1000 : TNZ ~ P. 2005

'Cond. TOG is iden-

Identified  
Second = TMZ

Sovereign - 80

I forgot this  
M&P NZ

This is performed by comparator logic in 1st stage.  
 if  $BO = NZ$ , then  $PC \leftarrow 2000$   
 if  $IO = NZ$ , then no change in PC

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## Dependencies in pipeline :-

① Structural Dependency

② Data                  "

③ Control              "

→ Stalls in the pipeline

→ Extra cycle w/o opn.

- Structural Dependency :- this is present in the pipeline due to resource conflict, the resource may be memory, register or functional unit.



- In clock cycle 4, I1 & I4 are accessing memory at the same time, hence there is a resource conflict.
- Conflict is an unsuccessful opn, so keep instn ④ into the waiting state until the resource becomes available, this waiting creates the stalls.



→ I4 can't be fetched in CC4, as I1 is accessing the memory.

→ In this case CPI=1 because of presence of extra cycles due to dependencies present in the pipeline.

★ Due to 78 minimize the stalls in the pipeline due to structural dependency renaming concept is used.  
eg in clock cycle 4 ,instn 1 is accessing the memory to read the data & I4 is accessing mem. to read the instn because Instn & data both are present in same mem. area

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

these is a conflict. Renaming implies that dividing the mem. in two modules & stores instn & data in diff. modes i.e. code memory & data memory.

|                | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> | CC <sub>8</sub> |
|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | CM              | ID              | ALU             | DM              | WB              |                 |                 |                 |
| I <sub>2</sub> | CM              | ID              | ALU             | DM              |                 |                 |                 |                 |
| I <sub>3</sub> |                 | CM              | ID              | ALU             |                 |                 |                 |                 |
| I <sub>4</sub> |                 |                 | CM              | ID              | CM              |                 |                 |                 |
| I <sub>5</sub> | CM - code mem.  |                 |                 |                 |                 |                 |                 |                 |
|                | DM - Data mem.  |                 |                 |                 |                 |                 |                 |                 |

→ no stall cycles,  
CPI=1

↳ when code & data are in diff. modules of the mem.

## \* Data Dependency:-

consider the program segment where instn j follows instn i in program.

i: Instn  
j: Instn

j is data dependent on i when j tries to read the data before i writes it. This means one of the source in j instn is the destn in i instn.

e.g. I<sub>1</sub>: Add r0, r1, r2 → until completion of .  
I<sub>2</sub>: MUL r<sub>3</sub>, r<sub>0</sub>, r<sub>4</sub> ↳

Data dependency exists b/w adjacent instn, i.e. it is known as true data dependency.

If Data Dependency exists b/w instn, keep dependent instn in waiting state until the o/p of previous instn is available.

e.g. Keep instn j in waiting state until instn i's execution is completed. This waiting creates stalls in the pipeline.

# Cosmos(Sajal)

|                                                                          | CC <sub>1</sub>                                                                                                                        | CC <sub>2</sub>                    | CC <sub>3</sub> | CC <sub>4</sub>                                | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|-----------------|------------------------------------------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> : Add x <sub>1</sub> /x <sub>2</sub> IF                   | ID                                                                                                                                     | ALU                                | MA              | WB                                             |                 |                 |                 |
|                                                                          | Source x <sub>1</sub> , x <sub>2</sub><br>(I <sub>1</sub> )<br>Dest. x <sub>0</sub><br>Dif x <sub>2</sub><br>Value will<br>be accessed | x <sub>1</sub> +x <sub>2</sub> [ ] |                 | x <sub>0</sub> =x <sub>1</sub> -x <sub>2</sub> |                 |                 |                 |
| I <sub>2</sub> : MUL x <sub>3</sub> , x <sub>0</sub> , x <sub>4</sub> IF | ID                                                                                                                                     | ALU                                |                 |                                                |                 |                 |                 |
|                                                                          | Source x <sub>3</sub> , x <sub>0</sub><br>(I <sub>1</sub> )<br>Dest. x <sub>5</sub><br>Dif x <sub>3</sub><br>Value                     | x <sub>3</sub> *x <sub>0</sub> [ ] |                 |                                                |                 |                 |                 |

To check whether the resources are dependent or independent resources.

we can't get the value of x<sub>0</sub>, because it is dependent upon the result of I<sub>1</sub>.

| Tomasulo | S/No. | Function <sub>1</sub> TOP <sub>1</sub> | TOP <sub>2</sub> | DOP <sub>1</sub>                 | DOP <sub>2</sub> | Status |
|----------|-------|----------------------------------------|------------------|----------------------------------|------------------|--------|
| Algo.    | 1     | + x <sub>1</sub>                       | x <sub>2</sub>   | -                                | -                | 1      |
|          | 2     | * x <sub>3</sub>                       | - x <sub>4</sub> | x <sub>0</sub> (I <sub>1</sub> ) | -                |        |

The x<sub>0</sub> which acts as operand in I<sub>2</sub> is the destination for I<sub>1</sub>.

If the status → IOP - Independent Operand  
 DOP - Dependent Operand

| I <sub>3</sub> | IF | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | IF | Total 2 stalls. |
|----------------|----|-----------------|-----------------|-----------------|-----------------|----|-----------------|
| I <sub>4</sub> |    |                 |                 |                 |                 |    |                 |

★ I<sub>3</sub> remains in IF, because ID segment is occupied by the instn I<sub>2</sub>.

In the above execution sequence in clock cycle 3, during the decoding process of I<sub>2</sub>, the CPU recognizes that dependency exists b/w I<sub>2</sub> & I<sub>1</sub>, so execution state is not altered for I<sub>2</sub> until completion of I<sub>1</sub>, so 2 stall cycles are created.

To minimize the stalls in the pipeline due to data dependency, operand forwarding mechanism is used. Operand forwarding is a hw technique, it is also known as By-passing/short-circuiting.

→ By-passing means one stage ALU i/p<sup>instn</sup> acts as the another instn ALU i/p.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

This means before updating the result into the reg file, we can access it from the latch.



When

- I<sub>1</sub>: Add  $r_0, r_1, r_2 \rightarrow$  only one  
I<sub>2</sub>: MUL  $r_3, r_0, r_4 \rightarrow$  data dependency  
I<sub>3</sub>: DIV  $r_5, r_0, r_6 \downarrow$  because I<sub>1</sub>'s o/p  
I<sub>4</sub>: SUB  $r_7, r_0, r_8$  acts as I<sub>2</sub>'s i/p.

no data forwarding in this case, because it will read the data from the reg. file

I<sub>2</sub> instr gets r<sub>0</sub> from the I<sub>1</sub> ALU's reg. rather than waiting & taking it after WB from reg. file.

\* While execution of transfer of control instr, the program execution sequence is altered. Control dependency exists in the pipeline due to the TOC instr execution.



Unconditional TOC :-

Consider the following program segment :-

1000 : I<sub>1</sub>  
1001 : I<sub>2</sub>  
1002 : I<sub>3</sub> (JMP 2000)  
1003 : I<sub>4</sub>  
:  
2000 : BI<sub>1</sub>  
2001 : BI<sub>2</sub>

Execution sequence :-

I<sub>1</sub> - I<sub>2</sub> - I<sub>3</sub> - BI<sub>1</sub> = BI<sub>2</sub>

fall through

1000

1001

1002 : JMP

taken path] 2000 : BT,  
2001

# Cosmos(Sajal)

PC is incremented after "inst<sup>n</sup>" fetch  
Op<sup>m</sup>.

Date \_\_\_\_\_  
Page No. \_\_\_\_\_



The process of removing the unwanted instr<sup>n</sup> is known as flush/freeze.

V.V Imp.

- \* If the target address is available at "k<sup>th</sup>" inst segment, then no. of flush stalls will be "k-1"
- \* In RISC processor, the no. of stalls will be 1, as the target address is available after the 2nd stage. [When only 1 branch instr<sup>n</sup> is present.]
- \* If
  - Inst<sup>n</sup> fetch is overlapping with instr<sup>n</sup> decode, ∴ before decoding the previous instr<sup>n</sup>, a new instr<sup>n</sup> is inserted into the pipeline.
  - If the decoding stage decodes the instr<sup>n</sup> as data transfer or data manipulation, the next instr<sup>n</sup> is a wanted instr<sup>n</sup>.
  - If the decoding stage decodes the instr<sup>n</sup> as unconditional JOC, then the next instr<sup>n</sup> is unwanted instr<sup>n</sup>, the unwanted instr<sup>n</sup> are flushed out from the pipeline. The process of removing unwanted instr<sup>n</sup> is known as flush/freeze.

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- \* The flush/freeze op<sup>n</sup> creates the stalls in the pipeline. These stalls are known as Branch penalties.

Branch Penalty = At what stage the target address is available - 1

for RISC, branch penalty is always 1.

The no. of stall cycles created from the branch op<sup>n</sup> = Branch freq. \* Branch penalty  
e.g.

Consider 4 stage pipeline, the target address is not available until completion of inst<sup>n</sup>, if program contains 30% unconditional branch inst<sup>n</sup>, how many no. of stall cycles are created from those branch op<sup>n</sup>?

Ans.

$$\text{Branch penalty} = 3$$

$$180 \text{ stall cycles} = 0.3 * 3 = 0.9$$

## Conditional TOC:-

Consider the program segment:

I<sub>1</sub>

I<sub>2</sub>

I<sub>3</sub> (JNZ 80, 2000)

I<sub>4</sub>

BI<sub>1</sub>

BI<sub>2</sub>

CC<sub>1</sub>

T<sub>F</sub>

PC=1001

CC<sub>2</sub>

ID

IF

PC=1002

CC<sub>3</sub>

ID

IF

PC=1003

CC<sub>4</sub>

ID

Cond. TOC

NZ (CMP R0, CTR R0=23)

condn is true

PC=1004

2000

CC<sub>5</sub>

ID

IF

PC=1004

2001

CC<sub>6</sub>

ID

Execution Sequence  
(Condition True):-

I<sub>1</sub> - I<sub>2</sub> - I<sub>3</sub> - BI<sub>1</sub> - BI<sub>2</sub>

Execution Sequence  
(False condition):-

I<sub>1</sub> - I<sub>2</sub> - I<sub>3</sub> - I<sub>4</sub>

unwanted  
inst  
when cond  
is true

(1003 I<sub>4</sub>)  
BI<sub>1</sub>(2000)  
by comparison  
in Inst stage

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

So,  $I_4$  is flushed & a stall is created.

- ★ When the decoding stage decodes the  $inst^n$  as conditional TOC with true condition, then the next  $inst^n$  is unwanted  $inst^n$ , flushout the unwanted  $inst^n$ , otherwise (if it is false cond<sup>n</sup>) the next  $inst^n$  is a wanted  $inst^n$ , so no need to flush.

Note:- If  $inst^n$  is conditional TOC, how many times the cond<sup>n</sup> evaluates to true, only those many stalls are present.

e.g.

$30\% \text{ cond}^n \text{ TOC}$        $70\% \text{ (true)} \rightarrow \text{Branch Penalty}$   
                                         $30\% \text{ (false)} \rightarrow \text{No branch Penalty}$

$$0.3 \times 0.7 = 0.21$$

No. of stall cycles from cond. branch =  $0.21 \times \frac{\text{Branch freq}}{\text{Branch penalty}}$

$$= 0.3 \times 0.7 \times 3 \quad [ \text{from previous ques} ]$$
$$= 0.21 \times 3 = 0.63$$

★ Due to Unconditional TOC & Conditional TOC, the pipeline doesn't operate at 100% efficiency.

- To minimize the stalls in the pipeline due to control dependency, branch prediction buffer (or) Branch target buffer is used.
- Branch target buffer is placed in the  $inst^n$  fetch Stage.

| PC <sub>entry</sub> | Expected PC |
|---------------------|-------------|
|                     |             |

\* No exact soln. for stalls in TOC.

- ★ In the PC entry, the next  $inst^n$  address after the TOC

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Inst<sup>n</sup> is maintained. In the expected PC, the target address is maintained.

cj 1000; I<sub>1</sub> : CL 04

1001; I<sub>2</sub> : ADD

1002; I<sub>3</sub> : ATNZ R<sub>0</sub>, 1000 ; let R<sub>0</sub> = 04

1003; I<sub>4</sub>.



an entry

is made

in the

buffer,

which shows PC value & Target address

Table, if they

do not match

then, execute

decode the current

inst<sup>n</sup> & If the PC

value matches with PC

value, then Target

address is made equal

to expected PC.

\* 1003, hence no

stall.

the next inst<sup>n</sup> is fetched from 1000 without letting the decoding operation to complete.

When decoding completes, it tells

that next address is 1000, which

is already known by expected PC, 2 inst<sup>n</sup> were fetched from 1000, rather than

★ This concept is useful to execute the loop, in the loop execution, the stalls are present during 1st & last cycle of the loop. The objective of prediction buffer is to predict the TA before decoding.

# Cosmos(Sajal)

Kings

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## Delayed Branches:-

This concept preserves the execution path in the pipeline while execution of transfer of control opn.

- Delayed Branch slot is associated with NOP opn.
- Unwanted instn slots are defined as the delayed slots, i.e. it substitutes the NOP opn., so functionality is safe & not missing.

|                | CC <sub>1</sub><br>IF | CC <sub>2</sub><br>ID<br>(if it add<br>opn then<br>next instn<br>is wanted) | CC <sub>3</sub><br>IF | CC <sub>4</sub><br>ID<br>(if it is<br>MEMB or<br>next instn<br>is wanted) | CC <sub>5</sub><br>IF | CC <sub>6</sub><br>ID<br>(if it is<br>Uncond<br>TOC<br>then next<br>instn is<br>unwanted) |                                                                                                               |
|----------------|-----------------------|-----------------------------------------------------------------------------|-----------------------|---------------------------------------------------------------------------|-----------------------|-------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| I <sub>1</sub> |                       |                                                                             |                       |                                                                           |                       |                                                                                           | The first statement is never a delayed slot.                                                                  |
| I <sub>2</sub> |                       |                                                                             |                       |                                                                           |                       |                                                                                           | → Delayed slot<br>(as next instn is wanted, then delayed slot is forwarded to next instn)                     |
| I <sub>3</sub> |                       |                                                                             |                       |                                                                           |                       |                                                                                           | → Delayed slot<br>(as next instn is unwanted, next slot is fixed delay slot & a NOP is inserted in its place) |
| I <sub>4</sub> |                       |                                                                             |                       |                                                                           |                       |                                                                                           |                                                                                                               |
| BI             |                       |                                                                             |                       |                                                                           |                       |                                                                                           |                                                                                                               |

★ The delayed slot is defined in the 2nd cycle of the pipeline. If the decoding stage decodes the instn as data transfer or data manipulation, then the delayed slot is forwarded to the next cycle. If the decoding stage decodes the instn as unconditional TOC or conditional TOC with true cond, then the delayed slot is fixed, so NOP is substituted during the decoding process.

## Scheduling

Consider following program segment:

I<sub>1</sub>: Add R0, R1, R2

I<sub>2</sub>: MUL R3, R0, R4

I<sub>3</sub>: Add R4, R5, R6

I<sub>4</sub>: Add R3, R7, R8

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

The processor executes the inst<sup>n</sup> in in-order-execution i.e.  $I_1 \rightarrow I_2 \rightarrow I_3 \rightarrow I_4$ .

$I_2$  is data dependent on  $I_1$ ,  $(I_3 \& I_4)$  are independent instn.

★ Scheduling causes the out-of-order execution

$\rightarrow I_2$  undergoes waiting till the completion of  $I_1$ ,

$\therefore I_3 \& I_4$  are also sharing the stall cycles even though they are independent.

|       | IF | ID | ALU | Mem | NB  |
|-------|----|----|-----|-----|-----|
| $I_1$ | IF | ID | -   | -   | ALU |
| $I_2$ | IF | ID | -   | -   | ALU |
| $I_3$ | IF | -  | -   | -   | ID  |
| $I_4$ | -  | -  | -   | -   | IF  |

To avoid above problem, schedule the inst<sup>n</sup>,

Scheduling causes out-of-order execution or reorder execution, i.e.  $I_1 - I_3 - I_4 - I_2$ .

Out-of-order execution creates two more dependencies in the pipeline, Anti Dependency & Output Dependency.

as  $I_3$  is executed before  $I_2$ ,  $I_3$  writes result in  $r_4$  &  $I_2$  have  $r_4$  as i/p, then  $I_2$  will read

★ ★ ○ Consider the program segment where inst<sup>n</sup> j follows inst<sup>n</sup> i in the program order.

Anti-Dependency exists when the inst<sup>n</sup> j tries to modify the registers before inst<sup>n</sup> i reads it.  
e.g.

inst<sup>n</sup>  $I_3$  modifies reg.  $r_4$  before  $I_2$  reads it.

○/p dependency :- It exists when inst<sup>n</sup> j tries to write the reg. before inst<sup>n</sup> i writes it, i.e. the reg. is loaded with old value.

# Cosmos(Sajal)

e.g. inst<sup>n</sup> I<sub>4</sub> writes the reg. R<sub>3</sub> before inst<sup>n</sup> I<sub>2</sub> writes it.

|           |      |
|-----------|------|
| Date:     | KMIS |
| Page No.: |      |

2012  
GATE

Due to the register conflict, the Anti & O/p dependencies are present in the pipeline, reg. renaming concept is used to avoid the anti & o/p dependency, i.e. use the temporary storage (reorder buffers) to store the o/p of the I<sub>3</sub> & I<sub>4</sub>.

After receiving the exception signal from the inst<sup>n</sup> I<sub>2</sub>, the reorder buffer values are updated into the reg. file, i.e. data is safe.

## Hazards :-

Hazard is created when the dependency exists in the pipeline. Hazard is a delay. Delay creates the extra cycles. Extra cycles are known as stalls.

Hazards are classified into 3 types based on the order of read & write op<sup>n</sup>, i.e. :-

- ① RAW (Read after Write)
- ② WAR (Write after Read)
- ③ WAW (Write after Write)



When j performs these op<sup>n</sup> before i performs these op<sup>n</sup>, there is risk of hazard.

- RAW :- This hazard is created when the inst<sup>n</sup> j tries to read the data before inst<sup>n</sup> i writes it. (True Data Dependency).
- WAR :- This hazard is created when the inst<sup>n</sup> j writes the data before inst<sup>n</sup> i reads it. (Anti-Dependency).
- WAW :- This hazard is created when the inst<sup>n</sup> j writes the data before inst<sup>n</sup> i writes it. (O/p dependency).

## • Performance Evaluation of Pipeline with Stalls

- ① Speed-up (S) =  $\frac{\text{Avg. instr. execution time of non-pipeline}}{\text{Avg. instr. execution time of pipeline}}$ .

# Cosmos(Sajal)



Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

②  $S = \text{Avg. CPI}_{\text{Non-pipe}} * \text{Cycle Time}_{\text{Non-pipe}}$

$$\text{Avg. CPI}_{\text{pipe}} * \text{Cycle Time}_{\text{pipe}}$$

③ The ideal CPI of pipeline processor is always almost 1, so  $\boxed{\text{avg. CPI}_{\text{pipe}} = \text{ideal CPI} + \text{no. of stall cycles/instr.}}$

$$\therefore \boxed{\text{avg. CPI}_{\text{pipe}} = 1 + \text{no. of stall cycles/instr.}}$$

under this:-

$$S = \frac{\text{Avg. CPI}_{\text{Non-pipe}} * \text{cycle-time}_{\text{Non-pipe}}}{(1 + \text{no. of stall cycles/instr.}) * (\text{cycle-time}_{\text{pipe}})}$$

Q Assume that pipelining & non-pipelining circuits are perfectly balanced, i.e. no skew time / setup time overhead, pipeline & non-pipeline cycle times are equal.

$$\boxed{S = \frac{\text{Avg. Instn ET}_{\text{Non-pipe}}}{(1 + \text{no. of stall cycles/instr.})}}$$

When all the instn are taking same no. of cycles, then 1 instn execution time = the no. of stages in the pipeline.

Under this condition:-

$$\boxed{S = \frac{(\text{no. of stages in pipeline})}{1 + \text{no. of stall cycles/instr.}}}$$

If there is no stalls, the processor is operating with 100% efficiency,

$$\boxed{S = \text{Pipeline Depth} = k.}$$

Chp 4  
②

$$\text{Ans } \frac{n}{(k+n-1)t_p} = \frac{1}{t_p} = \frac{1}{t_{\text{cycle time}}}$$

Avg Execution Time =  $\frac{1 + \text{no. of stages/instr.}}{\text{stalls/instr.}} * \text{cycle time}$   
 $= (1 + 4 * 0.2) * 10 \text{ ns} = 18 \text{ ns}$

# Cosmos(Sajal)

Date

05.05.12

V.V.Imp.

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## Memory Organisation

- Cache Memory :- Acc. to the memory hierarchy, the system supported memory standards are defined below:-

| ① Level :           | 1                                | 2                                           | 3         | 4                    |
|---------------------|----------------------------------|---------------------------------------------|-----------|----------------------|
| ② Name              | Register                         | Cache                                       | Main mem. | Sec Mem.             |
| ③ Typical size      | <1KB                             | <16MB                                       | <16 GB    | >100GB               |
| ④ Access Time(ns)   | 0.25-0.5                         | 0.5-25                                      | 80-250    | 50,00,000            |
| ⑤ Bandwidth(MB/sec) | 20k-100k                         | 5k-10k                                      | 1k-5k     | 20-150               |
| ⑥ Managed by        | Compiler                         | H/W                                         | OS        | OS                   |
| ⑦ Backed by         | Cache<br>(Taking data from mem.) | Main Mem.<br>(Data is taken from Main mem.) | Sec. Mem. | CD                   |
| ⑧ Implementation:   | Multiport port                   | Onchip/offchip SRAM                         | DRAM      | Magnetic (capacitor) |

- Acc. to the above standards, the memory hierarchy is designed to minimize the speed gap b/w the CPU & Memory.
- The accessing order of the hierarchy mem. system is



- If CPU requires data, then it checks data in registers, if it is present in reg., then event is a hit, otherwise it is a miss.
- If the data is not present in reg., then it checks in the cache memory
- The data transfer b/w Cache Mem. & Main mem. is controlled by Computer Organisation.

# Cosmos(Sajal)

KM

|           |  |
|-----------|--|
| Date:     |  |
| Page No.: |  |

- When CPU access the memory, it checks the availability of data in registers, if data is available, op<sup>n</sup> is hit, then the req. data is transferred to ALU, otherwise op<sup>n</sup> is a miss. When the op<sup>n</sup> is miss in the register, then the next reference is given to the cache memory.
- If op<sup>n</sup> is hit in cache, the data is transferred to the CPU in the form of words. If it is a miss in cache, the reference is forwarded to main memory.
- If op<sup>n</sup> is hit in main memory, then the req. data is transferred to the cache in the form of blocks & to CPU in form of words.
- Miss in main memory, the reference is forwarded to Sec. memory. The op<sup>n</sup> is always hit in sec. mem., i.e. the corresponding data is transferred to main mem. in the form of pages, transferred to cache mem. (via main mem.) in the form of blocks & to CPU (via Cache mem.) in the form of words.

\* Locality Of Reference :- The CPU performs read & write op<sup>n</sup> only on the cache, the cache mem. maintains the image of main mem. This property is called as the principle of locality of reference.

→ Based on the type of accessing, the mem. organisation is divided into 2 types :-

① Simultaneous / Independent Mem. orgn.  
② Hierarchical Mem. orgn.

→ Independent Mem :- In this orgn, the CPU directly communicates with all the levels of

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

mem. directly, still it follows accessing order, from level 1 to level n



If data is not present in  $L_1$ , then reference is forwarded to  $L_2$ , & if it is a hit in  $L_2$ , then data is transferred directly to CPU, w/o transfer of data from  $L_2$  to  $L_1$ .

- Here  $H_1, H_2, \dots, H_n$  indicates the hit ratio of corresponding levels of mem.  $\text{Hit ratio} = \frac{\text{no. of hits in mem}}{\text{total no. of accesses to that mem.}}$
- Hit ratio + Miss ratio = 1
- $T_1, T_2, \dots, T_n$  are the accessing times of the respec. levels of mem.
- In this orgn, there is no communication established b/w the levels of memory.
- If there is a miss in  $L_1$ , data is transferred from  $L_2$  to CPU without involvement of  $L_1$ .
- If there is a miss in  $L_2$ , data is transferred from  $L_3$  to CPU w/o involvement of  $L_2 \& L_1$ .

The avg. access time of memory is

$$\begin{aligned} \text{independent/} \\ \text{Simultaneous} \\ \text{accessing time.} \end{aligned} \quad T_{avg} = H_1 * T_1 + H_2 * (1-H_1)H_2 * T_2 + (1-H_1)(1-H_2)H_3 T_3 + \dots + (1-H_1)(1-H_2)\dots(1-H_{n-1})H_n T_n$$

where  $H_n = 1$

- $T_{avg}$  = time req. to access one word from the memory.

1 word  $\rightarrow T_{avg}$

efficiency of mem. system =  $\frac{1}{T_{avg}}$  words/sec, no. of words accessed in 1 sec.

# Cosmos(Sajal)

(1)

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

- Hierarchical Mem. Orgn :- In this orgn, the CPU can perform read or write opn only on the level 1 memory. If the opn is miss in level 1, take the data from the higher levels into L<sub>1</sub>, later the CPU can perform the read/write opn on the L<sub>1</sub> memory, i.e.



Where H<sub>n</sub>=1.

Hierarchical Mem. System :- Avg (for hierarchical mem. system) =

$$H_1 * T_1 + (1-H_1)H_2(T_1+T_2) + (1-H_1)(1-H_2)H_3(T_2+T_3+T_1) + \dots + H_{n-1}(1-H_1)(1-H_2)\dots(1-H_{n-1})H_n(T_n+T_{n-1}+\dots+T_1).$$

$$\text{if } T_{avg} = H_1 * T_1 + (1-H_1)H_2(T_1+T_2)$$

$$= H_1 T_1 + H_2 T_1 + H_2 T_2 - H_1 H_2 T_1 - H_1 H_2 T_2$$

$$= H_1 T_1 + T_1 + T_2 - H_1 T_1 - H_1 T_2 \quad [H_2 = 1]$$

$$[T_{avg} = T_1 + (1-H_1)T_2]$$

Q1. In a 2-level memory, the Level 1 mem. is 5 times faster than level 2 & its access time is 10ns less than T<sub>avg</sub>, let L<sub>1</sub> access time be 20 ns, what is the hit ratio? (no hierarchy word, independent).

Ans.  $30 = H_1 * 20 + (1-H_1)H_2 * 100$

$$30 = H_1 20 + (1-H_1) \times 100$$

$$30 = 20H_1 + 100 - 100H_1$$

$$30 + 70 = 780H_1$$

$$H_1 = 7/8 \approx 0.9$$

★  $S = \frac{\text{Performance of } L_1}{\text{Performance of } L_2} = \frac{1/\text{Access time of } L_1}{1/\text{Access time of } L_2} = \frac{T_2}{T_1}$

★ If the question doesn't say that mem. is hierarchical, ∴ in that case, take it as independent.

# Cosmos(Sajal)

The complete block must be transferred which comprises

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

Q2. 3 level memory has the following specifications:-

| Level | Access time /word | Block size (words) | Hit ratio                  |
|-------|-------------------|--------------------|----------------------------|
| 1     | 20                | -                  | 0.8 $T_1 = 20 \text{ ns}$  |
| 2     | 50                | 2                  | 0.9 $T_2 = 100 \text{ ns}$ |
| 3     | 100               | 4                  | 1 $T_3 = 400 \text{ ns}$   |

If the referred block is not available in  $L_1$ , transfer it from  $L_2$  to  $L_1$ , if not available in  $L_2$ , transfer it from  $L_3$  to  $L_2$  &  $L_2$  to  $L_1$ , what is  $T_{avg}$  to access the block.

Ans.

$$\begin{aligned} T_{avg} &= 0.8 \times 20 + 0.2 \times 0.9 \times (120) + 0.2 \times 0.1 \times (520) \\ &= 16 + 24 \times 0.9 + 10.4 \\ &= 16 + 10.4 + 21.6 \\ &= 48 \text{ ns} \end{aligned}$$

- ★ We have to access the block, it takes ~~50 ns~~ to access 1 word from level 2, so it will take 100 ns to access a block in  $L_2$  (which is of 2 words length).

## • Cache Memory :-

Cache Mem. acts as intermediate mem. b/w CPU & main memory. Therefore, CPU performs read & write op<sup>n</sup> only on the Cache data.



- ① Mem. Orgn      ③ Replacement Algo.  
② Mapping techniques      ④ Uptaking Techniques.  
⑤ Multilevel Caches

# Cosmos(Sajal)

★ 6a

Date \_\_\_\_\_  
Page No. \_\_\_\_\_

## Memory Orgn:-

The data is transferred from the main memory to Cache memory in the form of blocks, i.e. the main mem. & Cache mem. both are organised into blocks.

## Cache Mem. Orgn:-

The Cache Mem. is divided into equal parts based on the block size. Each part is called as Cache block or Cache line, i.e.  $\frac{\text{Cache Mem. Size}}{\text{Block size}}$ .

Consider 8 byte Cache mem. & 2 byte block size,  
no. of Cache blocks =  $\frac{8}{2} = 4$

## Before Orgn

8-block Cache

8 byte Cache



## After Orgn

$$\# \text{ of cache blocks} = \frac{8}{2} = 4$$

|    |   |    |
|----|---|----|
| 00 | 0 | 2B |
| 01 | 0 | 2B |
| 10 | 0 | 2B |
| 11 | 0 | 2B |

Line Offset

Word offset

Each block contains 2 bytes

★ No change in capacity but the internal orgn changes (internal structure changes)



{ 3 bit Cache address is interpreted as 2 bit line offset & 1 bit word offset.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

## \* Before Main Mem. Orgn:

The main mem. is also divided into equal parts based on the block size. Each part is called as main mem. block. The main mem. block no. is identified by "tag". No. of main mem. blocks =  $\frac{\text{main mem. size}}{\text{block size}}$

- consider 16 byte main mem. & 2 byte block size, the no. of main mem. blocks =  $\frac{16}{2} = 8$ .

## Before Orgn

16 byte  
main mem.



## After Orgn

no. of main mem. blocks =  $\frac{16}{2} = 8$

$$8 * 2 \text{ byte} = 16 \text{ B.} = 8^2$$

|     |   |    |
|-----|---|----|
| 000 | 0 | 2B |
| 001 | 0 | 2B |
| 010 | 0 | 2B |
| 011 | 0 | 2B |

111  
tag  
word offset

used to  
identify  
main mem.  
block no.

word address  
in the block.

## Physical Address

|     |             |
|-----|-------------|
| tag | word offset |
|-----|-------------|

indicates  
main mem.  
block no.

indicates  
coord address  
in block

$$\begin{aligned} &\log_2(\text{no. of main mem. blocks}) \\ &= \log_2(8) \\ &= 3 \text{ bit} \end{aligned}$$

$$\begin{aligned} &\log_2(\text{Block size}) \\ &= \log_2 2 \\ &= 1 \text{ bit} \end{aligned}$$

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

- \* In this orgn, there is no change in the capacity, but internal structure changes. i.e. the main mem. controller interprets the physical address as tag field & word offset field. This format is described earlier.



- \* Cache controller :



- Q6. Consider the cache mem. size is 64 KB & block size is 64 bytes, if CPU generates 20 bit physical address, how many blocks are present in the cache mem & main mem?

Ans. Blocks in Cache = 1024

Blocks in main mem.

Mem size of main mem. =  $2^{20} = 1 \text{ MB}$

Blocks in main mem. =  $\frac{2^{20} \text{ Bytes}}{2^6 \text{ Bytes}} = 2^{14}$

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- ★ The 20 bit physical address interpretation at Cache controller is :-



↳ these tags are stored in the tag mem. in the Cache controller.

- ★ 20 bit address interpretation at main mem.:-



## Mapping techniques:-

The process of transferring the data from main mem. to Cache mem. is called Mapping. There are 3 mapping techniques used:-

- ① Associative Mapping
- ② Direct Mapping
- ③ Set associative Mapping

- **Associative Mapping**: In this mapping technique, no mapping function is used, i.e. anyone of the main mem. block can be placed into anyone of the cache mem. lines, i.e. there is cache addressing is not used.
- The main mem. address format is used at the Cache controller.

# Cosmos(Sajal)

(1)

Kathy  
Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- The complete data along with tag is transferred from main mem. to cache mem.



## \* The address

- Consider the program segment

I<sub>1</sub>: MOV R0,[0000]

I<sub>2</sub>: MOV R1,[1000]

I<sub>3</sub>: ADD R0,R1

I<sub>1</sub>:- CPU generates the Memory Address

0000

↓

Associative cache known format is Main mem. format.

tag | wo  
3 bit 1 bit



1mb

The tag field is compared with the tag stored in cache memory controller

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

I<sub>2</sub>: CPU generates Mem. Addr.



★ ① During execution of the program, whenever CPU identifies the mem. ref. instn, it generates the mem. address & transfer to the cache memory.

② The Associative Cache interprets the mem. addr. into its known format, i.e.

tag | word

- ③ The CPU generated tag is compared with the existing tag in the cache controller.
- ④ If anyone of the existing tags matches with CPU generated tag, then it is hit, i.e. based on the

# Cosmos(Sajal)

①

|           |
|-----------|
| KM        |
| Date:     |
| Page No.: |

word offset, the resp. word is transferred to the dest<sup>n</sup>.

- ⑤ If none of the existing tags matches with the CPU generated tag, opn is miss, so the ref. is forwarded to the main memory.
- ⑥ Acc. to the main mem. address format, the corresponding block maps into anyone of the cache line, later word is transferred to the CPU.
- ⑦ Tag mem. size = no. of Cache lines \* no. of tag bit at each line.

⑧ Data mem. size = no. of Cache lines \* no. of bits in one block

$$\text{Tag mem. size} = 4 * 3 = 12 \text{ bits}$$

$$\text{Data mem. size} = 4 * 16 = 64 \text{ bits}$$

$$\text{Cache size} = \text{Tag mem. size} + \text{Data mem. size}$$

$$\therefore \text{Cache size} = 12 + 64 = 76 \text{ bits}$$

## • Direct Mapping:-

- ① In this mapping technique, mapping function is used to transfer the data from the main mem. to cache mem.
- ② ∵ there is a binding that takes b/w the main mem. block no. to cache mem. line no. So, cache addressing is used.
- ③ The mapping function is  $K \text{ MOD } N = i$ .  
K :- main mem. block no.      Main mem. block no.  
N :- no. of cache lines      no. of cache lines  
i :- Cache mem. line no.      line no. in Cache

# Cosmos(Sajal)

K24

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

e.g.

$$\begin{array}{ll} 0 \bmod 4 = 0 & 4 \bmod 4 = 0 \\ 1 \bmod 4 = 1 & 5 \bmod 4 = 1 \\ 2 \bmod 4 = 2 & 6 \bmod 4 = 2 \\ 3 \bmod 4 = 3 & 7 \bmod 4 = 3 \end{array}$$

- ① The above mapping function shows the relationship b/w the main mem. block no. to cache mem. line no., so cache mem. addressing is used.
- ② The address format at the Cache controller is,



② this means that 2 blocks of main mem. blocks can be stored at one line in cache mem., in our example block 0 & block 4 can be stored in line 0, but not at the same time, i.e. only 1 block from main mem. can be stored in 1 line.

Direct



which makes the tag field at cache of 3 bits. The "0" extra bit is stored in cache controller.

In this case, we get 0 after doing  $0 \bmod 4$ , So the B0 from main mem. is stored in 00 (binary for 0) line no., now the tag field is 2 bit (00), the extra bit '0' is also stored\*

# Cosmos(Sajal)

①

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

\* During mapping, complete data is transferred from main mem. to corresponding cache line along with the tag.

During this process, some of the tag bits maps into the cache mem. address logic & remaining bits are stored in the cache controller.

→ Consider program segment:-

I<sub>1</sub>: MOV R0,[0000]

I<sub>2</sub>: MOV R1,[1000]

I<sub>3</sub>: ADD R0,R1

I<sub>1</sub>: CPU generates mem. address

↓  
[1000]

Direct Cache known format



I<sub>2</sub>: CPU generates mem. address.

↓  
[1000]

Direct Cache known format



# Cosmos(Sajal)

- ① Compulsory Miss
- ② Capacity Miss
- ③ Conflict Miss

X =  $\lceil \frac{X}{T} \rceil$   
X =  $\lceil \frac{X}{T} \rceil$   
X =  $\lceil \frac{X}{T} \rceil$

So, the address is forwarded to Main mem.

1000  
↓  
known format

Tag | offset  
3 bit | 1 bit

100 | 0

000  
001  
⋮  
111



So, the block B4 is transferred to the 0th line no. of Cache mem.  
[0 → decimal  
00 → binary]  
but 00 contains B0 block. So it is overwritten by B4 in the cache mem. by B4.

- ① CPU generated address is transferred to the direct cache.
- ② The direct cache interprets the reg. based on its known format.
- ③ The line offset is directly connected to the cache memory address logic, so the corresponding cache line is enabled w.r.t. line offset.
- ④ The existing tag of the enabled line is compared with CPU generated tag.
- ⑤ When both matches, op<sup>n</sup> is hit. Then data is transferred to CPU based on the word offset.
- ⑥ When both doesn't match, op<sup>n</sup> is miss, so map the respective block from main mem. to Cache memory based on the tag, later the data is transferred to the CPU.

\* Tag mem. size = no. of Cache \* no. of bits in each line.

# Cosmos(Sajal)

①

Kitty  
Date \_\_\_\_\_  
Page No. \_\_\_\_\_

The disadvantage in the direct mapping is only 1 block is possible in each Cache Line at a time, ∴ Conflict Miss increases.

Program:

- I<sub>1</sub> MOV R0,[00000]
- I<sub>2</sub> MOV R1,[10000]
- I<sub>3</sub> MOV R2,[0001]
- I<sub>4</sub> MOV R3,[0001]
- I<sub>5</sub> Add R0,R1
- I<sub>6</sub> Add R2,R3



★ I<sub>5</sub> & I<sub>6</sub> do not have to do anything with cache as it is just reg. to reg. opn.

mem. is not involved.

- (i) During I<sub>1</sub>, there will be a hit.
- (ii) During I<sub>2</sub>, the 10000 Mod 4 = 0, but 0 line no. contains B0 & not B4, so there will be a miss & B4 need to be transferred to line no. 0, now line no. 0 contains B4.
- (iii) During I<sub>3</sub>, 0 Mod 4 = 0, but when tag field is compared, it doesn't match, ∴ it will be a miss & B0 will be transferred from main mem. to cache.
- (iv) During I<sub>4</sub>, 4 Mod 4 = 0, but when tag field is compared, it doesn't match & there will be a miss & B4 will be transferred from main mem. to cache.

★ To avoid the above problem, there is a need of alternate cache organisation which is used to hold multiple blocks in the same line at the same time. This alternate approach is called as Set-Associative Cache Organization.

## Set Associative Mapping:-

In this mapping, each Cache line can hold the multiple blocks, i.e. the Cache memory is organised into no. of sets based on the no. of blocks present on the same line.

$$\text{No. of sets} = \frac{N}{P}, \text{ where } N \text{ indicates no. of cache lines}$$

of cache lines, P indicates no. of blocks in the same line.

# Cosmos(Sajal)

Date: \_\_\_\_\_  
Page No. \_\_\_\_\_

- Consider 8 byte Cache & 2 byte block using 2-way set associative Cache orgn. The no. of sets is:-

$$\text{no. of Cache lines} = \frac{8}{2} = 4$$

$$\therefore \text{no. of sets} = \frac{4}{2} = 2.$$



8 byte cache

|     |   |
|-----|---|
| 000 | B |
| 001 | B |
| 010 |   |
| 011 |   |
| 111 | B |



|    |    |
|----|----|
| 00 | 2B |
| 01 | 2B |
| 10 | 2B |
| 11 | 2B |

$$\# \text{ of lines} = \text{CM size}$$

$$\begin{aligned} \text{Block size} \\ = 8 \\ \therefore 2 \end{aligned}$$

|   |    |    |
|---|----|----|
| 0 | 2B | 2B |
| 1 | 2B | 2B |

$$\# \text{ of sets} = \frac{4}{2} = 2$$

Line word offset

Set word offset

$$\begin{aligned} \log_2 (\# \text{ of sets}) \\ = \log_2 2 \\ = 1 \end{aligned} \quad \begin{aligned} \log_2 (\text{Block size}) \\ = \log_2 4 \\ = 2 \end{aligned}$$

[as one set contains 2 blocks, ...]

Size of one set = 4 bytes

- ★ Too many sets  $\rightarrow$  fully associative cache memory.
- ★ To little sets  $\rightarrow$  direct Cache memory.  
 $\therefore$  2 way & 4 way set associative cache gives the best result in the Cache design.

Mapping function:- In this technique, mapping fn is used to transfer data from main mem. to cache mem., i.e. there is a binding provided with the main mem. block no. to cache mem. line no.

- ② The mapping fn. is  $k \bmod S = i$ , where  $k$ : main memory block no.,  $S$ : no. of sets in cache,  $i$ : cache memory set no.

# Cosmos(Sajal)

①

$$\begin{array}{ll}
 0 \bmod 2 = 0 & 4 \bmod 2 = 0 \\
 1 \bmod 2 = 1 & 5 \bmod 2 = 1 \\
 2 \bmod 2 = 0 & 6 \bmod 2 = 0 \\
 3 \bmod 2 = 1 & 7 \bmod 2 = 1
 \end{array}$$

The above mapping fn. shows the relationship b/w main memory block no. & cache mem. set no., i.e. cache mem. addressing is used.

- The physical address can be interpreted at set associative cache controller is :-

2 bit  
 (it tells  
 that how  
 many main  
 main mem.  
 blocks are  
 mapping into  
 same line =  $2^2 = 4$ )  
 4 blocks in  
 same line.



These bits are stored in Cache controller.

| Main Memory |                 |
|-------------|-----------------|
| B0          | $0 \bmod 2 = 0$ |
| [000]       | $= 0$           |

  

| Cache |                                               |
|-------|-----------------------------------------------|
| 0     | → this is the set no. where B0 will be stored |
| 1     | → this is the set no. where B7 will be stored |

1 → this is the set no. where B7 will be stored.

During this mapping process, the complete data along with tag is transferred to cache mem. Some of the tag fields maps into cache mem. Address logic & remaining in Cache controller.

e.g. I<sub>1</sub>: MOV R0[0000]  
 I<sub>2</sub>: MOV RI,[1 000]  
 I<sub>3</sub>: Add R0,R1

I<sub>1</sub>: CPU generates the Mem. address  
 $\boxed{10000}$   
 2-way associative cache known form

The MUX O/P's 00 in one cycle & x in next cycle. This is compared with CPU generated tag (00).



# Cosmos(Sajal)

T<sub>2</sub>

CPU generates Mem. Address

1000

2-way Set - Associative cache known format

tag | so | wo



10 | 0 | 0



miss  
the MUX will output 00 in 1st cycle & xx in next cycle  
& these outputs will be compared with CPU generated tag '10', which doesn't matches, hence it will be a miss.

then, the address is forwarded to main mem.

1000

Main mem. known format

tag | wo

3      1



100 | 0



$$\begin{aligned} &\text{K MODS} \\ &= 4 \text{ MOD } 2 \\ &= 0 \end{aligned}$$

it will go into 0th set



this tag field is added when B4 is placed into 0th set.

- ① The CPU generated mem. req. is initially referred in the cache mem.
- ② Acc to 2-way set-associative address format, the cache controller interprets the req. i.e.
- ③ The Set offset field is directly connected to the addr logic of cache mem., the respec. Cache line is enabled based on the set offset value.
- ④ The enabled set contains 2 possible tags, ∴ MUX is req.

# Cosmos(Sajal)

3

to compare these tags one after another to the CPU generated tag. If anyone is matching op<sup>n</sup>'s hit, the resp. data is transferred to the dest<sup>n</sup> based on the word offset value.

- on the word offset value.

⑤ When none of the tags are matching, op<sup>n</sup> is miss, then map the respec. block into the respec. set using the mapping function. Later the data is transferred to the CPU.

The tag mem. size = no. of cache lines \* no. of blocks in each set (or no. of sets) \* no. of bits in one tag.

If some more bits are there, then we will add it in tag field.

## Types Of Cache Misses

These are 3 types of Cache Misses possible:-

- ① Compulsory Miss if first reference miss/cold-start :-

The very first reference to the Cache is itself a miss, called as compulsory miss.

e.g. when Cache is initially empty

- ## ② Capacity Miss:-

Because of the limited capacity of Cache, during the execution of the program, some of the cache blocks are replaced & later retrieved.

- ### ③ Conflict Miss/Collision /Inference Miss :-

Too many blocks when mapped on the same line is known as conflict miss.

e.g. Direct mapping & set associative mapping creates conflict miss.

## Types Of Locality :-

Based on the way of accessing reusable data, the locality of reference is divided into 2 types:-

- ① Temporal Locality, → same word in same block can be referenced by CPU in near future.

② Spatial Locality, → adjacent word in same block can be referenced by CPU in near future.

③ Sequential " .

# Cosmos(Sajal)

Temporal Locality means the same word in same block can be referenced by CPU in the near future.  
e.g. I<sub>1</sub>: MOV R0,[0000]

I<sub>2</sub>:

I<sub>3</sub>:

I<sub>4</sub>: MOV R0,[0000]

I<sub>5</sub>:

I<sub>6</sub>:

I<sub>7</sub>: MOV R0,[0000]

Spatial locality means the adjacent words in the same block can be referenced by CPU in near future.  
e.g. I<sub>1</sub>: MOV R0,[0000]

I<sub>3</sub>: MOV R0,[1000]

I<sub>5</sub>: MOV R0,[0000]

## Replacement Algorithm:-

When the cache is full, then replacement algorithm are req. to replace the Cache block with new blocks.  
There are 2 replacement algo used:-

① FIFO (First in First Out):

In the FIFO, replace the Cache block which is having longest time stamp.

② LRU (Least Recently Used)

In LRU, replace the Cache block which is having less no of references along with the longest time stamp.

- Quantitative principle to improve the performance  
Quantitative principle states that improve the performance within the cost & price limit by making the Common Case fast. This principle is called as Amdhal's Law.



Amdhal's law focus on performance gain after the enhancement.

# Cosmos(Sajal)

(5)

$$S = \frac{\frac{1}{ET_{new}}}{\frac{1}{ET_{old}}} = \frac{ET_{old}}{ET_{new}} - ①$$

$$ET_{new} = ET_{unenhanced \ portion} + ET_{enhanced \ portion}$$

$$\quad \quad \quad ET_{UO} \qquad \qquad ET_{FO}$$

To calculate the ET-new, there is a need of two factors, i.e.

- ① fraction enhance (F)
- ② Speedup enhance (S)

- F indicates how much portion of system undergoes enhancement
- Speedup enhance indicates how many times the enhanced portion is running faster than old portion.

$$S = \frac{\text{performance of new portion}}{\text{performance of old portion}} =$$

$$= \frac{1/\text{Exec^n time of new portion}}{1/\text{Exec^n time of old portion}}$$

$$S = \frac{(F)ET(\text{Old})}{(F)ET(\text{New})} \quad \begin{matrix} \downarrow \text{of frequent case only.} \\ \downarrow \text{of } ET_{FO} \end{matrix} \quad \begin{matrix} \text{as it is the only unit that undergoes enhancement.} \\ \text{Floating Point Unit} \end{matrix}$$

$$ET_{new} = ET_{unenhanced \ portion} + ET_{enhanced \ portion}$$

$$\quad \quad \quad \underbrace{ET_{old}}_{\text{complete system}} = (1-F)ET_{old} + \frac{ET_{old}(F)}{S} \quad \begin{matrix} \rightarrow \text{old system} \\ \rightarrow \text{unenhanced (complete)} \end{matrix}$$

$$\text{Overall} = \frac{ET_{old}}{ET_{old}(F) + ET_{old}(1-F)} \quad \begin{matrix} \rightarrow \text{unenhanced portion} \\ (\text{no change in its execution time}) \end{matrix}$$

$$\text{Overall} = \frac{1}{\frac{F}{S} + (1-F)} \quad \begin{matrix} \rightarrow \text{enhanced portion} \\ (\text{decrease of execution time in the enhanced portion.}) \end{matrix}$$

Let  $ET_{old} = 1$

$\therefore ET_{new} < 1$        $ET_{old} :- \text{of complete system}$

$\therefore \text{Overall} > 1$

\* If system contains multiple frequent cases:-

$$\text{Overall} = \left[ \left( 1 - \sum_i F_i \right) + \sum_i \frac{F_i}{S_i} \right]^{-1}, \quad i :- \text{no. of frequent cases}$$

# Cosmos(Sajal)

## A/ System

Q. A system consists 2 functional units, i.e. Floating point & Integer which is used in mathematical model simulation. The FP unit is enhanced (it is running 2 times faster) but actually 10% inst<sup>n</sup> are floating point, what is the performance gain.

Ans.  $F = 0.1 \quad 2 = \frac{ET_{old}(F)}{ET_{new}(F)}$   
 $1-F = 0.9$

#  $S = 2$

$$ET_{new \text{ overall}} = ET_{unenhanced} + ET_{enhanced}$$

(complete system)  $= 0.9 \times ET_{old} + \frac{ET_{old}(F)}{2} = 0.9 \times ET_{old} + \frac{0.1 \times ET_{old}}{2} = 0.9$

$$S_{\text{overall}} = \left[ \left( 1 - F \right) + \frac{F}{S} \right]^{-1} = \left[ 0.9 + \frac{0.1}{2} \right]^{-1} = 1.05$$

\* When rare case undergoes enhancement, there is no considerable gain.

Q. 3 enhancements are proposed for a new architecture with following speedups:-

$$S_1 = 30$$

$$S_2 = 20$$

$$S_3 = 15$$

If enhancement 1 & 2, each usable of 25% of time, how much % of the time enhancement 3 will be used when the  $S_{\text{overall}} = 10$ .

$$\text{Ans. } 10 = \left[ 1 - (50+x) + \frac{25}{30} + \frac{25}{20} + \frac{x}{15} \right]^{-1}$$

$$10 = \left[ 1 - (50+x) + \frac{25+37.5+20x}{300} \right]^{-1}$$

$$\begin{array}{c|ccccc} 15 & 30, 20, 15 \\ \hline 2 & 2, 20, 1 \\ \hline & & & & & \end{array} \quad = \quad \frac{1}{0.37}$$

$$10 = \left[ 1 - (50+x) + \frac{62.5+20x}{300} \right]^{-1}$$

$$10 = \left[ 1 - \right]$$

$$\begin{array}{l} 0.8 \times 4 + \\ 0.2 \times 10^6 \\ @ 3.2 \times \\ 2.00000 \end{array} \quad \begin{array}{l} ET_{old} \\ ET_{new} \\ ET_{mm} \\ 0.7 \times ET_{old} \\ 0.7 \times ET_{mm} \\ 0.3 \times ET_{mm} \\ 10 \\ 0.07 \end{array}$$

# Cosmos(Sajal)

(7)

## Chapter-5

Ans3. Cache Size = 8 kB

Block Size = 32 B

$$\text{no. of blocks in cache} = \frac{8 \times 2^{10} \times 8}{2^2} = 256$$



$$\text{no. of blocks in main mem.} = 2^{32}$$

$$0 \bmod 256 = 0$$

$$32 \quad 2288$$

$$256$$

8 bits

24 bits + 2 bits

$$\begin{array}{r} 33 \\ 256 \times 26 \\ \hline 1536 \\ 512 \times \\ \hline 0056 \end{array}$$

$(q+2) = 2L \times 256 = 5376 \text{ bits}$

$$Ans1. S=10 \quad F=0.7$$

Amdhal's Law

$$\text{Overall} = \left[ (1-F) + \frac{F}{S} \right]^{-1}$$

$$= \left[ (1-0.7) + \frac{0.7}{10} \right]^{-1} = 2.7$$

System with ~~cache~~ cache runs 2.7 times faster than sys. w/o cache.

Ans2.

|   |      |
|---|------|
| 0 | 28   |
| 1 |      |
| 2 |      |
| 3 | X 11 |
| 4 | 4    |
| 5 |      |
| 6 | 14   |
| 7 |      |
| 8 |      |
| 9 |      |

no. of Cache lines = 8

$$1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 8$$

$$7 \times 8$$

$$= 56$$

|   |    |
|---|----|
| 0 | 40 |
| 1 |    |
| 2 |    |
| 3 |    |
| 4 |    |
| 5 |    |
| 6 |    |
| 7 |    |
| 8 |    |
| 9 |    |

$$\frac{\text{no. of sets}}{\text{P-way}} = \frac{8}{2} = 4$$

$$0 \bmod 2$$

$$0 \bmod 4 = 0$$

$$14 \bmod 4 = 2$$

$$0 + 1 + 1 + 1 + 1 = 5$$

$$= 7 \times 10 = 70$$

Ans3.

Ans6. no. of Cache lines =  $2^9$

block size =  $2^4$  Bytes

mem. size =  $2^{29}$  Bytes



$$16 \times 2^9$$

$$0.8 \times 4 \text{ ns} + 0.2 \times (1 \text{ msec} + 4 \text{ ns})$$

$$\begin{array}{l} 5 \\ \log_2 29 = 5 \\ \downarrow \\ = 5 \end{array}$$

$$\begin{array}{l} 4 \\ \log_2 24 = 4 \\ \downarrow \\ = 4 \end{array}$$

$$\begin{array}{l} 4 \\ \log_2 16 = 4 \\ \downarrow \\ = 4 \end{array}$$

# Cosmos(Sajal)

Q.

Consider a 4-block Cache, with following main memory block references.

4, 5, 7, 12, 4, 5, 13, 4, 5, 7

assume Cache is initially empty, identify hit ratios by using

(a) FIFO

(b) LRU

(c) Direct Mapped Cache (Associative)

(d) 2-Way Set Associative with LRU.

Ans.

Total = 10

(a) ~~0.15~~ or 3

$$H = \frac{2}{10} = 0.2$$



(b) keep all recent entries on the top



$$H = 0.4$$

(c)

|   |    |    |   |
|---|----|----|---|
| 0 | 4  | 12 | 4 |
| 1 | -5 | 13 | 5 |
| 2 |    |    |   |
| 3 | 7  |    |   |

$$\textcircled{1} + \textcircled{1} + \textcircled{1}$$

$$H = 0.3$$

$$\begin{aligned} 4 \bmod 4 &= 0 \\ 5 \bmod 4 &= 1 \\ 7 \bmod 4 &= 3 \\ 12 \bmod 4 &= 0 \\ 4 \bmod 4 &= 0 \\ 13 \bmod 4 &= 1 \\ 4 \bmod 4 &= 0 \\ 5 \bmod 4 &= 1 \\ 7 \bmod 4 &= \end{aligned}$$

(d)

|   |        |
|---|--------|
| 4 | 12     |
| 5 | 7 13 7 |

~~$4 \bmod 4 = 0$~~

$$4 \bmod 2 = 0$$

$$5 \bmod 2 = 1$$

$$7 \bmod 2 = 1$$

$$12 \bmod 2 = 0$$

$$13 \bmod 2 = 1$$

$$\textcircled{1} + \textcircled{1} + \textcircled{1} + \textcircled{1}$$

$$H = 0.4$$



# Cosmos(Sajal)

(4)

Ans 1.

$$\frac{80 \times 4}{10} = 32 \text{ ns}$$

$$T_{avg}(\text{old}) = 80 \text{ ns}$$

$$T_{avg}(\text{new}) = 112 \text{ ns}$$

$$S = \frac{T_{MM}}{T_{CM}} = 9$$

$$T_{MM} = 9 \times T_{CM}$$

$$80 = 112 - H_1 \times T_{CM} + (1-H_1) \times 1 \times 9 \times T_{CM}$$

$$80 = H_1 \times 0.7 \times X + 0.3 \times 1 \times 9 \times X$$

$$80 = 0.7X + 2.7X$$

$$80 = 8 \times X$$

$$X = 21.62 \text{ ns}$$

$$112 = H_1 \times 21.62 + (1-H_1) \times 1 \times 9 \times 21.62$$

$$112 = 21.62X + (1-X) \times 194.58$$

$$82.58 = \frac{21.62}{172.94} - 172.94X$$

$$X = \frac{82.58}{172.94}$$

$$\begin{array}{r} 15 \\ 21.62 \\ \hline 172.9 \\ 194.58 \\ \hline 21.62 \\ \hline 172.94 \end{array}$$

$$\begin{array}{r} 15 \\ 21.62 \\ \hline 172.9 \\ 194.58 \\ \hline 21.62 \\ \hline 172.94 \end{array}$$

$$\begin{array}{r} 15 \\ 21.62 \\ \hline 172.9 \\ 194.58 \\ \hline 21.62 \\ \hline 172.94 \end{array}$$

$$\begin{array}{r} 15 \\ 21.62 \\ \hline 172.9 \\ 194.58 \\ \hline 21.62 \\ \hline 172.94 \end{array}$$

$$\begin{array}{r} 15 \\ 21.62 \\ \hline 172.9 \\ 194.58 \\ \hline 21.62 \\ \hline 172.94 \end{array}$$

$$\begin{array}{r} 15 \\ 21.62 \\ \hline 172.9 \\ 194.58 \\ \hline 21.62 \\ \hline 172.94 \end{array}$$

Ans 8.



①+②+④

Ans 10.



$$\frac{2^{25} \times 2^{10}}{2^5} = 2^{10}$$

Ans 11.

Block size = 64 byte      One time accessing it generates 48 bytes, 1 time

Ans 12.



$$\text{no. of Set lines} = \frac{32 \times 7}{32 \times 2} = 2^9$$

Ans 11. 1 time accessing it generates = 48 bytes  
1 time accessing time =  $\frac{K}{2} \text{ ns} + 80 \text{ ns}$   
 $= 92 \text{ ns}$

Because of 64 byte cache line, there is a need of 2nd time accessing.

To transfer  $\frac{64}{2} \text{ B} = 32 \text{ B}$  from MM to Cache Mem, it req.

# Cosmos(Sajal)

Date  
06.05.12

## Updating Techniques



Direct :-



$\frac{32\text{KB}}{32\text{B}} = 1024$   
Block size = 32B  
no of sets =  $\frac{32}{2} = 16$

I<sub>1</sub>: MOV r0, #23; r0=23  
I<sub>2</sub>: MOV r1, [0000]; r1=23  
I<sub>3</sub>: Add r0, r1; r0+r1=r1, r0=46  
I<sub>4</sub>: MOV [0000], r0; CM[0000]=46

I<sub>2</sub> :- is a hit opn, so the content will be available from cache.  
I<sub>4</sub> :- is a write hit opn, so the data is transferred or written to cache.

Cache Memory -      Main Memory -  
[0000]=46                  [0000]=23

When the same address contains diff. values at diff. places (e.g. Main mem. & Cache mem.) then it is called 'coherence'.

① The same address contains two diff. values at 2 diff. places is called "coherence".

② The coherence causes the loss of data.

I<sub>5</sub>: MOV r0, [1000]: Cache mem. block 0[B0] is replaced by block B4,  
I<sub>6</sub>: -                      ↓  
I<sub>7</sub>: -                      updated block → updated data is lost

I<sub>8</sub>: MOV r1, [0000]; r1=23 (old value)  
During the execution of I<sub>5</sub>, the block 4 is transferred from the main mem. to cache mem. line 0. The cache mem. line 0 contains the updated block.

③ Due to the conflict miss, the updated block is replaced with new block. Updated data is lost. After sometime if CPU wanted to e.g.

During exec<sup>n</sup> of I<sub>8</sub> CPU want to read the updated value present in the block 0, because of replacement, the updated data is lost, the old data is reflected.

To avoid the above problem there is a need of updating techniques. There are 2 updating techniques used, named as :-

- ① Write through
- ② Write back

Write through: ① The CPU performs read & write opn only on the Cache. When the CPU performs read opn, check the availability of the block in the Cache. If it is available opn is read hit, then CPU reads the data from the Cache.

② If block is not available in Cache, opn is read miss, there is a need of read allocate, i.e. transfer the respect. block from main mem. to Cache mem. for the read opn.

# Cosmos(Sajal)

- Write Op<sup>n</sup>: When CPU performs write op<sup>n</sup>, it checks the availability of block in Cache mem.
- ② If the block is available, op<sup>n</sup> is write hit, so the CPU performs the simultaneous write op<sup>n</sup> in the Cache mem. & main mem.
  - ③ If block is not available, op<sup>n</sup> is write miss, ∴ there is a need of write allocate i.e. transfer the respective block from main mem. to Cache Mem. for the write op<sup>n</sup>. Later, the CPU performs simultaneous updation in Cache & Main memory.

Inclusion Property :-

Inclusion means the low level memory data is always subset of the high level memory data, i.e. low level mem. must be a write through memory.

Inclusion Property is not applicable for write back.



The avg. access time of read :-

$$T_{avg(\text{read})} = H_R T_C + (1-H_R)[T_M + T_C]$$

↓      ↓      ↓      ↓  
read hit    read data    read allocate    read data (from cache)

$$T_{avg(\text{write})} = H_W T_W + (1-H_W)(T_M + T_W)$$

↓      ↓      ↓      ↓  
write hit    because of simultaneous write op<sup>n</sup>    write miss    write allocate

↑      ↓      ↓      ↓  
read data (from cache)    simultaneous write op<sup>n</sup> (max. time for write op<sup>n</sup> in each mem.)

e.g.  
 $T_{access\ of\ L_1} = 10\text{ns}$   
 $T_{access\ of\ L_2} = 20\text{ns}$   
 $T_{WC} = 20\text{ns}$

The avg. access time of write through mem. when considering both read & write op<sup>n</sup> is :-

$$T_{avg(WT)} = F_R T_{avg(\text{read})} + F_W T_{avg(\text{write})}$$

↓      ↓  
freq. of read op<sup>n</sup>    freq. of write op<sup>n</sup>

The efficiency of the write through is :-

$$\frac{1 \text{ words/sec}}{T_{avg(WT)}}$$

# Cosmos(Sajal)

Note: ① In the write through mechanism, there is no coherence at all.

② Write through mechanism is also applicable in independent mem. organisation.

$$\text{?? } T_{\text{avg}}(\text{write}) = H_w T_n \quad H_w = 1 \text{ (for independent)}$$

\*\* If  $H_w$  is not given, then  $H_w = 1$ , then it is independent.  
If  $H_w$  is given, then  $H_w$  = given value, it is hierarchical.

If write through mechanism consists, Hit ration for write op<sup>n</sup> is 1, then it is independent mem.

## Write Back Technique:-

In this technique, the CPU performs the write op<sup>n</sup> only on Cache, which implies coherence is still present, but this coherence is not creating loss of data problem because before replacing any cache line, the updated data is write back to the main mem. ∴ the updated data is safe.

Write Back Cache maintains one extra bit at each Cache line called Update bit, if it is "1" (Dirty bit) it means that the block is updated, otherwise (Clean bit) → the block is not updated.)

Before replacing the cache lines with new blocks, read the % of dirty bits, take those updations into main mem., later replace the block.



When the CPU performs the read op<sup>n</sup>, it checks the availability of blocks in the Cache, if it is available, op<sup>n</sup> is read hit, then CPU directly read the data, otherwise op<sup>n</sup> is read miss, ∴ there is a need of read allocate, which implies transfer the block from main mem. to Cache mem. for read op<sup>n</sup>. So there is a possibility of replacement, ∴ before read allocate, take the updations back

# Cosmos(Sajal)

the main memory. After read allocation, CPU directly read the data from the cache.

The avg access time of read cycle is:-

$$T_{avg\ read} = H_s T_c + (1 - H_s) \left[ \% \text{ of dirty bits} (T_{mt} + T_{mt} + T_c) + \% \text{ of clean bits} (T_m + T_c) \right] + \dots$$

↓                    ↓                    ↓                    ↓                    ↓                    ↓                    ↓  
 read hit      read miss      no. of updat.      write back      read data  
 read data from cache      miss      to be made in main mem.      time      allocate  
 ↓                    ↓                    ↓                    ↓                    ↓                    ↓  
 no updat.      read allocate      read data from cache      (time taken to update blocks in main mem. from cache)  
 read data from cache      (time taken to put the req. block from main mem. to cache)

When the CPU performs the write op<sup>n</sup> check the availability of block in Cache, if it is available, if it is available Op<sup>n</sup> is write hit, then CPU performs the write Op<sup>n</sup> only on Cache (still coherency is present).

Otherwise,  $Op^n$  is write miss,  $\therefore$  there is a need of write  $Op^n$ . Due to write allocate, there is a possibility of replacement, so before write allocate take the updations back into the main mem. The avg. access time of write cycle is :- 
$$\text{avg. access time} = \frac{\text{write miss}}{\text{misses}} + \frac{\text{updations}}{\text{misses}}$$

$$T_{\text{avg write}} = H_w T_c + (1-H_w) [ \% \text{ of dirty bits} (T_M + T_M + T_c) + \% \text{ of clean bits} (T_M + T_c) ]$$

↓  
 write hit      |  
 write data      ↓  
 no update  
 ↓  
 write allocate      |  
 write back      ↓  
 write data      |  
 write allocate

The avg. access time of write back cache when considering both read & write opn is :-

$$T_{avg(CWb)} = F_r T_{avg(read)} + F_w T_{avg(write)}$$

Efficiency :-  $\frac{1}{T_{avg}(wb)}$

# Cosmos(Sajal)

## Multilevel Caches

To reduce the miss penalty, multilevel caches are used in the system design.



- ① Local Missrate = no. of misses in the cache / no. of accesses
- ② Global Missrate = no. of misses in the cache / total no. of CPU generated references.

Avg. access time of memory is :-

$$T_{avg} = \text{Hit Time}_{L_1} + \text{Miss Rate}_{L_1} * \text{Miss Penalty}_{L_1}$$

$$\cdot \text{Miss Penalty}_{L_1} = \text{Hit Time}_{L_2} + \text{Miss Rate}_{L_2} * \text{Miss Penalty}_{L_2} \quad \xrightarrow{\text{local miss rate}}$$

$$\cdot \text{Miss Penalty}_{L_2} = \text{Main Mem. Access Time}$$

The no. of mem. stalls/instr = Misses in  $L_1/\text{instr} * \text{Hit time}_{L_2}$  + Misses in  $L_2/\text{instr} * \text{Miss Penalty}_{L_2}$

## Chapter-5

$$(7) \frac{2500}{64} = 3$$

No. of blocks req. to store the array =  $\frac{2500}{64} = 40$  blocks

Physical Addr.: 16 bit

Starting addr of an array = 1100H



|     | 000000 | 001100000000 |
|-----|--------|--------------|
| tag | 00     | 100000000000 |
| lo  | 00     | 000000000000 |
| hi  | 00     | 000000000000 |
| 0   | B28    | B32          |
| 1   | B29    | B33          |
| 2   | B30    | B34          |
| 3   | B31    | B35          |
| 4   | B0     | B32          |
| 5   | B1     | B33          |
| 6   | B2     | B34          |
| 7   | B3     | B35          |
| 8   | B4     | B36          |
| 9   | B5     | B37          |
| 10  | B6     | B38          |
| 11  | B7     | B39          |
| 12  | miss   | 56 miss      |
| 13  | B9     | 8 miss       |
| 14  |        | 40 miss      |
| 15  |        | 8 miss       |
| 16  |        | 8 miss       |
| 17  |        | 8 miss       |
| 18  |        | 8 miss       |
| 19  |        | 8 miss       |
| 20  |        | 8 miss       |
| 21  |        | 8 miss       |
| 22  |        | 8 miss       |
| 23  |        | 8 miss       |
| 24  |        | 8 miss       |
| 25  |        | 8 miss       |
| 26  |        | 8 miss       |
| 27  |        | 8 miss       |
| 28  |        | 8 miss       |
| 29  |        | 8 miss       |
| 30  |        | 8 miss       |
| 31  |        | 8 miss       |

# Cosmos(Sajal)

Ans

Ans 20. CM size = 64 KB

Block size = 16 B

$$\text{no. of blocks} = \frac{64 \times 2^{10}}{16} = 2^{12}$$

Main mem. size =  $2^{32} \times 4 \text{ KB} = 2^{32} \times 2^{12} \text{ B} = 2^{44} \text{ B}$

page size = 4 KB



No. of blocks =  $2^{12}$

No. of sets =  $\frac{2^{12}}{2} = 2^{11}$

$17 \times 2^4 \text{ bit}$   
 $2^{10} \times 17 \times 4 = 68 \text{ KB}$

$0x\text{ff } 000$   
 $8 \text{ bytes} \times 2^{20}$   
 $= 2^{23} \text{ bytes}$

11

One array element size = 8 B

One block size = 16 B

One block holds 2 array elements

One row req. 512 blocks

Ans 14.

CM size =  $32 \times 2^{10} \text{ B}$

Block size = 128 B

$$\text{no. of blocks} = \frac{2^{15}}{128} = 2^8$$



1 row  $\rightarrow$  512 elements

32 blocks

32 miss opn

512 rows  $\rightarrow$   $32 \times 512 = 2^5 \times 2^9 = 2^{14}$

$$= \frac{128}{2^4} \times \frac{64}{2^4} = 2^{14} \text{ blocks}$$

each block contains  $\rightarrow \frac{128}{8} = 16 \text{ elements}$

$$\frac{512}{16} = 32 \quad \frac{512 \times 512}{16} = \frac{2^{18}}{2^4} = 2^{14} \text{ blocks}$$

|     |                                 |         |
|-----|---------------------------------|---------|
| 0   | $a[0][0] \rightarrow a[0][15]$  | 1 miss  |
| 1   | $a[0][16] \rightarrow a[0][31]$ | 15 hits |
| 2   |                                 |         |
| 31  | $a[0][$                         |         |
| 256 |                                 |         |

$\begin{matrix} 32 \\ + 32 \\ + 32 \\ + 32 \\ = 128 \end{matrix}$

No. of misses = 16384

# Cosmos(Sajal)

|                                    |        |
|------------------------------------|--------|
| $a[0][0]$ along                    | → miss |
| $a[1][0] \rightarrow a[0][15]$     | → miss |
| $a[5][0] \rightarrow a[2][15]$     | → miss |
| $a[255][0] \rightarrow a[255][15]$ |        |

$a[256][0] \rightarrow a[256][15]$

$a[511][0]$

512 miss → To access 1 column

To access 512 columns →  $512 * 512$  miss

$$\frac{2^{14}}{2^{18}} = \frac{1}{16}$$

Ams 24.  $L_1$ : - local =  $\frac{40}{1000} = 0.04$

Global =  $\frac{40}{1000} = 0.04$

$L_2$ : - local =

25. Targ. =  $1 + 0.04 * (10 + 0.5 * 100)$   
 $= 1 + 0.04 * (10 + 50)$   
 $= 1 + 60 * 0.04$   
 $= 1 + 6 * 0.4 = 3.4$

26. Stall = Misses in  $L_1$  / instrn. \* Hit time of  $L_2$  +  
 Misses in  $L_2$  / instrn. \* Miss penalty of  $L_2$

1 instrn req. → 1.5 mem ref.

1000 mem. ref →  $\frac{1}{1.5} \times 1000 = 667$  instrn



Stall =  $40/667 * 10 + 20/667 * 100$   
 $= 3.59$  cycles  $\approx 3.6$  cycles.

27.

64 Bytes

$2^6$  Bytes

$2^{30}$  Bytes → 18

1 byte →  $\frac{1}{2^{30}}$

64 bytes →  $\frac{2^6}{2^{30}} = \frac{1}{2^{23}}$

$$0.08 \times \frac{64 \times 10^{-9}}{10^9} = 64 \text{ ns}$$

$$\text{total time} = (64 + 32) \text{ ns} = 96 \text{ ns}$$

FBFC  
512

+ 1023  
240  
15  
16  
11  
240

12  
+ 240  
240  
8  
9  
256  
16  
4096  
15  
61440

3048  
2560  
512  
512  
X 5  
X 6  
2560 31 768



31 B2F

# Cosmos(Sajal)

Ans 28. ~~16x 4W~~  
~~4x20 = 80ns~~

Q L<sub>1</sub> - Miss  
 Q L<sub>2</sub> - Hit (4W)  
 Q L<sub>1</sub> - Hit  
 Q L<sub>2</sub> - Miss / Hit (4W)  
 Q L<sub>1</sub> - Hit  
 Q L<sub>2</sub> - Hit (4W)  
 Q L<sub>1</sub> - Hit  
 Q L<sub>2</sub> - Hit (4W)  
 L<sub>1</sub> - Hit  
 L<sub>2</sub> - Hit (9W)  
88 ns

L<sub>1</sub> - Miss  
 L<sub>2</sub> - Miss  
 MM - Hit (4W)  
 L<sub>2</sub> - Hit  
 L<sub>1</sub> - Hit (but miss as complete 16 words are not transferred)  
 L<sub>2</sub> - Miss  
 MM - Hit (4W)  
 L<sub>2</sub> - Hit  
 L<sub>1</sub> - Hit (but miss also)  
 L<sub>2</sub> - Miss  
 MM - Hit (4W)  
 L<sub>2</sub> - Hit  
800  
+ 80  
+ 8

Ans 30. set lines =  $\frac{16}{4} = 4$  set lines

|   |     |     |      |     |
|---|-----|-----|------|-----|
| 0 | 048 | 432 | 800  | 216 |
| 1 | 1   | 133 | 129  | 73  |
| 2 |     |     |      |     |
| 3 | 255 | 3   | 1159 | 63  |

Ans 31. ~~x 2<sup>8</sup> x 2<sup>10</sup> = x 2<sup>8</sup> x 2<sup>10</sup>~~  
~~x = 8~~

Ans 32.

|   |     |    |
|---|-----|----|
| 0 | 04  | 16 |
| 1 | 5   | 9  |
| 2 |     |    |
| 3 | 535 | 7  |

① + ① + ① + ①

33.

|   |     |    |
|---|-----|----|
| 0 | 808 | 12 |
| 1 |     |    |

34. no. of blocks



# Cosmos(Sajal)



PP

Ans 37.

$$\begin{aligned} T_{avg} &= 0.9 \times 50 + 0.1 \times [500 + 50] \\ &= 45 + 0.1 \times 550 \\ &= 45 + 55 \\ &= 100 \text{ ns}. \end{aligned}$$

Ans 36. Lock up free cache :- used to hold outstanding miss information, it is also called as miss information status handling register.

Ans 38.

Main mem. size =  $2^{20}$   
line size =  $256 \text{ B}$

Diagram showing memory organization:

|       |                      |       |        |
|-------|----------------------|-------|--------|
| tag   | no. of lines = $2^6$ |       |        |
| 6 bit | 6 bit                | 8 bit |        |
|       |                      |       | $2^8$  |
|       |                      |       | 20 bit |

no. of blocks in main mem. =  $\frac{2^{20}}{2^8} = 2^{12}$

Ans 39.

$$K \bmod S = i$$

V :- no. of sets  
K :- set lines in 1 set

Ans 43.  $T_{avg}$  (write through)

$$\begin{aligned} T_{avg}(\text{read}) &= 0.8 \times 20 + 0.2 \times (200 + 20) \\ &= 16 + 0.2 \times (240) \\ &= 16 + 48 = 64 \text{ ns} \end{aligned}$$

$$\begin{aligned} T_{avg}(\text{write}) &= 0.9 \times 200 + 0.1 \times (200 + 100) \\ &= 18 + 24 = 120 \text{ ns} \end{aligned}$$

$$T_{avg}(\text{WT}) = 64 \times 0.7 + 120,$$

$$T_{avg}(\text{WT}) = 60 \times 0.7 + 120 \times 0.3$$

$$\therefore \text{Efficiency} = \frac{10}{78}$$

Ams A.A.

## Input-Output Differentiation

- ① I-O devices are very slow devices, ∴ they are not directly connected with the CPU.
- ② High speed interface logic is used b/w the CPU & I/O devices.
- ③ High speed interface logic is req. because I/O devices are electromagnetic devices & CPU is electronic device, ∴
  - ① there is a diff. in operating modes.
  - ② there is a diff. in word format.
  - ③ there is a diff. in data transfer rate.

To avoid all the above problems, high speed interface logic is used. All the I/O devices are connected to the CPU through the high-speed interface.



- ① High Speed interface control logic receives the CPU generated request & interpret it. Based on the address & control signal, the high speed interface enables the corresponding device for op", after preparing the data, the device transfers it to the buffer.
- ② When the data is present in buffer, the high speed interface generates the request to the CPU & waits for the acknowledgement.
- ③ After receiving ACK from CPU, the high speed interface transfers the data with high rate, ∴ speed gap is minimized.

## Types Of Interfaces:

 Non-programmable  
(Configuration is pre-defined during the design time)  
e.g. buffer, latch

Programmable  
(Configuration is defined like that in the command word register (CWR))  
in 8255

# Cosmos(Sajal)



18

| Programmable                                        |                                                                                             |
|-----------------------------------------------------|---------------------------------------------------------------------------------------------|
| Serial Interface<br>(Bit-by-bit)<br>e.g. 8251 USART | Parallel Interface<br>(more than 1 bit)<br>e.g. 8255 PPI<br>8259 AIC<br>82372<br>8257 → DMA |

## I/O Transfer Modes :-

There are 3 I/O transfer modes used to transfer the data from I/O devices to CPU/memory:-

- ① Programmed IO
- ② Interrupt Driven IO
- ③ DMA

### Programmed IO :-

- ① In this mode, no high speed interface logic is used, which implies I/O devices are directly connected with CPU, ∴ processor utilization is inefficient.
- ② Processor undergoes waiting until completion of I/O operation. This waiting time depends upon the speed of I/O device.



In the above diagram 10 KB/s is connected to the CPU in the programmed I/O mode, if data transfer is byte level, the CPU requires

$$10 \text{ KB} \rightarrow 1s$$
$$1 \text{ B} \rightarrow \frac{1}{10^4} = 10^{-4} \text{ s} (\approx 100 \mu\text{s})$$

### Interrupt Driven IO :-

- ① In this mode high speed interface logic is used b/w the CPU & basic I/O devices, ∴ processor utilization is good.
- ② This implies that CPU is not directly communicating with I/O, it only communicates with high speed interface, executing time depends on the latency of the high speed interface.

# Cosmos(Sajal)



When the CPU want to read/write 1 byte data from the 10 KB/S device which is connected to the CPU in the interrupt mode, it requires 4  $\mu$ sec.



## DMA(Direct Memory Access) :-

- ① In this mode the bulk amount of data is transferred from the I/O device to main memory through the DMA w/o involvement of the CPU
- ② DMA is a module, all the secondary storage devices are connected through the DMA.
- ③ When the user program capacity is greater than the main memory capacity use the virtual memory concept to increase the address space.
- ④ This implies that store the user program into the secondary storage device.
- ⑤ During execution of user program



# Cosmos(Sajal)



18

- During the execution of user program part 0, the CPU initializes the DMA along with IO address, Mem. address, control signals & count values. Later, the CPU is busy with user program.
- ② Simultaneously, DMA control logic interprets the req. & enables the corresponding IO device for the opn. After preparing the data, the IO device transfers it to the buffer present in the DMA.
- ③ DMA enables the HOLD signal to gain the control of the bus & waits for the acknowledgement.
- ④ After receiving the ACK, it transfers the data to the main memory until the count becomes 0.
- ⑤ After the opn, the DMA reestablish the bus connection to the CPU.
- During DMA opn, CPU may be in 2 states, i.e. busy state & blocked/hold state.
- Until preparation of data, the CPU is busy & until transfer of data, CPU is blocked.
- The preparation time depends on the speed of device (disk) & transfer time depends on the latency of the memory.

→ Suppose, prep<sup>n</sup> time denoted by x  
& transfer " " " y

then % of time CPU is busy =  $\left(\frac{x}{x+y}\right) \times 100$

% of time CPU is blocked =  $\left(\frac{y}{x+y}\right) \times 100$

→ DMA is operating under 3 modes:-

- ① Burst Mode
- ② Cycle Stealing Mode,
- ③ Block Level Mode

In the burst mode, after receiving HLDA signal, DMA transfers the bulk amount of data to the main memory.

In cycle stealing mode, the DMA forcefully suspends the CPU & take the control of the bus & transfers very imp. data to the main memory.

Block level mode is intermediate b/w burst & cycle stealing. In this mode data is transferred block wise.