

# COMPUTER ORGANISATION

1D

What is the functionality of computer?

Computer:- It converts one form of data into another form under the control of program. While execution of program, one form of data is converted into another form. Therefore, functionality of computer is execution of the program.

Program:- It is the sequence of instructions along with data.

Instruction:- It is a binary sequence which is designed inside the processor to perform some task i.e. binary sequence is bind with the operation. [Binary seq  $\leftrightarrow$  operat<sup>n</sup> rel at process design]

Data:- It is a binary sequence which is associated with the value based on the weighing factor.

To execute the program, the computer contain three(3) basic components named as: (1) CPU (2) Memory (3) I/O(External communication)

(Processing) (Storage) - cation

Program is initially stored in memory.

## MEMORY ORGANIZATION

- Memory chip is divided into equal parts called as cells.
- Each cell is able to be identified with unique no. called as address.
- Each cell is able to recognise control signals, i.e. read and write.



Additional CKT for address. Not in memory.

- To access any memory content, there is a need of address along with control signal.
- To execute the

program, ~~memory~~ CPU generates the memory request to access the memory to read the ~~instructions~~. The memory request contains



The no. of address bits required to access the memory is depending on the memory chip configuration.

MEMORY CHIP CONFIGURATION: The memory chips are configured as follows:-

$64K \times 8$ ,  $64K \times 16$ ,  $64K \times 32$  ...

Let us consider  $64K \times 8$  memory chip

$64K \times 8$  capacity of memory chip i.e.

$64K \rightarrow$  # of Memory cells present in the memory chip i.e.  $\rightarrow 2^{16}$  cells  $\rightarrow$  # of address bits required to access 1 cell i.e. 16 bits.

(NOTE :- If the memory chip contains  $2^n$  cells, n-bit address is required to access 1 cell among  $2^n$  cells) Address space/Address range/memory map of the mem chip i.e. 000... (16 times) to 111... (16 times) ... 1  $\rightarrow$

0000H to FFFFH (Based on limited human recognition space is converted into hexadecimal address space. (0000)<sub>16</sub>/H/0x  $\rightarrow$  (FFFF)<sub>16</sub>/H/0x)

$64K \times 8$  → Cell Size.

Based on the cell size, the memory chips are classified into 2 types → byte addressable memory and word addressable memory.

Byte addressable memory :- When the memory cell size is 8 bits, then the corresponding address space is called as byte address.

Word addressable memory :- When the memory cell size is based on the word length of the processor then the corresponding address space is called as word address.

Processor

Byte(B)

Word(W)  
wordsize = wordlength  
of the processor

8 bit μP

8 bit

8 bit

16 bit μP

8 bit

16 bit

32 bit μP

8 bit

32 bit

n-bit μP

8 bit

n bit

fixed irrespective  
of size of μP size

↑  
variable

Byte → Unique meaning i.e. unambiguity

Word → Multiple meanings i.e. ambiguity.

DEFINITE →  $64KB \rightarrow 64K \times B \rightarrow 64K \times 8$  (definite)  
UNDEFINITE →  $64KW \rightarrow 64K \times W \rightarrow 64K \times 2$  (undefined)

NOTE :- When the processor word size is greater than the memory cell size then there is a need of accessing the multiple memory cells to transfer the word from memory to processor. e.g.

16 bit processor :- MOV AX, [0000]

MOV → Mnemonic → User understandable format of the operation → converted to opcode (M/c understandable format of the operation) → indicates type of operation, MOV indicates data transfer operation.

AX → Destination  $\xrightarrow{\text{reg}}$  (Registers are always referred by the names) ~~reg~~ → 16 bits.

Register size is equal to processor size.

[2000] → Source → Memory (Mem. is referred by the addresses) → Byte address. → points to 8 bit data (cell size). But 2 cells are required.  $\therefore \{M[2000], M[2001]\}$  data sent to AX.

Eg → 32 bit processor  
MOV R0, [2000]  $\xrightarrow{\text{oprand}}$  DEST → Register (32bit)  
 $\xrightarrow{\text{SRC}}$  SRC (Memory add)(8bit)  
 $\{M[2000], M[2001], M[2002], M[2003]\}$  data transferred to R0 register.

\* Consider 64KB memory with following contents

| 64KB |     |
|------|-----|
| 0000 |     |
| !    |     |
| 2000 | 18H |
| 2001 | 23H |
| 2002 | 04H |
| 2003 | 16H |
| !    |     |
| !    |     |
| FFFF |     |

AX = M[2000] M[2001]  
If [2000] contains LB

AX → 2318 H

If [2000] contains HB  
AX → 1823 H

R0 = M[2000] M[2001] M[2002] M[2003]

If [2000] contains LB

R0 = 16042318 H

otherwise

R0 = 18230416 H.

During the execution of above instructions, due to the lack of order of data storage, the instructions generates the multiple outputs.

To avoid the above problem, there is a need of the memory address interpretation (Endian-ness) Mechanisms.

ENDIAN-NESS MECHANISMS:- These mechanisms show the order of data storage. They are 2 types:-

(i) little endian

(ii) Big endian

little endian mechanism shows the order of data storage as lower address contains the lower bytes & higher address contains the higher ..

|              |   |   |   |   |   |
|--------------|---|---|---|---|---|
| 5            | 4 | 3 | 2 | 1 | 0 |
| Byte storage |   |   |   |   |   |

0 4 3 2 1 0 Addr

Big endian mechanism shows the order of data storage as lower address contains the higher byte and higher address contains the lower byte.

|              |   |   |   |   |   |
|--------------|---|---|---|---|---|
| 0            | 1 | 2 | 3 | 4 | 5 |
| Byte storage |   |   |   |   |   |

5 4 3 2 1 0 Add

NOTE:- Little endian is a default address interpretation <sup>tech</sup> used in processor design. Theref

Therefore the above instructions produce the outputs as AX 

|      |      |
|------|------|
| HB   | LB   |
| 23   | 18   |
| 2001 | 2000 |

 $\Rightarrow$  2318H.

$R_0 = \boxed{16 | 04 | 23 | 18} \Rightarrow 16042318H.$   
 2003 2003 2001 2000  
 ↑ HB                      ↑ LB

## Memory



## Processor PIN STRUCTURE:-

Processor supports 3 categories of h/w pins to perform the various operations on externally connected components i.e. memory or I/O.

- ① Active low pins: These pins are enabled when the pin is zero. They are denoted as Pinnname; Eg  $\rightarrow \overline{RD}, \overline{WR}, \overline{CS}, \overline{INTA}$  etc.
- ② Active high pins: These pins are enabled when the input is 1. They are denoted as Pinnname. Eg  $\rightarrow ALE, INTR, HOLD, HLDA$  etc. Some component are anode, some cathode  
 $\therefore$  Above type of pins used in processors.
- ③ Time multiplexed pins:- This pin carries the multiple meanings but the meaning is defined by another pin. Eg. In 8085,  $AD_0 - AD_7$ ,  $A \rightarrow$  Address,  $D \rightarrow$  data controlled with  $ALE$ .  $L_0 \rightarrow$  Data is carried by Time multiplexed pins.  $L_1 \rightarrow$  Address.  $AD_8 - AD_{15}$  in 8086.  
 The advt of TM pin is that it reduces the no. of h/w pins.

Buffer :- Buffer is a temporary storage component without locking. Input + output lines are always enabled.



Latch :- Latch is a temporary storage component with locking i.e. I/P + O/P lines are there in the control of locking pins i.e. strobe (STB) and output enable ( $\overline{OE}$ ) pins respectively.



NOTE :- Latch may be configured as the buffer but buffer can never be configured as latch.

Latch is configured as the buffer by connecting STB pin to +5V and  $\overline{OE}$  pin to GND.



Latch as Buffer.

## MEMORY INTERFACING :

To execute the program, there is a need of communication required b/w CPU + memory to read the instruction & data.  
Therefore, interfacing is required b/w the CPU + memory.

done using BUS in practical.



CPU generates the mem. request :  $\frac{20804}{RD}$

M/C cycles:- The no. of clock cycles required to access memory (to read or write) is called machine cycles.

Machine cycles  $\rightarrow \{T_1, T_2, T_3, T_4\}$

ALE  $\rightarrow \frac{1}{T_1}, \frac{0}{T_2}, \frac{0}{T_3}, \frac{0}{T_4}$

$T_1, T_2, T_3, T_4$ .

$\overline{RD} \rightarrow \frac{0}{T_1}, \frac{1}{T_2}, \frac{1}{T_3}, \frac{1}{T_4}$

$\overline{WR} \rightarrow \frac{1}{T_1}, \frac{1}{T_2}, \frac{1}{T_3}, \frac{1}{T_4}$

$\overline{WR} \rightarrow \frac{1}{T_1}, \frac{1}{T_2}, \frac{1}{T_3}, \frac{1}{T_4}$

$2080H \downarrow$

00100000 10000000

↓ ↓ ↓ ↓  
A15-A8 AD7-AD0

$T_1 \Rightarrow AD_0 - AD_7 \rightarrow 80$ ,  $A_8 - A_{15} \rightarrow 20 \leftarrow \text{CPU}$ ,  $ALE \rightarrow 1$   
 &  $AO - A_7 \rightarrow 80$ ,  $A_8 - A_{15} \rightarrow 20 \leftarrow \text{RAM}$ ,  $\text{Latch-80}$ ,  $STB \rightarrow 1$   
 $T_2 \Rightarrow \overline{WR} - 1$  } CPU, RAM     $ALE - 0$ ,  $STB - 0$ .  
 $\overline{RD} - 0$

$T_3, T_4 \rightarrow$  Binary sequence at  $D_0 - D_7$  in RAM  
 to  $AD_0 - AD_7$ . This sequence is not forwarded to Latch. Data is moving  
 between memory and CPU. 2 cycles  
 because memory is slower than CPU.

NOTE: In every m/c cycle, in first time  
 state, ALE pin is in the high/enable  
 state, & remaining time states, ALE is  
 disabled to carry the address & data  
 efficiently over the TM pins.

BUS: Bus is a communication channel  
 used between the major components  
 of the system i.e. CPU, memory and I/O.  

- It is also called as the system bus.
- The key characteristic of the bus is  
 shared transmission media. The limitation  
 of the bus is only one transmission at  
 a time.



The communication channel contains 3 lines to provide communication.

1) Address lines

2) Data lines

3) Control lines

### Address lines

These lines are used to carry the address towards memory & I/O. They are unidirectional.

Based on the no. of address lines, we can determine the capacity of the memory system. Eg. 8085 supports

16 address lines i.e.  $A_{D_0} - A_{D_7}$ ,  $A_{A_8} - A_{A_{15}}$ .

$\Rightarrow 2^{16}$  address cells  $\Rightarrow 64\text{KB}$  (64K cells).

8086 supports 20 address lines i.e.

$A_{D_0} - A_{D_{15}}$ ,  $A_{A_{16}} - A_{A_{19}}$ .  $\Rightarrow 2^{20}$  address cells  $\Rightarrow 1\text{MB}$  cells

$\Rightarrow 1\text{MB}$  storage space.

Data Lines:- These lines are used to carry the binary sequence between CPU, memory & I/O. Therefore, data lines are bidirectional.

Based on the no. of data lines, we can determine the word length of the processor. Based on word length, we can determine the performance of the processor.

Eg: In 8085 processor, 8 data lines are time multiplexed with 8 address lines i.e.

$A_{D_0} - A_{D_7}$ .

Word length  $\Rightarrow 8\text{ bit} \Rightarrow 8\text{ bit}$  operations are performed.

In 8086, 16 data lines are time multiplexed.

with 16 address lines. i.e.  $AD_0 - AD_{15}$ .

Total word length = 16 bit  $\Rightarrow$  16 bit operations are performed.

### Control lines:-

These lines are used to carry the control signals + timing signals. Control signals are used to indicate the type of the operation and timing signals are used to synchronise the memory & I/O operations with processor clock. Control lines are bidirectional.

NOTE:- When there is a common channel, common address space, and common control sequence signals maintained between the memory and I/O, then there is a possibility of ambiguity.



To avoid the above problem, there is a need of bus configuration.

Bus Configuration:- There are 3 bus configuration in system design.

- (i) IB Processor (Top)
- (ii) Isolated processor
- (iii) Memory mapped I/O

(1) IOP (Input Output processor): This configuration uses the separate busses for I/O and memory. Introducing an additional ~~process~~ bus is an expensive process. So this config is used in the high performance processor design.

(2) Isolated I/O: This configuration uses the same bus and common address space but different control sequence signals.



8085 & 8086 uses isolated I/O.

LOAD → LOAD R0, [2000]

STORE → STORE [2000], R1

IN → IN R0, Port Addr

OUT → OUT PortAddr, R1

In 8086, isolated I/O configuration is implemented by using the following H/W pins.

|       | M1         | IO | RD        | WR        |              |
|-------|------------|----|-----------|-----------|--------------|
|       | (dual pin) |    | (unipins) | (unipins) |              |
| MemRD | 1          |    | 0         | 1         | Memory Read  |
| MemWR | 1          |    | 1         | 0         | Memory write |
| IORD  | 0          |    | 0         | 1         | IO read      |
| IOWR  | 0          |    | 1         | 0         | IO write     |

In 8085, isolated I/O configuration is implemented by using the following H/W pins:-

different from 8086

IO/M

RD

WR

1

0

1

IORD

1

1

0

IOWR

0

0

1

memRD

0

1

0

memWR

So, the diagram changes to:



CPU generates  
mem req.

$1000 \rightarrow$  AL's  
MemRD  $\rightarrow$  CL's



### Memory Mapped I/O

This configuration uses the common bus and common control signal but unique address space.

Unique address space is maintained by taking some of the addresses from the pool of the address space, assigned them to I/O ports. That means, port addresses are the part of the memory address space.



Disadvrt  $\rightarrow$  Not complete memory space is not in use. (Not allowed to be used by user received by processor)

The disadvantage in memory mapped I/O is that the complete memory is not in the use.

CPU generates:  $00 \rightarrow AL$ 's  
Mem Reg:  $\overline{RD} \rightarrow CL$ 's

Mem 00 valid } Mem  
controller RD-valid } op<sup>n</sup>  
I/O 00-Invalid } STOP  
controller RD-valid }

## Instruction Cycle:-

This concept describes the execution sequence of an instruction.

Instruction cycle consists of 2 subcycles named as fetch & execution cycle.

The sequence diagram of instruction execution is



## Fetch cycle:-

- The objective of the fetch cycle is to read the instruction related binary sequence from the memory to CPU.
- The CPU generates the memory request based on the program counter to read the instruction from the memory.
- Program counter points to the instruction address only.
- The functionality of PC is to hold the starting instruction address and immediately points to the next instruction address.
- Here the starting address is provided by the

user and the next instruction address is calculated by incrementing the PC with step size.

- Step size depends on the instruction length i.e if the instruction size is  $n$  bytes long, then step size is " $n$ ".
- The process of transfer the binary sequence from the memory to CPU based on PC is called as instruction fetch.
- At the end of instruction fetch, PC is incremented to step size to point the next instruction address.



Q Consider the program segment executed on a hypothetical processor:-

| Address    | Instruction    | Size (Words) |    |
|------------|----------------|--------------|----|
| 1000-1007  | I <sub>1</sub> | 2            | +8 |
| 11008-1011 | I <sub>2</sub> | 1            | -4 |
| 1012-1015  | I <sub>3</sub> | 1            | +4 |
| 1016-1023  | I <sub>4</sub> | 2            | +8 |
| 1024-1027  | I <sub>5</sub> | 1            | +4 |
| 1028-1035  | I <sub>6</sub> | 2            | +8 |
| 1036-1039  | I <sub>7</sub> | 1            | +4 |

Assume the word size is 32 bits. The program has been loaded in the memory with the starting address of 1000 (decimal) onward. During the execution of inst<sup>n</sup> 5. what could be the value present in PC?

ANS → 1028.

(b) Assume that memory is a word addressable with word size of 32 bits. The program has been loaded in memory with the starting address of 1000 (decimal) onwards. During the execution of I<sub>8</sub> what could be the value of PC?

Address:- 1000-1001, 1002, 1003, 1004-5, 6, 7-8, 9.

NOTE:- When the processor is supported with fixed length instructions then step size is a fixed constant. When the processor is supported with variable length instructions, then step size is also variable.

Q. A computer has 24 bit instruction. The program has been loaded in the memory with the starting address of 300 (decimal) onwards. Which one of the following is the valid program counter values? :

- (a) 400
- (b) 500
- (c) 600 ✓
- (d) 700

Execution cycle:-

- The objective of the execution cycle, currently fetched instruction undergoes processing.
- To process the instruction there is a need of identifying the type of operat

associated with the binary sequence.

- Opcode indicates the type of operation
- Opcode is present in the instruction but that is defined by using instruction format.

- Instruction format shows the layout of an instruction that means, it describes the internal structure of the instruction.

### Encoding (Many to 1 mapping)

In the encoding process,  $n$  different combinations can be represented using  $\log_2^n$  bits to indicate only one operation at a time.



NOTE:- Instructions are always represented in the form of encoded binary format. This means when the processor supports 16 different operations then 4 bit opcode is required to indicate 1 operation at a time.

### Decoding (1 to many mapping)

In the decoding process,  $n$  bit decoder generates the  $2^n$  output combinations.



NOTE: During the execution of the instruction decoder is required to identify the operation associated with the encoded binary format.

Instruction format is classified into 5 types based on the type of CPU organization.

The CPU organization is classified into 3 types based on the availability of ALU operands.

### Classification of Instruction set Architecture

- (1) Stack machine
- (2) Accumulator machine
- (3) General register organization

#### Stack Machine

In this organization, both the ALU operands are available in the stack. The same stack is also used as the destination.

The stack contains only one entry point and only 1 exit point i.e. top of stack (tos). So tos is default location. Therefore, No need to pass the tos address explicitly in the instruction.

- The instruction format contains only 1 field i.e. opcode
- The above instruction format is known as zero instruction format.



## OPCODE

Type of op<sup>n</sup>

ALU op<sup>n</sup>: src1 src2 Dest  
TOS TOS TOS.

POP → POP → operate → push.

Eg. ADD.

Sequence of operations → Pop, pop, +, push

NOTE:- In the stack machine ALU operations are the zero address instructions but data transfer operations are not zero. <sup>addr.</sup> instruction

Q. Consider the following program segment used to execute on a stack machine. Each arithmetic operation pops the 2<sup>nd</sup> operand, pops the 1<sup>st</sup> operand, operates on them & push back the result onto the stack

|        |        |
|--------|--------|
| push b | push y |
| push x | ADD    |
| ADD    | push c |
| pop c  | SUB    |
| push c | POP z  |

Which of the following statements is/are true?

1. If each push and pop instruction requires 5 bytes of storage and each arithmetic instruction requires 1 byte of storage, then the whole program requires 40 bytes of storage.
2. At the <sup>end of</sup> execution, z contains the same value as y.
3. At the end of execution stack is empty.
4. (Consider stack is initially empty)

- a) 1 only      (c) 2 & 3 only ✓  
 b) 2 only      (d) 1, 2, 3.

$$\text{Total storage} = 7 \times 5 + 3 \times 1 = 38 \text{ bytes.}$$



(2D)

## (2) Accumulator Machine

In this organization, the first ALU operand is default available in Accumulator. The same accumulator is used as the destination. The CPU contains only 1 accumulator register so it becomes the default location.

The 2nd ALU operand may be present in the memory or register. So, memory address or register name is required in the instruction to access the operand 2. Therefore, the instruction format contains 2 fields i.e.  $\boxed{\text{opcode}} | \boxed{\text{Addr}}$

The above format is known as 1 address inst' format.



$\boxed{\text{opcode}} | \boxed{\text{Addr}}$

Type of operation

Data Transfer operat'

Source  
Acc  
Mem

Dest  
Mem (write) STORE  
Acc (Read) LOAD

ALU OP:

SRC1  
Acc

SRC2  
Mem/Reg

DSTN  
Acc

Eg1: LOAD [5000]



$$ACC \leftarrow M[5000]$$

Eg2: STORE [6000]



Eg3: ADD Ro.



Eg4: MUL [6000]



Q. Consider a hypothetical processor. If it supports both 1 addr. & 0 addi instructions. It has 6 bit instruction & 4 bit addresses. If there exists 2 one address inst's, how many 0 address instructions can be formulated?

NOTE: 1 Address m/c also supported by 0 address instructions if there is a free space after allocating the 1 address inst. This technique is called as expand - opcode technique.

Step 1: Identify the highest instruction format according to the problem statement.

S2: Identify the no. of possible operations.

(Instructions)

S3: Identify the no. of free combinations (free abcodes) after allocating the high order instructions.

S4: Calculate the no. of low order Ins<sup>n</sup> by multiplying the decoded value of the Addr field with the free opcode.



(4)

Expansion  $\Rightarrow$



Q. Consider a hypothetical processor. It supports both 1 addr + 0 addr inst. A 16 bit instruction placed in the 256W memory. If there exists 1001 address instructions, How many 10 address & instructions are possible?

$$156 \times 2^8$$



Q. Consider a processor which has 128W memory. It supports both 1 addr & 0 addr instructions. A 12 bit instruction is placed in 1 memory word. What is the possible range of 1 addr & 0 addr inst supported by the processor.



$$N = 2^5 = 32$$

share the 32 op's. B/w the 1 Addr & zero add.



Range of 1 address = { $1$  to  $31$ }

$$\text{Range of 0 address} = \{1 \times 2^{4+4+3} \times 2^1\}$$

- 3) General Register Organization:- Based on the no. of registers supported by the processor, the architecture is divided into 2 types:-

- (a) Register to memory reference architecture
- (b) Reg-Reg reference architecture.

### Register to memory reference Architecture

In this architecture, processor supports less no. of registers. That means, register file size is small.

In this organization, the first ALU operand is always present in the registers, the same registers also used as the destination. same registers available

The second ALU operator is available either in memory or in registers. So this architecture supports an instruction format with 3 different fields → 

|               |            |            |
|---------------|------------|------------|
| <u>opcode</u> | <u>Rd1</u> | <u>Rd2</u> |
|---------------|------------|------------|

 The above instruction is called as 2 address instruction format.



| opcode | Addr1 | Addr2 |
|--------|-------|-------|
|--------|-------|-------|

Type of operation

Data transfer operation

~~src dest~~



ALU opn:



Eg 5:- Add R<sub>0</sub> / 6000

$$R_0 \leftarrow R_0 + M[6000]$$

Eg 1:- MOV R<sub>0</sub>, R<sub>1</sub>

| MOV | R <sub>0</sub> /R <sub>1</sub> |
|-----|--------------------------------|
| opn | dest exc                       |

$$R_0 \leftarrow R_1$$

Eg 2:- mov R<sub>0</sub>[4000]

$$R_0 \leftarrow M[4000]$$

Eg 3:- mov [6000], R<sub>0</sub>

$$M[6000] \leftarrow R_0$$

Eg 4:- Add R<sub>0</sub>, R<sub>1</sub>

$$R_0 \leftarrow R_0 + R_1$$

Q. Consider a processor that supports 2 addr, 1 addi & 0 add. A 16 bit instruction is placed in 128W memory. If there exists 2-2 address instructions, 100-1 address instructions, how many instructions for 0 can be formulated?

| opcode | Addr1 | Addr2 |
|--------|-------|-------|
|--------|-------|-------|

← 2 → ← 7 → ← 7 →

← → → → → → → →

$$N = 2^2 = 4,$$

No of free opcodes after allocating the 2 addi instructions =  $4 - 2 = 2$ .

$$\begin{aligned} \# \text{ of } 1 \text{ addi instructions} &= \frac{\text{Op}}{\text{Op} \text{ add1 add2}} \\ &= 2 \times 2^7 = 256 \end{aligned}$$

$$\begin{aligned} \# \text{ of free opcodes after allocating the 1 addi} \\ \text{instruction} &= 256 - 10 = 156 \\ \therefore \text{No. of } 0 \text{ add } \frac{\text{Op}}{\text{Op add1 add2}} \text{ instructions} &= 156 \times 2^7 \end{aligned}$$

Q. A computer has an instruction format with 3 fields i.e., opcode, register field which indicates 1 of the 60 processor register and memory address. A 32 bit instruction is placed in the 256 KW memory.

(a) How many no. of operations are possible.

(b) If there exists ~~n~~ 2-address instructions which uses both register and memory reference, then how many 1 address memory reference instructions are possible.  
 (register merging with opcodes)

(c) If there exists  $n$ -2 address instructions which uses both register & memory reference,  $m-1$  address register references then how many zero address instructions are possible.

Ans:-



$$N = 256 \rightarrow \# \text{ free opcodes after allocating 2addr. instructions} = (256-n)$$

$$\# \text{ of 1 address mem ref} \rightarrow (256-n) \times 2^6 \rightarrow \text{Reg merged.}$$



$$\# \text{ of 1 addr reg ref} = (256-n) \times 2^{18}$$

$$\# \text{ of free opcodes after allocating 1addr mem ref Inst} = [(256-n) \times 2^6] - m$$

$$\# \text{ of free opcodes after allocating 1 add reg ref Inst} = [(256-n) \times 2^{18}] - M$$

$$\# \text{ of free opcodes after 0-address} = ((256-n) \times 2^6) - m + 1$$

$$\text{add} \rightarrow ((256-n) \times 2^{18}) - M + 2$$

Register to register Reference architecture  
In this architecture, the processor supports more no. of registers i.e. register file size is large. No possibility of register overwriting.

In this architecture ALU operations are performed only on registers. i.e. Both the ALU operands are available in register. Due to the more no. of registers present in this processor, use a separate register to store the output of the ALU operations.

Therefore ALU operations require 3 address fields in the instruction. All the address fields contains register names.

In this architecture, 2 address instruction format is used to implement the data transfer operation. LOAD and STORE instr are used to transfer the data b/w the registers and memory.



The 3 instr format is [opcode] [Addr1] [Addr2] [ALU]  
Opcode → Type of operation → ALU operation

Addr1 → Dst<sup>n</sup> → Reg

Addr2 → SRC1 → Reg

Addr3 → SRC2 → Reg

Eg: ADD 90, 94, 92

$$90 \leftarrow 94 + 92.$$

#### FOUR ADDRESS INSTRUCTION FORMAT

There is an instruction format which contains 4 address fields along with the opcode.

| Opcode                      | Addr1                       | Addr2 | Addr3 | Addr4                       |
|-----------------------------|-----------------------------|-------|-------|-----------------------------|
| Type of operat <sup>n</sup> | <del>DEST<sup>n</sup></del> | SRC1  | SRC2  | NXT inst <sup>n</sup> addr. |

∴ Addr4 is already implemented using Program counter. ∴ 4-address instruction format is imaginary & Not yet implemented. PC is mandatory register in processor design. It is used to hold the next Inst<sup>n</sup> address. So, above concept is not in use.

NOTE :- 8085 μP is an example of accumulator machine.

① 8086 μP is an example of the reg to mem reference architecture.

② RISC μP is an example of the reg to reg reference architecture.

Q. Consider the following expression  
 $x = (A + B) * (C + D)$  used to execute on a stack machine, accumulator m/c & reg to mem & reg to reg machine. Here A, B, C, D and X are ~~not~~ present in the memory locations. How many no. of instructions are required to execute the expression on the respective m/cs.

LOAD A       $Acc \leftarrow M[A]$        $X = (A+B)*(C+D)$   
 ADD B       $Acc \leftarrow Acc + M[B]$   
 STORE P       $M[P] \leftarrow Acc$   
 LOAD C       $Acc \leftarrow M[C]$   
 ADD D       $Acc \leftarrow [C] + [D]$   
 MUL P       $Acc \leftarrow (A+B)*(C+D)$   
 STORE X       $M[X] \leftarrow (A+B)*(C+D)$

$\left. \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right\}$  Accumulator  
M/C.

PUSH A       $[A]$   
 PUSH B       $\boxed{B}$   
 ADD       $\boxed{A+B}$   
 PUSH C       $\boxed{C}$   
 PUSH D       $\boxed{D}$   
 ADD       $\boxed{C+D}$   
 MUL       $\boxed{(A+B)*(C+D)}$   
 POP X       $M[X] \leftarrow (A+B)*(C+D)$

$\left. \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right\}$  Stack  
Machine

MOV R0, A       $R_0 \leftarrow M[A]$   
 ADD R0, B       $R_0 \leftarrow R_0 + M[B]$   
 MOV R1, C       $R_1 \leftarrow M[C]$   
 ADD R1, D       $R_1 \leftarrow R_1 + M[D]$   
 MUL R0, R1       $R_0 \leftarrow R_0 * R_1$   
 MOV X, R0       $M[X] \leftarrow R_0$

$\left. \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right\}$  Reg to  
Mem.

LOAD R0, A       $R_0 \leftarrow M[A]$   
 LOAD R1, B       $R_1 \leftarrow M[B]$   
 ADD R2, R0, R1       $R_2 \leftarrow R_0 + R_1$   
 LOAD R0, C       $R_0 \leftarrow M[C]$   
 LOAD R1, D       $R_1 \leftarrow M[D]$   
~~ADD R3, R0, R1~~       $R_3 \leftarrow R_0 + R_1$   
~~ADD R4, R2, R3~~       $R_4 \leftarrow R_2 * R_3$   
 MUL R4, R2, R3       $M[X] \leftarrow R_4$   
 STORE X, R4

$\left. \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right\}$  Reg to  
Reg

NOTE:- To analyse the execution sequence of the program, let us consider an example CPU & its compatible instruction format i.e. accumulator m/c as an exq. CPU. It supports 8-bit instructions with 8 different operations. So instr format is 1-address format.



### EXECUTION SEQUENCE OF AN INSTR

Instr fetch  
 $t = \text{starting addr of program}$   
 $\downarrow$   
 $\text{PC} \rightarrow \text{CPU generates the mem Req}$   
 $\text{PC} \leftarrow \text{PC + stepsize}$

Design  
 Time  
 Binding  
 $D_0 = +$   
 $D_1 = *$   
 $D_2 = \text{AND}$   
 ROM



### operand fetch:-

Reg :-  $\text{IR}[\text{Addr}] \xrightarrow{\text{Reg Name}} \text{ALU}$



Mem:-

$\text{IR}[\text{Addr}] \xrightarrow{\text{mem Addr}} \text{CPU generates the mem Req}$



### Process Data:-



Based on the PC, the instr related binary seq is transferred from memory to CPU. This process is called as instruction fetch. At the end of instr fetch, PC is incremented by step size to point to the next instr addr.

- All the fetched instr is placed into the instr register to decode because instr format is predefined in this register (IR).
- The process of identifying the type of the combination associated with the opcode present in the instr is called as instr decode.
- Based on the type of CPU, the data related binary sequence is transferred from either register or memory to ALU. This process is called as operand fetch.
- The process of converting one form of data into another form based on the enabled operation is called as process data.

After completion of 1<sup>st</sup> instr the CPU fetch the next instr from the memory based on the program counter.

### FLAG / PSW

Acc = 89H  
R0 → 82H

10001001

10000010

100001011 ← Acc.

↓ Extra, lost

To store the exception conditions, there exists extra storage provided in processor called as flag register / PSW (Program status word).

Flag - Flag is a flip-flop. Flip-flop is a bistate device. It is used to store the one bit information.

Flags are classified into 2 types:

- Conditional flag.
- Control flags

Conditional flag :-

These flags are set or reset based on the result nature of ALU. There are 6 conditional flags used in processor:-

- (i) Carry (ii) Parity (iii) Auxiliary carry (iv) Zero
- (v) Sign flag (vi) Overflow flag.

Carry :- It is a conditional flag set on "Is there an extra bit out of MSB".

$$\begin{cases} T = \text{set} = 1 = C \\ F = \text{reset} = 0 = NC \end{cases}$$

Carry flag is used to indicate the range exceeding conditions of the unsigned arithmetic operations.

In the unsigned representation, we can represent only the positive no. only.

n-bit unsigned range is 0 to  $2^n - 1$

Eg. 4-bit unsigned range is 0 to 15.

$$\begin{array}{r} 7 = 0111 \\ 3 = 0011 \\ \hline 10 & 1010 \\ \hline CY-0 & CY-0 \\ \text{bimal} & \end{array}$$

$$\begin{array}{r} g = 1001 \\ 8 = 1000 \\ \hline 17 & \boxed{10001} \\ \hline CY-1 & CY-1 \\ \text{*ACC} & \end{array}$$

Parity Flag :- "Is the ALU output (Acc content) contain even no. of 1's".

$\rightarrow \begin{cases} T = \text{set} = 1 = PE(\text{even}) \\ F = \text{reset} = 0 = Podd \end{cases}$

This flag is used in the serial communication to prepare the error detection codes.

Auxiliary carry :- "Is there an extra bit from the lower nibble to higher nibble (3<sup>rd</sup> position to 4<sup>th</sup> position)"

$\rightarrow \begin{cases} T = \text{Set} = 1 = AC \\ F = \text{Reset} = 0 = NA \end{cases}$

Used in BCD arithmetics to indicate the range exceeding conditions.

Zero :- "Is the ALU output (Acc) zero"

$\rightarrow \begin{cases} T = \text{set} = 1 = Z \\ F = \text{reset} = 0 = NZ \end{cases}$

Control Structure

zero flag is used to implement the control structures

- (i) Selection
- (ii) Iteration
- (iii) Subprogram

Sign :- "Is the MSB of ALU output (Acc) one"

$\rightarrow \begin{cases} T = \text{set} = 1 = NG \text{ (Negative logic)} \\ F = \text{reset} = 0 = PG \text{ (Positive logic)} \end{cases}$

Overflow :- "There is a carry into the MSB & no carry out of the MSB or vice versa."

$\rightarrow \begin{cases} T = \text{set} = 1 = OV \\ F = \text{reset} = 0 = NV \end{cases}$

It is used in the signed arithmetic operations to indicate the range exceeding conditions.

SIGNED NOS :- In the signed format we can represent both positive and negative numbers.

- 2's complement no. representation is used to indicate signed nos. in the processor
- when the MSB bit is 1, then the no. is negative. So take the 2's complement to conclude the value.

When the MSB bit is zero, no. is positive  
So no need to take the complement to conclude the value.

n-bit 2's complement range is  $(-2^{n-1})$  to  $(2^{n-1})$

Eg. 4-bit 2's complement range = {8 to +7}

### 2's comple. Binary

### Equivalent Decimal

|      |    |
|------|----|
| 0000 | +0 |
| 0001 | +1 |
| 0010 | +2 |
| 0011 | +3 |
| 0100 | +4 |
| 0101 | +5 |
| 0110 | +6 |
| 0111 | +7 |
| 1000 | -8 |
| 1001 | -7 |
| 1010 | -6 |
| 1011 | -5 |
| 1100 | -4 |
| 1101 | -3 |
| 1110 | -2 |
| 1111 | -1 |

Eg.

$$\begin{array}{r} +7 \\ +3 \\ \hline +10 \end{array}$$

OV = set

$$\begin{array}{r} 0111 \\ 0011 \\ \hline 1010 \\ \text{OV} \equiv 1 \end{array}$$

$$01010 \Rightarrow (+10)$$

Eg

$$\begin{array}{r} +7 \\ -3 \\ \hline +4 \\ \text{OV} = 0 \end{array}$$

$$\begin{array}{r} 0111 \\ 1101 \\ \hline 0100 \\ \text{OV} = 0 \end{array} \Rightarrow (+4)$$

Eg

$$\begin{array}{r} -7 \\ +3 \\ \hline -4 \\ \text{OV} = 0 \end{array}$$

$$\begin{array}{r} 1100 \\ \text{OV} = 0 \end{array} \Rightarrow (-4)$$

Eg

$$\begin{array}{r} -7 \\ -3 \\ \hline -10 \\ \text{OV} \equiv 1 \end{array}$$

$$\begin{array}{r} 1001 \\ 1101 \\ \hline 10110 \\ \text{OV} \equiv 1 \end{array} \Rightarrow (10110) \Rightarrow (-10)$$

$$OV(x, y, z) = \bar{x}y^2 + xy\bar{z}$$

$x$ - MSB of OP<sub>1</sub>,  $y$ - MSB of OP<sub>2</sub>  $z$ - MSB of Result

condition of overflow set

Q. Consider the following 2 two's complement nos. and perform arithmetic opn. What could be status of the carry, sign, and overflow flags after the execution.

$$\begin{array}{r}
 11001101 \\
 10110011 \\
 \hline
 \boxed{1} 10000000
 \end{array}$$

$$\begin{aligned}
 \text{Carry} &= 1 \\
 \text{Parity} &= 0 \\
 A \cdot C &= 1 \\
 \text{Zero} &= 0 \\
 \text{Sign} &= 1 \\
 \text{Overflow} &= 0
 \end{aligned}$$

CONTROL FLAGS:- Based on the status of these flags, the program execution sequence is changed. This is changed by user. There are 3 control flags used:

- (i) Trap  $\rightarrow 1 \rightarrow$  Trace  $\rightarrow$  Single step execution  
 $\rightarrow 0 \rightarrow$  Go  $\rightarrow$  At a time prog " (-t) " (g)
- (ii) Interrupt  $\rightarrow 1$ : Enable interrupt  
 $\rightarrow 0$ : Disable INTs
- (iii) Direction  $\rightarrow 1$ : Auto decrement (STD=Set Direct flag)  
 $\rightarrow 0$ : Auto increment (CLD=Clear ..)

### ADDRESSING MODES:-

- It shows the way where the required object is present.
- Object may be an inst<sup>r</sup> / data.
- The O/P of the addressing mode is Effective address (EA).

- EA is the actual address where the required object is present  
i.e. object = content of EA = [EA]
- There are different conventions used to indicate different addressing modes i.e.
  - # / I = Indicates Immediate AM
  - Register Name = Register AM
  - [ ] = Direct AM
  - @ / C ) = Indirect AM
  - Indexing register name = Indexed AM

# / I → Processor supports different funct<sup>n</sup>s.  
 There are diff. AM used in computer. To use diff. AM, diff. user conventions. convenience codes are used during creating the program. These are also mentioned.

AM are basically classified into 2 types

- (a) Sequential control flow AM (focussed on data)
- (b) Transfer of control flow AM (focussed on instr)

### SEQUENTIAL CONTROL FLOW AM's -

These ~~actr~~ AM's are focussed on data.

- Data present in computer in 2 possible places i.e. processor register + memory
- ∴ Sequential control flow AM further divided into 2 types:

(i) Register based AM's: These are used only when the data is present in the register ∴ the EA is "register name".

(ii) Memory based AM's :

These are used when the data present in the memory ∴ EA is "Memory Address".

## Register based AM:-

### (i) Register AM:-

- This mode is used to access local variables.
- In this mode data is present in the register, the reg. name is available in the addr field of inst<sup>r</sup>.

∴ EA = Addr field of ~~inst<sup>r</sup>~~ value

↓  
Register Name.

Data = [EA] = [Reg. Name]

↑ Reg. Ref → read/write.



$R_0 \leftarrow R_0 + R_1$   
 1 reg ref    1 reg ref    1 reg ref  
 (write)    (read)    (read)

### (ii) Memory based AM:-

There are diff. types of AM employed under this category.

- (i) Implied / Implicit AM:- In this mode the data information is present in the opcode itself.

OPCODE

Type of opn  
EA not required  
STC Set carry  
 $c=1$

operand info  
CLC Clear carry  
 $c=0$

**[ADD]** STACK M/C.

Here data is part  
of opcode.

POP  
POP  
+  
push.

NOTE:- All the zero address instr use  
the implied addressing mode.

(ii) Immediate:-

- This mode is used to access the constants.
- In this mode the data present in the address field of instr.

i.e. 

|        |         |
|--------|---------|
| OPCODE | Address |
|--------|---------|

↓

Data

Data part of instr.

Eg: 

|     |              |
|-----|--------------|
| ADD | $r_0, \# 23$ |
|-----|--------------|

$r_0 \leftarrow r_0 + 23$



NOTE:- The limitation of immediate AM is  
the range of possible const. are restricted  
based on the size of the addr field . i.e. if  
the addr field size of the instr is n-bit, then  
the possible range of unsigned constants are  
{0 to  $2^n - 1$ }, Possible range of signed constants  
are { $-(2^{n-1})$  to  $(2^{n-1} - 1)$ }.

(iii) Direct AM (Absolute AM):-

- This AM is used to access static variables.
- In this mode data present in the memory,  
that memory address available in the  
addr field of the instr.

$\therefore EA = \text{Addr field value} \Rightarrow \text{Mem. Addr.}$

$Data = [EA] = [\text{Memory Addrs}]$

- 1 Mem. reference is req. to read/write data to/ from the memory by using this AM.



|        |        |          |
|--------|--------|----------|
| ADD    | [4000] | , [5000] |
| ↓      | ↓      | ↓        |
| +      | S1 & D | S2       |
| ↓      | ↓      | ↓        |
| Direct |        | Direct   |

$$M[4000] \leftarrow M[4000] + M[5000]$$

1 MemRef    1 MemRef    1 MemRef  
 $\underline{(3 \text{ Mem. Ref})}$

|     |        |      |
|-----|--------|------|
| ADD | [4000] | , R0 |
|-----|--------|------|

$\Rightarrow (2 \text{ Mem Ref})$

|     |        |          |
|-----|--------|----------|
| Mov | [5000] | , [4000] |
|-----|--------|----------|

$\Rightarrow (2 \text{ Mem Ref})$

#### (iv) Indirect AM :-

- This is used to implement the pointer concept
- In the indirect AM, data present in memory that memory Addr [EA] may be present in register/memory.
- ∴ Based on the availability of EA the indirect AM is further divided into two types:-

(iv)(a) Register indirect AM:- In this mode, EA is present in register, that reg. name is available in the addrs field of the instr.

∴ EA = [Addr. field value]  
 $\downarrow$   
 Reg. Name

$$\begin{aligned} Data &= [EA] \\ &= [(reg. name)] \end{aligned}$$

2 Ref :-

1 Reg. ref  $\rightarrow$  EA  
 1 Mem ref  $\rightarrow$  Read/write



Eg:- ADD @ R<sub>0</sub>, [2000]



$$M[R_0] \leftarrow M[R_0] + M[2000]$$

1. reg. ref = EA    1. reg.ref = EA    1. mem.ref (read)  
 1. mem.ref = write    1. mem.ref = read    /

(3 references required)



### ③ Memory Indirect Addressing Mode:-

In this mode the effective address is present in the memory, That memory address is available in the address field of the instruction.

$$EA = [\text{Address field value}]$$

↓  
Mem. addr.

Register indirect  
is faster

$$\text{data} = [EA] = [\text{mem Add}]$$

$$\text{1 mem. refer} = EA$$

1 mem. ref = read/write  
2 mem ref



Eg:-

|     |        |                  |
|-----|--------|------------------|
| ADD | @ 5000 | @ R <sub>0</sub> |
|-----|--------|------------------|

↓ + ↓ S<sub>1</sub> + D ↓ S<sub>2</sub>

Indirect ↓ Indirect  
2+2 1 → 5 mem. ref.

$$M[5000] \leftarrow M[5000] + M[R_0]$$

2 2 1

### Indexed Addressing Mode

It is used to implement the arrays. Array is an ordered set of homogeneous data elements. Array is always stored in the sequential memory locations. To access the array element, there is a need of 2 parameters i.e. (i) Base address (ii) Index value.

In the indexed mode, EA is calculated by adding the index value to the base address. i.e.  $EA = BA + \text{Index value}$ .

BA  $\rightarrow$  starting or ending address of the array.

Index  $\rightarrow$  Position of the array element.

Processor supports one special purpose register called as Index register ( $R_i$ ). It is used to hold the index value.

Base address is present in the address field of Instn.  $\therefore EA = \text{Addr. field value} + [R_i]$

$$\text{Data} = [EA] = [\text{Addr field value} + [R_i]]$$

1 reg ref = Index

1 arithmetic = EA(Mem. addr)

1 Mem. ref = read/write.



Gg

Add | 3( $r_i$ ), @6000       $r_i = \text{Index reg}$

+ S1 D S2

Indexed Indirect

$$M[3 + [R_i]] \leftarrow M[3 + [R_i]] + M[6000]$$

$\{ \equiv \} \leftarrow \begin{cases} 1 \text{ reg ref} = \text{Index} \\ 1 \text{ Arith} = EA \\ 1 \text{ Mem ref} = \text{read} \end{cases}$

$1 \text{ Mem ref} \rightarrow EA$

$1 \text{ Mem ref} \rightarrow \text{read}$

Indexed addressing mode is also called as based indexed addressing mode.

NOTE:- Some of the processors support base and index register used to hold the base and index value respectively.

Eg:- 8086 μP supports 2 registers named as

BX → Base reg,

SI/DI → Index reg. (source/Destination index)

$$EA = [Base\ Reg] + [Index\ Reg]$$

↓  
Base Addr

↓  
Index value

$$Data = [EA]$$

$$= [[Base\ Reg] + [Index\ Reg]]$$

1 reg ref = Base Addr

1 reg ref = Index

1 Arithm. > EA

1 Mref = read/write



↓  
Base Reg  
Name

↓  
Base Reg

↓  
Index Reg  
Name

↓  
Ri

↓  
+  
EA



→ read/write

Eg:- For 8086 μP :

|      |              |
|------|--------------|
| ADD  | AX, [BX][SI] |
| ↓    | S1           |
| + S1 | 2D           |
| ↓ S2 |              |

$AX \leftarrow AX + M[BX] + [SI]$   
 Reg.ref (write) Reg.ref (read)  
 1 Reg.ref = BA  
 1 Reg.ref = Index  
 1 Arithm. = EA  
 1 M.ref = Data

- 8086 processor supports another kind of addressing mode, in which 8 bit or 16 bit displacement is maintained along with the base index & index register names. This mode is called as Relative Base Indexed AM.

- In this mode the EA is calculated as

$$EA = [\text{Base reg}] + [\text{Index reg}] + 8\text{bit}/16\text{bit displacement}$$

$$\text{Data} = [EA]$$

$$= [\text{Base}] + [\text{Index}] + 8\text{bit}/16\text{bit disp}]$$

1 Reg.ref = Base  
 1 Reg.ref = Index  
 1 Arithm. = B + Index  
 1 Arithm. = dispadd  
 1 M.ref. = Data

Eg:- ADD AX, 80H[BX][SI]

$$AX \leftarrow AX + M[BX] + [SI] + 80H$$

Eg:-  $EA \rightarrow \{(BA), (SI), (DI)\} \rightarrow \text{Base Indexed AM}$   
 $EA = \{(BA), (SI), (DI), (8/16 \text{bit})\} \rightarrow \text{Relative Base Indexed AM}$

### Indirect Indexed Addressing Mode:-

In this mode EA is present in the memory. That memory addr. is calculated by adding the index value with the base address.

$$\therefore EA = [\text{Address field value} + [Ri]] / [Base Addr] + [Ri]$$

Data - [EA]

$$\Rightarrow [[\text{Addr} \cdot \text{field} + [\text{R}_i]]] / [[[\text{base Reg}] + [\text{R}_i]]]$$

I Reg ref = Index

I Arith. = Mem Addr where EA is present

I Mem Ref = EA

I Mem Ref = Read/write

(or)

I Reg ref = base Reg.

I Reg ref = Index/Val

I Arith. = Mem Addr, where the EA is present

I Mem ref = EA

I Mem ref = Read/write

| OPCODE | Addr |
|--------|------|
|--------|------|

Base Addr      Index Reg.  
Name



Not in use bcoz it takes more time

Indirect  
Indexed Addressing mode takes more # cycles  
to read / write data to / from memory.

Eg: ADD 3(94), @6000



$$M[3+[\text{R}_1]] \leftarrow M[3+[\text{R}_1]] + M[6000]$$

I Reg ref = Index      I Reg  $\rightarrow$  R<sub>1</sub> (Index)      1 Mem  $\rightarrow$  EA  
1 Arith. = Mem Addr / Arithm  $\rightarrow$  Mem Addr      1 Mem  $\rightarrow$  read

1 Mem ref = EA      1 Mem ref  $\rightarrow$  EA  
1 Mem ref = write      1 Mem ref  $\rightarrow$  read

NOTE:- Indexed addressing Mode is used to access the random array element based on the index value.

AUTO INDEXED AM. It is used to access the linear array elements. To access the linear array elements, there is a need of base address.

- Base address ~~is~~ indicates the starting or ending of an array.
- Base reg. is used to hold the base addr. In this mode EA is calculated either by incrementing or by decrementing the base address with the step size. i.e.

$$EA = [\text{Base Reg}] \downarrow (+/-) \text{Step Size} \quad \downarrow \text{Base Addr of} \quad \downarrow \text{Size of the data} \\ \text{an array} \quad \text{element.}$$

$$\text{Data} = [EA]$$

$$= [[\text{Base Reg}] (+/-) \text{Step Size}]$$

1 reg. ref. = Base Addr.

1 arithm = EA

1 M. ref = read/write

The classification of auto index AM is



1.  $\text{Mov } R_0, +(\text{R}_1)$

AM

2.  $\text{Mov } R_0, (\text{R}_1) +$

Post Auto Inc. ④

3.  $\text{Mov } R_0, -(\text{R}_1)$

where  $R_1$  = base register

4.  $\text{Mov } R_0, (\text{R}_1) -$

Post Auto Dec. ④

- Q. Consider a hypothetical CPU. It uses different operands accessing modes.

| Operand & accessing mode. | Frequency(%) |
|---------------------------|--------------|
| { Register                | 30           |
| { Immediate               | 20           |
| Direct                    | 22           |

Mem. Indirect = 19

Indexed = 11.

Assume the 2 cycles required for memory access, one cycle consumes for arithmetic operations & zero cycles consumes when the operand is coming in the register or instruction itself. What is the average operand fetch rate of the processor?

Ans :-

$$\begin{aligned} \text{Avg. operand fetch rate} &= \frac{(0.3 \times 20) + (0.2 \times 0) + (0.22 \times 2) + (0.17 \times 4)}{11} \\ &+ (0.11 \times 3) \\ &= 0.44 + 0.68 + 0.33 \\ &= 1.45 \text{ cycles} \end{aligned}$$

### Transfer of control flow AM :-

While execution of control structures i.e. selection statements (goto, if-then-else, switch etc), iterative statements (while, do-while, for loop etc) and subprogram concept, the program control is transferred from 1 location to another location.

- To implement the control structures, processor supports special type of instructions called as transfer of control flow.
- Transfer of control is associated with the target address.
- While execution of the TOC oper<sup>n</sup>, the control is transferred from current location to target location.
- There are 3 possible mnemonics used to implement the TOC op<sup>n</sup>.
  - (i) Branch
  - (ii) Jump
  - (iii) Skip

TOC op<sup>n</sup> classified into 2 types.

- (i) Unconditional TOC
- (ii) Conditional TOC

Unconditional TOC : While execution of these instr<sup>r</sup>, without checking any cond<sup>n</sup>, the control will be transferred from current location to target location. Eg. JMP 2000



Consider a program segment:

1000: I<sub>1</sub>

1001: I<sub>2</sub>

1002: I<sub>3</sub> (JMP 1020)

1003: I<sub>4</sub>

⋮

1020: B<sub>1</sub>

1021: B<sub>2</sub> (JMP 1012)

1022: B<sub>3</sub>

⋮

Execution sequence of program is:-

$$-t = 1000$$

$$\downarrow$$
$$PC = 1000$$

1001

1002 : [JMP/1020]

1003

1020

1021 : [JMP/1012]

1022

1012

1013

⋮

NOTE: Uncond. TOC operations are used to implement the goto statement, halt instr<sup>r</sup> and M/C control instr<sup>n</sup>.

HALT: It is implemented using the unconditional TOC with starting addr as the target address.

Eg: 1000: I<sub>1</sub>  
1001: I<sub>2</sub>  
1002: I<sub>3</sub> (Halt)  
1003: I<sub>4</sub>

-t = 1000  
↓  
PC = 1000  
1001  
1002 : Halt → [JMP | 1002]  
1003  
1002 : halt  
1003  
1002 : halt

### M/C control inst:-

Uncond<sup>n</sup> TOC op<sup>n</sup> are used to control the functionality of some specific m/cs.  
Eg: Traffic light controller.

I<sub>1</sub>: G: ON

Y: OFF

R: OFF

call delay()

G: OFF

Y: ON

R: OFF

call delay1()

G: OFF

Y: OFF

R: ON

call delay2()

Jmp I<sub>4</sub>.

### Conditional TOC:

Whole execution of this instruction, the associated cond<sup>n</sup> undergoes evaluation. When it evaluates to true, then the control is transferred to the target location, otherwise the next sequential inst<sup>n</sup> is executed.

Eg: JNZ 12000 pce seq. Addr. cond: NZ

Cond<sup>n</sup>  
TOC

NZ comp status of previous  
false → true: inst<sup>n</sup>  
Pc ← sequential addr Target Addr

target addr

Consider the program segment:

```
1000 : MOV r0, #4  
1001 : ADD r1, r2  
1002 : DEC r0  
1003 : JNZ 1001  
1004 : MOV  
;  
;
```

Execution Seq:-

$$-t = 1000$$

PC : 1000 :  $r_0 = 4$   
1001 : +  
1002 :  $r_0 = 3$  (NZ)  
1003 : NZ (True)  
1004  
1001 : +  
1002 :  $r_0 = 2$

1003 : NZ (True)  
1004  
1001 = +  
1002 =  $r_0 = 1$   
1003 = NZ (True)  
1004  
1001 = +  
1002 =  $r_0 = 0$   
1003 = zero (false)  
1004 = MOV.  
1005.

NOTE :- Cond<sup>rd</sup> TOC opn<sup>rs</sup> are used to implement the cond<sup>rd</sup> select<sup>n</sup> statmnts & iterative select<sup>n</sup> stmnts.

Subprogram :- Subprogram is a reusable program that means when the same functionality is repeatedly occurred in the main app, develop the functionality related code only once (Subprogram) and call the functionality many times in the main app.

The advantage of subprogram is it reduces the memory space, development time & development cost.

## Characteristics of subprogram:-

- (i) Single entry point - single exit points
- (ii) Main program is suspended during the execution of the subprogram.
- (iii) Control will be transferred to the main program after completion of the subprogram

## IMPLEMENTATION:-

To implement the subprogram concept, the processor supports pair instructions.

i.e. CALL-RET



Consider the program segment



-t=1000 without stack

PC : 1000 unsuccessful implementation

1001

1002 **CALL 2000**

1003

2000

2001

2002

2003 **RET**

2004

2005

with stack

1000

1001

1002

1003

2000

2001

2002

2003

2004

1003



In the above execution sequence, due to the lack of ~~return statements~~, RET address during the execution of RET instr., CPU enters into the WAIT state. ∴ It is not

- a successful implementa~~t~~ of subprogram.
- To make it successful, need of return address.
- Return address is Next inst<sup>r</sup> address after the CALL instr<sup>r</sup>.
- Runtime stack is used to store the return addresses
- During the execut<sup>n</sup> of program, when the CPU encounters the CALL instr<sup>r</sup>, it pushes the PC value into the stack. Later target address is placed into PC.
- During the execution of subprogram, when the CPU encounters the RET instr<sup>r</sup>, it invokes the POP operat<sup>n</sup> to restore PC with return address.

1. JMP 2000      PC  $\leftarrow$  seq. Addr  
                      Target "
2. JNZ 2000      T  $\Rightarrow$  PC  $\leftarrow$  seq. Addr  
                      target "  
                      f  $\Rightarrow$  PC  $\leftarrow$  seq. Addr.
3. CALL 2000 : PC  $\xrightarrow{\downarrow}$  stack  
                      PC  $\leftarrow 2000$
4. RET             PC  $\leftarrow$  seq. Addr    TOS (Return Addr)

TOC instr further divided into 2 types:-  
 (i) direct TOC + (ii) Indirect TOC.  
 In direct TOC, the target addr. is available in instruction itself. whereas in the indirect TOC, the target addr is not present in the instr.

| TOC      | uncond.                 | cond.                                                              |
|----------|-------------------------|--------------------------------------------------------------------|
| Direct   | JMP, CALL, goto<br>Halt | JNZ, CALLZ,<br>CALLNZ, CALLC,<br>For, while, do-while<br>if/switch |
| Indirect | RET                     | RETZ, RETC,<br>RETNZ, RETNZ                                        |

In the transfer of control flow AM category, two addressing mode are employed:-

(a) Relative / PC relative AM

(b) Based / Base reg AM.

→ Inst<sup>n</sup> are always stored in memory. So to access the instruct<sup>n</sup>s efficiently, there is a need of understanding the memory org.

→ Consider 8086 mem org<sup>n</sup>. It contains 1MB physical memory. The physical memory is divided into 16 logical segments with each segment size of 64 KB. (No change in size)

$$1\text{ MB} \Rightarrow 2^{20} \text{ cells.}$$



$$1\text{ MB} = 16 \times 64\text{ KB} = 2^{20}\text{ B.}$$

change in organization

$$\text{Physical addr} = \text{BA} + \text{Offset Addr.}$$

In 8086, segment registers are used to hold BA and pointer or index register is used to hold offset address.

Segment Registers → CS

→ DS  
→ SS  
→ ES

Pointer/Index Reg → IP } Pointer Reg  
→ BP }  
→ SP }  
→ SI } Index Reg  
→ DI }

If CS=0  $\Rightarrow$  {00000} Ins<sup>n</sup>  
0FFF }  
If DS=2  $\Rightarrow$  {20000} Data  
2FFFF }  
Notation

CS:IP  
DS:BP/SI/DI

To access code  $\Rightarrow$  CS+IP

To access data  $\Rightarrow$  DS+BP/SI/DI.

### Relative / PC-relative AM:-

When the target instr<sup>r</sup> is present in the same segment, then the control will be transferred within the segment (intrasegment TOC).

In the intrasegment TOC the base address is common. So only 1 parameter is required to calculate the target instr<sup>r</sup> addr. i.e. relative value.

Relative value is the distance b/w the current instr<sup>r</sup> location to target instr<sup>r</sup> location.

In the PC-relative addressing mode, the relative value is present in the address field of instr<sup>r</sup>.

$$EA = PC + \text{Addr field value}$$

↓      ↓      ↓  
Next instr<sup>r</sup>    current    relative  
Addr            Instr Addr    value

$$PC \leftarrow PC + \text{Addr. field value}$$

PC relative addressing mode is suitable for program relocation at runtime.

## Base register AM

When the target inst<sup>n</sup> is present in different segment, then the control is transferred from current segment to target segment (Intersegment transfer of control)

→ In the intersegment TOC, to calculate the Target inst<sup>n</sup> address, 2 parameters are required (i) Base Addr & relative value.

→ In the base register AM, the base Addr is maintained in the base register & relative value is present in the address field of instruction.

$$\therefore EA \leftarrow [\text{Base Reg}] + \text{Addr. field value}$$

↓                      ↓                      ↓  
 Next ins<sup>n</sup> addrs.    Base Addr of the    Relative value  
 Base segment

↓  
 PC  $\leftarrow$  [Base Reg] + Relative value.

Base register AM is suitable for write position independent codes (subprograms).

Q. A 3 byte long PC-relative ins<sup>n</sup> is stored in memory, starting address 243076 (decimal). The signed displacement of present in the addr field value is -21. Then what is the branch address?



Ans  
243058

$$\therefore \text{Target Addr.} = \text{PC} + \text{Add field} = \frac{243079 - 21}{243058}$$

## INSTRUCTION SETS :-

Processor supports 3 categories of ins<sup>n</sup>.

- Data Transfer Ins<sup>n</sup>
- Data Manipulation Ins<sup>n</sup>
- Transfer of control Ins<sup>n</sup>

### (i) Data Transfer Instruction

While execution of this instr, data is transferred from 1 location to another location. The limitation of data transfer operation is source may or may not have the storage functionality but the dest<sup>n</sup> always has the storage.  
Immediate addressing mode may be for

There may be different data transfer operations possible.

(i) MOV :- General purpose data transfer instruction.

| <u>source</u> | <u>Dest<sup>n</sup></u> |
|---------------|-------------------------|
| reg           | reg                     |
| reg           | mem                     |
| mem           | reg                     |
| mem           | mem                     |
| Immediate     | reg                     |
| Immediate     | mem.                    |

- ② LOAD ] These are memory specific oper<sup>n</sup>
- ③ STORE ] used to perform the memory read & memory write op<sup>n</sup> resp.
- ④ PUSH ] These op<sup>n</sup> are stack specific op<sup>n</sup>
- ⑤ POP ] used to perform the insert & delete operations resp.
- ⑥ IN ] No specific op<sup>n</sup> used to perform the
- ⑦ OUT ] Thread & I/O write op<sup>n</sup> resp.

## ② Data Manipulation Instructions

While execution of this inst<sup>n</sup>, one form of data is converted into another form

1. Arithmetic (ADD, SUB, MUL, DIV, INC, DEC)
2. Logical ( AND, OR, NOT, XOR, CMP)
3. Shift and Rotate Instructions.

NOTE: During the execution of arithmetic op<sup>n</sup>, the carry flag will be used except in the INC and DCR operations.

$$\text{INC } 255 \Rightarrow 255$$

$$\text{DCR } 0 \Rightarrow 0$$

CMP: - It is used to compare the 2 entities. During the execution of CMP inst<sup>n</sup>, it invokes the subtraction operation.

→ The O/P of the CMP inst<sup>n</sup> will be maintained in Z flag & carry flag.

Eg:- 

|     |                                 |
|-----|---------------------------------|
| CMP | r <sub>0</sub> , r <sub>1</sub> |
|-----|---------------------------------|

 ; r<sub>0</sub> = 80H  
r<sub>1</sub> = 80H

↓  
Invokes  
SUB

$$r_0 - r_1 = 0 \Rightarrow \text{zero flag} = 1 (\text{set})$$

After the CMP execution, if the zero flag is set, i.e. both the operands are equal

Eg:- 

|     |                                 |
|-----|---------------------------------|
| CMP | r <sub>0</sub> , r <sub>1</sub> |
|-----|---------------------------------|

 r<sub>0</sub> = 80H  
r<sub>1</sub> = 40H

$$\begin{array}{r} 1000\ 0000 \\ - 0100\ 0000 \\ \hline 0100\ 0000 \end{array}$$

→ zero flag is reset  
carry flag is reset.

After CMP execution, if zero flag & carry flag both are reset, that means 2nd operand is less than the first operand

Eg3:  $r_0 = 40H$   $r_1 = 80H$

$$\begin{array}{r} 01000000 \\ 10000000 \\ \hline 11000000 \end{array}$$

(0) zero-reset  
(1) carry

After CMP execution, if zero flag is reset & carry flag is set, that means 2nd operand is greater than 1st operand.

### Shift operation:-

- While execution of shift operation, data is transferred towards left or right bit by bit with loss of data.
- The classification of shift operations



In the logical shift operation all the bits are moving towards left or right including the sign bit.



e.g. LSL  $r_0, \#2$

$$1000\ 0011 \xrightarrow{\leftarrow, 2 \text{ times}} 0000\ 1100$$

OR CH



e.g. LSR  $r_0, r_0, r_0, 83H$

$$1000\ 0011$$

Appl<sup>n</sup> :- Transfer data into another reg. Used in the serial data comm. to transfer the data bit by bit.

In the arithmetic shift operations, all the bits are moving towards left or right except the sign bit.

ASL

Reg.



Eg.: ACL  $r_0 = 83H$

$1000\ 0011$

$1000\ 0110$

$86H$

ASR



Eg.:  $r_0 = 83$

$1000\ 0011$

$\Rightarrow 1100\ 0001$   
 $(C1)H.$

Appn:- Used in signed arithmetic.

Rotate Instruction:- While execution of rotate ins, all the bits are moving towards left or right without loss.

The classification of rotate operation is

- without carry -
  - left (ROL)
  - right (ROR)
- with carry -
  - left (RCL)
  - right (RCR)

ROL:-



No new data is added  
Existing data is not lost.  
only position is changed.

Eg.:  $83H$

$1000\ 0011 \rightarrow$

$(07H)$   
 $0000\ 0011$

ROR:-



$10000011 \rightarrow 11000001$   
 $(C1)H$

RCL:



$1000\ 0001$  C-0

$\Rightarrow 0000\ 0110$  C1  
 $06H$  & C1

RCR:-



10000011, 0  
01000001, 1

41H & C7H

App<sup>n</sup>:- optimization because implement is simple:  
Motor movement  $\Rightarrow$  ROL or ROR.  
Even odd  $\Rightarrow$  RCR.

Instruction cycle with interrupts:-

This concept describes the execution sequence of instructions along with the interrupts.

Interrupt is an unusual event to disturb the normal flow of execution.

Instruction cycle with interrupt consists of 3 subcycles:

- (i) fetch cycle
- (ii) execution cycle
- (iii) interrupt cycle

The sequence diagram of interrupt handling process is



The CPU responds to interrupts only after completion of the current instruction execution.

When the interrupt flag is ENABLED that means interrupt sources are enabled. So interrupt cycle is required to service the interrupt. Otherwise, interrupt sources are disabled, so no need of interrupt cycle.



After completion of every instr execution the CPU reads the status of the interrupt sources to identify whether the interrupt is present or not.

INTR  $\begin{cases} 1 & ; \text{INT is present} \\ 0 & ; \text{INT is not present} \end{cases}$

- When the int. are not present, then the CPU fetch the next inst. from the memory.
- When the int. is present, the CPU push the program counter value into the stack and transfers the control to the INTERRUPT VECTOR TABLE.
- IVT is a part of memory. It is used to store the interrupt subprograms.
- There are multiple interrupts possible in the processor. So different INT subprograms are required to service the different INT.
- Each INT subprogram must be end with IRET/RETI instruction.
- During execution of INT subprogram, when the CPU encounters the RETI instr it invokes the POP instr to restore the PC with return address.

∴ After service the interrupt, the CPU fetch the next instr based on the PC.

|                 | Data Transfer                         | ALU                       | Uncond TOC                                  | Cond TOC                           |
|-----------------|---------------------------------------|---------------------------|---------------------------------------------|------------------------------------|
| Fetch cycle     | PC $\leftarrow$ Seq. Addr.            | PC $\leftarrow$ Seq. Addr | PC $\leftarrow$ Seq.                        | PC $\leftarrow$ Seq.               |
| Execution cycle | No change in PC                       | No change in PC           | Change in PC<br>PC $\leftarrow$ target Addr | Change or<br>Not PC changed or not |
| Int Service     | Yes<br>RA : Seq. Addr<br>return Addr. | Yes<br>RA : Seq. Addr     | Yes<br>RA : target Addr                     | Yes<br>RA : Changed or same        |

**HD** Types of interrupts: Mainly  $\rightarrow$  H/w & S/w INT — rest are part of these

1. Hardware INT:- Hardware int. are present at the H/w pins of the processor. H/w INTs are also called as the ext. int.

2. Internal int.

External INTs are generated by the ext. sources i.e. Power supply, basic IO devices etc.

Internal INTs generated by the internal components of the processor i.e. invalid OPCODE, divide by zero etc.

2. Software INT:- S/w INT is an inst. It is defined in the instr set of the processor. Bg.: — supervisor call.

3. Maskable INT:- Maskable INT pins are enabled or disabled based on the signals. All the basic IO devices are connected to maskable INT pins.

4 Non Maskable INT:- Here, the NMI pin is always present in the enabled state. So critical conditions of the processor i.e. power failure & temp sensor etc. are connected to the NMI pin.

5. Vector Interrupt:- Vector INT is associated with fixed vector point (static vector Addr)

6. Non Vector INT:- It is associated with dynamic vector point (means vector point add. is supplied by the source).

Eg



7. Level triggered INT:- This INT is enabled based on the clock signals.

8. Edge triggered INT:- These INT are enabled either with raising edge transition or with falling edge transition.

INT structure:- To analyse the implementation of various INTs, let us consider 8085 processor as a reference component.

8085 H/W INT

- |                    |                                  |
|--------------------|----------------------------------|
| 1) RST 7.5 (0030H) | (4) RST 4.5 (TRAP) (NMI) (0024H) |
| 2) RST 6.5 (0034H) | (5) INTR                         |
| 3) RST 5.5 (002CH) |                                  |

1, 2, 3, 5 → Maskable

## S/W INT - RST 0-4.

For S/W INTs, ~~maskable~~ and non-maskable is not applicable because ~~intn~~ can not be enabled or disabled.

INTR - Non vectored INT.

Other 4 h/w INTs are vectored.

All the S/W INT are vectored.

RST 0 - 0000      RST 2 - 0010      RST 4 - 0020      RST 6 - 0030

RST 1 - 0008      RST 3 - 0018      RST 5 - 0028      RST 7 - 0028

For S/W INTs, level triggering and edge triggering is not applicable.

RST 4.5 → edge triggered

RST 6.5, 5.5, INTR → level triggered

TRAP (RST 4.5) → Both edge & level triggered

INT Priority: When the processor supports multiple h/w INT pins, then there is a possibility of simultaneous INTs.

- To service only one INT at a time, there is a need of priority levels.

- |                                                                        |                                                                                                                 |
|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| 1. TRAP (RST 4.5)<br>2. RST 7.5<br>3. RST 6.5<br>4. RST 5.5<br>5. INTR | <br>High Priority<br><br>Low |
|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|

Q. A certain P/P supports 4 INTs, I<sub>1</sub>, I<sub>2</sub>, I<sub>3</sub>, I<sub>4</sub>. I<sub>1</sub> → priority decreases. The CPU responds to INT in every 2.5 μsec. The service time of INTs are 20 μsec, 40 μsec, 30 μsec, 25 μsec resp. The INTs may or maynot occur simultan.

What is the possible range of time required to execute the INTs.

$$\Rightarrow \text{Min} \rightarrow 27.5 \mu\text{s} \quad \text{Max} - 105 + 10 = 125 \mu\text{s}.$$

- \* If many INTs are present simultaneously, after servicing one INT, an instr is executed. Only after execution of an instr, next INT is serviced.

Ans to Ques:-

min  $\rightarrow$  (without simul)  $\rightarrow$  2.5  $\mu$ s. with simul  $\rightarrow$  2.5  $\mu$ s.  
 max  $\rightarrow$  125  $\mu$ s with simul  $\rightarrow$  2.5  $\mu$ s.

RISC and CISC :-

Char. of RISC  $\rightarrow$

- It supports more no. of registers
- It supports less no. of addressing modes.
- It supports fixed length instructions.
- One instr per cycle  $\therefore CPI = 1$   
 $(\text{cycles per instr}) \Rightarrow$
- It supports successful pipeline implementation.
- Motorola, PowerPC, ARM (Advd. risc M/c) etc are eg. of RISC processors.

Characteristics of CISC :-

- It supports less no. of registers.
- It supports more no. of addressing modes.
- It supports variable length instructions.
- There is not one instr per cycle.  $CPI \neq 1$ .
- Supports unsuccessful pipeline implementation.
- Eg - Pentium.

Register organization in RISC CPU :-

- In the RISC CPU, the registers are organized as overlapped register windows
- The risc processor registers are classifi

- ed as
- (i) Global reg (G) (iii) IN reg (C)
- (ii) Local reg. (L) (iv) OUT reg. (C)
- The register window contains 3 set of registers i.e. local, IN and OUT reg.
- The global registers are accessible by all the register window.

- Window size = L + 2C + G.

In the RISC CPU, all the system supported registers are organized as overlapped register windows. Overlapping means window OUT registers are used as IN registers in the another window.



Ex:-



In overlapped (complete), register file size  
 $= \text{No. of window} * (L+C) + G$

## Components of computer:-

The computer contains 3 basic components.

(i) CPU (ii) Memory (iii) I/O

### CPU ORGANIZATION:

CPU functionality is processing the inst'. It consists 3 basic components (i) Registers (ii) ALU (iii) Control units.

REGISTERS :- Reg. are used to store the frequently referred data.

- During the execution of the program, there is a frequent communication between CPU and memory.

- Due to slow working of memory component, the CPU performance degrades
- To avoid this problem, memory hierarchy design is used in the system design.
- The objective of memory hierarchy is to balance the bandwidth between the CPU and memory. (or) to minimise the speed gap between CPU + memory. (or) sync the data transfer rate b/w CPU + memory.
- Memory hierarchy describes the way of organization and way of accessing the system supported memories.



Bottom to top approach of memory hierarchy is

- (1) Capacity decreases
- (2) Access time reduces (Expensive)
- (3) Cost increases. (Performance)
- (4)  $P = \frac{1}{\text{exec. Time}}$ ,  $E = \text{Acc. Time}$ ,  $D = \frac{1}{P}$

- 5) Performance increases.
  - 6) ~~Expensive~~ cost/bit increases.
- NOTE: During the execution of instruction, data becomes the most imp. So use the smallest storage <sup>(register)</sup> component to hold the frequently accessed data. The processor maintains minimum set of registers.

### Mandatory Register:

1. PC: It is used to hold the starting address & immediately points to the next instr<sup>n</sup> addrs.
2. IR: It is used to hold the currently fetched instr<sup>n</sup> to be decoded because instr<sup>n</sup> format is predefined in this register, so it is not user accessible register.
3. ACC: It is user accessible register, used to hold the ALU input & output.
4. TR: It is used to hold the 2<sup>nd</sup> ALU operand. It is not a user accessible register.
5. MAR: It is used to carry memory address. MAR register is directly connected to address lines of system bus.
6. MBR/MDR: It is used to carry the binary sequence. MDR register is directly connected to data lines of system bus.

NOTE:- Apart from the above registers, processor supports set of the general purpose registers.

### INSTRUCTION EXECUTION SEQUENCE USING MICROOPERATIONS:-



Mem Based:

Immed:-  $\text{IR}[\text{Addr}] \xrightarrow{\text{data}} \text{ALU}$

Direct:-  $\text{IR}[\text{Addr}] \xrightarrow{\text{mem Addr}} \text{MAR} \rightarrow \text{AL's}$

$\overline{\text{RD}} \rightarrow \text{CL's}$

$\rightarrow \text{MEM} \rightarrow \text{BIn} \rightarrow \text{DL's} \rightarrow \text{MBL}$

Reg Indirect:  $\text{IR}[\text{Addr}] \xrightarrow{\text{reg Name}} \text{Reg file} \rightarrow \text{EA}$

$\rightarrow \text{MAR} \rightarrow \text{ALU}$

$\overline{\text{RD}} \rightarrow \text{CL's}$

$\rightarrow \text{MEM} \rightarrow \text{BIn} \rightarrow \text{ALU}$

Mem Indirect:-

$\text{IR}[\text{Addr}] \xrightarrow{\text{mem Addr}} \text{MAR} \rightarrow \text{AL's}$

$\overline{\text{RD}} \rightarrow \text{CL's}$

$\rightarrow \text{MEM} \rightarrow \text{BIn} \rightarrow \text{EA}$

$\rightarrow \text{DL's} \rightarrow \text{MBR} \rightarrow \text{MAR}$

$\overline{\text{RD}} \rightarrow \text{CL's}$

$\rightarrow \text{ALU} \leftarrow \text{MBR} \leftarrow \text{DL's}$

$\rightarrow \text{ALU} \leftarrow \text{MBR} \leftarrow \text{CC's}$

Microoperation is a basic reg. to reg. transfer op<sup>n</sup>. Each  $\mu\text{inst}^n$  can be executed in 1 cycle  
 $\mu\text{op}^n$  is also called as  $\mu\text{inst}^n$  or control word or Atomic op<sup>n</sup>.

Control signals are reg. to execute the  $\mu\text{inst}^n$ .

Eg. (a)  $T_1 : R_1 \rightarrow R_2 ; R_{1\text{out}}, R_2 \text{ in.}$

(b)  $R_{1\text{out}}, R_{2\text{in}}, R_{3\text{in}} \Rightarrow \begin{cases} R_1 \rightarrow R_2 \\ R_1 \rightarrow R_3 \end{cases} \left\{ \begin{array}{l} \text{In same} \\ \text{cycle} \end{array} \right.$

MicroProgram: The sequence of MOp<sup>n</sup> is called as the μprogram.

- The fetch cycle μprogram is

|                               |                                          |              |
|-------------------------------|------------------------------------------|--------------|
| T <sub>1</sub> : PC → MAR     | ; PC <sub>out</sub> , MAR <sub>in</sub>  | Control unit |
| T <sub>2</sub> : M[MAR] → MBR | ; MAR <sub>out</sub> , MBR <sub>in</sub> |              |
| T <sub>3</sub> : MBR → IR     | ; MBR <sub>out</sub> , IR <sub>in</sub>  |              |

### CONTROL UNIT:-

- control signals are directly executed on base machine.
- Control unit generates the control signals
- Control signals are implemented in CO.
- MOp<sup>n</sup> are executed based on the sequence of the control signals.
- MOp<sup>n</sup> is a basic reg to reg transfer op<sup>n</sup>.
- μprog is a sequence of MOp<sup>n</sup>'s.
- Subcycles of the inst<sup>n</sup> cycle invokes their own μprograms.
- Inst<sup>n</sup> cycle contains 2 subcycles i.e. fetch & execute cycle.
- Inst<sup>n</sup> execute sequence is described by using the inst<sup>n</sup> cycle.
- Program is a sequence of inst<sup>n</sup> along with data.
- System functionality is execution of the program.



## Control unit

To design the control unit the designer must identify following 3 parameters:

- The no. of op<sup>n</sup> (Inst<sup>n</sup>) possible in the processor.
- The no. of μop<sup>n</sup> required to execute each inst<sup>n</sup>.
- The no. of ~~control~~ signals required to execute each op<sup>n</sup>.

After identifying the above 3 parameters designer maintains the database.

Eg: #inst<sup>n</sup> { I<sub>1</sub>, I<sub>2</sub>, I<sub>3</sub> }      op code = 2 bits  
 { ADD SUB MUL }  
 00 - ADD  
 01 - SUB  
 10 - MUL  
 11 - Undef.

# of control signals recognized by b/w = { S<sub>0</sub>, S<sub>1</sub>, S<sub>2</sub> }

# μop<sup>n</sup> req to execute each inst<sup>n</sup>: { T<sub>1</sub>, T<sub>2</sub>, T<sub>3</sub> }

Database :

| Inst <sup>n</sup> /Op <sup>n</sup> : | 00                              | 01                              | 10                              |
|--------------------------------------|---------------------------------|---------------------------------|---------------------------------|
|                                      | I <sub>1</sub><br>ADD           | I <sub>2</sub><br>SUB           | I <sub>3</sub><br>MUL           |
| T <sub>1</sub>                       | S <sub>0</sub> , S <sub>1</sub> | S <sub>0</sub> , S <sub>2</sub> | S <sub>2</sub> , S <sub>1</sub> |
| T <sub>2</sub>                       | S <sub>2</sub> , S <sub>0</sub> | S <sub>2</sub> , S <sub>1</sub> | S <sub>0</sub> , S <sub>2</sub> |
| T <sub>3</sub>                       | S <sub>1</sub> , S <sub>2</sub> | S <sub>1</sub> , S <sub>0</sub> | S <sub>1</sub> , S <sub>0</sub> |

→ implemented in CU

Inst<sup>n</sup>: ADD → ADD → 00 → T<sub>1</sub> → T<sub>1</sub>: S<sub>0</sub>, S<sub>1</sub>  
 set    SUB                          T<sub>2</sub>: S<sub>2</sub>, S<sub>0</sub> → T<sub>2</sub>: S<sub>2</sub>, S<sub>0</sub>  
 MUL                                  T<sub>3</sub>: S<sub>1</sub>, S<sub>2</sub> → T<sub>3</sub>: S<sub>1</sub>, S<sub>2</sub>

Match to previous circle diagram.

- After defining database, designer uses any one of the following approaches to imp. the database

- Hardwired Approach
- Microprogrammed Approach

## Hardwired control unit design :-

- In this design, control signals are expressed as sum of product expression; they are directly implemented on the independent h/w.
- It is the fastest control unit design when control signals are implemented on independent hardware.
- It is used in real time app<sup>n</sup>. Eg:- aircraft simulation & weather forecasting, etc.
- Even a minor modification requires redesign and reconnection of the control signals. So It is not flexible.
- It is not suitable in the designing and testing places.
- Risc control unit is hardwired CU.

Q. Consider a hypothetical control unit. It supports 3 inst<sup>n</sup> ( $I_1, I_2, I_3$ ) and 4 control signals ( $s_0, s_1, s_2, s_3$ ). Each inst<sup>n</sup> requires 4 Mop<sup>n</sup> ( $T_1, T_2, T_3, T_4$ ) to complete the execution. The following table indicates what are the control signals required for each Mop<sup>n</sup>, for each inst<sup>n</sup>.

| Inst <sup>n</sup> /Mop <sup>n</sup> | $I_1$     | $I_2$     | $I_3$     |
|-------------------------------------|-----------|-----------|-----------|
| $T_1$                               | $s_0 s_1$ | $s_2 s_0$ | $s_3 s_2$ |
| $T_2$                               | $s_1 s_2$ | $s_3 s_2$ | $s_0 s_2$ |
| $T_3$                               | $s_3 s_2$ | $s_1 s_0$ | $s_3 s_0$ |
| $T_4$                               | $s_0 s_3$ | $s_2 s_3$ | $s_3 s_0$ |

Get the control exp. for  $s_0, s_1, s_2, \& s_3$ .

$$\begin{aligned}
 S_0 &= T_0(I_1 + I_2) + T_2 I_3 + T_3(I_2 + I_3) + T_4(I_1 + I_3) \\
 S_1 &= T_1 I_4 + T_2 I_1 + T_3 I_2 \\
 S_2 &= T_1(I_2 + I_3) + T_2(I_1 + I_2 + I_3) + T_3 I_1 + T_4 I_2 \\
 S_3 &= T_1 I_3 + T_2 I_2 + T_3(I_1 + I_2) + T_4(I_1 + I_2 + I_3) \\
 &\quad T_1(I_2 + I_3) + T_2 + T_3 I_1 + T_4 I_2 \\
 &\quad T_1 I_3 + T_2 I_2 + T_3(I_1 + I_2) + T_4
 \end{aligned}$$



## MicroProgram Control Unit Design :-

- In this design, control memory is used to store the MProg.. The control memory is permanent memory i.e. ROM.
- MProg is a sequence of Moperations.
- Minstr are stored in the control memory by using the following format.

Minstr format:-



If control format is horizontal  $\rightarrow$  Horizontal Prog.

Based on way of representing control signals,  $\mu$ inst is divided into 2 types.  
i.e. (i) Horizontal  $\mu$ inst (ii) Vertical  $\mu$ inst.

In Horizontal format, the control signals are represented in the decoded binary format, whereas in the vertical format, the control signals are represented in the encoded binary format.

- Based on type of control word supported by control memory, the CU is divided into 2 types:-

- (i) Horizontal microprogram CU
- (ii) Vertical microprogram CU.

### Horizontal Programming:-

1. In this design, the CS are expressed in decoded binary format i.e. 1 bit/CS.
2. It supports longer control word.
3. No need of the external decoder to generate the CS, so it is faster than vertical.
4. It allows high degree of parallelism.  
i.e. none are more than one.
5. It is used in II<sup>l</sup> processing appn.
6. It is bit flexible compared with hardwired approach.

### Vertical Programming:-

1. In this design, the CS are expressed in encoded binary format, i.e. N CS require  $\log N$  bits.
2. It supports shorter control word.
3. There is a need of external decoder to generate the control signals. So it is slower.

- than horizontal.
4. It allows low degree of parallelism i.e. none or one.
  5. It allows easy implementation of new inst'. So it is more flexible.

### NOTE:-

The ascending order of the CU design in terms of speed is Vertical - Horizontal - Hardwired  
 low → high

The ascending order of CU design in terms of flexibility  
 low → high  
 Hardwired - Horizontal - Vertical.

- Q Consider a hypothetical CU. It supports 1024 control word memory. It has 48 CS, and 16 flags. What is the size of Cword in bits & control memory in bytes using the nonhorizontal programming?

Ans:-

| BC | flag | CS | SM Addr |
|----|------|----|---------|
|----|------|----|---------|

↓      ↓      ↓

4      48      10

= 62 bits

CW size = 62 bits

= 1024 CW

=  $1024 \times 62$  bits

=  $\frac{(1024 \times 62)}{8}$  bytes

- Q A hypothetical CU supports 8 mutually exclusive groups of CS that is tabulated below:-

|        |                |                |                |                |                |                |                |                |
|--------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| Groups | G <sub>1</sub> | G <sub>2</sub> | G <sub>3</sub> | G <sub>4</sub> | G <sub>5</sub> | G <sub>6</sub> | G <sub>7</sub> | G <sub>8</sub> |
| CS     | 1              | 3              | 7              | 6              | 12             | 14             | 20             | 2              |

How many bits saved using the vertical programmed over horizontal programming?

Ans:- H = 1 + 3 + 7 + 6 + 12 + 14 + 20 + 2 = 65 bits

$$V = \log_2 1 + \log_2 3 + \log_2 7 + \log_2 6 + \log_2 12 + \log_2 14 + \log_2 20 + \log_2 2$$

= 23 bits

$$\therefore \text{No. of bits saved} = (65 - 23) = 42 \text{ bits}$$

- Q. A C field ~~has~~ of  $\mu$ inst<sup>n</sup> supports 2 groups of cs. Group 1 indicates number of 24 CS and group 2 indicates at most 6 from remaining. What is the size of control field?



## Performance Evaluation of Processing

Performance is an indirect measured parameter.

It is depending on ~~on~~ 2 directly measured parameters

(a) Execution time

(b) Throughput

- The time required to complete the program is called execution time or elapsed time or response time.

- Consider 2 systems X and Y. Both are used to execute the same task in which X system takes 5 ns & Y system takes 10 ns time to complete the task.  $P \propto \frac{1}{ET}$

- The no. of tasks completed in a unit of time is called as throughput.  $P \propto T$

- The performance gain can be measured by using the speedup factor. ~~or~~ Speedup factor compares the 2 different entities.

$$\text{i.e. Speedup} = \frac{\text{Performance}}{(S) \text{ Performance}}$$

$$S = \frac{ET_y}{ET_x} ; S = n$$

System 'X' runs 'n' times faster than system 'Y'

- Quantitative principles are used to improve the performance.
- These principles states that improve the performance of processor within the cost & price limits while making the common case fast.
- This principle is called as "Amdahl's Law"



Amdahl's law is focussed on performance gain after enhance the frequent components.

i.e.  
 $\text{Overall} = \frac{\text{Performance of the Sys with enhancement}}{\text{Performance of the sys without enhancement}}$

$$\text{Overall} = \frac{1/ET_{\text{new}}}{1/ET_{\text{old}}} - \frac{ET_{\text{old}}}{ET_{\text{new}}} \quad (1)$$

- After enhance the system, it contains 2 portions i.e. enhanced portion and unenhanced portion.

$$\Rightarrow ET_{\text{new}} = ET_{\text{old}} + ET_{\text{unenhanced portion}} \quad (2)$$

2 parameters are needed to calculate the  $ET_{\text{new}}$

- Fraction enhanced ( $f$ ).
- Speedup  $S_{\text{enhanced}}$ .

Fraction enhanced indicates how much portion of the system is enhanced.

$f$  = enhanced

$(1-f)$  = unenhanced

→ Speedup enhanced indicates, how many times the enhanced portion is running faster than the old portion.

$S = \frac{\text{Performance of enhanced portion}}{\text{Performance of old portion.}}$

$$= \frac{\cancel{S}}{\cancel{ET_{\text{new}}(F)}} \cdot \frac{1}{ET_{\text{new}}(F)} = \frac{1/ET_{\text{old}}(F)}{ET_{\text{new}}(F)}$$

$$\textcircled{2}^{\text{nd}} \text{ in } \textcircled{3}^{\text{rd}} \quad ET_{\text{new}}(F) = \frac{ET_{\text{old}}(F)}{S} - \textcircled{2}$$

$$ET_{\text{new}} = ET_{\text{old}}(1-F) + \frac{ET_{\text{old}}(F)}{S}$$

Substitute \textcircled{2} in \textcircled{1},

$$S_{\text{overall}} = \frac{ET_{\text{old}}}{ET_{\text{old}}(1-F) + \frac{ET_{\text{old}}(F)}{S}}$$

Let  $ET_{\text{old}} = 1$ .

$$S_{\text{overall}} = \frac{1}{(1-F) + \frac{F}{S}} \quad \left[ \left(1 - \frac{F}{S}\right) + \frac{F}{S} \right]^{-1}$$

If  $ET_{\text{old}} = 1$ , then  $ET_{\text{new}}$  is always less than 1 &  $S_{\text{overall}}$  is always greater than 1.

NOTE:- When the system contains multiple enhancements then  $S_{\text{overall}} = \left[ (1 - \sum f_i) + \sum \frac{f_i}{S_i} \right]^{-1}$

$i = \# \text{enhancements}$ .

Q. Consider a computer which is used in the mathematical model simulations. It consists of 2 components:- Integer & floating point units. FPU is enhanced then it runs 2 times faster but only 10% of the instr are floating point. What is the performance gain.

$$\text{Ans: - } S_{\text{overall}} = \left( (1-f) + \frac{f}{s} \right)^{-1} \cdot \left[ (1-0.1) + \frac{0.1}{2} \right]^{-1}$$

$$= 1.05$$

$$\approx 1$$

According to the problem statement, instead of frequent case the rare case goes on enhancement.  $\therefore$  there is no considerable performance gain.

The corrective measurement is enhance the frequent case i.e. Integer unit.

$$S_{\text{overall}} = 1.82 \approx 2 \therefore \text{considerable gain}$$

Q. Suppose the cache memory is 10 times faster than main memory & it is used 70% of the time. How much speedup do we gain by using this cache.

$$\left[ (1-0.7) + \frac{0.7}{10} \right]^{-1} = \left( 0.3 + \frac{0.7}{10} \right)^{-1}$$

$$= \left( \frac{3.7}{10} \right)^{-1} = \frac{10}{3.7} = 2.7$$

$\Rightarrow$  Introducing cache improves performance by 2.7 times.

Q. 3 enhancements are proposed for a new architecture with the following speedups  $\Rightarrow S_1 = 30, S_2 = 20, S_3 = 15$ .

If the enhancement 1 & 2 both will

be used 25% of the time, what fraction of the time enhancement 3 will be used to gain the overall of 10

$$\cancel{D = \frac{1}{10}}$$

$$10 = \left[ [1 - (0.25 + x)] + \frac{0.25}{30} + \frac{0.25}{20} + \frac{x}{15} \right]^{-1}$$

$$10 = \left[ 0.5 + x + \left( \frac{0.5 + 0.75}{60} + \frac{x}{15} \right) \right]^{-1}$$

$$10 = \left( 0.5 + x + \frac{0.25}{60} + \frac{x}{15} \right)^{-1}$$

### Calculation of CPU time :-

- CPU time means program execution time.
- Processor is having clock pin as an input pin.
- The clock pin is externally connected with the clock generator.
- Clock generator is operating with the constant frequency to generate the clock pulses.
- To generate clock pulses.
- These clock pulses are carried into the CPU through the CLK pins. So the CPU operations are controlled by the clock.



The program execution time can be calculated based on cycle & cycle time.

- cycle is defined as raising edge to raising edge transition or falling edge or falling edge transition.
- It is also called as the 'tick'/clk tick/cycle/clock cycle etc.



- The time required to transfer the pulse either from raising edge to raising edge or from falling edge to falling edge is called as cycle time.
- The cycle time is depending on the clock frequency.

$$1 \text{ GHz} = \text{frequency}$$

$$\Rightarrow t = \frac{1}{1 \text{ GHz}} = 10^{-9} \text{ sec.} = 1 \text{ ns.}$$

The program execution time = #sec per ~~prog~~  
 $\Rightarrow \# \text{inst}^n / \text{prog} * \# \text{cycles/inst}^n * \cancel{\#inst^n / \text{prog}}$   
~~#seconds/ cycle~~

$$ET = \text{Inst Count} * \text{CPI} * \text{cycle time}$$

~~inst~~

Program is a mixing of ~~the~~ 3 categories of inst<sup>n</sup>. So different inst<sup>n</sup> take different # cycles. ∴ The prog. execut<sup>n</sup> time / CPU time  
 $= \sum_{i=1}^{n} (IC_i * CPI_i) * \text{cycle time}$

$i = \text{type of inst}^n$ .

- Q. Consider 1ns clock cycle processor. It consumes 4 cycles for load & store op<sup>n</sup>, 6 cycles for ALU op<sup>n</sup> and 5 cycles for branch op<sup>n</sup>.

The relative frequencies of these instructions are 40%, 40% & 20% resp.

- (a) What is the average inst<sup>n</sup> execution time?
- (b) What is the performance in terms of MIPS (million inst<sup>n</sup> per sec)?
- (c) If the prog contains  $10^6$  Inst<sup>n</sup>s what is the prog execution time?

Ans:-

(a) Avg inst<sup>n</sup> execution time =  $\sum (I_i \times CPI_i)^{\frac{1}{cycle}}$

$$= [(0.4 \times 4) + (0.4 \times 6) + (0.2 \times 5)] \times 1 \text{ ns}$$
$$= 5 \text{ ns.}$$

(b) 1 inst<sup>n</sup>  $\rightarrow$  5 ns.

#?  $\rightarrow$  1 sec.

$$\frac{1}{5 \times 10^{-9}} = 2 \times 10^8 \text{ ins/sec.} = 200 \text{ MIPS.}$$

(c)  ~~$10^6$~~   $\rightarrow 5 \times 10^7 \times 10^6$   
 ~~$10^6$~~   
 $= 5 \times 10^{-3}$   
 $= 5 \text{ ms.}$

## ⑤D HIGH PERFORMANCE ARCHITECTURE

High performance architecture exhibits the concurrency.

→ Concurrency means more than one event execution.

→ Event may be program, subprograms, inst<sup>n</sup> or stage (intrainst).

Concurrency implies parallelism, simultaneous, pipelining.

→ Parallelism means 2 or more event execu<sup>n</sup>

- at the same time period.
- Simultaneous means 2 or more event execution at the same time instance.
- Pipelining means 2 or more event executed in overlapping span.

According to the Flynn's classification, the parallel architecture is divided into 4 categories:

- SISD → Single inst<sup>n</sup> single data stream
- SIMD → Single " multiple "
- MISD → Multiple " single "
- MIMD → Multiple " multiple "

PU: Processing Unit

IS: Inst<sup>n</sup> stream

CU: Control Unit

DS: Data stream

MU: Memory Unit

### SISD:



Inst<sup>n</sup> & data both are stored in same memory. After performing the operation, the result is also stored in same memory. This concept is known as stored program structure.

→ Von Neumann Archi. was designed based on stored program concept.

→ Uniprocessor is using the SISD architecture.  
Eg → VAX, IBM 360, etc.

### SIMD:



→ Array processors use the SIMD archt.  
eg → STERAN, PEPE, MPP etc.

MISD :-



Not yet implemented.

MIMD :-



Multiprogramming system uses MIMD architecture  
Multiprocessor system uses MIMD architecture  
use: Eg. Cray, Cyber etc.

To improve the performance of uniprocessor system pipelining is used.

PIPELINING :-

Function of the pipeline  
→ If pipe off is connected as input to the another pipe.  
Defn:- ~~Accessing~~ Accepting the new inputs at one end before previously accepted input appears as an off at the other end.

Def<sup>n</sup> states that insert the new IP before completion of old inputs. That means new IPs are executed alongwith the old inputs. So pipeline allows the overlapping execution.

Non pipeline allows the non-overlapping execution i.e. new inputs are inserted only after completion of old inputs.

The overlapping execution sequence can be represented using space time diagram.

- In this diagram  $x \rightarrow$  cycles,  $y \rightarrow$  stages.



The successful char. of pipelines is that in every new cycle, new input must be inserted into pipeline.

$$\therefore \text{cos} = 1.$$

Designing of the pipeline - IP end

→ Pipe has 2 ends. ← off end  
Between the inp & outputs ends, its input & output pipe, is interconnected

→ Interface registers are used b/w the stages  
→ to hold the intermediate results & they are also known as buffer / latch, pipelining register.

All the stages along with interface registers are controlled by the common clock.



Consider 4 segment pipeline used to process the 4 tasks.

Stages:  $S_1, S_2, S_3, S_4$   
IIP :  $T_1, T_2, T_3, T_4$

| (Op) |   | $S_4$ |   |       | $T_1$ | $T_2$ | $T_3$ | $T_4$ | ... | - |
|------|---|-------|---|-------|-------|-------|-------|-------|-----|---|
| (up) |   | $S_3$ |   | $T_1$ | $T_2$ | $T_3$ | $T_4$ |       |     |   |
|      |   | $S_2$ |   | $T_1$ | $T_2$ | $T_3$ | $T_4$ |       |     |   |
|      |   | $S_1$ |   | $T_1$ | $T_2$ | $T_3$ | $T_4$ |       |     |   |
| 1    | 2 | 3     | 4 | 5     | 6     | 7     | 8     | 9     |     |   |

4 tasks  
= 7 cycles

Sequential:

| $S_4$ |   | 1 | 2 | 3 | $T_1$ | 5 | $T_2$ | 7 | $T_3$ | 9  | $T_4$ | 11 |
|-------|---|---|---|---|-------|---|-------|---|-------|----|-------|----|
| $S_3$ |   |   |   |   | $T_1$ |   | $T_2$ |   | $T_3$ |    | $T_4$ |    |
| $S_2$ |   |   |   |   | $T_1$ |   | $T_2$ |   | $T_3$ |    | $T_4$ |    |
| $S_1$ |   |   |   |   | $T_1$ |   | $T_2$ |   | $T_3$ |    | $T_4$ |    |
| 1     | 2 | 3 | 4 | 5 | 6     | 7 | 8     | 9 | 10    | 11 | 12    | 13 |

4 tasks  $\rightarrow$  16 cycles.

Performance evaluation of pipeline processor

- Consider  $k$  segment pipeline used with the clock cycle time  $t_p$  of  $T_p$  used to execute  $n$  tasks.
- The very first task is executed in the pipeline in non-overlapping order ...

It takes  $k$  cycles to complete the operation. The remaining  $n-1$  tasks are emerged from the pipe at the rate of 1 cycle per task.  $\therefore n-1$  tasks consumes  $n-1$  cycles to complete the operation.  $\rightarrow$  so the total time required to process the  $n$  tasks using the  $k$  segment pipeline is

$$ET_{\text{pipe}} = (k+n-1) t_p$$

$$= (k+n-1) t_p.$$

Consider non pipeline processor used to execute the  $n$ -tasks in which each task takes  $t_n$  time to complete the operation.

- $\therefore$  The total time required to complete the  $n$  tasks in non pipeline is

$$ET_{\text{non-pipe}} = n t_n$$

The performance gain of the pipeline processor over the nonpipeline is

$$\text{Speedup (S)} = \frac{\text{Performance}_{\text{pipe}}}{\text{Performance}_{\text{non-pipe}}}$$

$$S = \frac{1/ET_{\text{pipe}}}{1/ET_{\text{non-pipe}}} = \frac{ET_{\text{non-pipe}}}{ET_{\text{pipe}}}$$

$$= \frac{n t_n}{(k+n-1) t_p}$$

when the no. of Tasks increases, then  $n$  becomes too large than  $k-n$ . Then  $k+n-1$  value is approached to  $n$ .

When  $n \gg k$ , then

$$k+n-1 \approx n.$$

Now

$$S = \frac{n t_p}{(k+n-1) t_p} = \frac{n t_p}{n t_p}$$

$$\boxed{S = t_p / t_p}$$

When all the tasks take same no. of cycles then 1 task execution time is also equal to the no. of stages in the pipeline.  $t_n = k$  cycles.

$$t_n = k \cdot t_p$$

Under this cond<sup>n</sup>,  $S = \frac{k t_p}{t_p}$

$$\Rightarrow S = k$$

$k \rightarrow$  no. of stages in pipeline

Also called as pipeline depth.

When the processor is operating with 100% efficiency then maximum speedup depth the pipeline depth.

$$100\% \eta \rightarrow S_{max}$$

$$?, \eta \rightarrow \cancel{S} \cancel{\eta} S$$

$$n_{pipe} = \frac{S}{S_{max}} = \frac{S}{k}$$

The throughput of the pipeline is # tasks processed per total time taken to process the task.

$$T_{pipe} = \frac{\text{# tasks processed}}{\text{total time taken to process the tasks}}$$

$$= n / (k + n - 1) t_p$$

Types of the pipeline:-

(1) Linear pipeline:- This pipeline performs only 1 specific functionality. It consists only the feed forward connections.



Non-linear pipeline: This pipeline can perform multiple functionality. It consists of feed forward & feed backward connections.



Reservation Table is required to use the pipeline stages.

- Synchronous pipeline:-

In this pipeline operations are initiated based on the clock.

- Asynchronous pipeline:-

In this pipeline, operations are initiated based on the hand shaking signals.

Uniform delay pipeline:- In this pipeline all the stages maintains the same amt of delay to complete the assigned operation.



(i)  $t_p = \text{stage delay} = 2\text{ns}$ .

(ii) If buffer delay is included then  
 $t_p = \text{stage delay} + \text{buffer delay}$   
 $= 3\text{ns}$

(iii) Skew or setup time overhead is included. Then

$$t_p = t_p + \text{overhead time.}$$

## Non-uniform delay :-

In this pipeline, different stage delays are used to perform the different stage operation.



- (1)  $t_p = \max \text{ of stage delay}$
- (2)  $t_p = \text{if buffer delay is included then}$   
 $t_p = \max(\text{stage delay}) + 3 \text{ buffer delay.}$
- (3)  $t_p = \max(\text{stage delay} + \text{buffer delay}).$

Q. Consider an inst<sup>n</sup> pipeline which has the speedup factor ~~10~~ while operating with 80% efficiency. What could be # stages present in the pipeline?

$$\Rightarrow \eta = 80\% \quad n = \frac{s}{k} \quad k = \frac{10}{0.8} = 12.5 \quad (13)$$

$$s = 10$$

Q. Consider 2 pipelines A + B where pipeline A is having 8 stages of uniform delay of 2ns. Pipeline B is having 5 stages with the respective stage delays of 2ns, 3ns, 4ns, 1ns, 2ns. How much time is saved when 100 tasks are pipelined using A instead of B.

$$\Rightarrow \text{Pipe A: } K=8$$

$$n=100$$

$$t_p = 2 \text{ ns}$$

$$\text{ETA} = (K+n-1)t_p$$

$$= (8+100-1)2 = 214 \text{ ns.}$$

$$\text{Pipe B: } (B+100-1) \cdot t_p = 104 \times 4 = 416 \text{ ns}$$

$$\text{Time saved} = 416 - 214 = \underline{\underline{202 \text{ ns}}}$$

Q Consider 4 segment pipeline with respective stage delays of 10, 20, 5, 15 ns. What is the approx speedup when very large no. of inst<sup>n</sup> are pipelined?

$$\Rightarrow S = \frac{t_n}{t_p}$$

$$t_p = \max(10, 20, 5, 15)$$

$$t_p = 20 \text{ ns}$$

$$\begin{aligned} (\text{Nonoverlapping execution}) \quad t_n &= s_1 + s_2 + s_3 + s_4 = 10 + 20 + 5 + 15 \\ &= 50 \text{ ns} \end{aligned}$$

$$S = \frac{50}{20} = 2.5$$

$$\eta = \frac{S}{K} = \frac{2.5}{4} = 0.625$$

Q Consider 4 stage pipeline with the following time delays:

$$t_1 = 20 \text{ ns} \quad t_2 = 28 \text{ ns}$$

$$t_3 = 40 \text{ ns}$$

$$t_4 = 15 \text{ ns}$$

The interface regis. used b/w the stages have the delay of 5 ns, what is the speedup?

$$\begin{aligned} \Rightarrow S &= \frac{t_n}{t_p} = \frac{100}{45} \\ &= \underline{\underline{2.2}} \end{aligned}$$

No need of adding buffered delay because there is no need to save intermediate results in non overlapping execution.

Q. Consider 4 stage instruction pipeline where different instructions are sending different amount of time at different stages that is shown below:-

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ |                  |
|-------|-------|-------|-------|-------|------------------|
| $I_1$ | 1     | 3     | 1     | 2     | $\Rightarrow 7$  |
| $I_2$ | 2     | 1     | 2     | 1     | $\Rightarrow 6$  |
| $I_3$ | 1     | 2     | 1     | 2     | = 6              |
| $I_4$ | 1     | 2     | 2     | 1     | $= \frac{6}{25}$ |

- (i) How many cycles required to complete the above instn?
- (ii) what is the speedup?

$\Rightarrow$  ~~Not~~ Not uniform delay, not non-uniform delay. Special case. Solved using space time diagram.



(i)  $\Rightarrow 12$  cycles.

$$(ii) \Rightarrow S = \frac{ET_{nonpipe}}{ET_{pipe}} = \frac{I_1 + I_2 + I_3 + I_4}{12} = \frac{25}{12} = 2.08$$

Q. Consider 4 stage instr pipeline where different instr are spending different amount of time at different stages that is shown below:-

|    | S1 | S2 | S3 | S4 |
|----|----|----|----|----|
| I1 | 1  | 3  | 1  | 1  |
| I2 | 2  | 1  | 2  | 1  |
| I3 | 1  | 2  | 1  | 2  |
| I4 | 2  | 1  | 2  | 1  |

The following loop is executed on the pipeline

for ( $i=1$ ;  $i \leq 1000$ ;  $i++$ )

{ I1; I2; I3; I4; }

The output of the I2 for  $i=2$  will be available after \_\_\_\_\_ cycles.



(overlapping cycles)  $\leftarrow$  overlapping because it is pipelining  
 Ans  $\Rightarrow 15$  (Called loop level parallelism).

## RISC PIPELINE

To analyse the implementation issues of pipelining, let us consider a successful pipelining as reference.

RISC P<sup>n</sup> is a successful pipelining.  
 RISC processor supports 5 stage instr

pipeline. It is used to execute the inst<sup>n</sup>. RISC processor supports 3 categories of instruct<sup>n</sup>.

(i) Data transfer Inst<sup>n</sup>. LOAD & STORE  
Inst<sup>n</sup> are used to transfer the data between CPU & memory, in which to access the memory indexed addressing mode is used.

The inst<sup>n</sup> format is ~~10addr, 2Op~~

LOAD  $r_0, s(r_1)$ ;  $r_0 \leftarrow M[r_1 + s]$

STORE  $4(r_2), r_3$ ;  $M[r_2 + 4] \leftarrow r_3$

(ii) ALU operations: ALU operations are performed only on registers.

Inst<sup>n</sup> format is

Add  $r_0 r_1 r_2$ ;  $r_0 \leftarrow r_1 + r_2$ .

(iii) Branch op<sup>n</sup>:-

unconditional jump: JMP 2000; PC  $\leftarrow$  seg<sub>TA</sub> Addr

conditional jump :- JNZ  $r_0, 2000$

NZ CMP  $r_0$

T; PC  $\leftarrow$  seg<sub>TA</sub> Addr

F; PC  $\leftarrow$  seg Addr.

To execute the above inst<sup>n</sup>, RISC processor supports 5 stage inst<sup>n</sup> pipeline.

STAGE 1 (Inst<sup>n</sup> fetch):- In this stage inst<sup>n</sup> are fetched from memory based on the program counter. At the end of this stage, the program counter value is incremented to fixed constant.

STAGE 2 (INST DECODE) :- In this stage 2 operations are performed i.e. decode inst<sup>n</sup> & accessing the register file.

This stage also contains the comparator logic to evaluate the branch conditions.

STAGE 3 (EXECUTION) : In this stage ALU operations are performed.

STAGE 4 (Memory Access) : In this, memory read & memory write operations are performed.

STAGE 5 (Write Back) : This stage op<sup>n</sup> is divided into 2 halves.

In the first half, register write operat<sup>n</sup> is performed, in the 2<sup>nd</sup> half, register read operation is performed.



JNZ Me, 2000  
PC < Seg Addr

Cond T0C  
TA : 2000  
Cond: NZ, no  
 $r_0$  = value



Branch inst<sup>h</sup> → 2<sup>nd</sup> stage

ALU + reg. load → 5<sup>th</sup> stage  
reg ← reg store → 4<sup>th</sup> stage

Depend

There are 3 types of dependencies possible in pipeline.

- (i) structural dependency.
- (ii) data dependency
- (iii) control dependency.

Dependency creates the extra cycles in the pipeline. Extra cycle without operation (New i/p initiation) called as stall.

Structural dependency :- This dependency is present in the pipeline due to the resource conflict. The resource may be a memory, register or functional unit.

|    | CC1 | CC2 | CC3 | CC4 | CC5 |
|----|-----|-----|-----|-----|-----|
| I1 | Mem | ID  | ALU | Mem |     |
| I2 |     | Mem | ID  | ALU |     |
| I3 |     |     | Mem | ID  |     |
| I4 |     |     |     | Mem |     |

In the above sequence, inst<sup>h</sup> I1 & I4 both are trying to access the same resource memory. This is known as memory conflicts.

Conflict is an unsuccessful operation. So to make it successful, keep inst's 4 into the wait until the resource becomes available i.e. this waiting creates the stalls in the pipeline.

|    | CC1 | CC2 | CC3 | CC4 | CC5 | CC6   | CC7 | CC8 |
|----|-----|-----|-----|-----|-----|-------|-----|-----|
| I1 | mem | RD  | ALU | mem | WB  |       |     |     |
| I2 |     | mem | ID  | mem | mem |       | WB  |     |
| I3 |     |     | mem | ID  | ALU | mem   | WB  |     |
| I4 |     |     |     | x   | x   | x     | mem |     |
|    |     |     |     |     |     | stall |     |     |

To minimise the no. of stalls in pipeline, due to the structural dependency, renaming concept is used.

Renaming states that use the independent memory modules to store the instructions & data separately called as code memory & data memory. Use the code memory in stage 1 & data memory in stage 4. Without conflict, inst 1 fetches the data from data memory and insts 4 & 5 fetched from codememory.

|    | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 |
|----|-----|-----|-----|-----|-----|-----|
| I1 | CM  | ID  | ALU | DM  | WB  |     |
| I2 | -   | CM  | ID  | ALU | DM  | WB  |
| I3 | -   | -   | CM  | ID  | ALU | DM  |
| I4 | -   | -   | -   | CM  | ID  | ALU |
| I5 | -   | -   | -   | -   | CM  | ID  |
| I6 | -   | -   | -   | -   | -   | CM  |

Q1 ~~04~~

1.  $H = 02$

2.  $L = 05$

3.  $L_2 04(NZ)$   $L = 03, 02, 01, 00$   
 $(NZ)$       2      3      4      false  
 $\underline{1}$

$H = 01, 00$   
 $\downarrow$   
 $\text{false}$   
 $(1)$

Ans  $\rightarrow 4, 1.$

Q2  $L = 04, 3, 2, 1, 00$

$H = 0 \rightarrow L = \underbrace{255 \dots 0}_{255 \text{ True.}}$

$\therefore 255 + 4 = (259, +1)$

Q3: TOC AM = base + relative

Q4  $MOV B, #0 \quad B \leftarrow 0$

$MOV C, #8 \quad C \leftarrow 8$

Z:  $CMP C, #0 \quad 8 \text{ cmp } 0(NZ)$   
 $JZ X \quad (\text{false})$

$SUB C, #1 \quad C \leftarrow 7$

$RRC A, #1 \quad [CA_7A_6A_5A_4A_3A_2A_1], [R_0] \xrightarrow{\text{after } 8} [A_6A_5A_4A_3A_2A_1]$

$JC Y;$

$JMP Z$

Y:  $Add B, H1$

$JMP Z;$

X:  $\leftarrow$  To see original A, ignore  $RRC A, #1$

Q6  $\rightarrow$  TOC  $\rightarrow$  relative + base.

Q7  $\rightarrow$   $I_1: 1000 - 7$   
 $I_2: 8 - 11$   
 $I_3: 12 - 15$   
 $I_4: 16 - 23$   
 $I_5: 24 - 27$

Q9  $\rightarrow$

| F+D            | OF+PD |
|----------------|-------|
| $2w \times 2c$ | 3     |
| $1 \times 2$   | 3     |
| $1 \times 2$   | 1     |
| $2 \times 2$   | 3     |
| $1 \times 2$   | -     |
| <hr/>          | <hr/> |
| 14             | 10    |



= 24 cycles.

$$A[R_0] \leftarrow A[R_0] + @B$$

$$\downarrow$$

$$M[A+R_0] \leftarrow M[A+R_0] + M[B]$$

1 reg ref - Index      1 reg. ref = Index      1 Mref  $\rightarrow$  EA  
 1arith  $\rightarrow$  A      1arith  $\rightarrow$  A      1 Mref  $\rightarrow$  read  
 1 mem ref  $\rightarrow$  write      1 mem ref  $\rightarrow$  read

1 mem ref  $\rightarrow$  A      1 mem ref  $\rightarrow$  B      1 Mref  $\rightarrow$  read

(20)



4728 35  
⑤

(21)

$$R_S \leftarrow 1$$

$$[R_D \leftarrow 1] \quad 1000 + R_S = 1001 = ①$$

$$R_D \leftarrow 1001$$

$$[0 + 1001] \leftarrow 20$$

20  
1001

(22)

Before replacing the register, take the existing value and compare the remaining program instruction to check whether it's used as i/p or not.

- If it is not used at input, replace register content otherwise verifies the i/p as old i/p or new i/p.
- Replace register when it is a new i/p and don't replace register when it is old output.

(23)

$$a = 1$$

$$b = 10$$

$$c = 20$$

$$d = a+b = 11$$

$$e = c+d = 31$$

$$f = c+e$$

$$b = c+e$$

| reg1                                  | reg2                                  | reg3                                  |
|---------------------------------------|---------------------------------------|---------------------------------------|
| <input checked="" type="checkbox"/> a | <input checked="" type="checkbox"/> b | <input checked="" type="checkbox"/> c |
| <input checked="" type="checkbox"/> d | <del>f = b</del>                      | <del>b</del>                          |
| <input checked="" type="checkbox"/> d |                                       |                                       |

Chap 2 Q 20

CISC

|         |               |
|---------|---------------|
| 1       | 1             |
| 1       | 1             |
| 40      | MC $\uparrow$ |
| 6 times |               |
| 1       | 1             |
| MC      | MC            |

RISC  $\rightarrow$  LPI. cycles per inst?

$$S = \frac{ET_{CISC}}{ET_{RISC}} = \frac{42}{15} = 2.8$$

12. #inst<sup>n</sup> = 256

# Mop<sup>n</sup> = 8

Total = 2048 (256x8)  
= CM

CM size = 2048 Mop<sup>n</sup>  
HCW



11  $\rightarrow$  BC | f | CS | CM<sub>A</sub>  
— — 125 10  $\Rightarrow$  135.

10  $\rightarrow$  PC<sub>out</sub>, R<sub>in</sub>, MAR<sub>in</sub> change  $\frac{PC \leftarrow M[R_0]}{\times}$

MAR<sub>out</sub>, MBR<sub>in</sub>

MBR<sub>out</sub>, PC<sub>in</sub>

8  $\rightarrow$  X | Y | Mop<sup>n</sup>  $\Rightarrow$  26 bits

Nxt Mop<sup>n</sup> Addr \$  
↓  
3bit

$\Rightarrow X + 3 + 13 = 26 \Rightarrow X = 10$

Y = 3  
CM > 2<sup>10</sup> cells

= 1024 CW

$\Rightarrow 1024 \times 26 \text{ Bits}$

$(\frac{1024 \times 26}{8}) \text{ bits}$

To implement the BBS Inst<sup>n</sup> by using the alternative stmt, make sure that mask must contain only 1 bit as 1 that itself based on the position & rest bits must be zero. During AND op<sup>n</sup>, corresponding bit value is maintained by temp

## 6D) Data Dependency:-

→ Consider the program segment where the instruction  $j$  follows inst $n$   $i$  in the program order i.e.

$i : \text{inst}^n$   
 $j : \text{inst}^n$   
 $\vdots$

- Data dependency exists when Inst $n$   $j$  tries to read data before inst $n$   $i$  writes it.

→ Data dependency existing between adjacent inst $n$  is called true data dependency.

Eg:  $I_1 : \text{Add } r_0, r_1, r_2$   
 $I_2 : \text{Mul } r_3, r_0, r_4$

- When the above inst $n$  are executed in non overlapping order, there is no dataloss problem because  $I_2$  is inserted after completion of  $I_1$ .

- When above inst $n$  are executed in pipeline,  $I_2$  is inserted before completion of  $I_1$ . So  $I_2$  tries to read  $r_0$  before  $I_1$  writes it.  
 $\therefore$  old data is injected into the process.

- To make it correct, keep  $I_2$  in waiting until completion of  $I_1$ .

- This waiting creates "stalls" in the pipeline.

- To detect the data dependency condition, some kind of database will be maintained in the decoding stage.

- The dB contains following fields:-

| Serial No | Fun Unit | Dest $n$ | Independent | Indep $n$ S1 | Indep $n$ S2 | Dep $n$ S1 | Dep $n$ S2 | Status |
|-----------|----------|----------|-------------|--------------|--------------|------------|------------|--------|
|-----------|----------|----------|-------------|--------------|--------------|------------|------------|--------|

| SNO | Fun Unit | Dest $n$ | Independent | Indep $n$ S1 | Indep $n$ S2 | Dep $n$ S1 | Dep $n$ S2 | Status |
|-----|----------|----------|-------------|--------------|--------------|------------|------------|--------|
|-----|----------|----------|-------------|--------------|--------------|------------|------------|--------|

|                |     |                |                |                |   |   |   |                          |
|----------------|-----|----------------|----------------|----------------|---|---|---|--------------------------|
| I <sub>1</sub> | ADD | r <sub>0</sub> | r <sub>4</sub> | r <sub>2</sub> | - | - | - | 0 $\Rightarrow$ EX stage |
|----------------|-----|----------------|----------------|----------------|---|---|---|--------------------------|

|                |     |                |   |                |                |     |   |                            |
|----------------|-----|----------------|---|----------------|----------------|-----|---|----------------------------|
| I <sub>2</sub> | MUL | r <sub>3</sub> | - | r <sub>4</sub> | r <sub>0</sub> | (I) | - | 0 $\Rightarrow$ NOEX stage |
|----------------|-----|----------------|---|----------------|----------------|-----|---|----------------------------|

No further fetch are performed because Is is not passed to Ex stage & there is a stall. The above algo is called "TOMASULO Algo."

Status 0 means the execution is not yet completed. Execution completed  $\Rightarrow$  status 1.

|                | CC1 | CC2                                          | CC3 | CC4 | CC5 | CC6 |
|----------------|-----|----------------------------------------------|-----|-----|-----|-----|
| I <sub>1</sub> | IF  | ID<br>+                                      | EX  | MA  | WB  |     |
|                |     | S: r <sub>1</sub> , r <sub>2</sub><br>(I)(I) |     |     |     |     |
|                |     | D: r <sub>10</sub>                           |     |     |     |     |
|                |     | r <sub>4</sub> = value                       |     |     |     |     |
|                |     | r <sub>2</sub> = value                       |     |     |     |     |
| I <sub>2</sub> | IF  | ID                                           | X   | X   | EX  |     |
|                |     | *                                            |     |     |     |     |
|                |     | S: r <sub>0</sub> , r <sub>4</sub><br>(O)(I) |     |     |     |     |
|                |     | D: r <sub>3</sub>                            |     |     |     |     |
|                |     | r <sub>4</sub> = value                       |     |     |     |     |
| I <sub>3</sub> | IF  | X                                            | X   | ID  |     |     |
| I <sub>4</sub> |     |                                              |     |     |     | IF  |

CC4 & CC5 are called "stall cycles" because no new input is accepted due to presence of I<sub>2</sub> in IF stage.

- To minimise the no. of stalls, due to data dependencies, one hw technique is used called as operand forwarding. It is also called as bypassing or short-circuit.
- Operand forwarding states that access the new data from interface register before updating the register file.

Eg. I<sub>1</sub>: ADD r<sub>0</sub>, r<sub>4</sub>, r<sub>2</sub> ← True data dependency  
 I<sub>2</sub>: SUB r<sub>3</sub>, r<sub>0</sub>, r<sub>4</sub>  
 I<sub>3</sub>: MUL r<sub>5</sub>, r<sub>0</sub>, r<sub>6</sub> → dependency but not special name.  
 I<sub>4</sub>: DIV r<sub>7</sub>, r<sub>0</sub>, r<sub>8</sub> → update of register files



CC5 divided into two halves. 1st half → Decoding of I<sub>4</sub> & update of register file as I<sub>1</sub>.  
 2nd half I<sub>4</sub> → reading updated value from r<sub>0</sub>.

- In the above sequence diagram due to operand forwarding technique, the dependent inst<sup>n</sup> (I<sub>2</sub> & I<sub>3</sub>) are accessing the data from interface registers before updating the register file.  
 ∴ NO wait No stall.

- During execution of I<sub>4</sub>, it reads r<sub>0</sub> value from register file. At that time, r<sub>0</sub> contains the new value because CC5 cycle is divided into 2 halves. In the first half of CC5

$I_4$  updates  $RO$  &  $I_4$  decodes ins<sup>n</sup>.  
In the second half of CC5  $I_4$  reads the register file. At that time  $RO$  contains new value only.

## CONTROL DEPENDENCY

Program consists of mixing of 3 categories of inst<sup>n</sup>.

(i) Data transfer (ii) ALU (iii) Branch

During the execution of branch inst<sup>n</sup>, the control will be transferred from 1 location to another.

- When the program is executed in non overlapping order, due to branch inst<sup>n</sup>, no functionality goes missing.
- When the program is executed in pipeline, due to the branch operations, some funct<sup>n</sup> will be missing. To make it correct.
- Unwanted ins<sup>n</sup> must be flushed out of the pipeline. This operation creates the stalls in the memory pipeline.

## UNCONDITIONAL BRANCH:

Consider the prog segment.



|                                                                     | CC1             | CC2             | CC3             | CC4                                                   | CC5             | CC6             |
|---------------------------------------------------------------------|-----------------|-----------------|-----------------|-------------------------------------------------------|-----------------|-----------------|
| I <sub>1</sub>                                                      | IF<br>PC = 1000 | ID              | EX              | MA                                                    | WB.             |                 |
| I <sub>2</sub>                                                      |                 | IF<br>PC = 1001 | ID              | EX                                                    | MA              | WB              |
| I <sub>3</sub>                                                      |                 |                 | IF<br>PC = 1002 | ID<br>uncond.<br>TOC<br>TA = 2000<br>PC: 1003<br>2000 | EX              | MA              |
| I <sub>4</sub><br>needs to be<br>flush/<br>freeze<br>if<br>unwanted |                 |                 |                 | IF<br>PC = 1004<br>stall                              | ID              |                 |
| BI <sub>1</sub>                                                     |                 |                 |                 |                                                       | IF<br>PC = 2001 | ID              |
| BI <sub>2</sub>                                                     |                 |                 |                 |                                                       |                 | IF<br>PC = 2002 |

I<sub>3</sub> completes at CC4 because uncond. TOC completes at ID stage.

1. Inst<sup>n</sup> fetch stage is overlapping with Inst<sup>n</sup> decode stage. Before decoding fetched ins<sup>n</sup> the next sequential ins<sup>n</sup> is already inserted into the pipeline.

If decoding stage decodes the fetched ins<sup>n</sup> as data transfer or data manipulation then next sequential addr becomes the wanted instruction.

When decoding stage decodes the ins<sup>n</sup> as uncond. TOC then the next sequential ins<sup>n</sup> becomes the unwanted ins<sup>n</sup>.

To maintain the correct functionality while executing the program, there is a need of removing the unwanted ins<sup>n</sup> from the pipeline.

The process of removing the unwanted ins<sup>n</sup> from the pipeline is called as flush.

- On freeze operation creates the stalls in the pipeline.
- The no. of stall cycles created due to branch op<sup>n</sup> is called as "branch penalty".

Branch penalty = At what stage the target Addr -1 is available

NOTE: The risk pipeline branch penalty is always 1 because the target addr is available at 2<sup>nd</sup> stage.

The no. of stall cycles created during the execution of program, due to branch operation is calculated as

# stalls from branches =  $\frac{\text{branch frequency} * \text{branch penalty}}{\text{Prog GT}}$

### CONDITIONAL BRANCH:-

Consider the program segment.

1000 : I<sub>1</sub>

with true :-

1001 : I<sub>2</sub> (JNZ r<sub>0</sub>, 2000)

I<sub>1</sub> I<sub>2</sub> B<sub>1</sub> B<sub>2</sub>

1002 : I<sub>3</sub>

with false

1003 : I<sub>4</sub>

I<sub>1</sub> I<sub>2</sub> I<sub>3</sub> I<sub>4</sub> ...

2000 : B<sub>1</sub>

2001 : B<sub>2</sub>

|                 | CC1                    | CC2                                                                       | CC3 | CC4 | CC8 | CC6 |
|-----------------|------------------------|---------------------------------------------------------------------------|-----|-----|-----|-----|
| I <sub>1</sub>  | IF<br>PC=1001          | ID                                                                        | EX  | MA  | WB  |     |
| I <sub>2</sub>  | IF<br>PC>1002          | ID<br>Cond. TOC<br>NZCNPZ <sub>0</sub><br>↓<br>(T)<br>PC:<br>1003<br>2000 | EX  | MA  |     |     |
| I <sub>3</sub>  | IF<br>PC>1003<br>Stall | ID                                                                        | EX  |     |     |     |
| BI <sub>1</sub> |                        | IF<br>PC:2001                                                             | ID  |     |     |     |
| BI <sub>2</sub> |                        | IF<br>PC:2002                                                             |     |     |     |     |

- When decoding stage decodes fetchdins as cond. TOC with true condition then next sequential inst<sup>n</sup> becomes unwanted inst<sup>n</sup>. So, flushout the "unwanted" inst<sup>n</sup>. It creates stalls.
- If cond<sup>n</sup> evaluates to false, the next <sup>wanted</sup> inst<sup>n</sup> becomes "wanted" inst<sup>n</sup>, so no flush no stall.

To minimise the no. of stalls in pipeline due to branch operations, one h/w techniques used i.e. branch prediction buffer or Branch target buffer or loop buffer. (BTB / BTB/LB).

BTB :- BTB is a high speed buffer maintained at the inst<sup>n</sup> fetch stage to predict the target addresses. Branch target buffer consists of following ~~fields~~ fields.

|          |             |
|----------|-------------|
| PC entry | expected PC |
|----------|-------------|

→ During execution of the program when CPU identifies stalls in the pipeline due to the true condition, (first addr is target addr). Then the flush out inst<sup>n</sup> address and target address both are inserted into the PC entry and expected PC fields respectively.

⇒ from the next fetch onwards the PC value is compared with the PC entry field.

→ If none of the values matches, fetch the inst<sup>n</sup> based on current PC, if any one entry is matching then fetch the inst<sup>n</sup> based on the corresponding expected PC value.

1000: MOV R0, #4  
 1001: ADD R1, R2  
 1002: DJNZ R0, 1001  
 1003: MOV

PC: 1000: R0 = 4

1001: +

1002: R0 = 3(N2) → T

flush 1003 1001: +

1002: R0 = 2(N2) → T

1003: +

No flush 1001: +

No stall 1002: R0 = 1(N2) → T

No flush 1003 1001: +

1002: R0 = 0(N2) → T

No flush 1003

flush 1003 1003

| PC entry | PC Expected |
|----------|-------------|
| 1003     | 1001        |
| 1001     |             |

insert

← delete

Useful only in loops.

{ 1st flush → insertion } in expected - entry table.  
 { last flush → deletion }

- When there is a stall due to the false cond<sup>n</sup> delete the ~~entry~~ expected PC value from the buffer & fetch the next inst<sup>n</sup> based on the ~~current~~

corresponding PC value.

### DELAYED BRANCH :-

- It is a compiler technique. It is used to preserve the execution part in the pipeline.
- When there is an unwanted inst<sup>n</sup> present in the pipeline that slot is fixed as the delayed slot.
- When the slot is fixed as delayed then it substitutes the NOP operation.

|                 | CC1 | CC2     | CC3        | CC4        | CC5 | CC6 |
|-----------------|-----|---------|------------|------------|-----|-----|
| I <sub>1</sub>  | If  | ID<br>* | EX         | MA         | WB  |     |
| I <sub>2</sub>  |     | If      | ID<br>load | EX         | MA  | WB  |
| I <sub>3</sub>  |     |         | If         | ID         | EX  | MA  |
| I <sub>4</sub>  |     |         |            | uncond JOC |     |     |
| BI <sub>1</sub> |     |         |            | If(NOP)    | ID  | EX  |
| BI <sub>2</sub> |     |         |            |            | If  | ID  |

CC<sub>1</sub> is never delayed slot (1<sup>st</sup> inst<sup>n</sup> in the falling path)

CC<sub>2</sub> is a delayed slot

CC<sub>3</sub> " . " " → forwarded because of I<sub>1</sub>, ID

CC<sub>4</sub> " . " " → (fixed) → uncond JOC

CC<sub>5</sub> " never delayed slot (1<sup>st</sup> inst<sup>n</sup> in taken path)

• CC<sub>6</sub> " delayed slot (depends on decoding of ID)  
(wanted or unwanted depends on decoding unit)

### INSTRUCTION SCHEDULE :-

The processor executes the program in sequential order called as inorder execution.

In the inorder execution sequence, if any inst<sup>n</sup> is data dependent on other inst<sup>n</sup>, the

remaining inst<sup>n</sup>s are also sharing the stall cycles of the dependent inst<sup>n</sup>.

|                  |                                                                     |                                                      |
|------------------|---------------------------------------------------------------------|------------------------------------------------------|
| Eg:              | I <sub>1</sub> : Add r <sub>0</sub> , r <sub>4</sub> r <sub>2</sub> | Add r <sub>0</sub> , r <sub>1</sub> r <sub>2</sub>   |
| I <sub>2</sub> : | Mul r <sub>3</sub> , r <sub>0</sub> , r <sub>4</sub>                | Mul r <sub>3</sub> , r <sub>0</sub> , r <sub>4</sub> |
| I <sub>3</sub> : | Sub r <sub>4</sub> , r <sub>5</sub> , r <sub>6</sub>                | Sub \$, r <sub>5</sub> , r <sub>6</sub>              |
| I <sub>4</sub> : | Div r <sub>3</sub> , r <sub>7</sub> , r <sub>8</sub>                | Div T, r <sub>7</sub> , r <sub>8</sub>               |

In order Execution sequence :- I<sub>1</sub>-I<sub>2</sub>-I<sub>3</sub>-I<sub>4</sub>

- In the above example I<sub>2</sub> is data dependent on I<sub>1</sub>, so it undergoes wait until completion of I<sub>1</sub>. Due to this, I<sub>3</sub> & I<sub>4</sub> are also sharing stall cycles even they are independent.
- To avoid the above problem, there is a need of scheduler instructions.

Scheduling causes the out of order exec<sup>n</sup> (reorder exec<sup>n</sup>). I<sub>1</sub>-I<sub>3</sub>-I<sub>4</sub>-I<sub>2</sub> (out of order exec<sup>n</sup>).

- The out of order exec<sup>n</sup> creates 2 more dependencies in the pipeline
  - (i) Antidependency
  - (ii) Output dependency
- Antidependency exists when the inst<sup>n</sup> j tries to write the data before inst<sup>n</sup> i reads it.  
e.g:- I<sub>3</sub> modifies r<sub>4</sub> before I<sub>2</sub> reads it.
- Output dependency exists when inst<sup>n</sup> j tries to write the data before inst<sup>n</sup> i writes it.  
e.g:- I<sub>4</sub> executed before I<sub>3</sub>, so I<sub>4</sub> writes r<sub>3</sub> before I<sub>3</sub> writes it.

To handle the anti and output dependencies register renaming is used. Register renaming states that use some temp. regis storage (reorder buffer) to store the out of order exec<sup>n</sup> outputs. After receive the accept<sup>n</sup> signal from the dependent inst<sup>n</sup>, update the reg. file with reorder buffer contents.

### HAZARDS :-

- Hazard is a delay. Delay creates the extra cycles. Extra cycle without operation is called stall.

There are 3 types of data hazards.

- (i) RAW Hazard (Read after write)
- (ii) WAR Hazard (write after read)
- (iii) WAW Hazard (write after write)

RAW hazard is created when the inst<sup>n</sup> 'j' tries to read the data before inst<sup>n</sup> 'i' writes it (true data dependency).

WAR H is created when inst<sup>n</sup> j tries to write the data before inst<sup>n</sup> i reads it (antidependency).

WAWH is created when inst<sup>n</sup> j tries to write data before inst<sup>n</sup> i writes it (output dependency).

### Performance evaluation of pipelining with stalls :-

$$\begin{aligned}
 S &= \frac{\text{Avg ins}^n \text{ exec}^n \text{ of non-pipe}}{\text{Avg ins}^n \text{ exec}^n \text{ of pipe}} \\
 &= \frac{\text{CPI}_{\text{NP}} * \text{cycletime}_{\text{NP}}}{\text{CPI}_{\text{P}} * \text{cycletime}_{\text{P}}}
 \end{aligned}$$

$$\frac{\text{Avg Ins}^n E_{T_{\text{NP}}}}{\text{Avg Ins}^n E_{T_{\text{P}}}}$$



The ideal CPI<sub>P</sub> is almost 1 (always).

$$S = \frac{CPI_{np} + \text{cycletime}_{np}}{(1 + \# \text{stalls}/\text{inst}) * CPI_p}$$

↓ Ideal due to dependencies ↓

When all stages are perfectly balanced, then both cycle times are equal.

$$\therefore S = \frac{CPI_{np}}{(1 + \# \text{stalls}/\text{inst})} \quad [\because CPI_{np} = CPI_p]$$

When all insts take same # cycles, then 1 inst's exec is also equal to # stages in pipeline.

$$\therefore S = \frac{\text{pipeline depth}}{1 + \# \text{stalls}/\text{inst}}$$

When system operates with 100% efficiency then no stalls exist.

$$\Rightarrow S = \boxed{\text{pipeline depth}}$$

## Pg 17 Q2.

$$\text{cycle time} = 10 \text{ ns.}$$

$$\text{Branch freq} = 20\%,$$

$$\text{Branch penalty} = 5 - 1 = 4 \text{ cycles.}$$

$$\text{Avg inst's ET} = (1 + \# \text{stalls}/\text{inst}) \times \text{cycletime}$$

$$= [1 + (\text{Branch freq} * \text{Branch penalty})] \times \text{cycletime}$$

$$= (1 + (0.2 \times 4)) \times 10 \text{ ns}$$

$$= 18 \text{ ns.}$$

$$1 \text{ inst} \rightarrow 18 \text{ ns}$$

$$1 \text{ s} = \frac{1}{18} \times 10^9 \text{ ns/sec}$$

$$= 55 \text{ MIPS}$$

Q3 Pg 17 :-

|                    | IF             | ID             | EX                      | WB             |                |                |                |  |
|--------------------|----------------|----------------|-------------------------|----------------|----------------|----------------|----------------|--|
| I <sub>1</sub>     | 1              | 1              | 1                       | 1              |                |                |                |  |
| I <sub>2</sub>     | 1              | 1              | 1                       | 3              | 1              |                |                |  |
| I <sub>3</sub>     |                |                |                         |                |                |                |                |  |
| CC1                | IF             | ID             | EX                      | WB             |                |                |                |  |
| I <sub>1</sub>     | IF             | ID             | EX                      | WB             |                |                |                |  |
| I <sub>2</sub>     |                | IF             | ID → EX                 | EX             | EX             | WB             |                |  |
| I <sub>3</sub>     |                | IF             | ID<br>(I <sub>2</sub> ) | //             | //             | EX             | WB             |  |
| ALU0 <sup>WB</sup> |                |                | I <sub>1</sub>          | -              | -              | I <sub>2</sub> | I <sub>3</sub> |  |
| ex                 |                |                | I <sub>1</sub>          | I <sub>2</sub> | I <sub>3</sub> | I <sub>2</sub> | I <sub>3</sub> |  |
| ID                 | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub>          | -              | -              |                |                |  |
| IF                 | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub>          | .              |                |                |                |  |

← cycle diagram

space time  
← diagram

Q4 Pg 18 :-

|                                   | IF | ID | EX | MA  | WB      |    |    |  |
|-----------------------------------|----|----|----|-----|---------|----|----|--|
| I <sub>1</sub>                    | 1  | 1  | 1  | 1   | 1       |    |    |  |
| I <sub>2</sub>                    | 1  | 1  | 1  | 0   | 1       |    |    |  |
| I <sub>3</sub>                    | 1  | 1  | 1  | 1   | 1       |    |    |  |
| CC1                               | IF | ID | EX | MA  | WB      |    |    |  |
| I <sub>1</sub>                    | IF | ID | EX | MA  | WB      |    |    |  |
| I <sub>2</sub>                    |    | IF | ID | /// | EX      | MA | WB |  |
| I <sub>3</sub>                    |    |    | IF | /// | 1D → EX | MA | WB |  |
| If 1 <sup>st</sup> inst ALU → CC7 |    |    |    |     |         |    |    |  |

## MEMORY ORGANIZATION :-

Memory org<sup>n</sup> shows effective utilization of system supported memories.

To satisfy that obj. memory hierarchy design is used.

Memory Hierarchy uses the following specifications :-

|                  | 1                   | 2                   | 3                   | 4         |
|------------------|---------------------|---------------------|---------------------|-----------|
| Name:            | Reg.                | Cache               | MM                  | SM        |
| Typical Size:    | <1KB                | <16MB               | <16GB               | >100GB    |
| Implementation:  | custom<br>multiport | SRAM<br>(flip-flop) | DRAM<br>(capacitor) | magnetic  |
| Access time(ns): | 0.25-0.5            | 0.5-25              | 80-250              | 50,00,000 |
| Bandwidth(MB/s): | 20K-11ac            | 5000-10000          | 1000-5000           | 20-150    |
| Managed by :     | compiler            | H/W                 | OS                  | OS        |
| Backed by :      | cache               | MM                  | SM                  | CD        |

The mem hierarchy design shows access order of system supported memories as



- During the exec<sup>n</sup> of program, when CPU encounters memory reference ins<sup>n</sup>, it generates memory request & prefers to cache mem.
- If data is available in cache, op<sup>n</sup> is hit. So, the respective data is transferred to CPU in terms of words
- If op<sup>n</sup> is miss in cache mem., then reference is forwarded to main memory.
- When op<sup>n</sup> is hit in main mem., it transfers the data to the cache memory in the form of blocks and cache to CPU in form of words.
- Otherwise, ref. is forwarded to SM.
- The operation is always hit in SM because it is the last level of memory in hierarchy structure. ∴ hit ratio is always 1 in SM
- ∵ data is transferred from SM to MM in terms of pages, MM to cache in blocks & from cache to CPU in the form of words.
- According to hierarchy design, the data is transferred from higher to lower. ∴ lower level data is always

a subset of higher level data.

- this property is also called as inclusion.
- According to heirarchical design, the CPU can perform the read & write op<sup>n</sup> only the cache memory.
- Cache memory is a storage which holds the image of main memory so CPU performs op<sup>n</sup> on reusable space. This property is called as locality of reference.

### Types of memory organizations:-

Based on way of accessing the system supported memories, the memory org is divided into 2 types:-

- (i) Simultaneous Access memory orgn.
- (ii) Hierarchical access " "

### Simultaneous access:-

- In this org. the CPU is able to access all the levels of memories. In this org. level to level comm. is not allowed.
- CPU refers the level 1 memory to read or write the data.

If there is a miss operation, the reference is forwarded to level 2. If op<sup>n</sup> is hit in level 2 i.e. directly transferred to the CPU without the involvement of level 1. otherwise reference is forwarded to next level & so on.



where  $H_1, H_2, \dots, H_n$  are hit ratios of respective memories.

$$\text{Hit ratio} = \frac{\text{no. of hits}}{\text{Total # accesses}}$$

$T_1, T_2, \dots, T_n$  are access time of respective mem.

- The time required to access 1 operand from memory is called as average access time.

$$T_{avg} = H_1 T_1 + (1-H_1) H_2 T_2 + (1-H_1)(1-H_2) H_3 T_3 + \dots + (1-H_1)(1-H_2)(1-H_3) \dots (1-H_{n-1}) H_n T_n.$$

1 word =  $T_{avg}$ .

? words = 1 sec.

$$\Rightarrow \boxed{\eta, \text{ efficiency} = \frac{1}{T_{avg}} \text{ words/sec}}$$

(Always when cache is used)  
Hierarchical Access :- In this org. the CPU can perform read & write op<sup>n</sup> only on level 1 memory. If there is a miss in level 1, transfer the data from higher level to level 1, then only CPU can ~~do~~ read or write.



The avg access time of hierarchical access org. is

$$T_{avg} = H_1 T_1 + (1-H_1) H_2 (T_1 + T_2) + (1-H_1)(1-H_2) H_3 (T_1 + T_2 + T_3) + \dots + (1-H_1)(1-H_2)(1-H_3) \dots (1-H_{n-1}) H_n \left( \sum_{j=1}^n T_j \right)$$

Q In a 2 level memory, level 1 memory is 5 times faster than level 2 memory. And its access time is 10ns less than average access time. Let Level 1 access time be 20 ns. What is the hit ratio?

→ (Simultaneous)

$L_1 + L_2$

⇒ speed up factor is given

$$S = \frac{T_2}{T_1} = 5$$

$T_1$  faster

$$\Rightarrow T_2 = 5T_1$$

$$\begin{aligned} \textcircled{1} \Rightarrow T_1 &= T_{avg} - 10 \\ &= T_{avg} = T_1 + 10 \end{aligned}$$

$$T_1 = 20 \text{ ns}$$

$$\therefore T_{avg} = 30 \text{ ns}$$

$$\& T_2 = 100 \text{ ns}$$

$$\begin{aligned} T_{avg} &= H_1 T_1 + (1-H_1) H_2 T_2 \\ &= H_1 T_1 + (1-H_1) T_2 \end{aligned}$$

$$30 = H_1 * 20 + (1-H_1) 100$$

$$H_1 = 0.87 \text{ Ans}$$

last level Hit ratio = 1 (always)

Q. 3 level memory has following specifications

| Level | Access time/word | Block size (word) | Hit ratio |
|-------|------------------|-------------------|-----------|
| 1     | 20               | —                 | 0.8       |
| 2     | 50               | 2                 | 0.9       |
| 3     | 100              | 4                 | 1         |

If the required block is not present in  $L_1$ , transfer it from  $L_2$  to  $L_1$ , if not available in  $L_2$ , transfer from  $L_3$  to  $L_2$  +  $L_2$  to  $L_1$ . How long will it take to transfer the blocks

$$\begin{aligned} \text{Ans} \Rightarrow T_1 &= 20 \times 1 = 20 \quad \& T_3 = 100 \times 4 = 400 \\ T_2 &= 50 \times 2 = 100 \end{aligned}$$

$$\begin{aligned}
 T_{avg} &= H_1 T_1 + (1-H_1) H_2 (T_1+T_2) + (1-H_1)(1-H_2) H_3 (T_1+T_2) \\
 &= 0.8 \times 20 + 0.2 \times \cancel{0.9} (120) + (0.2)(0.1) \cancel{1} (520) \\
 &= 16 + 18(120) + 0.02 \times 520 \\
 &= 16 + 21.6 + 10.40 \\
 &= \underline{\underline{48 \text{ ns}}}.
 \end{aligned}$$

## CACHE MEMORY:-

Cache memory is used as the intermediate memory between CPU + main memory.  
 $\therefore$  CPU perform read + write opn only on the cache memory.



Basic elements are

- (i) Memory organization
- (ii) Mapping techniques
- (iii) Replacement algorithms
- (iv) Updating techniques
- (v) multilevel caches

Memory organization: The data is transferred from ~~one~~ main memory to cache memory in form of blocks. So both memories are divided into equal parts based on block size. Each part is called as cache mem. block + main memory block respectively.  
 $\# \text{cache mem. blocks (lines)} = \frac{\text{cache mem size}}{\text{block size}}$

$$\# \text{main memory blocks} = \frac{\text{MM size}}{\text{Block size}}$$

Consider 8 byte cache memory + 2 byte block size.



$$\# \text{lines} = \frac{8B}{2B} = 4$$



Before and after organization of cache memory there is no change in capacity but the internal structure differs. After cache mem. orgn., address can be interpreted as



Consider 16 byte main memory & 2B block size.

Before orgn

$$MM = 16B = 4 \text{ bit addr}$$

|      |   |
|------|---|
| 0000 | B |
| 0001 | B |
| :    | B |
| 1111 | B |

After orgn



Before & after orgn of the MM, there is no change in capacity but the internal structure differs so physical address can be interpreted as





**Data mem size** =  $\# \text{cache lines} * \# \text{bits in each line}$

## MAPPING TECHNIQUES

The process of transferring data from MM to cache memory is known as mapping.

There are 3 mapping techniques used:-

(i) Associative mapping

(ii) Direct mapping

(iii) Set associative mapping.

### Associative mapping:-

In this technique, there is no mapping function used to transfer the data. That means any MM block can be mapped to any cache line.

3. Cache address is not in use.

- The associative cache controller inter-

- interprets the CPU requests based on MM address format.
- During the mapping, the complete data block is transferred to CM along with Tag.



(i) initially empty cache.



Consider the program segment:-

- MOV R0, [0000]
- MOV R1, [1000]
- Add R0, R1

i: CPU generates mem req.



I<sub>2</sub> CPU generates the mem req.



The CPU generated <sup>mem</sup> request is initially referred to associative cache. The assoc. cache controller interprets the request into its known format i.e. [tag] wo.

- The CPU generated Tag is compared with the existing tags in the cache controller.
- If any one is matching, operation is hit. So based on the word offset, the data is transferred into CPU.
- If none of them is matching op<sup>n</sup> is miss. So the reference is forwarded to main memory.
- The main memory controller interprets the request into its known format i.e. [tag Two].
- The tag field is directly connected with the address logic of main memory. So corresponding MM block is enabled & transferred in cache mem. by using associative mapping.

technique. Later data is transferred to CPU.  
The tag memory size is equal to # of cache lines \* # tag bits in each line.

$$\text{Tag Mem size} = 4 \times 3 = 12 \text{ bits}$$

### DIRECT Mapping:-

In this technique memory function is used to transfer the data. The mapping function is

$$K \bmod N = i$$

↓      ↓      ↓  
MM Block#    #cache lines    CM line number  
 $(0-7) \bmod 4 = (0-3, 0-3)$

The above mapping fun shows the relationship b/w MM block & cache mem line.

∴ cache mem. address is in use.

The direct cache controller interprets the CPU generated requests as

physical Addr(4 bits)



|     |                |
|-----|----------------|
| 000 | B <sub>0</sub> |
| 001 | B <sub>1</sub> |
| :   | :              |
| 111 | B <sub>7</sub> |

|          |
|----------|
| Tag   WO |
|----------|



Consider the program segment :-

I<sub>1</sub>: MOV R<sub>0</sub>, [0000]

I<sub>2</sub>: MOV R<sub>1</sub>, [0001]

I<sub>3</sub>: Add R<sub>0</sub>, R<sub>1</sub>

I<sub>4</sub>: CPU generates mem req

I<sub>2</sub>: CPU generates mem req.



Tag memory size = 4 lines × 1 bit = 4 bits.

- Direct cache controller interprets the req. into its known format i.e. [tag|L0|W0]
- The line offset is directly mapped with

the address logic of cache memory... correct  
-onding line is enabled.

- The existing Tag in enabled line is compared with CPU generated tag.
  - when both matches, the op<sup>n</sup> is hit. So based on the word offset data is transferred to CPU. otherwise it is a miss.
- Reference is forwarded to MM.
- The required block is transferred from MM to cache memory based on the direct mapping function.

7D

Disadvantage in direct mapping is that each cache line is able to hold one block at a time. ∴ conflict misses will be increased.



I<sub>1</sub>: MOV R0,[0000] → B<sub>0</sub>, 0<sup>th</sup> Byte, Hit  
I<sub>2</sub>: MOV R0,[1000] → B<sub>4</sub>, " , Miss  
I<sub>3</sub>: MOV R0,[0001] → B<sub>0</sub>, 1<sup>st</sup> Byte, Miss  
I<sub>4</sub>: MOV R0,[1001] → B<sub>4</sub>, " , Miss

### Alternative cache organization

|   |                |                |
|---|----------------|----------------|
| 0 | B <sub>0</sub> | B <sub>4</sub> |
| 1 | B <sub>5</sub> |                |

I<sub>1</sub>: Hit  
I<sub>2</sub>: Miss  
I<sub>3</sub>: H  
I<sub>4</sub>: H

SET-ASSOCIATIVE  
CACHE  
ORGANIZATION

To overcome the disadvantage of direct mapping, alternative cache organization is used. In this, each cache line is able to hold more than 1 block at a time. The alternative cache organization is called as SET ASSOCIATIVE CACHE.

SET ASSOCIATIVE CACHE ORGANIZATION

In this organization, cache memory lines are again divided into equal parts based on no of main memory blocks can be placed at a time at each cache line.

Each part is called set. No. of sets,  $s = \frac{\text{no. of cache lines}}{\# \text{blocks placed at a time in each line.}}$

$$= \frac{N}{p\text{-ways}} - s$$

Consider 2-way set associative cache, with block size 2 bytes.

Before orgn (BB)

|     |   |
|-----|---|
| 000 | B |
| 001 | B |
| 010 | : |
| 011 | : |
| 100 | : |
| 101 | : |
| 110 | : |
| 111 | B |

After orgn into the lines

|    |    |
|----|----|
| 00 | 2B |
| 01 | 2B |
| 10 | 2B |
| 11 | 2B |

After orgn into the sets

$$\# \text{sets} = \frac{4}{2} = 2$$

# lines =  $\frac{8}{2} = 4$   
direct cache

|   |    |    |
|---|----|----|
| 0 | 2B | 2B |
| 1 | 2B | 2B |

Set associative cache.

Set Associative Mapping: In this technique, mapping function is used to transfer the data from main memory to cache. The mapping function is  $k \text{ mod } s = i$  where  $k$  = main memory block number,  $s \doteq \# \text{cache sets}$ .

$i$  = cache memory set no.

The above function shows the relationship b/w MM block & Cache memory set. So Cache memory address is in the use.

The set associative cache controller interprets the CPU generated ~~requests~~ address request

into the following format :-  
Physical Address (9-bit)



Main memory block

---

B0  
[000]

B7  
[111]

$$\begin{array}{c} k \bmod s = i \\ \xrightarrow{k \bmod 2 = 0} \\ k \bmod s = i \\ \xrightarrow{k \bmod 2 = 1} \end{array}$$

set 0

set 1.

Consider the program segment:

- I<sub>1</sub> : MOV R0, [0000]  
I<sub>2</sub> : MOV R4, [1001]  
I<sub>3</sub> : ADD R0, R4



If CPU generates the mem req.

[0000]

↓  
2-way set Assoc.

↓  
Known format





Set associative cache controller interprets the CPU generated request as follows:-

tag Pwo!

offset field is directly with the address logic of cache memory. The respective cache set will be enabled.

Each set contains multiple tags. So to compare the existing tags in the enabled set with the CPU generated tag there is a need of multiplexer.

- If any one tag is matching, spr<sup>n</sup> is hit. So based on the word offset, the data is transferred to the CPU.

If none tags are matching, spr<sup>n</sup> is miss so the reference is forwarded to main memory.

Based on the tag value, the respective block is transferred to the cache memory by using the set associative mapping technique.

Tag memory size  $\Rightarrow$  # sets \* # Blocks / no. of sets  
=  $S * P * \# \text{tag bits}$

$$= 2 * 2 * 2 = 8 \text{ bits}$$

## REPLACEMENT ALGO

When the cache is full, there is a need of replacement algorithms to replace the existing cache blocks with new blocks.

There are 2 replacement algos used:

(i) FIFO (ii) LRU.

In first in first out, replace the cache block with new block which is having the longest time stamp.

In least recently used algo, replace the cache block with new block which is having the less no. of references with longest time stamp.

Q. Consider 4 block cache memory (initially empty) with the following MM block references.

4, 5, 4, 12, 4, 5, 13, 4, 5, 7

Identify the hit ratio with:-

- (i) FIFO (ii) LRU (iii) Direct mapped cache  
(iv) 2 way set assoc. with LRU.

| M  | M  | M  | M  | H  | M  | M  | M  | M |
|----|----|----|----|----|----|----|----|---|
| 4  | 4  | 4  | 4  | 4  | 13 | 13 | 13 | B |
| 5  | 5  | 5  | 5  | 5  | 4  | 4  | 4  |   |
| 7  | 7  | 7  | 7  | 7  | 7  | C  | C  |   |
| 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 7 |
| 4  | 5  | 7  | 12 | 4  | 5  | 13 | 4  | 5 |

$$\text{Hit ratio} = \frac{2}{10} = 0.2$$

(ii) LRU

|   |   |   |    |   |    |    |   |   |   |
|---|---|---|----|---|----|----|---|---|---|
| 4 | 5 | 7 | 12 | 4 | 5  | 13 | 4 | 5 | 7 |
| 5 |   |   |    | 5 |    |    |   |   |   |
| 7 |   |   |    |   | 13 |    |   |   |   |
| 4 | 5 | 7 | 12 | 4 | 5  | 13 | 4 | 5 | 7 |

$$\text{hit ratio} = \frac{4}{10} = 0.4$$

(iii) direct mapped

|   |   |    |   |
|---|---|----|---|
| 0 | 4 | 12 | 4 |
| 1 | 8 | 13 | 5 |
| 2 |   |    |   |
| 3 | 7 |    |   |

m, m, m, m, m, h, m, h, m, h

$$\text{hit ratio} = \frac{3}{10} = 0.3$$

(iv) 2-way set asso with LRU :-

|   |   |    |   |   |   |    |   |   |    |   |   |   |
|---|---|----|---|---|---|----|---|---|----|---|---|---|
| 0 | 4 | 12 | 4 | 5 | 7 | 12 | 4 | 5 | 13 | 4 | 5 | 7 |
| 1 | 8 | 13 | 5 |   |   |    |   |   |    |   |   |   |
| 2 |   |    |   |   |   |    |   |   |    |   |   |   |

$$\text{hit ratio} = \frac{4}{10} = 0.4$$

UPDATING TECHNIQUES :-

CPU performs read & write operation only on the cache ~~memory~~. Before operation, it checks for availability of block in cache. If the block is available, operation is called as read hit or write hit. If the block is not available, operation is called as read miss / write miss.

When there is a miss operation, the required block is transferred from MM to cache memory. This process is called read allocate / write allocate. After the allocation, CPU performs read / write operation.



I<sub>0</sub>: MOV R<sub>0</sub>, [0000] ; B<sub>0</sub>, 0<sup>th</sup> byte  $\rightarrow r_0$ , M  $r_0 \leftarrow 23$

I<sub>1</sub>: ADD R<sub>0</sub>, #24 ; R<sub>0</sub>  $\leftarrow r_0 + 23$ ; R<sub>0</sub> = 47

I<sub>2</sub>: MOV [0000], R<sub>0</sub> ; B<sub>0</sub>, 0<sup>th</sup> Byte  $\rightarrow r_0 \Rightarrow CM$ ,  
 MM [0000] CM [0000]  $(0000) \leq 47$   
 $23 \leftrightarrow 47$   
 coherence.

During the execution of above program, instruction 3 indicates mem.wr op<sup>n</sup>. So, CPU performs write op<sup>n</sup> only onto the cache memory. Therefore, different values are maintained in cache memory & main memory with the same address.

- The same address contains different values at different places is called as incoherence.
- Coherence creates the loss of data problems, i.e.

I<sub>3</sub>: MOV R<sub>0</sub>, [1000] ; B<sub>4</sub>-0<sup>th</sup> byte  $\rightarrow M$ : CM =  $23 \mod 4 = 0$

I<sub>4</sub> : —

I<sub>5</sub> : —

I<sub>6</sub> : mov R<sub>0</sub>, [0000] ; B<sub>0</sub>- 0<sup>th</sup> Byte  $\Rightarrow M$ : CM =  $0 \mod 4 = 0$   
 $\Rightarrow B_4 B_0$   
 $23 \text{ (old)}$

During the execution of I<sub>6</sub>, due to the miss operation, the updated cache block is replaced with new block. So the updated information is lost. To avoid this problem, updating techniques are used:-

- (i) write through (ii) write back,

### (i) Write Through :-

In this technique, CPU perform simultaneous Write op<sup>n</sup> in the cache memory and main memory. ∵ There is no coherence.



$$\text{word update time } (T_w) = \max \left( \begin{array}{l} \text{word update time in cache,} \\ \text{time in MM} \end{array} \right)$$

The average access time of read cycle is

$$T_{avg\ read} = H_r T_c + (1 - H_r) (T_m + T_c)$$

$\downarrow$        $\downarrow$        $\downarrow$        $\downarrow$        $\downarrow$   
 read hit    read    read miss    read allocated    read

The average access time of write cycle is

$$T_{avg\ write} = H_w * T_w + (1 - H_w) (T_m + T_w)$$

$\downarrow$        $\downarrow$        $\downarrow$        $\downarrow$        $\downarrow$   
 write hit    word update time    write miss    write allocate    word update time.

The total average access time when considering both read & write is

$$T_{avg\ w+} = f_r * T_{avg\ read} + f_w * T_{avg\ write}$$

$\downarrow$        $\downarrow$   
 frequency of read op<sup>n</sup>    frequency of write op<sup>n</sup>.

$$n_{w+} = \frac{1}{T_{avg\ w+}} \text{ words/sec}$$

NOTE:- Write Through is also ~~availa~~ applicable to simultaneous access memory organizat<sup>n</sup>.



$$T_{avg} = H_w T_n \Rightarrow T_{avg} = T_n$$

Note: If  $H_w = 1 \Rightarrow$  (simultaneous mem.org.)  
 $H_w \neq 1 \Rightarrow$  (Hierarchical .. .)

### Write back:

In this technique, the CPU performs write operations only in the cache memory so still coherence is present but this coherence does not create the data loss problem because the CPU reads status of cache block before replace.

If the cache block is updated block then, it transfers the information back to main memory. Later replaces the cache block. Otherwise the new block will be placed into the cache.

In write back cache, each line maintains one extra bit called as update bit. This bit will be set when the block is updated otherwise reset.



The avg access time of read cycle is

$$T_{avg\ read} = H_r \cdot T_c + (1-H_r) [\% \text{ dirty bits} (T_m + T_m + T_c) \\ + \% \text{ clean bits} (T_m + T_c)]$$

↓      ↓      ↓      ↓      ↓

read hit    read    read miss    write back    Read allocate  
read

The avg access time of write cycle is

$$T_{avg\ write} = H_w \cdot T_c + (1-H_w) [\% \text{ dirty bits} (T_m + T_m + T_c) \\ + \% \text{ clean bits} (T_m + T_c)]$$

↓      ↓      ↓      ↓      ↓

write hit    write misses    write allocate    write    write allocate  
write    write

The total average access time when considering both read & write oprn is

$$t_{avgWB} = f_r * T_{avg\ read} + f_w * T_{avg\ write}$$

$$\eta_{WB} = \frac{1}{T_{avgWB}} \text{ words/sec}$$

This technique is not applicable for simultaneous access memory organization.

Multi level Caches :-

To reduce the miss penalty multi level cache memories are used in the system design. The time required to transfer the data from higher level to level 1

memory due to the miss operation is called as miss penalty.



In multilevel cache orgn we can calculate two types of miss-rate

- (i) Local miss rate
- (ii) Global miss rate

Local miss rate =  $\frac{\# \text{ misses in cache}}{\text{Total no. of accesses to that cache.}}$

Global miss rate =  $\frac{\# \text{ misses in cache}}{\text{Total # CPU generated references.}}$

$$LMR_{L_1} = 5/10 = 0.5$$

$$LMR_{L_2} = 2/5 = 0.4$$

$$GMR_{L_1} = 5/10 = 0.5$$

$$GMR_{L_2} = 2/10 = 0.2$$

In multilevel cache orgn. the avg. access time calculated as.

$$T_{avg} = \text{Hit time}_{L_1} + \text{missrate}_{L_1} * \text{miss penalty}_{L_1}$$

$$\text{Miss penalty}_{L_1} = \text{Hit time}_{L_2} + \text{missrate}_{L_2} * \text{miss penalty}_{L_2}$$

Miss penalty<sub>L<sub>2</sub></sub> = Main memory access time.

In the multilevel cache orgn, avg stall cycles per inst<sup>n</sup> is calculated as

$$\text{Avg stalls/inst}^n = \# \text{ misses in } L_1/\text{inst}^n + \text{Hit time}_{L_2} \\ + \# \text{ misses in } L_2/\text{inst}^n * \text{miss penalty}_{L_2}$$

## Types of MISSES :-

Cache memory consists of 3 types of misses :-

(1) Compulsory miss (First-reference or cold start miss)

This miss will occur when the very first reference to cache is miss (cache is initially empty).

(ii) ~~Capacity~~ miss :- Due to the small size of cache, it cannot hold all memory blocks required for program. So some blocks are discarded & later retrieved.

(iii) Conflict miss (Collision miss/Inferencemiss)  
This miss will occur when too many blocks are mapped into the same cache line or cache set.

## Types of Locality of reference :-

(1) Temporal locality :-

When the CPU is referring same word in same block in near future is called as temporal locality.

(2) Spatial locality :-

When CPU refers the adjacent words in same block in near future is called as spatial locality.

## IO ORGANIZATION :-

- IO devices are very slow devices. ∴
- IO devices are never directly connected with system buses.
- IO devices are always communicating with CPU through high speed interface (IO module).
- High speed interface is required because IO devices are electromagnetic devices and CPU is an electronic component.  
∴ there is a difference in operating modes, difference in word format & data transfer rate.
- ∵ to avoid the above problems, IO devices are connected with CPU through high speed interface.



When CPU executes IO operations, it generates IO requests and transfers to high speed interface. Later CPU continues its own some other task. During that process, CPU periodically reads the status of interface.

- High speed interface control logic interprets the request & enables the

corresponding port for respective operation.  
Based on the speed of I/O device,  
it prepares data & later transfers  
to buffer.

- When buffer has the data, high speed interface generates the signals to CPU & waits for ack.
- After receiving ack, the buffer contents are placed on data lines.  $\therefore$  speed gap will be minimised.

## Types of interface :-



## I/O transfer modes :-

There are 3 different ways present to transfer data from I/O to CPU/memory.

- Programmed I/O
- Interrupt driven I/O
- DMA

Programmed I/O:- In this mode high speed interface is not present. The CPU is directly connected with IO devices.

∴ Processor utilization is inefficient. This means that processor waits until completion of IO operation. This wait depends on speed of IO device.



$1KB \rightarrow 1\text{ sec}$   
 $1B \rightarrow 1\text{ msec}$   
Byte transfer time  
 $\leq 1\text{ msec.}$

Interrupt driven I/O:- In this mode, high speed interface is used to connect the various basic IO devices.

∴ Processor utilization is efficient because processor only communicates with high speed interface.

- So CPU time depends on latency of interface.



$$\text{Byte transfer time} = \text{latency of interface} \\ = 0.4 \mu\text{sec.}$$

Eg:-



Direct Memory Access: In this mode, bulk amount of data is transferred from I/O device to main memory without involvement of CPU.

When user program capacity is greater than main memory capacity, CPU uses virtual memory concept to increase the address space, i.e. the user program is stored into the secondary storage component. All the secondary storage components are connected to DMA module.



Actual implementation:-



DMA  $\rightarrow$  disk  
Int controller  $\rightarrow$  KB  
Mouse, Display, etc

The CPU initializes DMA along with I/O address, memory address, control signal & count value. Later CPU is busy with other task. DMA control logic interprets the request & enables the respective port for resp. open.

- Based on the speed of disk, it prepares the data. Later it transfers the data to buffer.

- When buffer contains data, DMA enables the hold signal to gain the control of the bus and wait for HLDA signal.

- After receiving the HLDA signal, DMA transfers data from I/O to main memory until counts becomes zero.

- After the operation, DMA reestablishes bus connections to CPU.

- In DMA opn, CPU is present in 2 states  
(i) Busy state (ii) Blocked (HOLD) state.

Until preparing the data, the CPU is busy and until transferring the data the CPU is blocked.

- Let  $x$  be the preparation time and  $y$  be the transfer time. Then % of time CPU is busy =  $\frac{x}{x+y} \times 100\%$

$$\% \text{ of CPU blocked} = \left( \frac{y}{x+y} \right) \times 100\%$$

Preparation time depends on disk speed & transfer time depends on MM latency.

DMA is operating in the 3 different modes.

(i) burst mode

(ii) cycle stealing mode

(iii) block mode

In burst mode of DMA, after receiving HLDA signal, bulk amount of data is transferred from I/O device to main memory.

In cycle stealing mode of DMA, before receiving HLDA signal, it forcibly suspend the CPU operation and gain the control of bus & transfers very important data from I/O to MM.

In block mode of DMA, after receiving HLDA signal, the data is transferred from I/O to main memory in blocks.

Pg 18. Q. 8. Branch Penalty =  $3-1=2$  cycles

Branch frequency = 20%

$$\begin{aligned}\text{Average Inst. Exec. Time} &= \left(1 + \# \text{stall}/\text{inst}\right) \\ &= \left(1 + (0.2 \times 2)\right) \frac{\text{cycle time}}{16 \text{Hz}} \\ &= 1.4 \text{ ns.}\end{aligned}$$

$$\begin{aligned}\text{Prog. Exec. Time} &= 10^9 \times 1.4 \text{ ns} \\ &= 1.4 \text{ sec.}\end{aligned}$$

Pg 19. Q 13.

|                |  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> |
|----------------|--|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| S <sub>4</sub> |  |                |                |                |                |                |                |                |
| S <sub>3</sub> |  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> |
| S <sub>2</sub> |  |                |                |                |                |                |                |                |
| S <sub>1</sub> |  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> |
| Q              |  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> |

$$\begin{aligned}Q 15. \quad S = \text{Avg ET}_{\text{np}} &= \frac{\sum (I_i * CPI_i) \cdot \text{cycle time}}{\text{Avg ET}_P} \\ &= \frac{CPI * \text{cycle time}}{1 * \text{Skew time overhead}}\end{aligned}$$

$$= \frac{(0.4 \times 4) + (0.2 \times 4) + (0.4 \times 5)}{1 * \text{Skew time overhead}} * 1 \text{ ns}$$

$$= \frac{4.4 \text{ ns}}{1 * (1 + 0.2) \text{ ns}} = 3.7$$

Q16.  $\frac{\text{needed}}{\text{given}} = \frac{2^x 2^{20} \times 2^5}{2^9 \times 2^{10} \times 2^3} = 2^4 = 16.$

Q19.  $\frac{250 \times 10^9 \text{ flops}}{100 \times 10^6 \text{ flops}} = 2500 \text{ sec.}$

Q24. Arg.  $\text{Ins}^n \text{ FT}_{\text{pipe}} = (1 + \# \text{stalls}/\text{inst}^n) \text{ cycle time}$   
 $\Rightarrow \left[ 1 + \left( \frac{20}{100} \times (5 - 1) \right) \right] 20 \text{ ns.}$   
 $\Rightarrow 24 \text{ ns.}$

25.  $S = \frac{\text{pipeline depth}}{1 + \# \text{stalls}/\text{inst}^n} = \frac{5}{2.2} = 2.27.$



$$\begin{aligned}\# \text{stalls} &= (0.3 + 0.3 \times 0.4 \times 0) + (0.3 \times 0.3 \times 0.3 \times 4) \\ &\quad + (0.3 \times 0.7 \times 0.7) \\ &= 0 + 0.027 \times 4 + 0.21 \times 4 \\ &= 0.108 + 0.84 \\ &= 0.948\end{aligned}$$

Arg  $\text{Ins}^n \text{ FT} = (1 + \# \text{stalls}/\text{inst}^n) \text{ cycle time}$   
 $= (1 + 0.948) \times 20$   
 $\Rightarrow 38.96 \text{ ns.}$

30.

| <u>True</u> | <u>Anti</u> | <u>Output</u> |
|-------------|-------------|---------------|
| <u>in</u>   | <u>out</u>  | <u>out</u>    |
| $I_2 - I_1$ | $I_3 - I_2$ | $3 - 1$       |
| $I_3 - I_2$ | $I_4 - I_3$ | $4 - 2$       |
| $I_4 - I_3$ | $I_5 - I_4$ | $2$           |
|             |             |               |

$$32. t_n = 12$$

$$t_p > 6$$

$$\therefore \frac{t_n}{t_p} = \frac{2}{1}$$

4

|       | If | ID | OF | PO | WB |
|-------|----|----|----|----|----|
| $I_0$ | 1  | 1  | 1  | 3  | 1  |
| $I_1$ | 1  | 1  | 1  | 6  | 1  |
| $I_2$ | 1  | 1  | 1  | 1  | 1  |
| $I_3$ | 1  | 1  | 1  | 1  | 1  |

~~•~~ once the stage is over, op is available.  
do not wait for completion of execution  
till WB.

38.  $80\% \text{ cond}^n$



$$\begin{aligned}
 & 0.8 \times 0.5 \times 4 \\
 & = 1.6 \\
 & (1+1.6) \times 2 \\
 & 5.2 \text{ ns.}
 \end{aligned}$$

Q23. Q2.



$\Rightarrow 7 \text{ miss}$

Total miss penalty  
 $= 7 \times 8 = 56$ .



$\Rightarrow 1 \text{ miss} = 10 \text{ miss penalty}$

$$Q3. \# \text{lines} = \frac{\text{cache size}}{\text{block size}} = \frac{8K}{32} = 256$$



$$\begin{aligned}
 \text{tag mem size} &= (19 + 4 + 4) \times 256 \\
 &= 256 \times 24 \text{ bits} \\
 &= 5876 \text{ bits}
 \end{aligned}$$

$$\frac{S_A}{S_B} = \frac{K}{1.2} \quad \frac{S_A/S_B}{1.2} = 0.833$$

chapters

Q. 6.



$$\begin{aligned} \text{tagmemsize} &= \# \text{lines} \times \\ &\quad \# \text{tag bits} \\ &\Rightarrow 2^9 \times 16 \text{ bits} \\ &= 2^{13} \text{ bits} \end{aligned}$$

$$\textcircled{1} \Rightarrow H = 0.7$$

$$S = \frac{T_m}{T_c}$$

$$\therefore T_m = ? T_c$$

$$T_{avg} = 80 \text{ ns}$$

$$T_{avg} = H \cdot T_c + (1-H)(T_m + T_c)$$

$$\Rightarrow T_{avg_{new}} = \frac{H \cdot T_c + (1+H)}{(T_m + T_c)} 80 \text{ ns} = 0.7 T_c + (1-0.7)(9 T_c + T_c) = 0.7 T_c + \cancel{0.3} 3 T_c$$

$$112 = H(21.6) + (1-H) 80 \text{ ns} \quad 3.7 T_c$$

$$(21.6 + 194.4) T_c = \frac{80}{3.7} = 21.6 \quad , T_m = 194.4$$

$$112 = H(21.6) + (2160) - 216H$$

$$- 104 = (194.4)H$$

$$H = 0.52$$

11. Decoder



One iteration generates = KC Bytes

↓  
Decoder \* Bank latency = KC Bytes

$$\frac{K}{2} + 80 \text{ ns} = \text{KC Bytes}$$

To access cache block, 2 iterations are required  $\Rightarrow 2 \times 92 = 184 \text{ ns.}$

$$\cancel{2^{24}} \cdot 92 \text{ ns.} = 48$$

12.



13.

Comp + Mux latency

$$= 0.6 + \frac{18}{10}$$

$$= 2.4 \text{ ns}$$

Q20.



Tag Mem size

$$\begin{aligned} &= S \times P \times \# \text{tag bits} \\ &= 2^4 \times 2 \times 17 \\ &= 68 \text{K bits} \end{aligned}$$

2 way set associative.

CM size = 84 KB

Block size = 16B

Addr size = 32 bit

$$\# \text{Lines} = \frac{64 \text{K}}{16} = 2^{12}$$

$$\# \text{sets} (S) = \frac{N}{P \cdot \text{way}} = \frac{2^{12}}{2} = 2^4$$

= 2048.



NOTE- Array is an ordered set of homogeneous data elements. Array is always stored in row major order in sequential memory locations. Array elements can be accessed in 2 ways:-

- ① Row major
- ② Column major.

When the main memory block references are sequential, then arrangement in cache memory is also sequential.

→ One element size = 8B

One block size = 16B

∴ one block holds 2 elements

memory  $\Rightarrow$  1024  $\Rightarrow$  512 blocks required to carry 1 row.

|      |         |    |
|------|---------|----|
| 0    | a00 a01 | a4 |
| 1    |         |    |
| 2    |         |    |
| 511  |         |    |
| 512  |         |    |
| !    |         |    |
| 1023 | a11     | a5 |
| 1024 |         |    |
| 1535 | a21     | a6 |
| 1536 |         |    |
| ;    |         |    |
| 1637 | a31     | a7 |
| ;    |         |    |
| 1638 | a41     | a8 |
| ;    |         |    |
| 1639 |         |    |

Row Major:-

a00  $\Rightarrow$  M  $\rightarrow$  a00, a01  $\Rightarrow$  CM

a01 = H

a02 = M  $\Rightarrow$  a02, a03  $\Rightarrow$  CM

: 50% hit ratio

Column Major

a00  $\Rightarrow$  M  $\rightarrow$  a00, a01  $\Rightarrow$  CM

a10  $\Rightarrow$  M  $\rightarrow$  a10, a11  $\Rightarrow$  CM

$\Rightarrow$  0% hit ratio.

14, 18, 16:-

Cache  $\Rightarrow$  32 KB

Block size = 128 B.

Element size = 8 B.

Each block has  $\frac{128}{8} = 16$  elements

# Blocks req/row =  $\frac{512}{16} = 32$ .

Row major :-

$a_{00} = M \Rightarrow a_{00} \rightarrow a_{015} \Rightarrow CM$   
 $a_{01} \rightarrow a_{015} \Rightarrow H$ .

$\therefore$  No. of Miss per row  $\geq 32K$

To access complete array =  $512 \times 32$  miss  
 $= 2^9 \times 2^5 = 2^{14}$   
 $= 16K$

Column major :-

$a_{00:}M \Rightarrow a_{00} \rightarrow a_{015} \Rightarrow CM$  } 512 Miss for 1 column  
 $a_{10:}M \Rightarrow a_{10} \rightarrow a_{1,15} \Rightarrow CM$

$\therefore$  Miss =  $512 \times 512 = 2^{18}$

$\therefore M_1/M_2 = \frac{1}{16}$

17 #Lines = 32 . Block size = 64 B

Array size =  $50 \times 50$  bytes = 2500 Bytes

#Blocks req to store array =  $\frac{2500}{64} = 39.04$

Starting addr: 1100H

Direct map:

| tag | lo | hi |
|-----|----|----|
| 50  | 5  | 6  |

|        |       |         |
|--------|-------|---------|
| 00 610 | 06100 | 00 0000 |
|--------|-------|---------|

line no  $\Rightarrow$  4

|    |     |
|----|-----|
| 0  |     |
| 1  |     |
| 2  |     |
| 3  |     |
| 4  | B31 |
| 5  | B32 |
| 6  | B33 |
| 7  | B34 |
| 8  | B35 |
| 9  | B36 |
| 10 | B37 |
| 11 | B38 |
| 12 | B39 |
| 13 | B40 |
| 14 | B41 |
| 15 | B42 |
| 16 | B43 |
| 17 | B44 |
| 18 | B45 |
| 19 | B46 |
| 20 | B47 |
| 21 | B48 |
| 22 | B49 |
| 23 | B50 |

40 miss + 16 miss = 56 miss.

$$25 \Rightarrow 1 + \frac{40}{1000} \times \left( 10 + \frac{20}{40} * 100 \right) \\ = 3.4 \text{ cycles.}$$

$$26 \Rightarrow 1.5 \text{ M.ref} \rightarrow 1 \text{ Ins}^n \\ 1000 \text{ M.ref} \Rightarrow ? \# \text{ ins.} \\ = 667.$$

$$\text{Avg mem stall cycles/ins}^n = \# \text{ miss in L1} / \text{Ins}^n * H_{L2} \\ + \# \text{ misses in L2/ins}^n * H_{L2} \\ = \frac{40}{667} * 10 + \frac{20}{667} * 100 \\ > \underline{\underline{36}}$$

27:- 1GB - 1sec

$$64B \Rightarrow ? \quad \frac{64B}{1GB} = \frac{2^6}{2^{32}} \text{ GB} = 64 \text{ ns} + 32 = \underline{\underline{96 \text{ ns}}}.$$



$$\text{Ans} = \frac{2}{4} \times \frac{16 \times 32}{4} = \underline{\underline{88 \text{ ns}}}.$$

$$\frac{16}{4} (20+2) = 88$$

29.

30.

|   |     |  |  |
|---|-----|--|--|
| 0 | 0   |  |  |
| 1 |     |  |  |
| 2 |     |  |  |
| 3 | 255 |  |  |

34. # lines = 512  
Block size = 32B  
direct map.  
Addr. FBFC (16bit)

Lock up free Cache is also called as miss-information status handling register. It is used to store outstanding miss-information.

40. Simultaneous.  
Hit ratio, write = 1



$$S = \frac{512}{8} = 26$$



$$T_m = 100 \text{ ns}$$

$$S = \frac{T_m}{T_c}$$

$$T_c > 10 \text{ ns}.$$

• write through.

$$H_r = 0.92$$

$$H_w = 1 \text{ (simultaneous memory org)}$$

$$f_r = 0.8\%$$

$$f_w = 15\%$$

$$\begin{aligned} T_{avg,read} &= H_r T_c + (1-H_r) T_m \\ &= 0.92 * 10 + (1-0.92) 100 \\ &= 17.2 \text{ ns} \end{aligned}$$

$$T_{avg,write} = 100 \text{ ns.}$$

$$T_{avg} = f_r * T_{avg,read} + f_w * T_{avg,write}$$

$$= 29.6 \text{ ns.}$$

Q3:-  $H_r = 0.8$   
 $H_w = 0.9$  (Hierarchical)

$$\text{Block size} = 2^W$$

$$f_w = 30\% \text{ (dirty bit)}$$

$$f_r = 70\% \text{ (read bit)}$$

$$T_c > 20 \text{ ns.}$$

$$T_m > 100 \text{ ns}/W.$$

$$= 200 \text{ ns}/\text{block}$$

$$T_w = \max(\text{word update in cache}, \text{word update in MM})$$

$$> \max(20, 100)$$

$$= 100 \text{ ns.}$$

$$\begin{aligned} T_{avg,r} &= H_r T_c + (1-H_r)(T_m + T_c) \\ &= 0.8 * 20 + (1-0.8)(200 + 20) \\ &= 16 + 44.0 \end{aligned}$$

$$\begin{aligned} T_{avg,w} &= \cancel{0.9} * H_w T_w + (1-H_w)(T_m + T_w) \\ &= 0.9 * 100 + (1-0.9)(200 + 100) \\ &= 120 \text{ ns} \end{aligned}$$

$$T_{avg} = f_r * T_{avg,r} + f_w * T_{avg,w}$$

$$= 78 \text{ ns}$$

$$R_{WT} = \frac{1}{T_{avg}} = 12.8 \text{ million words/sec.}$$

$$\begin{aligned} \underline{\text{Q4.}} \quad T_{avg,r} &= H_r \cdot T_c + (1-H_r) [\% \text{ dirty bit} (T_m + T_m + T_c) + \\ &\quad \% \text{ clean bit} (T_m + T_c)] \\ &= 72 \text{ ns.} \end{aligned}$$

$$T_{avg_w} = H_w T_c + (1-H_w) \left[ \frac{1}{2} \cdot \text{dirty bit } (T_m + T_m + T_c) + \frac{1}{2} \cdot \text{clean } (T_m + T_c) \right]$$

$\Rightarrow 46 \text{ ns}$ .

$$T_{avg} = f_r \times T_{avg_r} + f_w \times T_{avg_w}$$

$$= \dots 49.4 + 13.8$$

$$= 63.2$$

$$\eta_{WB} = \frac{1}{T_{avg}} = 15.5 \text{ million words}$$