

# COMPUTER ORGANIZATION

(5)



# Computer Organization

✓ 8086 → μP ✓

✓ Microcontroller 8051 ✓ H/w issues ↑ 3/10 issues.

✓ CO Theory — Morris Mano / William Stallings.

[Hennessy]

✓ Advanced Computer Architecture - Multiprocessor Env.

✓ Different tech to improve computer performance.

High  
level  
abstraction

A.K Ray → Gate CO

Ayala → H/w Architecture implemented in 8051.

Von Neumann →

~~3rd ed~~ [COMPUTER ARCH & quantitative approach :- Hennessy] ✓

✓ Three kinds of memory:

↳ long term memory ROM

↳ short term ↳ SRAM → flip flop memory.

↳ DRAM → middle term memory.

↳ When the power is off, data is present until capacitor gets discharged

## Computer

Def'n → computer  $\xrightarrow{\text{S/m}}$  [It converts one form of data into another form] under the control of program.

Objective of computer  $\xrightarrow{\text{S/m}}$   
Execution of a program



- \* Analog signals means continuous wave
- \* Digital signals means discrete wave.

Analog



Digital



signals understand only 2 values:-  
0 & 1

- Computer understands only digital signals.  
It understands sequence of 0's & 1's!
- The language understood by computer is known as Binary language.

- ↗ The digital signals consists of 2 values or 2 states 1's and 0's.  
(base).
- ↗ The no s/m whose radix is 2 called Binary no s/m so ultimately computer understands binary language.
- ↗ CPU understand only binary language. As in our computer, ASCII is a translator to convert one form to another such as keyboard & CPU.
- ↗ Program  $\rightarrow$  Sequence of instructions along with data.

a+b  $\rightarrow$  Identifying the INST<sup>n</sup> is not meaningful operation. Processing is main key for getting meaningful data.



~~Def~~ ① Instruction is a binary sequence/pattern (1's and 0's pattern) 101111, which is designed made by the processor to perform some task.

have their own task.

↗ Data  $\rightarrow$  Binary sequence associated with a value.

- A. While executing the instrn, how the processor come to know about task.
- B. Program stored in memory  $\rightarrow$  address.
- C. How our processor recognize the operations associated with respect to binary pattern.  
It is done by instrn format.
- D. Instrn format  $\rightarrow$  layout of instruction
  - $\downarrow$  Depends upon size of the instrn

Internal structure of instruction.

Before giving the layout of instrn, we need to give the size of instrn / length of instrn.



opcode  $\rightarrow$  Type of Task

$\rightarrow$  How many operation introduced (opcode).

Encoding If there are  $2^n$  combinations. These are defined by n bits (output).

Decoding If there are n bits <sup>O/P</sup>. Then it is defined by  $2^n$  bits (O/P).

Qn(2) Instructions are encoded binary pattern which is associated with unique operation.

$\leftarrow$  12 bit  $\rightarrow$



Q How the length of Opcode is restricted?

A The length of the opcode field depends based on no of operations supported by processor.



After introducing the operations, the operation is enabled by ALU. But not only enabling the operation is the criteria. We need to process the data i.e. input the data.



→ Whatever instr<sup>n</sup>  
is fetched from  
memory ie it  
is fitted onto  
IF.

→ Each and every signal  
associated with our opcod<sup>n</sup>.

→ There are no 2  
decoded signals  
generated w.r.t to 1  
operation.

→ After Decoding signal is generated, then  
binding took place.

→ D<sub>0</sub> associated with + operat<sup>n</sup>. That +  
operation is enabled in ALU. Then  
there is a need of data. That data is  
given to ALU itself.

→ Address of operand → gives the info about  
data.

→ Read the data Info from AD and give  
it to ALU and then OP is generated.

# Performance Evaluation of Computer S/m

## Performance $\Rightarrow$

- 32 bit processor    64 bit processor, There is a variation in performance.
- mips processor       $\uparrow$        $\rightarrow$  computable element.

~~Eg.~~  $C = a + b.$

Computable

element, we are computing the result of  $a$  after performing add<sup>n</sup> on  $a$  and  $b$ .

Performance is now computable element i.e. it is ~~function of~~ <sup>duration of execution of</sup> dependable element  $\rightarrow$  ~~function of~~ <sup>latency of</sup> program.

## Single S/m performance $\Rightarrow$

It is depending on execution time, elapsed time or response time.

Execution time  $\Rightarrow$  means latency of a program execution s/m

single s/m/c.

y.n/c.  $\Rightarrow$  single s/m

When the exec<sup>n</sup>

| Task           | T <sub>i</sub> |
|----------------|----------------|
| Execution Time | 5nsec          |

T<sub>i</sub> time is less, performance is high

|   |            |   |
|---|------------|---|
| * | $P_x / ET$ | ✓ |
|---|------------|---|

✓ If Execut<sup>n</sup> time increases, performance decreases.  
 ✓ If Execut<sup>n</sup> time decreases, performance increases.

Speedup  $\Rightarrow$   $\frac{\text{performance of } x}{\text{performance of } y} = \frac{1}{ET_x}$

~~variations in  
approach of  
input~~

Speedup  $\Rightarrow \frac{ET_y}{ET_x}$  ✓

In previous eg  $\Rightarrow$  Speedup =  $\frac{10}{5} = 2$ .

x is running 2 times faster than y.

✓ Speedup is used to compare 2 entities and choose the best. one.

✓ Performance of a Server sys.

This performance is depending upon throughput of a sys.

Throughput = no of tasks executed in a unit of time.

server (not a single S/m).

Date: \_\_\_\_\_

Page No.: \_\_\_\_\_



Eg Speedup =  $\frac{10}{5} = 2$   $x$  server runs 2 times faster as compared to  $y$  server

With respect to CO, we need to concentrate on single S/m environment

How to decrease execution time, is a objective of S/m designer

Andhal's Law:

This law focus on improvement gain by making use of common case fast.

Design: High performance processor

(P)  $\rightarrow$  Price 100\$

Within the price limit, we need to accommodate the performance. Then we call it as efficient designer.

While analyzing S/m we divide into 2 parts

Frequently used Parts

rare parts

Eg Television and remote control:

Introducing 100 keys on remote is not efficient design.

90% are floating pt ] they are utilized  
10% are integer ] by processor.

frequently  
used).

Rare Operat<sup>n</sup>s.

common case

Points:

Analyzing S/m before taking modifications into frequent case and rare case.

Take the enhancement only on frequent case.  
It improves the performance within the cost and price limits by making using common case fast criteria.

Speed up = ?

We need to calculate the speed up or performance gain. How much perf is gained with enhancement.

Benchmark program.

Speed up = Performance of task which are working on the S/m with enhancement

perf of task which are running on the S/m w/o enhancement.

⇒ 1

Execution Time System Enhancement

ET S/m w/o enhancement

Denominator prog  $\rightarrow$  program which can run  
any type of instrns.

Date:

Page No.:

Speed up = ET sum w/o enhancement

ET sum with enhancement

\* Speedup  $\Rightarrow \frac{ET_{old}}{ET_{new}}$

This eqn depending on 2 factors :-

① The first factor is fraction enhancement

i.e. How much fraction of original m/c undergoes enhancement. Only common case undergoes enhancement i.e. fraction

Eg fraction enhancement is 90%. floatingpt.

② Speedup Enhancement : how much faster the tasks run, if the enhancement portion also included.

$$\text{Speedup} = \frac{ET_{old}}{ET_{new}} \quad \xrightarrow{\text{map probability}}$$

/  $ET_{new}$  = Execution time of unenhanced portion + execution time of enhanced portion.

Eg  $ET_{old} = 1$

~~$ET_{new} < ET_{old}$~~

~~$ET_{new} > 1$~~

$\frac{ET_{old}}{ET_{new}} < 1$

$ET_{old} = 1$   
 $ET_{new} < ET_{old}$

7. Fraction enhancement =  $F$ . Overall =  $\frac{ET_{old}}{ET_{new}}$ .

7. Speedup enhancement =  $S$ .

Calculate  $ET_{new}$ .

$ET_{new} = ET \text{ of Unenhancement} + ET \text{ of enhancement}$

$$\Rightarrow ET_{old}(1-F) + ET_{old}F$$

$S$ .

$$\Rightarrow ET_{old} \left[ (1-F) + \frac{F}{S} \right].$$

$$S = \frac{ET_{old}F}{ET_{new}F}$$

$$ET_{new} \text{ Fract}^n = \frac{ET_{old} \text{ fract}^n}{S}$$

overall  $\Rightarrow ET_{old}$

$$ET_{old} (1-F) + \frac{ET_{old}F}{S}$$

$$\Rightarrow \frac{1}{(1-F) + F} = \left[ \frac{(1-F) + F}{S} \right]^{-1}$$

$$\frac{ET_{old} (1-F)}{ET_{old} F + ET_{old} F} \cdot S$$

single enhancement overall =  $\left[ \frac{(1-F) + F}{S} \right]^{-1}$

Total single speedup after enhancement

$$S = \frac{ET_{old} (1-F) + ET_{old} F}{ET_{old} F} \cdot S$$

$$S = \frac{ET_{old} F}{ET_{new} F}$$

If multiple enhancement takes place-

$$\text{Overall} = \left[ (1 - \sum F_i) + \sum F_i \cdot S_i \right]^{-1}$$

Where  $i = 1, 2, 3, \dots$

If 2 enhancement takes place -

$$\text{Overall} = \left[ 1 - (F_1 + F_2) \right] + \frac{F_1}{S_1} + \frac{F_2}{S_2} \right]^{-1}$$

## Numerical:

Ques Consider hypothetical system used in mathematical model simulation. Suppose floating point instructions improved to run at the rate of  $2x$ . but only  $10\%$  of instructions are floating point. what is the overall speedup?. Examine is it successful improvement or not? & take corrective decision.

10% inst<sup>n</sup>g are floating point ✓

( $b=2$ )  
10% FP  
90% I.

$$S_{\text{overall}} = \frac{1}{(1-F) + F} = \frac{1}{\left(\frac{1-10}{100}\right) + \frac{10}{100}} = 2$$

System (mathematical

model simulation)

$\Rightarrow 1$

✓ floating int

$$(1-0.1) + 0.1$$

$$\frac{0.05}{2}$$

10%

$$\frac{0.9 + 0.05}{2}$$

(improved).

$$\Rightarrow \frac{1}{0.95} \quad \frac{0.05}{0.95} \quad \frac{1.0}{0.9}$$

$\Rightarrow \downarrow$

$2x$

$\downarrow$

$S = 2$

$F = 10\%$

$S_{\text{overall}}$

$\Rightarrow 1.05$

$$\frac{1}{0.95} \quad \frac{1}{10 \times 2} \quad \frac{0.05}{0.5}$$

$$= 1.10$$

↳ it is not correct enhancement  
Here rare case goes under enhancement  
only fraction of speedup is improved,  
perf gain is very less.

In the above case rare event undergoes enhancement so the performance gain is very less.

To take the corrective measures, integer unit undergoes enhancement instead of floating point.

Under this condition, the speed up is equivalent to  $\Rightarrow$  Integer

$$f=0.9, s=2.$$

$$\Rightarrow \frac{1}{(1-0.9)+\frac{0.9}{2}} \quad \checkmark$$

$$\Rightarrow \frac{1}{0.1+0.45} \Rightarrow 1.81 \quad \checkmark$$

$$\begin{array}{r} 1.0 \\ 0.9 \\ \hline 0.1 \\ 0.45 \\ \hline 0.9 \end{array}$$

Speed up increases, if we improve frequent operations.

Ques. ② Consider memory  $S/m$ , suppose a cache memory is 10 times faster than main memory and that can be used in  $70\%$  of the time. How much speedup we gain.



$$S_{\text{overall}} = \frac{1}{(1-f)+\frac{f}{s}} = \frac{1}{(1-0.7)+\frac{0.7}{10}} = \frac{1}{0.3} = 3.33$$

$$\Rightarrow \frac{1}{(1-0.7)+0.7} = \frac{1}{0.3} = 3.33 \quad \checkmark$$

Ques → 3 enhancements with following specifications are proposed for new architecture i.e.  $s_1 = 30$ ,  $s_2 = 20$ ,  $s_3 = 15$ . If enhancements 1 and 2 are each used for 25% of the time. What fraction of the time enhancement 3 will be used to achieve overall speedup of 10.

enhancement

$$S_{\text{overall}} = \left[ 1 - (F_1 + F_2 + F_3) + \frac{F_1}{s_1} + \frac{F_2}{s_2} + \frac{F_3}{s_3} \right]^{-1}$$

$$10 \Rightarrow \left[ 1 + \left[ \frac{25}{100} + \frac{25}{100} + F_3 \right] + \frac{25}{100} + \frac{25}{100} + \frac{F_3}{s_3} \right]^{-1}$$

$$\Rightarrow \left[ 1 + \left[ 0.25 + 0.25 + F_3 \right] + \frac{25}{30} + \frac{25}{20} + \frac{F_3}{15} \right]^{-1}$$

$$10 \Rightarrow \left[ 1 + \left[ 0.50 + F_3 \right] + 0.8 + 1.25 + \frac{F_3}{15} \right]^{-1} \quad \begin{matrix} 0.8 \\ 0.2 \end{matrix}$$

$$10 \Rightarrow \left[ 1 + [0.50 + F_3] + 0.8 + 1.25 + 0.06 F_3 \right]^{-1} \quad \begin{matrix} 4/5 \\ 4/10 \end{matrix}$$

$$10 \Rightarrow \left[ 1 + [0.50 + F_3] + 2.05 + 0.06 F_3 \right]^{-1} \quad \begin{matrix} 15/8 \\ 15/10 \end{matrix}$$

$$10 \Rightarrow \left[ 3.55 + 1.06 F_3 \right]^{-1} \quad \begin{matrix} 0.1 \\ 0.06 \end{matrix}$$

~~$$10 \Rightarrow 3.55 + 1.06 F_3$$~~

~~$$0.1 = 3.55 + 1.06 F_3$$~~

~~$$0.1 - 3.55 = 1.06 F_3$$~~

~~$$F_3 \Rightarrow 0.45$$~~

~~$$\Rightarrow 45\%$$~~

## CPU Execution time Calculation :-

- Each and every processor is connected with common clock which runs at constant speed.
- $T_1 = 1\text{ ns}$   $T_2 = 2\text{ ns}$ } Its not constant running clock.



→ Clock pin is also known as I/P some info is carried over to processor  
Clock Generator & clock pin is connected with C.G.

The O/P of CG are time states. 11111.

The entire 8086 is controlled by clock. Clock generator must run at constant speed Then only time states will be having unique value.

For every operation, 4 ts are involved.

$t_1 t_2 t_3 t_4$  must be common ie 1ns

Time state is defined as 11111 (101) rising edge to rising edge and 010 ie falling edge to falling edge. 0 0 101 → time state

Edge triggering signals  $\rightarrow$  when the signal is activated based on time state ie 101 ✓

Level triggering signals

Active High  $\rightarrow$

Active Low  $\rightarrow$  0

The signals which are activated based on their transitions called as edge triggered signals.

\* Clock / tick / clock tick / cycle / clock cycle /  
clock period  $\rightarrow$  101 ✓

\* Based on clock 010 ✓

we are calculating execution time reqd by a computer

CPU Execution time = program execution time

= no of seconds per program.

Execution time  $\Rightarrow$  seconds ✓  
Program ✓

Program  $\Rightarrow$  no of inst<sup>n</sup>.  $\rightarrow$  To execute  
one inst<sup>n</sup>  $\rightarrow$  how many clock cycles  
reqd

CPU Time  $\Rightarrow$  seconds  $\Rightarrow$  Inst<sup>n</sup>  $\times$  Cycles  $\times$  seconds  
Prog. Prog. Inst<sup>n</sup> Cycles.

CPU Execution time = no of clock cycles reqd  $\times$   
clock cycle time [time is given]



Q1.

time is yet given, frequency is given

~~CPU~~ OUTime  $\Rightarrow$  no of clock cycle reqd  $\times$  clock cycle time.  
rate.

- If we know the no of clock cycles, instr<sup>n</sup> count, we can calculate average no of clock cycles per instruction.

CPI  $\Rightarrow$  no of clock cycle reqd  
Instruction count

Under that cond<sup>n</sup> no of clock cycles reqd

$\Rightarrow$  CPI  $\times$  IC

Substitute no of clock cycles reqd for the program value  $\times$  CPU time

CPU Time  $\Rightarrow$  CPI  $\times$  IC  $\times$  clock cycle time  
 $\Rightarrow$  CPI  $\times$  IC  
rate.

Each and every instr<sup>n</sup> will be having their own clock cycle.

CPU time =  $\sum_i$  CPI<sub>i</sub>  $\times$  IC<sub>i</sub>  $\times$  cctime

or

=  $\sum_i$  CPI<sub>i</sub>  $\times$  IC<sub>i</sub>  
Rate

i → Data transfer instr<sup>n</sup>, Datamanipulation  
Transfer of control.

### Instruct<sup>n</sup> set

A complete processor supports with 3 categories of Instr<sup>n</sup> i.e.

- (1) Data transfer Instr<sup>n</sup>
- (2) Data manipulation Instr<sup>n</sup>.
- (3) Transfer of control Instr<sup>n</sup>.

name  
convention  
Inst<sup>n</sup>

Data transfer Instr<sup>n</sup> → while execution of these instructions, data is transferred from source to destination w/o change.

We need to identify source of addresses, How many places the data is present.

The data is present in 2 places →

- (1) Processor register
- (2) Memory.

we need addressing mode. It shows where the data is present to operate Inst<sup>n</sup>.



→ multiple locations.

Source - in either Register or memory.  
Destination

- If its read operation → Dent<sup>n</sup> → register
- If its write operation → Dent<sup>n</sup> → memory.

To execute these instructions register & memory acts as a source and destination based on operation nature.

If operation is read, register is destination and if it is write, register memory is destination.

rule

7. The constraint of data transfer instruction is source may or may not have storage but destination is always have storage

MOV #23, AX

constant

X

Eg

Nemonic indicator  
type of operat^n

Nemonic followed by dest^u, dest^u is "source"

Constant does not have storage functionally

Register → Storage functionality

MOV AX, (# 23)

D S ✓

DU  
↓  
storage

register

source may  
or may not have  
storage

Data transfer Inst<sup>n</sup>s are divided into 2 types:-

- ✓ ① Register to Register transfer operation (Mov)
- ✓ ② Memory to Register " "
- ✓ ③ Register to memory " "

① Mov → Data transfer Inst<sup>n</sup> ~~Mov by rs~~

② Mov no #2345 ✓ (reverse is not valid)  
 ↓      ↓      ↓  
 Dest<sup>n</sup> need to have  
~~storage~~ constant (Immediate value). ~~storage~~

③ (Point) → 8086 Up allows move Inst<sup>n</sup> to perform data transfer operation from register to register & memory also.

Mov no, 2345. ✓ (register) [read operat<sup>n</sup>]

~~mov~~ ~~no, 2345~~  
 ↓  
 addrs.

If any no is not followed by # it is treated as address -  
 (Address of memory).

Mov 2345, no. ✓ [Dest<sup>n</sup> memory] [write operat<sup>n</sup>]  
 addrs.  
 of memory (storage facility).

Registers are present in memory

Memory to memory transfer operation is not allowed. There is no use.

Q. Memory reference data transfer operations are :-

(1) read and write

Memory is divided into equal parts ie known as cell  
Cell consists of unique address. No is known as

Q. Memory chip is divided into equal sized parts called cells.

Q. Each cell is identified with unique no called as address.

Q. Each cell understands 2 control signals.

(1) Read & write

To access any memory cell, send address along with control signal.

Memory chip is represented as following notations

i) cell size:

$\Rightarrow 64K \times 8, 64K \times 16, 64K \times 32$ .

Total capacity of chip.

By using this capacity, we came to know about no. of cells. It itself gives how many address lines are reqd to enable 1 cell. We are able to identify address space of that chip.

$$\text{Total no. of cells} = 2^6 \times 2^{10} \Rightarrow 2^{16}$$

$$\text{no. of address lines} = 16.$$

Address space  $\rightarrow$  min val to max val.

$$\rightarrow (0000000000000000)_2 \text{ to } (1111\ 1111\ 1111\ 1111)_2$$

Each and every memory chip will be having their own address space.

Memory addresses are represented in hexadecimal no sys. because of easier recognition. Memory is also important.

$$\begin{array}{ccccccccc} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ | & | & | & | & | & | & | & | & | \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{array} \rightarrow 0000 \rightarrow \text{each digit is equivalent to 4 binary bits.}$$



→ The 2nd parameter indicates cell size:-

Default cell size = 8 bits

64KB → invalid

$$K = 16, \cancel{K = 32}$$

Букви,

↑ it is only valid, if there is binding  
with 16, 32 & so on...

→ If memory address points 8 bit memory cells called as BYTE ADDRESSABLE

→ If memory address points the cell size based on the processor word length. Called as WORD ADDRESSABLE

Suppose      16 bit processor      ] word length = 16, 32.  
                  32    4        4      ]

Mensley addresses interpretation methods.

e.g. 8086 whose word length is 16 bits.

Jan.

MOV AX 2000

| Higher Byte     | Q3: | 2000 |
|-----------------|-----|------|
| 2000 → 23, → HB | 45  | 2001 |
| 2001 - 45 → LB  |     | 2002 |
|                 |     |      |

register  
size + 16 bits

1665  
Piscator

↳ we identify through 2 types  
as follow:

Memory interpretation are of 2 types :-

① Little endian.

② Big endian.

① Little endian

3 2 1 0 (address)  
| 3 | 2 | 1 | 0 | byte

(low to high)

low address consists of low byte

High 4      "      "      High 4.

If 8086 allows IET method, accessed 2 locations.

↓  
low (2000) - 23 → LB. ] ✓  
High (2001) - 45 → HB ] ✓

8bit = 1 byte  
16bit = 2 bytes  
24bit = 3 bytes  
AX = 23 | 45  
AX = (HB) | (LB)

32bit processor we need to access 4 locations.

|    |       |
|----|-------|
| 23 | 2000. |
| 45 | 2001  |
| 20 | 2002  |
| 64 | 2003  |

reg = 64 28 45 23 ✓

② Big endian method :-

4 3 2 1 0  
| 1 | 2 | 3 | 4 |

low address → Higher Byte

High address → lower byte

The default Memory address Interpretation method  
**LITTLE ENDIAN**

## RISC processors

In RISC processor, memory to register, memory to register, Data transfer instruction are implemented with load and store instr's respectively.

(Memory + register)

- Load - read operat'n. mem  $\rightarrow$  reg (I/P)
- Store  $\rightarrow$  write Operat'n. reg  $\rightarrow$  mem (O/P).

After enabling LRU, SW is waiting for data is present in register. Data is transferred through Data transfer instr's.

Eg Load r0, [2000]  $\rightarrow$  contents.

memory content whose address is 2000 is transferred to r0.

Eg Store r1, [2000] register  $\rightarrow$  memory

register.

In the RISC, store instr is followed by source But all other instr is followed by Destination

\* Except store instr all the instr's are followed by Destination

## Data Manipulation Instn's

While execution of these instructions, the data undergoes change. These are divided into 2 types.

- ① Bit manipulation instrn's
- ② Byte manipulation instrn's

Bit manipulation instrn: Only one bit undergoes change.

- ① Instrn's of set, clear

(Set)  $\rightarrow$  carry flag is equivalent to 1 while executing this instrn.



- ② Clr b no. (1)

7 6 5 4 3 2 1 0  
 0 0 1 0 0 0 1 1

Change.

0 0 1 0 0 0 0 1

This type of Instructions are useful in setting or resetting command word register.

Command word register each bit is having their own meaning.

If bit = 1  $\rightarrow$  Set  $\rightarrow$  rotate priority is reflected  
 If bit = 0  $\rightarrow$  Reset  $\rightarrow$  fixed priority is reflected.



## ② Byte manipulation

These instructions are divided into 3 types

- ① Arithmetic instr<sup>n</sup>
- ② Logical instr<sup>n</sup> ✓
- ③ shift & rotate instr<sup>n</sup> ✓

1. one form of BIP is accepted & other form of BIP is generated based upon instr<sup>n</sup> ✓

① Arithmetic instr<sup>n</sup>: Arithmetic instr<sup>n</sup> are used to perform mathematical models simulation mathematical formula / sol<sup>n</sup>.

These Instr<sup>n</sup>s are divided into several types :→

1. ADD

SUB

MUL

DIV

INC

DEC

②

Logical instr<sup>n</sup>: To enable or disable the internal components of the processor logical instr<sup>n</sup>s are used.

Level Triggered Signals are using logical instr<sup>n</sup>s.  
 ↓

Based on pin value, we are having 2 values i.e high & low. Some pin are activated with high or low.

~~eg~~ If chip is in enable state, then memory is active.



~~eg~~ Nand gates are communicating with chip select pin

\* logical operations are used to enable / disable the internal components of processor in memory.

If any one of the IP of Nand gate is 0, the output is high.

\* Level Triggered Signals:

Active High  
Signal

Eg : (ALE)  
(Address Latch Enable).  
(Pin name).

Active Low

Signal  
Pin name with bar  
It indicates Active low

It is connected with latch.  
Pin without bar  $\rightarrow$  Active name  
high.

Eg CS



Logical Instns: AND, OR, NOT, XOR,  $\uparrow$ , CMP  
compare

② Shift & Rotate Instn

Shift Instn: This instruction is used in serial data communication. serial

Data communication means bit by bit transmission. It is supported (useable) when the distance is large between the nodes. (source node or dest<sup>n</sup> node)

(2) 8251 USART (Universal Synchronous Asynchronous Receiver & Transmitter).

↗ It is an interface to perform Serial Data Communication.

↗ How it is executed?



↗ It consults internal buffer.

Buffer is used to hold data.

Based on buffer status we

can conclude the transmission is over or not.

Once the buffer is full, transmission takes place.

Shift register are reqd for serial transmission.



When sender buffer gets empty, stop transmission.

one node as transmitter & another as receiver by setting bit in the command word register. When the ~~idle~~ side is full, start the transmission i.e. LSB of buffer is placed on transmission medium.

Shift operations are divided into 2 types :-

rest of the bits are shift towards right one bit position. Repeat the same operation until buffer becomes empty.

Two Types:-



logical shift operation  $\rightarrow$  sign bit is shifted either towards left or right.



$\rightarrow$  LSB is lost, 0 is appended. All bits are moves towards right direction.



Serial data commun  $\rightarrow$  logical shift operation is used.

Arithmetic shift operation  $\rightarrow$  preserve the sign bit, we are not changing the position.



Common property of shifting → loss of data

Shift left:

$n-1 \ n-2$



If its n bit register:

$n-1 \ n-2 \ n-3$



remain same

$n-2$  is lost. If  $n-2$  is moved towards  $n-1$  then sign bit is lost.

$\text{Ex} = \begin{pmatrix} 1 & 0 & 1 & 1 & 0 & 1 & 1 \end{pmatrix} \xrightarrow{\text{lost}} 0.$

$= \begin{pmatrix} 1 & 1 & 0 & 1 & 1 & 0 \end{pmatrix} \xrightarrow{\text{sign}} 101$

Here we preserve sign

→ Arithmetic shift operators are used in Sign no's multiplicatn.

Rotate

Conclusion of shift operatn

\* [After certain operatn, the original Data is lost] \*

\* Use / Significance → Rotate instrns are used to perform optimizations in the program.

Rotate instructions are divided into 2 types:



→ Rotate w/o carry  $\rightarrow$  ROR

$\rightarrow$  ROL



→ No new data entered onto register, only position's  
are changed.



Stepper motor

$$s_c = 10111 \leftarrow$$

10111

anticlockwise  
 $\Rightarrow$  rotate ROL  
clockwise  $\rightarrow$  ROR

→ Stepper motor implementation :-

Book NK  
Author

Rotate operations are meant for optimization.  
In stepped motor interface, implementation with  
8086, ROR and ROL instructions are used  
to perform clockwise & anticlockwise  
movements.

RCR



Rotate right along with carry. LSB is transferred to carry & carry moved to LSB. rest of bits are moving towards right.

RCL



These type of instr<sup>ns</sup>s are used in

- (1) to identify the no is +ve or -ve. In Rcl. After MSB shifted to Cy. If Cy is set the no is -ve otherwise if it is reset then no is +ve.

- (2) to identify the no is even or odd.

RCP

Cy bit is always 1 if no is odd.

LSB bit is always 0 if no is even.

Num 2 0 1 0 1 0 1 0  
0 1 2 3 4 5 6 7

|   |   |   |      |
|---|---|---|------|
| 0 | 0 | 0 | even |
| 1 | 0 | 1 |      |
| 2 | 1 | 0 | odd  |
| 3 | 1 | 1 |      |
| 4 |   |   |      |
| 5 |   |   |      |
| 6 |   |   |      |
| 7 |   |   |      |

~~give concept~~

## Transfer of Control Instructions.

- While execution of these instructions, the program execution sequence is changed. Every transfer of control instr<sup>n</sup> is associated with target address.

Program execution

- The program execution sequence is controlled by program counter register. Program counter points starting inst<sup>n</sup> addresses & immediately it is pointing to next inst<sup>n</sup> address.



Once the inst<sup>n</sup>  $I_1$  is transferred to CPU  $\rightarrow$  fetch cycle. or Before executing the  $I_1$ , the program counter points to next address (updated).

When Inst<sup>n</sup> are stored in random way / locality of memory. Transfer of control Inst<sup>n</sup> are needed to load the successive address into Program counter.



Notes Points :

- ① Program counter is operated under 2 modes:-
- ① Implicit mode ✓ \*  $PC \leftarrow PC + \text{Step size}$
- ② Explicit mode ✓ \*  $PC \leftarrow \text{modified value}$ .

Step size → Memory-byte addressable }  
 Inst<sup>n</sup> is 16 bits }  
 Memory location 2 bytes }

Here memory location reqd to store Inst<sup>n</sup> is 2 byte

Here step size = 2 byte

Program counter is incremented with Step size 1, 2, 3, 4 --- so on.

- ① PC is operated under implicit mode where all the inst<sup>n</sup>s are stored in sequential memory locations

- ② PC is operated under explicit mode where the inst<sup>n</sup>s are stored in random locations of memory

under the 2<sup>nd</sup> mode of operation, there is a need of transfer of control instructions to load the modified value into PC.

Transfer of control operations are divided into 3 types

Branch      Jump      Skip

These transfer of control operations again classified into

- ① Unconditional ✓
- ② Conditional ✓

Unconditional transfer of control while executing these instructions, program counter is loaded with target address w/o checking any cond'n.

Eg  
 unconditional  
 (B) 2000 H PC = 2000.  
 JMP 3000 H PC = 3000.  
 SKIP 2050 H PC  $\Rightarrow$  2050.

TA (next address instn).

CALL 3500 H R  $\Rightarrow$  3500.

Conditional transfer of control instructions:- while execution of these instructions, condition undergoes evaluation. If it evaluates to true target address is loaded into PC. If it is false, next instn is executed

JNZ 2000 Jump non zero. PC  $\leftarrow$  2000

Taking previous instr<sup>n</sup> into account we can evaluate true or false.

Mov CX, 0010 H

Dec CX 22

True

Mov CX, 01H

Dec CX, 00H

False

As long cond<sup>n</sup> is true program counter is loaded with target address. otherwise next instr<sup>n</sup> is executed.

In 8051. djnz R0, 2000H

↓

decrement R0 if non zero.

No need to take previous instr<sup>n</sup> into account before loading target address.

## Subprograms

To implement subprograms, transfer of control instr<sup>n</sup> are used.

① call

② return

Repeatedly appeared instr<sup>n</sup>s in the main program collect as 1 program and store it in the separate memory space.

Sort() — 5 instr<sup>n</sup>.



- ✓. 250 Inst<sup>n</sup> ✓ if program language is not supported  
 → LOC is increased by subprogram then  
 → Memory space Inc. ← disadvantage  
 → Time inc.  
 → Cost inc.

~~Advantages~~

### Characteristics of Subprograms →

- ① Single entry point ] → call → main program
  - ② - single exit point ] → return → subprogram.
  - ③ Main program is suspended during the subprogram execution!
  - ④ Control is send back to main program after completion of subprogram.
7. To implement subprograms, there is a need of run time stack to store the return addresses



When there is call, there is return.



→ To handle the interrupt: →

To implement Interrupt controller concept in the processor environment. There is a need of Subprogram concept. Each interrupt related Subprogram is present in the memory to handle different interrupts. There is a need of transfer the control from main program to subprogram while maintaining the return address in the runtime stack.

\* The transfer control operation is divided into 2 categories: →

- ① Direct transfer of control operat<sup>n</sup>.
- ② Indirect transfer of control operat<sup>n</sup>.

(Call)

(Return)

Direct transfer of control operat<sup>n</sup>. → target address is associated with mnemonic.  
Indirect transfer of control operat<sup>n</sup>: →

| TOC      | Uncond <sup>n</sup> al     | Cond <sup>n</sup> al |
|----------|----------------------------|----------------------|
| Direct   | call, jump<br>Branch, skip | JNZ (loops)          |
| Indirect | Return<br>(ret)            | -                    |

Point: Implement one control met<sup>n</sup>'s are enough to get the next met<sup>n</sup> address.

## Numericals :-

Q1. Consider the non pipelined processor  
Assume that it has 1 ns clock cycle  
and it takes 4 cycles for ALU operations  
and branches, 5 cycles for memory  
operations. Assume that relative  
frequency of these operations are  
40%, 20% and 40% respectively.  
What is the Instn execution rate?

CPU Time = ?

Clock cycle time = 1 ns.

No of cycles = 4 cycles for ALU +  
No of cycles for memory branches

$$\text{CPU time} = \sum \text{CPI}_i * \text{I}_i * \text{cc time}$$

$$\text{CPU time} \Rightarrow (0.4 * 5 + 0.2 * 5 + 0.4 * 5) \text{ ns}$$

$$\Rightarrow 4.4 \text{ ns}$$

Page no 90  
Q4

Point the components of

# Computer Architecture

→ Computer Arch is classified into 2 types based on memory i.e. it is single memory architecture, <sup>in memory</sup> multi-level architecture.

Single memory architecture: This architecture is also known as VON NEUMANN ARCHITECTURE. This architecture is designed based on stored program concept.

only one memory space is used to store program and data.



Performance  $\propto$  1

Access Time = low.

\* Von Neumann arch → If the memory is organized properly, access time is less.

→ Stored program concept means Inst<sup>n</sup> & data both are in their same memory.

After performing the operat<sup>n</sup> result is also stored into the same memory.

To make it successful architecture use memory management policies to separate the Inst<sup>n</sup> area from data area. So ultimately access time is decreases. This concept is implemented in 8086 CPU.

(16 bit word length)  
8086  
Byte addressable

→ word length  
↓  
16 bit processor

Bit register size = 16 bit

Physical memory = 1MB  $\Rightarrow$   $1M \times 8$  bit

$$\Rightarrow 1M \times 8$$

$$\Rightarrow 1024 \times 1024$$

$$\Rightarrow 2^{10} \times 2^{10}$$

$$\Rightarrow 2^{20}$$

$\Rightarrow$  No of address lines = 20.

Address space  $\Rightarrow$  20's to

20 0's to 20 1's.

$\Rightarrow$  5 0's. i.e. 00 000

5 1's. FFFFP

$$2^{20} \rightarrow \text{either } ] \text{ Both is present}$$

data ]

Eg

Col1

100H

Col2

104H

No policy  
is maintained  
while using  
whole using  
HMO (Random)

Policy Row-column

~~100 75~~ M  
~~62~~

1 20 63<sup>HMO</sup>  
11 M. want  
to access.

AT<sub>2</sub> search time +  
to search

AT  $\Rightarrow$  AT

There is no  
search time.

AT  $\Rightarrow$  high.

AT<sub>2</sub> less

Perf = dec.

Perf = inc

$AT \propto \frac{1}{\text{Perf}}$

Segmentation  $\Rightarrow$  To organize  $2^{20}$  address space we need segmentation.

When 2 different entities are stored in same memory location to access them efficiently there is a need of memory mgt policies.  
8086 designer used segmentation policy i.e. 1MB physical memory is divided into 16 logical segments. Each segment size is of 64 KB.

1MB  $\Rightarrow$  16 logical segments



$$\Rightarrow 16 \times 64 \text{ KB.}$$

$$\Rightarrow 2^4 \times 2^6 \times 2^{10} \Rightarrow 16 \text{ address lines.}$$

$\Rightarrow 2^{20}$ . (no new cell added  
[Same cell status is no cell removed  
maintained). after segmentation)

Base address, offset address

The outside address is known as Base address and inside address is offset <sup>16 segments</sup> address.

Two entities are stored in

same memory To access each entity in efficient way we need segmentation



↗ 8086 supports 4 Segment registers called as CS (code segment register, data segment register, stack segment register, Extra Segment register).

↗ Segment registers are used to store the Base address of a segment where code is available.

①

$CS = 0$



Code segment register is used to store the Base address of the segment where code is available.

② DS → is used to Base address of the segment where <sup>the</sup> data is available. total address space.



Point If Base address is not shared by different segment registers then it is called Non overlapping memory organization

- (3) Stack segment → is a register used to store the base address of the segment where return address are available.
- (4) Extra segment register used to store the base address of the segment where the data is available

Ques. Why 2 Data area is reqd.

There is need of identify source address of data & Dest<sup>n</sup> address of data, In case of string operation [ data xfer operat<sup>n</sup> ], Extra segment is used.

↗ If there is no string operat<sup>n</sup> → No use of Extra segment



Point, In order to execute the string operat<sup>n</sup>s, there is a need of extra segment used to point Dest<sup>n</sup>. address.

↗ All our pc's allows VON NEUMANN Arch. Due to this reason it run in RAM not in ROM. Code memory is present in physical memory. (code Area & data area is same).

## Pointer & Index register

- The offset address of memory segments pointed by pointer registers and index registers. A part of  $\downarrow$ .  
To point the offset address of the segment.
- 8086 supports with 3 pointer registers and 2 index registers



Point: Registers are referred by names or internal addresses.

To enable single memory location there is need of BA and offset address.  
 When there is a need of address of data we use BP.  
 Need of offset

|      |                          |
|------|--------------------------|
| Base | Offset                   |
| CS   | : IP : Inst <sup>n</sup> |

→ Instruction pointer register.

Instruction pointer register holds the offset of Inst<sup>n</sup>.

CS : IP      ① If there is no binding: DS ; BP / SI / DI

If there is binding then DS ; BP / SI / DI string operation.

If there is no binding then DS ; BP / SI / DI not string operation.

Three possibility of holding address of data.

② If operation is string operat<sup>n</sup>

✓ DS: SI      ES: DI  
Source      Dest

D. If Base pointer register, source Index register,  
Dest<sup>n</sup> Index register points the data offset.  
(If the operat<sup>n</sup> is not a string operat<sup>n</sup>)

③ If operation is a string, Data segment register  
is binded with SI to point source  
address and Extra segment is binded with  
DI to point the Dest<sup>n</sup> address.

Case 3 Stack Area : \* stack pointer ?  
points / holds the offset of  
BS: SPT → return address (stack).

Suppose each register size is 16 bit. But internally it  
consists of Address generation logic circuit. As we are  
having a need of 20 bit address. But  
when we combine Base and offset ie 16 bit +  
16 bit = 32 bit. To convert this 32 bit  $\rightarrow$  16 bit  
we need AGIC  $\rightarrow$  4 zeros are added at LSB.



## Point

To read an instruction or data from the memory there is a need of Base and offset addresses to enable the memory cell.

- 7. These 2 addresses are pointed by Segment register and pointer registers. that means 32 bit address is given as I/P to the memory. Because we are having 8086 memory size is  $1MB(2^{20})$ . There is need of logic circuit to convert 32 bit address into 20 bit address.

That circuit is called as (AGLC) Address generation logic circuit.

- 8. It consists of 20 bit buffer it accepts the Base address as an I/P and performs shift left operation upto 4 times.

- 9. After that it accepts offset as a input and performs OR operation with Intermediate regt. Then it generates 20 bit address.



offset segment register

16bit

↓

16bit

, each digit

is 4 binary

bit. Total = 16



Hexadecimal: 3000 : 1234

bits



## Multimemory Architecture

This architecture is also called as "HORVARD" architecture. It is also called LOAD & STORE architecture.

It consists of 2 memory chips. One is code memory & Data memory. (2 memory are separately maintained in processor, one is holding instr<sup>n</sup> & other holds Data). No segmentation policy is reqd.

Need → Separation of Instr<sup>n</sup> space and data space.

→ Harvard architecture is implemented in 8051 MCU (microcontroller).

→ Processor (8086)

① CPU is off chip

*Computer  
I/O  
CPU  
Memory*  
There is CPU, memory and I/O. They are communicated through Bus.



Controller (8051)

① MCU is on chip processor.

*I/O device*  
There is no need of external communication. As I/O, CPU, memory are all present on the same board.

Ques: How the MCU implements HARVARD Arch?



→ 8051 consists of 2 memory chips one is

- ✓ ① Read only memory
- ✓ ② Random access memory.

① ROM is pointed by Program Counter

② RAM is pointed by Data pointer Register

→ ROM is code memory whereas RAM is data memory  
to execute

\* User programs are not allowed in 8051 mc.

→ In the embedded system, 8051 mc are controlled.

↓ Embedded System

Any device which includes programmable processor called as Embedded System. Any

Microwave oven ① Accept user IP. ✓ 23 ✓

② Decrement user IP 22.

NZ : 00  
T F

③ JNZ L (Jump if non zero).  
move to previous

Once the user IP label becomes zero. It target address cond

Stops the process. ④ Stop. (Stop the heating process).

→ Data (23) is stored in RAM. As power is off, Data is lost due to RAM is now volatile memory.

→ Code memory → ROM (Code remain same, IP data is different due to RAM).

→ Von Neumann arch is suitable for PC's and Harvard arch is suitable for Embedded System.

## Components of Computer System.

Computer Syst. consists of 3 components:-

- (1) CPU
- (2) memory.
- (3) I/O Devices

These 3 components are communicating each other using an efficient channel.

Efficient communication channel is called as Bus



Communication is rapid due to the reason that all of three acquiring different functionality such as.

(1) CPU → processing !



(2) Memory → storage of program.

(3) Input/O/P → provide external communication  
System → user  
User → Syst.

in a need of communication b/w 3 components  
 i.e. SB [System Bus].

Some " b/w fast & slow, there is speed gap.  
 To minimize the speed gap, we used synchronization.

System Bus Key characteristics is

Shared transmission media i.e. one transmission at a time

• Bus control is in the hands of CPU  
 because it is fast device which is communicated with S/m bus.

• S/m bus consists of 3 categories of lines.  
 ① Address line ② Data line ③ Control line.

① Address lines are used to carry the address to memory & I/O.

→ Address lines are Unidirectional  
 CPU only initiates memory / I/O operation by sending the address to I/O or memory.

Based on no of address lines total capacity of memory which is supported by the processor is determined.

→ Memory is controlled by address lines if we are having 20 address lines, we can have 1 MB. Memory.

② Data lines are used to carry the data b/w CPU & memory & I/O. Data lines are

Bidirectional Based on the no of data line, the performance of processor depends.

eg. 8 bit processor  $\Rightarrow$  8 data lines.

(3) Control lines are used to carry the control signals and timing signals.

Control signals are used to enable the operation and timing signals are used to provide synchronization.

Control lines are Bidirectional. It is used to carry read & write signals from CPU to memory or I/O devices and also carries interrupt signals from I/O devices to CPU.



\* Each and every I/O device has its own address which is equivalent to port address. (assume I/O address = 1000)  
Memory address space  
I/O address space

want  
CPU communicate memory.

→ need of "comm" channel → [ need of Address & CS ]



→ Comparison takes place b/w memory & I/O. Some address is present in both. There is & even both components are having same control signals. In this scenario we lead to unsuccessful operatn. CPU time wasted / Performance decreases. To avoid this prob, we require to modify S/m bus.

When the CPU communicates with memory. It sends the signal requests i.e. address along with the control signal is placed on the S/m bus. (Address lines & control lines). Assume that memory address is matching with I/O address and control signals leads to same scenario.

The characteristic of bus is shared media so memory and I/O devices access the address and control signal present on the S/m bus.

If the address and control signals are matching with memory and I/O. Two components are initiating the job at a time it is unsuccessful operatn.

because of the bus allows one transmission at a time.

The no of clock cycles are increases, the perf. decreased.



"Single Requests" enables 2 devices.

Point:- To overcome this problem we can use 2 types of system bus organizations one is

(1) Isolated I/O.

(2). Memory mapped I/O.

Isolated I/O: Use the same bus but different control signals



Control signals are divided into 2 types

- ① Memory related control signals <sup>MemRD<br/>MemWR</sup>
- ② I/O related control signals <sup>IORD<br/>IOWR</sup>



Isolated I/O mechanism is implemented in 8086 μp.

8086

As the no. of pins increases, complexity increases!

Active High

consists of 3 pins

Up to 3 pins.

- ① M/I/O → <sup>active low</sup> Pin (Dual pin/one pin)

instead of 4 pins.

Set=1 ⇒ Active High enabled ⇒ memory enabled.

Set=0 → I/O enabled (Active low pin).

Three Pins  $\Rightarrow$

Both are enabled

② ① M/I/O  $\Rightarrow$   $\begin{array}{|c|c|c|} \hline 1 & 1 & \\ \hline \end{array}$   $\Rightarrow$  memory

$\overline{RD} \Rightarrow \begin{array}{|c|c|c|} \hline 0 & 0 & 1 \\ \hline \end{array}$  here memory read

$\overline{WR} \Rightarrow \begin{array}{|c|c|c|} \hline 0 & 0 & 0 \\ \hline \end{array}$  operation is enabled

Now read:

$\Rightarrow$  we can't have - Read  
and write simultaneously  
I/O

Memory mapped I/O



Points  $\Rightarrow$  The processor which supports in and out ~~opera~~ instructions. That processor implements Isolated I/O concept.

IN out related to I/O device.  
 $\hookrightarrow$  when these are executed,  
memory is unaffected.

Isolated I/O is also known as I/O  
mapped I/O

$\Rightarrow$  It consists of single bus and some control signals.

Take some addresses from the pool of memory address space and assign those addresses to I/O devices and disable the same in the memory.

Memory





I/O(01) is matched with a  
I/O(01) enabled but do

Even we have efficient memory space, we can't utilize efficiently bcoz of I/O devices are sharing the memory address space

→ 8051 implements memory mapped concept

## Central Processing Unit :

CPU consists of three components one is

(1) register

(2) ALU

(3) control unit.

Register :- functionality is storage.

Why it is not connected with S/M bus?

As we have seen in previous scenario, all memory are connected via S/M bus. But register is not connected with S/M bus.

CPU → execution of Prog → Prog → memory.

(communication)

To get an info → we need S/M bus

After the enabling the ~~not~~ ACQ, we are accessing the data through memory. Access in Hertz.

To reduce the Access time, There is need of register inside CPU

Design constraint

Ques why registers are present in CPU?

Ans When the CPU executes the program, always communication takes place b/w memory and CPU.

- (1) If the communication takes place, b/w the fast component and slow component. There is a need of synchronization to minimize the speed gap.
- (2) Under that condition, memory hierarchy is the synchronization mechanism used b/w CPU and memory.

## Memory Hierarchy

Gathering 5m supported memories and arrange them on some priority basis.

Objective : → ① Synchronization.

② Balancing the Bandwidth b/w CPU & memory.

↓  
Data transfer rate

no { bits / sec }

System supported memory :-



Follow Bottom to top approach of memory Hierarchy.

① Capacity Decreases:

② Access time Decreases Perf  $\propto \frac{1}{\text{Access time}}$   
 ↓  
 Performance Increases.  
 ↓  
 cost per bit Increases.

Instead of accessing from memory, up, (valuable) data is stored in registers

Registers  
↓  
from  
memory

Registers are used to store frequently used data in it.

Registers are expensive than cache due to security.

There is no processor w/o register. The mandatory register set is regd for use of floating point numbers.

Registers are used to store the data which is referenced by CPU frequently.

Each and every processor consists of a minimum set of registers.

Mandatory register is as follows :-

### (1) Program Counter

↓  
Mandatory register

PC points starting instrn address and immediately it points to the next instrn address.

### (2) Registers)

### (3) Instruction register :-

It holds the currently fetched instruction for the purpose of Decode

Because IR is predefined by Instruction format.  
register  
Inst"address"

PC → M[ ]



If there is no IR, no one knows about the opcode. Due to this we can't decode to the next module. So due to this reason of is need of

# ADDRESSING MODES

→ data related.  
→ addressing modes.  
→ Init<sup>n</sup> related  
addr modes

Shows the way where the required object is present



Addressing Modes are classified in 2 types.

- ① Sequential control flow Addressing modes  
(PC = PC + 1)  
(Imperative mode)
- ② Transfer of control flow addressing modes  
(Data related addressing mode)



- ① Sequential → concentrates on data not on Inst<sup>n</sup> control flow
  - Because the process of execution of Sequential Instructions called as Straight line Screening (no need of zig zaging, only straight path).

- ② Transfer of control flow addressing modes
  - focus on Instructions because of executing the program from random locations

## Types of Sequential control flow

### Addressing modes

These type of instructions are divided into subtypes.

① Register based Addressing modes.

② Memory Based Addressing modes.

① REGISTER BASED AM :- only one addressing mode is implemented i.e called as Register Addressing mode.

It shows the register name where the data is present.

Eg. ①  $\boxed{\text{MOV AX, BX}}$  → data is present in reg with name (BX).  
↓      ↓  
register    register  $[\text{EA} = \text{BX}]$ .  
(D)      (S)

Eg. ②  $\boxed{\text{MOV } \text{R}_0, \text{R}_1}$        $[\text{EA} = \text{R}_1]$ .  
↓      ↓  
D      S

EA  $\Rightarrow$  register name.  
Effective address : The actual address where

1 Arithmetic computation required, 1 register reference required, 1 memory reference is reqd to get the data from the memory.

### \* Autodecrement & Autoincrement Indexed Addressing Mode

Autodec ] Displacement is always constant ie 1  
Auto inc ]

MOV R0, (R2) + → Post Inc.      Auto

Auto-increment      MOV R0, (R2) + Post Inc.  
 $EA = 0 + [R2]$ ,       $ER_2 = 0 + [R2]$ .  
 $EA = [R2] + 1$

Prec increment / Post dec - :

MOV R0, +(R2)

$EA = 1 + [R2]$  → first displacement is added, then we refer data.

This addressing modes are implement the loops in 2 types :

- (1) prec increment      & Post dec
- (2) pred decrement      Post inc

Some of the processors also introduced extra addressing modes called as Scaled addressing modes.

Scaled addressing modes means perform some arithmetic computations b/w the registers to calculate the effective address.

e.g.

MOV AX, [SI] [DI]

$$EA = [SI] + [DI].$$

8086  
allows  
scaled  
addressing

MOV AX, [BP] [DI]. Scaled means we  
need to perform arithmetic operatn

EA = [BP] [DI]. displacement  
in add. to calculate EA.

MOV AX, [BP] [DI] 600.

↓      ↓  
Base index Relative

Eg MOV AX, [BP] [DI] 600 , displacement  
↓  
Base address      → Indexed  
address

Effective addr  $\Rightarrow$  Base + Indexed + displacement  
↓      ↓  
times

Only scaled Addressing Mode's is supported  
by 8086 up: NO other processor supports.

Transfer of Control flow Addressing  
modes

↓ PC relative addressing mode.

is used in transfer of control flow  
addressing modes. In this addressing mode,  
the address of the successive instruction  
is transferred into program counter.

Eg Tmp 2000      prog counter is  
PC  $\leftarrow$  2000      automatically  
loaded with 2000.

1. Effective address = contents of reg.  $\Rightarrow$  [reg]

Eg reg = 2000

mov r0, (reg)

Indicate Indirect

Addressing mode (Memory)



outer → access the data in memory to get data).

1 reg reference to get the EA.

1 memory reference to get the data.

④ Memory indirect Addressing mode :→

The effective address is present in the address of operand field



NOTE

1 memory reference is reqd to read the effective address. and another memory reference is reqd to read the data.

Total memory ref = 2.

→ Register Indirect is better than Mem

30/Dec/2010

### Indexed Addressing

This addressing mode is used to implement the ARRAYS.

In this mode one register act as indexed register is denoted by  $R_i$

Effective address =  $\underbrace{\text{constant}}_{\text{constant}} + \underbrace{R_i}_{\text{Base address}} + \underbrace{[R_i]}_{\text{Displacement}}$  of an Array

Eg. MOV R0, 2(R2).  $R_2 = 1000$

constant

$\Rightarrow [2 + 1000]$

$$EA = 2 + 1000 = 1002$$

$$\text{Data} = [EA] = [1002]$$

$R_2 = 1000$

|      |  |      |
|------|--|------|
|      |  | 1000 |
|      |  | 1001 |
| 2345 |  | 1002 |
|      |  | 0    |
|      |  | 9    |

How these addressing modes are implemented?

### Implementation of addressing modes

- Addressing modes are implemented in 2 ways:-
- ① one is Implicit mode
- ② Explicit mode.

Implicit Mode Symbols are bounded with name

Eg # → Anything followed by # is constant  
 It is only suitable when there are limited no of addressing modes are supported.  
 Otherwise explicit mode is required if there are large no of add<sup>n</sup> modes.

MOV R0, # 2345

data

MOV R0, R1      bound with reg name.  
 The processor automatically

direct      identify that data  
 ↑      & placed in R1

MOV R0, 2000 (no # symbol, 2000 is  
 (data is present at memory address).  
 in memory.

read the data & move to R0).

Indirect

(a) → Indirect, min 2 ref are required.

MOV R0, @ R1  
 MOV R0, @ 2000

1 ref → Reg name.

1 ref → other that value access  
 the memory

If the SPU supports with min ~~one~~ set of addressing modes, Implicit implementation is enough to decode the type of addressing mode.

So,  $\text{instr}^n$  format consists of only 2 fields one is op code and other is address of operand (No need of decoders to decode the type of addressing mode).

Explicit Mode: There is a need of extra field regd in the  $\text{instr}^n$  format to enable the type of Addressing mode.



Here mappings of addressing modes are there so we can say here explicit mode implementation is reqd.

(3)

## Accumulator register

Accumulator is used to store the input data and O/P data of ALU



Here this implementation undergoes acc to PC



Acc. One register is reqd to pass the input data to ALU. We need register i.e. ACC. Whatever we need data we need to pass to ALU. If the connectivty is I.P. data. one direction is providing the I.P. data & other direction is providing the regt.

register which has  
directly  
communication with ALU  
with ALU  
is known as PC

MOV ACC, no.

# ① Memory Address Register / Memory Buffer register

MAR

MBR

These registers carries address and data.  
resp.

(MAR)

→ memory addr register is connected with address lines of S/m bus.

(MBR)

→ memory buffer Register are connected with Data line of S/m bus.

total seq of Prog Exec

Draw the sequence Diagram for read operation  
(Read an instr from memory).

(CPU)

I want to communicate

with Memory & [PC] → MAR

I/O. There is a

need of practically

RD → C/L's of

I/O

transfer the address of

SB.

into MAR bus. Memory mapped

of same

it is connected with

add & same

signals). PC

Memory is enabled to

Memory.

[PC]

(instn address)  
Value transferred MBR

instn address

Memory is enabled to PC

Address lines of SB.

Request  
multinucleate  
multicore  
address  
control signal

Value transferred MBR

instn address

Memory is enabled to PC

Address lines of SB.

RD → Control lines of

SB

Binary sigs → C/L's of SB → MBR

Memory is enabled acc to PC

PC used to access Instn address

Byte data

Q Int

Currently  
fetched  
instn transfer  
to IF

$PC \leftarrow PC + \text{step size}$

When the source is PC, the dest<sup>n</sup> is always IR.  
This process is called as fetch cycle.

Instruction Fetch (Memory to CPU)

The process of transferring memory content to CPU based on program counter called as Instruction fetch.   
→ It is a write, it has control memory, it initiates output.

• Inst<sup>n</sup> fetch



After Inst<sup>n</sup> fetch, decoding takes place.

The process of enabling the ALU gates is called as Decoding.

If the data is present in the register, IR[A0] IRreq, A0 field (Contd) IR[A0] field gives the register name where the data is present so the sequence of operation is

$$\text{reg} \Rightarrow \text{IR[A0]} \rightarrow \text{register name} \rightarrow \text{ACC}$$

Access register  
Content transferred to Acc

~~8/12~~

If the data is there in the memory, different addressing modes are effected. Under that condition, assume that immediate mode is applicable. So sequence is

Immediate  $\Rightarrow$  IR[AO]  $\rightarrow$  Acc Dut<sup>n</sup>

Source = IR[AO] Dut<sup>n</sup>  $\Rightarrow$  Acc

No it  
itself  
data

② Other than Immediate addressing mode we need to calculate the effective address.

AD  $\quad \quad$  If itself is  
If itself is not data then it is memory address.

Mem  $\Rightarrow$  IR[AO]  $\rightarrow$  MAR  $\rightarrow$  AL's  $\rightarrow$  Binary seq  $\rightarrow$  MBR  
 $\downarrow$  RD  $\rightarrow$  CL's

Source is address  
of operand  
field (memory)  
content itself  
is data)

Acc

Process of transferring the data from the IR[AO] into acc is called as

OPERAND FETCH  $\quad$  ~~10<sup>th</sup> dec 2010~~  
~~30/12/2010~~

The process of transferring the data into ALU and convert one form to another form of data based on ALU operation called as Process data

Memory to CPU  $\rightarrow$  Fetch cycle

Inside the CPU  $\rightarrow$  Execution cycle

The Instr<sup>n</sup> execution sequence is represented by Inst<sup>n</sup> cycle or consists of 2 subcycles one is fetch cycle & other is execution cycle that is shown below.



If source is PC → Dest<sup>n</sup> ( $\rightarrow$  IR)  
 If source is IR → Dest<sup>n</sup> (Acc)

one Inst<sup>n</sup> execution is completed, processor is ready to execute next Inst<sup>n</sup>.

There is always delay due to more time took by EC.  
 So we can say its not successful operat<sup>n</sup>.

## CPU Organization

7 CPU <sup>org</sup> is classified 3 types, based on no of address fields present in the Instruction. i.e ① stack organization  
② accumulator based m/c  
③ general register org.

Stack Org: It allows zero address instruction  
it only supports with push and pop  
operations to transfer the data. It  
consists of only opcode field (no  
address field).



The organization allows 1 address inst<sup>n</sup>  
The default register is accumulator.



③ General register organization:

This org allows 2 address & 3 address inst<sup>ng</sup>

→ 2 address inst<sup>n</sup> consists of 2 address fields  
in the instruction → forward address / base



After performing operat<sup>n</sup>  
see act as best n

(original data is nowhere present, In other words I/p data is modified with o/p data)

⇒ 3 address Inst<sup>n</sup> format → To overcome the prob<sup>m</sup> of 2 address Inst<sup>n</sup>, where the no of registers are more, we can use 3 address Inst<sup>n</sup>.

Format:



best  
add's

Operand 1 & operand 2  
address

Add  $\overline{x_0}, \overline{x_1}, \overline{x_2}$

$r_1, r_2$  added } NO overlap?

Reduced Instruction Set Computer (RISC) allows 3addr Instns

beacuz the no of registers are more.



Add  $r_7, r_2, r_3$  (no need of maintaining)

Beacuz whole reg file is connected with ALU

One register as Acc).

4 Address Instns  $\rightarrow$  This Instn consists of 4 address fields



There is no need,

4 A4 as PC

already point

to next instn

addrs.

As long as PC is not present in processor then there is no need of 4addr. But PC is mandatory register. So need of 4addr Instn.

If the four address comb<sup>n</sup> is specified  
no possibility of [Date] [Time]  
Page No.: Unit 7

Consider the hypothetical GM which supports 1 address size and 2 address inst<sup>n</sup> formats. The 16 bit inst<sup>n</sup> is placed in 128 word memory. If there exists  $(2)^2$  address inst<sup>n</sup> how many 1 one address inst<sup>n</sup> can be formulated.

16 bit inst<sup>n</sup>  
↓

128 word.

$\Rightarrow 2^7$



2 address inst<sup>n</sup>

opcode | no. |

1 address inst<sup>n</sup>

$2^7$  cells  $\rightarrow$  1 word.

There are  $2^7$  cells, each cell contains 1 word.  
Address size = 7

2  $\rightarrow$  4 possible  
opcodes

Address line = 7

Among 4 comb<sup>n</sup>, only 2 are defined.

only one  
address is reqd

Comb<sup>n</sup> [10 9 7 ]  
[11 ] [opcode | Addr.]

There is  
no free comb<sup>n</sup>  
so for it move  
further

$\Rightarrow$  no of free comb<sup>n</sup>

$\Rightarrow 2 \times 2^7$

no of address free

free comb<sup>n</sup>

One address Inst<sup>n</sup>  
opcode size

0000

0001

1111

2 address Inst<sup>n</sup>

00

[op | A1 | A2]

01

00

10

01

11

10

opcode

Q. Total possibilities = 4.

Now possibility Assume 3 possibility of 2 address. ]  
possibility [ " of 1 address ].

1 possibility of 2 address. ] Map  
3 possibility of 1 address ] possibility.

Case 1 If 2 addresses are already there, only 2 comb<sup>n</sup> are possible.

$$\Rightarrow 2 \times 2^7 \Rightarrow \underline{\underline{2^8}}$$

Address size = 7 bits

Inst<sup>n</sup> size = 16 bits

Consider 2 address inst<sup>n</sup> ie [opcode | Address A2 ]

Total = 2 address inst<sup>n</sup> are possible with the above format is 4.

i.e. 

|    |   |
|----|---|
| 00 | 1 |
| 01 |   |
| 10 |   |
| 11 |   |

 Assume already 2 inst<sup>n</sup> are used

The free comb<sup>n</sup> are 2. Use this comb<sup>n</sup> to define 1 address inst<sup>n</sup>.

i.e.



Date: \_\_\_\_\_  
Page No: \_\_\_\_\_

If there is no free comb<sup>n</sup>, then no alternative.

Case 1: If there exists 4 2 address inst<sup>n</sup>s one address inst<sup>n</sup>s are zero.

Case 2: The min no of 1 address inst<sup>n</sup>s supported by the above S/m is equivalent to 1 possibility x addresses.

1 Possibility x Address is  $2^7$

$$\Rightarrow 2^7 \text{ inst } n's$$

$$\Rightarrow 128$$

Case 3: The max no of 1 address inst<sup>n</sup>s supported by the above S/m is max possibility  $\times 2^7$ .

Max  $\rightarrow 3 \times 2^7 \rightarrow 128 \times 3$   
 $= 384$

Q: Consider a computer S/m that has 32 bit inst<sup>n</sup>s and supports with 12 bit addresses if there are 250 address inst<sup>n</sup>s How many 1 address inst<sup>n</sup>s are supported.



Inst<sup>n</sup>s  $\Rightarrow 6 \times 2^{12}$



Q. Halt

→ Every time PC is loaded

everytime PC is loaded

- ③ Word 40 contains 60 with 4, same operation is loaded, require loop.

- ⑥ whatever 30, that is loaded X.

- ⑦ load indirect 30

$$30 \rightarrow 50 \rightarrow 70 \quad X$$

EA

- ⑧ load indirect 20.

$$20 \rightarrow 40 \rightarrow 60 \quad \checkmark$$

Inst "Size"

R1  
15000



1006-1007 Direct MOV R1, 5000

1008-1011 Indirect MOV R2(R1).

1012-1015 Reg. Add R2 R3.

1016-1023 MOV MOV 6000, R2

1024-1027 Halt

→ enters into loop operation → 1. bit, 32 bit, 4 byte  
bit require 4 memory locations

To store one word, we need 4 locations  
1 word = 32 bits memory is byte addressable.

PT

⇒ CPU only respond to interrupts only after completion of current execution Inst "



if interrupt is  
present

During the add operat<sup>n</sup>, return address will be 1016

Once add undergoes execution PC pt  
not inst<sup>n</sup> address is 1016 is pushed  
onto stack.

(b) Register  $\leftarrow$  memory  $\Rightarrow$  3 clock cycles.

| Inst <sup>n</sup> | F+D | execute time | Instruction size<br>(words) |
|-------------------|-----|--------------|-----------------------------|
| I <sub>1</sub>    | 4   | 3            | 2                           |
| I <sub>2</sub>    | 2   | 3            | 1                           |
| I <sub>3</sub>    | 2   | 1            | 1                           |
| I <sub>4</sub>    | 4   | 3            | 2                           |
| I <sub>5</sub>    | 2   | 0 ✓          | 1                           |

fetch cycle  
I<sub>1</sub> I<sub>2</sub> I<sub>3</sub> I<sub>4</sub> I<sub>5</sub>

decoding cycle  
I<sub>1</sub> I<sub>2</sub> I<sub>3</sub> I<sub>4</sub> I<sub>5</sub>

execute cycle  
I<sub>1</sub> I<sub>2</sub> I<sub>3</sub> I<sub>4</sub> I<sub>5</sub>

Ans. 24

I<sub>1</sub>: R1, 5000  
I<sub>2</sub>: R2(R1) : indirectly data  
I<sub>3</sub>: Add: in memory.  
I<sub>4</sub>: Register  $\rightarrow$  Memory

F

D

OF X

PD X

Op

This completes at the  
end of Decoding.  
No need of operand  
fetch & process data

~~6.9 my~~Add A[R<sub>0</sub>], @B.

Indirect. Indirect

addressing add mode - fetch (Even)

mode

1st cycle

How many  
mem ref  
are there?

A and B → addresses present  
in memory.

ADD A[R<sub>0</sub>], @B; → mem addr.Operand | Operand  
fetch | fetchEA = displacement + R<sub>0</sub>  
(is memory)2 M.R → one:

2 mem refs are reqd

to read op 1

1 mem ref are req.  
to read op 2.Add A[R<sub>0</sub>], @B

2 2

✓ 4 memory ref -  
4 → to read data

+

(2) → regt. ✓ (mem  
ref). writing

going to  
memory  
writing

Ques

Consider a base m/c which supports with data transfer, data manipulation, transfer of control instructions which is running at 2 Gigahz speed. If the load instruction takes 3 clock cycles and store instruction takes 5 clock cycles, Manipulation inst. takes 6 clock cycles and branch inst. takes 8 clock cycles. Suppose the frequency of these instructions are 30%, 20%, 30% & 20% respectively what is the speedup when the processor when the processor uses 1ns. cycle time. where CPI is equal to 1.

$$\text{Load Inst} = 3 \text{ clock cycles} \Rightarrow 30\%$$

$$\text{Store Inst} = 5 \text{ clock cycles} \Rightarrow 20\%$$

$$\text{Manipulation Inst} = 6 \text{ clock cycles} = 30\%$$

$$\text{Branch Inst} = 8 \text{ clock cycles} = 20\%$$

$$\text{CPU Time} = \text{no of clock cycle reqd} \times \text{clock cycle}$$

9  
CPU rate

CPI  $\Rightarrow$  no of clock cycle reqd

Inst count

1/8  
5  
3

$$CPU \text{ time} \Rightarrow 3 \times \frac{3}{10^9} + 5 \times \frac{5}{10^9} + 6 \times \frac{3}{10^9} + 8 \times \frac{2}{10^9}$$

$$CPU \text{ time} = (0.9 + 1 + 1.8 + 1.6) \times 10^{-9} / 2 \times 10^9$$

$$\Rightarrow 5.3 \times 10^{-9}$$

$$T = \frac{1}{2 \times 10^9}$$

$$T = 0.5 \times 10^{-9}$$

$$\Rightarrow 0.5 \text{ ns}$$



CPI = 1.

$$\text{CPU time of Ideal processor} = \text{CPI} * I_c * \text{clock time}$$

$$= 1 * 1 * 1 \text{ ns.}$$

$$\Rightarrow 1 \text{ ns.}$$

Speed up  $\Rightarrow$  Execution time of Base m/c

Execution time of Ideal m/c

$$\Rightarrow \frac{2.65}{1} = 2.65 \text{ ns.}$$

Ideal m/c is 2.65 faster than Base m/c.

workbook - 91 (Pg)

(b) a) Array  $\rightarrow$  Indexed Add<sup>n</sup> mode.b)  $*A + i$  (Pointers)  $\rightarrow$  Indirectc) init temp =  $*x$   $\rightarrow$  Auto inc (loops).

(b)

$$R_3 = 2000$$

$$10 \quad 3000$$



M. Reference

- (i) Direct  $1000 - 1007 \rightarrow \text{MOV } R1, 3000$   $\rightarrow$  Direct  $\rightarrow$  (to)  
 Register  $1008 - 1001 \rightarrow \text{MOV } R2, (R3)$   $\rightarrow$  10  $R1 = 3000$   
 Register  $1012 - 1015 \rightarrow \text{Add } R2, R1$   $\{ 0 + 10 = 20 \}$   $R2 = 10$   
 Register  $1016 - 1019 \rightarrow \text{MOV } (R2), R2$   $\rightarrow$  10  $10 + 10 = 20$   
 A.M.  $1020 - 1023 \rightarrow \text{Inc } R3$   $R3 = R3 + 1$   $R3 = 20 + 1 = 21$

 $1024 \rightarrow 1027$ 

entry

 $R1 \leftarrow R1 - 1$  $R1 \leftarrow 10 - 1$ 

(d)

 $1024 \rightarrow 1028 \rightarrow 1035 \rightarrow 1036 - 1039 \rightarrow \text{BNZ } (loop) \rightarrow \text{Target addr.}$ 

Halt

 $\rightarrow \text{TOC (unconditional). m[3000] = 1000}$  $R1 = 10$  $R = 10$ 

loop initiated by  $R1$  value  $10$   $M[3000] = 1000$

$$R_3 = 2000$$

$$M[2000] = 100$$

Instead of

register if we

are having M. Index

AM then we will

use M.I as 41



$$M[2010] = 100$$

(b) (a) Reading data, modifying the data, updating the data at same location

Loop is executed 10 times, as data changed  
upto 2009. ~~(2010)~~<sup>th</sup> location  $\rightarrow$  no operation  
is performed. After execution, same  
values is reflected. i.e. 100.

Byte add. represent size of word i.e. 4.

① M-R, R-M  $\Rightarrow$  3 cycles : 2 clock cycles/  
 word.  
 Add "Inst"  $\Rightarrow$  1 cycle :  
 Inc " " = 1 cycle  
 Dec " " = 1 cycle  
 ③ Branch! Inst " " = 9

|   | Op.                                    | Expect'n                           | F+D                         |
|---|----------------------------------------|------------------------------------|-----------------------------|
| 1 | MOV R <sub>1</sub> , 3000              | $\rightarrow$ 3 cycles             | $2 \times 2 = 2$            |
| 2 | MOV R <sub>2</sub> , (R <sub>3</sub> ) | $\rightarrow$ 3 cycles $\times 10$ | $1 \times 2 = 2 \times 10$  |
| 3 | Add R <sub>2</sub> , R <sub>4</sub>    | $\rightarrow$ 1 cycle $\times 10$  | $1 \times 2 = 2 \times 10$  |
| 4 | MOV (R <sub>3</sub> ), R <sub>2</sub>  | $\rightarrow$ 3 cycles $\times 10$ | $1 \times 2 = 2 \times 10$  |
| 5 | Inc R <sub>3</sub>                     | $\rightarrow$ 1 cycle $\times 10$  | $1 \times 2 = 2 \times 10$  |
| 6 | Dec R <sub>3</sub>                     | $\rightarrow$ 1 cycle $\times 10$  | $1 \times 2 = 2 \times 10$  |
| 7 | Brnz                                   | $\rightarrow$ D <sub>90</sub> 90   | $2 \times 2 = 4 \times 10$  |
| 8 | Halt                                   | $\rightarrow$ 0                    | $1 \times 2 = 2 \text{ NO}$ |
|   |                                        | There is no operation              |                             |

There is no need of operand fetch and Precess data.



- ① mov R<sub>s</sub> 1 → Immediate  $R_s \leftarrow 1 \Rightarrow R_s = 1$
- ② load Rd, 1000(R<sub>i</sub>); → Index  $R_d \leftarrow 1000 + 1 = 1001$
- ③ add Rd, 1000 → Direct  $R_d = 1000 + 1 = 1001$
- ④ store 0(Rd), ~~20~~ → Index.

EN 0 + 1001  $\Rightarrow$  if I  $\rightarrow$  "not add", data  
1001



$R_1 \rightarrow R_2$  transferred. First  $R_1 \rightarrow R_2$  transferred  $R_3$ . speed up.  
 $R_1 \rightarrow M[100]$ : Register time is less as compared to Mem access time instead  
If transferring the data from Reg to Mem we will transfer data from Reg to Reg.

(14) (a)

Self relocating code  $\rightarrow$  Subprograms.  
Auto Invc  $\rightarrow$  add " mode of data "

(15) (b)

Privileged  $\rightarrow$  GRU compulsory to handle  
the interrupt.

(c)

RFE with  $\rightarrow$  After control send back  
to main prog, ( $\therefore$  interrupt  
handling is not possible.  
 $\therefore$  Only one interrupt is served per.

If interrupt is present, def service routine  
need to be executed. Then control  
comes back to main prog.

(16) (a)

(b)

$I_1 = 4.5 \mu\text{sec}$   $I_1 \rightarrow H.P$

$I_1 = 25 \mu\text{sec}$   $I_1 \rightarrow L.P$

(c)

$I_2 = 35 \mu\text{sec}$  Range of time for  $I_3$ ?.

$I_3 = 20 \mu\text{sec}$ .

✓ Response time  $\rightarrow$  Time required to  
respond to that interrupt  $= 4.5 \mu\text{sec}$

3 Interrupt occur simultaneously)

$\Rightarrow$  After  $I_1, I_2$ , CPU respond  $\rightarrow$  to  $I_3$  i.e

$\Rightarrow$  Min  $\rightarrow$   $I_3$  Suppose one interrupt  $I_3$  is present



$4.5 \mu\text{sec}$  Range ( $I_3$ ) =  $20 + 4.5$   
 $\downarrow$  Response  
Service time  
time

Min

(b)

$$R_3 = \text{Response time} + \text{Service time}$$

$$\Rightarrow 20 + 4.5$$

$$\Rightarrow 24.5$$

(c)

$$\text{Map } k_3 = 25 + 4.5 + 4.5 + 35 + 4.5 + 20$$

$$\Rightarrow 93.5 \mu\text{sec}$$

(d)

Absolute add  $\rightarrow$  address of operand field

Relative  $\rightarrow$  arithmetic operat<sup>n</sup> also.  
 ↓ deals with  
 data

(e)

a

(f)

c

usage

## Instruction Cycle

Instruction cycle consists of 2 subcycles, one is

- (1) fetch cycle
- (2) execution cycle

- (1) Fetch cycle reads the instr<sup>n</sup> from the memory based on program counter address.
- (2) At the end of this stage, program counter is incremented by step size to point the next instr<sup>n</sup> address.
- (3) In the execution cycle, three operations are carried out
  - ① Decoding  $\rightarrow$  it enables the SW gates.
  - ② operand fetch : It reads the data either from register or memory & Accumulator.
  - ③ process data : It converts one form of data into another form.

Suppose ACC is  
16 bit



Do  $16 + 16 \rightarrow$   
16bit generated by ALU = 17bit.  
(17bit data)  
ACC. 16bit (capacity). is 10<sup>16</sup>.

So we can say. Extra storage is reqd.



When ALU generates a result which is greater than the size of Accumulator then it is a loss of data.

Extra storage is changed from ALU to ALU. ALU is not accepting any I/P data from ACC.

To maintain correct result generated by ALU, there is a need of extra storage which is communicated with ALU in outward direction.

Extra storage is a flag register or program status word register.

Flag is flip flop (Bistable device), it can be set or reset based on the nature of ALU.

→ flag register consists of set of flip flop. Individual flag is flip flop which is set or reset based on the result quantity of ALU.

→ flags are classified into 2 types

- ① conditional flags.
- ② control flags.

~~Ans~~  
Conditional flags → These flags are associated with conditions. They are set or reset based on the status of cond'n.

All cond'n of flags are directly connected with ALU.

Control flags → Based on the status of these flags, program execution sequence is altered.

Conditional flags → Types:

① CARRY FLAG



Cond'n : If there extra bit out of MCB.

10110.

10001

10011! → This operation is performed by ALU. After ALU, it checks the cond'n of extra storage.

① If cond'n is True, then set the carry flag.

$$[CY = 1.] \text{ (Set)}$$

(2) If the cond<sup>n</sup> is false?

$$F = \text{reset} = 0 \quad \text{CY}$$

In mathematical model similar<sup>n</sup>, this carry flag is effectively used.

(2) parity flag → ?

is used in the serial data communication.  
The cond<sup>n</sup> is

odd no of 1's present in the result

In Arithmetic, parity flag is of no use.

$$\rightarrow \text{It is set } T=1 \quad [PO = 1]$$

$$F=0 \quad [PE=1]$$

(3) Auxilliary flag carry flag is used in BCD arithmetic computations ie decimal arithmetic auxilliary carry are used.

Decimal no are used in 2 ways.

(1) unpacked BCD.

(2) Packed BCD.

(1)

Unpacked BCD  $\Rightarrow$  1 digit = 8 bits

(8 bits) 0 = 0000 0000

(16 bits) 23 = 0000 0010 0000 0011

byte

It requires more memory space to store the no. into the memory. To reduce the memory space packed BCD notation is used. In the packed BCD, (1 digit = 4 bits).

1 digit = 4 bits

0 - 0000 ✓

23 - 0010 0011 ✓

→ By default we consider packed BCD of BCD.  
If not mentioned as packed BCD or unpacked BCD.

Position

Condition: If there is an extra bit from 3rd bit to 4th bit.

→ lower nibble → higher nibble  
 $89 \rightarrow 1000 \ 1001$       89 converted into packed BCD.  
 $+ 28 \rightarrow 0010 \ 0100$   
 $\hline 1011 \ 0001$       There is an extra bit from 3rd to 4th bit i.e.

Higher nibble      Lower nibble

If the condn satisfies true then  $T = 1 = AC$   
 False = 0 = NA No auxiliary carry

#### (4) Zero flag:

Condn  $\rightarrow$  Is the acc. value is zero. ✓

Zero flag  $\rightarrow$  If it is true, if the condn is zero.

eg  $R_0 = 12, R_1 = 12$   
 $\text{SUB } R_0, R_1$       given condn is set  
 $\text{acc} = 0$       conclusion  $\rightarrow$  2 nos are equal  
 $\rightarrow$  it is false, if the condn is non-zero.

#### # Compare operat<sup>n</sup>s

If both are zero, then set  
 If both are not zero, then don't

$r_1 = 8$

$r_0 = 12$

Sub  $r_0 - r_1$

$Acc = 4 \text{ zero } 4$

If the no is not equal then

We can say whether

the no is greater or less.

$r_0 = 12$

$r_1 = 8$

$\boxed{\text{Zero flag} = 0}$

4

$\boxed{Cy = 0}$

It means  $\downarrow$  The operation is such that  $r_0 > r_1$ .

Sub  $r_0, r_1$

$r_0 - r_1$

$\boxed{r_1 = 23}$

$\boxed{r_0 = 12}$

$\Rightarrow 12$

$\boxed{\text{Zero flag} = 0} \rightarrow \text{reset}$

$\boxed{23}$

$\therefore \text{borrow is reqd}$

$\boxed{Cy = 1} \text{ set}$

Conclusion<sup>n</sup> Second parameter > 1<sup>st</sup> parameter

When executing Compare Operation, It involves subtraction. If the 2 values are equal Zero flag is set. If 2<sup>nd</sup> value is less than 1<sup>st</sup> value, then zero flag is reset and carry flag is also reset.

Q. If the 2<sup>nd</sup> operand is greater than 1<sup>st</sup> operand, then zero flag is reset & carry flag is set. It means the op of compare operat<sup>n</sup> reflects on 2 flags called as zero flag and carry flag.

111  
 1011      4 bit register  
 0101      4 bit register  
1 00 00      4 bit register  
 If value of acc is zero, then condition of zero flag is satisfied.

All       $Zero = Set = 1 = (Z)$  ✓  
 see content in zero result is zero.

$Zero = Next \neq 0, = NZ$  ✓ JNZ

loop expected.

recognized by processor

Sign flag :-

Condition: If the MSB Bit of Accumulator is 1. If it is 0, it means negative logic otherwise if it is zero, then positive no. logic.

101011  
 100100  
1 001111 This part is reflected in acc  
 MSB.

Sign flag = F = reset = 0 = PL

Positive logic.

Sign flag = T = set = 1 = RL

Negative logic.

Overflow flag :-

Use: Overflow flag indicates the range of not supported by the S/m.

If the max limit is exceeded, it sets the overflow flag to 1.

Cond'n → Is there any carry into MSB & no carry out of MSB. Or vice versa.

If any of the cond'n sat'sfy  $\Rightarrow$  overflow flag is set.

Otherwise reset

|                                                                     |                                     |
|---------------------------------------------------------------------|-------------------------------------|
| <br>$\begin{array}{r} 111 \\ + 10111 \\ \hline 0010101 \end{array}$ | Carry into MSB.<br>Carry out of MSB |
|---------------------------------------------------------------------|-------------------------------------|

$1 / \underline{000\ 1100} \rightarrow$  reset

Overflow = f = Reset = 0  $\rightarrow$  No overflow (NV)

$$= T = \text{Set} = 1 - \underline{\text{OV}} \text{ (overflow)}$$

Page No. 112

(4)

Only ACU takes the resp of setting and resetting all flags.

(b)

$\begin{array}{r} 01001101 \\ + 11101001 \\ \hline 00110110 \end{array}$

Status  $\Rightarrow$  cy = 1 ✓

Overflow = 0 ✓

Sign = 0 (reset). ✓

work

$\underline{\text{MSB}} \Rightarrow 10, 0 \checkmark$

char

cy NV Sign AC P/E NZ

Binary - 1 0 0 1 0 0 ✓

## Control flags:

→ There are 3 control flags.

① trap flag.

② interrupt flag.

③ direction flag.

① TRAP FLAG (Programmer). If Trap flag is equivalent to 1, single step execution is activated.

Suppose prog contain 10 inst<sup>n</sup>, then one inst<sup>n</sup> is executed (Stop the process).

Each enter button indicates we enable the trap. For the next inst<sup>n</sup>, we are pressing the enter button, or means we are setting trap.



t=0 expect starts from starting to ending until break pt..

F10 ⇒ t=0 → all inst<sup>n</sup>s are executed until Break pt.

define  
start prog. F8 = t=1 → Single step execution

eg t=1 →  
g=0 →

If trap flag is disabled, it means execute all inst<sup>n</sup>s until Break pt.

(2)

## Interrupt flag

To enable or disable the interrupts this flag is used.

Markable  $\rightarrow$  listening the status of int.flag.  
 Non markable are not listening the  
 Interrupts status of Int. flag.



(3)

Direction flag : This flag is used in string manipulations.

There is a need of transferring of bulk data  
~~for~~ then we use string operation

→ auto Decrement  
↓ Directn → +  
↓ 0 → Auto Increment

### steps of string manipulation :-

- (1) Initialize source and destination area
- (2) Identify the length of the string.
- (3) Identify starting address or ending add<sup>r</sup> of the string which is passed as a input.
- (4) set or reset the Direct<sup>n</sup> flag.
- (5) perform string move operat<sup>n</sup>.



③ Identify the length of the string is 3.

$$CX = 3$$

Want  
register

④ Start<sup>n</sup> add<sup>r</sup> and ending add<sup>r</sup>.  
Dec count?      CLD ⇒ clear Direct<sup>n</sup> flag.  
One offset

~~21/12/10~~ Last day  
~~2010~~

Date:

Page No.:

NO need of Inc Source & Dec<sup>n</sup> address. The source & Dest<sup>n</sup> add<sup>n</sup> are automatically Inc until count becomes 0.

→ CLD (All dest<sup>n</sup> lags are included incs).

→ MOVSB → Move Byte String from source to Dest<sup>n</sup>.

Due to CLD, we won't write MOV SB again as count automatically perform dec operat<sup>n</sup>.

→ STD Set Direct<sup>n</sup> flag → auto dec of source & Dest<sup>n</sup> add<sup>n</sup>.

\* It is optimization because it is performed acc to Direct<sup>n</sup> flag. The total execution is changed.

## RISC Vs CISC

Architecture is divided into RISC & CISC two types based on the types of Inst<sup>n</sup>, i.e.

① fixed length Inst<sup>n</sup> → RISC

② Variable length Inst<sup>n</sup>. → CISC

Characteristics of RISC processor :-

more..

① RISC Supports Rich register set.

→ RISC processor consists of Overlapped Register windows.  
Each register window consists of 3 types of registers.

- ① Input register
- ② Local register
- ③ O/P register



→ RISC processor also supports with global registers.



\* Register file = no of windows  $[L + C] + 4$ .

Common register  $\Rightarrow$  1 set.

Each window consists of one local + one common register (C).

More no of registers inc, addressing no of nodes will  
be done  
↓  
Perf inc

## 2<sup>nd</sup> characteristics

→ less no of addr nodes. Because more no of registers. Most of time data is available in Register. So, addr nodes is reduced.

Indexed or register indirect are efficient mode to read the data from memory. So, most of the time we will use these addr nodes.

3rd char : fixed length of Instr is supported.

4th char → one Instr per cycle (all the instr size is same).

Average CPI of RISC = 1

If all instr is completed within 1 cycle, we can say successful pipeline. is implemented.

Eg of RISC ① → ARM → Advanced RISC m/c.

② → power pc

③ → Motorola 68000

④ → MIPS ( Microprocessor w/o Interlocking pipeline stages)

(20) (a).

(23) (d) 12 reg windows = 12

global windows = 16

I/P = 8.

local = 16

O/P = 8.

window size

$\Rightarrow 12$ .

Total no of reg.  $\Rightarrow W(L+C) + Q$

$$\Rightarrow 12(16+8) + 16$$

$$\Rightarrow 12 \cancel{(24)} + 16$$

$$\Rightarrow \cancel{384} \cancel{16}, 304$$

~~410 0~~

Reg windows

$$\Rightarrow 16+16+16$$

24

16 2

14 4

24 X

38 9

3

Reg window =  $L+C+Q$

$$\Rightarrow 16+16+16 = 48 \checkmark$$

24

12

48

16

24 28

16 16

18 16

30 4

(24) CISC

RISC

MOV AX, 05 ✓ 1 clock

MOV AX, 00 — 1

MOV BX, 06 ✓ 1 clock

MOV BX, 05 — 1

MUL AX, BX, 40.

MOV CX, 06 — 1

Multiply AX with BX

start ADD AX, BX  
 $\frac{2^{16}}{2^4} = 12$   
 loop start, loop till CX=0.

42 clock  
cycle

No mul operat<sup>n</sup>, But

then also we are

performing mul

= (15)

By default Immediate  
addr mode, the clock cycle

$$1+1+1+12 = 15$$

is 1

Denominator  
run faster than  
run slower

$$\text{Speed up} = \frac{\text{ET CISC}}{\text{ET RISC}} = \frac{42}{15} = 2.8$$

RISC Code will run faster than CISC  
by 2.8

(a)  $1.18 \xrightarrow{\text{change}} 1.45$

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

25

Accessing

operand Add<sup>x</sup> node

Frequency %

|                 |                                                                                |
|-----------------|--------------------------------------------------------------------------------|
| Registers       | 30.                                                                            |
| Immediate       | 20                                                                             |
| Direct          | 22 $\rightarrow$ (Memory reference)                                            |
| Memory Indirect | 17 (2 memory refs) $\frac{50}{33}$                                             |
| Index           | 11 (3 memory refs) $\frac{17}{100}$<br><u>100</u> (1 $\rightarrow$ arithmetic) |

No of cycles

- (memory)  
MT) (1) Read from memory 2.
- (2) Index Arithmetic 1
- Configurations.
- (3) Operands are available  
in registers or within  
the Instn itself. 0

Average operand fetch = ?

$$\Rightarrow 30\% \text{ of } 0 + 20\% \text{ of } 0 + 22\% \text{ of } 2 + \\ 17\% \text{ of } 4 + 11\% \text{ of } 3.$$

$$\Rightarrow \frac{30}{100} \times 0 + \frac{20}{100} \times 0 + \frac{22}{100} \times 2 + \frac{17}{100} \times 4$$

$$+ \frac{11}{100} \times 3$$

$$\Rightarrow \underline{\underline{1.45}}$$

26 Change

What is the speedup in the Instn execution rate if the same or pipelined Acyclic a 0.4 ns/c overhead is and consumed in setup and clock skew together.

$AW = 4 \text{ cycles} \Rightarrow 4 \times 10^{-9} \text{ sec}$   
 Branches = 3 cycles  $\Rightarrow 3 \times 10^{-9} \text{ sec}$   
 Memory access = 5 cycles  $\Rightarrow 5 \times 10^{-9} \text{ sec}$

~~$$T_2 = 1 \times 10^{-9} \text{ sec}$$

$$F = \frac{1}{T} = \frac{1}{1 \times 10^{-9}} \text{ Hz}$$~~

Execution time  
of pipelined  $\Rightarrow 1 + 0.4$   
 $\checkmark 1.4 \text{ nsec}$ .

$\Rightarrow$  Execution time of unpipelined

28 (b)

(30) relocation  $\rightarrow$  PC relative

(31) (a)

(32)

(33) (c)  $\rightarrow EA = \{ BX \}$  scaling Ind.  $\oplus$  <sup>disp</sup> <sub>concept</sub>  
 (34) b  $\rightarrow EA = \{ DI \}$   $\oplus$  <sub>disp</sub> <sup>2000</sup>

$$\Rightarrow [BX] + [SI] + 2000$$

<sub>displacement</sub>

(34) (b) Need of 3 reg to

store 3 values

giving no where  
to look at if  
so we are  
replacing



# Pipelining

→ High performance architecture

↳ exhibits concurrent execution.

High performance architecture performs concurrent execution. ✓  
program

Concurrency → more than one event execution.  
is called as concurrent execution.

event → program → Inst<sup>n</sup> → Subprogram.

Concurrency implies to ① parallelism ② simultaneous  
③ pipelining.

- ① Parallelism
- ② Simultaneous
- ③ pipelining

↗ Parallelism: → Two or more events executed at same time period (min and max value both are present). (0-5 sec).

↗ Simultaneous Two or more events executed at same time instance. (no range).

↗ Pipelining → Two or more events executed in overlapped time span.

## Classification of High performance Architecture

According to Flynn's classification, architecture is divided into four types based on Inst<sup>n</sup> stream and data stream.

- (1) SISD    (2) SIMD    (3) MISD    (4) MIMD

\* Single Inst<sup>n</sup> Single Data Stream

- (1) SISD (only one Inst<sup>n</sup> / data fetched at a time).

↳ Uniprocessor System

\* Single Port

The operations are :



Each processor consists of one control unit whereas if processor consists of more than 1 control unit it is multiprocessor. Only one memory access is possible.

Inst<sup>n</sup> fetch → Unidirectional

Memory fetch → Bidirectional      Multipro<sup>p</sup> / memory.  
 Single Inst<sup>n</sup> Multiple Data Stream

(2)

SIMD



Single port only one element goes execution.

|           |  |
|-----------|--|
| Date:     |  |
| Page No.: |  |

multiple

1. If it's multiport memory, w/o conflict data is transferred.
2. Same "Port" is operated on multiple data elements.
3. In matrix multiplication, data elements are different i.e. (A<sub>1</sub>, B<sub>1</sub>, C<sub>1</sub>), operation on the data element is same.
4. Two implementation of this architecture:
  - ① Array processor.
  - ② Associative Array processor.
5. The memory w.r.t Array processor is RAM chip.
6. Associative Array processor → Associative memory chip.

① Array processor: MPP → Massively parallel processor.

② Associative Array processor



Storage

PEPE

parallel element processor  
Insensible?

(bit serial  
associative)

(bit <sup>1</sup> bit  
associative).

If all the processes are homogenous program elements (all are <sup>accessing</sup> same memory, performing same operations). So it means simultaneous execution is possible.

The way of accessing data is known as concurrency.

SIMD exhibits (Spatial) concurrency.

MIMD  $\rightarrow$  (multiple instr<sup>n</sup> fetch)

$\rightarrow$  there is no implementation. multiple processors are there, only one process is executed once. So it means its not economical.

MIMD  $\rightarrow$  Multi-processor architecture  $\rightarrow$ .

Multi-processor consists of control units, their associated processing units & memory.



- $\rightarrow$  Multiple control units.
- $\rightarrow$  Multiple part memory.

No simultaneous operations. Always Homogeneous is not ensured. Similar processor is not designed. simultaneous exec<sup>n</sup> is not possible.

parallelism is achieved in form of asynchronous Helpup. All processor need not to be simultaneous. Here we are achieving asynchronous concurrency.

Based on the 4 arch of SIMD in having concurrent exec<sup>n</sup>, mid or having concurrent exec<sup>n</sup>.

Point: In the unprocessor, <sup>5th</sup> By the implementation of pipelining concept, concurrent executions are implemented. Concurrent executions are exhibited.

### Pipelining (Overlapped executions).

Principle: The principle of pipelining is one stage opt is connected with another stage ~~opt~~ Input

Defn: Accepting one input (new) at one end and before previously accepted Input appears at an output. at the other end.

Old & new IP both are there. Old exec<sup>n</sup> is not yet completed, but new IP is accepted. Both are possible for exec<sup>n</sup>.



The pipelining principle is accepting new I/P. for every new clock cycle.

Concept 6. (Pipelining)

(decomposition)

Dividing the prob

↓  
Subproblem  
(Execute simultaneously)

It is continuous process - pipe consists of different stages such as-



Intermediate  
latches (to store  
regt).

OR  
pipelined  
register

Pipelining consists of  $k$  stages, one stage O/P is connected to another stage I/P. All these stages are controlled by common clock.

Intermediate regt is stored in high speed latches. also called as pipelined registers or buffers.

→ The instr<sup>n</sup> execution sequence in the pipeline is represented by space-time diagram.

Q. Space diagram.

Program = 4 Inst<sup>n</sup>. ( $I_1, I_2, I_3, I_4$ )



X axis → time  
y axis → stages

$I_1 \rightarrow$  It appears as an off at one end when new I/P is accepted.

If execution is completed till  $S_4$ .

Pipelining allows I/P's even if previous are not yet completed.

→ Same clock cycle is used by  $I_{inst1}$  and  $I_{inst2}$  if it is overlapped exec<sup>n</sup>. The first and last stage are non-overlapped and rest of the stages are overlapped.

- New I/P is accepted as pipeline'd I/P. For every clock cycle, new I/P must be entered in characteristics of pipeline.
- After 4th clock cycle, no new I/P is accepted as all Instn are over.
- Except 1 and 7, all overlapped execution are overlapped exec<sup>n</sup> is possible. It is known pipelining.

### Temporal concurrency → Parallelism

Same time can be utilized, same stage is utilized by multiple Instn.

Space diagram indicates exec<sup>n</sup> sequence of pipelining.

→  $I_1 = 4$  clock cycles (  $I_1$  is present, Among 4 clock cycles, 3 clock cycles are overlapped with other Instn).

→ There are  $n$  inputs,  $K$  stages. If  $n$  inputs are divided into 1 and  $n-1$ ,



$$n \text{ I/P's} \rightarrow 1 \text{ to } n-1$$

→  $n$  I/P's |  $n-1$   $(K+n-1)$  cycles  
 $\checkmark K$  stages  $(K+n-1)$  cycles. to execute  
 n instns of K stage.

cycle time depend upon oscillator.

## Derivations :-

- When  $n$  tasks are executed with  $K$  staged pipelined the first task requires  $K$  cycles to complete the operation. The remaining  $(n-1)$  tasks, emerge from the pipe at the rate of 1 task per clock cycle, so it requires  $(n-1)$  cycles to complete the operation therefore

$$\text{total cycles} \rightarrow [K+n-1] \approx$$

If clock cycle time is  $(tp)$ , the total execution time is equivalent  $\underline{(K+n-1) tp}$ .

Consider now pipelined processor that performs same operation and each tasks takes a time  $t$  to complete the operation. The total execution time is equivalent  $\underline{n.tn}$ .

→ Speedup :- execution time of nonpipelined / execution time of pipelined.

$$\Rightarrow \underline{ntn}$$

$$(K+n-1) tp.$$

As the no. of tasks increases i.e. consider an ideal pipelined processor where average CPI = 1.

$$\boxed{\text{CPI} = 1}$$

$$\text{Clock per Inst} \Rightarrow 1.$$

Under this cond'n, execution time of pipelined

$$\boxed{n \times t_p}$$

$$\text{Exec time} = n \times t_p$$

$$\text{Speedup} = \frac{n \times t_n}{n \times t_p}$$

$$= \frac{t_n}{t_p}$$

$$\boxed{\text{Speedup} \Rightarrow \frac{t_n}{t_p}}$$

If we assume that <sup>the</sup> time taken to process a task is same in the pipelined and non-pipelined processors.

$$T_n = K \times t_p$$

~~The time req.~~  $\rightarrow$  pipelined exec time.

Under this cond'n  $\rightarrow \frac{K \times t_p}{t_p} = \text{Speedup}$

$$\boxed{K \text{ Speedup}}$$

The above formula states that if the CPU is working with 100% efficiency.

The maximum speedup = pipelined depth

no. of stages present in pipeline  $\rightarrow$  pipelined depth

Problem  $\Rightarrow$

The Inst<sup>n</sup> pipelined which has speedup factor (10) while operat<sup>n</sup> with 80% efficiency. How many no of stages are present in that inst<sup>n</sup> pipeline.

$$\text{Speedup factor} = 10.$$

$$\text{Efficiency} = 80\%.$$

$$\text{No of stages} = ?$$

$$\begin{aligned} \frac{80}{100} &= \frac{10}{x} \\ 80 \times 100 &= 1000 \\ 80 &= 100 \\ 8 &= 10 \end{aligned}$$

Possible  
[ $S = K$ ] with 100%.

$$80\% \quad 10$$

$$\begin{aligned} \frac{80}{100} &= \frac{10}{x} \\ 100 &= 100 \\ 80 &= 100 \\ 8 &= 10 \end{aligned}$$

Max Speed up

Ans

① 13 Ans ✓



Consider 2 pipelined A and B where Pipelining A having 8 stages of uniform delay of 1 nsec. The pipelined B is having 5 stages with respect to stage 1 delays 2 nsec, 3 nsec, 1 nsec, 2 nsec and 1 nsec. How much time is saved if 100 Inst<sup>n</sup>s are pipelined using A and instead of B?

Suppose Buffers are non uniform  $\rightarrow$  then

$$TP = \text{Max of stage delay} + \text{max of Buffer delay + setup time}$$

Uniform delay pipelined  $\rightarrow$  All stage delay are equal

$$if t = 1 \text{ nsec.} \rightarrow \text{All stages must complete in 1 nsec.}$$

Non Uniform delay  $\rightarrow$  The stage operation itself takes more time.



Different times are associated with stages.  
or is known as non pipelined.

$$t_p = t_p$$

$$t_p = \text{max stage delay.}$$

CLK cycle is changed to max delay i.e.  $3 \text{ nsec}$   
for every  $3 \text{ nsec}$ , the O/P line is  
given as I/P to next line. It  
is non uniform stage.

Buffers are used as interface reg, Latches,  
If all buffers are controlled by  
clock, If buffer delay is zero then  
there is no time adjustment w.r.t clock.

Eg. Buffer delay =  $5 \text{ nsec}$

$$\therefore t_p = \text{max stage delay} + \text{buffer delay.}$$

case

After all the possible cases, at the time  
of designing the pipeline, Intec is  
time to maintain. Designed pipeline  
is in operation, whatever disruption before

every stage on  
inside

→ pipeline, once it is in operation small fraction delay is possible to complete the stages.

→ Set up the clock  $\Rightarrow$  before the pipelined aligning is over. Before delivery we need to adjust the clock. The fraction of time is changed i.e set up time, skew time.

→  $t_p = \text{max stage delay} + \text{Buffer delay.}$   
 $+ \text{Setup time.}$

→ Uniform  $\Rightarrow [t_p = t_p]$

→ Non Uniform  $\Rightarrow [t_p = \text{max stage delay.}]$

→ Buffer delay  $\Rightarrow t_p = \text{max stage delay} +$   
 $+ \text{Buffer delay.}$

→ Setup time  $\Rightarrow t_p = \text{max stage delay} +$   
 $+ \text{Buffer delay} + \text{Set up time}$

2 stage pipelining  $\Rightarrow$

→ Consider four stage  $^{inst^n}$  pipeline, where different Inst's are spending different amount of time at different stages.



① How many clocks are reqd. to complete them?

② what is the speed up?

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ |                                                                                         |
|-------|-------|-------|-------|-------|-----------------------------------------------------------------------------------------|
| $I_1$ | 2     | 1     | 2     | 2     | 7                                                                                       |
| $I_2$ | 1     | 3     | 3     | 1     | 8                                                                                       |
| $I_3$ | 2     | 2     | 2     | 2     | 8 $\rightarrow$ 15 cycles                                                               |
| $I_4$ | 1     | 2     | 1     | 2     | $\frac{1}{29}$ Speed up = $\frac{\text{exec time of NP}}{\text{exec time of pipeline}}$ |

grain  
over  
uniform  
and now  
uniform  
delay



Waiting for stage 1 to begin with  $I_3$ .

Q1

Pipelined process is having fetch decode execution and ~~write~~<sup>write</sup> stages. F DEW PDEW

If following instns are executed in the above pipeline what is the speed up if fetch decode write operation takes 1 clock which execution takes 3 clocks for multiplication and division and 1 clock for other instns.

ADD R<sub>5</sub>, R<sub>0</sub>, R<sub>1</sub>      no data dependency

MUL R<sub>7</sub>, R<sub>8</sub>, R<sub>9</sub> → 3 ce

SUB R<sub>3</sub>, R<sub>4</sub>, R<sub>6</sub>

DIV R<sub>10</sub>, R<sub>11</sub>, R<sub>12</sub> → 1 clock

STORE R<sub>13</sub>, X



~~10/10/00~~ day 11

|       | F | D | E | W |
|-------|---|---|---|---|
| Add   | 1 | 1 | 1 | 1 |
| Mul   | 1 | 1 | 3 | 1 |
| Sub   | 1 | 1 | 1 | 1 |
| Div   | 1 | 1 | 3 | 1 |
| Store | 1 | 1 | 1 | 1 |

$$\text{Clock cycles} = 12 \checkmark.$$

$$\text{Speedup} \Rightarrow \frac{\text{non pipeline}}{\text{pipeline time}} = \frac{24}{12} \approx 2$$

## 2 Stage pipelining:

- The pipeline consists of 2 stages. one is fetch and execution. is not successful pipelining because the execution cycle takes more time than fetch cycle for decoding; and operand fetch & process data).



- Always the delay is present at the execution stage, to make it successful two methods are used ① Decomposition of delayed stage

## ② Replication of delayed stage.



→ Decomposition States that divide the delay stage into Substages while maintaining uniform delay.

This is possible in instruction pipeline with substituting 3 stages in place of execution stage.

- (1) Fetch (2) Decode
- (3) Operand fetch.
- (4) Process data.



- (1) 1st 2 I/P's are accepted; To accept 3rd, it accepts at 5th  
It is unsuccessful.

| Ex |       | $I_1$ | $I_1$ | $I_1$            | $I_2$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ |
|----|-------|-------|-------|------------------|-------|-------|-------|-------|-------|
| IP | $I_1$ | $I_2$ | -     | $\overline{I_3}$ | $I_3$ |       |       |       |       |
|    | 1     | 2     | 3     | 4                | 5     | 6     | 7     | 8     | 9     |



If Decomposition is not possible use the 2nd method i.e replication. Replication means arrange the duplicate stages in parallel along with at the delayed stage.



Each stage require 3ns.

Inst<sup>n</sup> fetch      Execution Stage



1 2 3  
EX<sub>1</sub>, EX<sub>2</sub>, EX<sub>3</sub>

Clock : I<sub>1</sub>  
Cycle 4

CC<sub>2</sub>      I<sub>2</sub>      I<sub>1</sub>

CC<sub>3</sub>      I<sub>3</sub>      I<sub>1</sub>      I<sub>2</sub>.

CC<sub>4</sub>      I<sub>4</sub>      I<sub>1</sub>      I<sub>2</sub>      I<sub>3</sub>.

CC<sub>5</sub>      I<sub>5</sub>      I<sub>4</sub>      I<sub>2</sub>      I<sub>3</sub>.

CC<sub>6</sub>      I<sub>6</sub>      I<sub>4</sub>      I<sub>5</sub>      I<sub>3</sub>.

it is successful pipelining.

loop level Parallelism:

→ one iteration stages are not overlapped with other iteration stages.  
ambiguity

loop level Hlasm - one Iteration stages overlapped with other Iteration:

I<sub>2</sub>, I<sub>3</sub>, I<sub>4</sub>      After completion of I<sub>1</sub>  
Then start with (I<sub>2</sub>)

## Reservation Table

|    | 51 | 52 | 53 | 54 |
|----|----|----|----|----|
| I1 | 2  | 1  | 1  | 1  |
| I2 | 1  | 3  | 2  | 2  |
| I3 | 2  | 1  | 1  | 3  |
| I4 | 1  | 2  | 2  | 2  |

- Non linear pipelined process, execution sequence is depending on reservation station/table.
- don't know about time v.

|    |                                                            |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |    |    |    |    |
|----|------------------------------------------------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|----|----|----|----|
| 54 |                                                            | $I_1$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ | $I_4$ | $I_4$ |       | $I_2$ | $I_2$ | $I_1$ |       |       |       |       |       |    |    |    |    |
| 53 |                                                            | $I_1$ | $I_2$ | $I_3$ | -     | $I_4$ | $I_4$ | -     | $I_1$ | -     | $I_2$ | $I_2$ | $I_3$ | -     | -     |       |       |    |    |    |    |
| 52 |                                                            | $I_1$ | $I_2$ | $I_2$ | $I_2$ | -     | $I_4$ | $I_4$ | $I_1$ | -     | $I_2$ | $I_2$ | $I_3$ | -     | $I_4$ | $I_4$ | $I_4$ |    |    |    |    |
| 51 | $I_1$                                                      | $I_1$ | $I_2$ | $I_3$ | $I_3$ | -     | $I_1$ | $I_1$ | $I_2$ | -     | $I_3$ | $I_3$ | $I_4$ | -     |       |       |       |    |    |    |    |
|    | 1                                                          | 2     | 3     | 4     | 5     | 6     | 7     | 8     | 9     | 10    | 11    | 12    | 13    | 14    | 15    | 16    | 17    | 18 | 19 | 20 | 21 |
|    | $\leftarrow \quad \rightarrow \quad \leftarrow$            |       |       |       |       |       | $i=1$ | $i=2$ |       |       |       |       |       | $i=3$ |       |       |       |    |    |    |    |
|    | for ( $i = 1 \text{ to } 2$ ) ( $11 : 12 : I_3 : I_4 :$ ). |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |       |    |    |    |    |

$$(11) \quad @ \quad ET = (K + n - 1) \cdot kp.$$

$$\Rightarrow (n+1000 -1) \cdot 160 \text{ ns}$$

(15)

No of clock cycles = 12

(6)

31

(16)

(17) (b)

(20) (c)

CPU Control Design

(3)

bbs, reg, pos, label

Emulation

Simulator

→ If instru is not supported, what

are the other possible ways to execute in environment

with eg.

→ instru is not yet designed, but in need of functionality

→ bbs → processor are not supporting

branch

bit set

(multiplex logic).

bbs

LSB

→ Reg : 00010101

→ Read the position, all the reg bits undergoes evaluation



✓ Mask indicates the position. The reg along position value reading onto temp.

✓ 31 — 0 (Right to left).  
 ←  
 Left representation

Read the 0<sup>th</sup> position — — — .

31 — 0 position is pointed by pos in mask.  
 ↗ pos.  
 1 1 0

→ a) mask ← 0x1 → pos.  
 ↓  
 Hexadecimal no.

(If any value followed by Hema → Hexadecimal value).

→ 0x1 position initially pointing to LEB to be loaded to Mask.  
 ↗ pos.

5 4 3 2 1 0 (Each time, we are reading one position).

Suppose we are reading 20<sup>th</sup> position, control is transferred to soth then check it is 1 or not  
 If yes then load onto up asc

② perform shift left operation on pos  
 → 1 → load onto mask

pos → 0 → No operat<sup>n</sup>

## RISC pipelining.

In a successful pipeline it consists of 5 stages, one is

- ① Inst<sup>n</sup> fetch
- ② Inst<sup>n</sup> decode
- ③ Execution
- ④ Memory access
- ⑤ Write Back

Reason :

RISC m/c supports with  
3 types of Inst<sup>n</sup>.

- (a) Data transfer (load & store)
- (b) For memory reference.
- (c) MBR for register reference

Total = ③ ✓ Data Transfer.  
Inst<sup>n</sup> ✓

- ② Data manipulation Inst<sup>n</sup> ↗.

→ unconditional & conditional branches.

Significance

## 5. Five Stage pipeline

- ① First stage → Instruction fetch

At this stage Inst<sup>n</sup> is transferred from memory to IR based on program counter.

At the end of this stage, PC is incremented by step size.

$$PC \leftarrow PC + \text{Step size}$$

Based upon PC → Inst<sup>n</sup> transferred Memory.

assume fetched Instn is load

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

①. 2<sup>nd</sup> Stage.

Decode stage performs 2 functionalities

① Decode the Instn

② Register file accessing.

→ ① load  $r_0, [2000]$

① load:

Register file  
accessing

Indirect register

or designate it as  $r_0$  as destn.

(load)  $l_0, 2[R_i^0]$

index register content

base reg

If base, it is read from memory.

displacement

②

Store:

offset

Store Reg,  $3[R_i^0]$ .

Index register value

Identified/  
accessed

③

Mov →

destn

(Mov)

$l_0, l_1 \rightarrow$

only Identification

↓

↓

↓

is done about source  
& destn ✓

Data file

Operat'n

④

Add

Add  $r_0, r_1, r_2$

Add  $r_0, r_1, \#23$

↓ ↓ ↑ immediate value

↓ ↑ 1  
identify read read

~~✓~~ While executing the PC, Decoder itself  
counts of comparators.

Jump, branch.

branch 2000 if its PC,  $PC \leftarrow 2000$

~~whether it is found or not~~ Execution is completed at the end of  
Decoding stage.

→ Branch Not equal to zero in

BNZ so  $2000$

so value is compared with BNZ  
 $PC \leftarrow 2000$

Branch Instructions are executed at the end  
of Decoding stage. The O/P is updated  
in the PC

~~Execution~~ The major work is done by  
ALU

It executes the instrn ✓.

① Register → Immediate  $\rightarrow$ .

② CPU recognize Immediate value.

→ 2 data elements are given as I/P to ALU.

ADD: 80, 81, #123,

enable during access control

no register is not updated. The off line of exec<sup>n</sup> does not.

### (2) Reg - Reg :

~~CD  
organise  
scheduling~~ Add  $r_0, r_1, r_2$   
 result is present at  
 off line of exec<sup>n</sup> stage.  
 Execution is over, the off is not reflected at PUL.

### (3) Memory reference

#### a) Load :

$1 +$  Add<sup>n</sup> b/w relative & content

#### b) Store

of index register.

or generate ER.

Reg<sup>n</sup>: The effective add<sup>n</sup> is the off of AW.

~~points~~ → If the Inst<sup>n</sup> is register reference and  
 ① Reg - Immediate, the AW executes  
 that Inst<sup>n</sup> but the off is not updated  
 into the register.

② If the Inst<sup>n</sup> is memory reference Inst<sup>n</sup>  
 effective add<sup>n</sup> is the off of AW.

(q)

Memory Access: Based on the O/P of AW,  
Memory Reference Instn's are executed at  
 this stage (EA)

excn<sup>n</sup> is completed load  $\rightarrow$  EA is already calculate, it enables  
 the mem to cat<sup>n</sup>, read the contents  
 but length is not updated in lo.

Store: excn<sup>n</sup> is completed at the end  
 of memory access.

(Store)

- STORE no, 3(Ri);  $\rightarrow$  store instn  
 excn<sup>n</sup> is over at  
 end of 4 stage.
- No change in mem acc in Arithmetic.



Data transfer  
 Instn acc  
 already solved  
 completely.

only load  
 & store is allowed.  
 It is just forward from 1 latch  
 to other latch.

Write Back:

The final O/P is updated into Data register.

Identified  
 at Decodif  
 level.

Q2.

|       |       |       |       |       |       |       |       |       |       |       |       |       |       |    |    |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|----|----|
| $S_4$ |       |       |       |       | $I_1$ | $I_1$ |       |       | $I_2$ | $I_3$ | $I_3$ | $I_4$ | $I_4$ |    |    |
| $S_3$ |       |       | $I_1$ | $I_1$ | $I_2$ | $I_2$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ | $I_4$ | -     | -     |    |    |
| $S_2$ |       | $I_1$ | $I_1$ | $I_2$ | $I_2$ | $I_2$ | $I_3$ | $I_3$ | -     | $I_4$ | $I_4$ |       |       |    |    |
| $S_1$ | $I_1$ | $I_1$ | $I_2$ | $I_3$ | $I_3$ | -     | $I_4$ | -     | =     |       |       |       |       |    |    |
|       | 1     | 2     | 3     | 4     | 5     | 6     | 7     | 8     | 9     | 10    | 11    | 12    | 13    | 14 | 15 |

$$E = \frac{\text{Enon pipeline}}{\text{E pipeline}} = \frac{29}{15} \approx 2.$$

Q3.

|   |   |    |    |   |   |    |   |    |   |    |    |    |  |  |
|---|---|----|----|---|---|----|---|----|---|----|----|----|--|--|
|   |   |    |    |   |   |    |   |    |   |    |    |    |  |  |
| W |   |    | A  |   | M | S  |   |    | D |    | D  | S  |  |  |
| E |   | A. | M. | M | M | S  | D | D  | D | D  | S  |    |  |  |
| D | A | M  | S  | D | - | -  | D | S. | - | -  |    |    |  |  |
| F | A | M  | S  | D |   | S. |   |    |   |    |    |    |  |  |
|   | 1 | 2  | 3  | 4 | 5 | 6  | 7 | 8  | 9 | 10 | 11 | 12 |  |  |

Q4.

Tuesday

4/ Jan/2011

Date:

Page No.:



### 5 Stage pipeline

mem+IR

DMI      DTI      TOC

which instruction is executed at which stage is given by 5 Stage pipeline

7. Memory Takes place at every part - (compulsory)

DTI → offset  $\xrightarrow{\text{load}} \text{store}$  EA.

DMI → Register file along with operation

TOC → program counter is loaded

with Target address  
(TOC is completed).

conditional

unconditional

executed at

the end of Decoding

\* TOC is not allowed in exec<sup>n</sup> stage only other

then  $\nearrow$  are allowed



## ① Structural Dependencies

↳ means resource conflict. The resource may be memory, functional unit or register.

Structural dependencies cause structural hazard.

Hazard means "STALL" (no operation assigned in the particular clock).

Extra clock cycle is added.

(STALL inc, exec time inc)

<sup>used to access</sup>  
data



Pipeline for every new cycle, new IP must present here there are 5 Instns, 5 clock cycle so we can say it is successful pipeline.

Memory → conflict resource.

Here it is not clear picture about memory whether it is going to access new data.

on first → It leads to new conflict

## → Data Manipulation Instr<sup>n</sup>

(Inst<sup>n</sup>)

Add  $r_0 \leftarrow r_1 + r_2 \rightarrow$  DM  $\rightarrow$  Regfile (Opn) (+)  $\rightarrow$  Opn is processed  
 $(r_1 + r_2)$

$\xrightarrow{\text{forward regt}}$   $\rightarrow$  Destination  
 $r_0 = r_3 + r_4$

Regt is  
 not reflected  
 onto dest<sup>n</sup>  
 Same regt

in forward

## → Data Transfer Instr<sup>n</sup>

Load  $r_3, 2(R_1) \rightarrow DT \rightarrow$  offset  $\xrightarrow{\text{load}} EA \xrightarrow{\text{load}} \text{Accumulator}$   
 $\xrightarrow{\text{store}} \text{Memory} \xrightarrow{\text{store}} \text{to read}$

Destination  
 $(R_2)$  update

Store the reg. content  
 in memory

Point:

(WB) only updating the register.

(Opnd<sup>n</sup>) is completed at the end of ALU, and regt is stored at the latch

## → Dependencies Among Instructions

The program tendency is Dependency. (No straight line Screening).

There are three types of Dependencies

(S) Exist State b/w Instr<sup>n</sup>.

↳ Structural Dependencies

↳ Data Dependencies

↳ Control Dependencies

In both cases pointer register & memory space is different.

Unsuccessful operat<sup>n</sup> → cyc is not allowed to fetch any operat<sup>n</sup>, one clock cycle is wasted. This waste clock cycle is known as stall as no. instr<sup>n</sup> fetch takes place.

→ Same resource used by multiple Inst<sup>n</sup>s that leads to conflict. Now we are considering about von Neumann concept it will lead to unsuccessful operat<sup>n</sup> upto memory is not separated.

Point ① In the clock cycle 4, Instruction 1 is accessing the memory to read the data.

In the same clock, Instr<sup>n</sup> 4 is accessing the memory to ~~fetch~~<sup>fetch</sup> the Instr<sup>n</sup>.

② Two different operat<sup>n</sup>s pointing same memory. So memory conflict is there so structural hazard is present in the pipeline.

③ To resolve the structural Hazards, use renaming mechanism.

(2)

## Data Dependency :-

Consider the program Order  $I_1, I_2, I_3 \dots$ . If  $I_2$  one of the I/P is an o/p of  $I_1$  until completion of  $I_1$  instrn,  $I_2$  instrn undergoes wait state.

This wait state causes Stall (Hazard) bcz of lack of data.

waitage  
of cycle cycle

( $I_1$ ) one of the I/P.  $\rightarrow$  Data dependency

is an o/p of  $I_1$

The identification of dependency is already implemented in ALU.

Inst<sup>n</sup> ① Add  $r_0, r_1, r_2$

②

Until completion of Inst<sup>n</sup>,

Input ② Sub  $r_3, r_0, r_2$

no value is not

available. This is

Called as Data Dependency.

Pipeline is implement in the h/w level.

Internally processor is maintaining the database, where the database is present is known as Instruction maintained.

① Reservation      } maintain db in h/w  
② Reorder

\* Completion of Inst<sup>n</sup> removes it from table

\* At the time of Decoding, R-F is there  
(Register file).





Now take one more int<sup>n</sup>  $I_3 \Rightarrow$

- ③ MUL  $\bar{v}_5, \bar{v}_6, \bar{v}_7$  (Independent).

| $I_3$                                                                                                                                                                                                                                                                                  | <u>IF</u> | <u>////</u> | <u>////</u> | <u>ID</u> | Present in<br>Prog, it<br>is present<br>for all<br>Incr up<br>whether<br>it is depend<br>ent or indep<br>endent |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-------------|-------------|-----------|-----------------------------------------------------------------------------------------------------------------|
| $I_2$ is Data Dependent on Instruction ( $I_1$ ).<br>Until completion of $I_1$ , $I_2$ undergoes waits.<br>The Regt is updated in the register at the<br>end of write back stage. So 2 clock<br>cycles are waited w/o any operat'n when<br>the data is available, 2nd fetch continues. |           |             |             |           |                                                                                                                 |

Consider  $I_3$  instr. If it is independent instr when the processor executes the program it follows in order execution i.e.  $I_1 - I_2 - I_3 \dots$ .  
 Bcoz of dependency b/w  $I_1$  &  $I_2$  the independent instr. are also sharing the stall cycles.

Resolving mechanism is reqd. to remove the dependency in the program.

Data dependency is compulsory, then resolving mechanism is compulsory.

Resolving mechanism :- (To minimize the stalls, when data dependency is encountered in prog).

→ is used in the data dependency in order to minimize the no of stalls. Called as operand forwarding or bypassing or short circuiting.

Operand forwarding means one stage ALU O/P is connected as another stage ALU I/P common data Bus carry. Intermediate regt. Intermediate regt to the dependent stages.



only forwarding is possible, Bwd is not possible, it will cause stalls

## Control Dependency $\Rightarrow$

By execution of transfer of control instr's, control is transferred from one location to other.

In the pipeline process, the instr's are accepted into pipe based on the program counter in sequential order. When there is a transfer of control instr's are executed, the unsuccessful instr is removed from the pipeline and the control is transferred to the successful instr.

(PC)

1000 I<sub>1</sub>: Add R<sub>0</sub>, R<sub>1</sub>, R<sub>2</sub> Consider the instr sequence

1001 I<sub>2</sub>: Jmp 2000  $\Rightarrow$  CPU want inorder exec.

(I<sub>1</sub>  $\rightarrow$  I<sub>2</sub>  $\rightarrow$  [2000]  $\rightarrow$  [2001])

1002 I<sub>5</sub>: SUB R<sub>3</sub>, R<sub>4</sub>, R<sub>5</sub>

2000 Dec R<sub>8</sub>.

2001 Add R<sub>9</sub>, R<sub>10</sub>, R<sub>11</sub>



Here in this case In 5 cycles, 4 instructions are processing, there is wastage of one clock cycle i.e. Cycles that leads to stall. ~~because of~~

Point 1: For every unconditional for loop, it causes stalls.

→ If 20 instructions are there, 20 ~~extra~~<sup>stalls</sup> are added } extra cycle.

Point 2: The process of removing unsuccessful instruction from the pipeline is called as FLUSH OR FREEZE.

This process is used only to preserve the execution sequence.

→ Every unconditional for instruction causes one stall in the pipeline.

### Conditional T0C

PC 1000 I<sub>1</sub> : Add r<sub>0</sub>, r<sub>1</sub>, r<sub>2</sub>

1001 I<sub>2</sub>, BNEZ r<sub>13</sub>, 2000

r<sub>13</sub> = 0 → Next instruction is executed at 1002 (false).

r<sub>13</sub> ≠ 0 → True. (Content is transferred to 2000)

1002 I<sub>3</sub>, SUB r<sub>3</sub>, r<sub>4</sub>, r<sub>5</sub>

Next instruction is fetched from 2000 as.

| IF | ID | Ex | Mem | WB |
|----|----|----|-----|----|
|----|----|----|-----|----|

IF  
ID.  
(True)  
PC=2000

Ex New WB

No need to remove  
remove  
(IF) no  
stall is present  
Inst<sup>n</sup> stall is present  
remove  
it from  
pipe and  
or not it  
target  
to target  
addr

~~2~~  
already  
create of  
Inst<sup>n</sup> and  
leads to  
unsuccessful  
inst<sup>n</sup>

\* Conanal TOC causes the no of stalls as the  
cond<sup>n</sup> evaluates to true

Point If the Inst<sup>n</sup> in conanal TOC, at the stage of  
Decoding it evaluates to true or false If the  
cond<sup>n</sup> is true, remove the unsuccessful inst<sup>n</sup>  
from pipe and fetch the successful Inst<sup>n</sup> based  
on target address.

\* If the cond<sup>n</sup> is false, no stall is present in the  
pipeline

No mechanism just prediction. (Guess work is done  
to remove the stalls when control dependency  
is depicted. The target addr is stored or  
depicted in advance)

To minimize the control dependency stalls Branch  
prediction Buffer is used or Branch target buffer

or delayed Branches are used.

## SCHEDULING

- When there is a dependency existed b/w 2 Inst<sup>n</sup>, if the rest of Instructions are independent then also stalls are shared among Inst<sup>n</sup>. (If the processor follows Inorder execution)
- To avoid that problem, Use the out of order exec<sup>n</sup>, that states Simultaneous exec<sup>n</sup> along with dependent Inst<sup>n</sup>.

Eg



$I_3 \left\{ \begin{array}{l} \rightarrow \text{Independent Inst}^n \\ \rightarrow \text{Arrangement of Inst}^n \end{array} \right.$  Before accepting  $I_2$  we will accept  $I_3$  &  $I_4$  bcz data will be available at end.

$I_1 \left\{ \begin{array}{l} I_3 \\ I_4 \\ I_2 \end{array} \right. \rightarrow \text{Arrangement of Inst}^n \text{ based upon dependency is known as Scheduling.}$

$\rightarrow$  There is no stalls.

Scheduling is possible when there is multiple Opal units

Name dependency is part of Structural Dependency

Date:

Page No.:

Add1      Add2  
I1      I2.

Scheduling is possible when there is no structural hazards

I1 : Add  $r_0, r_1, r_2$  } dependency. } Dependent  
I2 : SUB  $r_3, r_0, f_1$  } Intra

I3 : MUL  $r_1, r_5, r_6$ . } no dependency } Independent  
I4 : DIV  $r_3, r_8, r_9$ . }                            

Because  
of this  
dependency  
all Intra  
undergoes

→ we move to scheduling concept.  
scheduling → Old data lost, new data is updated



After exec<sup>n</sup> I4, I2 is writing.



Latest data is reflected, an old value  
(new) is removed in these cases

After Scheduling the Intra & named dependency are present, Name dependency occurs when 2

inst<sup>n</sup> use the same register or memory location.  
There are 2 types of name dependencies:-  
b/w instr<sup>n</sup>(i) that precedes instr<sup>n</sup>(j) in  
the program order.

Q

Antidependency :-

When instruction (J) writing a register or  
memory location before instr<sup>n</sup>(i) reads that.

Q

Output Dependency :- This dependency occurs when  
instr<sup>n</sup>(i) and instr<sup>n</sup>(j) modifies the same register  
or memory location.

Point

(1) Antidependency existed b/w I<sub>2</sub> & I<sub>3</sub> ie I<sub>3</sub> modifies  
the R<sub>1</sub> before I<sub>2</sub> reads.

(2) O/P Dependency: existed b/w I<sub>3</sub> and I<sub>4</sub> ie I<sub>4</sub> modifies  
the data before I<sub>3</sub>.

These dependencies causes Hazards in the pipeline.

Types of Hazards :-

→ Hazards are created whenever there is dependence  
b/w instr<sup>n</sup>s.

→ Data Hazards are classified into 3 types :- depending  
on the order of Read & write operations.

→ Consider 2 instr<sup>n</sup>s i & j with i occurring before j in

the program order.

Hazard

"RAW" → Read after write Hazard.

→ J tries to read "data" before i writes it so J incorrectly reads old value. This is called True Data Dependency.

"WAR" → Write after read Hazard.

→ J tries to write "Dest" before it is read by i. so J incorrectly got new data. (Antidependency).

"WAW" → Write after write

→ J tries to write data to dest before it wrote by i. (Output Dependency)

(RAR)

→ Cause no stall in the pipeline.

## Performance of Pipeline With Stalls.

\* SPEED UP:  $\frac{\text{Execution time of Non pipeline}}{\text{Execution time of pipeline.}}$

$$\Rightarrow \frac{\text{CPI Unpipelined}}{\text{CPI pipelined}} * \frac{\text{Clock cycle}}{\text{Clock cycle unpipelined}}$$

$$\text{CPI pipelined} = \text{Clock cycle pipelined}$$

Assume ideal CPI on a pipelined processor is almost 1 (always). Hence CPI pipelined is equal to

$$\text{CPI pipelined} = \text{Ideal CPI} + \frac{\text{Pipeline stalls}}{\text{Instructions}}$$

(stalls are present  
in random manner.)

$$CPI_{\text{pipelined}} = 1 + \frac{\text{Pipeline stall cycles}}{\text{Inst}^n}$$

→ If we ignore cycle time overhead (switching) the clock cycle of unpipelined and pipelined is same.

~~For pipeline  
is no stall  
no overhead~~

$$\text{Speedup.} = \frac{CPI_{\text{unpipelined}}}{CPI_{\text{pipelined}}}$$

$$= \frac{CPI_{\text{unpipelined}}}{1 + \frac{\text{Pipeline stall cycles}}{\text{Inst}^n}}$$

$$1 + \frac{\text{Pipeline stall cycles}}{\text{Inst}^n}$$

\* Important Case: where all the Inst<sup>n</sup> take the same no of cycles which will also equal to no of pipelined stages. (Pipelined depth)

$$\text{Unpipelined CPI} \Rightarrow \text{Depth of pipeline}$$

$$\text{Hence Speedup} = \frac{\text{Pipelined Depth}}{1 + \frac{\text{Pipeline stall cycles}}{\text{Inst}^n}}$$

$$1 + \frac{\text{Pipeline stall cycles}}{\text{Inst}^n}$$

If there is no stalls,

$$\text{Speedup} = \text{pipeline Depth}$$

\* Pipeline stall cycles from Branches:-

*in part of prog*

$$\Rightarrow (\text{Branch frequency}) * \text{Branches Parallel}$$

~~↳ flow control  
extra clock  
register  
added for  
every branch  
Inst<sup>n</sup>~~

need to identify  
with the help  
of space time diag

WB:

$$(17) \quad X = (S - R * (P + Q)) / T$$

Add R5, R0, R1;  $R5 \rightarrow R0 + R1$

RAW  $\rightarrow 4$ .

Mul R6, R2, R5;  $R6 \rightarrow R2 * R5$

Sub R5, R3, R6;  $R5 \rightarrow R3 - R6$

WAR  $\rightarrow 2$ .

Div R6, R5, R4;  $R6 \rightarrow R5 / R4$

WAW  $\rightarrow 2$ .

Store R1, X

source.

(18) ~~IF~~

~~IF ID~~

~~IF~~

ADD IF IP ID I<sub>1</sub>

|  | F | D | Ex | W |
|--|---|---|----|---|
|--|---|---|----|---|

|                |   |   |   |   |
|----------------|---|---|---|---|
| I <sub>1</sub> | 1 | 1 | 1 | 1 |
|----------------|---|---|---|---|

SUB. IF I<sub>2</sub>

|                |   |   |   |   |
|----------------|---|---|---|---|
| I <sub>2</sub> | 1 | 1 | 3 | 1 |
|----------------|---|---|---|---|

STORE I<sub>3</sub>

|                |   |   |   |   |
|----------------|---|---|---|---|
| I <sub>3</sub> | 1 | 1 | 1 | 1 |
|----------------|---|---|---|---|

MUL I<sub>4</sub>

|                |   |   |   |   |
|----------------|---|---|---|---|
| I <sub>4</sub> | 1 | 1 | 3 | 1 |
|----------------|---|---|---|---|

DIV I<sub>5</sub>

|                |   |   |   |   |
|----------------|---|---|---|---|
| I <sub>5</sub> | 1 | 1 | 1 | 1 |
|----------------|---|---|---|---|

DIV.

with Operand forwarding (12 clock cycles)  $\rightarrow$  Ans

|                | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 | CC9 | CC10 | CC11 | CC12 |
|----------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|------|------|------|
| I <sub>1</sub> | F   | D   | E   | W   |     |     |     |     |     |      |      |      |
| I <sub>2</sub> |     | F   | D   | E   | Q   | E   | E   | W   |     |      |      |      |
| I <sub>3</sub> |     |     | E   | D   | F   | III | III | E   |     |      |      |      |
| I <sub>4</sub> |     |     |     |     | III | III | D   | E   | E   | E    |      |      |
| I <sub>5</sub> |     |     |     |     | III | III | F   | D   | III | III  | E    | W    |

Ques

Consider five Stage pipelined which allows overlapping of all instrns except branch instrn, the target of branch instrn are not available until the Branch instrn is complete.

Let each stage delay is 20nscc ( $t_p$ ) and there are 30% Branch Instns (ignore the fact that some of them are conditional). What is the average instrn execution time & what is speedup?

$$\text{CPU Time} = 1 + \text{pipeline stall cycle} / \text{Instrn}.$$

Pipeline stall cycle = Branch freq  $\times$  Branch Penalty.

I1: ADD  $r_0, r_1, r_2$

I2: Jmp 3000

I3: Sub  $r_7, r_8, r_9$ .

C1 C2 C3 C4 C5 C6 CCT

I1 S1 S2 S3 S4 S5.

I2 S1 (S2) S3 S4 S5 until S5, no  
B.I (no overlapped)  
further exec'g

I3

{ Unexecuted  
flush } ~~S3~~

/// == ==

$$\textcircled{1} \text{ Execution time} = [1 + (30\% \times 4)] * 20\text{nscc}$$

\textcircled{2} Speedup  $\Rightarrow$  pipeline depth

1. Pclock cycle stalls:

$$\begin{aligned}
 & \xrightarrow{\text{Pclock}} 5 \xrightarrow{\text{Pclock}} 5 \xrightarrow{\text{Pclock}} 5 \xrightarrow{\text{Pclock}} 5 \\
 & 1 + 30\% \times 4 \quad \frac{1+30 \times 4}{100} \quad 1 + \frac{120}{100} \quad \frac{220}{100} \\
 & \Rightarrow \frac{5}{2.2} \Rightarrow \frac{5}{4.4} \approx 2.2
 \end{aligned}$$

frequency = 1 GHz

Conduct branch = 20%

Inset inf = 10%

Date:

Page No.:

(11)

CC1

CC2

CC3

CC4

CC5

(First instn itself Branch).

6F

52

53

Branch.

I<sub>2</sub>

~~51~~

~~52~~

flush

1111

Branch penalty = 2.

~~109~~ (1 + Branch freq \* Penalty)  
rate

$$\Rightarrow \frac{(1 + 2 * 20\%)}{1 * 10^9} 109$$

~~52 53~~  
~~51 52~~  
~~50 51~~  
~~49 50~~  
~~48 49~~  
~~47 48~~  
~~46 47~~  
~~45 46~~  
~~44 45~~  
~~43 44~~  
~~42 43~~  
~~41 42~~  
~~40 41~~  
~~39 40~~  
~~38 39~~  
~~37 38~~  
~~36 37~~  
~~35 36~~  
~~34 35~~  
~~33 34~~  
~~32 33~~  
~~31 32~~  
~~30 31~~  
~~29 30~~  
~~28 29~~  
~~27 28~~  
~~26 27~~  
~~25 26~~  
~~24 25~~  
~~23 24~~  
~~22 23~~  
~~21 22~~  
~~20 21~~  
~~19 20~~  
~~18 19~~  
~~17 18~~  
~~16 17~~  
~~15 16~~  
~~14 15~~  
~~13 14~~  
~~12 13~~  
~~11 12~~  
~~10 11~~  
~~9 10~~  
~~8 9~~  
~~7 8~~  
~~6 7~~  
~~5 6~~  
~~4 5~~  
~~3 4~~  
~~2 3~~  
~~1 2~~  
~~0 1~~  
 $\Rightarrow 1.2 \text{ pcc}$

(11)

(10)

I<sub>1</sub>

IP ID ERK WEB

I<sub>2</sub>

I<sub>3</sub>

I<sub>4</sub>

Add R<sub>2</sub> R<sub>1</sub> R<sub>0</sub> R<sub>2</sub>  $\leftarrow$  R<sub>1</sub> + R<sub>0</sub>

Mul. R<sub>4</sub> R<sub>3</sub> R<sub>2</sub> R<sub>4</sub>  $\leftarrow$  R<sub>3</sub> \* R<sub>2</sub>

(3) (6)

~~(5)~~ (15) operand forwarding.

~~15~~ → 15 clock cycles

~~(6)~~ ✓

~~(10)~~ ~~(C)~~

~~(9)~~ (b) ✓

~~(8)~~ (d)

|                | IF | ID | OF | PD       | WR |
|----------------|----|----|----|----------|----|
| I <sub>0</sub> | ✓  | ✓  | ✓  | <u>3</u> | ✓  |
| I <sub>1</sub> | ✓  | ✓  | ✓  | 6        | ✓  |
| I <sub>2</sub> | ✓  | ✓  | ✓  | ✓        | ✓  |
| I <sub>3</sub> | ✓  | ✓  | ✓  | ✓        | ✓  |

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 O' C12 C3 and  
I<sub>0</sub> IF ✓ I<sub>0</sub> ✓ OF ✓ PD ✓ PD WR

I<sub>1</sub> IF ✓ ID ✓ OF ✓ III, III PD PD PD PD

I<sub>2</sub> IF ✓ ID ✓ III, III OF III III III PD

I<sub>3</sub> IF III III ID III III III III OF



## Memory

Memory is used to store the program while exec<sup>n.</sup> of the program CPU frequently referencing the memory. So memory access time is controllable to calculate CPU performance.

Performance is dependent upon Access Time with inverse relatn.

To decrease the access time, memory hierarchy is used b/w memory and CPU.

Table Shows:

| Level                          | 0                                | 1                              | 2           | 3                                         |
|--------------------------------|----------------------------------|--------------------------------|-------------|-------------------------------------------|
| Name                           | reg                              | cache                          | M-M         | S-M                                       |
| Size                           | <1KB                             | <16MB                          | <16GB       | >100GB                                    |
| Implementation                 | Custom memory                    | onchip/off chip                | DRAM        | Magnetic Disk.                            |
| Technology                     | with multiports                  | SRAM                           |             |                                           |
| Access time                    | 0.25 - 0.5                       | 0.5-25                         | 80-250      | 5,00,000<br>A.T. High as compared to S.M. |
|                                | A.T. (low) - perf (High)         |                                |             |                                           |
| Bandwidth (Data transfer rate) | 20,000-1,00,000<br>[D.TR (High)] | 5,000-10000<br>due to register | 1,000-5,000 | 20 - 150                                  |
| Wr. (MB/sec)                   |                                  |                                |             |                                           |
| Managed by                     | Compiler                         | H/W                            | OS          | OS                                        |
| Packed by                      | Cache                            | MM                             | SM          | CD/Floppy                                 |

Cache Memory is an intermediate memory b/w CPU & main memory.

Cache Underlying principle is Locality of reference

### Locality of Reference

Large Data is placed in small area (<sup>data</sup> which is needed)  
so Access time will be less, so performance is higher

Locality of reference is of 2 types :-

① Spatial locality    ② Temporal locality

Spatial locality : The adjacent words of referenced block is used by CPU in the near future i.e.



Temporal locality : Same word in the block referenced by the CPU in near future.



Cache Memory is intermediate memory b/w CPU and Main memory



\* Block means group of memory cells.

\* Main memory is divided into no of blocks based on size of block.

✓ → No of blocks = Memory size  
(group of M.C.) / Block size



→ How cache memory is organized?

\* Cache Memory size is less than main memory

divided into no of parts based on block size.

line size → divided into no of parts each part is known as line.



∴ No of lines = Cache Memory size / Block size

Points

① Cache memory is divided into equal size parts based on size of the block. Each partition is called as line. Line size is always equal to block size. Main memory is also divided into equal partitions based on size of the block.

\* No. of blocks in main memory  $\Rightarrow$  is more than no. of lines in cache memory.

② Memory is organized in two ways :-

① Independent memory.

② Hierarchical memory

Independent memory organization: not present in cache. Data present in cache then move in L1 then move to L2 then move to main memory.



When the CPU executes the program initial data reference takes place in the registers. If the data is not present, next reference is to cache memory.

If the data is present in the cache memory operation is called hit, data is transferred to CPU.

If the data is not present in cache, upon architecture, CPU enables the level of main memory cell. and immediately access the data w/o involvement of cache.

Average access time under this organization:

\* Miss penalty = access time of main memory

Accessing given of 2 level memory independent org

$$T_{av} = T_1 H_1 + (1 - H_1) \cdot T_2$$

↓  
access time of  
q.C.M.

} No connection b/w  
L0 and L1.

main memory.

Because of independent memory areas, the content of main memory is directly move to CPU.



So we can say it is not suitable. Here access time inc, pen de.

\* flat ratio = no of hits  
(Hits + misses)  
total access

hit = miss = 0  
time ↗  
latency ↗

No extra time is required to check whether it is hit or miss.

\* By default we need to consider Independent if its  
 Date \_\_\_\_\_  
 Page No. \_\_\_\_\_  
 one & is not mentioned  
 about any

- \* In this organization if there is a miss operation takes place i.e. always miss throughout program execution in L1 (Memory) because the image of L1 is not copied into L0.
- \* Consider hierarchical memory concept when there is a miss operation in L0 CPU identifies the data in L1 and transfers that data into L0 and reads the data from L0.  
 So  $T_{avg} = T_1 \cdot H_1 + (1 - H_1)(T_2 + T_0)$
- \* Principle of locality sequence is possible due to Hierarchical as we can see the data in Hierarchical is dynamic (changing) But in case of Independent, the data is static.  
 In every part, extra storage is present

## ABC's of Cache

- ① Mapping Technique
- ② Replacement algo's
- ③ Cache coherency
- ④ Levels of cache

## Mapping Technique

→ The process of transferring data from main memory to cache memory is called as mapping. This mapping can be done with 3 Techniques :-

- ① Associative mapping
- ② Direct mapping
- ③ Set associative mapping.

→ Associative mapping :- While transferring the data from main memory to cache lines, it checks the availability of line. If the line is free, immediately data is transferred w/o any formulae (rules). If line is not available with the help of replacement algorithm any one of line is replaced by new data.



no need of checking where Block Bo is located in cache. we just check the availability of line.

→ Direct mapping while transferring the data from main memory to cache memory, it uses the formula  $K \bmod N$  to replace any line. Where K indicates main memory block no. N indicates no of cache lines.

→ CPU want to copy data, but data is not present on Cache line. Then it referred to main memory & referred to Block No such as B<sub>8</sub>

$$\Rightarrow K \bmod N$$

$$\Rightarrow B \bmod 4$$

$$\Rightarrow 0 \checkmark \rightarrow B_0 \rightarrow \underline{\text{data}}$$

CPU want to read data from B<sub>0</sub> and then replace the existing data on B<sub>0</sub> with B<sub>8</sub>.

Cache

|                |                |
|----------------|----------------|
| B <sub>0</sub> | B <sub>8</sub> |
| B <sub>1</sub> |                |
| B <sub>9</sub> |                |
| B <sub>1</sub> |                |

\* with the help of this formula is K MOD N

↳ There is association b/w cache line and

main memory block.

cache.



→ How many no. of main memory blocks are associated with one cache line.

$$\checkmark \text{ tag} = 2 \Rightarrow 2^2 = 4$$

Point

\* When the CPU generates the physical address if the Cache memory supports Direct mapped the address is interpreted as B fields info.

← Physical memory addr →



$$\checkmark \text{ Word offset} = \log \text{ Block size}$$

Problem of Direct mapped → one line holding one reg.

If same line uses 2 stages, no of mis operrat<sup>n</sup> is reduced

Date

Page No.:

Suppose address 00 00 110110 . 2 2 6.



- Consider main memory size is  $1\text{ KB}$  ( $2^{10}$ ) and Cache memory size ( $64 \times 4 = 256$  bytes), where Block size is equal to 64 bytes. The main memory is divided into

No. of blocks  $\Rightarrow$  64 KB  $\Rightarrow$  Memory size  
64 Block size

$$\Rightarrow \frac{2^{10} - 2^4}{2^6} \Rightarrow 16 \text{ blocks.}$$

and denoted with  $B_0 \rightarrow B_{15}$

Cache memory is divided into 4 lines denoted with 10 to 13.



$\geq \log_2$

No. of lines = 4

$$\log_2 2^2 = (2 \rightarrow \text{bits})$$

8 bytes four  
comps i.e.  $\begin{matrix} 0 \\ 1 \\ 0 \\ 1 \end{matrix}$

\* Line offset  $\hat{=}$  log # lines

\* Tag bits indicate the possible no. of main memory blocks present in cache line.

If tag is 00  $\rightarrow$  Block 0  $\rightarrow$  line 0 is present.

01  $\rightarrow$  B4

Physical memory.

$\Rightarrow$  tag + word offset

line no.  
01

10  $\rightarrow$  B8

11  $\rightarrow$  B12

| Line no. | Tag > 10 | Line no. | Tag no |
|----------|----------|----------|--------|
|          |          | 00000    | 00000  |
|          |          | 00001    | 00001  |
|          |          | 00010    | 00010  |
|          |          | 00011    | 00011  |
|          |          | 00100    | 00100  |
|          |          | 00101    | 00101  |
|          |          | 00110    | 00110  |
|          |          | 00111    | 00111  |
|          |          | 01000    | 01000  |
|          |          | 01001    | 01001  |
|          |          | 01010    | 01010  |
|          |          | 01011    | 01011  |
|          |          | 01100    | 01100  |
|          |          | 01101    | 01101  |

Blocks in main memory.

0000  $\rightarrow$  1111

due to direct mapped.



size of block = 64 bytes

CPU want to read data so addrs are generated



Tag matching

01

110110



Cache line (Direct mapped cache)

Memory interpretation in read



The direct mapping known formula shows that which memory block is transferred to which line of cache. Transfer Bo block, if  $i \text{ mod } 4 \} = 0$ . The data is transferred from main memory to cache memory line along with tag.



The CPU generates Physical addr to read the data from cache, memory add<sup>r</sup> is interpreted in the following format.



\* Tag 00 is compared with existing tag's of cache memory. If it is matching, it enables the respective line along with word offset. The byte is transferred from <sup>cache</sup> memory to CPU.

\* If it is not matching, it enables memory block with the help of KMODN formula, block is transferred from main memory to Cache memory along with tag.

& then it transfer the data to CPU.

Drawback of Direct mapped cache  $\rightarrow$  One line holds only one tag.

This problem is overcome by using set associative mapping.

## Set Associative

While transferring the data from main memory to cache memory, this method uses  $K \text{ mod } S$  formula to identify the cache line, where  $K$  indicates main memory reference no. and  $S$  indicates no of sets.



$$\text{No of sets} = N \rightarrow \text{no of Cache line}$$

(P-way).

Value of how many sets are present in cache line

Once the CPU generates physical addr to read the data from set associative cache memory; the addr is interpreted as the following format



$$\text{No of Cache lines} = 4$$

Mapping Tech = 2 way set associative.

$$\text{No of sets} = 4 = 2$$

Each cache line is divided into 2 sets.

|   |                |                |
|---|----------------|----------------|
| 0 | B <sub>0</sub> | B <sub>4</sub> |
| 1 | B <sub>3</sub> | B <sub>7</sub> |

Same line is used to access. In direct mapped B<sub>0</sub> is changed to B<sub>4</sub>. Other data is transferred. but here there is no need of replacement. There is enough space for B<sub>4</sub>.



Direct mapped

only one time miss

operation other hit operation

Prog - B<sub>0</sub>

Prog - B<sub>4</sub>

" → B<sub>0</sub>

(set associative mapping) " → B<sub>4</sub>

" → B<sub>0</sub>

only one time hit

operation other time miss is happened.

\* So we can conclude set associative is better than Direct mapped

## \* Replacement Algo

When there is a miss operat<sup>n</sup> in the cache, take the data from main memory. while transferring the data from main memory to cache, if the cache is full which line of cache is replaced with new block. This info is given by replacement algo's.

Replacement algos are divided into 3 types →

- ① Random replacement
- ② First in first out replacement
- ③ Least recently used replacement

- ① Random → In this method, no policy is used to replace cache line with new block.
- ② FIFO → In this method, each line time stamp is taken into account, replace the line with new block whose time stamp is high.
- ③ LRU In this method, 2 parameters are considered into the account, no of references and time stamp.  
Replace the line with new block where the line is having less no of registered references with higher <sup>time</sup> stamp.

Associative Mapping :-(Memory chip does not have address).

The memory is called as content addressable memory or parallel search memory or multi access memory.

Used to maintain 1 bit off time.



Count the no of 1's in matched reg, those elements are retrieved from it.

III XXX X

X indicates disable that bit

III   1  

enable the bit

we will do enable / disable acc to arg reg.

X → Disable.

1 → enable.

respective bits of arg reg considered as key.

100 content

111 → w/o address, Based on

content, other read/write is possible.

↓



100

110

→ Once the content is identified,

all the words of memory undergo full search.

These are used in S where search time is very critical.

1, 2, 3, 4

word retrieved from memory

(Railway website)

Here huge database is maintained

### Numerical

In a 2 level memory, if the level 1 memory is 5 times faster than level 2 and its access time is 10 times less than average access time let the level 1 memory access time ( $T_1$ ) is  $t$ . What is the hit ratio?

$$\text{level 1 memory} = 5 \text{ times}$$

$$\frac{T_{AV}}{T_1 H_1 + (1-H_1) T_2} = \frac{1}{t}$$

$$T_{AV} = T_1 H_1 + (1-H_1) T_2$$

$$\Rightarrow ?$$

① Speedup = access time of  $t_1$  / access time of  $t_2$ .

$$5 \rightarrow \frac{\text{access time of } T_2}{\text{access time of } T_1} = \frac{T_2}{T_1}$$

②

$$5 \times t = T_2$$

$$100 \text{ nsec} = T_2$$

$$T_{AV} = T$$

$$T_1 = T_{AV} - 10$$

$$20 = T_{AV} - 10$$

$$30 = T_{AV}$$

\* Associative memory chip is expensive than RAM chip because of these is need of extra bit required for every cell  
(match logic)  
Size, speed is same in both cases.

$$T_{AV} = 20(H_1) + (1-H_1)100$$

$$30 = 20H_1 + 100 - 100H_1$$

$$30 \Rightarrow -80H_1 + 100$$

$$80H_1 = 70$$

$$H_1 = \frac{70}{80}$$

$$H_1 \Rightarrow 0.8$$

Ques. Consider four block cache with following main memory block references

4, 5, 7, 12, 4, 5, 13, 4, 5, 7

D) Assume cache is initially empty. Identify hit ratios using  
① first in first out  $\Rightarrow 0.2$

② LRU

$\Rightarrow 0.4$

5 13  
4 12  
7  
5  
3

③ Direct mapping  $\Rightarrow 0.5$

④ 2way set associative with LRU

5 13  
4 12  
7  
5  
3

8

① First in first out

① Hit ratio

$$\Rightarrow \frac{5}{10} = 0.5$$

① Which block is present not pre-emption after 12

Hit ratios = No of hits

Total access time

$$\Rightarrow \frac{2}{10} = 1 - 0.2 \text{ CPU}$$

② No

|      |
|------|
| 4 13 |
| 5 4  |
| 7 5  |
| 12 7 |

② How many blocks are there after completion?

② LRU

|    |               |                     |
|----|---------------|---------------------|
| 4  | $\rightarrow$ | 4 $\rightarrow$ M   |
| 5  |               | 5 $\rightarrow$ M.  |
| 7  |               | 7 - M.              |
| 12 |               | 12 $\rightarrow$ M. |

4  $\rightarrow$  M

5  $\rightarrow$  M.

7 - M.

12  $\rightarrow$  M.

4  $\rightarrow$  M C

5  $\rightarrow$  M transfer

7  $\rightarrow$  M

12  $\rightarrow$  M

4 - H

5 - H

13 - M

4 - H

5 - H

7 - M

|    |    |    |    |    |    |
|----|----|----|----|----|----|
| 4  | 12 | 4  | 5  | 13 | 4  |
| 5  | 7  | 12 | 7  | 5  | 13 |
| 4  | 5  | 7  | 12 | 4  | 5  |
| 12 | 7  | 5  | 12 | 13 | 12 |

of references by row along with

|    |    |
|----|----|
| 5  | 7  |
| 4  | 4  |
| 13 | 9  |
| 12 | 13 |

$$\Rightarrow H = 4 = 0.4$$

12 is not

there

③ Direct mapping  $\rightarrow$

$$\begin{aligned} 5 \bmod 4 &= 1 & 4 \bmod 4 &= 0 \\ 13 \bmod 4 &= 1 & 5 \bmod 4 &= 1 \\ 7 \bmod 4 &= 3 & 7 \bmod 4 &= 3 \end{aligned}$$

|      |     |
|------|-----|
| 4 12 | 4   |
| 5    | 3 5 |
| 7    |     |
| 12   |     |

$4 \bmod 4$

= 0

$$5 \bmod 4 = 1$$

$$7 \bmod 4 = 3$$

$$3/10 = 0.5$$

$$12 \bmod 4 = 0$$

## 4 2way set associative cache

$$\begin{array}{c} \text{No of Sets} \rightarrow N \\ \text{Pway} \\ \Rightarrow \frac{4}{2} \end{array}$$

|   |   |    |
|---|---|----|
| 0 | 4 | 12 |
| 1 | 5 | 7  |

= 2

$$\text{Formula} = k \bmod s$$

$$① 4 \bmod 2 \Rightarrow 0$$

$$② 5 \bmod 2 \Rightarrow 1$$

$$③ 7 \bmod 2 \Rightarrow 1$$

$$④ 12 \bmod 2 \Rightarrow 0$$

$$⑤ 4 \bmod 2 \Rightarrow 0 \rightarrow \text{Hit}$$

$$⑥ 5 \bmod 2 \Rightarrow 1 \rightarrow \text{Hit}$$

$$\xrightarrow{\text{next}} ⑦ 7 \bmod 2 \Rightarrow 1 \rightarrow \text{miss} \rightarrow \text{2Set associative with LRU So 5}$$

7 becomes least recent so

becomes latest

$$⑧ 4 \bmod 2 \Rightarrow 0 \rightarrow \text{Hit} \rightarrow \text{replace it.}$$

we can't replace S

$$⑨ 5 \bmod 2 \Rightarrow 1 \rightarrow \text{Hit}$$

$$⑩ 7 \bmod 2 \Rightarrow 1 \rightarrow \text{Mem.}$$

$$\text{Hit ratio} \Rightarrow \frac{4}{10} = 0.4$$

13 is not there in Cache

Ques Consider 1 MB memory and 2KB cache  
memory both are partitioned into 64 Byte word block, How many bits are read for the physical add? what will be the total no of blocks in main memory and cache memory. How many tag bits are read if two cache memory don't mapped.

(23)  $2^7$  lines

$$\text{line size} = 64 \text{ words} \Rightarrow 2^6$$

Physical add<sup>r</sup> 220bit     $S = \text{No of lines}$   
20.                  g

|        |        |      |   |   |  |   |     |       |
|--------|--------|------|---|---|--|---|-----|-------|
| 1      | 9      | *    | 5 | 6 |  | → | 128 | 10    |
| Tag    | gate   | word |   |   |  |   | 4   |       |
| offset | offset |      |   |   |  |   | 5   | ⇒ 128 |

black size = 32  $\Rightarrow 2^5$

Physical address = 32 bit =  $2^{32}$

(a)

~~32~~

Cache size = 32 kbytes

$$\text{No of lines} = \frac{32K}{32} = 1K \Rightarrow 2^{10}$$

|    |    |   |
|----|----|---|
| 17 | 10 | 5 |
|----|----|---|

Tag line word

(10, 17) Ans

3215\*23

(23)

C

三

-6

3.2 KByte

32 but

→ 32 k byte

四

→ 16 Y2

$$\Rightarrow 2^{\frac{t}{4}}$$

326d

~~19~~  
~~13~~ 865  
36

(15) 15.5

of physical add<sup>r</sup>

## Cache Coherence :-



write Image of

main mem; updated



op perform.  
if write  
is a  
feed of  
add<sup>r</sup>

This update is not present in main memory

Same add<sup>r</sup> consists of multiple values i.e.  
Know Cache Coherence

\* After write operation, Cache memory is enabled so  
so value is changed to 45

Modified data will be lost if we apply some  
replacement algo such as same add<sup>r</sup> consists  
of multiple values, then updated data will  
be lost. This is cache coherence. If we  
don't handle cache coherence, it leads to  
problem, such as It damages prog exec.  
Multiprocessor.

Loosely coupled



Total no. of processor  
with shared communication  
with shared memory  
→ Large no. of bits  
(Max no. of bits)

Tightly coupled.



→ Small  
(More no. of miss operat<sup>n</sup>)

## Cache Coherence :-

Date: \_\_\_\_\_  
Page No.: \_\_\_\_\_

| $P_1$ | $P_2$ | MH<br>(x) |
|-------|-------|-----------|
| $x=0$ | -     | 0         |
| $x=0$ | $x=0$ | 0         |
| $x=1$ | $x=0$ | $x=0$     |

Coherence means same addr consists of multiple values

- \* In the Uniprocessor System, when CPU executes a program, it performs write operat'n, only cache memory is updated. The respective addr in the main memory is not updated so same addr consists of 2 values in cache and main memory, i.e. Cache Coherence.

If the cache coherence is not handled properly the new data will be mixing. To overcome cache coherence, there is a need of updating techniques.

- \* There are 2 techniques to avoid the coherence one is write through, write back Policy

- ① write through  $\rightarrow$  simultaneous updation
- ② write Back



$$T_m = 200 \text{ ns}$$

$\uparrow$   
 to access complete  
 block

$T_w \Rightarrow$  word accessing  
 time of  
 main memory.

$$\text{Block size} = 2 \text{ words}$$

To access 1 word it

$$\text{require } \Rightarrow \frac{200}{2}$$

Case 2.



~~20ns are  
 overlapped  
 with 100ns~~

$$T_w = 20 \text{ ns}, 100 \text{ ns}$$

when there is no  
 uniform delay then

$$T_w = \frac{100 \text{ ns}}{\downarrow \text{ delay}} \text{ (max stage}$$

$t_p = \text{max stage}$   
 delay

For successful update, 100ns is reqd.

Write through policy  $\Rightarrow$  indicates simultaneous update  
 ie when the CPU performs write operation it  
 updated into Cache memory and main memory

at a time here  $t_c$  indicates cache memory  
 access time ( $t_m$ ) indicates main memory block  
 access time,  $t_w$  indicates main memory word  
 accessing time

When there is a successful write operat<sup>n</sup>  
 the time reqd to perform that is  $t_w$  is maximum  
 (Cache word access time, main memory  
 word access time)

$$\Rightarrow T_{\max} (\text{cache word access time, main memory access time})$$

$$\Rightarrow T_{\max} (T_c, T_w)$$

$$\Rightarrow T_w$$



(11)

$$32 \times 2^{10}$$

 $\Rightarrow$ 

No. of chips

No. of chips

$$\frac{32 \times 2^{10}}{2^2} \Rightarrow \frac{32 \times 2^{10}}{2^8 \times 2^0}$$

$$2^4$$

8

8

(11)  
2

$$\text{No. of chips} \Rightarrow 256 \text{ K} \Rightarrow \frac{256 \times 2^3}{32 \times 2^{10} \times 8} \Rightarrow 2^6 = 64$$

(13)

$$\text{Cache blocks} = 16 \Rightarrow 2^4$$

Main memory  $\approx 256$  blocks.  $\Rightarrow 2^8$

Main memory.

Cache memory

(d)

$$\text{No. of sets} = 16 \Rightarrow 4$$

|   | 0   | 1 | 2   | 3  | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|-----|---|-----|----|---|---|---|---|---|---|----|----|----|----|----|----|
| 0 |     |   |     |    |   |   |   |   |   |   |    |    |    |    |    |    |
| 1 |     |   |     |    |   |   |   |   |   |   |    |    |    |    |    |    |
| 2 |     |   |     |    |   |   |   |   |   |   |    |    |    |    |    |    |
| 3 | 255 | 3 | 159 | 63 |   |   |   |   |   |   |    |    |    |    |    |    |

✓  
✓

10  
11  
12  
13  
14  
15

$$\text{① } \Rightarrow k \bmod s. \quad \text{② } 1 \bmod 4 \quad \text{③ } 0 \bmod 4$$

10  
11  
12  
13  
14  
15

$$\Rightarrow 0 \bmod 4$$

$$\Rightarrow 1$$

$$0 \bmod 4$$

$$\Rightarrow 0$$

$$\Rightarrow 0.$$

④

$$4 \bmod 4$$

$$\Rightarrow 0.$$

⑤

$$3 \bmod 4$$

$$\Rightarrow 3.$$

62  
11  
2  
1  
0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15

Physical memory =  $\frac{2^{20} \cdot 2^{14}}{2^6}$

(1) Physical memory = 20 bits

(2) no of blocks =  $\frac{2^{20}}{2^6} = \underline{\underline{2^4}}$  in main memory

(3) word offset =  $\log_2$  Block size  $\Rightarrow \cancel{\log_2 2^6}$   
 $\log_2 2^6 = 6$   $\cancel{2^{14}}$

line offset =  $\log_2$  no of lines  $\log_2 2^5 = 5$

(3) Cache memory =  $\frac{2^4}{2^6} = \underline{\underline{2^1}}$



(8) Memory = 20 bit address Cache =  $\frac{2^8 \text{ bit}}{\text{Line size}}$

$\underline{20.} \quad \underline{\underline{8 \text{ bit}}} +$

6 | 6 | 8 | 8 |

Tag. line size word offset

no of blocks in main memory  $\Rightarrow$  Memory size  
 $\cdot$  Block size

$$\Rightarrow \frac{2^{20}}{2^8} = \underline{\underline{2^{12}}} = \underline{\underline{4096}}$$

$$\text{no of lines in cache} = \frac{2^8}{2^3} = \underline{\underline{2^5}} = \underline{\underline{32}}$$

## System performance

read operat'n

write operat'n

$$T_{avg}(\text{read}) =$$



$$T_{avg}(\text{read}) = T_C \cdot H_r + (1 - H_r) (T_m + T_c)$$

(Read operat'n) (T<sub>m</sub> + T<sub>c</sub>)

$$T_{avg}(\text{write})$$



$$T_{avg}(\text{write}) = \text{Successful write} + \text{Miss write}$$

$$= H_w \cdot T_w + (1 - H_w) (T_m + T_w)$$

Assume the block which is modified, that block is present in cache & M.M. (max time need to get data simultaneously for m.m. & cache)

Miss write is also known write allocate  $\rightarrow$  The process of transferring the block from main memory to cache memory even only write operation is performed.

### Write OPERATION:-

In the case of write operatn, first CPU checks the cache memory for the availability of block which is going to be modified. If the block is found in cache, that operation is called as hit operatn related to write. The CPU immediately performs simultaneous updates on Cache and main memory. So it will take time.

Hitprocess  $[tw * Hw]$  - ①

② If the reqd block which is going to be modified is not in the cache, operation is miss.

so. there is a need of transferring the reqd block from main memory to cache before write operation. This process is called as write allocate

When the block is available in the cache, the CPU simultaneously update the block, this process requires the time.

$$\text{Miss process} = (1-Hw) (Tm + Tw) \quad \text{-- ②}$$

$\downarrow$  word update

Block transfer

$$Tm = \text{Hit process} + \text{Miss process} \text{ ie}$$

$$\Rightarrow tw * Hw + (1-Hw) (Tm + Tw)$$

$$T_{\text{read}} = T_C H_R + (1-H_R)(T_m + T_C)$$

$$\Rightarrow 20 \times 0.80 + (1-0.80)(200+20)$$

$$\Rightarrow \frac{1600}{100} + 0.2 \times 220$$



$$\Rightarrow 16 + 44 = 60$$

$$T_{\text{write}} \Rightarrow H_w \cdot T_c + (1-H_w)(T_m + T_w)$$

$$\Rightarrow 0.90 \times 100 (1-0.9) (200+200)$$

$$\Rightarrow \frac{90}{100} \times 200 + 0.1 \times 300$$

$$= 120$$

$$T_{\text{avg}} = H_R \cdot T_C + (1-H_R) \left[ \frac{1}{2} \text{ clean bits} (T_m + T_C) + 30\% [T_m + T_m + T_C] \right]$$

$$\Rightarrow 0.80 \times 20 + (1-0.80) \left[ 70\% (200+20) + 30\% [200+200+20] \right]$$

$$\Rightarrow \frac{80}{100} \times 20 + (0.2) \left[ \frac{70}{100} \times 220 + \frac{30}{100} \times 420 \right]$$

$$\Rightarrow 16 + 0.2 [154 + 126]$$

$$\Rightarrow 16 + \frac{2}{5} [280] = \frac{152}{5} = 30.4$$

$$H_R = 80\%$$

$$H_w = 90\% \quad 30\% \text{ update}$$

$$\frac{56}{16} = \frac{72}{2} = \frac{28}{5.6} = 5$$

$$C.R \% = 100\% \\ M.R \% = 50\%$$

$$T_{\text{write}} = H_w \cdot T_c + (1-H_w) [70\% (T_m + T_C) + 30\% (2T_m + T_C)]$$

$\Rightarrow 40 \text{ msec}$

# COMPUTER ORGANIZATION