

## Introduction

### Computer organization

- physical organization
- Touchable
- Computer ~~organization~~ Architecture

#### 1) Instructions

based on capability

##### RISC

- Reduced Instruction set Computing
- Limited capability
- clocks per unit instruction = 1
- Fixed size instructions
  - reduces compiler complexity
- H/w control unit design
- parameter passing is more efficient because of register window
- Rich set of registers
- programmer overhead is more
- Limited addressing modes (AM)
- AM or flexibility of pgm development (less flexible)

##### CISC

- Complex Instruction set Computing
- More capability
- clocks per instruction  $\geq 1$
- Variable size instructions
  - increases compiler complexity
- micro operation control unit design
- programmer overhead is less
- More Addressing modes
- More flexible

#### Note:

Instruction pipeline is efficiently used in RISC systems due to fixed size.

#### 2) Addressing modes: The way how operand is given in the instruction

- ① Immediate: operands are provided along with the instruction (fastest)
- ② Direct: Reference to the operands
- ③ Indirect: Reference to the effective address

#### 3) Data format: How to interpret the binary string

- For meaningful communication



#### Data formats



#### EBCDIC

- Extended binary coded decimal interchange code
- 8 bits / char

- Computer organization deals with physical devices and their interconnections with the aim of improving Performance  $\rightarrow$  metric: MIPS (Million Instructions per second)

| in/c | mips |
|------|------|
| A    | 10   |
| B    | >10  |

More performance

- Mips depends on
- nature of instructions
  - nature of h/w
  - nature of bus & mem & compiler

{ different on diff mips

(metric)  
FLOPs

(floating point operations per second)

#### Goals of Computer organization

- Improving Performance
- Reducing time

Performance is improved based on the way data is represented, stored, accessed, processed and communicated.

### How Data represented:

Data: (3,52)

Rep1:



3 Memory Access.

Rep2:



1 Memory Access  
(More performance)

### How Data stored:

#### Patterson 90 to 10% rule:

10% of pgm consume 90% of the time.



$$\text{Without Cache} \rightarrow 10 \times 100 = 1000 \text{ ns}$$

$$\text{With Cache} \rightarrow 40 + 10 = 50 \text{ ns (More performance)}$$

### How Data Accessed:



### How Data Communicated:

#### Message passing

#### Message process interface

(More performance)  $\downarrow$  Supports  $\Rightarrow$  Parallel Computation

#### shared memory

work division  
+ work assignment  
+ work collection

### Note:

|              | More performance          |
|--------------|---------------------------|
| Represented  | [M E]                     |
| stored       | Cache                     |
| Accessed     | DMA                       |
| Processed    | Pipelining                |
| Communicated | Message process interface |

### Computer Architecture Classifications:

#### Von Neumann

→ First memory based EA

Basis: store of pgm and data  
 $\downarrow$

Stored pgm concept

∴ Problem size and obs dependency  
memory space

#### Harvard

Basis: separate pgm store + data store

Stored pgm concept

#### FlyN

Basis: instruction stream  
and data stream

### Note: Fastness



## Von Neumann Architecture



⇒ If program size is more than our memory then this Arch fails to execute.

## Flynn Architecture

- It mainly focuses on flow of instructions



SIMD - single instruction multiple data

Ex:- `for(i=0; i<n; i++)`

$$c[i] = a[i] + b[i]$$



## Memory organization:

- Arrangement of memory with an objective of reducing Average Access Time.

### Memory Hierarchy:



CMOS - Complementary metal oxide semiconductor field effect transistor

BJT - Bipolar junction transistor.

-  $L_0, L_1, L_2, L_3$  are levels according to distance b/w processor

## Harvard Architecture (Patterson)



Data bus.

- changes in data will not affect program & viceversa
- parallel buses are used so more performance is obtained.

| Note :  |                 |                    |             |  |
|---------|-----------------|--------------------|-------------|--|
| streams | Control unit CU | Processing element | Memory      |  |
| SISD    | 1               | 1                  | Single unit |  |
| SIMD    | 1               | M                  | Modules     |  |
| MISD    | M               | 1                  | Single unit |  |
| MIMD    | M               | N                  | modules     |  |

### Practically Implemented Arch



$$\text{Single Core MLC} = 100 \times \text{sec}$$

$$\text{Dual Core MLC} = 40\text{sec} + \frac{60\text{ sec}}{2}$$

$$\text{Speed up} = \frac{100 \times 40}{100 \times 2} \Rightarrow 1.42 \Rightarrow 42\% \text{ Increase.}$$

## RAM

- Static RAM
- 6 Transistors
- ∴ Faster
- No Refreshing
- Costly bcoz to store 1 bit we require 6 transistors

- DRAM**
- Dynamic RAM
  - 1 Transistor + 1 capacitor
  - charged (Logic 1)      discharged (Logic 0)
  - Discharging is done at regular intervals so Refreshing is done
  - ↓ Refresh cycle placing same data into capacitor
  - slower compared to SRAM
  - cheaper compared to SRAM



- #)



- Performance of Memory Hierarchy is given by Hit ratio

$$H \propto S \text{ (size)}$$

$$H \propto \frac{1}{\text{Avg mem access time}}$$

The side effect of memory hierarchy is Data inconsistency at diff levels

↓ sol  
Proper write opens

Note:  $i \rightarrow \text{level}$   
 $H_0 < H_1 < H_2 < H_3 = 1$

- $T_i < T_{i+1}$
- $S_i < S_{i+1}$
- $C_i > C_{i+1}$
- $F_i > F_{i+1} \Rightarrow \text{Freq. of mem access}$
- $I_i < I_{i+1} \Rightarrow \text{Information}$

## Mathematical Expression for Memory Hierarchy:

### (I) Average Cost Expression



$$\text{Total size} = S_1 + S_2$$

$$\text{Total cost} = C_1 S_1 + C_2 S_2$$

$$\text{Avg cost} = \frac{C_1 S_1 + C_2 S_2}{S_1 + S_2}$$

### (II) Average Mem Access time Expression:

Case 1 : Processor have Access to both levels.



$$T_{avg} = h_1 T_1 + (1-h_1) T_2$$

Case 2 : Processor have access with faster memory



Note :

In Case 2 For multiple levels of memory

$$T_{avg} = T_1 + (1-h_1) T_2 + (1-h_1)(1-h_2) T_3 + \dots + (1-h_1)(1-h_2)\dots(1-h_{k-1}) T_k$$

### Case 3 : Exploring locality of reference

- In this case the current address is not available in faster memory, then a block of words are moved from slower memory to faster memory.
- This ensure the subsequently addressed words are supplied from faster memory hence in future reference miss penalty is reduced.



$$T_{avg} = b_1 T_1 + (1-h_1) [T_B + T_2]$$

$$T_{avg} = T_1 + (1-h_1) T_B$$

Note :

For multiple (k) levels of memory

$$T_{avg} = T_1 + (1-h_1) T_2 + (1-h_1)(1-h_2) T_3 + \dots + (1-h_1)(1-h_2)\dots(1-h_{k-1}) T_B$$

Note :

Hit value never changes! Individual mem access but affects effective mem access time  
(upon increase / decrease)

### Cache Memory : Overview

- Cache is the smallest and fastest memory component in hierarchy.
- It bridges the speed mismatch between the fastest processor and slowest memory component.
- It maintains locality of reference.
- Cache memory
  - Based on Contents
  - Two cache Data cache combined cache
- The Cache memory and main memory are divided into equi sized blocks.
- The num of blocks in cache memory is less than the num of blocks in main memory.
- So, Address Mapping decide which main mem block has to be placed in Cache memory.

#### Cache Memory

Based on which address used

physically indexed cache  
(PA is used)

virtually indexed Cache  
(VA is used)

#### Direct Mapping

- Simplest
- Exposed to conflict problems

#### Cache Coherence problem

##### Single processor system

- Write through update
- Resolved by

- Block replacement techniques are aimed for reducing num of block moves from main memory to Cache memory.

#### FIFO

- Arrival time is used as basis for replacement

#### LRU

- Most recently used block is prevented from replacement

#### Block size vs performance

- Block size selection must minimize the faults associated with locality of reference.

Optimal block size is selected

Spatial Locality

For bigger blocks are better

Temporal Locality

Smaller blocks are better



### Address mapping:

- The Address mapping decide the location of main mem block in Cache mem

Tag bits: They denote which main memory block is present in that cache position at that time.

- All this tag information is placed in a new memory called Cache directory

\* Space reqd for cache directory =  $N * (\text{Tag bits} + \text{Status bits})$

### Cache directory

Cache controller

| Tag bits   | Status bits |       |          |
|------------|-------------|-------|----------|
|            | use bit     | valid | invalid  |
|            |             |       | Modified |
|            |             |       |          |
| 1101       | 1           | 1     | 0        |
| :          | :           | :     | :        |
| $N$ Blocks |             |       |          |

Info about Main mem in Cache directory

### ① Direct Mapping :

- place the  $k^{\text{th}}$  main memory block in  $\lfloor k \text{ Mod } N \rfloor$  Cache block.

Ex:- Cache = 64 words

$$\text{Blocks}(N) = 4 \Rightarrow [1B \times 16W]$$

| $B = k \text{ mod } 4$ |          |
|------------------------|----------|
| 0                      |          |
| 1                      | $B_1$    |
| 2                      | $B_{20}$ |
| 3                      |          |



- The word offset will never change whether the block is in Cache or main mem

Conflict problem: let  $k_1, k_2$  be two main mem blocks

If  $\lfloor k_1 \text{ Mod } N \rfloor = \lfloor k_2 \text{ Mod } N \rfloor \rightarrow \text{Conflict}$



→ Direct  
→ Associative  
→ Set associative

Main memory = 512 words

$$\text{Blocks} = \frac{512W}{16W} = 32 \text{ Blocks.}$$

| $k$ |
|-----|
| 0   |
| 1   |
| 2   |
| 3   |
| ... |
| 29  |
| 30  |
| 31  |



In Direct addressing: (PA)



The direct mapping need least num of mappings  
Simpler to design  
Uses only one tag comparator

(1)

| Tag |     |     |     |     |     |     |     |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
| 00  | 4   | 8   | 12  | 16  | 20  | 24  | 28  |
| 01  | 1   | 5   | 9   | 13  | 17  | 21  | 25  |
| 10  | 2   | 6   | 10  | 14  | 18  | 22  | 26  |
| 11  | 3   | 7   | 11  | 15  | 19  | 23  | 27  |

Cache offset

| Tag | 0 | B8  |
|-----|---|-----|
| 100 | 1 | B17 |
| 101 | 2 | B10 |
| 101 | 3 | B7  |



Note :

$$k = f(T, B, N) = N * T + B$$

word 300

$\frac{300}{16} \rightarrow 16^2 18 + 12$   
18<sup>th</sup> Block of 12<sup>th</sup> word  
Not present in Cache

word 167

$\frac{167}{16} \rightarrow 16^2 10 + 7$   
10<sup>th</sup> Block of 7<sup>th</sup> word  
Present in Cache.  
(or)  
010 10 0111  $\rightarrow$  167  
Tag Cache word offset

- A word is said to be hit, if tag info of PA is matching with the tag info of the contained cache block which was referred by cache block offset

| Block | Tag | Main mem block (k) | word range (k*16, k*16+15) |
|-------|-----|--------------------|----------------------------|
| 0     | 2   | 8                  | 128 - 143                  |
| 1     | 4   | 17                 | 272 - 287                  |
| 2     | 2   | 10                 | 160 - 175                  |
| 3     | 1   | 7                  | 112 - 127                  |

∴

Easy to Implement

(N-1) MUX are reqd

1 (N-1) comparator reqd

∴

Conflict can present

#### How Approach for hit Computation :



$$\text{Hit latency} = \text{mux delay} + \text{comparator delay}$$

Multiplexer :



Comparator :



#### ② Associate mapping :

- In Associate mapping any block of main memory can be placed anywhere in Cache and hence there is no conflict problem.
- To know the address word present in cache, not all cache blocks have to be CMN so we require  $n$  tag comparators

- The physical address is splitted into two portions



Note :  
 Tag bits AM  $\rightarrow$  Tag bits DM + Cache offset DM

- More tag bits are present in Associate mapping

∴ No conflicts  $\Leftrightarrow$  More hits reqd  
 $N \times m$  bit comparator + 1 or gate  $\rightarrow$  Implies reqd

### HW Approach for hit Computing :



$$\text{Hit latency} = \text{Comparator delay} + \text{OR gate delay} / \text{MUX delay}$$

### ③ Set Associate Mapping :

∴ Conflicts reduced  
 less HW than Associative  $\Leftrightarrow$  Slower than Associative

- In set associate mapping the cache is divided into logical sets

#### k-way Set Association :

- Each set contains 'K' num of blocks.
- Total num of sets in cache (S) =  $\frac{N}{K}$

Note :  
 If  $K=1 \Rightarrow$  Direct mapping  
 $K=N \Rightarrow$  Associate mapping

- The Block  $x$  of main memory has to be placed in  $x \bmod S$  set of cache where  $0 \leq x \leq M-1$ , within the set it can be placed anywhere

so, it requires ' $K$ ' num of comparators and ' $S * \log \frac{M}{S}$ ' num of multiplexers

- Conflict problem is still present with set assoc mapping with reduced magnitude

$K \rightarrow$  small  $\rightarrow$  more conflicts  
 $K \rightarrow$  large  $\rightarrow$  less conflicts

- Physical Address is divided into -

- Tag bits SMM  $\rightarrow$   $\log_2 \frac{M}{S}$        $S = \frac{N}{K}$   
 $= \log_2 \frac{M K}{N}$



Note :  $\log_2 \frac{M}{N} \leq \log_2 \frac{M}{N} + \log_2 K < \log_2 M$

$$\text{Tag bits SMM} = \log_2 \frac{M}{N} + \log_2 K$$

Used when block size is not  $\frac{W}{K}$

## Hardware approach for hit computing :

- Num of Comparators =  $k$  (N bit Comp)
- Num of Mux =  $S + \log_2 \frac{M}{S}$  or  $S + \text{num of tag bits}$



$$\text{Hit latency} = \text{Mux delay} + \text{Comparator delay} + \text{OR gate delay}$$

## Points to remember

- If block size is not known

$$\text{Num of tag bits} = \log_2 \frac{M}{N} + \log_2 k$$

M - capacity of main memory

N - capacity of cache memory

W - num of words

B - Num of Blocks

k - set assoc splitting value



## Cache Coherence problem :

Write through : cache memory & main memory are updated simultaneously

$$\text{Updation} = \text{Max}(T_c, T_m) + T_M$$



☺ Simplicity

✗ unnecessary intermediate updations are performed in main memory.

↓  
for more write oper's write through gives worst performance

e.g. for(i=0; i<1000; i++) → a is updated 1000 times in both cache & main memory  
a=a;

## Write back update :



- current updation is performed in cache memory

- main memory is updated when concerned block in cache is chosen for replacement

✗ Updation off is more. → it uses modified bit concept

Modified bit (dirty bit)  $\begin{cases} 0 \rightarrow \text{not modified only referred ie read} \\ 1 \rightarrow \text{Modified ie referred and write are done} \end{cases}$

- Denote the status block modification in cache memory



$$T_{\text{update}} = T_{\text{BNEW}} + T_c \text{ (current opn on new block)}$$

cleanslot / empty slot

$$T_{\text{update}} = T_{\text{Bnew}} + T_{\text{Copy}} + T_c \text{ (current opn on new block)}$$

Dirty block

### Cache Coherence problem in Multiprocessor Environment:

The write through/<sup>wire back</sup> strategy can't resolve the cache coherence problem in multiprocessor environment due to Global variables defined at other places.



$$\begin{aligned} T_1 : (M_1) &: x = 5 \\ &x = x + 2 \\ T_2 : (M_2) &: x = x + 7 \\ T_3 : (M_1) &: y = x + 2 \end{aligned}$$

∴ Write through, write back can't resolve cache coherence problem



### Update all:

- updation of shared variable at all places
- ii - unnecessary updations are performed if that shared variable is no longer needed ie (Next use  $\rightarrow$  null)
- central node will take responsibility

### Selective update:

- Individual machines will take updation responsibility by broadcasting that the shared variable is updated.

### Directory based protocol:

In directory based approach the central node maintains the variable and its latest value and the bit string representing which mle are having that shared variable.

### i) Full directory based protocol

Present for Every local + Global var



∴ It consumes more space for directory information.

## 2) Limited directory protocol :

For each variable a list is maintained for directory information.

- Num of lists = Num of nodes that access a particular variable.



i) Effective directory space

ii) Complex to implement.

### Points to remember :



### Block replacement strategies :

- These strategies are aimed to reduce the num of block moments so that the performance is improved.

#### Techniques

##### Optimal

-  $K=4$

Conflict : Arrival time



Hits - 6

removals - 3

Blocks in cache : 25, 23, 7, 13

##### FIFO

- Imp using Queue
- with spatial locality its performance is better

##### Fully Associate Cache

our Assumption is MRU block may require again & again queue, stack are used for implementing LRU + use bit

##### Set Associate Cache

LRU



Optimal replacement Algo uses forward reference distance.

ie K: 4 5 7 12 4 5 13 4 5 7 25 23 F



- To know which arrived first in the cache we place a counter in the cache that tracks all the blocks coming in the cache.

##### ② FIFO

-  $K=4$

K: 4, 6, 7, 12, 4, 5, 13, 4, 5, 7, 25, 23, F

Hits = 3

Num of removals = 6

Last block moved out : 4

Blocks in cache : 5, 7, 25, 23

$$\text{Efficiency} = \frac{\text{Hits}_{\text{FIFO}}}{\text{Hits}_{\text{opt}}} * 100$$

$$= \frac{3}{6} * 100 = 50\%$$

③ LRU : (old → olduse) (Fully Associate Cache)

$- k=4$

$k: 4 \ 5 \ 7 \ 12 \ 4 \ 5 \ 13 \ 4 \ 5 \ 7 \ 25 \ 23 \ 7$

$$\rightarrow \eta = \frac{5}{6} \times 100 \approx 83.3\%$$

Queue Implementation :

Hits = 5

Removals = 4

Last block moved out = 4

Blocks in cache : 7 23 25 4

MRU Block : 7



Set Associate Cache :



$\times \text{MODS}$

$\downarrow$

$\times \text{MOD2}$



Hits = 4  
Removals = 5  
Blocks in S0 = 4, 12  
S1 = 7, 23

$$\eta = \frac{4}{6} \times 100 = 66.66\%$$

Direct Cache :

$\times \text{MODN} \rightarrow \times \text{MOD4}$

$k: 4 \ 5 \ 7 \ 12 \ 4 \ 5 \ 13 \ 4 \ 5 \ 7 \ 25 \ 23 \ 7$

$\times \text{MOD4}$



Hits = 3

Removals = 7

Last Block = 23 taken out

Blocks in cache = 4, 25, 7

$$\eta = \frac{3}{6} \times 100 = 50\%$$

Effect of Array placement in Memory and its influence on performance :

- The array is normally stored in RMO, if the Accessing is performed row wise then it gives optimal performance however the column wise references degrades the performance.
- If the array size is fairly bigger than the cache capacity for each block 1 fault occurs in the best case and for each reference 1 fault occur in the worst case.

i.e. for eg :-

Float array  $A[512, 512]$  is stored in MM and let Index var and x avail in processor registers. Direct mapped Cache of 64KB with 64B blocks are used for implementing PGMS P1 and P2. If  $M_1$  and  $M_2$  are miss references and each array element occupies 8B then

(i)  $M_1$  32K       $P_1:$  for  $i=1$  to 512  
                        for  $j=1$  to 512  
                         $x+=A[i][j];$

(ii)  $M_2$  256K       $P_2:$  for  $j=1$  to 512  
                        for  $i=1$  to 512  
                         $x+=A[i][j];$

Ques) Direct mapped cache = 64KB

Blocks =  $\frac{64KB}{64B} = 1024$  blocks in cache

Main Memory size  $\Rightarrow 2^9 \times 2^9 \times 8 = 2 \text{ MB}$ ,

Total Element references  $\Rightarrow 2^9 \times 2^9 \approx 2^{18} \Rightarrow 256K$

Blocks reqd.  $\frac{2 \text{ MB}}{64} \Rightarrow 32 \text{ K blocks}$



For P<sub>1</sub>: A[0|0] [0|1] ... [0|7] → 8 ref  
 M H ... H → 1 miss 7 hits

8 ref → 1 miss  
 256K ref → x  
 $x = \frac{256K}{8} \Rightarrow 32K$  miss (worst case)

For P<sub>2</sub>: A[0|0] [1|0] [2|0] ... [7|0] → 8 ref → 8 miss  
 M M M ... M  
 256K ref → x  
 $x = 256K$  miss (worst case)

Consider P<sub>1</sub>, which Array element replaces A[0][5] → A[16][5]

Note: A[i][j] → A[i+16][j]

If cache is organized into 2 way set assoc then A[0][4] is occupied with.

$$\text{Sets} = \frac{1024}{2} \approx 512 \quad \text{Bo}$$

A[8][4] → 8 \* 64 → 512 ✓  
 A[16][4] → 16 \* 64 → 1024  
 A[4][4] → 4 \* 64 → 256  
 A[0][4] → 4 \* 64 → 256.

$\times \text{MOD } 512$

0 512 1024 1536

### Performance Analysis of main memory:

The main memory is divided into modules and info is distributed across the modules so that the data transfer rate is more and thereby increasing the performance

This distribution of info is called Interleaving

Higher order → Improves spatial locality  
 Lower order → fast response.

- The modules are accessed in parallel ie (Internal oper's can be overlapped in time)

### Lower order interleaving:



⇒ Faster Access

## Higher order interleaving:



∴ spatial locality  
Page faults reduced

## Memory Expansion:

- Num of Basic capacity IC's =  $\frac{\text{Target Capacity}}{\text{Basic Size}}$
- Resultant memory is organized in rows & cols.
- For  $2^m$  rows we require  $m \times 2^m$  decoder is used to provide enable lines.

### Memory Expansion



## D-Ram Cells:

- D-Ram cells are arranged in 2D (R & C)
- In one refresh entire row get refreshed  
ie with m-rows and row refresh time t<sub>r</sub> then

$$\text{Refresh overhead} = mt_r$$

- Let refresh-time cycle is 'T' then T-mt<sub>r</sub> is used for memory read/write operations.



$$\therefore \text{Time used for refresh} = \frac{mt_r}{T} \times 100$$

$$\therefore \text{time used for R/W op's} = \frac{T - mt_r}{T} \times 100.$$

## RAID organization:

- Redundant array of independent disks
- It is adopted for secondary memories to improve performance
- The secondary mem is divided into Independent disks and can be accessed in parallel  
Thus Raid is increasing data transfer rate and also provide fault tolerance
- Raid-0: Improve data transfer rate  
No fault tolerance
- Raid-1: Disk mirroring  
1 disk failure tolerant
- Raid-2: Hamming code  
1 disk failure tolerant
- Raid-3: Parity disk  
1 disk failure tolerant

Raid-4 : interleaving parity disk (Bytelevel)  
1 Disk failure tolerant

Raid-5 : Interleaving parity disk (Blocklevel)  
1 Disk failure tolerant

Raid-6 : 2 parity disks  
2 disk failure tolerant



| Raid | Organization                               | Remarks                                                                                                                                                                                  |
|------|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0    | $1, 2, \dots, n$                           | $D = N \times (\text{Raid-0})$<br>- No fault tolerance                                                                                                                                   |
| 1    | $1, 2, \dots, n$<br>$1', 2', \dots, n'$    | $D_{\text{read}} = N \times (\text{Raid-0})$<br>$D_{\text{write}} = \frac{N}{2} \times (\text{Raid-0}) \Rightarrow \frac{\text{Raid-0}}{2}$<br>- Disk mirroring<br>1 Disk fault tolerant |
| 2    | $1, 2, \dots, n$<br>$c_1, c_2, \dots, c_k$ | $D = \frac{N}{N+k} \times (\text{Raid-0})$<br>- Hamming Code<br>- 1 disk fault tolerant                                                                                                  |
| 3    | $1, 2, \dots, n$<br>Parity disk.           | $D_R = \text{Raid-0}$<br>$D_W = \frac{N}{N+1} \times (\text{Raid-0})$<br>- 1 disk fault tolerant<br>- parity disk                                                                        |
| 4    | $1, 2, \dots, n$                           | - Byte level parity interleaving<br>- 1 disk fault tolerant                                                                                                                              |
| 5    | $1, 2, \dots, n$                           | - Block level parity interleaving<br>- 1 disk fault tolerant                                                                                                                             |
| 6    | $1, 2, \dots, n$<br>$c_1, c_2$             | - 2 parity disks<br>- 2 disk failure tolerant<br>- Not yet implemented<br>- More statistics reqd                                                                                         |

Note: Block level interleaving is better than byte level interleaving

If Block size = Disk size, then Raid-5  $\Rightarrow$  Raid-3

### Instructions, Addressing Modes and Instruction Pipeline

Instruction: Basic Component in a program

|         |        |
|---------|--------|
| Op Code | Op Ref |
|---------|--------|

→ specifies len of instruction  
tell JTS microprogram in mem

Note:

JTS Arch Contains  $2^n$  JTS  
where  $n = \text{opcode bits}$

## Classification of Instructions

Based on opcode

- Data transfer
- Arithmetic Ins
- Logical Ins
- Conditional Ins
- Iterative Ins

(special ins)

MC control Ins

deals with logic rather than data

Based on opref:

num of ref

### 4-Address Ins

- No Pgm Counter
- SELF Seq. Ins
- Ex:- ADD 100, 200, 300, 400
- $\nwarrow$  Addref nextans
- $\nwarrow$  Simplicity
- $\nwarrow$  lengthy

### 3-Address Ins

- Pgm Counter
- Ex:- ADD 100, 200, 300
- $\nwarrow$  Simplicity
- $\nwarrow$  Costly to Imp

### 2-Address Ins

- pgm Counter
- Ex:- ADD 100, 200
- $\nwarrow$  Small len ins
- $\nwarrow$  Mem overwriting
- $\nwarrow$  Costly to Imp

### 1-Address Ins

- Pgm Counter
- Accumulator
- Ex:- ADD 100  
Res stored in Acc
- $\nwarrow$  No mem overwriting
- $\nwarrow$  Costly to Imp

### 0-Address Ins

- Pgm Counter
- Accumulator
- Stack pointer
- Ex:- ADD
- $\downarrow$
- Stack addr ins
- $\boxed{\text{pop}} + \boxed{\text{pop}} + \boxed{\text{push}}$

### Note



Num of ins at T<sub>4</sub> N =

$$\begin{aligned} \frac{\text{Num of ins}}{\text{sacrificed}} &= N + C \cdot 2^{10-6} + b \cdot 2^{10-4} + C \cdot 2^{10-4} = 2^{10} \\ &= N + C \cdot 2^4 + b \cdot 2^6 + C \cdot 2^6 = 2^{10} \end{aligned}$$

## Instruction pipeline:

### 1-stage Pipeline



### 2-stage Pipeline



### 3-stage Pipeline



### 4-stage Pipeline



### 5-stage Pipeline



If num of stages  $> 6$  then Pipeline off is more. So,

Optimal stages = 6



## Types of Pipeline

### Synch pipeline



→ Handshake.

### Asynch pipeline



## Types

### Linear pipeline

- Feed forward pipeline
- Moves only in forward direction



### Non-linear pipeline

- Feed forward and feedback pipeline
- Reservation table : which stage is using desired stage

|    | 1 | 2 | 3 | 4 | 5 |
|----|---|---|---|---|---|
| S1 | X |   |   |   | X |
| S2 |   | X |   | X |   |
| S3 |   |   | X |   |   |

Clock



## Performance of Instruction pipeline:

- Given by speedup factor  $(s) = \frac{\text{Time without pipeline}}{\text{Time with pipeline}}$

- The Pipeline produces optimal performance if the ins are Independent.

Ideal case for Speed up factor = Num of stages ( $K$ ) ( $n=100\%$ )

- For  $n$  Instructions the  $k$ -stage pipeline will consume  $K+n-1$  clocks.

- The clock period of pipeline =  $\max(S_i \text{ delay}) + \text{Buffer time}$

- For the first ins, Non pipeline system will give better performance than pipelined system. However, on Avg the pipeline system will give better performance than Non pipeline system

- In General, the pipeline is expected to complete 1 ins/clk on average

$CPI_{avg} = 1$

- The pipelining will not reduce individual ins time (The ins has to do same amt of work whether it is pipelined or non-pipelined)

- The **dependencies** Among ins require reorganization of ins pipeline and reduces the performance of ins pipeline

## Dependencies

### Data dependency

Caused by  
Arithmetic instructions

↓ sol

Ins rescheduling  
Operand forwarding

'K' clocks  
per ins



### Control dependency

Caused by  
Conditional instructions

↓ sol

Delay load  
Prediction dependencies

n-1 cks.

sem ins

### Structural dependency

Caused by  
Resource limitation

↓ sol

Resource duplication

Speed up factor =  $\frac{n * K * \text{clocks}}{K+n-1 \text{ per ins}}$  (avg)

If  $n \gg K$  then

$$s = \frac{nK}{n} \Rightarrow K \text{ (ideal)}$$

$$CPI_{avg} = \frac{T_n}{n} = \frac{K+n-1}{n} \Rightarrow \frac{1}{n} = 1 \text{ (ideal)}$$

## Note:

For certain problems

Total CLKs need to

complete a process

$$= \text{Num of CLKs for 1st ins} + \text{Sum of CLKs of unique stage except 1st ins}$$

## Dependencies / Hazards



### ① Structural dependency:



T1:

|    |    |     |    |
|----|----|-----|----|
| IF | OL | ALU | OS |
| IF | OL | ALU | OS |

$S_2$ :

|    |    |     |    |
|----|----|-----|----|
| IF | OL | ALU | OS |
| IF | DL | ALU | OS |

Resource Replication

Still Conflict possible

↓ sol  
reorganization of stages

Many reasons are possible for structural dependency

- Due to registers
- Due to cache

### ② Control dependency:

caused due to (i) Conditional branches

(ii) Unconditional branches

(iii) Function calls / return

(iv) Pgm Control that may effect PC



(i) flushing (clearing the contents of pipeline)

(ii) stalling. (In case of Branch Ins the processor should stall some time to know whether it is branched or not)

(iii) Nop (No operation : compiler should intelligently place Nop where the processor do work but it was not effected.)

(iv) Reschedule and rearrange of some steps in pgm that doesn't affect pgm

Branch penalty was not reduced

Branch penalty was reduced to some extent

\* Consider a program



SF Target adds known after stage 2



CPI = 2

$$\text{Avg CPI} = 0.8(1) + 0.2(2) = 1.2$$

SF Target adds known after stage 3



CPI = 3

$$\text{Avg CPI} = 0.8(1) + 0.2(3) = 1.4$$

### Banches

#### Unconditional

Jmp 200.

- Target adds known after decode phase.



#### Effective CPI :

$n \rightarrow$  Total num of ins

$p \rightarrow$  prob that a ins is conditional branch

$q \rightarrow$  prob that a cond branch is true

$m \rightarrow$  num of stages after which the target addr is known

i. Total cond ins =  $np$

Total cond ins that are true =  $npt$

Not branched ins =  $n - npq$

$$\text{Avg CPI} = \frac{nq * m + n(1-p)}{n} = mpq + 1 - pq$$

$$\text{Avg CPI} = 1 + pq(m-1)$$

Ways to avoid stall cycles in Control dependencies :

#### 1) Branch prediction and speculative execution (Guess)



→ Flushing involved of Branching taken

→ Branch always taken. (100, 200, 201, 202, 101)

→ Branch never taken (100, 101, 102, 103, 200)

#### Dynamic prediction

- bit implementation
- Branch prediction buffer Implementation

Flushed if B taken

All one prediction based  
Not 100% guaranteed

#### Bit Implementation



B.P.B Jmp



## 2) Delayed Branching:



→ Instead of wasting two stall cycles of any Jns can be moved such that the meaning of pgm will not change is called Delayed Branching.

Ex:-

|     |    |          |
|-----|----|----------|
| 98  | ST | $R_1, B$ |
| 99  | LD | $A, R_0$ |
| 100 | B2 | 200      |

Delayed Branching



Requirements

- (1) Pgm meaning shouldn't change
- (2) Smart Compiler

60% stalls can be reduced  
A/C to research

## 3) Data dependency:

### Read after write (RAW)

- True dependency

$$I_i : R_1 \leftarrow R_2 + R_3$$

$$I_j : R_4 \leftarrow R_1 + R_5$$



### Write after read (WAR)

- Anti dependency

$$I_i : R_1 \leftarrow R_2 + R_3$$

$$I_2 : \text{load } M[R_2] \quad R_2 \stackrel{20}{\leftarrow}$$

Iins are read in such a way that the loaded  $R_2$  is read by  $I_i$ .

$$D(I_i) \cap R(I_2) \neq \emptyset$$

### Write after write (WAW)

- write / output dependency

$$I_1 : R_1 \leftarrow R_2 + R_3$$

$$I_2 : R_1 \leftarrow R_5 + R_6$$

Actual



$$R(I_i) \cap R(I_j) \neq \emptyset$$

### Operand forwarding: (Resolves RAW conflict)

- short circuiting



$$I_1 : R_1 \leftarrow R_2 + R_3$$

$$I_2 : R_4 \leftarrow R_1 + R_5$$

due to  
operand forwarding



- To resolve WAR WAW ] Register renaming

### Register Renaming :

| Temp reg's                       | (R1) ← R <sub>2</sub> * R <sub>3</sub>           | 10 5                 |
|----------------------------------|--------------------------------------------------|----------------------|
| 50                               | R <sub>1</sub> ← R <sub>2</sub> * R <sub>3</sub> |                      |
| 7                                | R <sub>1</sub> ← R <sub>4</sub> + R <sub>5</sub> |                      |
| T <sub>3</sub>                   |                                                  | (WAW)                |
| T <sub>4</sub>                   |                                                  |                      |
| T <sub>5</sub> (R <sub>3</sub> ) | R <sub>1</sub> ← R <sub>3</sub> * R <sub>4</sub> |                      |
| T <sub>6</sub> (R <sub>3</sub> ) | R <sub>3</sub> ← m[100]                          |                      |
| ;                                |                                                  | (WAR)                |
| ;                                |                                                  |                      |
| 100                              |                                                  | R <sub>3</sub> [100] |
| 5                                |                                                  |                      |
| ;                                |                                                  |                      |

### Machine Instructions and Addressing modes :

(WHR)

## Machine Instructions and Addressing modes:

Programmers

Developers



Ins set / Machine Ins



Instruction:

sequence of bits



## Types of CPU organizations:

- Single accumulator organization
- General register organization
- Stack organization

(Addx) (InCA)

- 1 Addr Ins, 0 Addr Ins

(Add R<sub>1</sub>, R<sub>2</sub>) (Add R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>)

1 Addr Ins  
(push x)

- handles 1 Addr Ins

ADD X → AC ← AC + M[X]

- handles 0 Addr Ins

INCA → AC ← AC + 1



(Architecture)

## Stack organization:



## General register organization:



- handles 2 Addr Ins

ADD R<sub>1</sub>, R<sub>2</sub> → R<sub>1</sub> ← R<sub>1</sub>+R<sub>2</sub>

- handles 3 Addr Ins

ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub> → R<sub>1</sub> ← R<sub>2</sub>+R<sub>3</sub>

## Addressing modes :

- To give programmers facilities such as pointers, counters for loops control, indexing of data, program relocation
- To reduce the number of bits in the addressing field of the instruction

### Modes :

#### 1) Implied mode:

- Implicit
- No need to specify anything

Ex:- **INCA**

increments Acc value

#### 2) Immediate mode:

- Deals with constants

Ex:- LD 5 → AC=5  
ADD LO → AC=AC+10

#### 3) Register mode:

- short instructions because only reg's are mentioned

Ex:- Add R<sub>1</sub> → AC ← AC + [R<sub>1</sub>]

#### Note :-

n reg in reg mode → operand bits  $[log_2 n]$

n reg in reg mode → maximum size represented  $\frac{n^1}{2} - 1$   
(2's comp)

#### 4) Register Indirect mode:

Effective Addr: It is the addr in the memory where the operand Addr is present

Ex:- ADD R<sub>1</sub> → R<sub>4</sub>  
AC ← AC + M[R<sub>1</sub>]  
R<sub>1</sub> | 300 |  
R<sub>2</sub> | 301 |  
R<sub>3</sub> | 302 |  
R<sub>4</sub> | 303 |  
Mem | operand |

#### 5) Auto Increment

#### Auto decrement mode:



#### Jump

To increment / decrement we need not go to ALU or CPU because registers are made from flip flops. They can perform Inc, dec, load without ALU

#### 6) Direct Addressing mode:

##### Absolute Addressing mode

Ex:- Add X → AC ← AC + M[X]

Eff Addr: X

#### 7) Indirect Addressing mode

Ex:- ADD X  
X | 300 | 1100 |  
| 301 | 500 |  
| 302 | 000 |  
| 303 | 000 |  
Mem | value |

- ∴ 2 Memory Access
- ∴ far parameter
- implement pointer

#### 8) Relative Addressing mode:

- To go for next ins how many addr should be ascended / descended

Eff Addr: PC + A



Used for program relocation

#### 9) Indexed Address mode:

- Index reg is present



only one instruction to access any array index

- used to access array

#### 10) Base register method

##### Base offset register method / Base offset method

- used in multiprogramming environment (Context switching, program relocation)

#### Problem :-



Eff Addr: Base + offset

## Note :

- Indirect addressing
- Immediate addressing
- Auto decrement addressing
- Program relocation - Based Addressing
- Relative Addressing
- Indexed addressing - Array Implementation

## Types of Instructions

### Data transfer Instructions



### Data manipulation



### Program Control Instructions

- Decisions Implementation
- if
- for
- while

### Data transfer Instructions : (load, store, exchange, input, output, push, pop)



### Data manipulation Instructions

#### Arithmetic Instructions:

- 1) Increment
- 2) Decrement
- 3) Add <sub>int</sub> <sub>float</sub> (Precision is present)
- 4) subtract
- 5) multiply
- 6) Divide
- 7) Add with carry
- 8) sub with borrow
- 9) Negate (2's comp)

#### logical Instructions:

- 1) clear (All reg bits change to 0)
- 2) complement
- 3) AND
- 4) OR
- 5) EX-OR (odd bit detector). (Mod 2 Arithmetic)
- 6) clear carry
- 7) set carry
- 8) Complement carry
- 9) Enable Interrupt
- 10) Disable Interrupt

### shift Instructions

- 1) logical shift right
- 2) logical shift left
- 3) Arithmetic shift right
- 4) Arithmetic shift left
- 5) Rotate right
- 6) Rotate left
- 7) Rotate right through carry
- 8) Rotate left through carry

### logical shift right and left :



### Arithmetic shift left : (2's comp)

sign bit is conserved



### Arithmetic shift right



- Same as logical shift



## Numerical Examples on flags :

QF Signed representation

$$A = 1100 \quad (-4)$$

$$B = 0010 \quad (2)$$

Now, perform  $A - B$

$$\begin{array}{r} A = 1100 \\ + 2's B = 1110 \\ \hline 11010 \end{array} \quad (-6)$$

$$\text{PSW} \quad \begin{array}{c} Z \ V \ S \ C \\ 0 \ 0 \ 1 \ 1 \end{array} \quad ; S = \text{MSB} \quad C = \text{cout}$$

$$V = C_{in} \oplus C_{out} = 1 \oplus 1 = 0$$

$$V=0 \quad S=1 \quad Z=0 \quad ACB$$

QF unsigned representation :-

$$A = 1100 \quad (12)$$

$$B = 0010 \quad (2)$$

$A - B$

$$\begin{array}{r} A = 1100 \\ - B = 0010 \\ \hline 1010 \end{array} \quad (10)$$

$$\begin{array}{c} Z \ V \ S \ C \\ 0 \ 0 \ 1 \ 1 \end{array}$$

$$Z=0 \quad C=1 \quad A > B$$

Note :-

QF signed

| V | Z | S | Res               |
|---|---|---|-------------------|
| 1 | X | X | Overflow (in dat) |
| 0 | 1 | X | $A = B$           |
| 0 | 0 | 0 | $A > B$           |
| 0 | 0 | 1 | $A < B$           |

QF unsigned

| Z | C | Res     |
|---|---|---------|
| 1 | X | $A = B$ |
| 0 | 0 | $A < B$ |
| 0 | 1 | $A > B$ |

## Call and Return program control instruction:



(iii) Store it in fixed memory  
Instead of subroutine memory.  
(iv) use memory stack → Best sol

Passes in  
self recursion.

fails in recursion

### Memory Stack :



only Addr are stored

No parameter  
passing

Modifications  
to memory stack



$M[sp] \leftarrow PC$   
 $Sp++$   
 $PC \leftarrow \text{Eff Addr}$

$Sp--$   
 $PC \leftarrow M[sp]$

return

### Note :-

Call & return - Ret Addr, parameters are stored  
Context switching - Ret Addr, Content in Regs are stored

(PSW)  
(Pgm status word)

## Interrupts :

Difference b/w subroutine & interrupt

### Subroutine

- Transfer of control is initiated by 'call'  
instruction

- pgm counter is saved on the stack

### Interrupt

- It is initiated because of a signal generated

① External :

(I/O devices, Timer devices  
Power supply)

③ Internal

divide by 0, Reg overflow  
- protection violation, stack

② Slow Interrupt

(System call)

- PC, PSW, processor reg's  
is saved on the context

- Effective Addr is computed by call and placed in PC.
- Eff Addr is given by H/w and placed in PC

## CISC and RISC

Depending on Type of Ins computers

CISC  
RISC

## Assembly language Tracing:

It is about calculating the contents of the registers, memory locations, and finding the space and time requirement.

One or more key Instructions are used to decide the flow of execution and hence the effect of these key Ins on flags have to be carefully analyzed.

### Note :-

\* Return Addr of HLT is address of halt only

## Register Allocation :

- If the Info is available in Registers then the system consume less Computational Time.
- However num of reg available in the processor are limited.
- Register Allocation is a process of assigning Registers to the variables and reusing for the subsequent allocation process.

### Register Allocation

#### Local register Allocation

Allocation consider block as a unit

#### Global register Allocation

Allocation consider program as a unit

Obj of Reg Allocation : Minimizes the num of spills  
Maximizes the resource usage

## Belady register Allocation :

Data structure : Allocated list , Free list

Algorithm : If a variable request a register

(i) If there is a free register Allocate it

(ii) Else

    find a reg that contains deadvariable

    if dead var found

        release it & Allocate it

(iii) Else

    Identify the reg whose live var is used after long time then

        a) spill it (Save it in memory)

        b) Allocate it.

## Global register Allocation

- Uses Graph Colouring approach.

$$\text{Ex:- } (a - b) + (c - (d + e))$$



Num of seg's = 3 (min)

### Micro programmed Control Unit:

- Binary pattern of control signals stored (decoded | Encoded) in control memory
- Control signals are generated after accessing control word and tiles using

$$T_{cg} = T_{cm} + T_{hw}$$

### Horizontal micropaging:



#### Note :

n control signals  $\rightarrow$  n bits

- No need of external decoders
- Degree of parallelism is high
- $DOP_{max} = \text{num of control sig}$

∴ control word is too long  
control memory is not efficiently used

Advantages

↓  
Diagonal micropaging

### Vertical micropaging:

obj: To reduce length of control word.



#### Note :

n control signals  $\rightarrow \lceil \log_2 n \rceil$  bits

∴ More flexible than HMP  
reduces length of control word

∴ Requires external decoder

- Consumes more time than HMP
- Degree of parallelism is low

$$DOP_{max} = 1$$

- Faster clock rates

## Diagonal Mapping :-

- It explores the relationships among the control signals.
- The control signals are divided into groups such that each group maintains mutually exclusive control signals.
- Encoding is done group wise.
- $DOP_{max} = \text{Total num of groups}$
- Below the groups it follows H/up, within the groups it follows V/up.



$\Rightarrow$  Num of bits = 27 bits/cw

$DOP_{max} = 6$

↑ DOP Increased  
Num of bits reduced.

## Control word Sequencing :

- Unless Control words are accessed one after the other, micro operations are not getting implemented and hence instruction is not getting implemented. Thus Control word Sequencing is aimed to provide address of next control word to be accessed.



How to send Next cw ??

Sol 1 : A programmed program counter - Maintenance will become difficult

Sol 2 : 

|      |    |
|------|----|
| CS   | NW |
| Addr |    |

 - Conditional stmt failed here



↓ sol  
Add 2 CW Addr fields

- length of cw increases
- which address field should be accessed is not known.



## Architecture:



INC : Next Addr  
LD : Target Addr

M.V. woframe  
Up model

for ex:-

|       |     |                         |
|-------|-----|-------------------------|
| 0 0 0 | INC | - Nop (normal nop)      |
| 0 0 1 | JNC | - cy = 0                |
| 0 1 0 | JC  | - cy = 1                |
| 0 1 1 | JNZ | - Z = 0                 |
| 1 0 0 | JZ  | - Z = 1                 |
| 1 0 1 | JP  | - S = 0                 |
| 1 1 0 | PM  | - S = 1                 |
| 1 1 1 |     | unconditional branching |

Design of BSL



### Nano programmed Control unit design

- Uses two level Control memory

micro control memory

Used for CW sequencing

nano controlled memory

control Signal generation

- maintains distinct control signals pattern and can be reused by maintaining add in cw of nano memory
- follows H/w

APPS: boot strap loader  
firm ware based apps.

More flexibility

due to two control memories it is the slowest control unit design



$$\begin{aligned} \text{Control memory size} &= \text{LcM} + \text{Ncm} \\ &= H_M [\log_2 H_m + \log_2 H_N] + H_N (P) \end{aligned}$$

$$T_{CS\ Gen} = T_{LcM} + T_{Ncm} + T_{H/w}$$

Points to remember:

| Characteristic    | CU design in Ascending order of Performance. |     |     |           |
|-------------------|----------------------------------------------|-----|-----|-----------|
| Speed             | Nano                                         | V/w | D/w | H/w       |
| Response time     | H/w                                          | H/w | D/w | V/w, Nano |
| Flexibility       | H/w                                          | H/w | D/w | V/w, Nano |
| Control mem space | H/w                                          | V/w | D/w | Nano H/w  |

speed  $\propto$   $\frac{1}{\text{Resp time}}$   $\propto$   $\frac{1}{\text{flexibility}}$

## I/O organization



### Issues

- 1) Addressing
- 2) Data transfer techniques

CPU and device is not directly connected because

- (i) Speed mismatch
- (ii) Different device driver
- (iii) Different dataformat used by devices.

$$1 \times M < \text{Top} < \text{IOC}$$

### Capability

| Interface          | Module                       | Top               | IOC                                                                                                                                                                                                                               |
|--------------------|------------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| - No Error control | Error control (Hamming code) | Implement I/O Ops | <ul style="list-style-type: none"> <li>- consists of Top as one element</li> <li>- Types           <ul style="list-style-type: none"> <li>Selector channel → selector channel</li> <li>Multiplexer channel</li> </ul> </li> </ul> |

### Selector channel:



### Multiplexer channel:



### Note : I/O channel



## Interface (USART)



Convert  
Serial to // / to serial

### CPU to device :



### Device to CPU :



### Parallel (8255 PPI)

Programmable peripheral Interface

We have to tell about device is S/p or C/p, about mode of operation

Control Information  
(should be provided using registers)

## Addressing of I/O :

### Goals of I/O :

- (i) Flexibility - Memory mapped I/O
- (ii) Expendability - I/O mapped I/O

## Addressing of IO

### Memory mapped IO

- ∴ Mem Addr and IO Addr are distinct
  - No need of control line (IO/M)
  - All mem ins extending to IO
  - More flexibility bcz it supports direct, indirect, immediate --- etc

### Mapping:



- ∴ Changes in IO space effects mem space

### Δ IOs effect MS

### IO mapped IO

- ∴ Δ IOs doesn't effect ms
  - More expandability due to independent IOs

### Mapping:



- ∴ If Addr = 80

sys doesn't know whether it is mem space or IO space

↓ sol

control line (IO/M) is introduced

- separate IO ins are needed (IN, OUT --)
- less flexible

## Data transfer techniques:

obj: data transfer tech are aimed to move data between device & processor.

### Types of data transfer tech

#### Pgm driven data transfer:

- processor controlled transfer
- ∴ Simplicity
- ∴ More waiting time wastes CPU cycle.



#### Interrupt driven data transfer

- device controlled data transfer
- ∴ Efficiently uses CPU than DMA
- ∴ Imp Comp is more.



#### DMA (Direct Mem Access)

- In DMA, system bus is shared b/w processor & DMA controller
- Based on when system bus is seized the DMA open mode can be
  - 1) cycle stealing - more waiting for DMA controller
  - 2) Burst - more waiting to CPU
  - 3) Block transfer

Adv of. (Cs + B)

- DMA Controller (8257 | 8237)

### Cycle stealing mode:

$$\% \text{ time CPU blocked in cycle stealing mode} = \frac{tx}{tx+ty} * 100\%$$

tx = word transfer time  
 ty = word prep time

The sys bus is returned after each word transfer and acquired after each I/Os cycle

### Burst mode:

$$\% \text{ time processor active contributing in burst mode} = \frac{tx}{tx+ty} * 100\%$$

tx = initialization & termination time

ty = data transfer time

Sys bus is returned only after entire data transfer is completed

### Block transfer mode:

- In this mode processor initially fills count reg & addr reg of DMA
- On each word transfer count decremented & addr incremented
- System bus is exchanged whenever count is zero

→ Burst mode + cycle stealing

Within block → Burst mode

B/w blocks → Cycle stealing

## In cycle stealing mode



Difference b/w INTR and DMA req:



## Disk:

- Semi Random access memory
- Seek time is Random Access and Rotational latency is Serial access
- Based on how data is placed on the tracks the disk construction can be:
  - (i) Constant capacity disks
  - (ii) Variable track capacity disks
- The unit of transfer in disks is 1 sector
- Disk is addressed sector wise.

## Disk Address



Time taken to Access a sector :

$$T_{\text{Access}} = T_{\text{Seek}} + T_{\text{Rot}} + T_{\text{DT}}$$

Note: If position of sector is not known then Avg Rot latency is taken

$$\text{Avg Rot Latency} = \frac{R}{2}$$

### Note:

$$\text{Capacity of the disk} = TSB$$

T = total num of tracks

S = num of sectors / track

B = Num of Bytes / sector



## In Block transfer mode

$$\begin{aligned} \text{Req} &= 16 \text{ bits}, \text{ Data} = 29,154 \text{ KB}, \text{ Byte Addressable} \\ \text{Min num of DMA req} &= 16 \\ \text{CR} &= 64 \text{ KB} \\ &\leftarrow 16 \rightarrow \\ 1 \text{ DMA req} &\rightarrow 64 \text{ KB} \\ x \text{ req} &\rightarrow 29,154 \text{ KB} \\ x &= \frac{29,154 \text{ KB}}{64 \text{ KB}}, \text{ USB 2eq} \end{aligned}$$

## In cycle stealing mode.

In cycle stealing mode, 1 cycle is stolen so that we transfer ~~1B~~ 1B (Byte Addressable)

$$\begin{aligned} 1 \text{ DMA req} &\rightarrow 1 \text{ B} \\ x \text{ req} &\rightarrow 29,154 \text{ KB} \\ x &= 29,154 \times 1024 \end{aligned}$$

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

1024 = 1024 \* 1024

The linear address of a sector in disk system =  $C * m + H * n + S$

 $m = p * n \quad p \rightarrow \text{num of sec surfaces}$ 
 $m \rightarrow \text{num of sectors/cylinder}$ 
 $n \rightarrow \text{num of sectors/surface}$

- The Address bits reqd for a sector in disk system



Note:

$\text{capacity of a disk system } (C_{DS}) = P T S B$

Note:

Num of sectors are same in every track

- \* - Angular velocity
- Variable recording density
- **Const track capacity**
- recording density

max  
(inner track) min  
(outer track)

dist b/w bits  
is same

- Const recording density
- **Variable track capacity**

- \* - Linear velocity

dist b/w bits  
is not same



Hard disk:



72 sectors

Note:

Num of tracks = num of cylinders

$P \rightarrow \text{num of surfaces}$

$T \rightarrow \text{num of tracks}$

$S \rightarrow \text{num of sec/track}$

$B \rightarrow \text{num of Bytes/Sec}$

- Sectors are addressed cylinder wise

Logical Address of sector 0 :  $(cy_0, sur_0, sect_0)$

$\langle C, H, S \rangle \Rightarrow \langle 0, 0, 0 \rangle$

Logical Address of sector 71 :  $(cy_2, sur_5, sect_3)$

$\langle C, H, S \rangle \Rightarrow \langle 2, 5, 3 \rangle$

logical addr to linear addr

Linear Address of sector 71

$LA = C * m + H * n + S$

$m = p * n$

$m = 24 \text{ sect/cylinder}$

$n = 4 \text{ sect/surface}$

$LA = 2 * 24 + 5 * 4 + 3$

$= 71$

Linear Addr to logical Addr

$m) LA (C$

$\overline{n) 2em+ (H}$

$\overline{2em \rightarrow S}$

$24) 71 (2 \rightarrow C$

$48$

$4) 23 (5 \rightarrow H$

$20$

$3 \rightarrow S$

$\langle 2, 5, 3 \rangle$

## Number Representation:

- Positional weighted system
- Value Expr & Range  $[V_{\min} \text{ to } V_{\max}]$   
 $\quad$  [underflow | overflow]



Note :

| Number   | Value Expr                                         | $V_{\min}$ | $V_{\max}$   |
|----------|----------------------------------------------------|------------|--------------|
| Integer  | $V = \sum_{i=0}^{n-1} a_i * 2^i ; a_i \in \{0,1\}$ | 0          | $2^n - 1$    |
| Fraction | $V = \sum_{i=1}^n a_i * 2^{-i} ; a_i \in \{0,1\}$  | 0          | $1 - 2^{-n}$ |

Resolution of fraction =  $2^{-n}$

## Floating point representation:

- Floating point nums denoted with Mantissa, Exponent

- Rep 1: Normalized sign magnitude fraction for mantissa  
(Biased sign exponent)



Biased Exponent with k bits we use Bias  $2^{k-1}$  value of expr

### Value Expr

$$V = (-1)^S (0.M)_2 * 2^{E-\text{Bias}}$$

(default)

$$V = (-1)^S (1.M)_2 * 2^{E-\text{Bias}}$$

(Explicit)  
(if mentioned)

(Implicit)

### Calculating Bias

$$\begin{array}{l} e \\ \text{(k bits)} \end{array} : \begin{array}{l} -2^{k-1} \\ \text{to} \\ 2^{k-1} \end{array}$$

Bias  $\oplus$

$$\begin{array}{l} E \\ \text{(k bits)} \end{array} : \begin{array}{l} 0 \\ \text{to} \\ 2^k - 1 \end{array}$$

↓  
(Biased exponent)

$$\begin{array}{l} \text{True Exponent} \rightarrow e \\ \text{Biased Exponent} \rightarrow E \end{array} \quad \left. \begin{array}{l} \rightarrow E = e + \text{Bias} \\ ?? \end{array} \right.$$

### Limitations of Normalized mantissa:

- Zero can't be represented
- Rep. Expose to overflow & underflow problems

↓ sol  
IEEE 754 floating point standard

### Rep 2: IEEE 754 floating point standard:

- IEEE 754 gives provision for storing special values  $(0, \pm\infty)$

Refer prob in  
Notes pg - 63

mantissa bit should always start with 1

XX can't be rep

-0.5 0 0.5

### Types

↓  
single precision (32 bits)

↓  
double precision (64 bits)

- The floating point num can be stored in Implicit form or in Fractional form
- Certain Mantissa Exponent pattern doesn't denote any number (NaN)
- This specification uses base 2, and Exponent in biased form

Single precision method (32 bits)

$$\text{Value Expr (V)} = (-1)^S (1.M) \times 2^{E-127}$$

Note: For fraction  $E=0$ .

$$V = (-1)^S (1.M) \times 2^{0-127}$$

$$V = (-1)^S (1.M) \times 2^{-126}$$

Note:

| S(1) | E(8)                 | M(23)              | value (V)                                       |
|------|----------------------|--------------------|-------------------------------------------------|
| 0/1  | 0000 0000<br>$E=0$   | 000-----0<br>$M=0$ | $\pm 0$                                         |
| 0/1  | 1111 1111<br>$E=255$ | 000-----0<br>$M=0$ | $\pm \infty$                                    |
| 0/1  | $1 \leq E \leq 254$  | $M = XXX--X$       | Implicit<br>$V = (-1)^S (1.M) \times 2^{E-127}$ |
| 0/1  | $E=0$                | $M = XXX--X$       | $V = (-1)^S (1.M) \times 2^{-126}$              |
| 0/1  | $E=255$              | $M \neq 0$         | NaN                                             |

### Fixed point Arithmetic:

- Addition and multiplication may result overflow
  - unsigned Add<sup>n</sup> | Sub<sup>n</sup> can be extended to signed num if operands are denoted in 2's Comp form.
  - unsigned Mul<sup>n</sup> cannot be extended to signed num hence separate Algo is needed
- for fixed point Arithmetic (Booth's Algo)

### Unsigned Arithmetic:

Overflow detection: presence of carry

Correction: Add  $2^n$  to result

### Signed Arithmetic:

Overflow detection: presence of carry in sign pos

$$Z = X + Y$$

In 1's Comp  
2's Comp

$$2S + (X_S - Y_S) = V$$

$$\text{Cin} (+) \text{ Cout} = V$$

To detect of

Correction: - Compliment sign bit of result

- Add  $\pm 2^{n-1}$  (depends upon sign bit)

Note: when operands of same sign are added. if the result is having opposite sign then the Arithmetic opern denote overflow

| $2S$ | $X_S$ | $Y_S$ | $V$ |
|------|-------|-------|-----|
| 1    | 0     | 0     | 1   |
| 0    | 1     | 1     | 1   |

$$V = 2S X_S Y_S + 2S X_S Y_S$$

## Booth's Algorithm :

- It provides faster fixed point signed multiplication
- Uses Arithmetic right shift process.
- The sign bit is protected during the computation
- Algorithm shifts the partial sum and hence there is no need to store partial product
- The Algo supports reusing of multiple register and partial sum register for storing the result.

## Booth's Notation :

- It is used to denote signed numbers. It is similar to that of 2's comp notation.
- The multiplier ( $Q$ ) can have one or more complete blocks of 1's and at most one partial block of 1's.

Ex:-

| Num                                                                                                                             | Complete Block | Partial Block | Add <sup>n</sup> | sub <sup>n</sup> |
|---------------------------------------------------------------------------------------------------------------------------------|----------------|---------------|------------------|------------------|
| 00 <span style="border: 1px solid red; padding: 2px;">1110</span> <span style="border: 1px solid red; padding: 2px;">111</span> | 2              | 0             | 2                | 2                |
| 000 <span style="border: 1px solid red; padding: 2px;">111111</span>                                                            | 1              | 0             | 1                | 1                |
| <span style="border: 1px solid red; padding: 2px;">111111000</span>                                                             | 0              | 1             | 0                | 1                |
| <span style="border: 1px solid red; padding: 2px;">111</span> 000 <span style="border: 1px solid red; padding: 2px;">111</span> | 1              | 1             | 1                | 2                |

For Complete block : 1 Add + 1 sub  
For partial block : 1 sub

## Booth's recording pattern :

- Used to find num of arithmetic & num of shift operations.

Pattern :

|   |   |    |  |
|---|---|----|--|
| 0 | 0 | 0  |  |
| 0 | 1 | -1 |  |
| 1 | 0 | 1  |  |
| 1 | 1 | 0  |  |

We should move from right to left

Ex:-

|    |    |   |   |   |   |   |
|----|----|---|---|---|---|---|
| 1  | 1  | 0 | 0 | 1 | 1 | 0 |
| AB | CB |   |   |   |   |   |

Pattern : 00 - 101000 - 1  

$$\begin{array}{ccccccc} 0 & 0 & 1 & 0 & 1 & 0 & 0 \\ \hline 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \end{array}$$
 pos

Arithmetic : 1CB - 1Add, 1sub  

$$\begin{array}{c} 1PB \\ - 1 \\ \hline 1Add, 2sub \end{array}$$

Shift operations : 0 + 3 + 5 = 8 ✓

## Ex:- Booth's recording pattern for -54

$$-54 \Rightarrow 2's \text{ Comp}$$

$$\Rightarrow -64 + 8 + 2 \Rightarrow \boxed{10010000}, 0$$

$$\begin{array}{r} -101-11-10 \\ \hline 65\ 43\ 21\ 0 \end{array}$$

$$\text{shift oper's} \Rightarrow 1 + 2 + 3 + 4 + 6$$

$$\Rightarrow 16$$

$$\text{Arithmetic} \Rightarrow 2CB \Rightarrow 2Add, 2sub$$

$$\begin{array}{c} 1PB \\ - 1 \\ \hline 1Add, 2sub \end{array}$$

$$2Add, 3sub$$

Ex:-  $13 * -6 \xrightarrow{\text{0110}}$  1010 2s,

$$1101 * -11-10$$

$$\begin{array}{r} 00000000 \\ 1110011X \\ 001101XX \\ 10011XXX \\ \hline 110110010 \end{array}$$

$$\boxed{1010} 0$$

$$\boxed{-11-10} \text{ Rec pattern}$$

$$1101 \rightarrow \text{ds.}$$

$$\boxed{0011}$$

$$2's \text{ comp of } -78,$$

## Leftovers

- floating point Arithmetic
- Tape memory
- Interrupts. - DMA working

## Interrupts :

- Interrupts are highly Async (not known when to occur)
- edge triggering is preferred than level triggering because edge triggering is less sensitive towards noise

| Processor                               | Device                                                                 |
|-----------------------------------------|------------------------------------------------------------------------|
| When to recognize ??                    | When to Interrupt ?? → <u>sol</u> → Async (so any time)                |
| How to recognize ??                     | How to Interrupt ?? → <u>sol</u> → change the INTR signal at this pin. |
| What to do ??                           |                                                                        |
| How to resolve ??<br>multiple interrupt |                                                                        |

## At processor phase:

- When to recognize ?? → sol → At the end of execution of current bus cycle
- How to recognize ?? → sol → By reading INTR flag status. we can recognize interrupt  
 INTR →   
 1 → Intr raised  
 0 → Intr not raised
- What to do ?? → sol → save set address, contents and transfer control to ISR (Intr service routine)  
 where is ISR?  
 ↓  
 (a) Vector interrupt → 1) use device info  
 (b) Scalar interrupt → 2) use default addrs
- How to resolve multiple interrupts ??  
sol: use priorities.

## Interrupt service

### Pended mode

- If we store all the interrupt and serve interrupt acc to priorities

### How to assign priorities

#### Static

- priorities can't be altered
- starvation to least priority interrupts
- uses daisy chain method.  
 position → priority



#### Dynamic

- priorities can be altered
- served interrupt priority is made lower than the lowest priority interrupt
- no starvation to least priority interrupts
- uses polling strategy



### Note:

Interrupt can interrupt the system but interrupt cannot be interrupted.

### Topic 2 : How DMA Works ?



- ① Device raises DMA req.
- ② DMA Controller holds CPU for system bus (Hold).
- ③ CPU initializes AR and CR.
- ④ CPU ACK to device (HLD ACK).
- ⑤ DMA Controller ACK to device (DMA ACK).

- ⑥ Device gets data in DR.
- ⑦ DR sends data to bus.
- ⑧ Bus sends data to memory.
- ⑨ Device releases DMA req.
- ⑩ DMA releases Hold req.
- ⑪ CPU releases HLD ACK.
- ⑫ DMA releases DMA ACK.

