

A decorative banner featuring five large, bold, red letters stacked vertically: 'I' at the top, followed by 'N', 'D', 'E', and 'X' at the bottom. Each letter is enclosed in a white square frame with a red double-line border. The letters are slightly tilted to the right.

NAME: K-Growtham STD.: \_\_\_\_\_ SEC.: \_\_\_\_\_ ROLL NO.: \_\_\_\_\_ SUB.: \_\_\_\_\_

→ Here current write operation is always done in cache.

### Multiprocessor Environment

- \* Every CPU has its own cache

- \* local variable: a variable within the CPU memory is local to that CPU.

global variable: a variable which is within some other CPU



- \* Here even using write-through also can't eliminate the cache coherence problem, coz some other processor could be using the variable or data as global.

#### (i) Cache Snooping Problem:

##### (i) Update all:

All processors having an updated variable.

If a processor has updated a variable, the central system observes this updation and gets it reflected in all the caches.

→ Here the disadvantage is that after updation in other CPU's the CPU sometimes may not use the variable anymore and there is no use of this updation.

##### (ii) Selective Update:

Processors listen updation and update the shared variable only if required. It is gives better performance.

### disadvantage:

- The processor which performs updation has to broadcast the information related to the updation, the bus.
- So every processor must monitor the bus activity.

### (ii) Directory based Protocols:

Here we use an active central system which maintains a directory. Here we have 2 approaches

#### i) Full Directory based    ~~ii) Limited Directory based~~

- A k-bit entry is maintained for every variable where k is no of processors. '1' denotes the variable is being used by the corresponding processor.
- Also the value (recent one) is maintained.
- A V/I flag is also maintained.
  - If  $V/I = 1$  then the value stored in directory is valid
  - If  $V/I = 0$  " " " " " " " " invalid
- If a processor wants to update the value, then until the updation is finished  $V/I$  is set to 0 and later after updation  $V/I$  is set to 1.
- This protocol uses array for directory implementation

#### Limitations:

- more space is required.
- For large value of k, most of the bits go waste if ~~we don't~~ only few processors use the variable.

(This limitation is overcome by using ~~the~~ limited directory)

ij Lim

Ac

Di

Block

~~Hop~~

Optimal

→ It

bloc

Ad

Disad

LRU:

- Here
- Use b  
block;
- If
- Que  
are

### iii) Limited directory based Protocol

\* linked list is used for directory information.

→ if a variable is used by only ~~two~~ ~~two~~ P machines then the variable has ~~two~~ P nodes.



Advantage: less space for directory

Disadvantage: more work (more overhead)

## Block Replacement Techniques:

~~HRP~~

### Optimal strategy:

→ It gives best performance. For replacement it identifies the MM block in CM which is not needed in future for longest time.

Adv: Best performance

Disadv: difficult to know the future references.

### LRU:

→ Here most recently used block is protected from replacement. Use bits are used for the implementation. least recently used block is selected for replacement.

→ If there is ~~no~~ no hit, then it is similar to FIFO

→ Queue can be extended for implementation, however on hit contents are to be reorganized

## FIFO

- The block which entered first into the cache is selected for replacement (largest time spending)
- Here assumption is that most of the references for that ~~new~~ block might have completed
- Queue is suitable for implementation.

### FIFO

Note:

- FIFO & LRU can be used with direct & set associative mapping.
- In direct mapped cache ~~we~~ we have only one option to place any MM block into cache. ∴ we don't use any of these.

Eg: A cache of 4-blocks, the following MM block references are made by the CPU

optimal:

|      |
|------|
| 8 17 |
| 13   |
| 7    |
| 9 16 |

$$\text{total ref} = 12$$

∴ 6 misses

∴ ~~hit~~ ∴ 6 hits

$$\therefore \text{hit ratio}, h = 0.5$$

## FIFO

|        |
|--------|
| 8 16 7 |
| 13 8   |
| 7 13   |
| 9 17   |

∴ ~~misses~~

∴ misses = 9

∴ hits = 3

i.e., 50% of the optimal strategy

$$\therefore \text{hit ratio} = 3/12 = 0.25$$

## LRU:

|          |
|----------|
| 8 13     |
| 13 16 17 |
| 7        |
| 9 8      |

misses = 8

hits = 4

$$\therefore h = \frac{4}{12} = 0.33$$

i.e., 66.6% of the optimal strategy

(Q31) A 2-D array occupies 48 blocks and cache is having 32 lines.

Array is accessed 3-times. The last block of array is filled 10% only. How many cache lines are overwritten during the successive accesses and what is the minimum value denoting total no of miss references?

- a) 16, 112    b) 8, 112    c) 16, 122    d) 8, 122

Sol:



The cache blocks ~~16 to 31~~ will be brought in first array access and will be left untouched during next access.

$\therefore$  In the next two accesses access to element of block 16 to 31

will be hits i.e.,  $3 \times 16 \times 2 = 32$

total no of references to blocks =  $48 \times 3 - 32 = 112$

Also in every access the cache blocks 0 to 15 will be overwritten.

$\therefore 16$

Ans: (a)

Eg: Block replacement with 2-way set association

8, 13, 7, 8, 19, 16, 7, 13, 8, 13, 17, 7

optimal: LRU:

|       |              |                                                    |
|-------|--------------|----------------------------------------------------|
| set 0 | 8            | even<br>odd-number block are<br>mapped to this set |
|       | 16           |                                                    |
| Set 1 | 13, 9, 18, 7 | even<br>odd number block maps to this<br>set       |
|       | X 17         |                                                    |

$$\therefore \text{no of misses} = 8$$

$$\Rightarrow \text{hits} = 4$$

$$\Rightarrow \text{hit ratio} = 4/12 = 0.33$$

Eg: Block replacement with Direct mapped cache

|   |               |
|---|---------------|
| 0 | 8, 16, 8      |
| 1 | 13, 9, 18, 17 |
| 2 |               |
| 3 | 7             |

$(k \bmod 4)^{\text{th}}$  block for 10th MM block

8, 13, 7, 8, 9, 16, 7, 13, 8, 13, 17, 7

$$\therefore \text{misses} = 8$$

$$\Rightarrow \text{hits} = 4$$

$$\therefore \text{hit ratio} = 4/12 = 0.33$$

- Q32) A 4-block cache is organized in 2-way set associate mapping. It is empty initially. LRU is used for block replacement. For the following MM ref what will minimum no of faults

8, 0, 11, 2, 18, 0, 12

|       |          |
|-------|----------|
| set 0 | 8, 12, 0 |
| set 1 | ∅, 8, 12 |
|       |          |

$$\therefore \text{no of faults} = 6$$

Method

For

hit

Main

i) Me

ii) DR

iii) M

Memory

→ Constr

Eg:

Q33 4-way set associative, 16-block cache.

MM ref: 0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155

which MM block will not be present in cache with LRU

Sol: a) 8 b) 129 c) 216 d) 3

remainder

Ref

0 0, 4, 18, 216, 8, 48, 32, 92

1 1, 133, 129, 73. (all will be present)

2 —

3 255, 3, 159, 63, 155 → 255 will not be present

|   |        |
|---|--------|
| 0 | 048    |
| 1 | X 32   |
| 2 | 8      |
| 3 | 216 92 |

∴ opt C no of misses = 16

Method 2:

For the reference string we can ~~not~~ check from opposite direction & find it.

Main memory:

i) Memory Expansion

ii) DRAM

iii) Memory interleaving

Memory Expansion

→ Construction of bigger capacity memory using smaller size chip

Eg: 8k x 16 memory using 1k x 4 chips

Target memory: 8k words

16 bits/word

Basic chip: 1k words

4 bits/word

$$\text{no of chips req} = \left\lfloor \frac{\text{target capacity}}{\text{Basic chip capacity}} \right\rfloor$$

$$= \frac{8k \times 16}{1k \times 4}$$

$$= 8 \times 4 = 32 \text{ chips}$$

$1k \times 4$  means  $\frac{1k}{4}$  rows 1k rows & 4 columns

These are arranged as

8 ( $1k \times 4$ ) rows with 4 ( $1k \times 4$ ) columns  
i.e., 8 rows & 4 column



Chip select lines are connected in parallel.  
i.e., 1 to all columns

Here each row is called a module.

Thus we have 8 modules in which each module gives  $1k \times 16$

| Decoder |   | (within module) |          |          |       |       |       |       |       | Address |       |       |       |       |                  |              |
|---------|---|-----------------|----------|----------|-------|-------|-------|-------|-------|---------|-------|-------|-------|-------|------------------|--------------|
|         |   | $A_{12}$        | $A_{11}$ | $A_{10}$ | $A_9$ | $A_8$ | $A_7$ | $A_6$ | $A_5$ | $A_4$   | $A_3$ | $A_2$ | $A_1$ | $A_0$ |                  |              |
| 0       | 0 | 0               | 0        | 0        | 0     | 0     | 0     | 0     | 0     | 0       | 0     | 0     | 0     | 0     | Row-0 (Module 0) | 00FF to 0FFF |
| 0       | 0 | 1               | 0        | 0        | 0     | 0     | 0     | 0     | 0     | 0       | 0     | 0     | 0     | 0     | Row-1 (Module 1) | 0400 to 07FF |
| :       | : | :               | :        | :        | :     | :     | :     | :     | :     | :       | :     | :     | :     | :     |                  |              |
| 1       | 1 | 1               | 1        | 1        | 1     | 1     | 1     | 1     | 1     | 1       | 1     | 1     | 1     | 1     | Row-7 (Module 7) | 1C00 to 1FFF |

Adv

Total address range 0000 to 1FFF

Eg: Target size : 64 KB

available : 4kx4 chips

Find no of chips required & size of decoder.

Sol:

$$\text{no of chips} = \frac{64 \text{ KB}}{4 \text{ k} \times 4} = 16 \times 2$$

∴ 32 chips are required

arrangement :  $16 \times 2$

i.e., 16 rows & 2 columns

size of decodes depends on no of rows

i.e.,  $4 \times 16$  decoder in this case.

## Dynamic RAM (DRAM)

- \* DRAM is used in primary memory construction
- \* ~~volt~~ volatile
- \* Contains one transistor & one capacitor for each bit
- \* Refreshing is required.
- \* In one refresh entire <sup>row</sup> gets refreshed
- \* In one refresh cycle all rows are refreshed
- \* To perform memory read or memory write the memory must be ~~state~~ stable. i.e., refreshing must be done.



Here 'a' is overhead time

$$t = a + b$$

$$\text{Refresh overhead \%} = \frac{a}{t} * 100$$

$$\therefore \text{time available for memory read/write} = \frac{b}{t} * 100$$

→ Here we use counter for regular refresh cycles.

→ DRAM has more package density than SRAM, however it is comparatively slower.

(Q34) A MM 1GB is constructed by 256Mx4 bit DRAM chips. The no of rows in DRAM chip is  $2^{14}$ . It takes 50ns for refreshing one row.

The refresh period is 2ms - percentage time that is available for memory read/write is.

Sol:



$$\text{req \%} = \frac{b}{2\text{ms}} * 100$$

$$= \frac{2\text{ms} - a}{2\text{ms}} * 100 = \frac{2\text{ms} - 2^{14} * 50\text{ns}}{2\text{ms}} * 100$$

$$= 59\%$$

### Memory Interleaving:

→ In order to improve the performance, MM uses modules rather than single large unit.

→ Module internal operation can be performed in parallel. However accessing same module multiple times must be sequential.

Information is always distributed  $\alpha$  (interleaved) across the modules.

→ 2 ways  
j)  
j)  
Lower order  
addresses  
it suffi

Higher order

addresses

→ PROV

→ MEM

→ MEMORY

→ If we

Eg: Consi

→ 2 ways of interleaving:

(i) lower order interleaving

(ii) higher order interleaving

Lower order interleaving:

uses higher address bits for word selection and lower address bits for module selection. It gives faster response. However it suffers from spatial locality.

Higher Order interleaving:

uses lower address bits for word selection and higher address bits for module selection.

→ provides better spatial locality (consecutive words in same module)

## → Memory

→ Memory bank uses combination of both.

→ If we use k modules we call it k-way interleaving

Eg: Consider 64 word memory

6 bit address is needed ( $A_5 A_4 A_3 A_2 A_1 A_0$ )

4-way interleaving (lower order)



→ Here words are stored in wrap around fashion.

Assume accessing memory takes 50 ns

and decoder delay is 2 ns

→ If memory is a single large unit,

to access 4 consecutive words it would take

$$4 \times 50 = 200 \text{ ns}$$

→ But in lower order 4-way interleaving



And this organization takes only 58 ns

→ Internal operations of memory are overlapped (but not addressing).

31/08/20

Higher Order Interleaving:



Control

\* CPU

\* Each

\* CU

Stack  
Points  
register

Q35 In a system design MM employs 8 modules with 8-way interleaved design. The words are stored in the modules in wrap around fashion. Internal operations of modules can be done in parallel, however accessing the modules multiple times has to be serialized. Let  $\text{soons}$  required for module internal operation and  $\text{loons}$  for bus activation. Max no of memory operations allowed to be initiated in 1ms for this main memory.

- a) 1000   b) 10000   c) 100000   d) 1000000

Sol:

for every initiation require  $\text{loons}$

$$\therefore \frac{1\text{ms}}{100\text{ns}} = \frac{10^{-3}}{100 \times 10^{-9}} = \frac{10^6}{100} = 10^4 = 10000$$

Q36 A DRAM chip is having  $2^{10}$  rows & each row contains  $2^{10}$  cells. Time required for one refresh operation is 100ns. How much time is req for refreshing the DRAM.

Sol:

one refresh is meant for entire row

$$\therefore 2^{10} * 100\text{ns}$$

## Control Unit Design:

- \* CU is responsible for implementing instructions
- \* Each processor will have its own instruction set architecture (ISA)
- \* CU has processor register for specific purpose / general purpose.

PC: maintain address of next instruction to be executed

IR: holds the current instruction which is to be implemented.

Stack Pointer: Maintains address of (~~top~~) top of the stack.  
register(SPR)

## general purpose registers

\* These are also known as scratch pad registers.  
They are used for storing information temporarily.

Accumulator: maintaining one of the operand and also it is the destination where result is stored.

- \* It is central register to interact with I/O devices
- \* It supports implied addressing mode.

## ISA (Instruction Set Architecture):

- ISA influences the design of processor
- It indicates behaviour of instruction

Eg:  $\phi$

### Sample Instruction formats

- In general most processor do not allow direct data transfer b/w two memory location. generally it is done via registers.

Eg: A processor uses 16 bit instruction and each instruction will have opcode, destination register, source operand one register, and source operand register. If ISA is having 100 instructions what will be the no of general purpose registers

Sol:



100 instruction  $\rightarrow$  7 bit opcodes

$\Rightarrow$  9 bits  $\rightarrow$  3 reg

$\Rightarrow$  1 reg  $\rightarrow$  3 bits

$\therefore 2^3 = 8$  general purpose registers.

Eg: A Machine has 32-bit architecture with 1 word long instruction. It has 64 GPRs (general purpose registers). It needs to support 45 instructions. Each instruction will have opcode, 2 general purpose registers. What is the maximum decimal value of unsigned operand supported by immediate field?

Sol:



$$\therefore \text{bits for operand} = 32 - 18 = 14$$

$$\therefore \text{Max value} = 2^{14} - 1 = 16383$$

(Q37) A processor is having 40 distinct instructions and 24 GPRs.

A 32-bit instruction has an opcode, 2 register operands and one immediate operand. The no of bits available for immediate operand is \_\_\_\_\_

Sol:



$$\therefore 32 - 16 = 16 \text{ bits}$$

(Q38) A processor's ISA has 80 instructions and 64 GPRs. Each instruction contains opcode with 3 registers (1-dest & 2 source operand). A program is having 100 instructions how many bytes of code memory is needed.

bytes of code memory is req?

Sol:



i.e., 25 bits

$$\begin{aligned}
 \text{Size of code memory} &= 100 \times 25 \text{ bits} \\
 &= 2500 \text{ bits} \\
 &= \frac{2500}{8} = 312.5 \text{ bytes}
 \end{aligned}$$

PC points to byte but it cannot point to a bit

∴ we consider size of instruction as 4 bytes

$$\therefore 100 \times 4 = 400 \text{ bytes}$$

## Addressing Modes:

- Addressing Mode denotes how operand is referred in the instruction
- Addressing modes give flexibility in program development.
- The address of the operand is termed as effective address (EA)
- In broad addressing modes are
  - 1) Non-computable addressing mode (calculation is not needed)
  - 2) Computable addressing mode (calculation is needed to get the address.)

## Non-computable addressing modes:

### (i) Immediate:

- Operand itself is present in the instruction.
- There is no EA.
- Fastest addressing mode and suitable for referring program constants

format:



Limitation: Operand range is small.

### (ii) Direct Addressing Mode:

Here operand address is given instead of operand



adv: overcomes limited operand range

\* but it requires one extra memory read.

EA: operand address

Operand:  $M[\text{operand address}]$

→ It is suitable for accessing static variables

limitation:

Limited address range.

### (iii) Indirect Addressing Mode:

Here address of the effective address is given



EA:  $M[\text{Ref to EA}]$

operand:  $M[EA]$

$$= M[M[\text{Ref to EA}]]$$

→ Two memory ref are needed

∴ slowest among non-computable addressing mode

→ It overcomes the problem of limited operand range & limited address range

→ It is used to support ~~pointers~~ pointers, parameter passing.

→ This is also known as single indirection.

double indirection:



EA:  $M[\mu[n[Ref of Ref of EA]]]$

operand:  $M[\mu[n[Ref of Ref of EA]]]$

Here we need 3 memory references.

Similarly for n-indirection, we need " $n+1$ " memory references.

### Computable Addressing Modes:

→ Here EA is computed using portion of instruction (displacement)

using one or more processor registers



### Applications:

#### Base Addressing Mode:

\* used for relocatable codes

#### Index addressing mode:

\* used for accessing array elements

Relative

#### Program Counter Relative Addressing Mode:

\* used for Inter segment (long jump) branching & Intra segment (short jump) branching

# Instruction Design

→ A instruction is having 2 basic components.

i.e., opcode & operand reference



we can classify the instruction based on ~~opcode~~

- i) opcode
- ii) operand references.

## Opcode based classification (functional classification)



## Classification based on operand references

### 4-address instruction:



Destination

Source

used for instruction

sequencing (earlier there was no PC,

so every instruction gives information about next one)

### Adv:

→ Here instructions need not to be stored contiguously ~~because~~ because A<sub>4</sub> creates a linked list of ~~loop~~ instruction.

→ design is simple.

### disadv:

→ instruction length is very high

### 3-address instruction:

- \* PC is added & A<sub>4</sub> is removed.



### 2-address instruction

- \* One of the source operand will be used as destination too.



disadv:

- \* Overwriting is on (on memory)

to avoid this problem of over

### 1-address instruction:

- \* Here accumulator ~~reg~~ is implied operand.



Here the problem of memory overwriting is overcome

### zero Addressing Instruction:

Implicit stack is used for data and operations are performed on stack contents.

Hence it is also known as stack based instruction.

Here we add stack pointer register (SP)

disadv:

Now here we need PC, accumulator & SP

∴ Complexity is more.

1/09/20

Eg: Let us consider a machine that supports 1-address and 2-address instructions. Instruction length is 16-bit. Program is stored in 128 word memory. If there exist 2 two address instruction, what will be the no of 1-address instructions in the ISA

128 word memory  $\Rightarrow$  7 bits for address

1-Address instruction



If we have only 1-address instructions then we can have

$$2^9 = 512 \text{ instructions}$$

2-address instruction



$\therefore$  for each two address instruction  $2^7$  1-address instructions are to be sacrificed

Now we have 2 such two address instruction

$\therefore$  we need to accommodate  $2 * 2^7$  1-address instructions

$\therefore$  no of 1-address instructions

$$= 512 - 256 = 256$$

are

Q39) A processor is having 64 floating point registers and 16 integer registers. ISA supports 4 types of instructions each having 16 bit length

| Type | Format                                                  | No of instructions |
|------|---------------------------------------------------------|--------------------|
| 1    | OP, IR <sub>1</sub> , IR <sub>2</sub> , IR <sub>3</sub> | 4                  |
| 2    | OP, FR <sub>1</sub> , FR <sub>2</sub>                   | 8                  |
| 3    | OP, IR, FR                                              | 14                 |
| 4    | OP, FR                                                  | N                  |

Find N

Sol:

IR — 4 bits      FR — 6 bits

address  
128  
16

Type 1:

|    |     |     |     |
|----|-----|-----|-----|
| OP | IR1 | IR2 | IR3 |
| 4  | 4   | 4   | 4   |

} 4

Type 2:

|    |     |     |
|----|-----|-----|
| OP | FR1 | FR2 |
| 4  | 6   | 6   |

} 8

Type 3:

|    |    |    |
|----|----|----|
| OP | FR | IR |
| 6  | 6  | 4  |

} 14

Type 4:

|    |    |
|----|----|
| OP | FR |
| 10 | 6  |

} N

If only type 4 are present, we will have  $2^{10}$

Adding type 3 takes  $2^4$  type 4 instruction

" type 2 takes  $2^6$  type 4 instruction

" type 1 "  $2^6$  " " "

$$\therefore N = 2^{10} - 14 * 2^4 - 8 * 2^6 - 4 * 2^6$$

$$= 2^{10} - 224 - 512 - 256$$

$$= 32$$

## Instruction Pipeline:

→ Instruction pipeline overlaps instruction phases and thus reduces avg implementation time.

→ The phases of an instruction are:

i) fetch : getting the instruction from memory to IR

ii) decode : Instruction is decode (obtain operation from opcode) and req operands are fetched from memory to GPR

iii) Execute : perform operation in ALU

iv) Store : store the result at destination.

- Each phase is done by separate hardware. So ~~one~~ when one phase is being done, other phases' hardware is idle.
- To remove this idleness & improve performance we use instruction pipelining.

### Concept chart of Instruction pipeline

- Instructions must be independent for giving optimal performance.

Q40) Match the following

P: Indirect addressing

1. Array Implementation

Q: Indexed addressing

2. Writing reloadable code

R: Base Register Addressing

3. Passing array as parameter

4. Accessing constants

Ans :

P-3 Q-1 R-2

Q41) In a hypothetical processor an instruction format is



where xx is displacement and R<sub>i</sub> is base register. The displacement is denoted as 2'complement signed number. If x = 1111 1111 1000

and R<sub>i</sub> decimal value is 300. The effective address in decimal is

Sol :

$$1111\ 1111\ 1000 \equiv 1000 = (-8)_{10}$$

$$\therefore 300 - 8 = 292$$

(Q42) A processor is having  $m$ -bit length instructions and support one address and two address instructions. The memory is having  $2^n$  words. If there are 8-two address instructions what will be no of one address instructions.

Sol.:



$$\therefore 2^{m-n} \text{ are possible}$$



$\therefore$  For each 2-address instruction  $2^n$  one address instruction are to be sacrificed.

$\therefore$  no of 1-address instructions

$$= 2^{m-n} - 8 * 2^n$$

$$= 2^{m-n} - 2^{n+3}$$

$\rightarrow$  performance of instruction pipeline is given by speed up factor.

$$S = \frac{\text{time without pipeline}}{\text{time with Pipeline}}$$

$$\therefore S > 1$$

Evolution of instruction pipeline :-

1- Non-pipeline system / single stage Instruction pipeline :





## ~~2~~ 2-Stage Instruction Pipeline



At any point of time  
2 instructions can be active

Ideal performance = 2 times of the non-pipelined system

## 4-Stage Instruction Pipeline



Ideal speed = 4

## 5 - Stage instruction pipe pipeline :



ideal speed up = 5

→ Here store may be done in registers  
or in memory

So we can split the stage

## Types

- i) i
- ii) i
- iii) i
- iv) i

## 6 - Stage instruction pipeline



ideal speedup = 6

(6 instructions can be active & needed)

## linear

## non-lin

## Synch

## Pipel

→ There is limit on how many stages we can have. we can't have large no of stages. Cuz overhead increases and performance drops. So 6 is the max no of stages possible with optimal performance.

## Dirct



### Types of instruction pipelines

- i) linear : Instructions move only forward (feed forward)
- ii) Non-linear : Forward as well as backward (feed back) (i.e., back to previous stage)
- iii) Synchronous : All stages move data at the same time.
- iv) Asynchronous : Data movement b/w the stages is not dependent on other stage's ~~stage~~ status.

linear:



$$CPI_{avg} = 1$$

Instruction will not revisit any stage

non-linear:



$$CPI_{avg} > 1 \text{ (Reservation tables)}$$

Synchronous:

Here we have interstage registers (buffer registers)



→ These registers are implemented with flip-flops operation with same clock.

Disadv: → Some stages may require more time. So other stages has to wait. Thus slowest stage decides performance.

registers

Stage

active if needed

have arbitrary  
lengths.  
since drops.

## Asynchronous Pipeline:

→ Handshake signals are used before data transfer



→ Here data transfer b/w  $S_1 \& S_2$  is not dependent on data transfer b/w  $S_3 \& S_4$ .

→ Here we don't need any buffer registers.

Now assume each stage takes 1 clock

4-stage



∴ In 1000<sup>th</sup> clock  $I_{997}$  would complete its execution if all the instructions are independent.

∴ n instruction would take  $4+n-1$  clocks.

∴ For k-stage pipelining,  $k+n-1$  clocks.

→ At 4<sup>th</sup> clock,  $CPI_{avg} = \frac{4}{1} = 4$

At 5<sup>th</sup> clock,  $CPI_{avg} = \frac{5}{2} = 2.5$

At 6<sup>th</sup> clock,  $CPI_{avg} = \frac{6}{3} = 2$

At 1000<sup>th</sup> clock  $CPI_{avg} = \frac{1000}{997} \approx 1$



$\therefore$  for large no. of instruction  $CPI_{avg} \approx 1$

### Performance of pipeline:

→ performance of pipeline is given speed up factor.

$$S = \frac{\text{time without pipeline}}{\text{time with pipeline}}$$

ideal value of  $S = k$

$k \rightarrow$  no. of stages

→ for  $n$ -instructions,  $k$ -stage instruction pipeline takes

$(k + n - 1)$  clocks

assuming each stage takes one clock.

clock period of instruction pipeline

$$= \cancel{Clock \text{ period}} \cancel{buff} \geq \text{Max(stage delay)} + \text{Buffer overhead}$$

→ On avg pipeline produces one instruction per clock

$$CPI_{avg} \approx 1$$

- Dependencies among instructions reduces pipeline performance.
- They require reorganization of pipeline contents. This inturn require waiting clocks (stall cycles)

$$\therefore CPI_{avg} = \left( 1 + \frac{\text{Stall freq}}{\text{clocks}} * \frac{\text{Stall cycles}}{\text{clocks}} \right) \text{clocks}$$

↳ i. instruction causing dependency.

$$\rightarrow \text{Effective speedup}, S_{eff} = \frac{S_{ideal}}{1 + \frac{\text{Stall freq}}{\text{clocks}} * \frac{\text{Stall cycles}}{\text{clocks}}}$$

$$S_{eff} = \frac{k}{1 + \frac{\text{Stall freq}}{\text{clocks}} * \frac{\text{Stall cycles}}{\text{clocks}}}$$

k → no of stages

→ Data dependencies may occur due to arithmetic instructions.

→ Techniques to deal data dependency are

i) Multiple

(i) Instruction rescheduling

ii) Operand forwarding

iii) Introduction of stall cycles.

iv) Register Renaming (for WAR) (for WAW)

→ Control dependencies may result due to branch instructions.

techniques to deal control dependencies are:

i) Multiple pipelines.

ii) Delayed load.

iii) Prediction techniques.

iv) Delayed branching

- Structural dependencies are resulted due to the limitation of resources.
- Resource duplication will resolve this problem.

(Q1B) Consider a 4-stage instruction pipeline with stage delays 20 ns, 15 ns, 18 ns, 10 ns - Inter stage buffer overhead is 5 ns

i) How much time is needed on an avg per instruction in non-pipelined system.

ii) How much time is needed on an avg per instruction in a pipelined system.

iii) Calculate speed up.

So:

$$\text{i) Time} = 20 + 15 + 18 + 10$$

$$= 63 \text{ ns} \quad (\text{No buffer registers})$$

$$\text{ii) CPI}_{\text{avg}} = 1 \text{ clock}$$

$$\text{Clock period} = \text{Max}(20, 15, 18, 10) + \text{Buffer overhead}$$

$$= 20 + 5 = 25 \text{ ns}$$

$$\therefore 25 \text{ ns}$$

$$\text{iii) Speed up} = \frac{63}{25} = 2.52$$

(Q1C) In the previous problem, if 100 instructions program is given

i) time of execution in non-pipelined system

ii) " " " " " pipelined system

iii) speed up

Sol:

$$i) 100 * (20 + 15 + 18 + 10) = 6300 \text{ ns}$$

$$ii) T_{dec} \text{ CPI}_{avg} \geq 1 \text{ clock}$$

$$\text{Clock period} = \max(20, 15, 18, 10) + 5$$

$$= 25 \text{ ns}$$

for 100 instruction is 4-stage pipeline

$$\text{no of clock req} = 4 * 1 + 99 \\ = 103$$

$$\therefore \text{time req} = 103 * 25 \\ = 2575 \text{ ns}$$

$$iii) \text{speed up} = \frac{6300}{2575} = 2.446$$

### Split Pipelines:

→ Identify max delay - pipeline and split it into further stages and thus we can decrease clock period. Thus performance increases.

Eg:



Assume buffer overhead = 2 ns

Clock period of 5-stage =  $20 + 2 = 22$  (i.e., avg time for execution of instruction)

Clock period of 6-stage =  $12 + 8 = 14$

$$\therefore \text{speed up} = \frac{22}{14} = \frac{11}{7} = 1.57$$

**Ques** A non-pipelined system is operating with 2.5GHz takes 5 clock cycles on an avg per instruction. If a 5-stage instruction pipeline due to stage management operates with 2GHz clock. A program contains 30% memory operations, 60% ALU operations and rest branch operations. The cache miss result 5% of memory operations 50 clock cycle stalls, 50% of branch operations result 2 clock stalls. What is speed up of pipeline over non-pipeline system.

Sol:

$$\text{Avg execution time per instruction in non-pipelined system} = 5 * \frac{1}{2.5 \times 10^9} \\ = 2 \times 10^{-9} \\ = 2 \text{ ns}$$

In pipeline System

$$CPI_{avg} = 1$$

$$\text{clock period } \Phi = \frac{1}{2 \times 10^9} = 0.5 \text{ ns}$$

$$\text{no of stall cycles} = 0.3(0.05)(50) + 0.1(0.5)(2)$$

↓                      ↓  
Memory          Branch

$$= 15(0.05) + 2(0.05)$$

$$= 17(0.05) = 0.85 \text{ cycle per instruction}$$

$$\therefore \text{avg execution time per instruction} = 0.5 + 0.85$$

~~2.35 ns~~

$$= 1 + 0.85$$

$$= 1.85 \text{ cycles}$$

$$\text{clock} = 1.85 \times 0.5 \text{ ns}$$

$$\text{speed up} = \frac{2 \text{ ns}}{1.85 \times 0.5 \text{ ns}} = \frac{4}{1.85} \approx 2.16$$

(Ques) A pipeline system operates with 2 GHz clock and result of 2-stall cycles for 30% of instructions. Non-pipeline system operates with 2.5 GHz clock and takes 5 clock on avg. Find speed up

Sol:

for non-pipeline

$$t_{avg} = 5 + \frac{1}{2.5 \times 10^9} = 2 \text{ ns}$$

for pipeline

$$\text{Clock period} = \frac{1}{2 \times 10^9} = 0.5 \text{ ns}$$

$$\therefore \text{avg no of clocks per instruction} = 1 + \text{stall cycles}$$

$$= 1 + 0.3(2)$$

$$= 1 + 0.6$$

$$= 1.6 \text{ cycles}$$

$$\therefore \text{avg time per instruction} = 1.6 \times 0.5 = 0.8$$

$$\therefore \text{Speed up} = \frac{2}{0.8} = 2.5$$

## Data Dependencies:

→ Data dependencies result when one instruction in pipeline is waiting for the result of earlier instruction which is not yet computed.

Eg: Consider I<sub>1</sub>: ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>

$$\dots R_1 = R_2 + R_3$$

I<sub>2</sub>: SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>  $\dots R_4 = R_1 - R_5$

|                |          |                                       |                                      |                        |
|----------------|----------|---------------------------------------|--------------------------------------|------------------------|
| I <sub>1</sub> | F<br>add | D<br>R <sub>2</sub> , R <sub>3</sub>  | E<br>R <sub>2</sub> + R <sub>3</sub> | S<br>(R <sub>1</sub> ) |
| I <sub>2</sub> | F<br>sub | D<br>(R <sub>1</sub> ) R <sub>5</sub> | E<br>R <sub>1</sub> - R <sub>5</sub> | S<br>R <sub>4</sub>    |

\* Here  $I_2$  is using old value of  $R_1$ . Actually it should use the new value of  $R_1$  modified by  $I_1$ .

$\therefore$  we say  $I_2$  is having data dependency over  $I_1$ .

So the correct way of implementation is



Now with correct implementation no of clocks increases to 7 from 5.

Thus performance is reduced.

### types of data dependencies:

→ 4 types are possible based on how shared register is used b/w the instructions.

- i) RAW (Read After Write) (True or flow dependency)
- ii) WAR (Write After Read) (Anti flow or Anti dependency)
- iii) WAW (Write After Write) (O/P dependency)
- iv) RAR (Read After Read) (I/P dependency or no dependency)

### RAW (Flow dependency)

→ dependent instruction has to read shared register only after it was written by the earlier one

WAR

## WAR (Antiflows)

Consider

$$I_1: \text{SUB } R_1, R_2, R_3 \quad \dots \quad R_1 = R_2 - R_3$$

$$I_2: \text{OR } R_2, R_3, R_4 \quad \dots \quad R_2 = R_2 \mid R_4$$

Generally Arithmetic instruction takes more time than logical instruction. So in a multiprocessor environment there is a danger that  $I_1$  read  $R_2$  value that is modified by  $I_2$ .

→ Dependent Instruction has to write into shared register only after it was read by earlier one.

03/09/20

## WAW (Output dependency)

Consider

$$I_1: \text{ADD } R_1, R_2, R_3 \quad \dots \quad R_1 = R_2 + R_3$$

$$I_2: \text{ADD } R_1, R_4, R_5 \quad \dots \quad R_1 = R_4 + R_5$$

Here destination register is same for both  $I_1$  &  $I_2$

Second instruction has to write into  $R_1$  only after  $I_1$  does.

## RAW (Input Dependency)

Consider

$$I_1: \text{ADD } R_1, R_2, R_3$$

$$I_2: \text{ADD } R_4, R_2, R_5$$

$R_2$  is source for the instructions.

Q47

6-Stage pipeline instruction pipeline has perfectly balanced stages and assume no buffer overhead. What is the speed of this pipeline if it results 2-stalls for 25% instructions compared to non-pipeline system.

Sol:

CPI for non-pipeline = 6

$$\text{CPI for pipeline} = 1 + \left( \frac{\text{Stall}}{\text{freq}} \right) (\text{Stall penalty})$$

$$= 1 + 0.25(2) = 1.5$$

$$\therefore \text{Speed up} = \frac{6}{1.5} = 4$$

pipeline

(Q8) A 4-stage instruction delays in (picoseconds) are 800, 500, 400, 300.

The first stage is replaced with two stages of delay 600, 350.

The throughput increase with this design is \_\_\_\_\_ percent

compared to 4 stage instruction pipeline

Sol:

$$\text{CPI}_1 = 800 \text{ ps}$$

$$\text{CPI}_2 = 600 \text{ ps}$$

throughput is no of instructions per second

∴ throughput increase =  $\frac{1}{600} - \frac{1}{800}$   $\times 100$

$$= \frac{1}{600} - \frac{1}{800} \times 100 = \frac{48 - 36}{48} \times 8 \times 100$$

$$= \frac{1}{600} - \frac{1}{800} \times 100 = \frac{48 - 36}{48} \times 8 \times 100 = \frac{2}{48} \times 8 \times 100$$

$$= \frac{8 - 6}{48} \times 100 = \frac{2}{48} \times 8 \times 100 = 33.33\%$$

### Detection of data dependencies:

Domain (source register) and Range (destination register) relations can be used to detect data dependencies.

Eg:

|                                                                       | Domain                             | Range             |
|-----------------------------------------------------------------------|------------------------------------|-------------------|
| I <sub>1</sub> : ADD R <sub>1</sub> , R <sub>1</sub> , R <sub>3</sub> | {R <sub>1</sub> , R <sub>3</sub> } | {R <sub>1</sub> } |
| I <sub>2</sub> : SUB R <sub>4</sub> , R <sub>1</sub> , R <sub>5</sub> | {R <sub>1</sub> , R <sub>5</sub> } | {R <sub>4</sub> } |

D<sub>2</sub> ∩ R<sub>1</sub> ≠ φ ⇒ RAW

R<sub>2</sub> ∩ D<sub>1</sub> ≠ φ ⇒ WAR

D<sub>2</sub> ∩ R<sub>5</sub> ≠ φ ⇒ ~~WAR~~ RAR

R<sub>1</sub> ∩ R<sub>2</sub> ≠ φ ⇒ ~~RAW~~ WAW

### Domain & Range

→ Since compiler knows these ~~dependencies~~ at compilation,

it is possible to detect data dependencies during compilation.

→ DAG is used at the time of compilation. to know which instruction are dependent on whom.

### Techniques

#### Handling Data Dependencies:

##### i) Instruction rescheduling:

→ Here compiler places independent instruction b/w dependent ones. The performance of this approach depends on the program & ability of compiler.

→ Based on no of stall cycles required b/w the dependent instructions, no of independent instructions to be placed is decided.

Eg: If there are 2 stall cycles b/w two data dependent instructions, then we need to place 2 independent instruction b/w them.

If I<sub>1</sub> & I<sub>2</sub> are dependent



→ Thus there will be no performance loss with instruction rescheduling.

→ However, here, the main problem is identifying them independent instructions ( $I_x$  &  $I_y$ ) in the program.

It is because

~~\*  $I_x$  &  $I_y$  should be independent b/w  $I_x$  &  $I_y$  also b/w  $I_1$  &  $I_2$ . Along with that they should also be independent from successors of  $I_1$ ,  $I_x$ ,  $I_y$  &  $I_2$ .~~

This requirement makes identifying such instructions difficult.

→ Detecting direct dependencies is easy, but detecting indirect dependencies is hard. It is NP-hard problem.

Eg: If  ~~$I_7$  is dependent on  $I_1$~~

~~It is~~

If  $I_7$  is dependent on  $I_1$ ,  $I_{10}$  is dependent on  $I_7$  and  $I_x$  is dependent on  $I_{10}$  then we say  $I_x$  is indirectly dependent on  $I_1$ .



Data hazards can be also handled at software level by inserting which the compiler inserts NOP b/w dependent instruction. Now compiler may rearrange instruction if need. no of NOP inserted = no of stalls

### iii) Operand Forwarding:

\* used to resolve data dependency.

\* Here operand value (before it is stored) is forwarded to the required stage using inter-stage buffer registers.

\* only one dependent operand is allowed.

\* Forwarding can be done not be done to previous clocks.

\* Operand forwarding can't resolve the data dependencies, if execution stage is computing effective address of the memory word and result stall cycles.

Consider the following code:

$$I_1: ADD R_1, R_2, R_3 \quad R_1 = R_2 + R_3$$

$$I_2: SUB R_4, R_1, R_5 \quad R_4 = R_1 - R_5$$

$$(i.e., (R_2 + R_3) - R_5)$$

So here we forward  $R_1$  value before it is stored



→ This approach requires additional hardware & cost of implementation is more.



- Selection on Mux is done by control unit using the information from compiler.
- Mux can send only one bit of data, thus if size of data bus is 16 bits, then we need  $16 \times 2 \times 1$  Muxes. This is the additional hardware required for implementation of operand forwarding.
- since executional unit can produce only one o/p at most, so only one operand can be forwarded.

Qnq) Consider 5-stage pipelined processor with IF, ID, OF, PO or WO stages. Except PO stage all other stages take one clock each for any instruction. PO stage takes 1 clock for ADD and SUB instruction, 3 clocks for MUL instruction and 6 clocks for DIV instruction respectively. Operand forwarding is used in the pipeline. what will be no of clocks required to execute the following sequence of instructions.

I<sub>1</sub> : MUL R<sub>2</sub>, R<sub>0</sub>, R<sub>1</sub>

PO : perform Operation

I<sub>2</sub> : DIV R<sub>5</sub> R<sub>3</sub> R<sub>4</sub>

WO : write operand.

I<sub>3</sub> : ADD R<sub>2</sub> R<sub>5</sub> R<sub>2</sub>

I<sub>4</sub> : SUB R<sub>5</sub> R<sub>2</sub> R<sub>6</sub>

Sol :



∴ 15 clocks.

|                | 1         | 2  | 3                                   | 4                                                | 5                                | 6              | 7 | 8 | 9 | 10                   | 11                                    | 12                               | 13                               |    |
|----------------|-----------|----|-------------------------------------|--------------------------------------------------|----------------------------------|----------------|---|---|---|----------------------|---------------------------------------|----------------------------------|----------------------------------|----|
| I <sub>1</sub> | IF<br>MUL | ID | OF<br>P <sub>0</sub> R <sub>1</sub> | P <sub>0</sub><br>R <sub>0</sub> *R <sub>1</sub> | W <sub>0</sub><br>R <sub>2</sub> |                |   |   |   |                      |                                       |                                  | W <sub>0</sub><br>R <sub>3</sub> |    |
| I <sub>2</sub> | IF<br>DIV | ID | Stall                               | OF<br>P <sub>0</sub> R <sub>4</sub>              |                                  | P <sub>0</sub> |   |   |   |                      |                                       |                                  | W <sub>0</sub><br>R <sub>3</sub> | 14 |
| I <sub>3</sub> | IF<br>ADD | ID |                                     | Stall Cycles                                     |                                  |                |   |   |   | OF<br>R <sub>2</sub> | P <sub>0</sub><br>x+R <sub>2</sub> =y |                                  | W <sub>0</sub><br>R <sub>2</sub> | 15 |
| I <sub>4</sub> | IF<br>SUB | ID |                                     | Stall Cycles                                     |                                  |                |   |   |   | OF<br>R <sub>6</sub> | P <sub>0</sub><br>y-R <sub>6</sub>    | W <sub>0</sub><br>R <sub>5</sub> |                                  |    |

∴ 15 clocks.

★ Here the reason for not performing OF of I<sub>2</sub> at 4th clock is due to design constraints.

★ i.e., Data should not be exposed.

★ i.e., Immediately after fetching we perform P<sub>0</sub> so that data won't be exposed much.

### Method 2: (short cut)

Except one stage all other stages take one each

MUL R<sub>2</sub> R<sub>0</sub> R<sub>1</sub>

no of clock req

DIV R<sub>5</sub> R<sub>3</sub> R<sub>4</sub>

$$= (k+n-1) + \text{extra clocks}$$

ADD R<sub>2</sub> R<sub>5</sub> R<sub>2</sub>

$$= (s+4-1) + 5 + 2 \text{ to to}$$

SUB R<sub>5</sub> R<sub>2</sub> R<sub>6</sub>

extra  
for DIV  
extra  
for MUL

$$= 15$$

→ This ~~short~~ shortcut works only if the condition is that except one stage rest takes 1 clock.

(Q50)

A RISC processor of 5-stage instruction pipeline (IF, ID, OF, P<sub>0</sub>, WB).

Except P<sub>0</sub>, all other stages take 1 clock each. A program

sequence of 100 instruction the P<sub>0</sub> takes 3 clocks for 40

instructions, 2 clocks for 35 instructions and 1 clock for

remaining 25 instructions. How many clock are req to complete

the program?

program sequence. Assume there are no hazards and no buffer overhead.

Sol:

$$\text{no of clocks req} = (k+n-1) \text{ clocks} + \text{extra clocks}$$

$$= (5+99) + 2*40 + 1*35$$

$$= 104 + 80 + 35$$

$$= 219 \text{ clocks.}$$

### Control Dependencies:

→ Control dependencies result due to change in flow of execution.

Branch instruction may result change in the flow of execution.

→ Pipeline contents are reorganized due to control dependencies.

→ This may take stall cycles from 1 to  $k-1$  based on intelligence of the pipeline.

$$\rightarrow \text{CPI}_{\text{avg}} = \left( 1 + \frac{\text{stall} * \text{stall freq}}{\text{cycles}} \right)$$

$$\rightarrow \text{Effective speed up} = \frac{k}{1 + \frac{\text{stall} * \text{stall freq}}{\text{cycles}}}$$



Target of branch instruction will not be fetched until branch instruction is fetched

→ Thus in the worst case  $k-1$  stall cycles are possible.

→ For the previous situation, if execution stage is capable of evaluating branch condition and compute address of target instruction, the no of stall cycles = 3



2 stall cycle.

→ If the branch is unconditional then decoding state itself can know the nature of instruction and computes target instruction address



Stall cycle = 1 (is minimum stall)

∴ from this we say no of stall cycles may vary from 1 to k-1 based on the intelligence of pipeline.

(Q5) A five stage pipeline (IF, ID|RF, EX, MEM, WB) is having delay(ns) -

1, 2, 2, 2, 1, 1, 0.75. In order to improve the performance the stage

ID|RF is split into 3 stages of equal delays. Further EX stage

is split into 2 stages of equal delays. 20% of the instructions

are branch instructions. The next instruction after the branch

will not be fetched until branch condition is not evaluated and target address is not computed. EX stage of 1st pipeline and

EX2 stage of 2nd pipeline is having this ability. What

is the performance gain of 2nd pipeline over the 1st one.

So



|         |                          |                          |                          |                      |                      |          |            |
|---------|--------------------------|--------------------------|--------------------------|----------------------|----------------------|----------|------------|
| IF<br>1 | ID <sub>1</sub><br>2 2/3 | ID <sub>2</sub><br>2 2/3 | ID <sub>3</sub><br>2 2/3 | EX <sub>1</sub><br>1 | EX <sub>2</sub><br>1 | MEM<br>1 | WB<br>0.75 |
|---------|--------------------------|--------------------------|--------------------------|----------------------|----------------------|----------|------------|

$$CPI_1 = 2.2 \text{ ns} \quad \text{clock period for pipeline 1} = 2.2 \text{ ns}$$

$$CPI_2 = 1 \text{ ns} \quad \text{clock period for pipeline 2} = 1 \text{ ns}$$

for Pipeline 1

$$\begin{aligned} CPI_{avg} &= 1 + (0.2)(2) \\ &= 1.4 \text{ clocks} \end{aligned}$$

for Pipeline 2

$$\begin{aligned} CPI_{avg} &= 1 + 0.2(5) \\ &= 2 \text{ clocks} \end{aligned}$$

performance gain =  $\frac{\text{time for pipeline 1}}{\text{time for pipeline 2}}$

$$= \frac{1.4 * 2.2}{2 * 1} = 1.54$$

### Handling Control Dependencies:

#### Multiple Pipelines:

- Here all possible target instructions after the branch instruction are placed in different pipelines.
- At the end of the branch instruction, only one of them is allowed to progress (it contains correct flow) and all others are stopped.

→ Here drawback is that, cost of implementation is more.

→ In general, however, if-then-else instructions will be implemented in 2-pipeline system

→ However, we cannot implement nested conditional statements with 2-pipeline system.

Consider below code implementation with 2-pipeline system

Pipeline 1

|                |   |   |   |   |
|----------------|---|---|---|---|
| I <sub>1</sub> | F | D | E | S |
| I <sub>2</sub> | F | D | E | S |

I<sub>1</sub>: if (---)

else

I<sub>2</sub>:

I<sub>7</sub>:

Pipeline 2

|                  |   |   |   |   |
|------------------|---|---|---|---|
| I <sub>1</sub> : | F | D | E | S |
| I <sub>2</sub>   | F | D | E | S |

Based on whether the condition is true or not, one of the above two pipelines proceeds with the execution.

→ If depth of nesting is "n" then we need n-pipelines.

### ii) Delay Slot (or) Delay load

In case of branch instruction, the next sequential target instruction are fetched for subsequent clocks.



So here irrespective of condition 1-delay slot happen  
(i.e., 1 stall cycle)

(Q52) A 5-stage instruction pipeline is used to implement the program that contain 20% branch instructions. The first stage is fetch and the instruction following branch will not be fetched until the branch instruction is completed. The no of stalls and avg CPI of this pipeline is

sol: ~~branch and condition will be stall when pipeline is at branch~~



$$\begin{aligned} \text{CPI}_{\text{avg}} &= 1 + 0.2(4) \\ &= 1.8 \\ \therefore & 4, 1.8 \end{aligned}$$

(Q53) In (Q52), if branch is taken by branch instruction then only stall cycles occur otherwise no stall cycles. Assume among the

branch instructions 40% are unconditional branches and 50% of conditional branch instruction, ~~branch~~ branch will not be taken. The average CPI is

sol:

$$\text{CPI}_{\text{avg}} = 1 + \frac{(0.2)(0.4)(4)}{\text{unconditional}} + \frac{(0.2)(0.6)(0.5)(4)}{\text{conditioned}}$$

$$\begin{aligned} &= 1 + 0.32 + 0.24 \\ &= 1.56 \end{aligned}$$

## Note:

~~1. Raw~~

I<sub>1</sub>: MOV R<sub>2</sub>, SO(R<sub>3</sub>)

I<sub>2</sub>: SUB R<sub>4</sub>, R<sub>2</sub>, RS

~~The~~ I<sub>1</sub> & I<sub>2</sub> has RAW dependency.

since we are using computable effective address, Operand Forwarding technique doesn't work in this case. So even after using operand forwarding, we get one stall cycle.

|     | 1              | 2                  | 3   | 4                       | 5              | 6 | 7              | 8 |
|-----|----------------|--------------------|-----|-------------------------|----------------|---|----------------|---|
| IF  | IF             | ID                 | Exe | Mem                     | WB             |   |                |   |
| MOV | R <sub>3</sub> | R <sub>3</sub> +SO |     | MEM(R <sub>3</sub> +SO) | R <sub>2</sub> |   |                |   |
|     |                |                    |     |                         |                |   |                |   |
|     | IF             | IF                 |     | Exe                     | Mem            |   | WB             |   |
| SUB | SUB            | Shift              |     | X-R <sub>3</sub>        | -              |   | R <sub>4</sub> |   |
|     |                |                    |     |                         |                |   |                |   |

↓  
Operand  
forwarding

## (iii) Prediction:

Prediction for branch instruction is

- Static Predictions {  
 (Compiletime)  
 (i) Branch always takes (fetch: branch target)  
 (ii) Branch never takes (fetch: next sequential)

## Dynamic Predictions:

### 1) BHT (Branch history table)

\* Prediction will depend on status of previous predicted results.

\* Majority previous decisions will decide the prediction.

Limitation: the size of the table.

(i) grows with no of branch instructions.



### iii) Un-related previous decisions influence current prediction

Consider

```

for (i=0; i<1000; i++) {
    if (x>y)
        a[i] = b[i]+5
}

```

Here condition B is true (writing)  
and false one time

So it is predicted that this  
condition could also be true  
even though  $x > y$  is not related to  $i < 1000$  in any way

So this is another limitation.

### 2) 2-bit dynamic prediction:

Whenever two consecutive previous predictions are wrong, then it changes the prediction

Flag variable:  $T$

$T=0$  (not taken)

$T=1$  (taken)

Fetch: Target instruction



BAT: Branch Always Taken

BNT: Branch Never Taken

Fetch: Next sequential instruction

→ Since we have 4 states, we need 2 bits to implement this.

So we need only 2 flip flops which is very less cost compared to BHT.

### 3) Branch Target Buffer:

→ It is a small cache which maintains the target address of the conditional branch instructions. ~~and~~ and prediction bits.

→ If it result in a hit and prediction is correct, then no penalty.

→ BTB cache miss penalty is normal miss penalty. ( $1 \text{ to } k-1$ ) stalls

→ BTB hit & wrong prediction penalty is few clocks more than

normal penalty as BTB has to be updated and the target is to be brought into BTB.

Miss  $\Rightarrow$  normal penalty

BTB hit + correct prediction  $\Rightarrow$  no penalty

BTB hit + wrong prediction  $\Rightarrow$  extra penalty

(Q54) Consider a BTB is employed in the system to deal with conditional branch instructions. There are 15.1. Conditional branch instructions and 90.1. instructions. The prediction of BTB is branch always takes. Among the conditional branch instructions 92.1. satisfies condition. Let normal branch stalls are 3 and wrong prediction stalls is 5 (because BTB is to be updated on wrong prediction). Compute avg CPI.



## Structural Dependencies

- It results due to non-availability of resources for different stages of pipeline.
- ~~Result~~ Resource duplication will solve this problem.

single port memory ~~does~~ doesn't allow two different memory operations at the same time for two different stages.



At the 4th clock we need two different memory operations.

- So instead of having single port memory, having Dual-port or multiport memory can resolve the issue.

## Resource Duplication:

It is subjected to cost & reliability



$$\text{reliability} = 0.9 + 0.9 = 0.81 \approx 81\%$$



$$R_2 = 1 - 0.1 + 0.1 = 0.99$$

$$\therefore \text{reliability} = 0.9 + 0.99 \approx 90\%$$

Eg: In a 5-stage instruction pipeline, 10% instructions result in data dependencies, 20% instructions result in control dependencies, 15% instructions result in structural dependencies. If stall cycles are 3, 2, 1 respectively. find CPI<sub>avg</sub>

Sol:

$$CPI_{avg} = 1 + 0.1(3) + 0.2(2) + 0.15(1)$$

$$= 1 + 0.3 + 0.4 + 0.15$$

$$CPI_{avg} = 1.85$$

### Super Scalar Architecture.

- Parallel pipelines are used to improve the performance.
- In m-way interleaved design there are m-pipelines operated such that in the first clock m instructions are placed in fetch stage of the pipelines.
- If these pipelines are having k-stages then in m-instructions are implemented in first k-clocks.
- The time taken for n instruction in this m-way interleaved k-stage pipelines is

$$T_n = (k + (n-m)/m) \text{ clocks.}$$

→ Speed up factor,  $S = \frac{n*k}{k + \frac{n-m}{m}}$

If  $n \gg m$  then

$$S = m*k$$

04/09/20

Clock 1:  $I_1, I_2, I_3 \dots I_m$  (Fetch)

Clock 2:  $I_{m+1}, I_{m+2}, I_{m+3}, \dots, I_{2m}$  (Fetch)

Thus if it is a  $k$ -stage pipeline, after  $k$ th clock,  $m$  instructions would finish their execution.

From then onwards, ~~we~~  $m$  instructions would finish ~~their~~ their execution per each clock.

If total no of instructions are  $n$ ,

1st  $k$  clocks  $\rightarrow m$  instruction

~~then~~  $n-m$  instructions are left for which  $m$  instructions will be executed for each clock.

$\therefore$  ~~no~~  $n-m$  instructions require  $\frac{n-m}{m}$  clocks.

$\therefore$  time required to finish  $n$  instructions are,

$$T_n = k + \frac{n-m}{m}$$

If  $m=1$ , i.e., single pipeline

$$T_n = k+n-1 \text{ clocks}$$

$$\text{Speed up} = \frac{\text{time without pipeline}}{\text{time with pipeline}}$$

$$S = \frac{n * k}{k + \frac{n-m}{m}}$$

If  $n \gg 1$  then

$$S = \frac{n+k}{\frac{n}{m}} \Rightarrow S = m+k$$

## 2-way interleaved pipeline



$\therefore$  Pipeline 1:  $I_1, I_3, I_5, I_7, I_9$

Pipeline 2:  $I_2, I_4, I_6, I_8, I_{10}$

$\therefore 10$  instruction are executed in 9 clocks.

i.e., ~~Clock~~ CPIavg < 1

→ However to do this kind of execution the instructions must be independent.

(QSS) Consider 4-way interleaved super scalar architecture where 4 pipelines of 6-stages are operated. Assume 1000 independent instructions are to be implemented, how many clocks are taken by this super scalar architecture. Assume there are no hazards  
Sol:

After ~~at least~~  $6(k)$  clocks  $4(m)$  instruction are executed.

From now on ~~one~~ 4 instructions are executed per each clock.

$\therefore$  no of clocks req = ~~for 1000~~  $\frac{1000}{4}$

$$= 6 + \frac{1000-4}{4}$$

$$= 6 + \frac{996}{4} = 6 + 249 = 255$$

Also speedup  $\approx$  min. of stalls  $\frac{1000 * 6}{255} = 23.52$

analogous to  $\frac{1000 * 6}{24} = 255$

→ pipeline latency  $= 24$  if no pipeline register between stages.

→ pipeline latency  $= 255$  if each stage has one pipeline register.

(min) Pipeline stall count  $= 24$

### Non-linear Pipeline:

- In non-linear pipeline, an instruction can visit the same stage more than once and hence  $CPI_{avg} > 1$
- Reservation Table indicate the usage of the stages at different clocks. We can detect conflict free scheduling using reservation table.
- The scheduling gap b/w the successive instructions is always greater than 1.

Eg.:  
3-stage pipeline reservation table

| Stage / clock | 1 | 2 | 3 | 4 | 5 |
|---------------|---|---|---|---|---|
| S1            | x |   |   |   | x |
| S2            |   | x | x | x |   |
| S3            | x |   | x |   |   |



Conflict: More than one instruction trying to occupy same stage at same clock.

latency: The scheduling gap in clocks b/w successive instructions

without causing conflicts in the instruction pipeline.

Many possible values are there for latency.

Minimum Available Latency (MAL):

min no of clock req for scheduling non-linear pipeline without resulting in conflicts.

Forbidden Latency of a stage (Hidden conflicts):

The time gap in clocks b/w successive usages in the stage.

It denote conflicts.

Forbidden Vector:

The binary representation of the ~~forbidden~~ latencies of all the stages. The length of the forbidden vector indicate the maximum value of forbidden latency among the stages. If the binary bit is one in position  $k$ , then it denote conflict for  $k^{\text{th}}$  clock.

→ For the previous example of reservation table,

$$\text{forbidden latency of Stage 1} = 5 - 1 = 4$$

Q56

For the following reservation table of a pipeline with stages S1, S2 and S3 the minimum available latency is \_\_\_\_\_

| Stage \ clock | 1 | 2 | 3 | 4 | 5 |
|---------------|---|---|---|---|---|
| S1            | x |   |   |   | x |
| S2            |   | x |   | x |   |
| S3            |   |   | x |   |   |

- a) 1
- b) 2
- c) 3
- d) 4

Sol:

$$\text{forbidden latency of } S1 = 5 - 1 = 4$$

$$" " " S2 = 4 - 2 = 2$$

~~S3 doesn't have forbidden latency~~

( $\because$  it is used only once)

Max forbidden latency = 4

∴ length of forbidden vector = 4

forbidden vector:

Clock 1 2 3 4

binary bit: 0 1 0 1

∴ forbidden vector: 0101

From this we can say, if latency is 2 or 4 it results in conflict

∴ Ans: 3

Method 2:

If latency = 2

I<sub>1</sub>: S<sub>1</sub> S<sub>2</sub> S<sub>3</sub> S<sub>2</sub> S<sub>1</sub>

I<sub>2</sub>: S<sub>1</sub> S<sub>2</sub> S<sub>3</sub> S<sub>4</sub> S<sub>5</sub>



conflict

If latency = 4

I<sub>1</sub>: S<sub>1</sub> S<sub>2</sub> S<sub>3</sub> S<sub>2</sub> S<sub>1</sub>

I<sub>2</sub>:



conflict

We know that latency is always greater than 1

If latency = 3

I<sub>1</sub>: S<sub>1</sub> S<sub>2</sub> S<sub>3</sub> S<sub>2</sub> S<sub>1</sub>

I<sub>2</sub>: S<sub>1</sub> S<sub>2</sub> S<sub>3</sub> S<sub>4</sub> S<sub>5</sub>

∴ Ans: 3

Binary string representing stage utilization

Stage 1 & latency = 1

Conflict in 5th clock for stage 1, blw

|                  |   |   |   |   |   |   |   |   |   |
|------------------|---|---|---|---|---|---|---|---|---|
| I <sub>1</sub> : | 1 | 0 | 0 | 0 | 1 |   |   |   |   |
| I <sub>2</sub> : | 0 | 1 | 0 | 0 | 0 | 1 |   |   |   |
| I <sub>3</sub> : | 0 | 0 | 1 | 0 | 0 | 0 | 1 |   |   |
| I <sub>4</sub> : | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |   |
| I <sub>5</sub> : | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |

instructions I<sub>1</sub> & I<sub>5</sub> for  
latency = 1

latency = 2

|                  |   |   |   |   |   |   |   |   |   |
|------------------|---|---|---|---|---|---|---|---|---|
| I <sub>1</sub> : | 1 | 0 | 0 | 0 | 1 |   |   |   |   |
| I <sub>2</sub> : | 0 | 0 | 1 | 0 | 0 | 0 | 1 |   |   |
| I <sub>3</sub> : | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |

conflict for stage 1 in 5th clock  
blw instruction I<sub>1</sub>&I<sub>3</sub> for  
latency = 2

## Micro Operations & Control Signals



These signals can be for both internal/external devices

∴ Control signal

$$C = F(T, F, O, I)$$

Information (MOP)



Mop: micro operation

$$f = x_1, \text{ if } x_2 = 1$$

$$f = 0, \text{ if } x_2 = 0$$

∴ x<sub>2</sub> acts like a gate

∴ Control signal do the job of allowing or blocking the information.

## Program - Instruction - Micro Operation

- \* program is collection of sequential instructions.
  - \* Each ~~seq~~ instruction is collection of sequence of sub operation  
 i.e. (Fetch, Exec, interrupt)
  - \* The time req to execute an instruction is called instruction cycle.
  - \* Each suboperation is sequence of elementary basic operations (atomic) which are known as micro operations. These operation can't be further ~~de~~ divide.
  - \* The collection of microoperations of fetch is called fetch microprogram.
  - So we have execute microprogram & interrupt microprogram.
  - \* Microoperation is smallest unit of instruction.
  - \* Microoperations are denoted with RTL (Register transfer language)
  - \* Each microoperation requires set of control signals which needed at the same time.  
 So CU generates these signals.
  - If CU uses only H/w to generate signals, this is called H/w CU design. This is RISC system.  
 Also since only H/w is involved it is faster.
  - If CU uses memory & H/w to generate signals, this is called Memory control Unit. This is CISC system.  
 Since memory is involved it is slower.
- Control Memory

# Register Transfer Language

Condition : Micro-operation

Condition : destination  $\leftarrow$  source

Mop: Transfer data from register  $R_1$  to register  $R_2$  at clock<sub>1</sub>

RTL:  $T_1 : R_2 \leftarrow R_1$



Swap at S/W level.

$T_1 : R_1 \leftarrow R_2, R_2 \leftarrow R_1$



S/W level

$$\begin{aligned} \text{temp} &= a \\ a &= b \\ b &> \text{temp} \end{aligned}$$

Swapper

Here  $R_1$ ,  $R_1$  out,  $R_2$ ,  $R_2$  out are control signals.

So in this way we can swap the contents in one microoperation.

Let us con

Q57 what is the proper arrangement of following instructions if they are implemented in 5-stage instruction pipeline (IF, IP, EXE, MEM, WB) for optimal performance.

I1: LD R2, 100(R1)

I2: ADD R3, R1 + R2

I3: LD R4, 100(R5)

I4: ADD R6, R4 / R7

a) I1, I2, I3, I4

b) I1, I3, I2, I4

c) I1, I3, I4, I2

d) I3, I2, I1, I4

→ Single

Now for

for



Executing I<sub>2</sub> after I<sub>1</sub> gives a stall

But executing I<sub>4</sub> after I<sub>3</sub> also gives a stall

However, I<sub>1</sub> is independent from I<sub>3</sub> & I<sub>4</sub>

I<sub>2</sub>    "    "    "    "

∴ I<sub>1</sub> I<sub>3</sub> I<sub>2</sub> I<sub>4</sub>

→ and we're not getting stalling of 2nd register bank after instruction

I<sub>3</sub> fills the stall

∴ Add after I<sub>3</sub> → opt(b)

Let us consider a basic system with

PC, IR, acc, memory, MAR, MDR

Constraint: Memory is accessible only to MAR & MDR

→ single bus architecture is used to connect registers & memory

Now for this hypothetical system, we develop microinstructions

for

→ fetch, branch, writeback off bus etc.

→ Execute: ADD, Branch, RET etc.

→ INTR (interrupt)

Fetch → PC → RAM → F

(FIFO) PROM → RAM → F

F → PC → RAM → F

Execution unit

Memory



Bus controller tells which register has to put data on the bus or take data from the bus

sizes of registers

$$PC = MAR, \quad IR = ACC = MDR$$

### Fetch operation:

- \* Get instruction into IR from Memory ~~and~~ using the location given by PC.

steps:

(i) Copy instruction address from PC into MAR

(ii) Read the instruction from memory to MDR

(iii) Copy the instruction from MDR to IR

(iv) Update PC to next instruction address

### RTL notation:

$$T_1: MAR \leftarrow PC$$

$T_1$  - at clock 1

$$T_2: MDR \leftarrow \text{memory } (M[MAR])$$

$$T_3: IR \leftarrow MDR, \quad PC \leftarrow PC + 1 \rightarrow \text{two micro operations are done in parallel.}$$

## Execution phase Micro program

65

- Micro operation here vary from instruction to instruction.
- IR is having the instruction before executional micro programs is activated.

### ADD X instruction executional phase micro program:

- ① Add contents of memory location  $X$  to the accumulator and store the result in accumulator.

Steps:      IR: 

|    |   |
|----|---|
| op | X |
|----|---|

 → address

i) place address  $X$  into MAR

ii) Read the memory contents of location  $X$  into MDR

iii) Add MDR with accumulator. and hold result in accumulator.

### RTL:

$$T_1: \text{MAR} \leftarrow \text{IR}[\text{operand ref}] \quad \left. \begin{array}{l} \text{operand fetch} \\ \text{addressing} \end{array} \right\}$$

$$T_2: \text{MDR} \leftarrow \text{Memory}$$

$$T_3: A \leftarrow A + \text{MDR}$$

### Microprogram for execution phase of $\overbrace{\text{BUN } X}$

↳ Branch Unconditional

- Control to be transferred from current instruction to the location  $X$  where the next required instruction is present.

### Steps:

PC is to be updated to  $X$

### RTL:

$$T_1: \text{PC} \leftarrow \text{IR}[\text{reference}]$$

## Interrupt Microprogram

- All instructions check for interrupt at the end of instruction cycle.
- When interrupt occurs the current PC contents have to be saved in predefined memory location (say)  $X$ . ↳ i.e., return address.
- Then transfer control to beginning instruction address of ISR.



So we need to store PC in memory & update PC.

### RTL:

- Saving return address in memory
- |                                                                                  |                                                                        |
|----------------------------------------------------------------------------------|------------------------------------------------------------------------|
| $T_1: MAR \leftarrow X$ <small>(X is predefined address)</small>                 | $T_2: MDR \leftarrow PC$ <small>for saving return address (PC)</small> |
| $T_3: Memory \leftarrow MDR$                                                     |                                                                        |
| $T_4: PC \leftarrow Y$ <small>now PC points to first instruction of ISR.</small> |                                                                        |

### Microprogram for RET:

- It takes return address from memory location  $X$  and control transfers to that location.

### RD:

#### Steps:

- Go to location  $X$  by placing  $X$  into MAR and collect return address in MDR.
- Update PC with  $X$  which is in MDR.

RTL:T<sub>1</sub>: MAR  $\leftarrow$  XT<sub>2</sub>: MDR  $\leftarrow$  MemoryT<sub>3</sub>: PC  $\leftarrow$  MDRNote:

Every register has in & out control signals. If "in" is high then value are stored into register. If "out" is high then data in the register is placed on the bus.

Eg: R<sub>1</sub>  $\leftarrow$  R<sub>2</sub>

The above instruction stores value of R<sub>2</sub> in R<sub>1</sub>.

R<sub>1</sub> in & R<sub>2</sub> out signals have to be high simultaneously.

~~So control logic takes 2 inputs i.e. R<sub>1</sub> & R<sub>2</sub>~~

In this way control signals implement micro operations.

$\rightarrow$  MDR  $\leftarrow$  Memory

$\Rightarrow$  MAR<sub>out</sub>, MDR<sub>in</sub> are high, MEMR are activated  
 $\qquad\qquad\qquad$   $\hookrightarrow$  memory read

So in this way for every ~~control~~ in microprogram CU activates appropriate control signals

## Control Unit Design

$\rightarrow$  CU is responsible to identify which control signal is required when and have a mechanism to activate that control signal at that time.

## H/W CU Design

- Each control signal is expressed as SOP (sum of Product Expression) and realized using dedicated hardware.
- control signal,  $c$  is a function of  $\text{clock}(T)$ , Instruction Phase ( $I_{ph}$ ) and for which instruction.

Control signal,  $c = \cdot F(T, I_{ph}; I)$

$$\text{where } T = \{T_1, T_2, T_3, \dots, T_k\}$$

$$I_{ph} = \{\text{fetch}, \text{Exe}, \text{interrupt}\}$$

$$I = \{I_1, I_2, I_3, \dots, I_m\}$$

Eg: Consider H/W CU design which support 2 instructions

$I_1, I_2$  and using 4 control signal  $C_1, C_2, C_3, C_4$

Let each instruction require 4-micro operations.

The following table describes the above details

| Mop   | Control Signals |            |
|-------|-----------------|------------|
|       | $I_1$           | $I_2$      |
| $T_1$ | $C_1, C_2$      | $C_1, C_4$ |
| $T_2$ | $C_2, C_4$      | $C_1, C_3$ |
| $T_3$ | $C_2, C_1$      | $C_1, C_3$ |
| $T_4$ | $C_4, C_1$      | $C_4, C_2$ |

Here  $C$  is function of  $\text{clock}(T)$  & instruction  $I$ .

$$C = F(T, I)$$

$$\text{where } T = \{T_1, T_2, T_3, T_4\}$$

$$I = \{I_1, I_2\}$$

68 SOP for Control signals

69

$$C_1 = I_1 T_1 + I_2 T_1 + I_2 T_2 + I_1 T_3 + I_2 T_3 + I_1 T_4$$

$$C_2 = I_1 T_1 + I_1 T_2 + I_1 T_3 + I_2 T_4$$

$$C_3 = I_2 T_2 + I_2 T_3$$

$$C_4 = I_2 T_1 + I_1 T_2 + I_1 T_4 + I_2 T_4$$

on Photo  
(I<sub>ph</sub>)

Now consider simplification of these SOPs

$$C_1 = \underbrace{I_1 T_1 + I_2 T_1}_{\text{common}} + \underbrace{I_2 T_2 + I_1 T_3 + I_2 T_3 + I_1 T_4}_{\text{common}}$$

At  $T_1$ ,  $C_1$  is required for both  $I_1$  &  $I_2$   
only at  $T_3$

$$\therefore C_1 = T_1 + I_2 T_2 + T_3 + I_1 T_4$$

$$C_2 = I_1 T_1 + I_1 T_2 + I_1 T_3 + I_2 T_4$$

$$C_3 = I_2 T_2 + I_2 T_3$$

~~$$C_4 = I_2 T_1 + I_2 T_3$$~~

$$C_4 = I_2 T_1 + I_1 T_2 + T_4$$

Ex:

| MOP   | Control |       | Signals |
|-------|---------|-------|---------|
|       | $I_1$   | $I_2$ |         |
| $T_1$ | $C_1$   | $C_1$ |         |
| $T_2$ |         | $C_1$ |         |
| $T_3$ |         | $C_1$ |         |
| $T_4$ |         | $C_1$ |         |

$$\text{Now } C_1 = \underbrace{I_1 T_1 + I_2 T_1}_{\text{common}} + \underbrace{I_2 T_2 + I_2 T_3 + I_2 T_4}_{\text{common}}$$

~~NEED~~  
at  $T_1$ ,  $C_1$  is activated

& for  $I_2$ ,  $C_1$  is always activated

$\therefore C_1$  can be minimized as  $C_1 = T_1 + I_2$

Q58



Consider an instruction  $R_0 \leftarrow R_1 + R_2$ . The following steps are used to execute it over given data path. Assume that PC is incremented appropriately. The subscripts  $g$  and  $w$  indicate read and write operations respectively.

1.  $R_2_{g1}$ , TEMP1<sub>g1</sub>, ALU<sub>add</sub>; TEMP2<sub>w1</sub>
2.  $R_1_{g1}$ , TEMP1<sub>w1</sub>
3.  $PC_{g1}$ , MAR<sub>w</sub>, MEM<sub>g1</sub>
4. TEMP2<sub>g1</sub>,  $R_0_w$
5. MDR<sub>g1</sub>, IR<sub>w</sub>

which one of the following is the correct order of execution of the above steps.

- a) 2,1,4,5,3
- b) 1,2,4,3,5
- c) 3,5,1,2,4
- d) 3,5,1,4,2

Sol:

Initially Instruction fetch has to be done  
so 3,5 must be the first instructions

~~now operands has to~~

Now fetching operand  $R_1$ , (∴ 2)

Now fetch  $R_2$  & add. (∴ 1)

Now store (∴ 4)

∴ opt (c)

## Advantages of H/w CU:

- Since every thing is H/w level implementation, control signal can be generated parallelly and hence it is faster. It is used in RISC system.

## Disadvantage:

- Even a minor modification requires re-design and re-connection of H/w component. ∴ not flexible.
- More cost of implementation.

→ Hence H/w CU is not suitable for design & testing.

## Micro-Programmed CU Design :

- Binary pattern of control signals is stored in control memory.  
↳ (Control word)
- Control signals are generated after accessing control word from control memory and using hardware.
- Any changes confine to binary pattern and ~~hardware~~ hardware remains unchange. Hence this design is flexible.
- Based on how the binary pattern is derived Micro program can be
- horizontal
  - vertical
  - diagonal

### i) Horizontal Micro program

→ It assigns 1-bit per each control signal



control word for first microoperation of instruction  $I_1$ .

Length of control word = no of control signals in the system.

#### Drawback:

→ Control words are lengthy.

i.e., for example if a system contains 64 signals, and it requires only 1 signal, still we need to allocate 64 bits for it.

Thus if length is more, access time is more.

→ Control memory is inefficiently utilized.

i.e., most of the bits in control word are 0's.

Advantage: Provides higher degree of parallelism (DOP)

DOP: No of control signals per control word.

These values are written from previous example under H/w design



MCAR: Micro Control Address Register

MDR: Micro Control Data register

→ Control Memory holds one control word for every micro operation.

No of control words = no of Map

Eg: If an ISA has 256 instructions in which each instruction has 16 Mop, then find no of control words.

$$256 * 16 = 4096$$

∴ 4K Control words

## (ii) Vertical Micro programming

→ Control signal binary pattern is encoded and stored in the memory.

Thus if we have  $k$  control signals

then control word require  $\log k$  bits

i.e., size of control word =  $k$ .

→ Hence it results in shorter control words.

Eg:  $C_1 = 00$

$C_2 = 01$

$C_3 = 10$

$C_4 = 11$

→ However for generation control signal, external decoder is need in addition to hardware.

→ So it has shorter control word, but cost is more.



- 74
- Here each signal is generated one after other. i.e.,  $c_1, c_2, c_3, c_4$  in  $1\mu s$
- For example consider we need  ~~$c_1, c_2$~~  in  $1\mu s$ .
- If they are generated ~~sequential~~, parallelly then  $4\text{MHz}$  clock will be sufficient.

However if it is sequential then we need ~~4 MHz~~

- $4\text{MHz}$  (time period =  $0.25\mu s$ )



Hence vertical micro programming require ~~not~~ faster clocks.

#### Limitations:

- Degree of parallelism  $\geq 1$  (i.e., 1 control signal / word)
- Entire control word can generate only one signal at a time.

## Diagonal Micro programming

- Control signals are encoded function wise.
- Hence control wordsize is bigger than vertical and smaller than horizontal microprogramming

~~Assume we have~~

- Here we divide control function signals based on the function they do. ~~do~~ ~~function~~ ~~function~~ ~~function~~ ~~function~~

Eg: Assume we have 5 different functions and 64 control signals. Let distribution of control signals, function wise, be as shown below

$$F_1 - 16, F_2 - 32, F_3 - 8, F_4 - 4, F_5 - 4$$

now size of control word is

|       |       |       |       |       |
|-------|-------|-------|-------|-------|
| 4     | 5     | 3     | 2     | 2     |
| $F_1$ | $F_2$ | $F_3$ | $F_4$ | $F_5$ |

(i.e., 16 bits)

Here  $DOP = 5$

$\rightarrow DOP$  is higher than vertical & less than horizontal.

### Control Word Sequencing

$\rightarrow$  Unless all control words for all micro operations are generated in the sequence, the instruction will not be implemented.

$\rightarrow$  Control word sequencing takes the responsibility of providing address of next word to be accessed from control memory.



A portion of CDR is used to get the next address for CAR.

$\rightarrow$  Each control word has link to next address to be accessed.

$\rightarrow$  The last instruction has NIL.

$\rightarrow$  1-Address control instruction is used in most designs.



→ If condition is satisfied, next control word address given by address field. If condition is not satisfied, next CWA address is obtained by incrementing CAR.



**Q59** Horizontal microprogrammed CU has 256 instructions in ISA.

Each instruction requires 16 MOPS. The system supports 16-Flag conditions and 32 control signals. The control word format is:



i) Find no of bit in control words

ii) In byte organized CM how many bytes are needed for control memory

|      |    |         |
|------|----|---------|
| Cond | CS | Address |
| 4    | 32 | 12      |

$$\text{No. of control words} = 256 \times 16 = 2^{12}$$

∴ size of address = 12

In horizontal Mprogramming, CS is 32 bits

$$16 \text{ conditions} \Rightarrow \log_2 16 \text{ bits} = 4 \text{ bits}$$

i) No of bits = 48

ii) No of bytes = 6

ch size of control memory

$$= 4 \text{ k} \times 48 \text{ bits}$$

$$= 4 \text{ k} \times 6 \text{ bytes}$$

$$= 24 \text{ kB}$$

(Q6) Control signal field is divided in 6 groups.

If no of control signals per group is given below

| $g_1$ | $g_2$ | $g_3$ | $g_4$ | $g_5$ | $g_6$ |
|-------|-------|-------|-------|-------|-------|
| 14    | 15    | 7     | 6     | 5     | 6     |

what will be the min no of bits save for control field w.r.t to horizontal Mprogramming.

sol:

$$\text{total no of control signals} = 14 + 15 + 7 + 6 + 5 + 6 \\ = 53 \text{ bits}$$

$$\therefore \text{no of bits in horizontal} = 53$$

$$\text{no of bits in diagonal} = \boxed{4 \ 4 \ 3 \ 3 \ 3 \ 3}$$

$$\therefore \text{no of bits in diagonal} = 4 + 4 + 3 + 3 + 3 \\ = 20$$

$$\therefore \text{no of bits saved} = 53 - 20 = 33$$

$$\text{no of bits for vertical} = \log_2 53 = 6$$

$$\text{here no of bits saved} = 53 - 6 = 47$$

06/09/20

Q61 Find no of clock req for  $R_1 \leftarrow R_1 + R_2$  i.e., ADD  $R_1, R_2$ .

The instruction has to be fetched from memory.



operand fetch into temporary ALU register temp1. Result after addition to be moved from Temp2 to  $R_1$ . Assume that any transfer using data bus takes one clock

Sol:

$\text{MAR} \leftarrow \text{PC}$  Add doesn't require any clock  
 $\text{MDR} \leftarrow \text{Memory}$   
 $\text{IR} \leftarrow \text{MDR}$   
 $\text{temp1} \leftarrow R_2$  Add doesn't require any clock  
 $\text{temp2}_{\text{in}}, \text{ADD}$   
 $R_2 R_1 \leftarrow \text{temp2}$   
∴ 6 clocks

### Nano Programmed Control Unit:

→ It is most flexible CU design and uses two level

- control memory
  - i.e., (i) Micro-Control Memory.
  - (ii) Nano-control memory.

→ Micro-control memory is responsible for control word sequencing.

→ Nano-control memory is responsible for control signal generation.

→ It follows horizontal micro-programming but reduces effective control memory. This due to saving only distinguishable micro operations.

control words in nano control memory.

- Because of 2-level control memory it is the slowest CU design.
  - It is used to store bootstrap programs or firmware of the application programs.
  - It provides same degree of parallelism as that of horizontal programming.
  - Effective Control memory size = micro control memory size + nano control memory size
- $$= H_M (\log H_M + \log H_N) + H_N * P$$

$H_M$ : total words in micro control memory

$H_N$ : total words in nano control memory

$P$ : no of control signal to be generated.



- Nano Control memory has distinguish control words and micro control memory by ~~do~~ references to those

| Characteristic          | Control Unit Arrangement<br>(in Descending Order) |
|-------------------------|---------------------------------------------------|
| 1. Speed                | H/w, H <sub>μ</sub> P, V <sub>μ</sub> P, NPCU     |
| 2. flexibility          | NPCU, V <sub>μ</sub> P, H <sub>μ</sub> P, H/w     |
| 3. Control Memory Space | H <sub>μ</sub> P, Nano, V <sub>μ</sub> P, H/w     |

## Assembly Program Tracing

- find contents of the register
- " " " specific memory locations
- find memory space arrangements

Time Complexity & Space Complexity  
(clocks)

Tracing will consider "program flow" which depends on one or more key instructions

MOV R1, (8000)

Loop: MOV R2, 0(R3)

ADD R2, R1

MOV 0(R3), R2

INC R3

DEC R1

BNZ Loop

HALT

Memory location 3000 has value 10

$$R_3 = 2006$$

Memory contents from 2006 to 2016 are having value 100.

j) How many memory accesses are done for data

- a) 10    b) 11    c) 20    d) 21

so is the new book just out now

Move R1, (3000) -----

loop: MOV R2, 0(R3) ... |  
ADD R2, R1  
MOV 0(R3), R2 ... | } 2 accesses within loop  
INC R3  
DEC R1  
BNZ loop

HAL



loop: R<sub>2</sub> 



From value of  $R_1$  we can say that

the loop executes 10 times

$$\therefore 1 + 2810 = 21$$

d) when machine halts what is the value in memory location 2010

- a) 100 b) 50 c) 109 d) 110

when the program is about to halt the value in R<sub>1</sub> is

ADD R<sub>2</sub>, R<sub>1</sub> ---- R<sub>2</sub> [10]

$$R_2 = 100 + 1 = 101$$

~~(Q1)~~ MOV O(R<sub>3</sub>), R<sub>2</sub> But loc 2010 is not affected : opt @ 2009 [lo1]

(iii) what are the final values of  $R_1, R_2, R_3$

$$R_1 = 0$$

$$R_2 = 101 \text{ (from (ii))}$$

$$R_3 = 2010$$

(iv) If the program is stored in word organized memory

Instruction fetch takes 2 clocks / word and execution for memory related instruction is 2 clocks and for other one clock. How many clocks are req for this non-pipelined implementation. Assume HALT instruction req only fetch.

Sol:

$$\text{MOV } R_1, (3000) \dots 2*2 + 2 = 6 \text{ clocks}$$

$$\text{loop: MOV } R_2, 0(R_3) \dots 2 + 2 = 4 \text{ clocks}$$

$$\text{ADD } R_2, R_2 \dots 2 + 1 = 3 \text{ clocks}$$

$$\text{MOV } 0(R_3), R_2 \dots 2 + 2 = 4 \text{ clocks} \quad \left. \begin{array}{l} \\ \\ \end{array} \right\} 22 * 10 = 220$$

$$\text{INC } R_3 \dots 2 + 1 = 3 \text{ clocks}$$

$$\text{DEC } R_1 \dots 2 + 1 = 3 \text{ clocks}$$

$$\text{BNZ loop} \dots 4 + 1 = 5 \text{ clocks}$$

$$\text{HALT} \dots 2 \text{ clocks}$$

$$\therefore 220 + 6 + 2 = 228 \text{ clocks}$$

(v) If the above program is stored in byte organized memory from decimal address 1000 onward. If interrupt is occurred during INC R3. what will be the return address (i.e., the address saved in stack). Instruction word length is 32 bits.

Sol:

MOV R<sub>1</sub>(3000) --- 1000 to 1007 word = 4 bits (bytes)

loop: MOV R<sub>2</sub>, O(R<sub>3</sub>) --- 1008 to 1011 here is byte organized

ADD R<sub>2</sub>, R<sub>1</sub> --- 1012 to 1015

MOV O(R<sub>3</sub>), R<sub>2</sub> --- 1016 to 1019

INC R<sub>3</sub> --- 1020 to 1023

DEC R<sub>1</sub> --- (1024) to 1027

∴ Interrupt is occurred during INC R<sub>3</sub>, execution of INC R<sub>3</sub> will be finished and starting address of DEC R<sub>1</sub> will be saved.

(ii) If interrupt occurs during BNZ instruction, possible return addresses are

Sol:

R<sub>1</sub> = 0 --- then 1008

R<sub>1</sub> = 1 --- then 1038

(iii) If interrupt occurs during halt, the return address is

- a) 1036 b) 1039 c) 1040 d) 1037

Sol:

HALT instruction means same instruction is repeated again and again. i.e., in the implementation PC never gets incremented.

∴ Beginning address of HALT is return address

i.e., 1036

### Register Allocation

→ If variables are maintained in CPU registers it takes less time to access & process data.

↑ However the no. of registers are limited. Hence allocation plays a significant role.

→ Register allocation policy minimizes the no of spills. A spill is required when a live variable is stored into memory.

→ two allocation policies

Local Register allocation

Global register allocation

### Local Register allocation:

#### Belady's algorithm

→ Place all the registers in free list as and when it is allocated place that in allocated list.

i) A variable request for register allocation consult free list, if free register is available allocate it. Also map the register to allocated list.

ii) If no free register is available (i.e., free list is empty) then identify that allocated register which contains 'dead' variable. Release it and allocate to the variable.

iii) If no dead variable is found then identify the allocated register whose variable is required after longest time. Spill it & allocate it.

but maps between variable & register pattern is saving to memory.

Q63

Load & store instruction are only supported memory operations.

Arithmetic operation are performed only on register operands.

If no spill is allowed min no of register req is \_\_\_\_\_



Method 1:



$\therefore 3$  registers.

Method 2:

$$R_1 = a$$

$$R_2 = b$$

$$R_1 = R_1 - R_2$$

$$R_2 = c$$

$$R_3 = d$$

$$R_2 = R_2 + R_3$$

$$R_3 = e$$

$$R_2 = R_3 - R_2$$

$$R_1 = R_1 + R_2$$

- (ii) Now for the same question, if one memory operand is allowed then min no of register req is :



$\therefore 2$  registers

\* Q64 \* Consider expression  $(a-1) * ((b+c)/3 + d)$  for a load-store architecture, arithmetic instructions can have only register or immediate operand. Min no of register req for optimal code generation (without spill) is \_\_\_\_\_

Sol:

~~from~~ Multiplication is commutative

$$((b+c)/3 + d) * (a-1)$$

$$\begin{array}{l} R_1 = b \\ R_2 = c \\ R_1 \leftarrow R_1 + R_2 \\ R_1 \leftarrow R_1 / 3 \\ R_2 \leftarrow d \\ R_1 \leftarrow R_1 + R_2 \end{array} \quad \left| \begin{array}{l} R_2 = a \\ R_2 \leftarrow R_2 - 1 \\ \dots \\ R_2 \leftarrow a \end{array} \right.$$

$$R_1 = R_1 * R_2$$

Q65)  $x = a + b * c - d / f$

No spill allowed

Operands must be registers.

Find min no of registers req.

Sol:



## IO Organization

- CPU has to interact with external devices using IO ports.
- CPU can interact with external devices only through address.  
i.e., addresses assigned to devices.
- Assigning addresses to IO devices can be done in two ways

### i) Memory Mapped IO

→ address space is partitioned into memory space & IO space



→ adv:

\* all addressing modes can be extended to I/O

\* address are distinct. So just from the address we can say whether it is I/O address or memory address.

∴ There is no need of address separation

∴ i.e., no need of IO/M

Disadv:

\* less I/O expandability.

~~Exchange in I/Os~~

### ii) IO Mapped IO (Isolated IO)



Disadv:

\* Addresses are not distinct. So address separation is needed.

i.e., IO/M control bit is needed.

\* Limited IO address & IO addressing mode  
∴ limited flexibility

In the case of RISC, only direct addressing mode is possible.

Adv:

→ More Expansion of IO.

→ How to move the data from/to IO devices?

3 ways:



### (i) Program Driven Data Transfer

→ Simple & useful for all devices.

→ It will get the status device & check whether device is ready or not. This checking goes on until the device is ready.



→ Program instructions are used to transfer a word.

Disadv: Most of the time of CPU is spent in waiting

### iii) Interrupt Driven Transfer:

- Whenever device is ready, it will interrupt processor  
(i.e., Device controlled data transfer)
  - So here CPU will not be in the loop



- Processor (CPU) is most efficiently used in interrupt driven data transfer.

Drawback:

More complexity in interrupt implementation.

## Interrupt implementation issues



.. w.r.t devices

- when it should interrupt?

Asynchronous I/O API

i.e., it can interrupt at any time.

- How to interrupt?

By changing signal at H/w pin

(, level trigger (or) edge trigger)

w.r.t processor

- when to recognize

Complete the execution of current instruction  
& recognize.

→ How to recognize?

\* Whenever interrupt occurs interrupt flag is set.

\* At the end of every instruction the flag is checked.

→ What to do after recognizing?

\* Save return address & branch to ISR

\* Saving return address is done at a default location

\* Use the device information to know the location of the ISR. Such devices which tell the location of the interrupt are called vector interrupt device.

\* Second way to find the location of ISR is using a default address.

Such devices are called Scanner Interrupt Device.

→ How to resolve if more than one interrupt occurs?

use priorities & serve highest priority interrupt.

→ Who will assign the priority? and what kind?

\* OS may assign them.

\* Priorities

→ Static

Dynamic

Daisy chain

Polling

Polling

## Daisy chain :

highest priority devices are connected closer to the processor.



- \* Here the position decides the priority.
- \* Starvation

## Polling :



PIC: Priority Interrupt Controller.

- Every Device has its own INTR & INTA line
- Services are done in a round robin fashion  
∴ no starvation

→ Now when one of multiple interrupts are served what about other request?

## Pended mode:

- Save in interrupt queue & serve
- Non-pended

→ others are Rejected.

### (iii) Direct Memory Access (DMA)

- Here processor is isolated & DMA controller coordinates data transfer.
- Processor & DMA controller use the system bus.  
(Address bus, data bus & control bus) in master-slave mode

- During DMA controller is Master & during normal operation processor is the master.

→ DMA mode is based on when system bus is returned to CPU.

#### Burst mode

→ return after entire data is transferred

\* preferred for

#### Cycle stealing mode

→ After transferring each word.

\* preferred for lower IO burst.

#### Block transfer mode

→ After a block transfer.

→ Count register specifies no of words of the

word block to be transferred.

If it is 8 bit then, max possible for transfer is

$$2^8 = 256 \text{ words.}$$

→ DMA deprives ~~of~~ the system bus availability to CPU, hence

it will be blocked due to DMA.

→ In cycle stealing mode

$$\% \text{ time processor block} = \frac{t_x}{t_x + t_y} * 100$$

t<sub>x</sub>: word transfer time t<sub>y</sub>: word preparation time.

during word preparation CPU can use the system time but not ~~do~~ during word transfer time.



Here if word is 8 bytes then

$$ty (\text{word prep}) = 8 * 1 = 8 \mu s$$

$$t_x = 2 \mu s$$

$$\% \text{ time CPU blocked} = \frac{2}{10} * 100 = 20\%$$

~~During DMA~~

→ In burst mode the system bus is blocked entirely for DMA.

$$\therefore \text{time of CPU contribution} : \frac{tx}{tx + ty} * 100$$

$tx$ : CPU time for initialization & termination of DMA

$ty$ : Data transfer time by DMA

→ Acknowledgement of interrupt is done at the end of each sub-operation of the instruction.

⇒ In case of DMA there is no need to save address and all ~~like int~~ unlike the case with ~~int~~ interrupt.

Q66) Consider a disk with data transfer rate 50 MBPS. It is operated with cycle stealing mode of DMA. Here whenever 64 bit information is available it transferred in 40 ns. what is the % time CPU blocked due to DMA.

Sol:

50 MBPS

$$\frac{50 \text{ MB}}{1 \text{ B}} = \frac{1}{\frac{50 \times 10^6}{50 \times 10^9}} = \frac{1000}{50} = 20 \text{ ns}$$

64 bit = 8 bytes --- 160 ns

$$\therefore \% \text{ time blocked} = \frac{40}{40 + 160} = 20\%$$

Q67) In Q66 if DMA controller is having 14 bit count register and it is required to transfer 64 MB. minimum no. of times system bus exchanged b/w DMA controller and CPU is \_\_\_\_\_

Sol:

Max no of bytes that can be transferred at once

$$= 2^{14} \text{ bytes} \quad (\text{for min, we need to transfer } 2^{26} \text{ bytes})$$

$$\text{Total bytes} = 2^{26} \text{ bytes}$$

For every transfer the bus is exchanged twice

$$\therefore \text{no of transfers} = \frac{2^{26}}{2^{14}} = 2^{12} = 4096$$

$$\therefore \text{no of exchanges} = 4096$$

Q68 A processor is operating with 50MHz clock. The DMA requires 1000 clocks for initialization & termination. The disk with data transfer rate 1MBPS is used to transfer 2KB file using DMA. What is % of CPU time used in this transfer?

Sol:

This is burst mode

$$\Rightarrow \text{time period} = \frac{1}{50 \times 10^6} = 20 \text{ ns}$$

$$\therefore 1000 \text{ clocks} = 1000 \times 2 \times 10^4 \text{ ns} \\ = 20 \mu\text{s} \quad \text{--- CPU is used}$$

A efficient priority: idle time of CPU

$$1 \text{ MB} \quad 1 \text{ s}$$

(except transfer of 2KB)  $\rightarrow 1 \text{ ms}$

$$2 \text{ KB} : 2 \text{ ms} \quad \text{--- CPU is blocked}$$

A efficient interrupt: time CPU used

$$= \frac{20}{20 + 2} \text{ ms} \approx 0.96 \cdot 1$$

$\approx 0.96 \text{ s}$

$$= \frac{20}{20 + 2} = \frac{1}{10} \approx 0.96 \cdot 1$$

Q69 Let variables A, B, C, D denote CPU temperature high, non-maskable interrupt, high priority and maskable words. Which is the correct relation?

- a)  $A \rightarrow BC$
- b)  $A \rightarrow CD$
- c)  $A \rightarrow BD$
- d)  $A \rightarrow BCD$

Sol:

CPU high temperature issue has high priority.

and thus it will be non-maskable

$$\therefore A \rightarrow BC$$

## Disk Memory

- \* Semi Random access memory
- \* information is recorded in terms of tracks & sectors.
- \* unit of transfer in disk is sector.
- \* Based on the recording density the disk construction can be
  - (i) Constant track capacity (default consideration)
  - (ii) variable track capacity
- \* In constant track capacity variable recording density is used (max for innermost track and min for outermost track)

Here disk will have different angular velocity  $\omega_r$  for different tracks. This changing of velocity is called angular velocity modulation. (Angular velocity is more for outer tracks and less for inner tracks)

However, track coverage time is same.

### \* Var In variable track capacity

- inner tracks will have lesser data
- recording density is constant across every track.
- Angular velocity is constant

so it takes less time for covering inner tracks.

### \* To perform read/write on disk

- i) Read/write head is moved onto the concerned track.  
(seek time)

(ii) Disk is rotated so that concerned sector is under read/write head. (Rotational latency)

(iii) perform data transfer.

∴ total time for disk access = seek time + rotational latency + data transfer time.

→ In general seek time  $\gg$  rotational  $\gg$  transfer latency

∴ Disk latency  $\approx$  seek time + rotational latency

→ the capacity of disk

$$C = T * S * B$$

T: no of tracks  $\Rightarrow$  number of radial sectors

S: sectors / track

B: bytes / sector.

→ DISK Addressing (sector is addressed)

| track num | sector num |
|-----------|------------|
|-----------|------------|

$\log_2 T$  bits

$\log_2 S$  bits

→ If format overhead per sector  $k$  bytes,

$$\text{the disk capacity} = C - (T * S * k)$$

→ Data transfer rate:

No of bytes transfer per unit time.

In one rotation disk covers one track

$\therefore$  in one rotation time it covers one track capacity.

→ Taking DMA into consideration

for word preparation time we consider the ~~data~~ time from transfer rate (calculated through rotation speed)

## Disk System

A disk in which both sides' data is stored is called platter.

→ It is collection of disk stacked on one another.

Information is distributed in terms of cylinders.

→ cylinder is logical collection of similar position tracks.

→ Thus in one rotation entire cylinder is covered.

→ Sectors are addressed cylinder wise. Each sector has a logical address and linear address.

logical address  $(C, H, S)$

C - cylinder No.

H - Surface No.

S - (sector No.)

Cylinders are no number from inward to outward

→ Each disk has 2 surfaces.

→ Relation b/w linear & logical address

last logical add  $(C, H, S)$

linear add  $L = C * m + H * n + S$

$m$  = no of sectors per cylinder

$n$  = no of sectors per track

If there are  $p$  surfaces then

$$m = p * n$$

→ ~~logical~~ (linear to logical)

Linear address =

$m) \downarrow (c \dots \text{cylinder number})$

$n) \downarrow g_1 \quad (H \dots \text{Surface number})$

$g_2 = S \dots \text{sector number}$

→ capacity of disk system → no of surfaces

$$= p * T * S * B$$

(a)

$$= K * M * B$$

no of cylinders      sectors/cylinder      bytes/cylinder

→ no of bits for addressing disk system

| surface<br>NO | track<br>NO | sector<br>NO |
|---------------|-------------|--------------|
|---------------|-------------|--------------|

$\log_2 P \quad \log_2 T \quad \log_2 S$

| cylinder<br>NO | sector<br>Num |
|----------------|---------------|
|----------------|---------------|

$\log_2 K \quad \log_2 m$

-- sector no among entire cylinder

Q8

## Hard disk

16 surfaces

16384 cylinder des,

64 sectors / track

Sector capacity is 512 bytes.

Data is organized cylinder wise. Address format  $\langle C, H, S \rangle$

A file of size 42797 KB is stored from  $\langle 1200, 9, 40 \rangle$

What is the cylinder at last sector of file.

- so) a) 1281      b) 1282      c) 1283      d) 1284

so)

Assume file is stored from 0th location

Now calculate last sector

Sectors/cylinder

$$\begin{aligned} &= 16 * 64 \\ &= 1024 \end{aligned}$$

file size = 42797 KB

sector size = 512 bytes

$$\therefore \text{file size} = 42797 * 2 = 85594 \text{ bytes}$$

sectors/cylinder

1024 ) 85594 ( 83 -- cylinder number

sectors/track

64 ) 602 ( 9 -- surface number

26 -- sector number

$\langle 83, 9, 26 \rangle$

now we add  $\langle 83, 9, 26 \rangle$  to  $\langle 1200, 9, 40 \rangle$

|      |   |    |
|------|---|----|
| 1200 | 9 | 40 |
| 83   | 9 | 26 |
| 1284 | 3 | 2  |

$$40 + 26 = 66$$

$$= 64 + 2$$

$$9 + 9 + 1 = 19 = 16 + 3$$

Method 2:

~~Calculate linear add of starting add~~

~~add no of sectors to it.~~

~~Now convert it to logical~~

~~linear add of starting add =  $1200 \times 1024 + 9 \times 64 + 40$~~

## Number Representation

Numbers represented can be

1. Fixed Point --- radix position is fixed

2. Floating Point ... radix position may change  
∴ position need not be stored

They may be signed or unsigned  
∴ position has to be stored.

→ For n-bit number, it will have  $2^n$  combinations.

These combinations have to be interpreted as per the number.

→ Fixed point numbers can be integers or fractions

fixed fraction part of binary (10001.0) (0.00101)

→ For n-bit fixed point unsigned number range is 0 to  $2^n - 1$

The decimal value =  $\sum b_i \cdot 2^i$

→ For n-bit fixed point unsigned fraction range is

0 to  $1 - 2^{-n}$

decimal value =  $\sum b_i \cdot 2^{-i}$

Eg: find max value of 8 bit fraction

Consider

$$+ 11111111 + (1-1)$$

$$v = 1/2 + 1/2^2 + \dots + 1/2^8 = \frac{\frac{1}{2}}{1 - \frac{1}{2}} = 1$$

if fraction is having infinite bits

$$\text{max value} = 1$$

Max value of 8 bit fraction

$$= \frac{1}{2} + \frac{1}{2^2} + \frac{1}{2^3} + \dots + \frac{1}{2^8}$$

$$= 1 - \left( \frac{1}{2^9} + \frac{1}{2^{10}} + \dots \right)$$

$$= 1 - \frac{1}{2^8} \left( \frac{1}{2} + \frac{1}{2^2} + \dots \right)$$

$$\therefore \text{Max value of } n\text{-bit fraction} = 1 - \frac{1}{2^n}$$

$$\therefore \text{Max value of } n\text{-bit fraction} = 1 - 2^{-n}$$

i.e., Max value of n-bit fraction =  $1 - 2^{-n}$

Here this  $2^{-8}$  ( $2^{-n}$ ) is called resolution.

i.e., minimum change required to be recognized by the system.

The resolution is equal to the weight of least significant bit.

→ resolution ↓ → accuracy ↑

E.g.: A 12 bit fraction has less resolution than 8 bit fraction.

(Q8) A real number (unsigned) has i bits for integer and j bits for fractional field. What is its range.

Sol:

$$0 \text{ to } (2^i - 1) + (1 - 2^{-j})$$

$$0 \text{ to } 2^i - 2^{-j}$$

## Signed Numbers

i) Signed - Magnitude

ii) 1's complement

iii) 2's complement

Common things:

→ All representations use most significant bit to denote sign.

0 ----+ve

1 ---- -ve

→ for all +ve number all representation have same pattern and same value.

~~extra~~

Variations

→ There are two patterns for zero in sign Mag. & 1's comp.

Sign Mag. & 1's comp have same range

whereas 2's comp have only one pattern for '0' and hence it covers one extra -veg. number in range

→ range: ~~design~~

$(-2^{n-1} - 1)$  to  $(2^{n-1} - 1)$  --- sign Mag & 1's Comp

$(-2^{n-1})$  to  $(2^{n-1} - 1)$  --- 2's comp

→ Sign Mag & 2's comp representations follow weighted system where 1's comp is non-weighted code

→ In sign-mag sign bit doesn't have any weight but in 2's comp sign bit has -ve weight.

- (882) An 8-bit binary pattern ~~have~~ has value of  $(-47)_{10}$  in sign Mag. what its value if interpreted in 2's comp.

Sol :

(-47)<sub>10</sub>

$$(47)_{10} = (101111)_2$$

$$(-47)_{10} = 10101111$$

0101111 in 2's Co

Another problem was  $-128 + 47 = -81$ . In addition sets, this is a subtraction problem. The numbers are 128 and 47. The first number is positive and the second is negative. The result is negative.

Sign Extension : ~~Sign extension is a process of extending the number of bits of a binary number.~~

- It allows storing smaller number in bigger register.
  - Extended bits do not change either sign or value.
  - In 2's comp & 1's comp sign bits are copied in extended

## Floating Point Representation

- A floating point number contains integer part, fractional part and radix point has to be stored.
  - Most of the systems use mantissa-exponent combination to represent floating number
  - Mantissa covers both integer portion and fractional portion and represented as normalized sign-magnitude fraction  
$$(\text{fraction beginning with '1'}) - (\text{in other system beginning with non-zero})$$

- Exponent represent relative position of radix point.
- Most of the systems use biased exponent for denoting normal signed exponent (true exponent)

~~E~~ E = et biased

Ex-3 Example of finding et

Normalized fraction f

Normalized fraction f

|   | 10 | 11 | 100 | 1000 | 10000 | 100000 |
|---|----|----|-----|------|-------|--------|
| 0 | 0  | 0  | 0   | 0    | 0     | 0      |
| 1 | 0  | 0  | 0   | 0    | 0     | 0      |
| 2 | 0  | 0  | 0   | 0    | 0     | 0      |
| 3 | 0  | 0  | 0   | 0    | 0     | 0      |
| 4 | 0  | 0  | 0   | 0    | 0     | 0      |
| 5 | 0  | 0  | 0   | 0    | 0     | 0      |
| 6 | 0  | 0  | 0   | 0    | 0     | 0      |
| 7 | 0  | 0  | 0   | 0    | 0     | 0      |
| 8 | 0  | 0  | 0   | 0    | 0     | 0      |
| 9 | 0  | 0  | 0   | 0    | 0     | 0      |

Ex-4 Normalizing of floating point number  
and quantization of et

Ex-5 Normalizing of floating point number

Ex-6 Normalizing of floating point number

Ex-7 Normalizing of floating point number

Ex-8 Normalizing of floating point number

01/10/20

## Number System:

Learned about binary numbers, octal, decimal, 2's complement form.



| bit pattern | unsigned | SM | 1's | 2's |
|-------------|----------|----|-----|-----|
| 0 0 0       | 0        | 0  | 0   | 0   |
| 0 0 1       | 1        | 1  | 1   | 1   |
| 0 1 0       | 2        | 2  | 2   | 2   |
| 0 1 1       | 3        | 3  | 3   | 3   |
| 1 0 0       | 4        | -0 | -3  | -4  |
| 1 0 1       | 5        | -1 | -2  | -3  |
| 1 1 0       | 6        | -2 | -1  | -2  |
| 1 1 1       | 7        | -3 | 0   | -1  |

→ Find min no of bits req to represent  $(-32)_{10}$  in 2's comp form.

Sol:

If n is no of bits the max value represented is

$$2^{n-1}$$

$$\therefore 2^{n-1} \geq 32$$

$$2^{n-1} \geq 32$$

$$n-1 \geq 5 \Rightarrow n \geq 6 \dots$$

$$\Rightarrow n = \lceil 5.7 \rceil = 6$$

$$n = \lceil 6 \dots \rceil = 7$$

→ Find min no of bits req to represent  $(-32)_{10}$  in 2's comp

form.

Sol:

min number possible is  $-2^{n-1}$

$$\therefore -2^{n-1} \leq -32 \Rightarrow 2^{n-1} \geq +32$$

$$n-1 \geq 5$$

$$n \geq 6 \therefore \underline{6 \text{ bits}}$$

ozpoh

## Binary Codes



|      |   |
|------|---|
| 0000 | 0 |
| 0001 | 1 |
| 0010 | 2 |
| 0011 | 3 |
| 0100 | 4 |
| 0101 | 5 |
| 0110 | 6 |
| 0111 | 7 |
| 1000 | 8 |
| 1001 | 9 |

Weighted      Non-weighted

Weighted codes: 8421, BCD, Excess-3, Gray code, Excess-3 (sequential & self complement)

Non-weighted codes: Self complementing, 2's complement form

### Self Complementing Code:

In self complementing code, if the code of a digit and its complement is represented by the same number of bits, then the codes of 9's complement of each digit are 1's complement to each other.

Weighted

Excess-3: 00000 is valid. But 00000 is invalid.

|   | 8421 | Ex's -3 |
|---|------|---------|
| 0 | 0000 | 0011    |
| 1 | 0001 | 0100    |
| 2 | 0010 | 0101    |
| 3 | 0011 | 0110    |
| 4 | 0100 | 0111    |
| 5 | 0101 | 1000    |
| 6 | 0110 | 1001    |
| 7 | 0111 | 1010    |
| 8 | 1000 | 1011    |
| 9 | 1001 | 1100    |

Ex's-3 is sequential

i.e., adding '1' gives next number.

∴ Ex's-3 is both sequential &

self complementing

3321:

|   | 3321    |
|---|---------|
| 0 | 0 0 0 0 |
| 1 | 0 0 0 1 |
| 2 | 0 0 1 0 |
| 3 | 0 0 1 1 |
| 4 | 0 1 0 1 |
| 5 | 1 0 1 0 |
| 6 | 1 1 0 0 |
| 7 | 1 1 0 1 |
| 8 | 1 1 1 0 |
| 9 | 1 1 1 1 |

For numbers  $\leq 4$ , if multiple ways are possible take the one with least binary value.

Eg: Consider representing 3 in 3321

$$\begin{array}{r} 3 \ 3 \ 2 \ 1 \\ \hline 0 \ 0 \ 1 \ 1 \\ 0 \ 1 \ 0 \ 0 \\ \hline 1 \ 0 \ 0 \ 0 \end{array} \left. \begin{array}{l} \} 3 \\ \hline \end{array} \right. \begin{array}{l} 0 \ 0 \ 0 \end{array}$$

But we take 0011

(~~0011~~ matching) ~~0011~~

However the above is general convention followed to make the code self complementing

For number  $\geq 5$ , assign its complement of its 9's complement

Note:

In case of weighted codes, for every self complementing code, sum of weights = 9

→ 3321 is self complementing but not sequential.

Eg: Represent  $(863)_{10}$  in

(i) BCD    (ii) 3321    (iii) Ex's-3

(i) BCD

1000 0110 0011

(ii) 3321

1110 1100 0011

(when asked BCD,  
it generally refers  
to 8421.)

(iii) Ex's -3

Additional CDRs in these categories often add up to 20-30%  
100% 100% 80%

1011 1001 0110

## BCD Addition

$$\begin{array}{r} \rightarrow 3 - 0011 \\ 6 - 0110 \\ \hline 9 \end{array}$$

$$\rightarrow \begin{array}{r} 1 \\ \times 3 \\ \hline 0011 \end{array}$$

$$\begin{array}{r}
 \overline{10} \\
 + \overline{1010} \\
 \hline
 0110 \\
 + \overline{1010} \\
 \hline
 10000
 \end{array}$$

↙ Invalid pattern

0100 10 so add 00

$$\begin{array}{c} 0001 \\ \swarrow \\ \downarrow \\ 1 \end{array} \quad \begin{array}{c} 0000 \\ \swarrow \\ \downarrow \\ 0 \end{array} \quad \text{i.e., } 10$$

→ 9-1001 - Igualdo si adattato - 2000 mod. II

9 - 1001  
  
 (E) 1001 < 10010 valid pattern  
 valid pattern  
 10010 possible 2nd primitive 0001 0010

0001 0010

Valid pattern

↓  
0001 0010 | 4 bits possible with given logic slides

Aug 6 1970 21<sup>st</sup> 1970 35°C Wettest day of the year

Here pattern is valid but still result is wrong

The ~~rest~~ It is because of carry.

1001.0

00110 ← add 6

11000

卷之三

0001 1000

1. e., 18

→ So we add 6, if invalid pattern occurs or if there is a carry, i.e., the correction used in BCD addition is 6.

Eg:

$$\begin{array}{r}
 128 \\
 789 \\
 \hline
 912
 \end{array}
 \quad \rightarrow \quad
 \begin{array}{c|c|c}
 0001 & 0010 & 0011 \\
 0111 & 1000 & 1001 \\
 \hline
 1001 & 1011 & 1100 \\
 0110 & & 0110 \\
 \hline
 10001 & 10010
 \end{array}
 \quad \rightarrow \text{invalid pattern.}$$

valid pattern &  
no carry

~~carry has to be~~  
passed to next digit

∴ Result: 1001 0001 0010

9      1      2

i.e., 912

### Excess-3 addition:

→ If carry occurs → Correction is add 6 + 3.

→ If there is no carry → add (-3)

→ while performing this addition make sure that no of digits in i/p is same as no of digits in o/p

Eg:  $7 + 3 = 10$

so 8 odd  
07 passed at 11  
03  
10 → 660 560 02

Reason:

no carry:

$$\begin{array}{r}
 x+3 \\
 y+3 \\
 \hline
 x+y+6 \\
 -3 \\
 \hline
 x+y+3
 \end{array}$$

Eg:  $7 - 1010$

$2 - 0101$

9       $\overline{1111}$  → no carry  
0011  
1100 → so subtract 3.  
→ q in ext-3.

Eg:  $\begin{array}{r}
 07 \\
 03 \\
 \hline
 10
 \end{array}
 - \begin{array}{r}
 0011\ 1010 \\
 0011\ 0110 \\
 \hline
 0111\ 0000
 \end{array}
 \quad \text{carry occurred} \\
 \text{so add } +3$

$\begin{array}{r}
 0011\ 1010 \\
 +0011\ 0011 \\
 \hline
 0100\ 0011
 \end{array}
 \quad \text{Carry } 1 \rightarrow 0100 + 1$

no carry  
so subtract  
3

$\therefore \text{Result: } 0100\ 0011$

∴  $0100\ 0011$  is the sum of  $0100\ 0011$  and  $0011\ 0011$ .

## Gray Code

- Reflexive
- Unit Distance code
- Cyclic Code
- Non-weight code

1-bit

0

1

2-bits



give upper half

MSB as 0

lower half

MSB as 1

3-bits



for 3-bits

write all 2-bit codes in upper half and use reflection

for lower half

MSB of upper → 0

MSB of lower → 1

∴ this procedure can be extended to any no of bits.

→ Hence gray code is called Reflexive.

→ Every successive numbers differ exactly in one bit position.

so gray code is unit distance code.

→ Also the distance b/w last & first number is also 1.

Therefore gray code is called cyclic.

## Conversions blw binary & gray codes

Binary to gray:

$$\text{Ex: } (12)_{10} - (1100)_2$$

$\begin{array}{cccc} \downarrow & \downarrow & \downarrow \\ 1 & 0 & 1 & 0 \end{array} \rightarrow_{12} \text{in } \cancel{\text{binary}} \text{ gray code}$

Let  $b_4 b_3 b_2 b_1$  be binary code

Let  $g_4 g_3 g_2 g_1$  be corresponding gray code

$$\text{then } g_4 = b_4 \quad \boxed{\text{if}} \\ g_3 = b_4 \oplus b_3 \\ g_2 = b_3 \oplus b_2 \\ g_1 = b_2 \oplus b_1$$



Diagram illustrating boundary conditions for a 4x4 grid:

- Top row: Fixed (F)
- Bottom row: Zero flux (Q)
- Left column: Zero flux (Q)
- Right boundary:  $-15^{\circ}\text{C}$
- Interior values: 100, 100, 100, 0

Find  $11001000 - 200$  in binary of regd.

↑  
10101100 — 200 in gray

gray to binary:

~~94 93 92 91~~ → 12 in gray

$$\begin{array}{cccc} 1 & 1 & 1 & 1 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ 1 & 1 & 0 & 0 \end{array} \rightarrow \text{convert to binary}$$

by  $b_3 b_2 b_1$

In binary to gray

$$g_3 = b_4 \oplus b_3$$

$$\Rightarrow b_2 = g_2 \oplus b_4$$

$$bu = g_3$$

$$b_3 = b_4 \oplus g_3$$

$$L = \cup_{j=1}^k S_j$$

$$b_4 = g_4$$

$$\begin{aligned}b_3 &= b_4 \oplus g_3 \\&= g_4 \oplus g_3\end{aligned}$$

so  $b_2 = b_3 \oplus g_2$  and next matching bit is  $g_2$   
so  $b_1 = g_4 \oplus g_3 \oplus g_2$  and next matching bit is

$$b_1 = b_2 \oplus g_1$$

so  $b_0 = g_4 \oplus g_3 \oplus g_2 \oplus g_1$  and next matching bit is  $g_0$

e.g.: 

while converting gray to

binary

~~odd no of 1's~~ is XOR of the  
~~last k bits~~ before

$g_k$  (including  $g_k$ )

so just put '1' if  
~~no of 1's = odd~~

~~and 0 if no of 0's = even~~

Note:

$$g_n = b_{n+1} \oplus b_n$$

$$b_n = b_{n+1} \oplus g_n$$

$$b_n = g_n \oplus g_{n+1} \oplus g_{n+2} \oplus \dots \oplus g_k$$

Most significant bit.

in binary to gray  
 $b_3 = b_4 \oplus g_3$   
 $\Rightarrow b_3 = g_3 \oplus b_4$

## Error Handling



- If error changes pattern from one valid pattern to other valid pattern, then we cannot detect the error. (If we are not using any extra information)
- If error changes a pattern from valid to invalid then error can be detected.

### Error Detection: (~~Parity Check~~)

- To detect error 1-bit errors, the distance b/w any two valid codes should be atleast 2.

For example:

$\{0001\}$      $\{0001\}$      $\{1001\}$      $\{1100\}$

These patterns cannot be used to detect one bit error detection. because distance b/w  $0001$  &  $1001$  is 1.

Eg:

$\{0000\}$      $\{0011\}$      $\{1010\}$      $\{1001\}$

Here distance b/w any two valid codes is 2.  
∴ This pattern can be used for detecting 1-bit errors.

- Similarly to detect 2-bit errors, the distance b/w any two valid patterns has to be atleast 3.
- Similarly to detect ~~n~~-bit errors, the distance b/w any valid codes has to be atleast  $t+1$ .
- In a code,  $d_{min}$  distance b/w any two valid patterns is called Hamming distance.

→ so to detect t-bit errors,

03/10/20 Hamming distance has to be  $t+1$

## Error Correction:

→ To correct 1-bit error, the minimum distance between any two valid codes has to be 3.



→ To correct 2-bit error, the minimum distance b/w any two valid codes has to be 5.



→ Only to correct  $n$ -bit error, the minimum distance b/w any two valid codes has to be  $2n+1$ .

→ We increase distance b/w valid codes by increasing no of bits.

For example consider gray code.

3-bit gray code

|         | <u>p</u> --- parity bit | even parity   |
|---------|-------------------------|---------------|
| 0 0 0   | 0                       | (even parity) |
| 0 0 1   | 1                       |               |
| 0 1 0   | 0                       |               |
| 0 1 1   | 1                       |               |
| 1 0 0   | 0                       |               |
| 1 0 1   | 1                       |               |
| 1 1 0   | 0                       |               |
| 1 1 1   | 1                       |               |
| 1 0 1 0 | 0                       |               |
| 1 0 1 1 | 1                       |               |
| 1 1 0 0 | 0                       |               |
| 1 1 0 1 | 1                       |               |
| 1 1 1 0 | 0                       |               |
| 1 1 1 1 | 1                       |               |

even parity: bit is added such that no of 1's is even

By adding this parity bit the minimum distance b/w any two codes is 2.

so we can use it for 1-bit error detection.

## Hamm Hamming Code:

→ used for 1-bit Error Correction with m+1 bits of data & p bits of parity.

→ Let m be no of bits to be transferred.

so we add p parity bit which are used in error correction.

→ So we transfer m+p bits.

out of this any one bit can get corrupted

or there could be no error.

∴ m+p bits can get transferred in m+p+1 way

'p' bits has to handle all these cases

$$2^p \geq m+p+1$$

Eg: Calculate min no of parity bits required to transfer

4 bits.

sol:

$$m=4, P$$

$$\Rightarrow 2^P \geq m+p+1 \Rightarrow 2^P \geq 4+1$$

$$\Rightarrow P=2$$

### Working:

Let us consider we need to transfer 0101  
i.e.,  $m=4 \Rightarrow P=3$

Let us consider even parity

$m_1 m_2 m_3 m_4$   
0 1 0 1

Total bits to be transferred

1 2 3 4 5 6 7  
 $P_1 P_2 m_1 P_3 m_2 m_3 m_4$   
0 1 0 0 1 0 1

$$P_1 \rightarrow P_1 3 5 7  
0 1 1$$

$\because P_1 = 0$  (even parity)

$$P_2 \rightarrow P_2 3 6 7  
0 0 1$$

$$\therefore P_2 = 1$$

$$P_4 \rightarrow P_4 5 6 7  
1 0 1$$

$$\therefore P_4 = 0$$

| $P_3$ | $P_2$ | $P_1$ | Check bits | error position |
|-------|-------|-------|------------|----------------|
| —     | —     | —     | —          | no error       |
| 0     | 0     | 0     | —          | 1              |
| 0     | 1     | 0     | —          | 2              |
| 0     | 1     | 1     | —          | 3              |
| 1     | 0     | 0     | —          | 4              |
| 1     | 0     | 1     | —          | 5              |
| 1     | 1     | 0     | —          | 6              |
| 1     | 1     | 1     | —          | 7              |

from the table,

$P_2$  makes sure bits positions (1, 3, 5, 7) has even parity

slly  
 $P_2 \rightarrow (2, 3, 6, 7)$

$P_3 \rightarrow (4, 5, 6, 7)$

$\therefore$  Data transferred is 0100101

Assume that bit at position 5 is corrupted

i.e., Data received is 0100001

Now receiver can calculate  $C_1, C_2, C_3$

$C_1 \rightarrow (4, 5, 6, 7)$  has to be of even parity

1 2 3 4 5 6 7  
0 1 0 0 0 0 1

It has odd parity

$\Rightarrow$  one of 4, 5, 6, 7 has been corrupted

$\therefore C_1 = 1$  ( $\because$  error)

$$c_2 \rightarrow (2, 3, 6, 7)$$

$1 \ 0 \ 0 \ 1$  odd part of bytes has odd sum  
 ↳ even parity  
 ∴ no error  
 $c_2 = 0$  without error detection error.

$$c_3 \rightarrow (1, 3, 5, 7)$$

|                        | 0 | 0 | 0 | 1 | 0 |
|------------------------|---|---|---|---|---|
| position               | 1 | 2 | 3 | 4 | 5 |
| number                 | 0 | 1 | 0 | 1 | 1 |
| sum                    | 0 | 1 | 1 | 0 | 1 |
| $c_1, c_2, c_3$        | 0 | 1 | 0 | 1 | 1 |
| $c_1, c_2, c_3 = 101$  |   |   |   |   |   |
| i.e., 5                |   |   |   |   |   |
| ∴ location 5 is error. |   |   |   |   |   |

Eg: let  $M = 1100$

odd and even positions 1 4 5 6 7 i.e. 3rd position is error  
 parity check (P<sub>1</sub>, P<sub>2</sub>, P<sub>3</sub>)  
 $P_1 \rightarrow m_1 \quad P_2 \rightarrow m_2 \quad P_3 \rightarrow m_3 \quad P_4 \rightarrow m_4$   
 (idle 0) → 1 1 1 0 0  
 (idle 0) → Data transferred

let data received: 0 1 0 1 1 0 0

$$P_1 \rightarrow (0010) \quad C_3 = 0$$

$$P_2 \rightarrow (1000) \quad C_2 = 1$$

$$P_3 \rightarrow (1100) \quad C_1 = 0$$

∴ C<sub>1</sub>C<sub>2</sub>C<sub>3</sub> indicates error position will be 3rd position

$$0 \ 1 \ 1$$

i.e. 1 error at 3rd position

∴ applies here only F. method

(idle 0) → 1 1 1 0 0



∴ Data transferred: 00101011

Let data received: 001010011

$$C_1 = (89) - (11) \checkmark \Rightarrow C_1 = 1$$

$$C_2 = (4567) - (0100) \Rightarrow C_2 = 1$$

$$\cancel{C_2} \quad \text{Carry from 4th position}$$

$$C_3 = (2367) - (0100) \Rightarrow C_3 = 1$$

$$C_4 = (13579) - (01101) \Rightarrow C_4 = 1$$

$$\begin{array}{cccc} & & & \\ \text{Parity bit} & \text{C}_1 & \text{C}_2 & \text{C}_3 & \text{C}_4 \\ & 0 & 1 & 1 & 1 \end{array}$$

∴ error at 7th position.

## Overflow:

→ Overflow is a condition where the result ~~does~~ of an operation doesn't fit in the given space.

### Overflow in unsigned numbers

→ In unsigned numbers addition

overflow occurs  $\Leftrightarrow$  carry occurs

### Overflow in 2's complement number system:

→ adding one +ve & one -ve never results in overflow

→ If adding two -ve numbers result in +ve then overflow has occurred.

∴ if adding two +ve numbers result in -ve then overflow has occurred.

- $\begin{array}{l} \text{+ve} \\ \text{-ve} \end{array}$  +  $\begin{array}{l} \text{-ve} \\ \text{+ve} \end{array}$  }  $\rightarrow$  no overflow      *Carry flag = 0*  
 $\begin{array}{l} \text{-ve} \\ \text{+ve} \end{array}$  +  $\begin{array}{l} \text{+ve} \\ \text{-ve} \end{array}$  }  $\rightarrow$  no overflow      *Carry flag = 1*  
 $\begin{array}{l} \text{-ve} \\ \text{-ve} \end{array}$  +  $\begin{array}{l} \text{-ve} \\ \text{-ve} \end{array}$  }  $\rightarrow$  overflow      *Carry flag = 1*  
 $\begin{array}{l} \text{-ve} \\ \text{-ve} \end{array}$  +  $\begin{array}{l} \text{+ve} \\ \text{+ve} \end{array}$  }  $\rightarrow$  overflow      *Carry flag = 1*

For example  $A_3 A_2 A_1 A_0$   $B_3 B_2 B_1 B_0$

$$\begin{array}{r} \text{maximum length} \\ \hline \text{length of sum} \\ \text{length of sum} \end{array}$$

$$\boxed{\text{Carry} \rightarrow A_3 B_3 \bar{S}_3 + \bar{A}_3 \bar{B}_3 S_3}$$

Instead of using above method we can use Cin & Cout

to determine overflow

$$\begin{array}{c} \text{Cin} \\ \curvearrowleft \\ \text{Cout} : A_3 A_2 A_1 A_0 \\ \hline B_3 B_2 B_1 B_0 \\ \hline S_3 S_2 S_1 S_0 \end{array}$$



Carry occurs if Cout & Cin are different

Carry doesn't occur if Cout & Cin are same

$$\boxed{\text{Carry} \rightarrow \text{Cin} \oplus \text{Cout}}$$

Cin  $\Rightarrow$  Overflow flag (in Processor)  
 Cout  $\Rightarrow$  Overflow flag (in Status word)

## Floating Point Representation:

- The floating point numbers are stored in mantissa (M), exponent (E) form.
- Most of the notations represent mantissa as normalised sign magnitude fraction.
- The normalization can be explicit or implicit.
- The exponent is denoted in biased form
  - \* The biased exponent is unsigned number, which can represent signed exponent of original number.

$$\text{True Exponent} = \text{Biased exponent} - \text{Bias}$$

- The floating number is stored in the following way

|   |   |   |
|---|---|---|
| S | E | M |
|---|---|---|

S - Sign ; E - Biased Exponent , M - Mantissa

- If we use n bits



$$\text{Biased Exponent: } 0 \leq E \leq 2^{k-1}-1$$

$$\text{Bias: } 2^{k-1}$$

## IEEE Standards

1985 : Single precision, double precision

2008 : half precision, quadruple precision (not needed for gate)

### 1985 Standard:

- The base of the system is 2.
- Two kinds of precision:
  - single precision (32 bits)
  - Double precision (64 bits)
- The floating point number can be represented in
  - \* Fractional form ( $0.011\ldots$ )
  - \* Implicit normalised form. ( $(1.M)$ )
    - ↳ Mantissa
- Certain mantissa and exponent combinations does not represent any number (Not a Number -NAN)
- Used to represent division by 0.  $\times 10^{\infty}$

### Single Precision



### Floating point Representation:

| $100.1$ | $1001$ | $E$  | $s \text{ in } E$ |
|---------|--------|------|-------------------|
| $0$     | $1001$ | $+1$ | $0 1001 100$      |

$0.1001 \times 2^3$

$0.01001 \times 2^4$

so here in the above  $100.1$  has 2 representations

as it is first digit is 1 so we can have more representations for one number.

To overcome this we use normalization.

## Normalization (Explicit Normalization)

Ex: 100.1

Move the radix point to the left of the ~~1st~~ 1<sup>st</sup> '1' from left side

$s \cdot E^{(4)} \cdot M(5)$  (bias)  $\rightarrow 1001 \times 2^3$

|   |      |        |
|---|------|--------|
| 0 | 0011 | 10010. |
|---|------|--------|

Ex: 0.00101 (biased and no uniform binary pattern)  $\rightarrow$

$0.101 \times 2^{-2}$

Exponent can even be negative.

$s \cdot E^{(4)} \cdot M(5)$  (bias)

so we use 2's complement

|   |      |       |
|---|------|-------|
| 0 | 1110 | 10100 |
|---|------|-------|

Exp  $\rightarrow -2$

0010

(as 2's comp  $\rightarrow$  1110 to 0010)

we calculate the number using below formula

$$(-1)^s \times 0.M \times 2^E$$

$$= (-1)^0 \times 0.10100 \times 2^{-2}$$

Note:

As exponent can be negative, while comparing two numbers it is bit difficult.

So to make it easy we use biassing. (i.e., add  $-2^{n-1}$  to E)

Ex: Consider  $0.101 \times 2^{-2}$

no of bits for exponent be 5.

Biassing is done by adding  $+2^{5-1} = 2^4 = 16$  to exponent

$-2 + 16 = 14$   $\rightarrow$  Biased exponent.

|   |       |       |
|---|-------|-------|
| s | E     | M     |
| 0 | 01110 | 10100 |

n is no of bits used for exponent.

$\text{value} = (-1)^S \times 0.M \times 2^{E-\text{bias}}$

Consider  $(4 \cdot 875)_a$

$$\begin{array}{r} 0.875 \\ 0.875 \\ \hline 0 \\ ① \longrightarrow 1.750 \\ 0.750 \\ \hline 0.500 \\ ① \longrightarrow 1.000 \end{array}$$

Assume no of bits in exponent field = 4  $\times 10^4$  to follow

" " " " Mantessa field  $\approx$  5

So we need to discard LSB's from manuscript

8-9 we have only 5 bids left

$$\begin{array}{c|cc} S & E & M \\ \hline 0 & 1011 & 1001 \end{array} \quad \text{bias} = 3 + 1 = 4$$

$$= 3+8 = 11 = (1011)_2$$

now let us calculate the value

A few of them walked off as Ebbig did at a recent (D)

$$vallee = (-1)^s \times 0.M \times 2$$

En betonkőről tű (N) osztályozott, rendmeneti fajták eredményeiből

$$\tilde{\gamma} = (-1)^{\theta} \times 0.10011 \times 2^{11-8}$$

$$= 0.10011 \times 2^3$$

$\approx 100:1$

卷之三十一

but original value is 4.875

Now have error.

To overcome this we can use implicit normalization.

Implicit Normalization: If nothing is given take implicit as default  
 Here we place the radix point on the right of first '1' from left 'side'.

It because we already know that there will be one, we don't need to store it again.

100.111

$$\approx 1.00111 \times 2^2$$

$$\text{biased exponent} = 2+8=10 = (1010)_2$$

|   |      |       |
|---|------|-------|
| S | E    | M     |
| 0 | 1010 | 00111 |

$$\text{value} = (-1)^S \times 1.M \times 2^{E-\text{bias}}$$

implied

(implicit)

positive mantissa 1 and bias of exponent of 8

$$\text{Here value} = (-1)^0 \times 1.00111 \times 2^{10-8}$$

$$\text{value} = 1.00111 \times 2^2$$

$$= 100.111_2$$

$$= 4.875$$

→ Implicit Normalization has no representation for '0'.

- Q) Consider a 16 bit register of the following format is used to store a floating point number. Mantissa (M) is denoted as normalized signed magnitude fraction, exponent (E) is expressed in excess-64 form. Base of the system is 2.

- i) How many bits are allocated for fractional mantissa.

|   |   |   |
|---|---|---|
| S | E | M |
|   |   |   |
| 1 | 7 |   |

Excess 64

$$2^{n-1} = 64$$

$$n-1 = 6$$

$$n = 7$$

∴ bits for Mantissa

$$= 16 - 1 - 7 = 8$$

2) What is the expression for decimal value?

$$\cancel{(-1)^S \times 1.0 \times 2^{E-64}}$$

$$(-1)^S \times 1.0 \times 2^{E-64}$$

Normalizing floating point

Scientific notation floating point

3) What is value of the largest number that can be represented in base 10.

Sol:

Sign is positive

Exponent 4 - Mantissa has to be as large as possible

| S | E       | M       |
|---|---------|---------|
| 0 | 1111111 | 1111111 |

$$(-1)^0 \times 1.111111 \times 2^{127-64}$$

$$1.111111 \times 2^{63} \\ 1111111 \times 2^{55}$$

$$(2^9 - 1) \times 2^{55} = 2^{64} - 2^{55}$$

4) What is 16 bit pattern for  $(-7.5)_{10}$

$$-(111-1)_2$$

$$1.111 \times 2^{-2}$$

$$\text{biased exp} = 2^2 + 64 = 66 = (1000010)_2$$

| S | E       | M        |
|---|---------|----------|
| 1 | 1000010 | 11100000 |

$$2^2 \times (1110000)$$

→ Increasing no of bits in exponent increases the range.

→ Increasing no of bits in mantissa gives increases accuracy i.e., precision

5) What is the difference b/w 1st smallest tie number and 2nd smallest positive number.

1st smallest positive

$\begin{array}{c} S \quad E \quad M \\ \boxed{0 \quad 0000 \quad 0000 \quad 0000} \end{array}$

$$(-1)^0 \times 1 \cdot 0000 \quad 0000 \times 2^{-64}$$

$$= 1 \times 2^{-64} = 2^{-64}$$

2nd smallest positive

$\begin{array}{c} S \quad E \quad M \\ \boxed{0 \quad 0000 \quad 0001 \quad 0000} \end{array}$

$$(-1)^0 \times (1 \cdot 0000 \quad 0001) \times 2^{-64} = 1 \times 2^{-64}$$

$$\Leftrightarrow (1 \times 2^{-64}) + (0 \cdot 0000 \quad 0001) \times 2^{-64}$$

$$\text{difference} \rightarrow (0 \cdot 0000 \quad 0001) \times 2^{-64}$$

$$2^{-8} \times 2^{-64} = 2^{-72}$$

6) find difference b/w 1st highest tie number & 2nd highest positive number.

$$\text{1st highest} \rightarrow (-1)^0 \times (1 \cdot 1111 \quad 1111) \times 2^{127-64}$$

$$(1 \cdot 1111 \quad 1111) \times 2^{63}$$

$$(1111 \quad 1111) \times 2^{55}$$

$$(2^{9-1}) \times 2^{55} = 2^8 \times 2^{55} = 2^{63}$$

$$\text{2nd highest} \rightarrow (-1)^0 \times (1 \cdot 1111 \quad 1110) \times 2^{127-64}$$

$$(1 \cdot 1111 \quad 1110) \times 2^{63}$$

$$(1 \cdot 1111 \quad 1111) \times 2^{63} - (-1)(0 \cdot 0000 \quad 0001) \times 2^{63}$$

$$\text{diff} \rightarrow (0.00000001) \times 2^{63}$$

$$2^{-8} \times 2^{63} = 2^{55}$$

Here difference is very huge.

Note:

- The numbers are densely distributed to '0' and sparsely distributed away from '0'
- So number that are close to '0' have less error.
- This happens because numbers which are close to zero has less exponent value (magnitude)



7) Find pattern for  $(-16.125)_{10}$

$$\approx -(10000.001)_2$$

$$\approx -1.0000001 \times 2^4$$

$$\text{biased exp} \rightarrow 4 + 64 = 68$$

S E  
011000100100000010

## Single Precision

| S | F | M  |
|---|---|----|
| 1 | 8 | 23 |

(Excess 127)

| S(1) | E(8)                | M(23)                | value                    |
|------|---------------------|----------------------|--------------------------|
| 0/1  | 0000 0000<br>E=0    | 0000 ... 0<br>M=0    | $\pm 0$                  |
| 0/1  | 1111 1111<br>E=255  | 0000 ... 0<br>M=0    | $\pm \infty$             |
| 0/1  | $1 \leq E \leq 254$ | $M = \text{xxx...x}$ | Implicit Normalized Form |
| 0/1  | E=0                 | M $\neq 0$           | Fractional form          |
| 0/1  | E=255               | M $\neq 0$           | NAN                      |

If we use ex-128  
 (overflow) under floating point representation

then it may result in value of E=255

so we use Ex-127

Q) Consider the 32 bit register which stores floating point numbers in IEEE single precision format.

i) what is the value of the number, if 32 bits are as given below

| (1) | (2)       | (3)         |
|-----|-----------|-------------|
| 0   | 1000 0001 | 11000 ... 0 |

It falls under  $1 \leq E \leq 254$

$\therefore$  Implicit normalized form

$$E = 131 - 127$$

$$= 4$$

$$\text{10 } (-1)^S (1.1100 \dots 0) \times 2^4$$

$$11100.0000 \dots 0 = 28$$

2) what is the value of the number in decimal for given bit pattern

$0\ 11111111110000\ldots0$   
 $\text{S } 0 \text{ E } 11111111110000\ldots0$   
 $E = 255 \text{ & } M \neq 0$

∴ It is NAN.

### Double Precision:



| S(1) | E(11)                | M(52)             | value                  |
|------|----------------------|-------------------|------------------------|
| 0/1  | 000...00<br>E=0      | 0000...00<br>M=0  | $\pm 0$                |
| 0/1  | 111...11<br>E=2047   | 0000...00<br>M=0  | $\pm \infty$           |
| 0/1  | $1 \leq E \leq 2046$ | $M = XXX\ldots X$ | Implicit norm.<br>Form |
| 0/1  | 000...00<br>E=0      | M ≠ 0             | Fractional<br>form     |
| 0/1  | 111...11<br>E=2047   | M ≠ 0             | NAN                    |

$\xrightarrow{E-1023}$

$\xrightarrow{(-1)^S (1.M) 2^{E-1023}}$

$\xrightarrow{(-1)^S (0.M) 2^{-1022}}$

- Q) Consider the 64 bit register which stores floating point numbers in IEEE double precision format

i) what is the value of below 64 bit pattern

$1\ 1000\ 0000\ 000\ 11011000\ldots0$   
 $\text{S } 1 \text{ E } 1000\ 0000\ 000\ 11011000\ldots0$

$$E = 1024 + 3 = 1027$$

∴ Implicit normalized form

$$(-1)^1 \times (1.1101100\ldots0) \times 2^{1027-1023}$$

$$-(1.11011) \times 2^4 = -11101.1 = -29.5$$

2) Find value

$0111\ 1111\ 1111\ 0000\ 0000 \times 2^{2047}$

Sign      E      M

$E = 2047$

$M > 0$

$\therefore \infty$

$$(111)_6 = (110101)_2$$

④ Represent  $(117)_6$  in IEEE single & double precision format

| S | E(8)      | M(23)        |
|---|-----------|--------------|
| 0 | 010000101 | 11010100...0 |
|   |           |              |
|   |           |              |
|   |           |              |
|   |           |              |
|   |           |              |
|   |           |              |
|   |           |              |

$(117)_6 = (110101)_2$

$= 1.110101 \times 2^{6+127} = 1.110101 \times 2^{133}$

Single Precision

Biased exp  $\rightarrow 6 + 127 = 133$

i.e., 10000101

| S(1) | E(8)      | M(23)        |
|------|-----------|--------------|
| 0    | 010000101 | 11010100...0 |

Double Precision:

$$\text{Biased exp} \rightarrow 6 + 1023 = 1029$$

$$\text{i.e., } 100000000101$$

| S | E(11)        | M(52)        |
|---|--------------|--------------|
| 0 | 100000000101 | 11010100...0 |

$$2^{(6+1023)} \times 1.110101 \times 2^{1029}$$

$$2^{1029} \times 1.10111 \times 2^{1029} = 1.10111 \times 2^{1029} \times (2^{1029} + 1)$$

Q What is the range of floating-point numbers using ~~float~~<sup>IEEE 754</sup> fractional form of IEEE 754 single precision format.

50

smallest



.....01

$E=0$   $M \neq 0$   
*i.e. fractional form*

value → -

$$\delta \mathcal{L}^k = \delta \lambda \left( (1 - \alpha) I + \delta \right) \mathcal{L}^k (1)$$

smallest



254 - 127

$$= (1, 1, \dots, 1) 2^{127}$$

- (1111111111) 2 124

*2. fomber* *Revised*

$$- (2^{24} - 1) 2^{124}$$

$$-(\frac{148}{2} - 2^{124})$$

*i piso del estrecho*

~~§ 8 - 170 50000 \$ et toutes , 200000 offert~~

~~(S-2)~~ basidionyces fimbriati sp. nov. istotriangularis testaceae et A. basidi-

Given fractional form; convert it into decimal form.

$\Rightarrow E=0, H \neq 0$

smallest: (-ve)



$$(-1)^1 \cdot (\underbrace{a - 1111\cdots 1}_{2^3}) \cdot 2^{-126}$$

$$= \underbrace{(01111\cdots 1)}_{23} \times 2^{149} \times (1000000000\cdots 1)$$

$$= \left( 2^{23} - 1 \right) \times 2^{-149}$$

smallest(ive): ~~smallest finite, positive~~ ~~is zero and take~~

$0|0000\ 0000|000\dots01$

$$(-1)^0 \times (0.000\dots01) \times 2^{-126}$$

$$2^{-23} \times 2^{-126} = 2^{-149}$$

largest tie:

$0|0000\ 0000|111\dots11$

$$(-1)^0 \times (0.111\dots11) \times 2^{-126}$$

$$(1111\dots11) \times 2^{-149}$$

$$(2^{23}-1) \times 2^{-149}$$

$$\approx (1-2^{-23}) \times 2^{-126}$$

Q IEEE 754 single precision format is used to store floating point numbers, what is the value of  $A-B$ ?

where A is smallest representable using implicit normalized form, B is largest representable fractional form. A, B are positive.

Sol:

$\Rightarrow E=0$  & E is smallest possible  $\Rightarrow E=1$

$0|0000\ 0001|0000\dots001$

$$(-1)^0 \times (1.000\dots000) \times 2^{-127}$$

$$(1.0000\dots000) \times 2^{-126}$$

$$1 \times 2^{-126}$$

$$2^{-126}$$

$$2^{-126}$$

$\therefore$   $B$ :

from previous question

$$\text{value of } B = 2^{-126} (1 - 2^{-23})$$

$$\Rightarrow A - B = 2^{-126} (1 \cancel{- 2^{-23}} - 1 + 2^{-23}) \\ = 2^{-126} (2^{-23})$$

$$\therefore \cancel{2^{-148}} = 2^{-149}$$