

## Lecture - 10

### Parallel Processing

#### Simultaneous Data Processing

##### Types

- Vector Processing
  - Array Processing
  - Pipelining Processing
- } Not in GATE

### Flynn's Classification

⇒ SISD :

Single instruction stream, single data stream.

Fetch single instruction at a time.  
Run only one instruction at a time.

Eg : Von Neumann

⇒ SIMD :

Single instruction stream, multiple data stream.

Fetch single instruction at a time.  
Execute multiple at a time  
Eg. pipeline processor.

⇒ MISD

Multiple instruction stream, Single data stream

Fetch multiple instructions at a time.  
Execute single instruction at a time.

Only hypothetical.

⇒ MIMD

Multiple instruction stream, Multiple data stream

Fetch multiple instructions.  
Execute " "

Multiple pipelines

Eg, super scalar computers.

These computers have instruction level parallelism.

Pipelining

Useful when the same processing is applied over multiple inputs.

We divide the processing into sub operations.

⇒ We decompose a sequential process into sub-operations.

- ⇒ Suboperations are performed in segments.
- i) We decompose a sequential process into sub operations.
- ii) Sub operations are performed by in segments (separate hardware).
- \* → Each segment can perform its respective sub operation over different input in parallel to one another.

Task ⇒ One operation performed in all segments forms a complete task.

Think of this like a car assembly. Each worker ~~or~~ represents a segment or stage. Each worker is doing his specialized work which represents doing that suboperation.

A single car represents a task and the ~~as~~ complete assembly line represents the pipeline.

Eg. of pipelining,

Suppose we want to do

$$A_i * B_i + C_i \quad \text{for } i = 1, 2, 3, \dots, 5$$

Let's ~~divide~~ decompose into suboperations as,

$$\text{Segment 1: } R_1 \leftarrow A_i^0, R_2 \leftarrow B_i^0$$

$$\text{Segment 2: } R_3 \leftarrow R_1 * R_2, R_4 \leftarrow C_i^0$$

$$\text{Segment 3: } R_5 \leftarrow R_3 + R_4$$

Rough hardware design  $\Rightarrow$



| Clock Pulse | Segment 1      |                | Segment 2                       |                | Segment 3                                        |
|-------------|----------------|----------------|---------------------------------|----------------|--------------------------------------------------|
|             | R <sub>1</sub> | R <sub>2</sub> | R <sub>3</sub>                  | R <sub>4</sub> | R <sub>5</sub>                                   |
| 1           | A <sub>1</sub> | B <sub>1</sub> |                                 |                |                                                  |
| 2           | A <sub>2</sub> | B <sub>2</sub> | A <sub>1</sub> * B <sub>1</sub> | C <sub>1</sub> |                                                  |
| 3           | A <sub>3</sub> | B <sub>3</sub> | A <sub>2</sub> * B <sub>2</sub> | C <sub>2</sub> | A <sub>1</sub> * B <sub>1</sub> + C <sub>1</sub> |
| 4           | A <sub>4</sub> | B <sub>4</sub> | A <sub>3</sub> * B <sub>3</sub> | C <sub>3</sub> | A <sub>2</sub> * B <sub>2</sub> + C <sub>2</sub> |
| 5           | A <sub>5</sub> | B <sub>5</sub> | A <sub>4</sub> * B <sub>4</sub> | C <sub>4</sub> | A <sub>3</sub> * B <sub>3</sub> + C <sub>3</sub> |
| 6           |                |                | A <sub>5</sub> * B <sub>5</sub> | C <sub>5</sub> | A <sub>4</sub> * B <sub>4</sub> + C <sub>4</sub> |
| 7           |                |                |                                 |                | A <sub>5</sub> * B <sub>5</sub> + C <sub>5</sub> |
| 8           |                |                |                                 |                |                                                  |

The same process in sequential manner will take 15 cycles

As the number of inputs increase the efficiency will only increase further.

### Pipeline Cycle Time

Minimum time in which all segments can perform their respective suboperations.

In simple terms, it is the time to which clock pulse can be set so that all segments can complete their operation.

Now let's look at the above table and make general considerations of for pipelines.

Consider  $k$  segment pipeline  
 clock cycle time =  $t_p$   
 Number of inputs =  $n$

Time for first input =  $k \times t_p$ ,  
 Or if we want in terms of cycle number,  
 we are numbering as cycle no = 1 complete cycle.

Cycles for first input =  $k$

Cycles for all  $n$  inputs =  $k + (n-1)$

Time for processing all  $n$  inputs =  $(k + n - 1) * t_p$

We are not taking hazards into consideration yet.

Speed up ratio

Speed up ratio is the performance gain.

Speed up ratio  $> 1$  when comparing with non-pipeline

speed up ratio =  $\frac{\text{non-pipeline time}}{\text{pipeline time}}$

$S = \frac{\text{non-pipeline time}}{\text{pipeline time}}$

If we consider that both have same clock pulse time,

$S = \frac{\text{non-pipeline cycles}}{\text{pipeline cycles.}}$

Therefore,

$$S = \frac{R \times n}{k + (n-1)} \quad [k = \text{no. of suboperations}]$$

$n = \text{no. of inputs}$

Suppose  $n \geq k$  [We can ignore first  $k-1$  cycles in this case]

This is ideal case for pipeline system.

$$\cancel{S = R \times}$$

$$S_{\text{ideal}} = S_{\text{max}} = \frac{k \times n}{n} = k$$

So maximum/ideal speed up is  $k$  times.

If the question does not tell cycles, and wants the maximum speed up, use,

$$S_{\text{max}} = \frac{\text{Time for one task non-pipeline}}{\text{Time of single clock pulse}}$$

We use time of single clock pulse in denominator since  $k-1$  cycles are ignored.

The above formulae is better for numericals.

Hdr ; ignore first  $k-1$  cycles for non pipeline if asking for max/ideal speed up.

## Lecture - 47

### Synchronous Pipeline

Suppose all segments take different amount of time to do their operation.



So to keep it synchronous, we choose the max segment delay as clock pulse.

$$t_p = \max(\text{segment delay})$$

But there can be loss of intermediate values if some segments are faster.  
So we have intermediate register/buffer.



Intermediate registers

But since they will also cause some delay

$$t_p = \max(\text{segment delay}) + \text{register delay}$$

Therefore in pipelines there is a delay caused by registers. This is not present in non-pipeline system.

So we modify the formula for speed up ratio as,

$$S = \frac{n \times t_n}{(k+t_n-1) \times t_p} \quad \left[ \begin{array}{l} t_n = \text{time to do 1 task in non-pipeline} \\ t_p = \text{time for 1 clock pulse} \end{array} \right]$$

$$S_{\max}/S_{\text{ideal}} = \frac{t_n}{t_p}$$

### Latency

Time after which machine takes next input.

Latency for non-pipelined system =  $t_n$

Latency for pipelined system =  $t_p$

### Throughput

Number of inputs processed per unit time.

In pipelined system,

Time taken for n inputs =  $(k+t_n-1) \times t_p$

Inputs processed per unit time =  $\frac{n}{(k+t_n-1) \times t_p}$

For ideal case, ignore k-1

$$\text{Throughput}_{\text{ideal}} = \frac{1}{t_p}$$

Note : This lecture also has many questions in the lecture et.

## Lecture - 42

### Instruction Pipeline

We will apply pipeline processing on instruction cycle, it is called an instruction pipeline.

Assume such pipeline has 5 phases,

IF : Instruction Fetch

ID : Instruction Decode and Address calculation

OF : Operand Fetch

EX : Execution

WB : Write Back

Eg

Clock Pulses

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |
|----------------|----|----|----|----|----|----|----|----|
| I <sub>1</sub> | IF | ID | OF | EX | WB |    |    |    |
| I <sub>2</sub> |    | IF | ID | OF | EX | WB |    |    |
| I <sub>3</sub> |    |    | IF | ID | OF | EX | WB |    |
| I <sub>4</sub> |    |    |    | IF | ID | OF | EX | WB |

This works well when we don't have branching instruction.  
We follow the certain rule if there is a branching instruction.

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|
| I <sub>1</sub> | IF | ID | OF | Ex | WB |    |    |    |    |    |    |    |
| I <sub>2</sub> |    | IF | ID | OF | Ex | WB |    |    |    |    |    |    |
| I <sub>3</sub> |    |    | IF | ID | OF | Ex | WB |    |    |    |    |    |
| I <sub>4</sub> |    |    |    | IF | -  | -  | IF | ID | OF | Ex | WB |    |
| I <sub>5</sub> |    |    |    |    |    |    |    | IF | ID | OF | Ex | WB |

In cycle 7, I<sub>3</sub> is decoded as a branch type instruction. So all further instructions are removed from the pipeline. So the I<sub>4</sub> fetch is discarded.

No further instructions enter pipeline until branch decision is taken.

By standard, branch decision is made in EXECUTION Phase.

So in our example at the end of cycle 6. So at start of 7<sup>th</sup> cycle, instructions can be entered in pipeline.

Let's consider branch is not taken. The I<sub>4</sub> is fetched again because fetch was discarded.

Then pipeline continues it's usual working.

Cycles 4, 5 and 6 above had the pipeline stalled, so they are called stalled cycles (also called pipeline bubbles).

Example when jump is taken.

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 |
|----------------|----|----|----|----|----|----|----|----|----|----|----|----|
| I <sub>1</sub> | IF | ID | OF | EX | WB |    |    |    |    |    |    |    |
| I <sub>2</sub> | IF | ID | OF | EX | WB |    |    |    |    |    |    |    |
| I <sub>3</sub> | IF | ID | OF | EX | WB |    |    |    |    |    |    |    |
| I <sub>4</sub> |    | IF | -  | -  |    |    |    |    |    |    |    |    |
| I <sub>5</sub> |    |    |    |    |    |    |    |    |    |    |    |    |
| I <sub>6</sub> |    |    |    |    |    |    |    |    |    |    |    |    |
| I <sub>7</sub> |    |    |    |    | IF | ID | OF | EX | WB |    |    |    |
| I <sub>8</sub> |    |    |    |    |    |    | IF | ID | OF | EX | WB |    |

Here I<sub>4</sub> is the instruction where which is branch instruction.

So when it is decoded at ~~now~~ (cycle 4), the further pipeline is stalled till cycle 6 completes.

And since branch is taken we jump to I<sub>7</sub> at start of cycle 7.

Number of instructions processed = 5(n)  
Number of segments (k) = 5

$$\begin{aligned}
 \text{Total cycles required} &= n+k-1 \\
 &= 5+5-1 \\
 &= 9
 \end{aligned}$$

But in our example, we took 12 cycles because 3 were stalled.

To calculate number of stalled cycles.

No. of stalled cycles =  $i - 1$   
 where,  $i^o$  = the cycle at the end of which branch decision is taken.

In our example we have 5 segments

IF ID OF EX WB

We taken decision at end of 4th segment. So

$$\text{stalled cycles} = 4 - 1$$

"  $i^o$  =  $i^o$ th segment at the end of which branch decision is taken.

### Pipeline Hazards

Situations that prevent next instruction from being executed during its designated clock cycle.

They cause stalled cycles.

### Pipeline Hazards

- Structural Hazards / Resource Conflict
- Data Hazard / Data Dependency
- Control Hazard / Branch Dificulty

## Structural Hazard

2 different segments try to use same resource at same time.

Suppose we have two instructions I<sub>1</sub> and I<sub>2</sub>, but I<sub>1</sub> takes 3 cycles for execution.

So in pipeline,



Since EX segment is already in use, this causes resource conflict.

Another conflict of this type is if we have operand fetch and instruction fetch in the same cycle in a pipeline.

Since only a single access can be done, it causes resource conflict. To solve this we can use two separate caches, a data cache and an instruction cache.

## Lecture - 43

### Data Dependency.

Result of an instruction is used as input in next.

If In :  $R_2 \leftarrow R_2 \times R_3$   
 Inter :  $R_4 \leftarrow R_1 + R_3$

So in pipeline



A general pipeline hardware cannot detect data dependency.

### Solution for data dependency

- Software (By compiler)
- Hardware

Software solution is called delayed load and is used when general pipeline is used.

Compiler generates code in a way that avoids data dependency.

## Delayed Load

We either

- Move independent instructions between dependent instructions
- Insert no operation instruction.

$I_1$

$I_2$  ↗ Data  
Dependency

$I_3$

$I_4$  } Independent instructions  
 $I_5$

↓ compiler converts to.

$I_1$

$I_2$  ↗ Data  
Dependency

$I_4$

$I_5$

$I_3$

or compiler adds no operation instructions.

$I_1$  converted to  
 $I_2$  ↗  
 $I_3$  ↗  
 $I_4$

$I_1$   
 $I_2$  ↗  
 " " " "  
 $I_3$  —  
 $I_4$

Data  
Dependency

## Hardware solutions

- ↳ Hardware interlock
- ↳ Operand Forwarding

### Hardware interlock

This method will make stall cycles in pipeline to solve this problem.

This is what we saw in the very first example.

In :  $R_1 \leftarrow R_2 \times R_3$

In<sub>2</sub> :  $R_4 \leftarrow R_1 + R_5$

IF ID OF EX WB

IF ID - - - OF EX WB

Hardware locked  
to create stall cycles

### Operand Forwarding

Rather than stalling cycles, we take operand value and ALU data from the intermediate registers within the pipeline rather than waiting for write back (WB).

So if we have data dependency as

$$I_n : R_1 \leftarrow R_2 \times R_3$$

$$I_{n+1} : R_4 \leftarrow R_1 + R_5$$

IF ID OF EX WB



IF ID OF EX WB

In hardware implementation,



The decision unit detect data dependency.  
This is also called bypassing.

Because a specialized decision unit is need this method is expensive.

Forwarding can't make it so that All to All data dependency causes no stall cycles.

This cannot prevent stall cycles in memory data dependency i.e. load and store operation data dependency.

## Load Data Dependency

$$R_1 \leftarrow M[\text{add}]$$

$$R_4 \leftarrow R_1 + R_2$$

By standard, many CPUs do LOAD operation in EX phase. So we consider a single stall cycle.

|    |    |    |    |       |
|----|----|----|----|-------|
| IF | ID | OF | EX | WB    |
| IF | ID | -  | OF | EX WB |

## ~~Stop~~ Read Data Dependency

$$R_4 \leftarrow R_1 \times R_5$$

$$M[\text{add}] \leftarrow R_4$$

There are always 2 stalled cycles.

|    |    |    |    |          |
|----|----|----|----|----------|
| IF | ID | OF | EX | WB       |
| IF | ID | -  | -  | OF EX WB |

## Control Hazard or Branch Difficulty

Hazards because of branch instruction.  
General pipeline hardware cannot detect branch difficulty.

### Solution

- └ Software solution (Delayed Branch)
- └ Hardware solution

## Software Solution (Delayed Branch)

The compiler will add no operation instructions equal to the number of cycles to stall, similar to software delayed load.

## Hardware solutions

### 1. Prefetch Target Instruction

Two pipelines will work together and the wrong pipeline is discarded.

### 2. Branch Prediction

The pipeline will try to predict the branch decision and if wrong decision taken, it will rollback.

#### Types

- Static (Either always take branch or not take branch)
- Dynamic (will make guess according to instruction)

### 3. Look Buffer

In a pipeline we have a buffer which will store instructions of a look and continuously run them without fetching or decoding over and over.

### 4. Branch Target Buffer

We use this frequently with branch prediction.

If a certain branch instruction previously taken decision we will take same decision next time.

The previous decisions result are stored in a buffer.

## Lecture - 44

### Data Hazard Classification

Assume there are two instructions  $i$  and  $j$  and  $i$  is executed before  $j$ .

#### Read After Write (RAW)

$j$  tries to read a source before  $i$  writes it.  
So  $j$  incorrectly gets the old value.

$i: R_1 \leftarrow R_2 + R_3$  ) called data dependency  
 $j: R_4 \leftarrow R_1 * R_5$  )

Solution is operand forwarding.

#### Write After Write (WAW)

$i: R_1 \leftarrow R_2 + R_3$  ) called no write dependency  
 $j: R_1 \leftarrow R_4 * R_5$  ) ( It is a false dependency )

$j$  tries to write an operand before it is written by  $i$ .

#### Write After Read (WAR)

$j$  tries to write a destination before  $i$  reads it. So  $j$  gets new value

$i: R_1 \leftarrow R_2 + R_3$  ) anti-dependency  
 $j: R_2 \leftarrow R_4 * R_5$  ) ( It is a false dependency )

## Solution for false dependencies (Register Renaming)

# For  $j^{\text{th}}$  instruction, replace the register causing false dependency with some other register and do the same for all subsequent instructions that depend on the  $j^{\text{th}}$  instruction's operation on that register.

This is a hardware solution.

$$\begin{array}{ll} i: R_1 \leftarrow R_2 + R_3 & \xrightarrow{\text{reg. renaming}} \\ j: R_1 \leftarrow R_4 \times R_5 & j: R_6 \leftarrow R_4 \times R_5 \end{array}$$

## Pipeline Efficiency

Because of hazards (stall cycles), efficiency is reduced.

$$\begin{aligned} \text{Speedup without hazard} &= S \\ \text{Speedup with hazard} &= \cancel{S} \quad S' \end{aligned}$$

$$\text{Efficiency} = \frac{S'}{S}$$

$$0 \leq \text{Efficiency} \leq 1$$

## CPI (Cycles Per Instruction)

$CPI = \frac{\text{Number of cycles for all instructions}}{\text{Number of instructions}}$

No. of instructions =  $n$   
No. of segments =  $k$

$$CPI = \frac{n}{k} + k - 1 + \text{(stalled cycles)}$$

$$CPI_{\text{ideal}} \text{ (ignore } k-1) = \frac{n + \text{stalled cycles}}{n}$$

$$CPI_{\text{ideal}} = 1 + \frac{\text{stalled cycles}}{n}$$

To calculate stalled cycles,

stalled cycles = stalling instructions  $\times$  number of cycles it stalls.

$$CPI_{\text{ideal}} = 1 + \frac{\text{stalling instructions} \times \text{no. of cycle it stalls}}{n}$$

$$CPI_{\text{ideal}} = 1 + (\text{ratio of stalling instruction}) \times \text{no. of cycle it stall}$$

So if ques says, 25% of instructions cause 2 stalled cycles then,

$$\text{ratio of stalling instructions} = 0.25$$

$$\text{no. of cycles it stalls} = 2$$

$$CPI_{\text{ideal}} = 1 + 0.25 \times 2$$

Note : Questions for pipeline in future.

## Lecture - 45

### Non-Linear Pipeline (Not frequently in Gate)

Input is not passed in sequential manner and a single segment can be used multiple time, to represent this we use a reservation table.

|         |                | Clock → |   |   |   |   |   |   |   |
|---------|----------------|---------|---|---|---|---|---|---|---|
|         |                | 1       | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Segment | S <sub>1</sub> | X       |   |   |   |   | X | X |   |
|         | S <sub>2</sub> |         | X |   | X |   |   |   |   |
|         | S <sub>3</sub> |         |   | X |   | X | X |   |   |

### Latency

Number of time gap units (cycles) after which next input can be given to pipeline after ~~from~~ a input enters pipeline.

For 3 cycles between two inputs.

- permissible latency : latency which does not cause collision.
- forbidden latency : latency which causes collision.

Collision : when a segment in pipeline is tried to be used for 2 inputs at same time. This is a structural hazard.

## Forbidden latency calculation

For example,

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----------------|---|---|---|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   |   | X |   | X |
| S <sub>2</sub> |   | X |   | X |   |   |   |   |
| S <sub>3</sub> |   |   | X |   | X |   |   | X |

All latencies  $\Rightarrow 1, 3, 5, \dots, 8$

For S<sub>1</sub>,

since it is used in cycle 8 and then 6  
 Forbidden latency = 8 - 6 = 2

Similarly, it is used in 8 and 1

Forbidden latency = 8 - 1 = 7

Similarly, it is used in 6 and 1

Forbidden latency = 6 - 1 = 5

Forbidden latencies = 2, 5, 7

For S<sub>2</sub>,

similar calculate ~~for S<sub>1</sub>~~ as in S<sub>1</sub>  
 Forbidden latencies = 2

For S<sub>3</sub>,

Forbidden latencies = 4, 2, ~~2~~ 2

So all forbidden frequencies are,  
2, 4, 5, 7

Remaining are permissible latencies  
1, 3, 6, 8

### Collision Vector

It is a n-bit vector (number) where where  
n is the number of clocks ~~to~~ to process 1 input.

$C_n \ C_{n-1} \dots \ C_2 \ C_1$

If ~~n<sup>th</sup>~~ cycle number is Forbidden then,

$$C_n = 1$$

If ~~n<sup>th</sup>~~ cycle number is permissible then,

$$C_n = 0$$

From our example in previous section

| $C_8$ | $C_7$ | $C_6$ | $C_5$ | $C_4$ | $C_3$ | $C_2$ | $C_1$ |
|-------|-------|-------|-------|-------|-------|-------|-------|
| 0     | 1     | 0     | 1     | 1     | 0     | 1     | 0     |
| P     | F     | P     | F     | F     | P     | F     | P     |

(P = Permissible; F = Forbidden)

So collision vector for our example,  
01011010

## State diagram using collision vector

- ⇒ Take a state  $S$
- ⇒ For every 0(zero) at position  $i$ , shift collision vector of state  $S$  in right by  $i$  and then take bitwise OR for new state with collision vector.
- ⇒ Draw arrows from  $S$  to new state and write  $i$  on the arrows.
- ⇒ Do same process for new states acquired.  
*Note: Shift state  $S$  and do OR with original collision vector*
- \* Continuing old example

For state 01011010,

\* 0 at  $i=1$

$$\Rightarrow 01011010 \text{ shr } 1 = 00101101$$

$$\begin{array}{r} \xrightarrow{\text{OR}} 01011010 \\ 00101101 \\ \hline 01111111 \end{array}$$



\* 0 at  $i=3$

$$\Rightarrow 01011010 \text{ shr } 3 = 00001011$$

$$\begin{array}{r} \cancel{0} \cancel{1} \cancel{0} \cancel{1} 0 \\ \cancel{0} \cancel{1} \cancel{0} \cancel{1} 0 \\ \hline 00001011 \end{array} \quad \text{OR } 01011010$$



\* 0 at  $i=6$

$$\Rightarrow 01011010 \text{ shr } 6 = 00000001$$

$$\begin{array}{r} \text{OR } 01011010 \\ 00000001 \\ \hline 0101011 \end{array}$$



\* 0 at  $i=8$

$$\Rightarrow 01011010 \text{ shr } 8 = 00000000$$

$$\begin{array}{r} 01011010 \\ \text{OR } 00000000 \\ \hline 01011010 \end{array}$$



For state 0111111

\* 0 at  $i=8$

$$\Rightarrow 010 \ 0111111 \text{ shr } 8 = 00000000$$

~~01011010~~  
~~00000000~~  
~~01111111~~

$$\Rightarrow \begin{array}{l} \text{OR} \\ \underline{01011010} \\ 00000000 \\ 01011010 \end{array}$$



For state 01011011

\* 0 at  $i=3$

$$\Rightarrow 01011011 \text{ shr } 3 = 00001011$$

$$\Rightarrow \begin{array}{l} \text{OR} \\ \underline{00001011} \\ 01011011 \\ 01011011 \end{array}$$

\* 0 at  $i=6$

$$\Rightarrow 01011011 \text{ shr } 6 = 00000001$$

$\Rightarrow 00000001$   
 OR  
 $\underline{01011010}$   
 $\underline{01011011}$

\* 0 at  $i=8$   
 $01011011 \text{ shr } 8 = 00000000$

$\Rightarrow 00000000$   
 OR  
 $\underline{01011010}$   
 $\underline{01011010}$

Complete state diagram



### Simple Cycles

A cycle in the state diagram is a complete loop from one state back to itself, but no other state is involved.

Note: including all states is not required.

Eg, simple cycles from state 01011010  $\Rightarrow (8), (1,8), (6,8)$  etc.  
 cycles from 01011011 =  $(3), (6), (8,3), (8,6)$   
 " 01111111 =  $(8,1)$

## Greedy Cycles

Each state will have

A simple cycle whose first edge taken is smallest for the starting state is known as a greedy cycle.

Each state will have a greedy cycle if it has atleast a single simple cycle. The greedy cycle will also have smallest average cycles of all simple cycles.  
Eg, in our example,

for state 01011010, has smallest edge ~~(1, 8)~~, 1  
therefore, greedy cycle is (1, 8).

for state 01011011, greedy cycle is (3).

for state 01111111, greedy cycle is (8,1).

∴ For a given state, the simple cycle with least average cycle is the greedy cycle.

## Minimum Average Latency

The average latency of a cycle can be calculated by taking average of all the edges.

Minimum Average Latency (MAL) is the smallest average latency of all greedy cycles.

So in our example,

Greedy Cycle

(1, 8)

(3)

(8, 1)

Average latency

$$\frac{8+1}{2} = 4.5$$

$$\frac{3}{1} = 3$$

$$\frac{8+1}{2} = 4.5$$

$$MAL = 3$$

## Lecture 46

### Floating Point Representation



#### Biased Exponent

Suppose we have an exponent field of 4 bits.

We know that when using 2's complement for negative numbers, the range for  $n$  bit negative number is from  $-2^{n-1}$  to  $(2^{n-1}-1)$  in decimal.

So 4 bits can store from  $-2^3$  to  $2^3-1$  i.e.  $-8$  to  $7$  which is from  $(000)_2$  to  $(011)_2$

We do not want to use a negative number for storage. So to transform it we can just add a bias to make number start from 0 in storage.

In our 4 bit example bias is 8

In  $n$  bit exponent, bias is  $2^{n-1}$

And exponent + bias = E (biased exponent)

## Normalized Mantissa

### i) Explicit Normalization (default for GATE ques)

When we normalize explicitly, we convert to form  $0.\text{1}0\dots$

So

$$101.11 \xrightarrow{\text{converts to}} 0.10111 \times 2^3$$

And mantissa ( $M$ ) = 10111

and biased exponent =  $3 + 2^{e-1}$  ( $e$  is no. of exponent bits)

### ii) Implicit Normalization (used for IEEE) bits

When we do implicit,

$$101.11 \xrightarrow{\text{converts to}} 1.0111 \times 2^{12}$$

$$\text{Mantissa } (M) = 0111$$

Note: Sometimes bias =  $b$  is written as excess- $n$  form of exponent

So if exponent is excess- $n$ , the bias is  $n$ .

For floats it is a tradeoff of exponent range and accuracy

More bits in mantissa ( $M$ ) = more precision

More bits in exponent ( $E$ ) = larger range of numbers

## Disadvantages of Floating Point Representation

then the largest

- ⇒ The smallest +ve number that can be represented in a explicit normalized mantissa is

S   E   M

0
000...1000...

This is because we always have a 1 after the point in explicit normalized form.

- ⇒ The largest -ve number is similarly

1
100...1100...

S   E   M

So on the number line



This range of numbers cannot be stored in conventional floating point representation

(This is called underflow)

So conventional floating point cannot :

- \* store zero
- \* store infinity
- \* store a number which cannot be normalized

To fix this issue, we use standard IEEE-754 format.

In some questions, mantissa is said to not be normalized and said to be stored in pure fraction.

In these questions, don't do normalization.

Eg, if  $0.239 \times 2^{13}$  will be directly converted to  $0.0011101 \times 2^{13}$  and we have to take  $M = 0011101$  and then burn 13 to biased exponent.

## Lecture - 47

### IEEE-754 Floating Point Representation

#### Single Precision

( $S = 1$  bit,  $E = 8$  bits,  $M = 23$  bits) (bias = 127)

#### Double Precision

( $S = 1$  bit,  $E = 11$  bits,  $M = 52$  bits) (bias = 1023)

We don't use the same bias calculating method as we do in conventional method because

$E = 00\ldots 0$  and  $E = 111\ldots 1$  are used for special numbers. [if (e) exponent bits, bias is  $2^{e-1} - 1$ ]

Note  $\Rightarrow$  We do implicit normalization in IEEE-754

| S   | E                                      | M                | Number                                    |
|-----|----------------------------------------|------------------|-------------------------------------------|
| 0   | 00...0                                 | 0...0            | +0                                        |
| 1   | 00...0                                 | 0...0            | -0                                        |
| 0   | 11...1                                 | 0...0            | $+\infty$                                 |
| 1   | 11...1                                 | 0...0            | $-\infty$                                 |
| 0/1 | 11...1                                 | $M \neq 0$       | NaN (Not a number)                        |
| 0/1 | 00...0                                 | $M \neq 0$       | Denormalized or fraction no.              |
| 0/1 | $E \neq 0..0$<br>and<br>$E \neq 11..0$ | $M = xx\ldots x$ | Implicit Normalization<br>(Normal Number) |

special numbers

Not A Number

means normalize to 1.0111... form

Since we normalized chose a bias such that  $E = 1..1$  and  $00...0$  is reserved, if we try to store a number such that  $E = 11..1$  and  $M \neq 0$ , it gives NaN error.

## Denormalized Number

Very small number, which cannot be normalized.

In this case in order to store this in IEEE standard,

we will only normalize till exponent is  $-(\text{bias}-1)$  and storing the Mantissa we get and  $E = 00\dots 0$ .

Eg suppose we are using single precision float and we have a number  $0.0\dots 0011$  and we cannot completely normalize, then we treat it as special number,

$$0.0\dots 0011 \rightarrow 0.00011 \times 2^{-126} \leftarrow \text{bias-1}$$

$$\text{And store } M = 000011 \\ E = 00\dots 0$$

So for a denormalize number,

$$|\text{Value}| = 0.M \times 2^{-(\text{bias}-1)}$$

Note : A question came in GATE 2020 on single precision number division.

The question asked for answer of  $0x42200000 \div 0xC1200000$