

$$i * B_i + C_i \quad \text{for } i = 1, 2, \dots, 5$$



| Clock pulse | Seg 1<br>R1    | Seg 1<br>R2    | Seg 2<br>R3                    | Seg 2<br>R4    | Seg 3<br>R5                                    |
|-------------|----------------|----------------|--------------------------------|----------------|------------------------------------------------|
| 1           | A <sub>1</sub> | B <sub>1</sub> |                                |                |                                                |
| 2           | A <sub>2</sub> | B <sub>2</sub> | A <sub>1</sub> *B <sub>1</sub> | C <sub>1</sub> |                                                |
| 3           | A <sub>3</sub> | B <sub>3</sub> | A <sub>2</sub> *B <sub>2</sub> | C <sub>2</sub> | A <sub>1</sub> *B <sub>1</sub> +C <sub>1</sub> |
| 4           | A <sub>4</sub> | B <sub>4</sub> | A <sub>3</sub> *B <sub>3</sub> | C <sub>3</sub> | A <sub>2</sub> *B <sub>2</sub> +C <sub>2</sub> |
| 5           | A <sub>5</sub> | B <sub>5</sub> | A <sub>4</sub> *B <sub>4</sub> | C <sub>4</sub> | A <sub>3</sub> *B <sub>3</sub> +C <sub>3</sub> |
| 6           |                |                | A <sub>5</sub> *B <sub>5</sub> | C <sub>5</sub> | A <sub>4</sub> *B <sub>4</sub> +C <sub>4</sub> |
| 7           |                |                |                                |                | A <sub>5</sub> *B <sub>5</sub> +C <sub>5</sub> |
| 8           |                |                |                                |                |                                                |



SISD (concurrent)  
 SIMD (parallel comp)  
 MIMD (hypothetical)  
 MIMD (Super scalar  
or  
Super comp)  
~~OF, FD, EX, MEM, RW~~

No. of cycles  $\rightarrow$  no. of input

no. of input = 5 so  $\rightarrow$  7 cycles

for non-pipeline deline

3 cycle for 1 input

so  $\rightarrow$   $3 \times 5 = 15$  cycles.

|         | Sequential        | Pipeline |
|---------|-------------------|----------|
| $n = 5$ | $3 \times 5 = 15$ | 7        |
| $n = 6$ | $3 \times 6 = 18$ | 8        |
| $n = 7$ | $3 \times 7 = 21$ | 9        |

- Q
- Best temp register  
 1) ~~control sig~~  
 2) ~~vertical~~ (2)  
 Vertical (3)
- 2) Raw War Ray.  
 example (2)  
 pipeline example  
 numbered
  - 3) ~~diff~~ ~~stages~~  
 numbered  
 first initiation  
 pipeline numerical  
 with Branch  
 and its resolution  
 (3)

## Pipeline Cycle Time

- ④ min. time in which all segments can perform their respective sub-operations.
- in previous examples each clock pulse needs min. amt. of time to complete one segment. (or perform their sub oprs)

denote  $\Rightarrow t_p$

### General Consideration About Pipeline

Consider  $k$ -segment pipeline with clock cycle time =  $t_p$  to perform  $n$  tasks

- Time req. to perform 1<sup>st</sup> task =  $k t_p$
- Time req. " " remaining  $(n-1)$  tasks  
 $= (n-1) t_p$

Time req. for all 'm' tasks

$$= \underbrace{(k+m-1)}_{\text{no. of cycles}} t_p [t_p k + (m-1)t_p]$$

→ verifying it from previous ex.

$$k = 3, m = 5, t_p = ?$$

$$\text{total cycles} = (3 + (5-1)) = \underline{7 \text{ cycles}}$$

$$\text{total time} = 7 \times m t_p$$

\* As soon as one output comes out of pipeline each next output starts coming <sup>at</sup> ~~out~~ of each time cycle

→ Consider a non-pipeline system that takes  $t_m$  time to perform a task (Sequential, non-pipelined)

Time req. for 'm' tasks =  $m * t_m$

e.g.: - water flowing <sup>out</sup> of pipe, firstly

[It is irrespective to come out but then the water of pipe will keep flowing till you stop length] → the tap or 'm' tasks are complete.

## Speed up Ratio

if speed up ratio is 'm' it means pipelined architecture is 'm' times faster than non-pipelined architecture.

$$\text{Speed up ratio} = \frac{\text{non-pipeline time}}{\text{pipeline time}}$$

$$S = \frac{m * t_n}{(K+m-1) t_p} \quad - \textcircled{1}$$

as the number of tasks increases,  
'm' value is extremely large

$$m \gg k \quad [\text{ignore } k-1 \text{ from } \textcircled{1}]$$

$$S = \frac{t_n + m}{m + t_p} = \frac{t_n}{t_p} \quad - \textcircled{2}$$

[ideal case]

→ with initial  $k-1$  cycle we are not getting any output after that one cycle one output

→ Special Case: If to perform one task pipeline & non-pipeline system take equal time

$$m = k * t_p$$

in ②

$$S_{ideal} = \frac{k t_p}{t_p} = k$$

Speed up can never be more than 'K',

Q) A non-pipeline system takes 50 ns to process a task. The same task can be processed in a 6-Segment pipeline with a clock cycle of 10 ns. Determine the speedup ratio of the pipeline for 100 tasks. What is the maximum speedup that can be achieved?

$$t_m = 50 \text{ ns} \quad k = 6, \quad t_p = 10 \text{ ns}$$

$$m = 100$$

$$S = \frac{m * t_m}{(k+m-1) t_p} = \frac{100 * 50}{(6+99) 10}$$
$$= \frac{500}{105} = 4.76$$

$$S_{max} = \frac{t_m}{t_p} = \frac{50}{10} = 5$$



$$\begin{aligned}
 t_p &= \max(\text{seg delay}) \\
 &= \max(10, 12, 8, 9) \\
 &= 12 \text{ ms}
 \end{aligned}$$

$$\begin{aligned}
 t_p &= \text{max of seg. delays} + \text{reg delay} \\
 &= \max(10, 12, 8, 9) + 2 = 14 \text{ ms}
 \end{aligned}$$

→ For non pipelined no intermediate register needed

$$\begin{aligned}
 t_m &= \text{Sum of all seg delay} \\
 &= 10 + 12 + 8 + 9 = 39 \text{ ms}
 \end{aligned}$$

Cycle time in Synchronous Pipeline

$$t_p = \text{max of seg. delays} + \text{reg delay}$$

$$t_m = \text{Sum of seg. delay}$$

- (\*) If in question value of 'n' is given  
 $\frac{t_m}{t_m - t_p}$  is 'n' not given  $S_{\max} = \frac{t_m}{t_p}$

## Latency & Throughput

time after which machine takes next input

Latency in non-pipelined =  $t_m$

Latency in pipelined =  $t_p$

## Throughput

No. of inputs processed per unit time

$$n / (k + n - 1) t_p$$

In  $(k + n - 1) t_p$  time, no. of inputs processed =  $n$

$$\frac{1}{(k + n - 1) t_p} \text{, no of } -$$

$$\frac{n}{(k + n - 1) t_p}$$

ideal case ( $n \gg k$ )

$$\frac{1}{t_p} \Rightarrow \text{throughput} = \frac{1}{t_p}$$

$[n \gg k]$

Q) Consider a 5 stage pipeline with cycle time of 15 ns. Calculate processing time of pipeline for 500 tasks.

A)  $m = 500, k = 5, t_p = 15$   
pipeline time

$$= (k + m - 1)t_p = (5 + 500 - 1)15$$
$$= 7560 \text{ ns}$$

## Instruction Pipeline

IF : Instruction Fetch

ID : Instruction Decode

OF : Operand Fetch

EX : Execution

WB : Write Back

10 11 12 | 13 K

IF  
IF  
ID  
ID  
IF  
ID  
IF  
ID  
OF EXWB  
OF EXWB

IF

ID

IF

ID

IF

ID

IF

ID

IF

ID

IF

Stack cycles  
(bubbles)

IF ID OF EXWB  
IF ID OF EXWB

no. of instructions executed =  $T_1 - T_4 + T_3 - T_1$

$$= 6 \quad (m=6)$$

$$K = 5 \{ IF ID OF EXWB \}$$

w/o and branch

$$= (K+m-1) = (5+6-1) = 10$$

$$exten = 3$$

If branch is evaluated at  
ith stage of pipeline.

no. of stalls because of  
branch =  $i-1$

Q)

in a k-segment pipeline,  
if branch outcome is available  
after ith stage.

By standard  $\Rightarrow$  branch decision  
is made in 'Execution' Phase

Actually only 6 instructions executed

So  $m=6$ ,  $k=5$ ,

$\rightarrow$  After completion of branch i-stuff  
Next instruction is jettisoned

If one cycles time = 10 ns  
so total execution time

$$= \text{no. of cycles} * 10$$

$$= 13 \times 10 = 130 \text{ ns}$$

actual.

$$10 \times 10 = 100 \text{ ms}$$

Q) FI : 5

DI : 7 LD : 1 ms

FO : 10

EI : 8

WO : 6

$$\begin{aligned} t_p &= \max(5, 7, 10, 3, 6) + \\ &= 10 + 1 = 11 \end{aligned}$$

$$m = 12, \quad T_q \rightarrow I_q \quad [ \text{start} = 4-1 \\ = 3 ]$$

$$(k+n-1) = (5+18-1) = 12$$

~~22~~

No. of cycles

$$= 18 \times 11 = 165$$

# Pipeline Hazards

Situations that prevent the next instruction from being executing during its designated clock cycle.

↳ hazards leads to have stall cycles.

- 1) Structural hazards / Resource Conflict
- 2) Data hazard / Data dependency
- 3) Control Hazard / Branch Difficult

## Structural hazard

- ⊗ 2 diff. <sup>inputs</sup> ~~segments~~ try to use same resource at same time

|     | T1            | T2            | T3            | T4            | T5            | T6 | T7    |
|-----|---------------|---------------|---------------|---------------|---------------|----|-------|
| MUL | IF            | ID            | DF            | EX            | EX            | EX | WB    |
| ADD | <del>IF</del> | <del>ID</del> | <del>DF</del> | <del>OP</del> | <del>OP</del> | -  | EX WB |
| ADD | IF            | ID            | DF            | -             | -             | -  | EX WB |

## Example

|    |    |    |    |    |    |    |    |    |
|----|----|----|----|----|----|----|----|----|
| 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |
| I1 | IF | ID | OF | EX | WB |    |    |    |
| I2 |    | IF | ID | OF | EX | WB |    |    |
| I3 |    |    | IF | ID | OF | EX | WB |    |
| I4 |    |    |    | IF | ID | OF | EX | WB |

If at T3, I1 and I3 both are fetching instruction & operand from memory. So there will be a resource hazard causing stalls and hence some stall cycles.

## Resolution

→ Inclusion of two cache memories one for operands (data) and one for instructions.

Data Hazard / Data dependency  
(we are considering General Pipeline hardware)  
result of an instruction is used  
as input in next.

i:  $R1 \leftarrow R2 * R3$       T1 T2 T3 T4 T5 T6  
                                   IF ID OF EX WB  
                                   IF ID ~~-~~ - OF,  
                                   stall cycle

i+1:  $R4 \leftarrow R1 + R5$

- ④ We cannot use RI value in the next instruction ( $i+1$ ) as it is yet not updated in RI (i.e. WB stage)  
So we have to wait for 2 cycles to get updated RI value  
 $\rightarrow$  So RI data dependency was there
  - ⑤ A general pipeline hardware can't detect data hazard

## Solution

- ④ Software Solution (provided by compiler)  
Compiler generates binary seq. for CPU so compiler detects hardware

and judge whether hardware can handle data dependency or not otherwise it will generate instructions in such a manner that there will not be any data dependency.

compiler soln  $\Rightarrow$  delayed load

(\*) Hardware Solution (Smart h/w)

- ↳ hardware interlock
- ↳ operand forwarding.

### Delayed Load

|    |                         |
|----|-------------------------|
| I1 | R1 K -> R2 * R3         |
| I2 | R5 $\leftarrow$ R1 + R3 |

$\rightarrow$  there is data dependency in R1  
so there shld be 2 cycle delay in OF of I2

$\rightarrow$  compiler will add two independent instruction in between I1 & I2

the compiler will insert no-  
operation instructions in between  
I1 & I2 (dependent instructions)

|              | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |
|--------------|----|----|----|----|----|----|----|----|
| R1 ← R2 * R3 | IF | ID | OF | EX | WB |    |    |    |
| NOP          |    |    |    |    |    | IF | ID | OF |
| NOP          |    |    |    |    |    | EX | WB |    |
| R5 ← R1 + R4 |    |    |    |    |    | IF | ID | OF |
|              |    |    |    |    |    | EX | WB |    |

→ So compiler solved the data dependency.

→ IN NOP no register values will change  
nothing will change.

↳ give example of no item on assembly line but still all operations will perform

### Smart hardware

#### Hardware interlock:

| I1 R1 ← R2 * R3 | 1  | 2  | 3  | 4  |    |   |   |    |    |    |
|-----------------|----|----|----|----|----|---|---|----|----|----|
|                 | IF | ID | OF | EX | WB |   |   |    |    |    |
| I2 R4 ← R1 * R5 |    |    |    | IF | ID | — | — | OF | EX | WB |

OF hardware will be locked for I2  
in 4th cycle as no control signal will be received.

## Operand Forwarding

(bypass is expensive)

- $R1 \leftarrow R2 + R3$  IF ID OF EX  $\rightarrow$  WB
- $R5 \leftarrow R1 * R4$  IF ID OF EX  $\rightarrow$  WB

Example: You and friend giving viva,  
of exam. he is inside and you outside.  
Both your projects are in one branch.  
You tell the viva coordinator that  
you also have the pendrive (Bluff)  
when your friend comes out you  
take pendrive and proceed for viva.)

→ In operand forwarding the result  
of EX stage for I1 is forwarded to  
WB as well as EX stage of I2 instr.  
inside ALU itself. So we saved 2 cycles.



This will only happen for  
viva data dependency.

Decision unit will decide that  
operand forwarding will happen  
or not.

### ALU to ALU data dependency

↳ no stall cycle because of  
data dependency if operand  
forward used.

ex  $R1 \leftarrow M[\text{add}]$   
LOAD     $R4 \leftarrow R1 * R5$

operand forward will not help  
coz data coming from outside  
processor.

STORE     $R4 \leftarrow R1 * R5$   
 $\text{MEM}[\text{ADD}] \leftarrow R4$

⊗ in viva example your friend not  
inside viva room

## Data Hazards

Write after write (WAW) Hazard

$$I_1 : R_3 \leftarrow R_1 * R_2$$

$$I_2 : R_3 \leftarrow R_4 + R_5$$

→ RAW (Read after Write (True dependency))

$$\text{ADD } R_1, R_2, R_3 \quad R_1 \leftarrow R_2 + R_3$$

$$\text{ADD } R_4, R_1, R_5 \quad R_4 \leftarrow R_1 + R_5$$

→ Reading of R1 after writing of R1

→ WAR (Write after Read) (Anti dependency)

$$\text{Add } R_1 \quad R_2 \quad R_3$$

$$\text{Add } R_2 \quad R_6 \quad R_5$$

read R2 earlier than writing R2

→ WAW (Write after Write) (output dependency)

$$\text{Add } R_1 \quad R_2 \quad R_3$$

$$\text{Add } R_1 \quad R_6 \quad R_5$$

R1 writes after R1 in both instructions

→ RAR (Read after Read) (No dependency)

$$\text{Add } R_1 \quad R_2 \quad R_3$$

$$\text{Add } R_7 \quad R_6 \quad R_3$$

R3 read after R5 read

## Control Hazard or Branch difficulty

- After the branch instruction CPU is not able to decide which instruction to execute as this depends on whether condition is met for branching or not
- So this leads to generation of stall cycles. and control hazard
- General hardware can't detect this control hazard.(branch difficulty)

### Handling

#### Software

Compiler

delayed branch

#### Hardware

- Prefetch Target Instruction
- Branch Prediction
- Loop Buffer
- Branch Target Buffer

### Delayed Branch

Instruction (branch) IF ID OF EX WB

NOP

IF ID OF EX WB

NOP

IF ID OF

NOP

IF ID OF

- Number of NOP inserted by compiler depending upon after how many cycles next instruction needs to be executed

Based on condition the PC value will be updated and so the <sup>next</sup> instruction will be executed

→ Because of NOP no stall cycles generated

### Hardware Solution

→ Prefetch Target Instruction

( Hardware is able to handle branch



→ Two separate pipelines will be started

### Decode phase

→ Understands which instruction (whether branch or not)

→ calculate address (target)

→ So in decode phase processor knows where to jump but doesn't know whether it should jump or not

→ 2 pipelines will go together

↳ As per condition one pipeline is discarded and one is accepted

## Branch Prediction

- Hardware will predict whether instruction will jump or not
- Static
  - Branch not taken (always)
  - Branch taken (always)
- Dynamic (intelligent)
  - decide on current execution of program
  - current hardware (80-90% correct prediction)

## Loop Buffer



- If we are running a loop we have to jump 100 times, so to accommodate this there is a special buffer in pipeline which feeds these instructions and executes them for 100 times without fetch

## Branch Target Buffer

| Branch | Target |
|--------|--------|
| I4     | I12    |
| I5     | I16    |

we keep track of where the jump happened in previous instance. By looking at the previous data we may be able to guess branch.