

# Pipelining

Using pipelining we can execute more than one instruction at a time.

Suppose we have four stages as

IF - instruction fetch

ID → Instruction decode

Ex - Execute

WB - write back

Suppose  
1 Each stage takes one clock. So if we have to execute 3 instruction (without pipeline) we need  $3 \times 4 = 12$  clock cycle

By using pipelining.

|    | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12 |
|----|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|
| IF | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | I <sub>9</sub> |                |                |    |
| ID | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | I <sub>9</sub> |                |                |    |
| Ex |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | I <sub>9</sub> |                |    |
| WB |                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | I <sub>9</sub> |    |

CPI = 1, in 12 clock we can execute 9 instruction.

$$\text{Clock Cycle} = 4 + 8 = 12$$

Suppose n no. of instruction,

$$\text{Clock Cycle needed} = 4 + (n-1)$$

①

So, if we have  $m$  stage and  $n$  no. of instructions.

$$\text{Clock cycle needed} = m + (n-1)$$

$$\text{Speedup} = \frac{\text{Time without pipeline}}{\text{Time with pipeline}}$$

Suppose time for one clock =  $t_p$

so Time using Pipeline

$$T_p = [m + (n-1)] t_p$$

without Pipeline,

$$\text{clock cycle needed} = m * n$$

$$\text{Time } \not= T_{wp} = m * n * t_p$$

$$\text{Speedup} = \frac{m * n * t_p}{[m + (n-1)] t_p} = \frac{m * n}{m + (n-1)}$$

$$= \frac{m * n}{n + (m-1)} = \frac{m}{1 + \frac{(m-1)}{n}}$$

$n \approx$  large

$$\text{Speedup} \approx m$$

$$\text{Efficiency} = \frac{\text{Speedup}}{m}$$

efficiency  $\approx 1$

Q.

$$\text{Speedup} \cong m$$

if we increase no. of stages Speedup will increase into what will be the max limit of stage?

Q.2 Comparing the time  $T_1$  taken to a single instruction on a pipelined CPU with time  $t_2$  taken on a non-pipelined but identical CPU, we can say that

(a)  $T_1 \leq T_2$ , b

(b)  $T_1 \geq T_2$

(c)  $T_1 < T_2$

(d)

How many clock cycle needed?

|                | F | D | E | W | stage          | 1        | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|----------------|---|---|---|---|----------------|----------|---|---|---|---|---|---|---|---|----|----|----|----|----|----|
| I <sub>1</sub> | 1 | 2 | 1 | 1 | I <sub>1</sub> | <b>F</b> | D | D | E | W |   |   |   |   |    |    |    |    |    |    |
| I <sub>2</sub> | 1 | 2 | 2 | 1 | I <sub>2</sub> | <b>F</b> | X | D | D | E | E | W |   |   |    |    |    |    |    |    |
| I <sub>3</sub> | 2 | 1 | 3 | 2 | I <sub>3</sub> | <b>F</b> | F | X | D | X | E | E | W | W |    |    |    |    |    |    |
| I <sub>4</sub> | 1 | 3 | 2 | 1 | I <sub>4</sub> | <b>F</b> | F | X | D | X | E | E | W | W |    |    |    |    |    |    |
| I <sub>5</sub> | 1 | 2 | 1 | 2 | I <sub>5</sub> | <b>F</b> | X | D | D | D | X | E | E | W | W  |    |    |    |    |    |

total 15 clock cycle.

Q) How many clock cycle needed for following program

f8 (i=1 to 2)

$$\left\{ \begin{array}{l} I_1; \\ I_2; \\ I_3; \\ I_4; \end{array} \right\}$$

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ | $\leftarrow$ stage |
|-------|-------|-------|-------|-------|--------------------|
| $I_1$ | 2     | 1     | 1     | 1     | }                  |
| $I_2$ | 1     | 3     | 2     | 2     |                    |
| $I_3$ | 2     | 1     | 1     | 3     |                    |
| $I_4$ | 1     | 2     | 2     | 2     |                    |

↓

clock cycle

## Instructions

## Hazard in Pipeline

- ① Structural
- ② Control
- ③ Data

Structural: because of resource conflicts



memory

I<sub>1</sub> both try to access the  
memory at same time

Solution: ~~diff~~ Replicate Resource  
Memory 



Data Memory



Instruction  
Memory

Control depends on dependencies



Suppose I<sub>2</sub> is Branch instruction.

due to branching we have to flush I<sub>3</sub>, I<sub>4</sub>, I<sub>5</sub> and I<sub>6</sub> executions.

## Solution

(i) Flushing

(ii) Stalling

(iii) Nop  
(no operation)  
Compiler insert Nop

(iv) Reschedule & Rearrange  
Compiler job to reschedule

I01 BZ 200

I02 NOP

I03 NOP

I04 NOP

I05 -



Suppose 4 stage pipeline ~~is~~ S1, S2, S3, S4

We know the target address of Branch instruction after completion of S2 stage



If we know the target address after S3 stage



$$CPI = 3$$

If we know the target address after S4 stage

$$CPI = 4$$

Branch instructions ~~are~~



If we know the target address after E.

Assume, there are n no. of instruction in program

Probability of Conditional Branching = p

Probability of Condition true = q

no. of Stage = m, we get target address after m<sup>th</sup> stage

Total Conditional branch instruction =  $np$

Total instruction for which condition true =  $npq$



Total instruction which will not branch

$$= n - npq$$

$$= n(1-pq)$$



$$\text{effective CPI} = \frac{n pq * m + n(1-pq)}{n}$$

$$= m pq + (1-pq)$$

$$= 1 + pq(m-1)$$

## Way to avoid Stall Cycles



- Branch prediction and speculative execution.



## Delayed Branching

$S_1, S_2, S_3 | S_4, S_5$   
 $\times \times \quad S_1$



- (i) program Meaning should not change
- (ii) Smart compiler.

The following code is run on a pipelined processor with one delay slot

I1 : ADD  $R_2 \leftarrow R_7 + R_8$

I2 : SUB  $R_4 \leftarrow R_5 - R_6$

I3 : ADD  $R_1 \leftarrow R_2 + R_3$

I4 : STORE  $M[R_4] \leftarrow R_1$

BRANCH to label if  $R_1 = 0$

Which of the instructions I1, I2, I3 & I4 can legitimately occupy the delay slot without any other program modification?

Ans: I4