

## Chapter - 6 Pipelining

63

Pipelining - an implementation technique in which multiple tasks are overlapped in execution.

Eg: Laundry (Wash, Dry, Fold)

Eg: Ordering food item in canteen.

If 2 people wants to buy food, [place order +  
(B) buy  
in non pipelined:-  $\rightarrow$  2 stages

5m 5m 5m 5m 5m 5m.

P1 P B

P2 P B

P3 P B

Total time -  $10 \times 3 = \underline{\underline{30}}$

In pipelined:

5m 5m 5m 5m

P1 P B  $\rightarrow 2 \times 5$

P2 P B  $\rightarrow 2 \times 5 + 5$

P3 P B  $\rightarrow 2 \times 5 + 5 + 5$

\* Total time =  $\underline{\underline{20m}}$ .

Theoryput - executive time of n tasks.

Speed up ratio =  $\frac{\text{time non pipe}}{\text{time pipe.}}$

(Q) Consider a  $k$  stage pipeline with  $t_p$  as time reqd for each stage to process for  $n$  no. of processes. Calculate the total no. of cc to complete <sup>pipelined & non-p</sup> exec. in. is also speed up ratio.

eg: If 3 stages. & 4 processes.  
& non pipelined



$$\text{Total cc} = 3 \times 4$$

$$= 12$$

$$\text{i.e. } (k \times n)$$

$$\begin{aligned}\text{Total time} &= \frac{\text{no. of cc} \times \text{time for each stage}}{=\cancel{k} \times \cancel{n} \times t_p} \\ &= k \times n \times t_p\end{aligned}$$



$$\therefore \text{Total time} = 3+3 = 6 \quad (k+n-1)$$

$$\therefore \text{time} = (k+n-1) \times t_p$$

$$\text{Speed up ratio} = \frac{\text{Time non-pipe}}{\text{Time pipe}} \quad (15)$$

$$\text{Speed up} = \frac{kxn \times tp}{(k+n-1) \times tp} = \frac{kxn}{k+n-1}$$

Non pipelined:

$$\text{Time} = kxn \times tp, CC = kxn$$

Pipelined

$$\text{Time} = (k+n-1)tp, CC = k+n-1$$

a) Calculate no. of cc reqd by a 4 stage pipeline to execute 100 instr if it takes 30ps for each stage. Also calculate speed up ratio.

$$\text{No. of } CC = k+n-1 = 4+99 = \underline{\underline{103}}$$

$$\text{Speed up} = \frac{kn}{k+n-1} = \frac{4 \times 100}{4+99} = \underline{\underline{3.88}}$$

b) Assume operation time for major functional units: Memory - 200ps, ALU - 100ps, Reg Read/Write - 50ps. Consider a 5 stage pipeline in MIPS with stages Inst Fetch, Reg Read, Execution, Mem Access and Reg Write.

Calculate no. of cc reqd to execute 200 instr in pipelined and non pipelined. and also speed up ratio?

In  $\underline{\text{single}}$ ,  $\underline{\text{SCL}}$ ,

as in page 62.

600ps = total time for an instruction

$$\text{Here } (\underline{k} \times \underline{t_p}) = 600 \text{ ps}$$

$\therefore$  no need to find  $k \times t_p$

$$\begin{aligned}\text{Time}_{\underline{\text{SCL}}} &= 600 \text{ ps} \times \left( \frac{\underline{I_c} \times \underline{CPI} \times \underline{CET}}{(\underline{= 200} \times \underline{1} \times \underline{600})} \right) \\ &= 600 \times 200 = 120 \text{ ns} = \underline{\underline{120000 \text{ ps}}}\end{aligned}$$

Pipelined:

max time for each stage = 200ps

$$\therefore t_p = 200 \text{ ps}$$

$$\begin{aligned}\therefore \text{time pipe} &= \cancel{n} \times \cancel{t_p} \times (k + n - 1) t_p \\ &= (5 + 99) \times 200 \\ &= 104 \times 200 = \underline{\underline{20800 \text{ ps}}}\end{aligned}$$

Non-pipelined =  $n \times h \times t_p$

$$= 200 \times 5 \times \cancel{200}$$

max time for  
each stage

$$= \underline{\underline{200000 \text{ ps}}}$$

In MIPS, instr execution can be split 67  
 into 5 stages -

- (IF) Instr fetch - uses DM
- (ID) Instr decode - uses Reg File
- (EX) Execution - ALU
- (MEM) Memory access - DM
- (WB) Write Back - Reg File

If we have ~~four~~ <sup>four</sup> mistakes,  
 add \$s1 \$s2 \$s3  
 lw \$s4 4(\$s6)  
~~sub~~ \$s2 \$s7 \$s3  
~~or~~ \$s3 \$s7 \$s7

multiple clock cycle pipeline diagram

can be drawn as.



→ altogether it requires 8cc

If not in pipeline, how much?

$$5 \times 4 = \underline{\underline{20cc}}$$

## Pipeline hazards:

68

→ events in pipelining when the next instr cannot execute in the full cc.

① ↳ structural hazard

② ↳ data hazard

↳ load-use hazard

③ ↳ control or branch hazard.

① Structural hazard.

An occurrence in which a planned instr cannot be executed in a proper cc because h/w cannot support the combination of instr that are set to be executed.

MIPS is designed in such a way that this hazard is avoided. How?

\* structural hazard arises if a conflict (Regfile, Mem, ALU) arises when the same unit is used by diff instr in the same cc.

2 cases:

① Consider 4 cc

if same mem was used for IF and MEM access, structural hazard

will arise, because both the write and the data is being fetched / written in the same unit.

But we have separate mem - DM & DM  
 $\therefore$  It's not the same unit hence no hazard

### ⑥ Consider the 5th cc-

4th instr uses Reg file to decode

1st instr uses Reg file to write value

But this hazard is also avoided  
 because, reading the file happens  
 in the 2nd half of cc & writing  
in 1st half.



$\therefore$  MIPS has no structural hazard.

[Note: ALU is also used by only one instr in pipeline in any cc.]

$\therefore$  no hazard.

### ② Data hazard

An occurrence in which a planned instr cannot execute in the proper cc because data that is needed to execute the instr is not yet available.

eg: sub \$s1 \$s2 \$s3  
 add \$s3 \$s1 \$s2

In pipeline



In the 4th cc, add instruction requires the value of \$s1 which is written to regfile only in 5th cc

$\rightarrow$  data hazard.

Solution:

Forwarding / bypassing.

$\rightarrow$  a method of resolving the hazard by retreiving the missing data element from internal buffers rather than waiting for it to arrive from registers or memory.

e.g.: at the end of 3rd cc, the value of  $s2 - s3$  to be written to  $s1$  is ready in ALU. (for sub)

This value is forwarded to 4cc & given as i/p to ALU to for add instruction.



\* what ever value was decoded by 71  
 add  
~~the~~ instr in 3rd ic is overwritten  
 by the value forwarded from  
 above instrs & this is forwarded  
 to ALU for execution. [an occurrence in which  
 data requested from a  
 load instr has not yet  
 become available]

### 2.1 Load-use hazard - ~~an occurrence in which date requested by load~~

Consider the below set of instrs.

lw \$s1 4(\$s2)

add \$s3 \$s1 \$s4

1cc 2cc 3cc 4cc 5cc 6cc



As in the prev case, here forwarding alone will not work. Because the actual value to be written to \$s1 is available only at the end of 4cc.

But add needs \$s1 in the beginning of 4cc.

How to resolve this?

Ans: Stalling & Forwarding.

\* If add instr can wait for 1cc, the value can be forwarded. - stall - no operation.

[will see later]

## Control hazard

(72)

An occurrence in which proper  
instructions can't be executed in a cc  
because instruction fetched is not the  
one needed.

Soln: Stalling.

Eg:

beq \$s1 == \$s2 L



sub

for beq misst, if \$s1 == \$s2 are  
same, then the next misst to  
be executed is and (L), but in  
pipeline, in 2nd cc, add misst is  
automatically fetched. ∴ we  
need to stall until the decision  
is made.

[will see later]

\* actual decision making happens in the 4cc.

solution : \* predict branch will not be taken and proceed with normal sequential execution.

\* on 4cc, If PC consid consider the instrs given in pr ex:

\* if  $s_1 \neq s_2$  are same then next instr to be executed should be L : and...

$\therefore$  if bne - then stall the instrs pushed to pipeline till 4cc, else proceed with exec



(decision made here)

If one prediction goes wrong, then flush the three instrs & start executing L in 5th cc

# Pipelined Datapath.



Data stored in the pipeline registers?

IF/ID :  
 $\left. \begin{array}{l} \text{PC+4, (32 bits)} \\ \text{Instr, (32 bits)} \end{array} \right\} = 64 \text{ bits (Max size)}$

ID/EX :  
 $\left. \begin{array}{l} \text{PC+4, (32)} \\ \text{RD1, (32)} \\ \text{RD2, (32)} \\ \text{sign extended, (32)} \\ \text{st/rd no: (5)} \end{array} \right\} = 133 \text{ bits}$

EX/MEM :  
 $\left. \begin{array}{l} \text{branch address, (32)} \\ \text{zero flag, (1)} \\ \text{ALU Result, (32)} \\ \text{RD2, (32)} \\ \text{st/rd no: (5)} \end{array} \right\} = 102 \text{ bits}$

MEM/WB :  
 $\left. \begin{array}{l} \text{RD (DM), (32 bits)} \\ \text{ALU res, (32 bits)} \\ \text{st/rd no: (5)} \end{array} \right\} = 69 \text{ bits}$

a) Consider the below set of insts. (75)

84  $i_1$  lw \$s5 \$s2 - WB  
88  $i_2$  add \$s6 \$s3 \$s4 - MEM  
92  $i_3$  or \$s7 \$s2 \$s1 - EX  
96  $i_4$  add \$s0 \$s4 \$s3 - ID  
100  $i_5$  sub \$s1 \$s2 \$s1 - IF

At CC5, right before the insts are executed, the processor state is as follows. PC has address 100.

Every reg has an initial value of  $10 + \text{reg no.}$ . Determine the value of every field in the 4 pipeline reg at the end of SCC.

IF / ID: a) sub mstr

b) 104 (PC+4)

$s_0 = 26(10+16)$   
 $s_1 = 27(10+17)$   
 $s_2 = 28(10+18)$   
 $s_3 = 29$   
 $s_4 = 30$



→ ID / EX: a) PC+4 of  $i_4$  - 100 (add mstr)  
b) RD1 - \$s4 - 30  
c) RD2 - \$s3 - 29  
d) rd no. - of \$s0 - 16.

EX / MEM:

c) ALU Result of  $\text{or } i_3$  (\$s7)

$$s_2 || s_1 = 28 || 27 = \underline{\underline{31}}$$

b) rt/rd no: = of  $\$S7$  = 23 (78)

MEM / WB :

a) ALU Res of  $\$a$  (add misl.)

$$\cancel{\$S3} + \cancel{\$S4} = 29 + 30 = \underline{\underline{59}}$$

b) rt/rd no: of  $\$S6$  = 22

Code - reordering to avoid pipeline stall.

Consider the below set of instructions.

Identify the dependencies, hazards and solve. Reorder the instr. to avoid any pipeline stalls.

- 1 lw  $\$t1$   $\$0(\cancel{\$t0})$
- 2 lw  $\cancel{\$t2}$   $4(\cancel{\$t0})$
- 3 add  $\cancel{\$t3}$   $\$t1 \cancel{\$t2}$
- 4 sw  $\cancel{\$t3}$   $12(\cancel{\$t0})$
- 5 lw  $\cancel{\$t4}$   $8(\cancel{\$t1})$
- 6 add  $\cancel{\$t5}$   $\$t1 \cancel{\$t4}$
- 7 sw  $\cancel{\$t5}$   $16(\cancel{\$t0})$

|     | dependencies                      | hazard   | solve               |
|-----|-----------------------------------|----------|---------------------|
| (a) | b/w 2 $\cancel{3}(\cancel{\$t2})$ | load-use | stalling + forward. |
| (b) | 3 $\cancel{4}(\cancel{\$t3})$     | data     | forward.            |
| (c) | 5 $\cancel{6}(\cancel{\$t4})$     | load-use | stalling + forward  |
| (d) | 6 $\cancel{7}(\cancel{\$t5})$     | data     | forward             |

To avoid stalling b/w 2 & 3, (77)

- \* If we can put some independent instr b/w them, stalling can be avoided.
- \* check for one such instr.



if an instr is added,



ans:  
if instr ② is put b/w ① & ③

- 1 lw \$t1 0(\$t0)
- 2 lw \$t2 4(\$t0)
- 3 ls \$t4 8(\$t0)

- 4 add \$t3 \$t1 \$t2
- 5 sw \$t3 12(\$t0)
- 6 add \$t5 \$t1 \$t4
- 7 sw \$t5 16(\$t0)

→ ② and ③ → avoided.

# Branch hazards.

solutions:

- ① Assume branch not taken.
- ② Reducing delay of branches.
  - moving branch decision to FD stage
  - ∴ again predict BNT
  - ∴ if prediction goes wrong,  
we need to flush only one  
miss.

## Dynamic branch prediction

(79)

- ① 1-bit prediction scheme.

\* a branch history table



0 - not taken

1 - taken.

1-bit FSM



for an address - check the bit  
if 0 - assume will be taken  
if 1 - assume will not be taken  
if prediction goes wrong, then  
invert the bit.

∴ every time prediction goes wrong,  
bit is inverted.

- ② 2-bit prediction scheme.

- prediction must be wrong  
fairly, b/c it is changed.

2-bit FSN:



For the foll eg: (branching sequences)  
calculate the prediction accuracy. (assume initial state is not taken)

① NNN TTNNNTTN

actual seq predict H/M

|    |   |   |
|----|---|---|
| N  | N | H |
| N  | N | H |
| N  | N | H |
| T  | N | H |
| T  | T | M |
| TN | T | H |
| N  | N | M |
| T  | N | H |
| T  | T | M |
| N  | T | H |

$$\text{accuracy} = \frac{7}{10} = \underline{\underline{0.7}}$$

≈ calculate for  
TTTNNTTNNT