



# Computer Organization and Architecture

Pipelining and  
vector processing  
**part-2**

# ABOUT ME : MURALIKRISHNA BUKKASAMUDRAM

- MTech with 20 years of Experience in Teaching GATE and Engineering colleges
- IIT NPTEL Course topper in Theory of computation with 96 %
- IGIP Certified (Certification on International Engineering educator)
- GATE Qualified
- Trained more than 50 Thousand students across the country
- Area of Expertise : TOC,OS,COA,CN,DLD



# Pipelining and vector processing Part-2

## Data Hazards

5-stage



### Data Dependency

[ RAW (Read After Write) ]

True Dependency

Example :-

- I<sub>1</sub> : { ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub> ; R<sub>1</sub>  $\leftarrow$  R<sub>2</sub> + R<sub>3</sub> }  
I<sub>2</sub> : { MUL R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub> ; R<sub>4</sub>  $\leftarrow$  R<sub>1</sub> \* R<sub>5</sub> }  
I<sub>3</sub> : Next Instrn.

## Operand Forwarding



# Pipelining and vector processing Part-2

WAR [ Write After Read ] | Anti - Dependency

$I_1 : ADD R_1, \overbrace{R_2, R_3}^{\text{Read from } R_2}; R_1 \leftarrow R_2 + R_3$   
 $I_2 : MUL R_2, \overbrace{R_4, R_5}^{\text{Read from } R_4}; R_2 \leftarrow R_4 * R_5$



WAW [ Write After Write ] | output Dependency

$\checkmark I_1 : MUL \overbrace{R_1, R_2, R_3}^{\text{Write to } R_1}; R_1 \leftarrow R_2 * R_3$

$\checkmark I_2 : DIV \overbrace{R_1, R_4, R_5}^{\text{Write to } R_1}; R_1 \leftarrow R_4 / R_5$



# Pipelining and vector processing Part-2

## Control Hazards [ unconditional Branch ]

I<sub>1</sub> : ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub> ; R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>

I<sub>2</sub> : BUN Target ; Branch Unconditionally

I<sub>3</sub> : Next instrn.

⋮

✓ I<sub>T</sub> : Target Instruction.



Penalty 3 stall cycles

# Pipelining and vector processing Part-2

## Conditional Branch Instructions

Example :-

$y=1$ ,  $y=0$

$I_1 : \boxed{\text{ADD } R_1, R_2, R_3}$ ;  $R_1 \leftarrow R_2 + R_3$   
 $I_2 : \underline{\text{BV}} \quad \underline{\text{Target}}$ ; Branch to Target if overflow.  
 $I_3 : \text{Next Instn.}$   
 $\vdots$   
 $I_T : \underline{\text{Target}}$



# Pipelining and vector processing Part-2



- The instruction pipeline of a RISC processor has the following stages: Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Perform Operation (PO) and Writeback (WB). The IF, ID, OF and WB stages take 1 clock cycle each for every instruction. Consider a sequence of 100 instructions. In the PO stage, 40 instructions take 3 clock cycles each, 35 instructions take 2 clock cycles each, and the remaining 25 instructions take 1 clock cycle each. Assume that there are no data hazards and no control hazards.

The number of clock cycles required for completion of execution of the sequence of instructions is \_\_\_\_\_.

# Pipelining and vector processing Part-2



$$\begin{array}{r}
 124 \\
 70 \\
 28 \\
 \hline
 219
 \end{array}$$

# Pipelining and vector processing Part-2

$$\begin{array}{c}
 \text{Naive } [NP] \\
 (\text{IF}, \text{ID}, \text{OF}, \text{EX}, \text{wB}) \\
 \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \quad \downarrow \\
 (5 \quad 4 \quad 20 \quad 10 \quad 3) = 20
 \end{array}$$

$$\text{Buffer Delay} = 2 \text{ ms}$$

E.P (Efficient)



20

12

- 20** 2. Instruction execution in a processor is divided into 5 stages. Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Execute (EX), and Write Back (WB). These stages take 5, 4, 20, 10, and 3 nanoseconds (ns) respectively. A pipelined implementation of the processor requires buffering between each pair of consecutive stages with a delay of 2 ns. Two pipelined implementations of the processor are contemplated:

- (i) a naive pipeline implementation (NP) with 5 stages and
  - (ii) an efficient pipeline (EP) where the OF stage is divided into stages OF1 and OF2 with execution times of 12 us and 8 ns respectively.

The speedup (correct to two decimal places) achieved by EP over NP in executing 20 independent instructions with no hazards is

# Pipelining and vector processing Part-2

NP. :-

$$5 \times 20 \text{ n.s} = 100 \text{ n.s}$$

$$19 \times 20 = 380 \text{ n.s}$$

$$5+19 = 24$$

$$\begin{array}{r} 480 \\ 148 \\ \hline 528 \end{array}$$

$$24 \times 2 \text{ n.s}$$

$$48 \text{ n.s}$$



EP



$$12 \times 6 = 72 \text{ n.s}$$

$$12 \times 19 = 228 \text{ n.s}$$

$$25 \times 2 = 50$$

$$\begin{array}{r} 298 \\ 72 \\ \hline 300 \end{array}$$

350

# Pipelining and vector processing Part-2

Non-Pipelining

$$\begin{aligned}
 &= 50 \text{ n.s} \times 100 \\
 &= 5000 \text{ n.s}
 \end{aligned}$$

3. A non pipelined system takes 50 ns to process a task. The same task can be processed in a 6 segment pipeline with a clock cycle of 10 ns. What is the speed up of the pipeline for 100 tasks ?

- A. 5.1
- B. 4.76
- C. 4.9
- D. None of these

$$6 \times 10 \text{ n.s} + 99 \times 10 \text{ n.s}$$

$$60 + 990 = 1050 \text{ n.s}$$

clocks - 1 2 3 4 5 6 7 8 9

|       |       |       |       |       |       |       |  |
|-------|-------|-------|-------|-------|-------|-------|--|
| $T_1$ | $s_1$ | $s_2$ | $s_3$ | $s_4$ | $s_5$ | $s_6$ |  |
|-------|-------|-------|-------|-------|-------|-------|--|

|       |       |       |       |       |       |       |  |
|-------|-------|-------|-------|-------|-------|-------|--|
| $T_2$ | $s_1$ | $s_2$ | $s_3$ | $s_4$ | $s_5$ | $s_6$ |  |
|-------|-------|-------|-------|-------|-------|-------|--|

$$\text{Speed UP} = \frac{5000}{1050} =$$

# Pipelining and vector processing Part-2

$$\text{clock Cycle} = \frac{1}{\text{frequency}}$$

P<sub>1</sub>

$$1 \text{ cycle} = \frac{1}{10^9} \text{ sec} = 1 \text{ n.s}$$

5. Consider two processors P<sub>1</sub> and P<sub>2</sub> executing the same instruction set . Assume that under identical conditions , for the same input a program running on P<sub>2</sub> takes 25% less time but incurs 20% more CPI (Clock cycles per instruction) as compared to the program running on P<sub>1</sub>. If the clock frequency of P<sub>1</sub> is 1 GHz then the clock frequency of P<sub>2</sub> is ?

Sol :- Assume that P<sub>1</sub> takes 5 cycles

P<sub>1</sub> takes 5 n.s

$$6 \text{ cycles} = 3.75 \text{ n.s}$$

$$6 \times \frac{1}{x \text{ GHz}} = 3.75$$

$$x = 6 / 3.75 = 1.6$$

$$\text{the time taken by P}_2 = 5 - 0.25(5)$$

$$= \underline{\underline{3.75 \text{ n.s}}}$$

$$\text{P}_2 \text{ takes } 5 + 0.20(5) = 6 \text{ cycles}$$

# Pipelining and vector processing Part-2

$$\text{Speed Up} = \frac{\text{Time taken for Non-Pipeline for 1-instruction}}{\text{Time taken for Pipelining}}$$

6. Consider a non pipelined processor with a clock rate of 2.5 GHz and average Cycles per instruction of four. The same processor is upgraded to a pipelined processor with five stages, but due to the internal pipeline delay, The clock speed is reduced to 2 GHz. Assume that there are no stalls in the pipeline. The speed up achieved in this pipelined processor is ?

3.2 ✓

$$1 \text{ cycle} = \frac{1}{2.5 \times 10^9} \text{ sec}$$

$$= \frac{1}{2.5} \text{ n.s.}$$

$$\begin{aligned} 1 \text{ cycle} &= \frac{1}{2 \times 10^9} \text{ sec} \\ &= \frac{1}{2} \text{ n.s.} = 0.5 \text{ n.s} \end{aligned}$$

$$1 \text{ instruction takes } 4 \times \frac{1}{2.5} = 1.6$$

$$\text{Speed Up} = \frac{1.6}{0.5} = \underline{\underline{3.2}}$$

# Pipelining and vector processing Part-2

Consider the clock cycles taken by the 4 instructions as follows.

|                | F | D | E | W |
|----------------|---|---|---|---|
| I <sub>1</sub> | 1 | 2 | 1 | 1 |
| I <sub>2</sub> | 2 | 1 | 3 | 1 |
| I <sub>3</sub> | 1 | 1 | 2 | 1 |
| I <sub>4</sub> | 1 | 1 | 1 | 2 |

Find  
Total No. of Clock Cycles ?

|                | clocks - | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|----------------|----------|---|---|---|---|---|---|---|---|---|----|----|----|
| I <sub>1</sub> |          | F | D | D | E | W |   |   |   |   |    |    |    |
| I <sub>2</sub> |          |   | F | F | D | E | E | E | W |   |    |    |    |
| I <sub>3</sub> |          |   |   | F | D | — | — | E | E | W |    |    |    |
| I <sub>4</sub> |          |   |   |   | F | D | — | — | — | E | W  | W  |    |

12 cycles