



- \* Program is a combination of data transfer, data manipulation, & transfer of control (COC) instructions. Different instructions take (consume) different cycle to complete the executing so,

$$\text{Program ET.} = \left[ \sum_i (I_{ci} \times CPI_i) \right] \text{Cycle time.} \quad i = \text{instruction.}$$

Let  $CPI_i$  be the number of cycles required for the instruction type  $i$  and  $I_i$  be the number of executed instructions of type  $i$  for a given program. Then we can calculate an overall (average)  $CPI$  as follows:

$$CPI = \frac{\sum_i^n (CPI_i \times I_i)}{I_c}$$

The processor time  $T$  needed to execute a given program can be expressed as  $T = I_c \times CPI \times T$  (where  $T = 1/f$ , and  $I_c = \text{instr. count.}$ )

A common measure of performance for a processor is the rate at which instructions are executed, expressed as millions of instructions per second (MIPS), referred to as the MIPS rate. We can express the MIPS rate in terms of the clock rate and CPI as follows:

$$\text{MIPS rate} = \frac{I_c}{T \times 10^6} = \frac{f}{CPI \times 10^6} \text{ MIPS}$$

Q2. Common class question 1, 2, and 3. Consider a 1.5 GHz clock frequency processor used to execute the following program segment.

| Instruction Type | Instruction Count | CPI |
|------------------|-------------------|-----|
| Load             | 200               | 11  |
| Store            | 200               | 9   |
| Arithmetic       | 250               | 7   |
| Shift            | 150               | 6   |
| Branch           | 50                | 4   |

(i) What is the average instruction execution?

$$\text{Ans} \quad \text{Avg CPI} = 200 \times 11 + 200 \times 9 + 250 \times 7 + 150 \times 6 + 50 \times 4$$

$$\text{Avg CPI} = 8.36 \text{ cycles} \quad \text{Gt. time} = \frac{950}{1.5 \text{ GHz}} = 0.66 \text{ nsec}$$

$$\text{Avg Instruction cycle} = 8.36 \text{ cycles} \quad = 8.36 \times 0.66 \text{ nsec} = 0.66 \text{ nsec}$$

$$\boxed{\text{Avg Inst. ET} = 5.51 \text{ nsec}}$$

(ii) What is the MIPS rate of a program?

$$\text{Ans} \quad 1 \text{ Instruction Avg. ET} = 5.51 \times 10^9 \text{ sec}$$

$$1 \text{ Instruction} = 5.51 \times 10^9 \text{ sec}$$

$$\text{To 1 sec} = \# \text{ Instr} = \frac{1}{5.51 \times 10^9} \text{ Inst/sec}$$

$$= \frac{1}{5.51 \times 10^9} \text{ Inst/sec} = 0.1814 \times 10^9 \text{ Inst/sec}$$

$$= 181.4 \text{ MIPS}$$

(iii) What is the total program ET?

$$\text{Ans} \quad \text{Avg Inst. ET} = 5.51 \text{ nsec}, \text{ Program Contains } 950 \text{ instructions}$$

$$\text{Program ET} = 950 \times 5.51 \text{ nsec}$$

$$= 5.234 \times 10^{-6}$$

$$= 5.234 \times 10^{-6}$$

$$\boxed{\text{Program ET} = 5.234 \text{ usec}}$$

Q1. 1 nsec clock cycle processor consumes 4 cycles for load and store operation and 6 cycles for ALU operations and 2 cycles for branch operations. The relative frequency of these operations are 40%, 40% and 20% respectively.

What is the average instruction ET?

$$\text{Avg CPI} = 4.0 \times 4 + 4.0 \times 6 + 2.0 \times 2 \\ = 1.6 + 2.4 + 0.4$$

$$\text{Avg CPI} = 4.4 \text{ cycles}$$

$$\text{Avg. Instruction ET} = 4.4 \times 1 \text{nsec} \\ = 4.4 \text{nsec}$$

(ii) What is the performance in terms of MIPS?

$$1 \text{ Instruction} = 4.4 \times 10^9 \text{ sec}$$

$$\text{In 1 sec} = \frac{1}{4.4} \times 10^9 \text{ Inst/sec} \\ = \frac{1000}{4.4} \times 10^6 \text{ Inst/sec}$$

$$= 227.2 \text{ MIPS}$$

(iii) If program contains  $10^6$  instructions then what is total program ET?

$$1 \text{ Instruction} = 4.4 \times 10^{-9} \text{ sec}$$

$$\text{Program Contains} = 10^6 \text{ Instructions}$$

$$\text{Program ET} = 4.4 \times 10^{-9} \times 10^6 \\ = 4.4 \times 10^{-3} \\ = 4.4 \text{ msec}$$

Q4. Consider a 2.8ns clock cycle processor which consumes 9 cycles for load and store instructions and 7 cycles for ALU instructions and 3 cycles for branch instructions. Relative frequency of these instructions are 40%, 40% and 20% respectively. processor is enhanced with an average CPI of 1. During the enhancement

# INSTRUCTION PIPELINING

## Pipelining

- \* Pipeline is a mechanism which are used to improve the performance of the system, in which task are Executed in an Overlapping manner.
- \* Pipelining is a decomposition technique that means the problem is divided into Sub problem and Assign the Sub problem to the pipes then Operate the pipe under the same clock.
- \* Accepting new input before the previously accepted input appear as a output at other end. i.e it means new input are Executed along with old input in a Overlapping manner.

Non Pipeline: Accepting new input after the Previously accepted input appear as a output at other end. In which no Overlapping Execution occur.

## Non Pipeline

| S <sub>4</sub> |                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |                |                |  |                |  |  |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|--|----------------|--|--|
| S <sub>3</sub> |                | I <sub>1</sub> |                | I <sub>3</sub> |                | I <sub>3</sub> |                | I <sub>3</sub> |  | I <sub>4</sub> |  |  |
| S <sub>2</sub> |                | I <sub>1</sub> |                | I <sub>2</sub> |                | I <sub>3</sub> |                | I <sub>4</sub> |  |                |  |  |
| S <sub>1</sub> | I <sub>1</sub> |                | I <sub>2</sub> |                | I <sub>3</sub> |                | I <sub>4</sub> |                |  |                |  |  |

↑  
Input end.

| S <sub>4</sub> |                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |  |  |  |  |  |  |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|--|--|--|--|--|--|
| S <sub>3</sub> |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |                |  |  |  |  |  |  |
| S <sub>2</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |                |                |  |  |  |  |  |  |
| S <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |                |                |  |  |  |  |  |  |

↑  
Output end.

Pipeline Design:

- \* Pipeline has 2 ends
  - Input end
  - Output end

- \* Between the input and output end multiple pipes are interconnected to satisfy the functionality of the pipeline.

Clock



- \* These pipes are called Stage or Segment.
- \* Between the stages buffer are used to store the intermediate result.
- \* These buffer are also called as latch, pipeline registers, interface registers.
- \* All the stages along with the buffer are controlled by common clock.

### Pipeline Performance Evaluation:

Now consider the case where a k-segment pipeline with a clock cycle time  $t_p$ , is used to execute n tasks. The first task  $T_1$  requires a time equal to  $k t_p$ , to complete its operations since there are k-segments in the pipe. The remaining  $(n-1)$  tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to  $(n-1)t_p$ .

Therefore, to complete n tasks using a k-segment pipeline requires  $k + (n-1)$  clock cycles.

$$ET_{\text{pipe}} = K + tp + (n-1) tp$$

$$ET_{\text{pipe}} = [K + (n-1)] tp$$

$K$ : # stages (segments)

$n$ : # instruction

$tp$ : each stage delay in pipeline.

Execution Time in Non-Pipeline:

$$ET_{\text{Non pipeline}} = n \cdot tn$$

$tn$ : each instruction execution time in non pipeline

$n$ : # instructions (tasks)

Performance Gain of Pipeline Processor

Performance Gain

Speed up factor [S]

Performance of Pipeline

Performance of non pipeline

$$= \frac{1/ET_{\text{pipe}}}{1/ET_{\text{non pipe}}}$$

$$\rightarrow \text{Speed up factor } [S] = \frac{ET_{\text{non pipe}}}{ET_{\text{pipe}}} \Rightarrow S = \frac{n \cdot tn}{[K + (n-1)] tp}$$

$$S = \frac{tn}{tp}$$

As the number of task increase,  $n$  becomes much larger than  $K-1$ , and  $K+n-1$  approaches the value of  $n$ .

When very large number of task/instruction executed  
on

When no. of task is not given.

eg:  $k=4, n=1$

$$t_p = (\text{Stage delay}) = 2 \text{ nsec.}$$

$$\begin{aligned} ET_{\text{pipe}} &= [4 + (1-1)] \times 2 \text{ nsec.} \\ &= 8 \text{ nsec} \end{aligned}$$

Pipeline

$$T_n = 2+2+2+2 = 8 \text{ nsec}$$

$$ET_{\text{non-pipe}} = n \cdot t_n$$

$$= 1 \times 8 \text{ nsec}$$

$$= 8 \text{ nsec}$$

Non-pipeline

Note: In ideal case when pipeline & non-pipeline are perfectly balanced then One Task ET in non-pipeline

$$ET \cdot NP = K \cdot t_p$$

eg:  $k=4$  1 Task ET in non-pipeline

$$t_p = 2 \text{ cycle} \quad (2 \text{ cycle in each stage})$$

$$ET_{NP} = 2+2+2+2$$

$$ET_{NP} = 8 \text{ cycle}$$

$$ET \cdot NP = K \cdot t_p$$

$$= 4 \times 2$$

$$ET \cdot NP = 8 \text{ cycle}$$

eg: 1 Task ET in non-pipeline

(1 cycle in each stage)

$$ET_{NP} = 1+1+1+1 = 4 \text{ cycle}$$

$$ET \cdot NP = K \cdot t_p$$

$$= 4 \times 1 \Rightarrow$$

$$ET \cdot NP = 4 \text{ cycle}$$

Note: This only happens in the condition when Pipeline are perfectly balanced

Each stage takes same amount of time/cycle

$$S = K \cdot t_p$$

$$t_p = \text{constant}$$

$$S = K$$

Throughput

$[K + (n-1)] \cdot tp$  time unit in task completed

In 1 time unit =  $n$

(Throughput)  $[K + (n-1)] \cdot tp$

Note: In pipeline when number of tasks are not given  
Then number of tasks completed per unit time

$$\text{Throughput} = \frac{1}{tp}$$

Efficiency:

$$\eta = \frac{S}{K}$$

S: Speed up factor

K: number of storage (segments)

Types of Pipeline:

1. Uniform Delay Pipeline

2. Non Uniform Delay Pipeline

1. Uniform Delay Pipeline: In uniform pipeline each storage taking same amount of time/delay to execute the assigned operation/task.

1 Task ET in pipeline

$$K = 4 \text{ (stage)}$$

$$tp = \text{storage delay} \times 2$$

$$n = 1$$

$$ET_{\text{pipe}} = [K + (n-1)] \cdot tp$$

$$= [4 + (1-1)] \times 2 \text{ nsec}$$

$$= 8 \text{ nsec.}$$

1 Task ET in non pipeline

$$ET_{NP} = 1 \times 8 \text{ nsec}$$

$$t_n = 2 + 2 + 2 + 2$$

$$= 8 \text{ nsec}$$

If buffer delay is included:

$$tp = (\text{storage} + \text{buffer}) \text{ delay}$$

$$tp = 2 + 1 = 3 \text{ nsec.}$$

$$ET_{\text{pipe}} = [4 + (1-1)] \times 3$$

$$= 12 \text{ nsec}$$

$$ET_{NP} = 8 \text{ nsec.}$$

1 Task ET in pipeline = 1 Task ET in non pipeline.

Non Pipeline has more overhead than pipeline.

2. Non uniform delay pipeline: In non uniform delay pipeline each stage take different amount of time (delay) to complete the assigned task/operation.

1 Task ET in pipeline

$$k = 4, n = 1$$

$$t_p = \max(\text{stage delay})$$

$$t_p = \max(2, 8, 4, 2)$$

$$t_p = 8 \text{ nsec}$$

$$ET_{\text{pipe}} = [4 + (n-1)] t_p$$

$$ET_{\text{pipe}} = 32 \text{ nsec}$$

1 Task ET in non pipeline

$$ET_{\text{NP}} = n t_p$$

$$t_p = 2 + 8 + 4 + 2$$

$$= 16 \text{ nsec}$$

$$ET_{\text{NP}} = 1 \times 16 \text{ nsec}$$

$$ET_{\text{nonpipe}} = 16 \text{ nsec}$$

If buffer delay is included:

$$t_p = \max(\text{stage} + \text{buffer delay})$$

$$= \max((2+1), (8+1), (4+1), (2+1))$$

$$= 9 \text{ nsec}$$

$$ET_{\text{pipe}} = [4 + (n-1)] \times 9$$

$$ET_{\text{pipe}} = 36 \text{ nsec}$$

When number of task/instruction increase :

If  $n = 1000$  is previous question. (above one)

$$ET_{\text{pipe}} = [k + (n-1)] t_p$$

$$= [4 + (1000-1)] \times 8$$

$$ET_{\text{pipe}} = 8024 \text{ nsec}$$

$$ET_{\text{nonpipe}} = n t_p$$

$$t_p = 2 + 8 + 4 + 2 = 16 \text{ nsec}$$

$$ET_{\text{nonpipe}} = 1000 \times 16$$

$$ET_{\text{nonpipe}} = 16000 \text{ nsec}$$

### Important Points:

1. When pipelines are perfectly balanced (Uniform delay) then 1 task (instruction) execution time in pipeline is same as 1 task E.T. in non pipelined.

2. When pipeline are not perfectly balanced (non uniform delay) then 1 task execution in pipeline is generated/equal to 1 Task ET in non Pipeline.

$T_1$ : One task ET in Pipeline

$$T_1 \geq T_2$$

$T_2$ : One task ET in non Pipeline

3. When the number of tasks are increased then pipeline performance is better compared to non pipeline.

- ④ Buffer delay is not included in non pipeline, because in non pipeline we are not storing the intermediate result. So buffer delay is included only in pipeline when it is given in question.

Uniform delay.

$$t_p = \text{Storage} + \text{buffer delay}$$

Non uniform delay

$$t_p = \max(\text{Storage} + \text{buffer delay}, \text{delay})$$

- Q1 Consider an instruction Pipeline which has Speed up factor 20 while operating with 80% efficiency. What could be the number of stages in the pipeline?

Ans  $\eta = \frac{s}{k} \Rightarrow 80\% = \frac{20}{K}$

~~input size maps to output size if you input size is 1000 then K =  $\frac{20 \times 1000}{2488} = 25$~~

- Q2 A pipeline is having Speed factor as 10 and operating with efficiency of 80%. What will be the number of stages in the pipeline?

Ans  $\eta = \frac{s}{k} \Rightarrow 80\% = \frac{10}{k}$

$$K = \frac{10^5}{4.88} \times 100 \quad K = 12.5 = 13$$

- Q3 4 Segment pipeline have the respective stage delay of 10ns, 15ns, 20ns, and 30ns. What is the efficiency of the pipeline when very large number of task are executed?

Ans  $K = 4$ ,  $t_p = 10 + 15 + 20 + 30 = 75 \text{ ns}$   $\eta = \frac{s}{t_p} = \frac{75}{30} = 2.5$

$$t_p = \max(\text{stage delay})$$

$$= \max(10, 15, 20, 30)$$

$$t_p = 30 \text{ ns}$$

$$\eta = \frac{s}{t_p} = \frac{75}{30} = 0.625 = 62.5\%$$

Q4. A 4 segment instruction pipeline has the respective stage delay of 8ns, 11ns, 20ns, 21ns respectively. The interface register are used between the stages have a delay of 2ns. What is the approx Speed up value when very large number of tasks are pipelined?

Ans.  $k=4$ . Stage delays = (8ns, 11ns, 20ns, 21ns) Buffer delay = 2ns  
 $t_p = 8+11+20+21 = 60\text{ ns}$   
 $t_p = \max(\text{stage delay} + \text{buffer delay}) = 21+2 = 23$   $S = \frac{t_p}{t_p} = \frac{41}{23} = 1.86$

Q5. Consider 4 Stage Pipeline with the respective stage delay of 20ns, 80ns, 50ns, and 10ns. To the enhancement process of the Pipeline target stage is split into 2 equal stage delay:

(i) What is the speedup between new and old design?

Ans. 4 stages, Stage delay = (20ns, 80ns, 50ns, 10ns)



$$S = \frac{\text{performance of old}}{\text{performance of new}} = \frac{80}{50}$$

$$S = 1.6$$

(ii) What is the clock frequency of the new pipeline?

Ans. Frequency  $\propto \frac{1}{\text{cycle time}}$

$$\text{Frequency} = \frac{1}{50 \times 10^9} = \frac{1}{50} \times 10^9 = \frac{1000 \times 10^6}{50}$$

$$= 20\text{ MHz}$$

Q6. The stage delay in 5 stage pipeline are 900, 600, 550, 450 and 400 nanoseconds. The larger stage delay is replaced using a functionally equivalent design involving two stages with respective stage delay 440 and 460 nanoseconds. What is the throughput increase of the pipeline?

Ans.

Old design:

$K=5$ , Stage delay = (900, 600, 550,  
450, 400)

$t_p = \max(\text{stage delay})$

$$= 900 \text{ nsec.}$$

1 Inst<sub>g</sub> in old pipe =  $900 \times 10^{-9} \text{ sec}$

In 1 sec # inst<sub>r. ext.</sub> =  $\frac{1}{900 \times 10^{-9}}$

$$\boxed{\text{Throughput} = \frac{1}{t_p} = \frac{1}{900 \text{ nsec}}}$$

New design:

$$K=6$$

Stage delay = (440, 460, 600, 550, 450,  
400)

$t_p = \max(\text{delay})$

$$t_p \text{ new} = 600 \text{ nsec}$$

$$\boxed{\text{Throughput}_{\text{new}} = \frac{1}{600}}$$

$$\% \text{ of throughput} = \frac{\text{new} - \text{old}}{\text{old}} \Rightarrow \frac{1/600 - 1/900}{1/900} = \frac{1/6 - 1/9}{1/9} = \frac{1/6}{1/9} = \frac{3}{2} = 1.5$$

Q7. A 4-stage pipeline has the stage delays as 150, 120, 160 and 140 nsec respectively. Registers that are used between the stages have a delay of 5 nsec each. Assuming constant clocking rate, the total time taken to process 1000 data items on the pipeline will be

Ans

$$n=1000 \quad K=4$$

$t_p = \max(\text{stage delay} + \text{buffer delay})$

$$t_p = 165 \text{ nsec. (160+5)}$$

$$T_{\text{pipe}} = [K + (n-1)] t_p = [4 + (1000-1)] \times 165 \times 10^{-9}$$

$$= 1003 \times 165 \times 10^{-9}$$

$$= 165.5 \times 10^{-6} \rightarrow 165.5 \text{ microsecond}$$

Q8. We have two designs, D<sub>1</sub> and D<sub>2</sub> for a synchronous pipeline processor. D<sub>1</sub> has 5 pipeline stages with execution times of 3, 2, 4, 2, and 3 nsec while the design D<sub>2</sub> has 8 pipeline stages each with 2 nsec execution time. How much time can be saved using design D<sub>2</sub> over design D<sub>1</sub> for executing 100 instructions?

Ans.

Design D<sub>1</sub>

$$K=5, \text{ Stage} = (3, 2, 4, 2, 3) \\ \text{delay}$$

$$t_p = \max(\text{stage delays}) = 4 \text{ nsec.}$$

$$ET_{D_1} = [K + (n-1)] t_p \\ = [5 + (100-1)] \times 4$$

$$ET_{D_1} = 416 \text{ nsec}$$

Design D<sub>2</sub>

$$K=8, t_p = 2 \text{ nsec } n=100$$

$$ET_{D_2} = [K + (n-1)] t_p \\ = [8 + (100-1)] \times 2$$

$$ET_{D_2} = 214 \text{ nsec}$$

$$\text{Time Saved} = 416 - 214 \Rightarrow \underline{\underline{202 \text{ nsec}}}$$

Q9. A five stage pipeline has stage delays of 150, 120, 150, 160 and 140 nanoseconds. The registers that are used between the pipeline stages have a delay of 5 nanoseconds each. The total time to execute 100 independent instructions on the pipeline, assuming there are no pipeline stalls, is 17160 nsec.

Ans

$$K=5 \quad \text{Stage delay} = (150, 120, 150, 160, 140)$$

$$\text{Buffer delay} = 5 \text{ nsec}$$

$$t_p = \max(\text{stage delay} + \text{buffer delay}) = 160 + 5 \\ = 165 \text{ nsec}$$

$$ET_{\text{pipe}} = [K + (n-1)] t_p \\ = [5 + (100-1)] \times 165 \\ = 104 \times 165 = \underline{\underline{17,160 \text{ nsec}}}$$

Q10.

Consider a 3-stage pipelined processor having a delay of 10ns, 20ns and 14ns for the first, second and third stage respectively. Assume that there is no other delay and the

Processor does not suffer from any pipeline hazard. Also assume that one instruction is fetched every cycle.

The total execution time for executing 100 instructions on this processor is ns.

Ans  $K=3$ , Stage Delay =  $(10, 20, 14 \text{ nsec})$   $n=100$

$$t_p = \max(10, 20, 14) = 20.$$

$$\begin{aligned} ET_{\text{pipe}} &= [K + (n-1)] t_p \\ &= [3 + (100-1)] \times 20 \\ &= 102 \times 20 = 2040 \text{ msec} \end{aligned}$$

Q. Consider the following processors. Assume that the pipeline registers have zero latency.

P<sub>1</sub>: Four stage pipeline with stages latencies: 1 ns, 2 ns, 2 ns, 1 ns.

P<sub>2</sub>: " " " " " " " " 1 ns, 1.5 ns, 1.5 ns, 1.5 ns

P<sub>3</sub>: Five stage " " " " " 0.5, 1, 1, 0.6, 1 ns

P<sub>4</sub>: " " " " " 0.5, 0.5, 1, 1, 1.1 ns.

Which processor has the highest clock (peak) frequency?

Ans  $P_1 \Rightarrow t_p = \max(1, 2, 2, 1) = 2 \text{ nsec}$

$$P_2 \Rightarrow t_p = \max(1, 1.5, 1.5, 1.5) = 1.5 \text{ nsec}$$

$$P_3 \Rightarrow t_p = \max(0.5, 1, 1, 0.6, 1) = 1 \text{ nsec}$$

$$P_4 \Rightarrow t_p = \max(0.5, 0.5, 1, 1, 1.1) = 1.1 \text{ nsec.}$$

Frequency  $\propto \frac{1}{t_p} \therefore P_3$  has higher clock frequency.

Q. The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage (with delay 800 picoseconds) is replaced with a functionally equivalent design involving two stages with respective delays 600 and 350 picoseconds. The throughput increase of the pipeline is 33.33 percent.

Old design:  $t_p = \max(800, 500, 400, 300) = 800 \text{ nsec. per}$

$$\text{Throughput} = \frac{1}{800}$$

New design:  $T_p \text{ new} = \max(600, 350, 500, 400, 300) = 600 \text{ ps}$

$$\Delta T_{\text{throughput}} = \frac{1}{T_p \text{ new}} = \frac{1}{600}$$

$$\begin{aligned}\% \text{ Increment in throughput} &= \frac{T_p \text{ new} - T_p \text{ old}}{T_p \text{ old}} = \frac{1/600 - 1/800}{1/800} \\ &= \frac{8-6}{48} \times \frac{8}{1} = \frac{2}{48} \times \frac{8}{1} = \frac{16}{48} = \frac{1}{3} \\ &= \underline{\underline{33.3\%}}\end{aligned}$$

Note: In the successful pipeline

$$\begin{aligned}\text{Cycle per instruction} &= 1 \\ \therefore \text{for } n \text{ instructions } ET_{\text{pipe}} &= [K + (n-1)] \text{ cycles} \\ 1 \text{ instruction} &= K + n - 1 = \frac{n}{n} = 1 \quad (K-1) \text{ is ignored}\end{aligned}$$

$$\therefore CPI = 1$$

Q1. How to construct/design the pipeline?

Ans If we want to construct 'n' stage pipeline, the entire CPU (Control Unit) is divided into 'n' functional unit (Independent functional unit) which is independent from each other.

Independent function unit means when one functional unit performing one task, at the same time other functional unit performing the other task/operation

Eg: Instruction Pipeline

At clock cycle no. 2

$I_1$  is in Decode Stage and

$I_2$  is in Fetch Stage at

the same time (clock cycle 2)

| WB |       |       |   |
|----|-------|-------|---|
| Ex |       |       |   |
| Do |       | $I_1$ |   |
| IF | $I_1$ | $I_2$ |   |
| 1  | 2     | 3     | 4 |

Q2.

What is the meaning of CPI=1 in pipeline?

Ans

Characteristics of Pipeline is in every new cycle new input must be inserted into the pipeline.

$$\boxed{\text{CPI} = 1}$$



Q3.

Why Clock is required?

Ans

Because whenever we enable the clock then operations are performed and to provide the synchronization between the stages.

Q4. How to set this CPI in Uniform delay Pipeline?

Ans

$$\boxed{\text{Uniform delay; } t_p = \text{Stage delay}}$$

If buffer delay is given:

$$t_p = \text{Stage delay} + \text{Buffer delay.}$$

Q5. How to set this CPI in non uniform delay pipeline?

Ans

$$\boxed{t_p = \max(\text{Stage delay})}$$

If buffer delay is included:

$$\boxed{t_p = \max(\text{Stage delay} + \text{buffer delay})}$$

Q6.

Inby delay we take minimum in calculating t<sub>p</sub>?

Ans



$$t_p = 20 + 2 = 22 \text{ ns}$$

$$F = \frac{1}{22} \text{ ns}$$

If we take minimum (2 ns) then Stage 1 work finish and data move from Stage 1 to stage 2. But in 22 ns Stage 2, Stage 3 and Stage 4 work not finish.

So for proper functioning and working we take maximum

If we take maximum



If we take maximum 30 nsec but stage 1 task finish at 20 nsec  
But clock is set to be 30 nsec. Every phase output (Synchronization) will be available after 30 nsec.

When One cycle complete the delay given from stage 1 to stage 2 so Synchronization.

Q1. Consider a non-pipelined processor with a clock rate of  $2.5 \text{ GHz}$  and average cycles per instruction of four. The same processor is upgraded to a pipelined processor with five stages. Due to the internal pipeline delay, the clock speed is reduced to  $2 \text{ GHz}$ . Assume that there are no stalls in the pipeline. The speedup achieved in the pipelined processor is \_\_\_\_\_

Ans

Non Pipeline

Each instruction takes = 4 cycle

$$\text{Cycle time} = \frac{1}{2.5 \text{ GHz}}$$

$$\text{Cycle time} = 0.4 \text{ nsec}$$

$$ET \text{ in non pipeline } (t_0) = 4 \times 0.4 \text{ nsec}$$

$$= 1.6 \text{ nsec}$$

Pipeline

Each instruction takes 1 cycle

$$C.PI = 1 \text{ cycle}$$

$$\text{Cycle time} = \frac{1}{2 \text{ GHz}} = 0.5 \text{ nsec}$$

$$tp = 0.5 \text{ nsec}$$

$$S = \frac{ET_{non\ pipe}}{ET_{pipe}} = \frac{1.6}{0.5} = 3.2$$

Q2.

Consider a 4 Stage Pipeline processor. We want to execute a loop:

For ( $i=1$ ;  $i \leq 100$ ;  $i++$ ) {  $I_1, I_2, I_3, I_4$ } where the time taken (in ns) by instructions  $I_1$  to  $I_4$  for stages  $S_1, S_2, S_3, S_4$  is shown below

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ |
|-------|-------|-------|-------|-------|
| $I_1$ | 1     | 2     | 1     | 2     |
| $I_2$ | 2     | 1     | 2     | 1     |
| $I_3$ | 1     | 1     | 2     | 1     |
| $I_4$ | 2     | 1     | 2     | 1     |

The output of  $I_1$  for  $i=2$  will be available after?

Ans.

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | ↓ 11           | 3              | 4              | 6              |
| I <sub>2</sub> | ↓ 3            | 4              | 6              | 7              |
| I <sub>3</sub> | ↓ 4            | 5              | 8              | 9              |
| I <sub>4</sub> | ↓ 6            | 7              | 10             | 11             |
| I <sub>1</sub> | 7              | 9              | 11             | 13             |

Output of I<sub>i</sub> for i=2 can be 13.

Ans.

Q3 Consider a 4-stage pipeline Processor. The number of cycles needed by the four instructions I<sub>1</sub>, I<sub>2</sub>, I<sub>3</sub>, I<sub>4</sub> in stage S<sub>1</sub>, S<sub>2</sub>, S<sub>3</sub>, S<sub>4</sub> is shown below:

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | 2              | 1              | 1              | 1              |
| I <sub>2</sub> | 1              | 3              | 2              | 2              |
| I <sub>3</sub> | 2              | 1              | 1              | 3              |
| I <sub>4</sub> | 1              | 2              | 2              | 2              |

What is the number of cycle needed to execute the following loop: for (i=1 to 2) {I<sub>1</sub>, I<sub>2</sub>, I<sub>3</sub>, I<sub>4</sub>}.

Ans

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | ↓ 12           | 3              | 4              | 5              |
| I <sub>2</sub> | ↓ 3            | 6              | 8              | 10             |
| I <sub>3</sub> | ↓ 5            | 7              | 9              | 13             |
| I <sub>4</sub> | ↓ 6            | 9              | 11             | 15             |
| I <sub>1</sub> | ↓ 8            | 10             | 12             | 16             |
| I <sub>2</sub> | ↓ 9            | 3              | 15             | 18             |
| I <sub>3</sub> | ↓ 11           | 14             | 16             | 21             |
| I <sub>4</sub> | ↓ 12           | 16             | 18             | 23             |

23 cycles will be needed to execute the loop

Ans

Q4 Consider a 3GHz processor with a three-stage pipeline and stage latencies  $\tau_1$ ,  $\tau_2$ ,  $\tau_3$  such that  $\tau_1 = 3\tau_2/4 = 2\tau_3$ . If the longest pipeline stage is split into two pipeline stages of equal latency, the new frequency is 4 GHz, ignoring delays in the pipeline registers.

Ans.

$$T_1 = \frac{3T_3}{4} = 2T_2$$

$$T_1 : T_2 : T_3 = 6 : 8 : 3$$

$$T_1 = \frac{3T_2}{4}$$

$$\frac{3T_2}{4} = 2T_3$$

$$\frac{T_1}{T_2} = \frac{3}{4} [6/8]$$

$$\frac{T_2}{T_3} = \frac{8}{3}$$

New design:

Assume time =  $x$ 

$$T_1 = 6x, T_2 = 8x, T_3 = 3x$$

$$tp_{\text{new}} = \max(6x, 8x, 3x) = 8x$$

$$\text{Frequency} = \frac{1}{\text{Time}}$$

$$tp_{\text{new}} = \max(6x, 8x, 4x, 3x)$$

$$tp_{\text{new}} = 6x$$

$$\text{Frequency} = \frac{1}{\text{Time}}$$

$$3G\text{Hz} = \frac{1}{8x}$$

$$= \frac{1}{6x} \Rightarrow \frac{1}{6} \times \frac{1}{x}$$

$$\frac{1}{x} = 24G\text{Hz} \quad \text{eq(1)}$$

$$\therefore \text{Clock freq} = \frac{1}{6} \times 24G\text{Hz}$$

$$\therefore 4G\text{Hz}$$

Qs. Consider two processors P<sub>1</sub> and P<sub>2</sub> executing the same instruction set. Assume that under identical conditions, for the same input, a program running on P<sub>2</sub> takes 25% less time but incurs 20% more CPI (clock cycles per instruction) as compared to the program running on P<sub>1</sub>. If the clock frequency of P<sub>1</sub> is 1Ghz, then the clock frequency of P<sub>2</sub> (in GHz) is

Ans

$$ET = IC \times CPI \times \text{Cycle time}$$

In the question same no of instructions

$$ET = \text{cycle/instruction} \times \text{cycle time}$$

$$ET_{P1} = CPI \times \text{cycle time}_1$$

$$ET_{P2} = 1.2 CPI \times \text{cycle time}_2$$

$$ET_{P2} = 0.75 \times ET_{P1}$$

$$\text{Cycle time } P_1 = \frac{1}{1\text{GHz}} = 1\text{usec}$$

$$0.75 \times ET_{P1} = 1.2 CPI \times \text{cycle time}_2$$

$$0.75 \times CPI \times 1\text{usec} = 1.2 CPI \times \text{cycle time}_2$$

$$\text{Cycle time } T = \frac{0.75}{1.2} = 0.625 \text{ nsec}$$

$$\text{Clock frequency} = \frac{1}{T} = \frac{1}{0.625 \text{ nsec}} = 1.6 \text{ GHz}$$

Q6. Consider a machine with 40 MHz processor which has run a benchmark program. The executed programs consist of 100,000 instruction executions, with the following instruction mix and clock cycle count. What will be the effective CPI, pipe rate and execution time.

| Instruction Type   | Instruction Count | Cycle / Instruction |
|--------------------|-------------------|---------------------|
| Integer arithmetic | 15000             | 1                   |
| Data transfer      | 32000             | 2                   |
| Floating point     | 15000             | 2                   |
| Control transfer   | 8000              | 2                   |

$$\text{Avg. CPI} = 0.45 \times 1 + 0.32 \times 2 + 0.15 \times 2 + 0.8 \times 2$$

$$\text{Avg. CPI} = 1.55 \text{ cycle}$$

$$\text{Cyc. time} = \frac{1}{40 \text{ MHz}}$$

$$\text{Avg. Instruction ET} = 1.55 \times \frac{1}{40} \times 10^{-6} \text{ sec} \rightarrow 0.03875 \times 10^{-6}$$

$$1 \text{ Instruction ET} = 0.03875 \times 10^{-6} \text{ sec}$$

$$\text{To 1 sec no of instructions} = \frac{1}{0.03875} \times 10^6 \text{ Inst/sec}$$

$$= 25.8 \text{ MIPs}$$

$$\text{Total program ET} = 25.8 \times 10^6 \times 10^{-6} \text{ sec}$$

$$= 0.03875 \times 10^3 \text{ sec}$$

$$= 3.875 \times 10^{-2} \text{ sec}$$

$$= 3.875 \text{ msec}$$

Q7. Which of the following is not a form of main memory?  
Ans. Instructions Opcode.

**Q8.** In a 10-bit Computer instruction format, the size of the address field is 3 bits. The Computer uses Expanding OP code technique and has 4 two-address instructions and 16 one-address instructions. The number of zero-address instructions it can support is

**Ans.**



Primitive

Demux

Further derived

**Primitive:** Total number of operations =  $2^4 = 16$  operations.

$$\text{Given } 2AT = 4$$

# Free Opcode after allocating  $2AT = 16 - 4 = 12$

$$\begin{aligned} \text{Total # Operations in 1AF} &= \text{Free opcodes} \times 2 \\ &= 12 \times 2^{7-4} \Rightarrow 12 \times 2^3 \\ &\Rightarrow 96 \end{aligned}$$

$$\text{Given } 1AF = 16$$

# Free Opndc after allocating  $1AF = 96 - 16 = 80$

$$\begin{aligned} \text{Total # Operations in OAF} &= 80 \times 2^{10-7} = 80 \times 2^3 \\ &\Rightarrow 640 \end{aligned}$$

**Q9.**

|                       |       |       |       |       |
|-----------------------|-------|-------|-------|-------|
| Design D <sub>1</sub> | 500ns | 450ns | 600ns | 500ns |
| Design D <sub>2</sub> | 600ns | 820ns | 750ns | 650ns |

What is the cycle time latency of one instruction and throughput in the pipeline design D<sub>1</sub> and D<sub>2</sub> if in both design D<sub>1</sub> and D<sub>2</sub> having buffer register between each stage have a delay of 20 nsec? Buffer delay = 20ns.

**Ans.**

Design D<sub>1</sub>

$$tp_{D_1} = \max(\text{Stage + buffer delay})$$

$$= \max(500, 20)$$

$$tp_{D_1} = 820$$

Design D<sub>2</sub>

$$tp_{D_2} = \max(\text{Stage + buffer delay})$$

$$= (820 + 20)$$

$$tp_{D_2} = 840$$

Latency of one instruction

$$ETD_1 = 4 \times 820 = 3280 \text{ nsec}$$

$$\text{Throughput} = \frac{1}{820 \text{ nsec}}$$

Latency of one instruction

$$ETD_2 = 4 \times 840 = 3360 \text{ nsec}$$

$$\text{Throughput} = \frac{1}{840 \text{ nsec}}$$

$$\rightarrow D_1 = (820 \text{ nsec}, 3280 \text{ ns}, 1/820)$$

$$\rightarrow D_2 = (840 \text{ nsec}, 3360 \text{ ns}, 1/840)$$

Pipeline D1 performing better than D2.

## Pipeline Stages

① 2-Stage Pipeline



② 4-Stage Pipeline

I<sub>1</sub>: IF ID EX WB

I<sub>2</sub>: IF ID EX WB

I<sub>3</sub>: IF ID EX WB

I<sub>4</sub>:

IF : Instruction Fetch

ID : Instruction Decode

EX : Execute

WB : Write back

③ 6-Stage Pipeline  $\Rightarrow$  Fr DI CO FO EI WO

Additional Stages:

① Fetch Instructions:

Read the next expected instructions from buffer

③ Calculate Operande (CO)

\* Calculate the effective address of each source operand.

\* This may not involve displacement register indirect or other forms of address calculations

② Decode Instructions:

\* Determine the opcode and the operand specifiers

④ Fetch Operande (FO)

\* Fetch each operand from memory

- \* Operands in register need not be fetched

### ⑤ Executed Instruction (EI)

- \* Perform the indicated operation and store the result, if any, in the specified destination operand locations.

### ⑥ Write Operand (Wo)

- \* Store the result in memory.

Note: Generally we use Classical pipeline:



RISC Pipeline: 5 Stage

In RISC pipeline 5 Stage:

1. Instruction Fetch (IF)
2. Instruction Decode (ID)
3. Execute (Ex)
4. Memory Access (MA)
5. Write Back (WB)

1. Instruction Fetch: In this stage memory instruction is fetched from memory.

2. Instruction decode: In this stage 2 operations are performed:

(i) Decode the instruction

(ii) Operand loading (fetching) from the register file.

This stage also contains Comparator Circuit to evaluate the branch condition.

3. Execute: In this stage data processing (All Operations) are performed.

4. Memory Access : In this stage Operand (Data) will be accessed from memory (Load or Store).
5. Write Back : In this stage register write (Operand Storing into reg-file) Operation performed.

### Pipeline Hazards

Occur when the Pipeline, or some portion of the pipeline must stall because conditions do not permit continued execution.

There are 3 types of hazards:

1. Resource
2. Data
3. Control.



Also referred to as a "Pipeline bubble".

- Hazard is a situation that cause the Pipeline to idle or stall.
1. Structural hazard (Resource Conflict)
  2. Data hazard (Data/operand)
  3. Control hazard (Branch instruction)

→ In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operations.

- \* Resource conflicts caused by access to memory by two segments at the same time. Most of these conflicts can be resolved by using separate instructions and data memory.
- \* Data dependency conflicts arise when an instruction depends on the result of a previous instruction, but the result is not yet available.
- \* Branch difficulties arise from branch and other instructions that change the value of PC.

## Hazards/ Dependencies in the Pipeline:

- \* Dependency is a major problem in the Pipeline, Causes extra cycle.
- \* Cycle in the Pipeline without new input is called as extra cycle. Also named as "Stall".
- \* When stall is present in the Pipeline then  $CPI \neq 1$

eg: 5 Stage Pipeline:

Normal Execution.

|                | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | TF              | ID              | Ex              | MA              | WB              |                 |                 |
| I <sub>2</sub> |                 | IF              | ID              | Ex              | MA              | WB              |                 |

I<sub>1</sub>: ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>; R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>

I<sub>2</sub>: MUL R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>; R<sub>4</sub> ← R<sub>1</sub> \* R<sub>5</sub>.

Instructions I<sub>2</sub> data dependent on I<sub>1</sub>

|                | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | TF              | ID              | Ex              | MA              | WB              |                 |                 |
| I <sub>2</sub> | IF              | TD              | "               | "               |                 |                 |                 |

Bubble/Extra Cycle.

Stall

## Structural Dependency/Hazards

- \* Structural dependency is created in the pipeline between two or more phase require the same resource at the same time and they are not able to run simultaneously.
- \* Structural dependency is created in the pipeline due to resource conflict.
- \* Resource maybe registers, functional unit, ALU, memory etc.

| eg:1. | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 | CC9 | CC10 | CC11 | CC12 |
|-------|-----|-----|-----|-----|-----|-----|-----|-----|-----|------|------|------|
| J1:   | MEM | ID  | EX  | MEM | WB  |     |     |     |     |      |      |      |
| J2:   |     | MEM | ID  | EX  | MEM | WB  |     |     |     |      |      |      |
| J3:   |     |     | MEM | ID  | EX  | MEM | WB  |     |     |      |      |      |
| J4:   |     |     |     | MEM | MEM | MEM | WB  |     |     |      |      |      |
|       |     |     |     | "   | "   | "   | "   | "   | "   | "    | "    |      |
|       |     |     |     | ①   | ②   | ③   |     |     |     |      |      |      |

3 extra cycles / bubble / Stalls.

Here J1 is accessing the memory to store the data and  
J4 is accessing the memory to fetch the instruction.

| eg:2. | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 |
|-------|-----|-----|-----|-----|-----|-----|-----|
|       | MEM | ID  | EX  | MEM | WB  |     |     |
|       |     | MEM | ID  | EX  | MEM | WB  |     |
|       |     |     | MEM | ID  | EX  | MEM | WB  |
|       |     |     |     | MEM |     |     |     |

→ Resource Conflict

- \* In clock cycle (CC4) both J1 and J4 attempting accessing the same resource 'Memory'.
- \* This situation is called resource conflict so (J1 & J4) not go for execution.
- \* So keep J4 into the waiting until the resource become available.
- \* This waiting creates stall/bubble/extra cycle in pipeline.

### (Instructions Cache)

Instructions/Code memory [Im/Em]

Memory

→ Data Memory [Dm].

### (Data Cache)

- \* To minimize the stalls due to structural dependency one hardware technique mechanism is used called "Renaming".

- \* Renaming stage then divide the memory into independent module to store the instructions and data separately called Instruction/Code Memory [Cm/Dm] and data memory (Dm) respectively.

|                  | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 |
|------------------|-----|-----|-----|-----|-----|-----|-----|-----|
| I <sub>1</sub> : | Cm  | To  | Ex  | Dm  | WB  |     |     |     |
| I <sub>2</sub> : |     | Cm  | To  | Ex  | Dm  | WB  |     |     |
| I <sub>3</sub> : |     |     | Cm  | To  | Ex  | Dm  | WB  |     |
| I <sub>4</sub> : |     |     |     | Cm  | To  | Ex  | Dm  | WB  |

No Stalls.

### Data Dependency:

$$\begin{aligned} I_1: & \text{ADD } r_1, r_2, r_3; \quad r_1 \leftarrow r_2 + r_3 \\ I_2: & \text{MUL } r_4, r_1, r_5; \quad r_4 \leftarrow r_1 * r_5 \end{aligned}$$

Instruction I<sub>2</sub> is data dependent on instruction I<sub>1</sub> because I<sub>2</sub> is using the result of instruction I<sub>1</sub>, as operand.

Data dependency is created in the pipeline between (I<sub>j</sub> or I<sub>j+1</sub>) instructions using the result of previous instructions.

1. Data Dependency/Flow dependency  $\rightarrow$  RAW hazards
2. Anti dependency  $\rightarrow$  WAR hazards
3. Output/Write dependency  $\rightarrow$  WAW hazards

Data hazard: A data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed for some reason. Or any condition that cause the pipeline to stall is called as hazard.

Data hazards: 1. RAW (Read after write)

2. WAR (Write after read)

3. WAW (Write after write)

### Important Points:

- \* Every dependency may create stall.
- \* ANTI and OUTPUT dependency never create stall.
- \* Data dependency may or may not create stall/bubble/extracycle.

Data Dependency

#### ① True Data Dependency

(Adjacent instructions dep. which create stall)

#### ② Data dependency

(which may or may not create stall.)

Q1. Consider the following Program segment, Executed on a RISC pipeline  
Find how many data dependency?

I<sub>1</sub> : Add r0, r1, r2

I<sub>2</sub> : Sub r3, r2, r0, r2

Data dependency

I<sub>3</sub> : Mult r4, r0, r5

R<sub>2</sub> - I<sub>1</sub>

I<sub>4</sub> : Div r0, r6, r7

I<sub>3</sub> - I<sub>1</sub>

I<sub>5</sub> : Mult r8, r0, r9

I<sub>5</sub> - I<sub>4</sub>

Q2. How to check that dependency creates hazards or not?

Ans Execution of instruction in a normal pipeline manner if there is a stall/bubble/extracycle occur then data dependency creates stall.

Q4. I<sub>1</sub> : Add r0 r1 r2; r0 ← r1 + r2

I<sub>2</sub> : Mult r3 r0 r4; r3 ← r0 \* r4.

Now check this dependency (I<sub>2</sub>-I<sub>1</sub>) creates hazards or not?

Ans RISC : 5 Stage IF, ID, EX, MA, WB

|                | cc1 | cc2 | cc3 | cc4      | cc5 | cc6 | cc7 | cc8 | cc9 |
|----------------|-----|-----|-----|----------|-----|-----|-----|-----|-----|
| I <sub>1</sub> | IF  | ID  | Ex  | MA       | WB  |     |     |     |     |
| I <sub>2</sub> |     | IF  | Id  | Jo       | Jo  | Id  | Ex  | MA  | WB  |
|                |     |     |     | 3 stages |     |     |     |     |     |

Correct value.

At C<sub>0</sub> J<sub>2</sub> read wrong value of r0

At C<sub>4</sub> J<sub>2</sub> read wrong value of r0

At C<sub>5</sub> J<sub>2</sub> read wrong value of r0

Because r0 available at end of C<sub>5</sub>

So 3 strobe/stall/extra cycle/bubble



At clock cycle 5 the value of r0 getting [write back into register]

### Split Phase



#### I Case

1<sup>st</sup> half : Reg Read

2<sup>nd</sup> half : reg write



#### II Case

1<sup>st</sup> half : register write

2<sup>nd</sup> half : register read



#### Case I is Previous Question:



Old (wrong) value So 3 strobe is previous Question.

#### Case II



so now 2 strobe

## II Case of previous Questions (Q4)

RISC : 5 Stage

|    | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 |
|----|-----|-----|-----|-----|-----|-----|-----|-----|
| T1 | JF  | IO  | Ex  | MA  | NB  |     |     |     |
| T2 | JF  | IO  | IO  | IO  | Ex  | MA  | NB  |     |
|    |     |     |     |     |     |     |     |     |

↓ 2 stalls

At CC3 T2 read old value of  $r_0$

At CC4 T2 read old (wrong) value of  $r_0$

At CC5 (end part) of cycle T2 read correct value of  $r_0$ .

So 2 stalls.

## Data Dependency (Formal Definition)

\* Data dependency will be occurred when the instruction 'j' try to read the data before 'i' writes it.

eg: T1: add  $r_0, r_1, r_2 : r_0 \leftarrow r_1 + r_2$

T2: mul  $r_2, r_0, r_2 : r_3 \leftarrow r_0 * r_2$

\* When the above instructions are executed in a non-pipelined system then data dependency condition does not occur because "T2" executing after "T1" so, T2 reads the register ( $r_0$ ) data T1 writes it (Read-after-Write)

\* When the above instructions are executed in a pipeline system then data dependency condition will be occurred because (T2) will be executing along with (T1) so (T2) reads the ( $r_0$ ) data before T1 writes it. Therefore T2 incorrectly get the old value from (r0) (data loss).

|    | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 | CC8 |
|----|-----|-----|-----|-----|-----|-----|-----|-----|
| T1 | JF  | IO  | Ex  | MA  | NB  |     |     |     |
| T2 | JF  | IO  | ID  |     |     |     |     |     |
|    |     |     |     |     |     |     |     |     |

So:  $r_0, r_2$

D:  $r_3$

$r_0 = \text{Value} - \dots$  Read-before write

$r_2 = \text{Value}$

To handle above problems, some logic will be maintained in the "I<sub>2</sub>" Stage of pipeline to stop the occurring of a dependent data. Structure of the logic is as follows : "Tomasiello ALU".

| S/No           | Function | Destination    | Independent Source 1 | Independent Source 2 | Dependent Source 1               | Dependent Source 2 | Status | Reg. file access.                                                    |
|----------------|----------|----------------|----------------------|----------------------|----------------------------------|--------------------|--------|----------------------------------------------------------------------|
| I <sub>1</sub> | Add      | r <sub>0</sub> | r <sub>1</sub>       | r <sub>2</sub>       | -                                | -                  | 0      | r <sub>1</sub> = value<br>r <sub>2</sub> = value<br>Ex-stage stalled |
| I <sub>2</sub> | MUL      | r <sub>3</sub> | -                    | r <sub>2</sub>       | r <sub>0</sub> (I <sub>1</sub> ) | -                  | 0      | r <sub>2</sub> = value<br>Ex-stage not stalled.                      |

"0" execution yet not complete

"1" execution complete

By using the above logic, we need to stop the access of dependent data until the data becomes available.

This waiting creates extra cycle/bubble/stall in the pipeline.

### Solutions of Data Dependency (Raw Hazard)

→ Hardware Solutions

→ Software Solutions (Compiler based)

    → No operation instruction inserted.

### Hardware Solutions of Data Dependency.

1. **Hardware interlocks:** The most straightforward method is to insert hardware interlocks. An interlock is a circuit that detects instructions whose source operands are destinations of instructions further up in the pipeline.

2. **Operand forwarding:** Another technique called operand forwarding uses special hardware to detect a conflict and then avoid

it by routing the data through special paths between pipeline segments. For eg: instead of transferring one result into a destination register, the hardware register checks the destination operand, and if it is needed as a source in the next instruction, it passes the result directly into the new input, bypassing the register file.

\* To minimize the data dependency stall in the pipeline hardware technique is used i.e. Open forwarding also named as By-passing or Short circuit.

\* This mechanism states that use the buffer between the stages to hold the immediate result so that dependent instructions will be accessing the new data from the buffer before update register file.

$$\text{eg: } T_1 : r_0 \leftarrow r_1 + r_2$$

$$T_2 : r_3 \leftarrow r_0 + r_4$$

|                             | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|-----------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| without Operand Forwarding: | T <sub>1</sub>  | IF              | To              | Ex              | MA              | WB              |                 |
|                             | T <sub>2</sub>  | IF              | IF              | ID              | To              | To              | Ex              |

3 Stalls

Buffer

|                          | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|--------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| With Operand Forwarding: | T <sub>1</sub>  | IF              | To              | Ex              | MA              | WB              |                 |
|                          |                 |                 |                 | IF              | ID              | Ex              | MA              |

If here  
wrong (old)  
Value of r<sub>0</sub>  
Then in CC<sub>4</sub>

correct value  
r<sub>0</sub> using

1.

## Data/Raw Dependency (Read After Write)

- \* An instruction modifies a register or memory location.
- \* Succeeding instructions reads data in memory or register location.

eg:  $I_1: \text{Add } r_0, r_1, r_2$

$$I_1: r_0 \leftarrow r_1 + r_2$$

$I_2: \text{Sub } r_3, r_0, r_2$

$$I_2: r_3 \leftarrow r_0 - r_2$$

2.

## Anti Dependency/WAR Dependency (Write After Read)

- \* An instruction reads a register or memory location.
- \* Succeeding instruction writes to the location.

eg:  $I_1: \text{Mul } r_3, r_1, r_2$

$$I_1: r_3 \leftarrow r_1 * r_2$$

$I_2: \text{Add } r_1, r_4, r_5$

$$I_2: r_1 \leftarrow r_4 + r_5$$

3.

## Output Dependency/WAW Dependency (Write After Write)

- \* Two instructions both write to the same location.

eg:  $I_1: \text{Mul } r_3, r_1, r_2$

$$I_1: r_3 \leftarrow r_1 * r_2$$

$I_2: \text{Add } r_3, r_4, r_5$

$$I_2: r_3 \leftarrow r_4 + r_5$$

### Types of Data Hazard:

1.

#### Read after Write, or true dependency

- \* An instruction modifies a register or memory location.
- \* Succeeding instruction reads data in memory or register location.
- \* Hazard occurs if the read takes place before write operation is complete.

2.

#### Write After Read, or Antidependency

- \* An instruction reads a register or memory location.
- \* Succeeding instruction writes to the location.
- \* Hazard occurs if the write operation completes before the read operation takes place.

3.

#### Write After Write or Output dependency

- \* Two instructions both write to the same location.

\* Hazards occur if the write operations take place in the reverse order of the intended sequence.

Q1. Consider the following Program Segment, Executed on a Risc Pipeline:  
 (i) How many Raw dependency? (ii) How many Raw hazard?

Ans Data Raw Dependency:

I<sub>1</sub>: T<sub>1</sub>: add r0, r1, r2. I<sub>2</sub>: Sub r3, r4, r5.

I<sub>3</sub>-I<sub>4</sub>: 4 Raw dependency I<sub>2</sub>: mul r6, r7, r8.

I<sub>4</sub>-I<sub>5</sub>: 3 Raw hazards I<sub>4</sub>: xor r9, r10, r11.

I<sub>5</sub>-I<sub>1</sub>: 3 Raw hazards I<sub>5</sub>: mul r9, r10, r11.

|                | Cc <sub>1</sub> | Cc <sub>2</sub> | Cc <sub>3</sub> | Cc <sub>4</sub> | Cc <sub>5</sub> | Cc <sub>6</sub> | Cc <sub>7</sub> | Cc <sub>8</sub> | Cc <sub>9</sub> |
|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | T <sub>E</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |                 |                 |                 |
| I <sub>2</sub> |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |                 |                 |
| I <sub>3</sub> |                 |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |                 |
| I <sub>4</sub> |                 |                 |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |
| I <sub>5</sub> |                 |                 |                 |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |

After Cc<sub>5</sub> → r<sub>0</sub> output available.

Q2. Consider the following Program Segment, Executed on a Risc Pipeline:

Check Raw dependency Create Raw hazard?

Ans I<sub>1</sub>: Add r0, r1, r2

Raw Dependency

I<sub>2</sub>: Sub r3, r4, r5

I<sub>1</sub>-I<sub>3</sub>

Raw hazard: I<sub>3</sub>-I<sub>1</sub>

I<sub>3</sub>: Mul r6, r7, r8

I<sub>1</sub>-I<sub>5</sub>

I<sub>4</sub>: Add r9, r10, r11

..

I<sub>5</sub>: Add r12, r13, r14

..

|                | Cc <sub>1</sub> | Cc <sub>2</sub> | Cc <sub>3</sub> | Cc <sub>4</sub> | Cc <sub>5</sub> | Cc <sub>6</sub> | Cc <sub>7</sub> | Cc <sub>8</sub> | Cc <sub>9</sub> |
|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | T <sub>E</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |                 |                 |                 |
| I <sub>2</sub> |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |                 |                 |
| I <sub>3</sub> |                 |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |                 |
| I <sub>4</sub> |                 |                 |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |                 |
| I <sub>5</sub> |                 |                 |                 |                 | T <sub>F</sub>  | I <sub>D</sub>  | Ex              | MA              | WB              |

WB available after Cc<sub>5</sub>

Raw hazard: I<sub>3</sub>-I<sub>1</sub>

Q3. How to check Raw Dependencies Create Raw hazards or not?

Anc Pipeline executing them check if that result is required earlier or not by next instructions.

Q4 Consider a pipelined processor with the following four stages:

IF, ID, EX, WB.

The IF, ID and WB stage take one clock cycle each to complete the operations. The number of clock cycles for the EX stage depends on the instructions. The ADD and SUB instruction need 1 clock cycle and the MUL instruction needs 3 clock cycles in the EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the following sequence of instructions?

Add R<sub>2</sub>, R<sub>1</sub>, R<sub>0</sub> R<sub>2</sub>  $\leftarrow$  R<sub>1</sub> + R<sub>0</sub>

Mul R<sub>4</sub>, R<sub>3</sub>, R<sub>4</sub> R<sub>4</sub>  $\leftarrow$  R<sub>3</sub> \* R<sub>2</sub>

Sub R<sub>6</sub>, R<sub>5</sub>, R<sub>4</sub> R<sub>6</sub>  $\leftarrow$  R<sub>5</sub> - R<sub>4</sub>

Anc Without Operand Forwarding:

|                | C <sub>1</sub> | C <sub>2</sub> | C <sub>3</sub> | C <sub>4</sub> | C <sub>5</sub> | C <sub>6</sub> | C <sub>7</sub> | C <sub>8</sub> | C <sub>9</sub> | C <sub>10</sub> | C <sub>11</sub> | C <sub>12</sub> |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | IF             | ID             | EX             | WB             |                |                |                |                |                |                 |                 |                 |
| I <sub>2</sub> |                | IF             | ""             | ""             | ID             | EX             | EX             | EX             | WB             |                 |                 |                 |
| I <sub>3</sub> |                |                | IF             | "              | "              | "              | "              | "              | "              | ID              | EX              | WB              |

$\Rightarrow$  12 clock cycles without Operand Forwarding

| WB |                |                |                | I <sub>1</sub> |                |                |                |                | I <sub>2</sub> |                |  | I <sub>3</sub> |
|----|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|--|----------------|
| Ex |                |                | I <sub>1</sub> |                |                | I <sub>2</sub> | I <sub>2</sub> | R <sub>2</sub> |                |                |  |                |
| ID |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>2</sub> |                |                |                |                | I <sub>3</sub> |  |                |
| IF | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> |                |                |                |                |                |                |                |  |                |

1 2 3 4 5 6 7 8 9 10 11 12

Without Operand Forwarding  $\Rightarrow$  12 clock cycles.

With Operand Forwarding:



8 Clock cycles using Operand Forwarding.



Anc : 8 Clock Cycles

Q5. Consider the sequence of machine instructions given below:

Mul R<sub>5</sub>, R<sub>0</sub>, R<sub>1</sub>

Div R<sub>6</sub>, R<sub>2</sub>, R<sub>3</sub>

Add R<sub>7</sub>, R<sub>5</sub>, R<sub>6</sub>

Sub R<sub>8</sub>, R<sub>7</sub>, R<sub>4</sub>.

In the above sequence, R<sub>0</sub> to R<sub>8</sub> are general purpose registers. To the instructions shown, the first register shows the result of the operation performed on the second and third registers. This sequence of instructions is to be executed in a pipelined instruction processor with the following 4 stages:

(1) Instruction fetch and decode

(2) Operand fetch

(3) Perform Operation

(4) Write back the result

(a) The P0 stage takes 1 clock cycle for Add or Sub instructions.

3 clock cycles for Mul instructions and 5 clock cycles for Div instructions. The pipelined processor uses Operand forwarding from the P0 stage to the OF stage. The number of clock cycles taken for the execution of the above sequence of instructions is 13 clock cycles.

Ans:

| WB |       |       |       |       | $T_1$ | $T_1$ | $T_1$ | $T_2$ | $T_2$ | $T_2$ | $T_2$ | $T_3$ | $T_3$ | $T_4$ |
|----|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| PO |       |       | $T_1$ |       | $T_1$ | $T_1$ |       | $T_2$ | $T_2$ | $T_2$ | $T_2$ | $T_3$ | $T_3$ | $T_4$ |
| OF |       | $T_1$ | $T_2$ |       |       |       |       |       |       |       |       | $T_3$ | $T_4$ |       |
| IF | $I_1$ | $I_2$ | $I_3$ | $I_4$ |       |       |       |       |       |       |       |       |       |       |

1 2 3 4 5 6 7 8 9 10 11 12 13.

 $\therefore 13 \text{ clock cycles}$ 

Another Approach:

$$n=4 \quad (T_1, T_2, T_3, T_4)$$

$$K=4 \quad (4 \text{ stages})$$

$$IF \quad 1$$

$$OF \quad 1$$

$$PO \quad 1$$

$$WB \quad 1$$

$$CPI = 1$$

Extra

$$MUL \quad 3$$

$$DIV \quad 5$$

6 stalls

$$\begin{aligned} ET_{\text{pipe}} &= [K + (n-1)] T_p \\ &= [4 + (4-1)] \text{ cycle} \\ &= 7 \text{ cycle.} \end{aligned}$$

$$\begin{aligned} ET &= ET_{\text{pipe}} + \text{stalls} \\ &= 7 + 6 \\ &= 13 \text{ cycles (Ans)} \end{aligned}$$

Q6. The instruction pipeline of a Risc processor has the following stages: IF, ID, OF, PO and WB. The IF, ID, OF and WB stage take 1 clock cycle for every instruction. Consider a sequence of 100 instructions. In the PO stage, 40 instructions take 3 clock cycles each, 35 instructions take 2 clock cycles each, and the remaining 25 instructions take 1 clock cycle each. Assume that there are no data hazard and no control hazard.

The number of clock cycles required for completion of execution of the sequence of instructions is 219.

Ans:

$$K=5, \quad n=100$$

Extra (stalls)

|        |        |                   |   |                                |
|--------|--------|-------------------|---|--------------------------------|
| IF : 1 | WB : 1 | PO : 40 Inst. : 3 | 2 | Total stalls = $40 \times 2 +$ |
| ID : 1 |        | 35 Inst. : 2      | 1 | $35 \times 1$                  |
| OF : 1 |        | 25 Inst. : 1      | 0 | = 115 stalls                   |
| PO : 1 |        |                   |   |                                |

$$\begin{aligned} E_{\text{pipe}} &= [1 + (m)] \text{ cycles} \\ &= [5 + (100 - 1)] \text{ cycles} \\ &= 104 \text{ cycles} \end{aligned}$$

$$\begin{aligned} \text{Total ET} &= E_{\text{pipe}} + E_{\text{mem}} \\ &= 104 + 15 \\ &\Rightarrow 119 \end{aligned}$$

Q7. The performance of the pipelined processor suffer if:

- A. The pipeline stages have different delays.
- B. Consecutive instructions are dependent on each other.
- C. The pipeline stages share hardware resources.
- D. All of the above.

Note: Anti and Output Dependency does not create any stalls because in Anti and output we are getting / the products or for reading / updating the correct value.

The solution of Anti and Output dependency is "Register Renaming".

Register Renaming: Use some temporary storage register to store the result of the out-of-order instruction's output.

Q8. When and how Anti dependency and Output dependency create problem?

- Ans
- (i) If the previous instruction taking more time (very high time) compare to next instruction. (Unnecessary delay)
  - (ii) When there is true data dependency between two instruction then to minimise the stall, we try to reschedule the instruction (out-of-order instruction execution).
  - (iii) In Super Scalar system, multiple instructions inserted in the stages of so next instruction if read/write the value that was used in previous instruction, then it creates a problem.

\* CPU always execute a sequence of program from top to bottom in a sequence called as in-order sequence execution.

In this execution sequence if any instruction is dependent then the remaining instructions are also sharing the same cycle count.

\* They are independent:

e.g.  $I_1$ : Add  $r_0, r_1, r_2$

$I_2$ : Sub  $r_3, r_0, r_1$

$I_3$ : Mul  $r_4, r_5, r_6$

$I_4$ : Div  $r_7, r_8, r_9$



In the above execution sequence ' $I_2$ ' is data dependent on  $I_1$ , so  $I_2$  will be waiting until the ' $I_1$ ' execution is completed when the hardware does not support operand forwarding.

This waiting creates stall in the pipeline. This stall also shared by  $I_3$  and  $I_4$  instruction even they are independent.

To utilize this point instructions scheduling concept states that insert the independent instructions first.

It causes out of order execution (Re-order execution)

Out of order execution sequence is  $I_1 - I_3 - I_4 - I_2$

Out of order execution sequence creates two more dependencies in the pipeline ...

- (i) Anti data dependency
- (ii) Output data dependency.

To handle the above dependency problem hardware technique is used i.e "register renaming".

This technique states that, use the re-order buffer to store the out-of-order buffer instruction output later update the register file with a re-order buffer content after the completion of a dependent instruction therefore data is safe.

### Control Dependency:

Control dependency is created in the pipeline due to:

- \* Branch instructions
- \* Program Control

- \* Function Call
- \* Transfer of Control Instructions

Conditional Toc

Unconditional Toc

Assume 5 Stage Pipeline and  $I_2$  is the branch instruction. Assume the target address of the branch instruction is available at the end of the last stage.

1000  $I_1$  Unconditional Toc

| $I_2$ is a<br>branch instr | 1001 $I_2$ JMP 2000       |       | 1     | 2     | 3     | 4     | 5     | 6 |
|----------------------------|---------------------------|-------|-------|-------|-------|-------|-------|---|
|                            | 1002 $I_3$ branch to 2000 | $I_1$ | $S_1$ | $S_2$ | $S_3$ | $S_4$ | $S_5$ |   |
|                            | 1003 $I_4$                | $I_2$ | $S_1$ | $S_2$ | $S_3$ | $S_4$ | $S_5$ |   |
|                            | 1004 $I_5$ = unwanted     | $I_3$ | $S_1$ | $S_2$ | $S_3$ | $S_4$ |       |   |
|                            | 1005 $I_6$ unwanted       | $I_4$ | $S_1$ | $S_2$ | $S_3$ |       |       |   |
|                            | 1006 $I_7$                | $I_5$ | $S_1$ | $S_2$ |       |       |       |   |
| 2000 $BI_1$                |                           |       | $I_6$ | $I_7$ |       |       |       |   |
| 2001 $BI_2$                |                           |       |       |       |       |       |       |   |

$I_3, I_4, I_5,$  and  $I_6$  is unwanted instruction

$BI_1$  is the wanted instruction.

| Target address           |       |       |       |       |       |       |        |   |
|--------------------------|-------|-------|-------|-------|-------|-------|--------|---|
| 1 = wanted, 0 = unwanted |       |       |       |       |       |       |        |   |
| $S_5$                    |       |       |       | $I_1$ | $I_2$ |       |        |   |
| $S_4$                    |       |       |       | $I_1$ | $I_2$ | $I_3$ |        |   |
| $S_3$                    |       |       |       | $I_3$ | $I_2$ | $I_1$ | $BI_1$ |   |
| $S_2$                    |       |       |       | $I_2$ | $I_3$ | $I_4$ | $BI_1$ |   |
| $S_1$                    | $I_1$ | $I_2$ | $I_3$ | $I_4$ | $I_5$ | $I_6$ | $BI_1$ |   |
|                          | 1     | 2     | 3     | 4     | 5     | 6     | 7      | 8 |

4 Stage

Assume the target address of the branch address of branch instruction is available at end of 4<sup>th</sup> stage.

| $S_5$ |       |       |       | $I_1$ | $I_2$ |       |        | $BI_1$ |
|-------|-------|-------|-------|-------|-------|-------|--------|--------|
| $S_4$ |       |       |       | $I_1$ | $I_2$ |       |        | $BI_1$ |
| $S_3$ |       |       |       | $I_1$ | $I_2$ | $I_3$ |        | $BI_1$ |
| $S_2$ |       |       |       | $I_1$ | $I_2$ | $I_3$ | $BI_1$ |        |
| $S_1$ | $I_1$ | $I_2$ | $I_3$ | $I_4$ | $I_5$ | $I_6$ | $BI_1$ |        |

3 Stage

Assume the target address of the branch instruction is available at the end of 3rd stage.



If the target address of branch instruction is available at  $k^{th}$  stage:

$$\text{# stalls from branch} = k - 1$$

instruction

Ideal CPT = 1

But due to control hazard / Control dependency.

$$\text{Average Instructions} = 1 + \frac{\text{# instructions which create stalls}}{\text{ET}} \times \frac{\text{# stalls from branch}}{\text{Frequency of branch instr.}}$$

$$\text{Average Instructions} = 1 + \left( \frac{\text{Branch freq.}}{\text{ET}} \times \frac{\text{# stalls due to branch}}{\text{Frequency of branch instr.}} \right)$$

### Control Dependency

- \* Control dependency is created in the pipeline due to execution of branch (Toc instruction) so program execution will be changed
- \* In this the next sequential instruction become wanted or unwanted, it depends on the output previous instruction decoding.
- \* If the previous instruction decode the instruction as simple instruction (Arithmetic, Data transfer etc) then the next instruction becomes wanted instruction.

- If the previous instruction decode the instruction as branch instruction, Unconditional Tar, Conditional Tar with True then next sequential instruction become unwanted.

1000 I<sub>1</sub>  
 I<sub>1</sub> is → 1001 I<sub>2</sub> Jmp 2000  
 a branch 1002 I<sub>3</sub>  
 instruction 1003 I<sub>4</sub>  
 1004 I<sub>5</sub>  
 1005 I<sub>6</sub>  
 1006 I<sub>7</sub>

When I<sub>1</sub> execute then I<sub>2</sub> is wanted instruction.

But when I<sub>2</sub> execute (decode) then I<sub>3</sub>, I<sub>4</sub>... is a wanted instruction because I<sub>2</sub> decode is unconditional Tar (Jmp 2000) so BI<sub>1</sub> is the new wanted instruction.

- The number of stalls Created due to branch operation is called Branch Penalty
- This branch penalty value depends on the outcome of the target address.

Note: In the Risc pipeline branch penalty value is '1' because target address is available at the end of 2<sup>nd</sup> stage.

$$\text{Branch penalty} = k-1 = 2-1 = \underline{\underline{1}}$$

\* Generally target address is available in decoding Stage/phase but conditions Checking done in Executing Phase/Stage.

\* If target address available at 'k' stage

$$\text{Branch penalty} = k-1$$

$$\boxed{\text{Branch penalty} = \begin{cases} \text{At allbar Stage} \\ \text{Target address available} \end{cases} - 1}$$

→ Avg CPI = 1, But due to Control hazard

$$CPI = 1 + \left( \frac{\text{Branch instr} \times \# \text{ stalls due}}{\text{freq. to branch}} \right)$$

Q1. Suppose we have 100 instructions. Out of 100 instructions 30 are branch type, which creates 3 stalls cycles then Aug CPI?

Ans

100 Instructions

$$ET = 100 \times 1 = 100$$

Total ET = no. of branch \* # stalls due to  
stalls instructions

branch

$$\Rightarrow 30 \times 3 = 90$$

$$\text{For total 100 instr} / = 100 + 90$$

$$= 190$$

$$\text{Aug CPI} = \frac{190}{100} = 1.9 \quad (\text{Ans})$$

$$\rightarrow \text{Average instructions } ET = 1 + \left( \frac{\text{branch instructions} \times \# \text{ stalls due to}}{\text{frequency of branches}} \right)$$

$$= 1 + (0.30 \times 3)$$

$$= 1 + 0.9$$

$$\Rightarrow 1.9 \quad (\text{Ans})$$



$$\# \text{ stalls / Instructions} = 0.7 \times 0 + 0.30 \times 3 = 0.9$$

$$\text{Aug Post-MEME} = [1 + \# \text{ stalls / Instructions}]$$

$$\Rightarrow 1.9 \quad (\text{Ans})$$

(Overlapped manner)

- Stalling: Design a pipeline in such a manner if it detects the branch instructions, then we have to stall (stop putting the next instruction into the pipeline) until branch instructions evaluating (executing) not completed

Solution of

Control Dependency

Hardware Solution  $\rightarrow$  Branch prediction

Software Solution  $\rightarrow$  Delayed Branch

Branch Prediction: In this we predict that the branch is taken or branch is not taken and if prediction is correct then accordingly instruction will execute.

(i) Branch is always taken: In this we assume every time branch is taken, so in this next instruction is target address instruction for execution.

(ii) Branch is never taken: In this we assume every time branch is not taken, so in this next instruction is the next sequential instruction for execution.

Note: If prediction goes wrong then we have to suffer from Stalls.

### Dynamic Prediction

1. Bit implementation
2. Branch target / Branch loop / Branch Predictions

Branch loop buffer: It is a high speed buffer maintained in the instructions fetch stage of pipeline.



When the next time this type of instruction comes then if matches with the address directly getting the target address.

Delayed Branch: It is a compile technique (Software Solution) in which trying to rearrange the instructions if possible.

If it is not possible to rearrange, then insert 'Nop' instruction after the branch instruction.

Nop Instruction: This is the instruction that fetch and execute but not update any register or not affect our desired output.



→ 'K' Stage Target Address

Delay slot:  $k-1$  delay slot.

Q1. Delayed branching can help in the handling of control hazards.

The following code is to run on a pipelined processor using one branch delay slot:

T<sub>1</sub> : ADD R<sub>2</sub> ← R<sub>7</sub> + R<sub>8</sub>

T<sub>2</sub> : SUB R<sub>4</sub> ← R<sub>5</sub> - R<sub>6</sub>

T<sub>3</sub> : ADD R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>

T<sub>4</sub> : STORE MEMORY [R<sub>4</sub>] ← R<sub>1</sub>

Branch to label if /R<sub>1</sub> == 0

Which of the above instructions T<sub>1</sub>, T<sub>2</sub>, T<sub>3</sub> or T<sub>4</sub> can legitimately occupy the delay slot without any other program modification?

Ans T<sub>4</sub> - Can insert in delay slot because no changes in program.

## INSTRUCTION PIPELINING (Continued)

### Control Dependency

$$\text{Avg Instruction ET} = \text{CPI} \times \text{Cycle Time}$$

\* Generally CPI = 1.

But due to Control Hazard, CPI ≠ 1.

$$\text{CPI} = (1 + \# \text{ stalls}/\text{Instruction})$$

$$\# \text{ stalls per instruction} = \text{Branch instruction} \times \text{Branch penalty frequency}$$

$$\text{Average Instruction ET} = (1 + \# \text{ stalls}/\text{instruction}) \times \text{cycle time}$$

Speed up factor with stalls:

$$S = \frac{\text{Avg Instruction ET}_{\text{non-pipe}}}{\text{Avg Instruction ET}_{\text{pipe}}}$$

Ideal CPI of pipeline is always '1' due to a dependency problem  
Some Stalls are created in the Pipeline.

$$S = \frac{\text{CPI}_{\text{non-pipe}} \times \text{Cycle time}_{\text{non-pipe}}}{(1 + \text{no. of stalls}/\text{Instruction}) (\text{Cycle time}_{\text{pipe}})}$$

When the Pipeline Stages are perfectly balanced then 1 task execution time in the non-pipeline is also equal to number of stages in the Pipeline i.e...  $t_n = k \cdot t_p$  Under this Condition :

$$\Rightarrow S = \frac{k \cdot t_p}{(1 + \text{no. of stalls}/\text{instruction}) (\text{Cycle time}_{\text{pipe}})}$$

$$\Rightarrow S = \frac{k}{(1 + \text{number of stalls}/\text{Instruction})}$$

Q1. A CPU has five stages pipeline and runs at 1GHz frequency. Instruction fetch happens in the first stage of pipeline. A conditional branch instruction computes the target address and evaluates the condition in the third stage of the pipeline. The processor stops fetching new instructions following a conditional branch until the branch outcome is known. A program executes  $10^9$  instructions out of which 20% are conditional branches. If each instruction takes one cycle to complete on average, then total execution time of the program

Ans Frequency = 1GHz  $\rightarrow$  Cycle time =  $\frac{1}{1\text{GHz}} \text{ sec} = 1\text{nsec}$

5 Stage Pipeline

Total =  $10^9$  instructions.

Target address = 3rd stage

Branch penalty = 3-1  $\Rightarrow$  2 stall cycles

Avg Inst. =  $(1+0.4) \times 1$  stall  $\Rightarrow$  1.4 nsec

$$ET = 1.4 \text{ nsec}$$



$$\# \text{stalls/instruction} = 0.8 \times 0 + 0.2 \times 2$$

$$\text{Program ET} = 1.4 \times 10^{-9} \times 10^9 = 1.4 \text{ sec}$$

Another approach:



$$\text{Avg CPI} = 0.8 + 0.2 \times 3 = 1.4 \text{ cycle}$$

$$\text{Avg Instructions ET} = 1.4 \times 1 \text{ nsec}$$

$$\text{Program ET} = 1.4 \times 10^{-9} \times 10^9 = 1.4 \text{ sec}$$

(Ans)

| M         | T      | W | T | F | S | S |
|-----------|--------|---|---|---|---|---|
| Page No.: | YOGIYA |   |   |   |   |   |
| Date:     |        |   |   |   |   |   |

Q2 Consider a 6-stage pipeline, where all stages are perfectly balanced. Assume that there is no cycle-time overhead of pipelining. When an application is executing on this 6-stage pipeline, the speedup achieved with respect to non-pipelined execution is 25%. Of the instructions incur 2 Pipeline stalls. Cycle time is \_\_\_\_\_.

Ans.

$$K = 6 \text{ stage}$$

$$- t_n = 6 \times c$$

$$\text{S. } \frac{\text{ET}_{\text{nonpipe}}}{\text{ET}_{\text{pipe}}} = \frac{t_n}{K} = \frac{t_n}{6}$$

$$= \frac{6 \times c}{6}$$

$$(1 + \# \text{stalls}/\text{inst}^+) \times \text{cycle time}$$

$$= \frac{6 \times c}{(1 + 0.5) \times t_n}$$

$$= \frac{6}{1.5} \Rightarrow \underline{4} \quad (\text{Ans})$$



Q3. Consider a 5-stage instruction pipeline, where all stages are perfectly balanced and have a cycle time of 20ns. Target address of branch instruction is available at end of the execution (last stage). There are 30% instruction.

(i) What is average instruction execution time (ignore the fact some are conditional)?

Ans.

$$\# \text{stages} = 5$$

$$\text{Cycle time} = 20 \text{ ns}$$

Target address available at end of 5<sup>th</sup> stage.

$$\text{Branch penalty} = 5 - 1 = 4 \text{ p.}$$

$$\text{Avg Instruction ET}_{\text{pipe}} = (1 + 1.2) \times 20$$

$$= 2.2 \times 20 \\ = \underline{44 \text{ ns}}$$

(Ans)



(ii) What is Speed up factor?

Ans.

$$\text{S. } \frac{\text{ET}_{\text{NP}}}{\text{ET}_{\text{P}}} \rightarrow \frac{57.20 \text{ ns}}{20 \text{ ns}}$$

$$(1 + \# \text{stalls}/\text{inst}) \times \text{cycle}$$

$$= \frac{5 \times 20}{(1+1/2) \times 20} \rightarrow \frac{5}{2.2} \therefore S = 2.27 \text{ (Ans)}$$

- (iii) Among the branch instructions, 30.1. are conditional and 70.1. of them, does not satisfy the condition then what is avg instruction ET?

Ans



Q4. Consider a non pipelined Processor/Operating at 2.5 GHz. It takes 5 clock cycles to complete an instruction. You are going to make a 5-Stage Pipeline out of this processor. Overheads associated with Pipelining force you to operate the Pipelined Processor at 2GHz. In a given program, assume that 30.0. are memory instructions, 60.1. are ALU instructions and rest are branch instructions. 5% of the memory instructions cause stalls of 50 clock cycles each due to Cache miss. and 50% of the branch instructions cause stalls of 2 cycles each. Assume that there are no stalls associated with the execution of ALU instructions. For this program, the Speedup achieved by the pipelined processor

Cycle time of the non-pipelined processor (round off to two decimal places) is \_\_\_\_\_

$$\text{Cycle time} = \frac{1}{24\text{MHz}} = 0.5\text{ nsec}$$

# Stalls/instruction =

$$= 30 \times 0.05 \times 50$$

$$+ 60 \times 0$$

$$+ 10 \times 50 \times 2$$

$$\Rightarrow 0.85$$

$$\text{Avg Post. ET} = (1 + 0.85) \times 0.5 \\ = 1.85 \times 0.5 \\ = \underline{\underline{0.925 \text{ nsec}}}$$



Nonpipeline: 5 clock cycles

2.5 GHz

$$\text{Cycle time} = \frac{1}{2.5} \text{ sec} = \underline{\underline{0.4 \text{ nsec}}}$$

$$\text{ET}_{\text{nonpipe}} = 5 \times 0.4 \text{ nsec} \\ = \underline{\underline{2 \text{ nsec}}}$$

$$S = \frac{\text{ET}_{\text{NP}}}{\text{ET}_{\text{P}}} = \frac{2}{0.925} \therefore \underline{\underline{S = 2.16 \text{ (Ans)}}}$$