

## Pipeline

Definition - It is a technique of decomposing a sequential process into suboperation with each subprocess being executed in a dedicated segment that operates simultaneously or concurrently with all other segments.

→ In pipeline execution technique multiple instructions are overlapped in execution only if the resources, that are used in the same clock cycle or different stages are different then only overlapping is possible.

→ Advantage of pipeline is, it make the performance of CPU high. This is done by increasing throughput of the system by using Pipeline. Throughput means no. of task completed per unit time so that the program can run faster even if no single instruction runs faster.

### Principle of Pipeline →

Pipeline is based on the principle that multiple no. of independent task are operating simultaneously or concurrently.

Speed of pipeline depends on no. of stages or segments present in pipeline.

Speed up = No. of pipeline stages or segments

### General structure & consideration of pipeline →

→ Any operation is divided into suboperation & each suboperation is performed in a segment ( $S_i$ ) which consist of combinational circuits.

→ The segments ( $S_i$ ) are separated by registers ( $R_i$ ) where each registers store intermediate result of segments.

→ Information flows between adjacent segments under the control of common clock applied to all registers simultaneously.

→ A task is defined as total operations performed going through all segments.

## Diagram of 4-Segment pipeline



→ The results moves in the segments one step down to the next segment. When no more data or inputs are available then the clock pulse go on continue till the last O/P (output) emerge out of the pipeline.

→ Behaviour of pipeline is explained by using space-time diagram where the x-axis contain clock cycle & y-axis contain segments.

→ Example -

Show the Space-time diagram to process 4 task in a 3-segment pipeline.

Clock Cycle → 1 2 3 4 5 6

Segments     $S_1 \downarrow T_1 \quad T_2 \quad T_3 \quad T_4$

$S_2 \quad T_1 \quad T_2 \quad T_3 \quad T_4$

$S_3 \quad \quad T_1 \quad T_2 \quad T_3 \quad T_4$

$S_1, S_2 \text{ & } S_3$  - Segments

$T_1, T_2, T_3 \text{ & } T_4$  - Tasks

Pipeline Speed up Calculation →

Let's consider 'n' is no. of task to be performed.

For conventional machine or non-pipelined machine -

$t_n = \text{Time to complete execution of single task}$ .

$T_1 = \text{Total time to complete execution of } n \text{ no. of task}$

$$\text{So, } T_1 = n \times t_n$$

For pipelined machine -

$t_p = \text{Time to complete each suboperation}$ .

$T_K = \text{Total time to complete execution of } n \text{ no. of task}$

## Pipeline Speed up calculation -

Let's consider 'n' no.of task to be processed.

Speed up = Total time required to process 'n' no.of task through non-pipelined system on conventional system

Total time required to process 'n' no.of task through pipeline system.

For non-pipelined system -

Let  $t_n$  = time to complete execution of a single instruction.

$T_1$  = time to complete execution of 'n' no.of task on instructions.

$$T_1 = n \times t_n$$

For Pipeline System -

Let  $t_p$  = time to complete each sub-operation in segment or stage.

$T_K$  = Total time to complete 'n' no.of task on instructions.

$$T_K = \{K + (n-1)\} t_p$$

Speed up of a pipeline processing over an equivalent non-pipeline processing is defined by the ratio given above. i.e:

$$S_K \text{ (Speed up)} = \frac{T_1}{T_K}$$

$$S_K = \frac{n \times t_n}{\{K + (n-1)\} t_p}$$

If the time taken to complete 'n' no.of task in non-pipeline system is same as the time required to complete 'n' no.of task in pipeline system then the speed up can be expressed as,

$$S_K = \frac{n \times t_n}{\{K + (n-1)\} t_p}$$

$$\Rightarrow S_K = \frac{n \times K + n \times t_p}{\{K + (n-1)\} t_p} \quad [\because t_n = K + t_p]$$

$$\Rightarrow S_K = \frac{nK}{K + (n-1)}$$

If no. of task to be processed becomes much larger than  $K + (n-1) = n$ . So under this condition the speed up equation reduced to,

$$S_K = \frac{n+t_n}{\{K+(n-1)\}+t_p}$$

$$\Rightarrow S_K = \frac{n+t_n}{n+t_p} \quad [\because n \ggg, K+(n-1)=n]$$

$$\Rightarrow \boxed{S_K = \frac{t_n}{t_p}}$$

( $S_K = \frac{t_n}{t_p}$  is the actual speed up of pipeline)

If no. of task to be processed is more & also the time required to process 'n' no. of task in non-pipeline system is same as that of pipeline system then the speed up can be expressed as,

$$S_K = \frac{n+t_n}{\{K+(n-1)\}+t_p}$$

$$\Rightarrow S_K = \frac{n \times K + t_p}{n+t_p} \quad [\because t_n = K+t_p \& n \ggg, so K+(n-1)=n]$$

$$\Rightarrow \boxed{S_K = K}$$

This shows that- the maximum speed up that a pipeline can provide is  $K$ . (It is hypothetical)

From the speed up formula it is clear that- the total no. of clock cycles required to execute or process 'n' no. of task through pipeline is,

$$\text{No. of clock cycle} = K + (n-1)$$

e.g- Process  $A_i * B_i + C_i$  through pipeline. Where  $i \rightarrow 1 \text{ to } 4$



| Segment →   | $S_1$        | $S_2$              | $S_3$               |
|-------------|--------------|--------------------|---------------------|
| CLOCK cycle | $(R_1, R_2)$ | $(R_3, R_4)$       | $(R_5)$             |
| $t_1$       | $A_1, B_1$   |                    |                     |
| $t_2$       | $A_2, B_2$   | $(A_1 * B_1), C_1$ |                     |
| $t_3$       | $A_3, B_3$   | $(A_2 * B_2), C_2$ | $(A_1 * B_1) + C_1$ |
| $t_4$       | $A_4, B_4$   | $(A_3 * B_3), C_3$ | $(A_2 * B_2) + C_2$ |
| $t_5$       |              | $(A_4 * B_4), C_4$ | $(A_3 * B_3) + C_3$ |
| $t_6$       |              |                    | $(A_4 * B_4) + C_4$ |

Total no. of clock cycle required =  $K + (n-1)$

$$= 3 + (4-1)$$

$$= 6 \text{ no. of}$$

Q. In a 4-stage pipeline the time required to process a sub-operation within segment is 20 ns and no. of task is very large. The time required to process task in a non-pipeline system is 60 ns then calculate speed up.

Ans -  $t_p = 20 \text{ ns}$

$$t_n = 60 \text{ ns}$$

$n \ggg$

$$S_K = \frac{t_n}{t_p} = \frac{60}{20} = 3$$

Q. In a 6-segment pipeline if the time required to complete execution of 'n' no. of task is same as that of non-pipeline system and  $n \ggg$ , calculate speed up(max).

Ans -  $t_n = K t_p \text{ & } n \ggg$

So, Speed up ( $S_K$ ) =  $K = 6$

The maximum speed up = 6

Q. In a 4-segment pipeline time required to complete the sub-operation in segment or stage is 20 ns, no. of task to be processed is 100. (Assume that both pipeline & non-pipeline system requires same time to process total no. of task). calculate speed up.

Ans -  $t_p = 20 \text{ ns}$ .

$$n = 100$$

$$K = 4$$

$$\text{Speed up } (S_K) = \frac{n K t_p}{\{S_K + (n-1)\} t_p} = \frac{100 \times 4 \times 20}{\{4 + (100-1)\} 20} = \frac{8000}{2060} = 3.88$$

$(t_n = K t_p)$

## Types of pipeline →

1. Arithmetic pipeline
2. Instruction pipeline
3. RISC pipeline.

### 1. Arithmetic Pipeline -

- Arithmetic pipeline divides an arithmetic operation into sub-operations for execution in pipeline segments.
- It is found in very high speed computers. It is basically used in solving floating-pt scientific operations.
- e.g.  $(A_i * B_i) + (C_i * D_i)$  where  $i \rightarrow 1 \text{ to } 4$



Segments →

| Clock Cycle ↓ | $S_1$       |             |             |             | $S_2$               |                     | $S_3$                             |  |
|---------------|-------------|-------------|-------------|-------------|---------------------|---------------------|-----------------------------------|--|
|               | R1<br>$A_i$ | R2<br>$B_i$ | R3<br>$C_i$ | R4<br>$D_i$ | R5<br>$(A_i * B_i)$ | R6<br>$(C_i * D_i)$ | R7<br>$(A_i * B_i) + (C_i * D_i)$ |  |
| $t_1$         | $A_1$       | $B_1$       | $C_1$       | $D_1$       |                     |                     |                                   |  |
| $t_2$         | $A_2$       | $B_2$       | $C_2$       | $D_2$       | $(A_1 * B_1)$       | $(C_1 * D_1)$       |                                   |  |
| $t_3$         | $A_3$       | $B_3$       | $C_3$       | $D_3$       | $(A_2 * B_2)$       | $(C_2 * D_2)$       | $(A_1 * B_1) + (C_1 * D_1)$       |  |
| $t_4$         | $A_4$       | $B_4$       | $C_4$       | $D_4$       | $(A_3 * B_3)$       | $(C_3 * D_3)$       | $(A_2 * B_2) + (C_2 * D_2)$       |  |
| $t_5$         |             |             |             |             |                     |                     | $(A_3 * B_3) + (C_3 * D_3)$       |  |
| $t_6$         |             |             |             |             |                     |                     | $(A_4 * B_4) + (C_4 * D_4)$       |  |

$$\text{No. of clock cycle} = K + (n-1)$$

$$= 3 + (4-1)$$

$$= 6$$

e.g. floating-pt addition & subtraction.

Let the two inputs to the floating-pt adder are,

$$X = A \times 2^a$$

$$Y = B \times 2^b.$$

where  $A \& B$  = Mantissa

$a \& b$  = exponents

The sub-operations for floating-pt addition & subtractions are,

1. Compare the exponents.
2. Align the mantissa of smaller exponent.
3. ADD or SUB the mantissa.
4. Normalize the results.

Comparison is done by subtracting the exponents. Then the mantissa of smaller exponents is shifted to the right to align the mantissa. After aligning the ADD operation is to be done & after that the result is normalized to avoid overflow or underflow.

The comparator, shifter, Adder or subtractor, incrementor & decrementor in arithmetic pipeline are implemented using combinational circuit.

$$X = 0.9504 \times 10^3$$

$$Y = 0.8200 \times 10^2$$

Now, after aligning

$$X = 0.9504 \times 10^3$$

$$Y = 0.0820 \times 10^3$$

After Addition result is,  $Z = x + y = 1.0324 \times 10^3$

After normalizing the result is  $Z = 0.10324 \times 10^4$

Q. calculate the value of tp for the above e.g. if the time delay of 5 segments are given as,

$t_1 = 60 \text{ ns}$ ,  $t_2 = 70 \text{ ns}$ ,  $t_3 = 100 \text{ ns}$ ,  $t_4 = 80 \text{ ns}$  &  $t_5 = 50 \text{ ns}$  and the interface register has a delay of  $t_6 = 10 \text{ ns}$ . For the same calculate  $t_n$  (non-pipeline floating-pt delay) & speed up.

Ans - For non-pipeline,

$$\begin{aligned} t_n &= t_1 + t_2 + t_3 + t_4 + t_5 + t_6 \\ &= 60 + 70 + 100 + 80 + 50 + 10 \\ &= 370 \text{ ns} \end{aligned}$$

For Pipeline,  $t_p$  = maximum time delay among all segments  
+ Register or intermediate register delay.

$$= 100 + 10$$

$$= 110 \text{ ns}$$

$$\text{Speed up} = \frac{t_m}{t_p} = \frac{370}{110} = 3.36$$

Q. Determine no. of clock cycles that a pipeline take to process 200 task in a 6-segment pipeline.

$$\text{Ans: } K = 6, n = 200$$

$$\begin{aligned} \text{No. of clock cycles required} &= K + (n-1) = 6 + (200-1) \\ &= 6 + 199 = 205 \end{aligned}$$

Q. A non-pipeline system takes 50 ns to process a task. The same task can be processed in a 6-segment pipeline with a clock cycle of 10 ns. Determine the speed up calculation for 100 task. What is the maximum speed up that can be achieved?

$$\text{Ans: } t_m = 50 \text{ ns}; t_p = 10 \text{ ns}; K = 6; n = 100$$

$$\begin{aligned} \text{Speed up (S}_K\text{)} &= \frac{n t_m}{\{K + (n-1)\} t_p} = \frac{100 \times 50}{\{6 + (100-1)\} \times 10} = \frac{5000}{1050} \\ &= 4.76 \end{aligned}$$

$$\text{Maximum Speed up in real or actual} = \frac{t_m}{t_p} = \frac{50}{10} = 5$$

Q. Calculate the time required to process 50 task in 6-segment pipeline, where segment time delay is 20 ns.

$$\text{Ans: } K = 6, n = 50, t_p = 20 \text{ ns}$$

$T_1$  = time required to process 'n' no. of task in Pipeline system.

$$\begin{aligned} T_1 &= \{K + (n-1)\} \times t_p \\ &= \{6 + (50-1)\} \times 20 \\ &= 1100 \text{ ns} \end{aligned}$$

## 2. Instruction Pipeline -

- Instruction pipeline is very complex because instruction execution is extremely complex & involves several operations which are executed successively. This implies a large amount of hardware works at a given time.
- In instruction pipeline technique multiple no. of instructions execution are overlapped. This is solved without using any additional hardware but only by letting different parts of the hardware work for different instructions at the same time.
- The instruction pipeline organization of a processor is similar to an assembly line i.e; the work to be done in an instruction is broken down into smaller steps or pieces, each of which takes a fraction of time needed to complete the entire instruction.

Each of the stages or steps is a pipe stage or pipe segments. Pipe stages or segments are connected to form a pipe.



- The time required for moving an instruction from one stage to the next is called a machine cycle or one clock cycle. The execution of an instruction takes several clock cycles or machine cycles as it passes through pipeline.

→ Different stages of instruction pipeline are,

- (a) Fetch instruction (FI) - During this stage read the next expected instruction into a buffer.
- (b) Decode instruction (DI) - During this stage decode the opcode or operation code & operand specifier.
- (c) Calculate operand (CO) - During this stage the address calculation or effective address of source operand is calculated.
- (d) Fetch Operand (FO) - During this stage each operand or data from memory is fetched. If the operand is present in processor register then it does not require FO stages.
- (e) Execution instruction (EI) - During this stage the specified operation is performed & write or store the result in any processor register if any specified.

(b) Write operand (W0) - During this stage the result is written into memory.

Different stages of 6-segment pipeline are,

FI, DI, CO, FO, EI, W0

Different stages of 4-segment pipeline are,

FI

DI

CO

FO

EI

W0

FI — Fetch next instruction from memory.

DA — Decode instruction & calculate effective address of operand.

FO — Fetch operand by using effective address.

EX — Execute instruction & write the result in memory.

Different stages of 2-segment instruction pipeline are,

FI - Fetch instruction.

EI - Execute instruction.

→ The completion time will be longer for fetch stage. So, execution will take long time for reading, storing & also for performing some operation directly on operand stored in memory. Thus fetch time is longer & other stages may have to wait for sometime before it can empty its buffer. This is to be done or waiting time is added to other stages than memory related stages because of making time delay of each stages equal.

i.e:  $\lambda_{\text{Delay of FI stage}}^{\text{time}} = \lambda_{\text{Delay of DI stage}}^{\text{time}} = \text{time delay of CO stage} = \text{time delay of FO stage} = \text{time delay of EI stage} = \text{time delay of W0 stage}.$

$$t_p = \text{Maximum} (\text{FI, DI, CO, FO, EI, W0}) + t_r$$

where  $t_r$  = register or intermediate register delay.

→ Segments that require memory access are,  
FI, FO, W0.

→ Not all instructions goes through all the 6-stages of pipeline. e.g - LOAD instruction donot need W0 stages. But to simplify instruction pipeline it is

assumed that each instruction requires all the 6-segments or 6-stages.

Q. Process 4 instruction in 2-segment pipeline & calculate no.of clock cycle.

| CLOCK cycle   | 1  | 2  | 3  | 4  | 5  |
|---------------|----|----|----|----|----|
| instruction i | FI | EI |    |    |    |
| i+1           |    | FI | EI |    |    |
| i+2           |    |    | FI | EI |    |
| i+3           |    |    |    | FI | EI |

$$\begin{aligned}\text{Total no.of clock cycle} &= k + (n-1) \\ &= 2 + (-1) \\ &= 2 + 3 \\ &= 5\end{aligned}$$

instruction → i    i+1    on    i+2    i+3

|             |                |          |
|-------------|----------------|----------|
| CLOCK cycle | t <sub>1</sub> | FI       |
|             | t <sub>2</sub> | EI    FI |
|             | t <sub>3</sub> | EI    FI |
|             | t <sub>4</sub> | EI    FI |
|             | t <sub>5</sub> | EI       |

| CLOCK cycle | t <sub>1</sub> | t <sub>2</sub> | t <sub>3</sub> | t <sub>4</sub> | t <sub>5</sub> |
|-------------|----------------|----------------|----------------|----------------|----------------|
| Segments    | FI             | i              | i+1            | i+2            | i+3            |
|             | EI             | i              | i+1            | i+2            | i+3            |

Q. Process 4 instructions in 4-segment instruction pipeline.

| CLOCK cycle   | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
|---------------|----|----|----|----|----|----|----|
| instruction i | FI | DA | FO | EX |    |    |    |
| i+1           |    | FI | DA | FO | EX |    |    |
| i+2           |    |    | FI | DA | FO | EX |    |
| i+3           |    |    |    | FI | DA | FO | EX |

$$\begin{aligned}\text{No.of clock cycle} &= k + (n-1) \\ &= 4 + (4-1) \\ &= 7\end{aligned}$$

Q. Process 4 instructions in 6-Segment pipeline.

| CLOCK cycle   | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|---------------|----|----|----|----|----|----|----|----|----|
| Instruction i | F1 | D1 | C0 | F0 | E1 | W0 |    |    |    |
| i+1           |    | F1 | D1 | C0 | F0 | E1 | W0 |    |    |
| i+2           |    |    | F1 | D1 | C0 | F0 | E1 | W0 |    |
| i+3           |    |    |    | F1 | D1 | C0 | F0 | E1 | W0 |

$$\text{Total clock cycle} = k + (n-1) = 6 + (4-1) = 9$$

→ Factors that limit the performance enhancement in instruction pipeline are,

- Some waiting time in various stages.
- Conditional branch instruction.

1) On conditional branch until the instruction is executed there is no way of knowing which instruction will come next. Hence in this case the pipeline simply loads the next instruction in sequence & proceeds. At the end of execution stage (EI), it is cleared that whether branch is taken or not.

→ If branch is taken then at the end of execution or EI stage, the pipeline must be cleared or flushed all those instructions that are fetched after a branch instruction. This results in Performance penalty which occurs because we could not anticipate the branch is taken or not correctly.

→ If branch is not taken then normal processing of pipeline goes on.

2) After DI stage it is cleared that whether there is a conditional branch or unconditional branch. If there is unconditional branch then the branch instruction address is retrieved in F0 stage & before that if the pipeline loads any instruction & proceeded them then they are cleared or flushed that results in Performance penalty. But after DI stage if it is cleared that it is a conditional branch then step 1 is carried out.

3) In case of conditional branch if the anticipation is not correct then there will be more performance penalty as compared to performance penalty of executing unconditional branch instruction because in case of unconditional one it is always clear that instruction executed next is the instruction from target address.

4) Flowchart for 4-Segment instruction pipeline-



5) Flowchart for 6-Segment instruction pipeline -



6) e.g. of execution of conditional branch in instruction pipeline. (6-Segment)

Assume that instruction 3 is a conditional branch & after executing instruction 3, control of execution shifted to instruction 15 or after executing instruction 3 next instruction 15 is executed. Hence until instruction 3 is executed there is no way of knowing whether instruction 4 will come next or instruction 15 will come next for execution.

Initially Pipeline loads instruction in sequence & proceeds

Assume that branch is taken for this example but the branch is taken is determined after E1 stage of instruction 3 at clock cycle 7. So, at clock cycle 8 it is clear that branch is taken for I<sub>3</sub> (instruction 3). Therefore at clock cycle 8 flush or clear the pipeline by removing I<sub>4</sub> to I<sub>7</sub> from pipeline.

At clock cycle 8, instruction 15 is entered into the pipeline into the FI stage.

No instruction get completed during clock cycle 9 to 12. Hence performance penalty is 4.

| Clock cycle →                                     | 1               | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 |
|---------------------------------------------------|-----------------|----|----|----|----|----|----|----|----|----|----|----|----|----|
| Instruction I <sub>1</sub>                        | FI              | DI | CO | FO | EI | WO |    |    |    |    |    |    |    |    |
| I <sub>2</sub>                                    |                 | FI | DI | CO | FO | EI | WO |    |    |    |    |    |    |    |
| I <sub>3</sub>                                    |                 |    | FI | DI | CO | FO | EI | WO |    |    |    |    |    |    |
| I <sub>4</sub> to I <sub>7</sub>                  | I <sub>4</sub>  |    |    | FI | DI | CO | FO |    |    |    |    |    |    |    |
| instruction get clear                             | I <sub>5</sub>  |    |    |    | FI | DI | CO |    |    |    |    |    |    |    |
| on flushed                                        | I <sub>6</sub>  |    |    |    |    | FI | DI |    |    |    |    |    |    |    |
| from pipeline                                     | I <sub>7</sub>  |    |    |    |    |    | FI |    |    |    |    |    |    |    |
|                                                   | I <sub>15</sub> |    |    |    |    |    |    | FI | DI | CO | FO | EI | WO |    |
|                                                   | I <sub>16</sub> |    |    |    |    |    |    |    | FI | DI | CO | FO | EI | WO |
| Instruction execution begins from I <sub>15</sub> |                 |    |    |    |    |    |    |    |    |    |    |    |    |    |

Performance penalty or branch penalty = 4

7) Description of 4-Segment instruction pipeline to handle conditional branch.

Assume that instruction 3 is a branch instruction & branch is taken here. Before branch instruction, all the instructions like I<sub>1</sub>, I<sub>2</sub> are executed normally but the instructions following I<sub>3</sub> are affected.

If branch is taken then after I<sub>3</sub> the instruction from target address are executed.

But if branch is not taken then after I<sub>3</sub>, instruction I<sub>4</sub> get executed next.



Performance penalty or branch penalty = 3

In clock cycle 4, instruction 1 is in EX segment; instruction 2 is in FO stage; instruction 3 is in DA stage & instruction 4 is in FI segment.

As soon as instruction 3 (branch instruction) is in decoded stage (DA stage), the other instructions are halted from FI to DA stage in clock cycle 4 until the branch instruction is executed in clock cycle 6.

Because branch is taken so after I<sub>3</sub>, I<sub>10</sub> get executed & I<sub>4</sub> is flushed from pipeline

No instruction is completed from clock cycle 7 to 9, so performance penalty or branch penalty is 3.

8) Description of unconditional branch instruction execution in instruction pipeline →

Consider that instruction no. 3 or I<sub>3</sub> is an unconditional branch instruction in a program & when instruction no. 3 or I<sub>3</sub> get executed then the control of execution moves to or jumps to instruction no. 15 or I<sub>15</sub> unconditionally.



Performance Penalty = 3

→ After F0 Stage or F0 segment of unconditional branch instruction the next instruction that is going to be executed is known. And for the e.g. given, it is instruction 15 or I<sub>15</sub> that is executed next after unconditional branch instruction I<sub>3</sub>.

→ Instruction next to I<sub>3</sub> that is I<sub>4</sub> get fetched in normal process of pipeline. This is because before DI stage of I<sub>3</sub> instruction it is not known that it is unconditional branch. But after DI stage of I<sub>3</sub>, it is confirmed that I<sub>3</sub> is a unconditional branch & after I<sub>3</sub> the next instruction that get executed is instruction 15 or I<sub>15</sub> from target address given in I<sub>3</sub> instruction. Therefore I<sub>4</sub> is removed or flushed from instruction pipeline.

→ Performance penalty or branch penalty for a unconditional branch instruction in a 6-segment instruction pipeline = 3.

### Pipeline Hazards -

Pipeline hazards are the situations that prevent the next instruction in the instruction stream executing during its designated clock cycle. The instruction is said to be stalled. When an instruction is stalled, all instruction following or later the stalled one are also stalled in the pipeline. But the instructions present before the stalled one are not affected or continue in normal.

#### Types of Pipeline Hazard -

1. Structural Hazard
2. Data Hazard
3. Control Hazard.

#### 1. Structural Hazard -

Structural hazard occurs if the hardware resources required by the simultaneous overlapped instruction's execution are same.

i.e.: Resources like memory or any functional unit are requested by more than one instruction at the same time or at the same clock pulse.

e.g-

| CLOCK Cycle                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 |
|----------------------------|----|----|----|----|----|----|----|----|----|----|----|
| Instruction I <sub>1</sub> | F1 | DI | CO | FO | EI | WO |    |    |    |    |    |
| I <sub>2</sub>             |    | F1 | DI | CO | FO | EI | WO |    |    |    |    |
| I <sub>3</sub>             |    |    | F1 | DI | CO | —  | FO | EI | WO |    |    |
| I <sub>4</sub>             |    |    |    | F1 | DI | —  | CO | FO | EI | WO |    |
| I <sub>5</sub>             |    |    |    |    | F1 | —  | DI | CO | FO | EI | WO |

In the above e.g. I<sub>1</sub> & I<sub>3</sub> instructions access from same memory location. i.e: I<sub>1</sub> instruction wants to write result on memory with the same address where I<sub>3</sub> instruction wants to read the operand or data and the write operation of I<sub>1</sub> (WO) & read operation of I<sub>3</sub> (FO) performed in same clock cycle, CLK 6 for same memory location. Hence there exist resource dependency between I<sub>1</sub> & I<sub>3</sub>.

Therefore I<sub>3</sub> instruction has a stall at CLK 6 to avoid resource dependency. This implies instructions following I<sub>3</sub>, which are I<sub>4</sub> & I<sub>5</sub> also get stalled during the same clock cycle 6. Hence performance penalty for this example is 1.

e.g-

| CLOCK Cycle                        | 1 | 2  | 3  | 4  | 5  | 6  | 8  | 9  | 10 |    |    |
|------------------------------------|---|----|----|----|----|----|----|----|----|----|----|
| ADD R <sub>1</sub> ,R <sub>2</sub> |   | F1 | DI | CO | FO | EI | WO |    |    |    |    |
| SUB R <sub>1</sub> ,R <sub>3</sub> |   |    | F1 | DI | CO | —  | —  | FO | EI | WO |    |
| INR R <sub>5</sub>                 |   |    |    | F1 | DI | —  | —  | CO | FO | EI | WO |

ADD R<sub>1</sub>,R<sub>2</sub>; R<sub>1</sub> ← R<sub>1</sub> + R<sub>2</sub>

SUB R<sub>1</sub>,R<sub>3</sub>; R<sub>1</sub> ← R<sub>1</sub> - R<sub>3</sub>

INR R<sub>5</sub>; R<sub>5</sub> ← R<sub>5</sub> + 1

In this e.g. for SUB operation the operand or data required are R<sub>1</sub>, R<sub>3</sub>. R<sub>3</sub> is available but R<sub>1</sub> is not available for EI operation of SUB because R<sub>1</sub> = R<sub>1</sub> + R<sub>2</sub> which means that after R<sub>1</sub> write operation it is

available by ADD operation. Therefore EI operation of SUB get stalled till WO operation of ADD is not complete  
So, Performance penalty = 2.

Remedy -

The remedy of structural hazard is,

→ By using duplicate resources.

→ By providing separate data & instruction cache.

## 2. Data Hazard -

It occurs due to data dependency. It occurs when the execution of an instruction depends on the results of a previous instruction.

e.g - ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub> ; R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>  
SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub> ; R<sub>4</sub> ← R<sub>1</sub> - R<sub>5</sub>  
MOV R<sub>6</sub>, R<sub>7</sub>

| Clock Cycle →                                        | 1  | 2  | 3  | 4  | 5  | 6  | 7 | 8 | 9  | 10 |    |    |
|------------------------------------------------------|----|----|----|----|----|----|---|---|----|----|----|----|
| ADD R <sub>1</sub> , R <sub>2</sub> , R <sub>3</sub> | F1 | DI | CO | FO | EI | WO |   |   |    |    |    |    |
| SUB R <sub>4</sub> , R <sub>1</sub> , R <sub>5</sub> |    |    |    | F1 | DI | CO | — | — | FO | EI | WO |    |
| MOV R <sub>6</sub> , R <sub>7</sub>                  |    |    |    |    | F1 | DI | — | — | CO | FO | EI | WO |

Performance penalty = 2

In the above e.g. the SUB instruction can not move to F0 stage because the SUB instruction depends on data of R<sub>1</sub> which get available after WO stage. Hence there are two stalls until the ADD instruction has written the result into R<sub>1</sub>. Therefore, the performance penalty for this e.g = 2.

Remedy →

The remedy for data hazards can be,

→ Hardware technique

(a) Interlock

(b) Forwarding

(a) Interlock - The hardware detects data dependency & delays scheduling of the dependent instruction by stalling enough clock cycle.

(b) Forwarding - It also known as bypassing or short circuiting. In forwarding method if the hardware

detects the value needed for the current operation is the one produced by the previous operation but which has not yet been written back, it selects the forwarded result as ALU O/p instead of the value read from register or memory.

e.g. of forwarding,

CIR cycle → 1 2 3 4 5 6 7 8

MUL R<sub>2</sub>, R<sub>3</sub> F1 DI CO FO EI W0

ADD R<sub>1</sub>, R<sub>2</sub> F1 DI CO - FO EI W0

Hence after EI stage of MUL instruction the result is available by forwarding. So performance penalty is reduced to 1 stall.

→ Software Technique -

Software technique is a way of rescheduling the instruction in a program so that stall can be avoided.

e.g. of software technique,

|                                                                                                                                                                                                                                                          |                                                                                                                                                                                                           |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOAD R <sub>b</sub> , b<br>LOAD R <sub>c</sub> , c<br>Data } ADD R <sub>a</sub> , R <sub>b</sub> , R <sub>c</sub> } Data<br>Dependent } STORE R <sub>5</sub> , R <sub>a</sub> } Dependent<br>-cy      LOAD R <sub>e</sub> , e<br>LOAD R <sub>f</sub> , f | LOAD R <sub>b</sub> , b<br>LOAD R <sub>c</sub> , c<br>LOAD R <sub>e</sub> , e<br>ADD R <sub>a</sub> , R <sub>b</sub> , R <sub>c</sub><br>LOAD R <sub>f</sub> , f<br>STORE R <sub>5</sub> , R <sub>a</sub> |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

(Unscheduled code)

(Scheduled code)

### 3. Control Hazard -

control hazard occurs because of branch instructions, mostly for conditional branch instructions. Hence due to branch instructions there are performance penalties. To reduce performance penalties or to avoid the chance of performance penalties there are several remedies.

Remedy →

a. Prefetch Target Instruction.

b. Branch Target Buffer.

c. Loop Buffer.

d. Branch prediction.

e. Delayed Branch.

## 1. Multiple Stream →

→ In multiple stream replicates the initial portion of pipeline & allow pipeline to fetch both instruction stream that is branch taken & branch not taken. And both are saved until branch is executed. Then select the right instruction stream & discard the wrong or not required instruction stream.

→ A problem associated with multiple stream that is, by using multiple pipelines there arise contention delays for access to the register & memory.

## 2. Prefetch Branch Target →

→ In this technique when a conditional branch is recognized then the target of branch is prefetched, in addition to the instruction following the branch.

→ Here if the branch is taken then the target is already prefetched.

## 3. Loop Buffer →

→ It is a small high speed memory maintained by the instruction fetch stage of the pipeline & containing the 'n' most recently fetched instruction in sequence.

→ If the branch is to be taken then the hardware first checks whether the branch target is within the buffer or not. If it is present then the next instruction fetched from buffer.

→ Therefore loop buffer is like a cache memory which acts as a storage of entire loop that allows to execute a loop by accessing 'n' most recently fetched instructions of loop from loop buffer without accessing memory.

## 4. Delayed Branch →

→ In this technique pipeline performance improved by automatically rearranging instructions of a program. So, the delay due to branch instruction is handled by the compiler. The compiler analyze the instructions of program & rearrange the instructions of program to avoid delay due to branch.

e.g. of delayed branch-

LOAD R<sub>1</sub>, 4000  
INR R<sub>5</sub>  
Branch to X  
ADD R<sub>2</sub>, R<sub>3</sub>  
SUB R<sub>5</sub>, R<sub>4</sub>  
MOV R<sub>6</sub>, R<sub>7</sub>  
MOV R<sub>8</sub>, R<sub>9</sub>

X:

Rearrange of Instructions  
in Program to avoid delayed  
to Branch.

LOAD R<sub>1</sub>, 4000  
INR R<sub>5</sub>  
Branch to X  
MOV R<sub>6</sub>, R<sub>7</sub>  
MOV R<sub>8</sub>, R<sub>9</sub>  
# NOP  
ADD R<sub>2</sub>, R<sub>3</sub>  
SUB R<sub>5</sub>, R<sub>4</sub>

X:

## 5. Branch prediction →

→ In branch prediction various techniques are used to predict whether a branch will be taken or not. These techniques are,

- (a) Predict never taken
- (b) Predict always taken.
- (c) Predict by opcode.
- (d) Taken or not taken switch.
- (e) Branch history table.

Static branch strategies

Dynamic branch strategies.

→ Static branch strategies means they do not depend on execution history of conditional branch instruction. And dynamic branch strategies means they depends on execution history of conditional branch instruction.

### a) Predict never taken -

Assume that branch will not be taken & continue to fetch instructions in sequence from next instruction.

### b) Predict always taken -

Assume that branch is always taken & fetch the instructions from target address.

### c) Predict by opcode -

Here the decision made by the processor is based on opcode of branch instruction. The processor assumes that branch will be taken for certain branch opcodes not for others.

#### d) Taken or Not taken Switch →

In this technique one or more bits are associated with each conditional branch instruction that reflects the recent history of the instruction (branch). These bits are referred to as a taken/not taken switch that directs the processor to make a particular decision for the next time the branch instruction is encountered.



(Branch prediction state diagram)

The branch prediction state diagram starts at the upper left corner. This algorithm or diagram requires two consecutive wrong prediction to change the prediction decision.