

## Pipelining-

- ① Pipelining is a H/w mechanism.
- ② Pipelining uses the decomposition technique.
- ③ The function of the pipeline is one pipe output is connected as input to the another pipe

(4) Definition of the pipeline is "Accepting the new i/p's at one end **before** previously accepted i/p, appears as an o/p at the other end."

(5) Definition states that new i/p's are inserted into the pipeline before completion of the old inputs. So that new i/p's are executed along with the old i/p's.

therefore, the new i/p's utilize the CLK cycles of the old inputs.

(6) Pipelining allows the overlapping execution.

(7) In the non-pipelining process, new i/p's are inserted only after completion of the old inputs. Therefore, it allows the non-overlapping execution.

(8) Overlapping execution seq. in the pipeline is represented using space-time diagram.

i.e.



- ⑨ Successful characteristic of the pipeline is every new cycle, new i/p must be inserted into the pipeline. Therefore,

$$CPI = 1$$

### Designing of Pipeline:

- ① Pipe has two ends i.e. input end & output end.
- ② Between the i/p & o/p ends, multiple pipes are interconnected to satisfy the objective of the Pipelining. Each pipe is called as the stage or segment.
- ③ B/w the stages, interface registers are used to hold the intermediate result i.e. also called as Latch/Buffer / pipeline register.
- ④ All the stages along with the interface registers are controlled by the common clock.



- ⑤ Consider 4 segment pipeline ( $S_1, S_2, S_3, S_4$ ) used to execute the 4 tasks ( $T_1, T_2, T_3, T_4$ ). Show the execution sequence in the pipeline.



### Performance Evaluation of Pipeline Processor -

- ① Consider K-segment pipeline with clock cycle time  $t_p$  used to execute  $n$  tasks.
- ② The very first task in the pipeline is executed as non-overlapping order. So, it takes  $K$ -cycles to complete the operation.
- ③ The remaining  $(n-1)$  tasks will emerge from the pipe at 1 task/cycle. Therefore,  $(n-1)$  cycles are required to complete  $(n-1)$  tasks.
- ④ So, the total time required to complete the task is  $t_p(n+K-1)$ .

n tasks using the K-segment pipeline is -

$$ET_{\text{pipe}} = (K+n-1) \text{ cycles}$$

$$ET_{\text{pipe}} = (K+n-1) * tp$$

(5) Consider non-pipeline processor used to execute the n tasks in which each task require  $t_n$  time to complete the execution.

(6) So, the total time required to complete n tasks in non-pipeline is

$$ET_{\text{non-pipe}} = n * t_n$$

- ~~requires shifting to memory access~~

(7) The performance gain of the pipeline processor over the non-pipeline is

$$\text{Speed Up (S)} = \frac{\text{Performance pipe}}{\text{Performance non-pipe}}$$

$$= \frac{1}{ET_{\text{pipe}}}$$

$$= \frac{1}{ET_{\text{non-pipe}}}$$

$$= \frac{ET_{\text{non-pipe}}}{ET_{\text{pipe}}}$$

$$S = \frac{n * t_n}{(K+n-1) * tp}$$

⑧ When the  $n$ -value is increases the  $(K+n-1)$  is approach to the value of  $n$ .

⑨ Under this condition,

$$S = \frac{n \times t_n}{n \times t_p}$$

$$S = \frac{t_n}{t_p}$$

10) When all ins<sup>n</sup> are taking same no. of CLK cycles then one task execution time is also equal to the no. of stages in the pipeline. i.e.  $t_n = K$  cycles.

$$t_n = K \times t_p$$

$$S = \frac{K \times t_p}{t_p}$$

$$S = K \text{ (Pipeline depth)}$$

11) When the system is operating with 100% efficiency then max. SpeedUp = pipeline depth

$$\frac{100\% n}{? n} \xrightarrow{\cancel{S_{max}}} S$$

$$n_{\text{pipe}} = \frac{S}{S_{\text{max}}} = \frac{S}{K}$$

(12) Throughput of the pipeline is no. of tasks completed per total time required to complete the task.

$$TP_{\text{pipe}} = \frac{\# \text{ tasks completed}}{\text{total time req. to complete task}}$$

$$TP_{\text{pipe}} = \frac{n}{(k+n-1) * tp}$$

### Types of Pipeline -

#### ① Linear Pipeline -

This pipe is used to process only one functionality. In this pipeline, only the feedforward connections are present



#### ② Non-Linear Pipeline -

This pipeline is used to process the multiple functions. In this pipeline feedforward & feedbackward connections are maintained.



feedbackward

$$\frac{Z_1}{Z} = \frac{Z_2}{Z} = \sin \theta$$

### ③ Synchronous Pipeline -

In this pipeline, the initiations are controlled based on the CLK. Linear pipelines are the synchronous pipelines.

### ④ Asynchronous Pipeline -

In this pipeline, operations are initiated based on the handshaking signals. It means that before initiating the operation, there is a need of taking the feedback.

### Hypothetical Pipelines -

#### ① Uniform Delay Pipeline -

In this pipeline, all the stages are taking the same amount of time to complete the operation.



a)  $t_p = \text{stage delay}$

b) If buffer delay is included

$$t_p = \text{stage delay} + \text{Buffer delay}$$

c) If any skew time / setup time overheads are included

$$t_p = t_p + \text{overhead}$$

### ③ Non-Uniform Delay Pipeline

In this pipeline different stage delays are maintained.



a)  $t_p = \text{MAX}(\text{stage delay})$

b) If buffer delay is included.

$$t_p = \text{MAX}(\text{stage delay}) + \text{Buffer delay}$$

c) If non-uniform buffer delay is included

$$t_p = \text{MAX}(\text{stage delay} + \text{Buffer delay})$$

Ques-1 A  $\text{ins}^n$  pipeline has the speedup factor. So what operating with 80% efficiency. What could be the no of stages in the pipeline.

$$n = \frac{s}{k}$$

$$0.8 = \frac{10}{K} \Rightarrow K = \frac{10}{0.8} = 12.5 \approx 13$$

Q2) Consider 2 pipelines A & B where pipeline A having 8 stages of uniform delay of 2ns, pipeline B having 5 stages with respect to stage delays of 2, 4, 6, 3 & 1. How much time is saved when 100 tasks are pipelined using A instead of B.

$$\begin{aligned} ET_A &= (K+n-1) * tp \\ &= (8+100-1) * 2 \\ &= 214 \end{aligned}$$

$$\begin{aligned} ET_B &= \text{MAX}(2, 4, 6, 3, 1) = 6 \\ &= (5+100-1) * 6 \\ &= 624 \end{aligned}$$

$$\begin{aligned} \text{Save time} &= ET_B - ET_A \\ &= 624 - 214 \\ &= 410 \text{ sec Ans.} \end{aligned}$$

3) Consider 4 segment pipelines with the respective stage delays 20, 30, 40, 10 ns. What is the approx speed up when very large no. of instructions are pipelined.

$$S = \frac{tn}{tp}$$

$$tp = \text{MAX}(20, 30, 40, 10) = 40$$

$tn$  = One task execution time

$$\begin{aligned} tn &= S_1 + S_2 + S_3 + S_4 \\ &= 20 + 30 + 40 + 10 \\ &= 100 \end{aligned}$$

$$S = \frac{100}{40} = 2.5$$

$$n = \frac{S}{k} = \frac{2.5}{4} = 62.5\%. \text{ Ans.}$$

4) Consider 4 stage inst pipeline where different inst are spending different amount of time at different stages i.e. shown below. What is the speed up?

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |     |
|----------------|----------------|----------------|----------------|----------------|-----|
| I <sub>1</sub> | 2              | 1              | 2              | 1              | = 6 |
| I <sub>2</sub> | 1              | 3              | 1              | 2              | = 7 |
| I <sub>3</sub> | 2              | 2              | 2              | 1              | = 7 |
| I <sub>4</sub> | 2              | 2              | 1              | 2              | = 7 |

$$t_p = 2$$



$$ET_{\text{pipe}} = 13 \text{ cycle}$$

$$ET_{\text{non-pipe}} = T_1 + T_2 + T_3 + T_4 \\ = 27$$

$$S = \frac{ET_{\text{non-pipe}}}{ET_{\text{pipe}}}$$

$$= \frac{27}{13}$$

$$S = 2.07$$

$$\eta = \frac{S}{K}$$

$$= \frac{2.07}{4} = 0.517 = \underline{\underline{51.7\% \text{ Ans.}}}$$

## Lecture - 9

Ques-1 Consider 4 stage pipeline where diff. ins<sup>n</sup> are pending different amount of time at diff. stages shown below:-

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | 1              | 3              | 1              | 1              |
| I <sub>2</sub> | 2              | 1              | 2              | 1              |
| I <sub>3</sub> | 1              | 1              | 1              | 2              |
| I <sub>4</sub> | 1              | 2              | 1              | 1              |

The following loop is executed in the pipeline  
 $\text{for } (i=1; j \leq 1000; j++)$

I<sub>1</sub>; I<sub>2</sub>

I<sub>2</sub>;

I<sub>3</sub>;

I<sub>4</sub>;

The o/p of "I<sub>2</sub>" for (i=2) will be available after 16 cycle.



## RISC Processor

- ① To analyse the implementation issue of pipelining, let us consider RISC pipeline as a reference model.
- ② RISC processor supports 5 stage ins<sup>n</sup> pipeline used to execute the ins<sup>n</sup>.
- ③ RISC processor ins<sup>n</sup> set contain 3 category of the op<sup>n</sup>:
  - ④ D
  - i) Data Transfer Op<sup>n</sup> -

i.) In the RISC machine, LOAD & STORE ins<sup>n</sup> are used to transfer data b/w CPU & memory. In these ins<sup>n</sup>, mem. address is calculated using indexed A.M.

Syntax -

i.) LOAD r0, 3(r1)

$$r_0 \leftarrow m[3 + [r_1]]$$

2) Store  $3(r_1), r_0$

$$m[3+r_1] \leftarrow r_0$$

2) Data Manipulation op<sup>n</sup>

1) In the RISC CPU, ALU op<sup>n</sup> are performed only on the register data.

Syntax -

ADD r0, r1, r2

$$r_0 \leftarrow r_1 + r_2$$

3) Transfer of Ctrl OP<sup>n</sup>

1) Unconditional TOC

Syntax - JMP 2000

$$PC \leftarrow \begin{cases} \text{seq. Addr} \\ 2000 \end{cases}$$

2) Conditional TOC

1) In this op<sup>n</sup>, condition is evaluated based on current ins<sup>n</sup> only.

Syntax - DJNZ r0, 2000

Dec r0

r0 compare NZ

$$\begin{cases} \text{True} & (PC \leftarrow \text{seq. Addr}) \\ \text{False} & (PC \leftarrow \text{seq. Addr}) \end{cases}$$

To execute the above ins<sup>n</sup>, RISC processor supports 5 stage ins<sup>n</sup> pipeline -

### 1.) Stage I (Ins<sup>n</sup> Fetch)

- ① In this stage, CPU reads the ins<sup>n</sup> from memory simultaneously, PC is incremented to next seq. Ins<sup>n</sup> Addr.

### Stage II (Ins<sup>n</sup> Decode)

- ① In this stage, two different op's are performed i.e. decode the Ins<sup>n</sup> & access the register file.
- ② This stage is also contain the comparators logic to evaluate the branch cond<sup>n</sup>.

### Stage III (Execute)

In this stage only the ALU op<sup>n</sup> are performed

### Stage IV (Memory Access)

In this stage, only the memory read & memory write op<sup>n</sup>s are performed.

### Stage V (Write Back)

This stage op<sup>n</sup> is divided into two parts

In the first part, register write op<sup>n</sup> is performed.

In the 2<sup>nd</sup> part, register read op<sup>n</sup> is performed.



DNZ 30,2000  
 $PC \leftarrow PC + S.S.$

Cond. Toc

TA : 2000

30 = value

Dec 30

30 compare NZ



V.V. Imp

## Dependencies In the Pipeline -

- ① 3 kinds of dependencies are possible in the pipeline named as
  - 1) Structural Dependency
  - 2) Data Dependency
  - 3) Control Dependency
- ② Dependency creates the extra cycles in the pipeline.
- ③ Cycle without new input insertion is called as Stall Cycle.
- ④ When the stalls present in the pipeline, then  $CPI \neq 1$ . Therefore, 100% Efficiency never possible.

### 1.) Structural Dependency -

- ① This dependency is created in the pipeline due to the Resource Conflict. Resource

may be a memory or register or functional unit.

| <u>Diagram I</u> | CC1 | CC2 | CC3 | CC4  | CC5 |
|------------------|-----|-----|-----|------|-----|
| I <sub>1</sub>   | Mem | ID  | ALU | Mem. |     |
| I <sub>2</sub>   |     | Mem | ID  | ALU  |     |
| I <sub>3</sub>   |     |     | Mem | ID   |     |
| I <sub>4</sub>   |     |     |     | Mem  |     |

② In this above cycle diagram, Inst I<sub>1</sub> & I<sub>4</sub> both are trying to access the same resource (memory) in the same cycle (CC4). This situation is called as Resource Conflict.

Resource Conflict in the pipeline is named as Structural Dependency.

③ Conflict is a unsuccessful opn, so to make it successful Keep I<sub>4</sub> into waiting until the resource becomes available. This waiting creates stalls in pipeline.

|                | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 |
|----------------|-----|-----|-----|-----|-----|-----|-----|
| I <sub>1</sub> | Mem | ID  | ALU | Mem | WB  |     |     |
| I <sub>2</sub> |     | Mem | ID  | ALU | Mem | WB  |     |
| I <sub>3</sub> |     |     | Mem | ID  | ALU | Mem | WB  |
| I <sub>4</sub> |     |     |     |     |     |     |     |

stalls.

- ④ To minimize the no. of stalls in pipeline due to the structural dependency, one of H/W mech. is used called as Renaming.
- ⑤ In the Cycle diagram-1, I<sub>1</sub> is accessing the memory in 4th stage to read/write the data whereas I<sub>2</sub> is accessing the memory in the 1st stage of pipelining to read the insn in same cycle of CCY.
- ⑥ When the insn & data both present in same memory. then the above situation creates the memory conflict.
- ⑦ Renaming mech. states that Divide the mem. into independent sub-modules to store insns & data separately called as code memory (CM) & data memory (DM) respectively.
- ⑧ Refer the code memory in the stage 1 & data memory in the stage 4 of pipeline. Then, accessing the code memory & data memory in same cycle doesn't creates the conflict.

|                | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 |  |
|----------------|-----|-----|-----|-----|-----|-----|--|
| I <sub>1</sub> | CM  | ID  | ALU | DM  | WB  |     |  |
| I <sub>2</sub> |     | CM  | ID  | ALU | DM  | WB  |  |
| I <sub>3</sub> |     |     | CM  | ID  | ALU | DM  |  |
| I <sub>4</sub> |     |     |     | CM  | ID  | ALU |  |
| I <sub>5</sub> |     |     |     |     | CM  | ID  |  |

## Lecture-10

### 1) Data Dependency -

- (1) Consider the program segment in which the ins<sup>n</sup> j follows ins<sup>n</sup> i.
- (2) Data dependency is existed when the ins<sup>n</sup> j tries to read the data before ins<sup>n</sup> i writes it.

Note: ~~old for new update requires : if~~

- (1) Ins<sup>n</sup> j , 1 of the input may be the output of the ins<sup>n</sup> i.

Eg:- I<sub>1</sub>: Add R<sub>10</sub>, R<sub>11</sub>, R<sub>12</sub>

I<sub>2</sub>: Mul R<sub>13</sub>, R<sub>10</sub>, R<sub>14</sub>.

- (2) Acc. to the pipeline concept, I<sub>2</sub> is inserted into the pipe before completion of the I<sub>1</sub>. Therefore, I<sub>2</sub> reads the register R<sub>10</sub> before i.e. updated by I<sub>1</sub>. So, old value is undergoes processing instead of the new value.

- (3) To avoid this problem, keep I<sub>2</sub> into the waiting state until completion of the I<sub>1</sub>. This waiting creates the stalls in the pipeline.

- (4) To identify the data dependency condition some kind of the database will be maintained at the decoding stage.

⑤ The database contain the following field name

| S.No.          | Functional Unit | Destn           | Independent Source 1 | Ind. Source 2   | Dep. S1             | Dep. S2 |
|----------------|-----------------|-----------------|----------------------|-----------------|---------------------|---------|
| I <sub>1</sub> | +               | g <sub>10</sub> | g <sub>11</sub>      | g <sub>12</sub> | -                   | -       |
| I <sub>2</sub> | *               | g <sub>13</sub> | -                    | g <sub>14</sub> | g <sub>10</sub> (I) | -       |

I<sub>1</sub>: Execution stage will be allocated

I<sub>2</sub>: Execution stage will not be allocated

|                | CC <sub>1</sub>                                  | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|----------------|--------------------------------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | IF                                               | ID              | EX              | MA              |                 |                 |                 |
|                | +                                                |                 |                 |                 |                 |                 |                 |
|                | S: g <sub>11</sub> , g <sub>12</sub><br>(I) (I)  |                 |                 |                 |                 |                 |                 |
|                | D: g <sub>10</sub>                               |                 |                 |                 |                 |                 |                 |
|                | g <sub>11</sub> = value                          |                 |                 |                 |                 |                 |                 |
|                | g <sub>12</sub> = value                          |                 |                 |                 |                 |                 |                 |
| I <sub>2</sub> | IF                                               | ID              |                 |                 |                 | EX              |                 |
|                |                                                  | *               |                 |                 |                 |                 |                 |
|                | S: g <sub>10</sub> , g <sub>14</sub><br>(D), (I) |                 |                 |                 |                 |                 |                 |
|                | D: g <sub>13</sub>                               |                 |                 |                 |                 |                 |                 |
|                | g <sub>14</sub> = value                          |                 |                 |                 |                 |                 |                 |
| I <sub>3</sub> | IF                                               |                 |                 |                 |                 |                 |                 |
| I <sub>4</sub> |                                                  |                 |                 |                 | IF              |                 |                 |
|                |                                                  |                 |                 | Stage           |                 |                 |                 |

⑥ To minimize the no. of stalls in the pipeline due to the data dependency, 1 hardware technique used called as operand forwarding. It is also called as By Passing / Short circuiting.

⑦ Operand forwarding means accessing the new data from the interface register before updating the register file.

⑧ Consider the program segment:-

I<sub>1</sub>: Add R<sub>0</sub>, R<sub>1</sub>, R<sub>2</sub> } True data → Possible b/w  
I<sub>2</sub>: SUB R<sub>3</sub>, R<sub>0</sub>, R<sub>4</sub> } dependency Adjacent ins<sup>n</sup>  
I<sub>3</sub>: Mul R<sub>5</sub>, R<sub>0</sub>, R<sub>6</sub>  
I<sub>4</sub>: Div R<sub>7</sub>, R<sub>0</sub>, R<sub>8</sub>



⑨ In the above sequence diagram, the dependent ins are accessing the data from the interface registers whereas I<sub>4</sub> is accessing the data from register file i.e. clock cycle (CC<sub>5</sub>) is

divided into 2 halves.

⑩ In the 1st half of CC<sub>5</sub>, ins<sub>n</sub>(I<sub>1</sub>) updates the Reg. no & ins<sub>n</sub>(I<sub>4</sub>) decodes the operation.

⑪ In the 2nd half of CC<sub>5</sub>, I<sub>4</sub> is accessing the register file, at that time, register R<sub>6</sub> already contain the new value.

### 3) Control Dependency -

1) When the branch operations are executed, the program control will be transferred from one location to another location.

2) When the unconditional & conditional TOC op are executed in the pipeline some extra cycles are created.

### Unconditional TOC -

① Consider the program segment

1000: I<sub>1</sub>

1001: I<sub>2</sub>

1002: I<sub>3</sub> (JMP 2000)

1003: I<sub>4</sub>

⋮

2000: BI<sub>1</sub>

2001: BI<sub>2</sub>

⋮

↓ falling path

↓ taken Path.

- ② The Program execution is -  
 $I_1 - I_2 - I_3 - BI_1 - BI_2$ .

|                  | CC <sub>1</sub> | CC <sub>2</sub>    | CC <sub>3</sub> | CC <sub>4</sub>         | CC <sub>5</sub> | CC <sub>6</sub> |
|------------------|-----------------|--------------------|-----------------|-------------------------|-----------------|-----------------|
| $t = 1000$       |                 |                    |                 |                         |                 |                 |
| $I_1 : IC$       | IF              | LD                 | EX              | MA                      |                 |                 |
|                  | PC = 1001       | +                  |                 |                         |                 |                 |
| $I_2$            |                 | IF                 | ID              | EX                      |                 |                 |
|                  |                 | PC = 1002          |                 |                         |                 |                 |
| $I_3$            |                 |                    | IF              | FD                      |                 |                 |
|                  |                 |                    | PC = 1003       | uncond <sup>n</sup> TOC |                 |                 |
| $\downarrow$     | $I_4$           |                    |                 | PC = 1004               | 2000            |                 |
| unwanted         | $BI_1$          | get spot to new FA |                 | IF                      | IF              |                 |
| $\downarrow$     |                 | addition & repeat  |                 | PC = 1004               | 9               | PC = 2001       |
| flush/<br>freeze | $BI_2$          |                    |                 | start                   |                 | IF<br>PC = 2002 |

- ③ Inst fetch stage is overlapping with the decoding stage. That means before decode the current ins<sup>n</sup>, the next sequential ins<sup>n</sup> is already inserted into the pipeline.

- ④ When the decoding stage decodes the ins<sup>n</sup> as data transfer / data manipulation then the next sequential ins<sup>n</sup> is always the wanted ins<sup>n</sup>.

- ⑤ When the decoding stage decodes the ins<sup>n</sup> as unconditional TOC, then the next sequential ins<sup>n</sup> becomes the unwanted ins<sup>n</sup>.

- (6) To maintain the correct functionality, there is a need of remove the unwanted ins<sup>n</sup> from the pipeline.
- (7) The process of remove the unwanted ins<sup>n</sup> from the pipeline is called as flush/freeze.
- (8) The flush operation creates the stalls in the pipeline.
- (9) The no. of stalls cycle created due to the branch operations is called as the branch penalty.

Branch = At what stage, the target Addr. is available

Note:-

- (1) The RISC processor branch penalty is always 1 becoz. the target address is available at the 2<sup>nd</sup> stage.
- (2) The # stall cycles created during the execution of the program due to the branch operations is

$$\text{No. of stall cycles/Branch} = \text{Branch freq.} * \text{Branch penl}$$

Conditional TUC -

- (1) Consider the program segment:-

1000: I<sub>1</sub>

1001: I<sub>2</sub> (JNZ 910, 2000)

1002: I<sub>3</sub>

1003: I<sub>4</sub>

2000: BI<sub>1</sub>

2001: BI<sub>2</sub>

Falling Path

Taken Path

## (2) Execution Sequence with True Condition :-

I<sub>1</sub> - I<sub>2</sub> - BI<sub>1</sub> - BI<sub>2</sub>

## (3) Execution Sequence with False Condition :-

I<sub>1</sub> - I<sub>2</sub> - I<sub>3</sub> - I<sub>4</sub>

| -t=1000         | CC <sub>1</sub><br>PC:  | CC <sub>2</sub>                                                                    | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> |
|-----------------|-------------------------|------------------------------------------------------------------------------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub>  | IF<br>PC: 1001          | ID                                                                                 | EX              | MA              | WB              |                 |
| I <sub>2</sub>  |                         | IF<br>PC: 1002<br>Cond: NZC<br>TA: 2000<br>NZ cmp 20(NZ)<br>True: PC: 1003<br>2000 | ID              | EX              | MA              |                 |
| I <sub>3</sub>  | IF<br>PC: 1003<br>stall | ID                                                                                 | EX              |                 |                 |                 |
| BI <sub>1</sub> |                         | IF<br>PC: 2001                                                                     | ID              |                 |                 |                 |
| BI <sub>2</sub> |                         |                                                                                    | IF<br>PC: 2002  |                 |                 |                 |

Unwanted flush.

- (4) When the decoding stage decodes the  $inst^n$  as conditional TOC with True Condition then the next sequential  $inst^n$  becomes the unwanted  $inst^n$ . Therefore, flush out the unwanted  $inst^n$  from the pipe. It creates the stall.
- (5) When the decoding stage decodes the current  $inst^n$  as conditional TOC with False condition the next sequential  $inst^n$  becomes the wanted  $inst^n$  so no flush  $\rightarrow$  no stall.
- (6) To minimize the # stalls in the pipeline due to control dependency, branch target buffer / branch prediction buffer or loop buffer is used at the  $inst^n$  fetch stage.
- (7) To predict the target Address.

### Branch Target Buffer-

- ① It is a high speed buffer maintained at the  $inst^n$  fetch stage.
- ② This buffer contains the following structure.

| PC-Value | Expected PC-Value |
|----------|-------------------|
| 1003     | 1000              |

Eg:- 1000: I<sub>1</sub>  
 1001: I<sub>2</sub>  
 1002: I<sub>3</sub> (DJNZ R<sub>0</sub>, 1000) R<sub>0</sub>=4.  
 1003: I<sub>4</sub>.

### Execution Sequence:-

PC: 1000

1001

1002: R<sub>0</sub>=3 (NZ) True.

stall [0000] 1000  
 ↴ 1001

Insert the entry 1002: R<sub>0</sub>=2 (NZ) True.

into target buffer [0000] 1000

& fetch the ins<sup>n</sup> from Expected-PC 1001 1002: R<sub>0</sub>=1 (NZ) True.

[0000] 1000

1001

1002: R<sub>0</sub>=0 (NZ) F

[0000] [0000]

1003

delete the entry from the target address buffer.

& fetch the ins<sup>n</sup> from the corresponding PC-value.

- Q) This concept is suitable for the loop execution, when there is a stall w/rt True cond<sup>n</sup>,

then the unwanted insn address is maintained in the PC-value field & target address will be maintained in the expected PC value field.

- (4) In the next fetch onwards, the current PC value is compared with the PC entry field in the target buffer.
- (5) When it is matching, the CPU fetch the insn based on the corresponding expected PC-value otherwise CPU fetch the insn based on the current value.
- (6) When there is a stall in the pipeline w.r.t False condition, then delete the target address from the expected PC value field & fetch the next insn based on the corresponding PC-value field.

## Delayed Branch

(1) Delayed Branch is a compiler technique.

(2) In this technique, compiler rearranges the ins<sup>n</sup> if possible to rearrange (or) substitutes the NOP ins<sup>n</sup> after the branch ins<sup>n</sup> if not possible to rearrange to preserve the execution path in the pipeline.

User Program - Compiler Re-arrangement Program

I<sub>1</sub>: Add \$0, \$11, \$12

I<sub>1</sub>

I<sub>2</sub>: Add \$13, \$14, \$15

I<sub>3</sub>

I<sub>3</sub>: JMP BI<sub>1</sub>

I<sub>2</sub>

I<sub>4</sub>: MUL \$16, \$17, \$18

I<sub>4</sub>

Re-arrangement

movement

BI<sub>1</sub>: Add \$19, \$110, \$111

BI<sub>1</sub>

BI<sub>2</sub>: Sub \$12, \$13, \$114

BI<sub>2</sub>

| PC: I <sub>1</sub> | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub>                     | CC <sub>4</sub>     | CC <sub>5</sub> | CC <sub>6</sub> |
|--------------------|-----------------|-----------------|-------------------------------------|---------------------|-----------------|-----------------|
|                    | IF              | ID              | EX                                  | MA                  | WB              |                 |
| PC: I <sub>3</sub> |                 |                 |                                     |                     |                 |                 |
| I <sub>3</sub>     | IF              | IF              | IF                                  | EX                  | MA              |                 |
|                    |                 |                 |                                     |                     |                 |                 |
|                    |                 |                 |                                     |                     |                 |                 |
| I <sub>2</sub>     |                 |                 | IF                                  | ID                  | EX              |                 |
|                    |                 |                 | PC: I <sub>4</sub> /BI <sub>1</sub> |                     |                 |                 |
| BI <sub>1</sub>    |                 |                 |                                     | IF                  | ID              |                 |
| BI <sub>2</sub>    |                 |                 |                                     | PC: BI <sub>2</sub> |                 | IF              |

③ In the above program segment, executing the  $I_2$  before  $I_3$  & after  $I_3$ , the functionality is remain same. But when the  $I_2$  is executing before  $I_3$  then stalls are created in the pipeline. When it is executing after  $I_3$  then no stalls present in the pipeline.

### NOP Substitution

① Compiler substitute the NOP insn after branch if not possible to rearrange to preserve executable path in pipeline.

② NOP substitution is a logical expansion of the unwanted insn.

User Program      Compiler  
falling

$I_1$ : Add  $r_0, r_1, r_2$

$I_2$ : Add  $r_3, r_4, r_5$

$I_3$ : JMP  $B_1$

$I_4$ : MUL  $r_6, r_7, r_8$

$B_1$ : Add  $r_9, r_{10}, r_{11}$

$B_2$ : Sub  $r_{12}, r_{13}, r_{14}$

'NOP' substitution

$I_1$

$I_2$

$I_3$  (JMP  $B_1$ )

NOP

Taken.

$I_4$

$B_1$

$B_2$

| $PC = I_1$ | CC1        | CC2          | CC3 | CC4 | CC5 | CC6 |
|------------|------------|--------------|-----|-----|-----|-----|
| $I_1$      | IF         | ID           | EX  | MA  | WB  |     |
|            | $PC: I_2$  |              |     |     |     |     |
| $I_2$      | IF         | ID           | EX  | MA  | WB  |     |
|            | $PC: I_3$  |              |     |     |     |     |
| $I_3$      | IF         | ID           | EX  | MA  |     |     |
|            | $PC: NOP$  | Uncard. TOC  |     |     |     |     |
|            |            | $PC: I_4$ BI |     |     |     |     |
| $NOP$      | IF         | ID           | EX  |     |     |     |
|            | $PC: I_4$  |              |     |     |     |     |
| $BI_1$     | IF         | ID           |     |     |     |     |
|            | $PC: BI_2$ |              |     |     |     |     |
| $BI_2$     | IF         |              |     |     |     |     |
|            | $PC: BI_3$ |              |     |     |     |     |

## "Ins" Scheduling -

① Processor executes the program always in a sequential order called as InOrder Execution.

② In the inorder execution sequence, if any ins<sup>n</sup> is data dependent then the remaining ins<sup>n</sup> are also showing the stall cycles which are created by the dependent ins<sup>n</sup>.

Eg:- Inorder Exec<sup>n</sup> Seq -

I<sub>1</sub>: Add  $r_9, r_{11}, r_{12}$

I<sub>2</sub>: Sub  $r_3, r_{10}, r_{14}$

I<sub>3</sub>: MUL  $r_4, r_{15}, r_{16}$

I<sub>4</sub>: DIV  $r_{13}, r_{17}, r_{18}$ .

Inorder Exec<sup>n</sup> Seq -

I<sub>1</sub> - I<sub>2</sub> - I<sub>3</sub> - I<sub>4</sub>.

wait  
stalls.

③ In above prog seq,  $I_2$  is data dependent on  $I_1$  so it undergoes waiting until completion of  $I_1$ . Due to this  $I_3$  &  $I_4$  both are also sharing the stall cycles of  $I_2$  even if they are independent.

④ To handle the above problem, there is a need of schedule the ins<sup>n</sup>. means that execute the independent ins<sup>n</sup> first. Scheduling causes the out-of-order (Reorder) execution.

⑤ Out of Order exec sequence is

$$I_1 - I_3 - I_4 - I_2$$

⑥ Out-of-Order exec creates two more dependencies in the pipeline.

- 1.) Anti-dependency
- 2.) Output-dependency

⑦ Anti-dependency is occurred when the ins<sup>n</sup> 'j' tries to write the data before ins<sup>n</sup> 'i' reads it.

Ex-  $I_3$  is executed before  $I_2$ ,

$I_3$  modifies the seg.  $314$  before  $I_2$  reads it. Therefore, at the time of executing  $I_2$  wrong value will be injected (data-loss).

- (8) Output dependency is occurred when the insn 'j' tries to write the data before insn 'i' writes it.
- Eg:-  $I_4$  is executing before  $I_2$ . So,  $I_4$  modifies the reg.  $r_{13}$  before  $I_2$  writes it. Therefore, the dest. destination ( $r_{13}$ ) is finally containing the old value (data loss).

- (9) To handle the above dependencies, one of the H/W mechanism is used called as register-renaming.
- (10) Reg-renaming states that use some temporary storage (reorder buffer) to store the out-of-order insn results. After completion of the dependent insn exec<sup>n</sup>, the register file is updated with the reorder buffer contents.

$I_1 \rightarrow r_{10}$

$I_3 \rightarrow$  re-order

$I_4 \rightarrow$  Reorder Buffer

$I_2 \rightarrow r_{13}$



Reorder Buffer  $\rightarrow$  Reg. File.

## Hazards :-

- (1) Hazard is a delay. Delay is created in the pipeline due to the dependencies.
- (2) 3 Kinds of Dependencies creates the 3 kinds of Hazards in pipeline :-
  - 1) Structural Hazard.
  - 2) Data Hazard.
  - 3) Control Hazard.
- (3) Data Hazards are classified into 3 kinds based on the order of read & write op.
  - (1) RAW (Read After Write) Hazard
  - (2) WAR (Write After Read)
  - (3) WAW (Write After Write)
- (4) RAW Hazard is created when the ins "j" tries to read the data before ins "i" writes it (TRUE Data Dependency).
- (5) WAR Hazard is created when the ins "j" tries to write the data before ins "i" reads it (Anti-dependency).
- (6) WAW Hazard is created when the ins "j" tries to write the data before ins "i" writes it (output dependency).

$Ins^n$  'j'

$Ins^n$  'i'

- ①  $In\text{-Reg} \rightarrow OUT\text{-Reg}$  (TRUE-DATA)
- ②  $Out\text{-Reg} \rightarrow In\text{-Reg}$  (ANTI)
- ③  $Out\text{-Reg} \rightarrow Out\text{-Reg}$  (OUTPUT)

## Performance Evaluation of Pipeline with Stall-

$$S = \frac{\text{Avg } Ins^n ET_{\text{non-pipe}}}{\text{Avg } Ins^n ET_{\text{pipe}}}$$

$$S = \frac{CPI_{\text{non-pipe}} * \text{cycle time}_{\text{non-pipe}}}{CPI_{\text{pipe}} * \text{cycle time}_{\text{pipe}}}$$

- ① The ideal CPI of pipeline is always almost 1, but due to dependency stalls are created in the pipeline. Therefore,

$$S = \frac{CPI_{\text{non-pipe}} * \text{cycle time}_{\text{non-pipe}}}{(1 + \# \text{stalls}/Ins^n) * \text{cycle time}_{\text{pipe}}}$$

- ② When all pipeline stages are perfectly balanced, then pipeline & non-pipeline both of their cycle times are equal.

$$S = \frac{CPI_{\text{non-pipe}}}{(1 + \# \text{stalls}/Ins^n)}$$

- ③ When all the ins<sup>n</sup> are taking same no. of cycles, then one ins<sup>n</sup> exec<sup>n</sup> time is also equal to

no. of stages in the pipeline. Therefore,

$$S = \frac{\text{Pipeline depth}}{(1 + \# \text{stalls}/T_{ns}^n)}$$

- (4) When the processor is operating with 100% efficiency then no stalls present. Therefore

$$S = \text{Pipeline depth}$$