

- To detect the data-dependency condition some logic should be implemented in ID stage of the pipeline.
- Structure of a logic is -

### Tomasulo Algorithm :-

- This algorithm is used to detect data-dependency in a pipeline. It work on ID stage of each Inst<sup>n</sup>.

Eg:-  $I_1: r_0 \leftarrow r_1 + r_2$

$I_2: r_3 \leftarrow r_0 * r_2$

#### (i) ID-Stage

##### (a) Inst<sup>n</sup> decode

| S.no  | Func <sup>n</sup> | Dest <sup>n</sup> | Indep-<br>endent<br>src1 | Indep-<br>endent<br>src2 | Depen-<br>dent<br>src1 | Depend-<br>ent<br>src2 | Status |
|-------|-------------------|-------------------|--------------------------|--------------------------|------------------------|------------------------|--------|
| $I_1$ | ADD               | $r_0$             | $r_1$                    | $r_2$                    | —                      | —                      | 0      |
| $I_2$ | MUL               | $r_3$             | —                        | $r_2$                    | $r_0$                  | —                      | 0      |

##### (b) Access reg-file -

|       |                                                                |
|-------|----------------------------------------------------------------|
| $I_1$ | $r_1 \leftarrow \text{value}$<br>$r_2 \leftarrow \text{value}$ |
| $I_2$ | $r_0 \leftarrow X$<br>$r_2 \leftarrow \text{value}$            |

check ( $\text{Dest}(I_1) = \text{Src}(I_2)$ )

(ii)

Ex - stage

|                |                                                                                                                      |
|----------------|----------------------------------------------------------------------------------------------------------------------|
| I <sub>1</sub> | EX - stage Allocated                                                                                                 |
| I <sub>2</sub> | EX - Stage is not Allocated<br>(I <sub>2</sub> will be waiting on ID-stage<br>until I <sub>1</sub> status becomes L) |

- By using this logic, we can stop the accessing of a unwanted data in the pipeline execution. Therefore, dependent Inst<sup>n</sup> will wait on ID- stage until the data become available in register.



- To minimize the data-hazard in pipeline, H/W technique is used i.e. operand forwarding also called as bypassing or short circuiting.
- Operand Forwarding techniques states that use the buffers b/w the stages to hold the intermediate result so that dependent inst<sup>n</sup> will access the new value from the buffer before updating the register file. In this process last-stage op<sup>n</sup> is modified to perform the i.e. WRITE-READ op<sup>n</sup> in the same cycle means updated data immediately available for READ access in the same cycle. Therefore buffer is not required at the end of last stage.
- consider the following program code, execute in the pipeline using operand forwarding technique -

Code :-

```

I1: ADD r0, r1, r2
I2: SUB r3, r0, r2
I3: MUL r4, r3, r0
I4: DIV r5, r4, r0
  
```

**NOTE :-** Adjacent Data-Dependency is called as TRUE - Data - Dependency

i.e.

$$\begin{bmatrix} I_2 - I_1 \\ I_3 - I_2 \\ I_4 - I_3 \end{bmatrix}$$

NOTE - Non-Adjacent Data dependency analysis is data-dependency only. But it is eliminated through True-data-dependency

i.e.

$$\begin{bmatrix} I_3 - I_1 & (r_0) \\ I_4 - I_1 & (r_0) \end{bmatrix}$$



$$t_p = \max \left( \begin{array}{c} \text{stage} \\ \text{- Delay} \end{array} + \begin{array}{c} \text{Buffer} \\ \text{Delay} \end{array} \right)$$

↓                          ↓

$4ns$                        $1ns$

Q: Consider following Program , execute on a RISC pipeline. How many cycles are required to complete the program. Assume that all the inst<sup>n</sup> are spending 1 cycle on all the stages but LOAD inst<sup>n</sup> takes 3-cycles on 4<sup>th</sup> stage of a Pipeline.

I<sub>1</sub> : LOAD r<sub>0</sub>, 3(r<sub>1</sub>)

I<sub>2</sub> : ADD r<sub>2</sub>, r<sub>0</sub>, r<sub>1</sub>

I<sub>3</sub> : LOAD r<sub>3</sub>, 4(r<sub>4</sub>)

I<sub>4</sub> : SUB r<sub>5</sub>, r<sub>3</sub>, r<sub>4</sub>

|                | IF | ID | EX | MA | WB |      |
|----------------|----|----|----|----|----|------|
| I <sub>1</sub> | 1  | 1  | 1  | 3  | 1  | LOAD |
| I <sub>2</sub> | 1  | 1  | 1  | 1  | 1  | ADD  |
| I <sub>3</sub> | 1  | 1  | 1  | 3  | 1  | LOAD |
| I <sub>4</sub> | 1  | 1  | 1  | 1  | 1  | SUB  |

Now, generate reservation table -

|       | $cc_1$ | $cc_2$ | $cc_3$ | $cc_4$ | $cc_5$ | $cc_6$ | $cc_7$ | $cc_8$ | $cc_9$ | $cc_{10}$ | $cc_{11}$ | $cc_{12}$ | $cc_{13}$ | $cc_{14}$ | $cc_{15}$ |
|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-----------|-----------|-----------|-----------|-----------|-----------|
| $I_1$ | IF     | ID     | EX     | MA        | MA        | MA        | MA        | MA        | WB        |
| $I_2$ | IF     | ID     | ID     | MA        | MA        | MA        | MA        | MA        | WB        |
| $I_3$ | IF     | ID     | IF     | EX     | MA     | MA     | MA     | MA     | MA     | MA        | MA        | MA        | MA        | MA        | WB        |
| $I_4$ | IF     | ID     | IF     | ID     | EX     | MA     | MA     | MA     | MA     | MA        | MA        | MA        | MA        | MA        | WB        |

\* Since due LOAD op<sup>n</sup> data (dependent) will be present at 'MA' stage.

Q. Consider the following program executed on 4-stages (IF, ID, EX, WB) pipeline. Assume that all inst<sup>n</sup> are spending 1-cycle on all the stages but 'MUL' Inst<sup>n</sup> takes 4-cycles on EX stage and 'DIV' Inst<sup>n</sup> takes 5-cycles at EX stage. Operand Forwarding mechanism is used as an optimization technique in the pipeline to handle the data dependency.

- (a) How many cycles are required to complete prog.
- (b) How many cycles are saved using optimization over non-optimization.

I<sub>1</sub>: MUL r<sub>0</sub>, r<sub>1</sub>, r<sub>2</sub>

I<sub>2</sub>: DIV r<sub>3</sub>, r<sub>1</sub>, r<sub>2</sub>

I<sub>3</sub>: ADD r<sub>4</sub>, r<sub>2</sub>, r<sub>1</sub>

I<sub>4</sub>: SUB r<sub>3</sub>, r<sub>1</sub>, r<sub>2</sub>

|                | IF | ID | EX | WB |     |
|----------------|----|----|----|----|-----|
| I <sub>1</sub> | 1  | 1  | 4  | 3  | MUL |
| I <sub>2</sub> | 1  | 1  | 5  | 2  | DIV |
| I <sub>3</sub> | 1  | 1  | 1  | 1  | ADD |
| I <sub>4</sub> | 1  | 1  | 1  | 1  | SUB |

With optimization ( Operand - Forwarding )

|       | $cc_1$ | $cc_2$ | $cc_3$ | $cc_4$ | $cc_5$ | $cc_6$ | $cc_7$ | $cc_8$ | $cc_9$ | $cc_{10}$ | $cc_{11}$ | $cc_{12}$ | $cc_{13}$ | $cc_{14}$ | $cc_{15}$ |
|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-----------|-----------|-----------|-----------|-----------|-----------|
| $T_1$ | IF     | ID     | EX     | EX     | EX     | EX     | EX     | WB     |        |           |           |           |           |           |           |
| $T_2$ | IF     | ID     | ID     | ID     | ID     | ID     | EX     | EX     | EX     | EX        | EX        | EX        | EX        | WB        |           |
| $T_3$ | IF        | IF        | IF        | IF        | IF        |           |
| $T_4$ |        |        |        |        |        |        |        |        |        |           |           |           |           |           |           |

(Total cycle = 14)

without optimization ( no - operand Forwarding )

|       | $cc_1$ | $cc_2$ | $cc_3$ | $cc_4$ | $cc_5$ | $cc_6$ | $cc_7$ | $cc_8$ | $cc_9$ | $cc_{10}$ | $cc_{11}$ | $cc_{12}$ | $cc_{13}$ | $cc_{14}$ | $cc_{15}$ | $cc_{16}$ | $cc_{17}$ | $cc_{18}$ | $cc_{19}$ |
|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| $T_1$ | IF     | ID     | EX        | EX        | EX        | EX        | EX        | EX        | EX        | EX        | WB        |           |
| $T_2$ | IF     | ID        | ID        | ID        | ID        | ID        | ID        | ID        | ID        | WB        |           |
| $T_3$ |        |        | IF        | IF        | IF        | IF        | IF        | IF        | IF        | IF        |           |           |
| $T_4$ |        |        |        |        |        |        |        |        |        |           |           |           |           |           |           |           |           |           |           |

( Total cycles = 18 )

$$\begin{aligned} \text{Cycle - saved} &= 18 - 14 \\ &= \underline{\underline{4 - \text{cycles}}} \end{aligned}$$

### (iii) Control Dependency :-

- It occurred in the pipeline when the transfer of control & inst<sup>n</sup> are executed in the pipeline.

Code:-

|                                   |   |                      |
|-----------------------------------|---|----------------------|
| 1000 : I <sub>1</sub>             | ↓ | Falling/through path |
| 1001 : I <sub>2</sub>             |   |                      |
| 1002 : I <sub>3</sub> ( JMP 2000) | ↓ |                      |
| 1003 : I <sub>4</sub>             |   |                      |
| :                                 | : |                      |
| 2000 : BI <sub>1</sub>            | ↓ | Taken path           |
| 2001 : BI <sub>2</sub>            |   |                      |
| :                                 | : |                      |

Expected o/p sequence -

(I<sub>1</sub> → I<sub>2</sub> → I<sub>3</sub> → BI<sub>1</sub> → BI<sub>2</sub>...)

'IF' mprog is -

|                                              |
|----------------------------------------------|
| T <sub>1</sub> : PC → MAR                    |
| T <sub>2</sub> : M[MAR] → MBR<br>PC + I → PC |
| T <sub>3</sub> : MBR → IR                    |

| PC:1000         | cc <sub>1</sub> | cc <sub>2</sub> | cc <sub>3</sub> | cc <sub>4</sub> | cc <sub>5</sub> | cc <sub>6</sub> |
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub>  | IP<br>PC:1001   | ID              | EX              | MA              | WB              |                 |
| I <sub>2</sub>  |                 | IF<br>PC:1002   | ID              | EX              | MA              | WB              |
| I <sub>3</sub>  |                 |                 | IF<br>PC:1003   | ID<br>PC:2000   | EX              | MA              |
| I <sub>4</sub>  |                 |                 |                 | IF<br>PC:1004   | ID              | EX              |
| BI <sub>1</sub> |                 |                 |                 |                 | IF<br>PC:2001   | ID              |
| BI <sub>2</sub> |                 |                 |                 |                 |                 | IF<br>PC:2002   |

IF of I<sub>4</sub> make  
PC ← 1004 but  
during ID of I<sub>3</sub>  
PC ← 2000 (JMP inst<sup>n</sup>)



- In the above exec. sequence, unwanted Inst<sup>n</sup> is executed in the pipeline. So program output is missing (unwanted behaviour). This situation is known as control-dependency.
- To handle such problems, FLUSH op<sup>n</sup> or FREEZE op<sup>n</sup> is used. This op<sup>n</sup> states that, insert the NOP inst<sup>n</sup> after the JMP inst<sup>n</sup> in the program to suspend the unwanted inst<sup>n</sup> fetch.

Code:-

1000:  $I_1$   
 1001:  $I_2$   
 1002:  $I_3$  (JMP 2000)  
 1003:  ~~$I_4$~~  (NOP)  
 1004:  $I_5$   
 1005:  $I_6$   
 ; ;  
 2000 :  $BI_1$   
 2001 :  $BI_2$   
 ;

|          | cc <sub>1</sub> | cc <sub>2</sub> | cc <sub>3</sub> | cc <sub>4</sub> | cc <sub>5</sub> | cc <sub>6</sub> |
|----------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| PC: 1000 |                 |                 |                 |                 |                 |                 |
| $I_1$    | IF<br>PC: 1001  | ID              | EX              | MA              | WB              |                 |
| $I_2$    |                 | IF<br>PC: 1002  | ID              | EX              | MA              | WB              |
| $I_3$    |                 |                 | IF<br>PC: 1003  | ID<br>PC: 2000  | EX              | MA              |
| NOP      |                 |                 |                 | IF<br>PC: 1004  | ID              | EX              |
| PC: 2000 |                 |                 |                 |                 | IF<br>PC: 2001  | ID              |
| $BI_1$   |                 |                 |                 |                 |                 | IF<br>PC: 2002  |
| $BI_2$   |                 |                 |                 |                 |                 |                 |

Actual - O/P - seq :  $I_1 \rightarrow I_2 \rightarrow I_3 \rightarrow \boxed{NOP} \rightarrow BI_1 \rightarrow BI_2$

unwanted Inst<sup>n</sup>  
+ stall

- Number of stalls created in the pipeline due to a branch instn is called as branch penalty.
- It depends on the availability of target address in the pipeline i.e branch penalty is equal to & at what stage target Address available - 1.

(branch penalty  $\rightarrow$  stage in which target Address available - 1 )

**NOTE:-** In RISC pipeline branch penalty is 1 as target address is available in 2<sup>nd</sup> (ID) stage.

**NOTE:-** For a hypothetical K-stage pipeline , branch penalty is  $K-1$ .

- if stage number of TA :- given

$$(\text{Branch - penalty} = \text{stage number} - 1)$$

- if stage name of TA :- given

$$(\text{Branch - penalty} = \text{corresponding stage number} - 1)$$

- until the Inst<sup>n</sup> is completed (all Inst<sup>n</sup> are proceed through all stages) -

$$\text{Branch penalty} = \text{Last stage in pipeline} - 1$$

**NOTE:-** Total no. of stalls created from the branches during the prog exec<sup>n</sup> is -

$$\left( \frac{\text{Branch frequency}}{\text{Branch Inst}^n / \text{prog}} * \frac{\text{Branch penalty}}{\text{stalls} / \text{branch inst}^n} \right)$$

**NOTE :-** To minimize the control hazard (branch penalty), h/w technique is used i.e. branch prediction buffer also called as branch target buffer or loop buffer.

- Branch - prediction buffer is a high-speed buffer present in IF(stage) used to hold the predicted target address. when target address present in first stage then there is not stall in pipeline.
- **NOTE :-** to minimize the control hazard, S/W mechanism is used i.e. delayed branch.

Delayed Branch :- It is a compiler technique, so compiler will rearrange the code if possible, otherwise substitute the NOP inst<sup>n</sup> after JMP inst<sup>n</sup> to preserve the execution path.

### User-code

```

1000: I1
1001: I2
1002: I3 (JMP 2000) ] SWAP
1003: I4
|
2000: BI1
2001: BI2

```

### COMPILER - M/C code

#### (a) Rearrangement

Addr: I<sub>1</sub>

```

[ I3 ( JMP BI1)
  I2
  I4
  :
  BI1
  BI2
]

```

|                    | CC <sub>1</sub>    | CC <sub>2</sub>     | CC <sub>3</sub>     | CC <sub>4</sub>     | CC <sub>5</sub> |
|--------------------|--------------------|---------------------|---------------------|---------------------|-----------------|
| PC: I <sub>1</sub> | IF                 | ID                  | EX                  | MA                  | WB              |
| I <sub>1</sub>     | PC: I <sub>3</sub> |                     |                     |                     |                 |
| I <sub>3</sub>     | IF                 | ID                  | EX                  | MA                  |                 |
|                    | PC: I <sub>2</sub> | PC: BI <sub>1</sub> |                     |                     |                 |
| I <sub>2</sub>     |                    | IF                  | ID                  | EX                  |                 |
|                    |                    | PC: I <sub>4</sub>  |                     |                     |                 |
| BI <sub>1</sub>    |                    |                     | IF                  | ID                  |                 |
| BI <sub>2</sub>    |                    |                     | PC: BI <sub>2</sub> |                     |                 |
|                    |                    |                     |                     | IF                  |                 |
|                    |                    |                     |                     | PC: BI <sub>3</sub> |                 |

O/P-sequence = I<sub>1</sub> - I<sub>3</sub> - I<sub>2</sub> - BI<sub>1</sub> - BI<sub>2</sub>

- o/p sequence is different but program result will be same , and there is no STALL .

**NOTE :-** Inst<sup>n</sup> rearrangement swapping may be possible in unconditional branch op<sup>n</sup> but no rearrangement can be taken place in case of <sup>conditional</sup> branch Inst<sup>n</sup>.

## (b) NOP - substitution -

MIC generated -

Addr : I<sub>1</sub>

$I_2$        $I_3$  (JMP ~~2~~ BI<sub>1</sub>)

NOP

Iy

6

2

81

$\text{BI}_2$

O/p sequence -

$$I_1 - I_2 - I_3 = NOP - BI_1 - BI_2$$

$$-BI_2$$

## Instruction Scheduling :-

- CPU always execute the program in a sequence from I<sub>1</sub> to I<sub>n</sub> called as in-order execution.
  - In the in-order execution sequence, if any inst<sup>n</sup> is dependent then remaining inst<sup>n</sup> are also sharing the STALL cycle even they are independent.

$I_1 : ADD \quad r_0, r_1, r_2$   
 $I_2 : SUB \quad r_3, r_0, r_4$   
 $I_3 : MUL \quad r_4, r_5, r_6$   
 $I_4 : DIV \quad r_3, r_7, r_8$

In-order execution:-

$I_1 - I_2 - I_3 - I_4$   
stalls

- In the above execution sequence,  $I_2$  dependent on  $I_1$  so  $I_2$  waiting until the  $I_1$  execution is completed. Waiting creates stall. These stalls are shared by  $I_3$  and  $I_4$  inst<sup>n</sup> even they are independent.
- To handle the above problem inst<sup>n</sup> scheduling concept is used. This concept states that execute the independent inst<sup>n</sup> first called out of order (re-order) in inst<sup>n</sup> execution.

$(I_1 - I_3 - I_4 - I_2)$

- Re-order execution creates two more dependencies in the pipeline.

- (i) Anti-data dependency
- (ii) Output data-dependency

Inst<sup>n</sup>                          Inst<sup>n</sup>

READ - before - WRITE

WRITE - before - READ

WRITE - before - WRITE



three types of hazard

- Anti data dependency will be occurred when Inst<sup>n</sup> J WRITE the data before Inst I reads it. eg - I<sub>3</sub> exec before I<sub>2</sub>, so I<sub>3</sub> updates the register r<sub>4</sub> before I<sub>2</sub> reads it ∴ I<sub>2</sub> incorrectly access new value from reg (r<sub>4</sub>) .
- O/p data -dependency will be occurred when the Inst<sup>n</sup> J tries to WRITE the data before Inst<sup>n</sup> I WRITE it. eg - I<sub>4</sub> exec before I<sub>2</sub>, so I<sub>4</sub> updates the register (r<sub>3</sub>) before I<sub>2</sub> WRITES it ∴ destination (r<sub>3</sub>) updates with a old value (Data Loss)

**NOTE:-** To minimize the anti & o/p data -dependency & stalls h/w mechanism is used i.e register renaming.

- Register renaming states use re-order buffers to store the out-of-order inst<sup>n</sup> o/p. Later update the reg - file with a re-order buffer contents after completion of dependent Inst<sup>n</sup> execution.

$$I_1 \rightarrow r_0$$



$$I_2 \rightarrow r_3 \text{ (STATUS=1)}$$



## Hazard :-

- Hazard is a delay.
- Delay is present in the pipeline due to a dependency problem.
- ~~Data~~



{ it is created when  
the inst<sup>n</sup> J tries to  
READ the data before  
inst<sup>n</sup> I WRITES it.  
(RBW).

{ it is created  
when inst<sup>n</sup> J tries  
to WRITES the data  
before inst<sup>n</sup> I READS  
it (WRB).

{ it is created  
when inst<sup>n</sup> J  
tries to WRITE  
data before inst<sup>n</sup>  
I WRITE it  
(WBW)

**NOTE:-** Read- After -Read (RAR) is not a hazard as  
 Read- before- Read (RBR) is not any dependency problem.



| Inst <sup>n</sup> J | Inst <sup>n</sup> I | Type of Dependency | Type of Hazards   |
|---------------------|---------------------|--------------------|-------------------|
| I/P reg             | O/P reg             | True Data          | RAW (in-order)    |
| O/P reg             | I/P reg             | Anti- Data         | WAR (out-order)   |
| O/P reg             | O/P reg             | O/P Data           | WAW (out-order)   |
| I/P reg             | I/P reg             | —                  | —                 |
| ALU                 | ALU                 | Structural         | Structural Hazard |
| JMP                 |                     | control            | control Hazard    |

## Performance Analysis with Stalls :-

$$S = \frac{\text{Avg. Inst}^n \text{ ET}_{\text{non-pipe}}}{\text{Avg. Inst}^n \text{ ET}_{\text{pipe}}}$$

$$S = \frac{\text{CPI}_{\text{non-pipe}} * \text{CT}_{\text{non-pipe}}}{\text{CPI}_{\text{pipe}} * \text{CT}_{\text{pipe}}}$$

- Ideal CPI of pipeline is always 1 but due to dependency problem stalls are present in the pipeline so

~~Stalls~~

$$\left[ S = \frac{\text{CPI}_{\text{non-pipe}} * \text{CT}_{\text{non-pipe}}}{\left( 1 + \frac{\text{stalls}}{\text{Inst}^n} \right) \text{CT}_{\text{pipe}}} \right]$$

- when the pipelines stages are perfectly balanced, then 1-task ET is also equal to no. of stages in the pipeline (i.e  $t_n = k \cdot t_p$ ). under this condition

$$S = \frac{k \cancel{t_p}}{\left( 1 + \frac{\text{stalls}}{\text{Inst}} \right) \cancel{t_p}}$$

$$\left[ S = \frac{K}{1 + \text{stalls/Instn}} \right]$$

- when the system is operated with 100% efficiency then no stalls present. So,

$$S = \frac{K}{1+0} = K$$

Q. Consider 5-stage "instn" pipeline which allows all the "instn" except branch "instn". Processor stops the fetching of a following "instn" after the branch "instn" until the branch "instn" is completed. Target Address is available in the pipeline when "instn" completes its execution. Pipeline stages are operated with 3ns clk. Program contains 30% branch "instn" among them 40% are conditional in - which 50% doesn't satisfy the condition when the condition is false then the following "instn" are overlapped. Calculate the av. "instn" ET and speed-up factor of pipeline.



$$\text{stalls} = K-1 = 5-1 = 4$$



$$\text{stalls / Inst}^n = (0.7 * 0) + (0.3 \times 0.6 \times 4) + (0.3 \times 0.4 \times 0.5 \times 4) + (0.3 \times 0.4 \times 0.5 \times 0)$$

$$= \underline{\underline{0.96}}$$

(a) Avg. inst<sup>n</sup> ET<sub>pipe</sub> =  $(1 + \text{stall}/\text{Inst}^n) \cdot T_{\text{pipe}}$

$$= (1 + 0.96) (3 \text{ns})$$

$$= \underline{\underline{5.88 \text{ns}}}$$

(b)

Performance gain  $s = \frac{k}{1 + \text{stall}/\text{Inst}^n}$

$$= \frac{5}{1 + 0.96}$$

$$(s = 2.55)$$

- Q. Consider 8-stage pipeline operated with 4ns CLK, which allows all the inst<sup>n</sup> except the memory reference inst<sup>n</sup>. MR inst<sup>n</sup> penalty is 6-cycles. Program contains 60% MR inst<sup>n</sup> what is avg. inst<sup>n</sup> execution time.



$$CT_{\text{pipe}} = 4 \text{ ns}$$

Avg. stalls /  $Inst^n$  =  $(0.6 \times 6) + (0.4 \times 0)$   
 $= \underline{\underline{3.6}}$

$$\begin{aligned} \text{Avg. } Inst^n \text{ ET}_{\text{pipe}} &= \left(1 + \frac{\text{stalls}}{Inst^n}\right) CT_{\text{pipe}} \\ &= (1 + 3.6)(4 \text{ ns}) \\ &= \underline{\underline{18.4 \text{ ns}}} \end{aligned}$$

Q. Consider 6-stage pipeline with 2ns cblk uses, delay branch concept to optimize the control dependency stalls. In the delayed branch re-arrangement proc. is used to avoid the stalls, if not possible to re-arrangement then 4 NOP inst<sup>n</sup> are substituted after the branch inst<sup>n</sup> in the program. Program contain 40% branch inst<sup>n</sup> among them 60% are optimized, what is the avg. inst<sup>n</sup> ET.

6-stage  
pipeline with 2ns  
clock



$$\text{stalls/instr}^n = (0.4 \times 0.4 \times 4)$$

$$= \underline{\underline{0.64}}$$

$$\begin{aligned}
 \text{avg. instr}^n \text{ ET}_{\text{pipe}} &= (1 + \text{stalls/instr}^n) \text{ t_p} \\
 &= (1 + 0.64)(2 \text{ ns}) \\
 &= \boxed{(3.28 \text{ ns})}
 \end{aligned}$$

Q. Consider 5-stage pipeline without branch-prediction, use to execute the prog which contain 20 inst<sup>n</sup> ( $I_1 - I_{20}$ ).

Pipeline stage delays are (2ns, 4ns, 8ns, 3ns, 1ns).

In the pipeline all inst<sup>n</sup> are proceed through all stages. If inst<sup>n</sup> is a uncond JMP inst<sup>n</sup>, it transfer the control to  $I_{18}$  inst<sup>n</sup> during its execution. What is the prog execution time.

Program

$$(K = 5, t_p = 8\text{ns})$$

$I_1$

$I_2$

$I_3$

$I_4 : \text{JMP } I_{18}$

$I_5$

:

$I_{18}$

$I_{19}$

$I_{20}$

$$ET_{\text{pipe}} = (K + n-1) t_p$$

$n = 20$ , but due to  
JMP inst<sup>n</sup> all 20 inst<sup>n</sup>  
are not executed.



so, effective no. of input ( $n$ ) = 11

$$\begin{aligned} \text{so, } ET_{\text{pipe}} &= (K + n - 1) t_p \\ &= (5 + 11 - 1)(8 \text{ ns}) \\ &= 120 \text{ ns} \end{aligned}$$

Q. Consider the following code, executed on the pipeline CPU.

$$I_1: r_0 \leftarrow r_1 + r_2$$

$$I_2: r_1 \leftarrow r_0 - r_2$$

$$I_3: r_0 \leftarrow r_1 * r_2$$

$$I_4: r_2 \leftarrow r_0 * r_1$$

$$I_5: r_1 \leftarrow r_2 - r_0$$

$$I_6: M[X] \leftarrow r_1$$

How many RAW, WAR & WAW Hazards possible?

| RAW [In-order]<br>(IN-OUT) | WAR [out-order]<br>(OUT-IN) | WAW [out-order]<br>(OUT-OUT) |
|----------------------------|-----------------------------|------------------------------|
| $I_2 - I_1 (r_0)$          | $I_2 - I_1 (r_1)$           | $I_3 - I_1 (r_0)$            |
| $I_3 - I_2 (r_1)$          | $I_3 - I_2 (r_0)$           | $I_5 - I_2 (r_1)$            |
| $I_4 - I_3 (r_0)$          | $I_4 - I_3 (r_2)$           |                              |
| $I_5 - I_4 (r_2)$          | $I_4 - I_2 (r_2)$           |                              |
| $I_6 - I_5 (r_1)$          | $I_4 - I_1 (r_2)$           |                              |
|                            | $I_5 - I_4 (r_1)$           |                              |
|                            | $I_5 - I_3 (r_1)$           |                              |
|                            | $I_5 - I_1 (r_1)$           |                              |

## Non-linear Pipeline :-

- This pipeline contains forward and backward connection, so reservation table is used to process the inputs.
- Let us consider the following reservation table to process the inputs -



path : S<sub>1</sub> — S<sub>2</sub> — S<sub>3</sub> — S<sub>2</sub> — S<sub>1</sub>

- Multiple checkmarks in the row, shows repeated use of same stage in different cycles.
- contiguous checkmarks in the row indicates extended use of same stage.
- Multiple checkmarks in the column indicates parallel use of <sup>diff</sup> stages in the same cycle.

## Latency analysis :-

- Latency means clk cycle difference b/w to successive initiation in the pipeline.
- It depends on the reservation table.
- In the non-linear pipeline, some of the latencies will causes collisions called as forbidden latency (non-permissible latency). And some of the latency doesn't causes collision called as non-forbidden latency (permissible latency)
- To detect the forbidden latency we need to take the clock cycle difference b/w any two checkmarks in the row. i.e

$$\text{row 1 (S}_1\text{)} : 5 - 1 = 4 \quad (C_4 = 1)$$

$$\text{row 2 (S}_2\text{)} : 4 - 2 = 2 \quad (C_2 = 1)$$

$$\text{row 3 (S}_3\text{)} : \underline{\quad}$$

- Collision vector is derived based on the forbidden latencies.

i.e  $[c_n \dots c_2 c_1]$

$c$   $\begin{cases} 0 & ; \text{permissible} \\ 1 & ; \text{non-permissible} \end{cases}$

length of collision vector = no. of cycles needed to execute one instn.

eg:-  $c_5 \ c_4 \ c_3 \ c_2 \ c_1$   
 $[0 \ 1 \ 0 \ 1 \ 0]$

- Latency sequence is derived based on collision vector w.r.t permissible latencies. i.e

(i) Latency seq w.r.t 'c<sub>1</sub>' :-

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|----------------|---|---|---|---|---|---|---|---|---|----|----|----|----|
| s <sub>1</sub> | 1 | 2 |   |   | 1 | 2 | 3 | 4 |   |    | 3  | 4  | 5  |
| s <sub>2</sub> |   | 1 | 2 | 1 | 2 |   |   | 3 | 4 | 3  | 4  |    |    |
| s <sub>3</sub> |   |   | 1 | 2 |   |   |   |   | 3 | 4  |    |    |    |

Latency cycle

(1,5), (1,5), (1,5) ...

$$(2-1) = 1$$

$$5(7-2) = 5$$

$$(\text{Avg - latency}) = \frac{1+5}{2} = 3$$

$$(8-7) = 1$$

$$(13-8) = 5$$

b.

(ii) Latency seq w.r.t 'c<sub>2</sub>' :-

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
|----------------|---|---|---|---|---|---|---|---|---|----|----|----|----|
| s <sub>1</sub> | 1 |   |   | 2 | 1 |   | 3 | 2 |   | 4  | 3  |    |    |
| s <sub>2</sub> |   | 1 |   | 1 | 2 |   | 2 | 3 |   | 3  | 4  |    | 4  |
| s <sub>3</sub> |   |   | 1 |   |   | 2 |   |   | 3 |    |    | 4  |    |

### Latency cycle

3, 3, 3 ...

$$(4-1) = 3$$

$$(7-4) = 3$$

$$(10-7) = 3$$

(Avg-latency = 3)

(iii)

### Latency seq wrt 'Cs' :-

|       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|-------|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|
| $s_1$ | 1 |   |   |   | 1 | 2 | 3 |   |   | 2  | 3  | 4  | 5  |    |    | 4  |
| $s_2$ |   | 1 | 1 |   |   | 2 | 3 | 2 | 3 |    |    | 4  | 5  | 4  | 5  |    |
| $s_3$ |   |   | 1 |   |   |   |   | 2 | 3 |    |    |    | 4  | 5  |    |    |

### Latency -cycle

$$(6-1) = 5$$

(5,1), (5,1), (5,1), ...

$$(7-6) = 1$$

$$(12-7) = 5$$

$$(13-12) = 1$$

(Avg-latency =  $\frac{5+1}{2} = 3$ )

## Memory Organization :-

$$\text{Performance} \propto \frac{1}{\text{Access time}}$$

$$\text{Access-time} \propto \text{Execution time}$$

$$\text{Total Time} = \frac{\text{Hit/miss}}{\text{Latency}} + \frac{\text{Access time}}{\text{Depends on Bandwidth}} + \frac{\text{Transfer time}}{\text{Depends on Bandwidth}}$$

↓  
(Negligible)

## Types of memory organization :-

(i) Simultaneous Access mem org<sup>n</sup>.

(ii) Hierarchical Access mem org<sup>n</sup>.

### (i) Simultaneous Access memory org<sup>n</sup>:

- In this org<sup>n</sup>, CPU directly connected to all the levels of a memory, so that CPU will be accessing the data from any level of the memory in a sequence i.e
  - when there is miss in level 1 mem, then CPU directly access the data from level 2 mem without copying it into level 1 mem

- when there is a miss in level 2 <sup>mem</sup> then CPU access data directly from level 3 mem without copying it into level 2 and level 1 memory and so on.



- Hit-Ratio ( $H$ ) =  $\frac{\text{Hits}}{\text{Total Access}}$

Hit-Ratio ( $H$ ) + miss-ratio = 100%.

-  $T_1 < T_2 < T_3 < \dots < T_n$

Time required to access 1 word Data from the mem is called as avg. memory access Time

$$T_{\text{avg}} = H_1 T_1 + (1-H_1) H_2 T_2 + (1-H_1)(1-H_2) H_3 T_3 + \dots - (1-H_1)(1-H_2) \dots (1-H_{n-1}) H_n T_n$$

1 word  $\rightarrow T_{avg}$

so in 1 sec =  $\frac{1}{T_{avg}}$  words/sec

$$\eta_{mem} = \frac{1}{T_{avg}} \text{ words/sec}$$

### (ii) Hierarchical Access memory orgn:-

- In this organization, CPU always connected to level 1 memory so that CPU always accessing the data only from the level 1 memory.
- When there is a 'miss' in level 1 memory then the respective data will be copied from higher levels to level 1 memory. Later CPU will access the data only from the level 1 memory.



$$T_{avg} = \underbrace{H_1 T_1}_{+} + \underbrace{(1-H_1) H_2 (T_2 + T_1)}_{+ \dots +} + \underbrace{(1-H_1)(1-H_2) H_3 (T_3 + T_2 + T_1)}_{+ \dots +} \\ + \dots + \underbrace{(1-H_1)(1-H_2) \dots (1-H_{n-1}) H_n (T_n + T_{n-1} + \dots + T_1)}$$

Analysis:-

I<sub>1</sub>: Data (Assume that data present in L2 Mem)

I<sub>2</sub>:

I<sub>3</sub>: Data

I<sub>4</sub>:

I<sub>5</sub>: Data

I<sub>6</sub>:

I<sub>7</sub>: Data

Re-use the I<sub>1</sub> Data

Assume L<sub>1</sub> → T<sub>1</sub> (2ns)

L<sub>2</sub> → T<sub>2</sub> (10ns)

Simultaneous

I<sub>1</sub>: M<sub>L<sub>1</sub></sub>, T<sub>2</sub>

miss in  
L<sub>1</sub>

I<sub>3</sub>: M<sub>L<sub>1</sub></sub>, T<sub>2</sub>

I<sub>5</sub>: M<sub>L<sub>1</sub></sub>, T<sub>2</sub>

I<sub>7</sub>: M<sub>L<sub>1</sub></sub>, T<sub>2</sub>

Hierarchical

I<sub>1</sub>: M<sub>L<sub>1</sub></sub>, T<sub>2</sub> + T<sub>1</sub>

I<sub>3</sub>: T<sub>1</sub>

I<sub>5</sub>: T<sub>1</sub>

I<sub>7</sub>: T<sub>1</sub>

Temporal  
locality

access same  
data word

$$\text{Time} = 4T_2$$

$$= (40\text{ns})$$

$$\text{Time} = T_2 + 4T_1$$

$$= (18\text{ns})$$

$I_1$ : Data ( $w_1$ )

$I_2$ : " ( $w_2$ )

$I_3$ : " ( $w_3$ )

$I_4$ : " ( $w_4$ )

Assume all the words  
are present in  $L_2$ , Mem

### Simultaneous

$I_1$ :  $M_{L_1}, T_2$

$I_2$ :  $M_{L_1}, T_2$

$I_3$ :  $M_{L_1}, T_2$

$I_4$ :  $M_{L_1}, T_2$

### Hierarchical

$I_1$ :  $M_{L_1}, T_2 + T_1$

$I_2$ :  $M_{L_1}, T_2 + T_1$

$I_3$ :  $M_{L_1}, T_2 + T_1$

$I_4$ :  $M_{L_1}, T_2 + T_1$

---

$$\text{Time} = 4T_2$$

$$= (40 \text{ ns})$$

$$\text{Time} = 4(T_2 + T_1)$$

$$= (48 \text{ ns})$$

So, ~~adjust~~ adjustment is done, data is transfer  
in the form of blocks.

Assume block size =  $16W$  so,

in one data - transfer  $16$  words will  
be move to  $L_1$

$$I_1 : M_{L_1}, T_2 + T_1$$

$$I_2 : T_1$$

$$I_3 : T_1$$

$$I_4 : T_1$$



Spatial  
Locality



adjacent  
data words

Accessing

$$\text{Time} = T_2 + 4T_1$$

$$= (18 \text{ ns})$$

**NOTE :-** In the hierarchical access memory org", access time is low because of the principal 'locality of reference'.

- "Locality of reference" means accessing the higher levels of a memory data from the lower-level memory ( $L_1$ ). It is of two types:
  - (i) Temporal locality
  - (ii) Spatial locality
- To satisfy the above locality of reference, we need to organise the data in the memory in form of blocks. Block contain multiple words and its size is decided by designer.

**NOTE:-** when the question contain hierarchy word or cache memory then use hierarchical org<sup>n</sup> otherwise use simultaneous org<sup>n</sup>.

- Q.** In a two-level memory org<sup>n</sup>, Level 1 memory is 8 times faster than level 2 memory. And its access time is 30ns less than avg memory access time. Let L<sub>1</sub> memory access time is 40ns. What is the Hit-ratio?

$$(H_2 = 1)$$

$$T_{avg} = H_1 T_1 + (1-H_1) H_2 T_2$$

$$S = \frac{T_2}{T_1} = 8$$

$$T_{avg} = H_1 T_1 + 8(1-H_1) T_2$$

$$(T_2 = 8T_1)$$

$$(T_1 = 40\text{ns})$$

$$70\text{ns} = H_1(40) + 8(1-H_1)(40)$$

$$\text{so, } T_2 = 320\text{ns}$$

and

$$70 = 40H_1 + 320 - 8 \cdot 320H_1$$

$$T_{avg} = 40 + 320H_1$$

$$70 = -280H_1 + 320$$

$$280H_1 = 250$$

$$\left( H_1 = \frac{25}{28} \right)$$

Q. Three-level memory org' has the following specification :

| Level | Access time per word | Block-size in word | Hit - ratio |
|-------|----------------------|--------------------|-------------|
| 1     | 40 ns                | —                  | 0.6         |
| 2     | 100 ns               | 2                  | 0.9         |
| 3     | 200 ns               | 4                  | 1           |

If the referred data is not in L1 memory then transfer it from L2 to L1 and if not in L2 mem then transfer it from  $L3 \rightarrow L2 \rightarrow L1$ . What is  $T_{avg}$ ?

$$T_1 = 40 \times 1 = 40 \text{ ns}$$

$$T_2 = 100 \times 2 = 200 \text{ ns}$$

$$T_3 = 200 \times 4 = 800 \text{ ns}$$

so,

$$\begin{aligned}
 T_{avg} &= H_1 T_1 + (1-H_1) H_2 (T_2 + T_1) + (1-H_1)(1-H_2)(T_2 + T_1 \\
 &\quad + T_3) \\
 &= 0.6(40) + (0.4)(0.9)(240) + (0.4)(0.1)(1040) \\
 &= 24 + 86.4 + 41.6 \\
 &= \underline{\underline{152 \text{ ns}}}
 \end{aligned}$$