

## Performance Analysis :-

- Performance is an indirect measurement depends on the execution time.

$$\left( \text{Performance} \propto \frac{1}{\text{Execution time}} \right)$$

- Execution time means the time required to complete the program.

$$\begin{aligned}\text{Exec. time (CPU time)} &= \text{no. of seconds / program} \\ &= \underset{\substack{\downarrow \\ \text{Inst}^n \text{ count} \\ (\text{IC})}}{\text{Inst}^n/\text{prog}} \times \underset{\substack{\downarrow \\ \text{CPI}}}{\text{cycles/Inst}^n} \times \underset{\substack{\downarrow \\ \text{CT}}}{\text{second/cycle}}\end{aligned}$$

$$(\text{CPU time} = \text{IC} * \text{CPI} * \text{cycle time})$$

In the program , different Inst<sup>n</sup> are taking different no. of cycles to complete program execution . So,

$$(\text{Program execution time} = \sum (\text{IC}_i * \text{CPI}_i) \text{ CT})$$

where i is type of inst<sup>n</sup>

- Speed Up (s) factor is used to measure the performance gain. i.e

$$s = \frac{\text{Performance } n}{\text{Performance } y}$$

$$s = \frac{\frac{1}{ET_n}}{\frac{1}{ET_y}}$$

$$(s = \frac{ET_y}{ET_n} = n)$$

\* this means system x will runs n times faster than system y

Eg:-

sys  
X                    sys  
Y

[Task]  
 $ET_n = 5\text{ns}$

[Task]  
 $ET_y = 10\text{ns}$

so,  $s = \frac{ET_y}{ET_n} = \frac{10}{5}$

$$\therefore (s = 2)$$

Thus sys n is 2 times faster than sys y

So, relation b/w system  $y^n$  &  $y$  is -

$$S = \frac{y}{x} \quad \text{since } S = 2$$

$$\text{so, } \frac{y}{x} = 2 \Rightarrow \boxed{y = 2x}$$

- Q. Consider 2ns clock cycle processor used to execute the following code :

| Inst <sup>n</sup> | Inst <sup>n</sup> count<br>(IC) | CPI |
|-------------------|---------------------------------|-----|
| LOAD              | 40                              | 8   |
| STORE             | 60                              | 6   |
| ALU               | 20                              | 3   |
| BRANCH            | 40                              | 2   |

what is the performance of a system in terms of MIPS (Million Inst<sup>n</sup> per second).

$$ET \text{ of prog} = \sum_i (IC_i * CPI_i) CT$$

$$\begin{aligned}
 \text{ET of prog} &= (40(8) + 60(6) + 20(3) + 40(2)) 2 \\
 &= (320 + 360 + 60 + 80) 2 \\
 &= 820 \times 2 \text{ ns}
 \end{aligned}$$

$$\begin{aligned}
 \text{Avg. Inst^n ET} &= \frac{\text{ET of prog}}{\text{Total no. of Inst^n in prog}} \\
 &= \frac{820}{\sum \text{IC}} = \frac{820 \times 2 \text{ ns}}{160} \\
 &= 10.25 \text{ ns}
 \end{aligned}$$

$$1 \text{ Inst}^n \longrightarrow 10.25 \text{ ns}$$

$$\frac{10^9}{10.25} \text{ Inst}^n \leftarrow 1 \text{ ns}$$

Thus (performance = 97.5 MIPS)

Q. Consider a 2.5 ns clock cycle processor which consumes 9 cycles for data transfer inst<sup>n</sup>, 6 cycles for ALU inst<sup>n</sup> and 3 cycles for Branch inst<sup>n</sup>.

Relative frequencies of these inst<sup>n</sup> are 40%, 40% & 20% respectively. System is enhanced to make avg. CPI = 1, in this process CT is increased upto 30%. What is the performance gain.

$$ET_{old} = \sum (IC_i * CPI_i) CT$$

$$= [(0.4 * 9) + (0.4 * 5) + (0.2 * 3)] 2.5 \text{ ns}$$

$$= \underline{\underline{16.5 \text{ ns}}}$$

$$ET_{new} = \sum (IC_i * CPI_i) CT$$

$$\{CPI = 1\} = [0.4 + 0.4 + 0.2] [2.5 + (30\% \text{ of } 2.5)]$$

$$= \underline{\underline{3.25 \text{ ns}}}$$

$$S = \frac{ET_{old}}{ET_{new}} = \frac{16.5 \text{ ns}}{3.25 \text{ ns}}$$

$$(S = 5.07)$$

so new system runs 5.07 times faster than old system

## Amthal's Law -

- It states that improve the performance within cost & price limit while making the common case fast-  
(Improve the performance by enhance part of the system rather than the replacement of the system)
- It focused on the performance gain after enhance a part of the system. i.e

$$S_{\text{overall}} = \frac{\text{Performance of sys with enhance (new)}}{\text{Performance of sys without enhance (old)}} \quad \text{--- (i)}$$

$$S_{\text{overall}} = \frac{ET_{\text{old}}}{ET_{\text{new}}}$$

- In the enhancement only the part of a system is modified therefore <sup>new</sup> system contain two portions -



$$ET_{\text{new}} = ET \text{ of unenhanced portion} + ET \text{ of an enhanced portion}$$

Eg:-



\* So, always frequently used units are enhanced

Two factors required to calculate  $ET_{new}$  :-

↳ Fraction enhanced ( $F$ )

↳ Speed Up enhanced

Fraction enhanced" implies how much part of the system is enhanced or modified. i.e

$F$  is enhanced portion

$1-F$  is unenhanced portion

Speed Up enhanced indicates performance gain of an enhanced portion.

$$S = \frac{\text{Performance of a new } F}{\text{Performance of old } F}$$

$$S = \frac{\text{ET of old } F}{\text{ET of new } F}$$

$$\text{ET of new } F = \frac{\text{ET of old } F}{S} - \quad (ii)$$

Substitute eq. (ii) in eq (i) :-

$$(\text{ET}_{\text{new}} = \text{ET of old } (1-F) + \frac{\text{ET of old } (F)}{S})$$

so,

$$S_{\text{overall}} = \left[ \frac{ET_{\text{old}}}{ET_{\text{old}}(1-F) + \frac{ET_{\text{of old}}(F)}{s}} \right]$$

- To cover the complete system let us consider the relative value i.e

$$\text{System} = 100\% \quad (\text{complete})$$

$$S_{\text{overall}} = \frac{100\%}{(100\% - F) + F/s}$$

$$= \frac{1}{(1-F) + F/s}$$

$$= \left[ \left( (1-F) + F/s \right)^{-1} \right]$$

when the system contains multiple enhancement then performance gain is calculated as -

$$S_{\text{overall}} = \left[ (1 - \sum f_i) + \sum \frac{f_i}{s} \right]$$

$i$  is no. of enhancements

Q. Consider a hypothetical system used in scientific application area. Application program refers the integer and floating point units during its execution. Floating point unit is enhanced then it runs 2 times faster but only 10% of integer instr<sup>n</sup> are floating point. What is the performance gain.

$$F = 10\% \quad \text{and} \quad S = 2$$

so, apply ambdal's law -

$$S_{\text{overall}} = \frac{100}{(100-F) + F/S} = \frac{100}{95} = \underline{\underline{1.05}}$$

NOTE :- In the above sys enhancement, rare component is modified so there is no considerable performance gain. correct enhancement is when we enhance frequently used components.

$$F = 90\% \quad \text{and} \quad S = 2 \quad (\text{Integer unit})$$

$$S_{\text{overall}} = \frac{100}{(100-90) + \frac{90}{2}} = \frac{100}{55} = \underline{\underline{1.81}}$$

Q. Consider cache memory which is 9-times faster than main memory and it is referenced by the CPU 80% of time. What is the performance gain of a system using the cache memory.

$$S = 9 \quad \text{and} \quad F = 80\%$$

so, apply Ambdal's law -

$$\begin{aligned} S_{\text{overall}} &= \frac{100}{(100-F) + F/S} = \frac{100}{(100-80) + \frac{80}{9}} \\ &= \frac{900}{180 + 80} = \frac{900}{260} = \underline{\underline{3.46}} \end{aligned}$$

Q. Consider 2ns clock-cycle processor which consumes 4 cycles per data transfer inst<sup>n</sup>, 8 cycles ALU inst<sup>n</sup> and 2 cycles for branch inst<sup>n</sup>. Relative frequencies of these inst<sup>n</sup> are 30%, 60% and 10% respectively. System is enhanced to make the ALU inst<sup>n</sup> CPI=1. What is the performance gain.

$$ET_{old} = \sum (IC_i * CPI_i) CT$$

$$\text{of inst}^n = (0.3(4) + 0.6(8) + 0.1(2)) 2\text{ns}$$
$$= 12.4\text{ns}$$

$$ET_{new} = \sum (IC_i * CPI_i) CT$$

$$\text{of inst}^n = (0.3(4) + 0.6(1) + 0.1(2)) 2\text{ns}$$
$$= 4\text{ns}$$

$$(S = \frac{12.4}{4} = 3.1)$$

Q. 3 enhancements are proposed for a new architecture with respective speedup's ( $s_1 = 30, s_2 = 20, s_3 = 15$ ).

Enhancement 1 and 2 each usable of 25% of time and enhancement 3 is usable for 45% of time. What is overall performance gain %?

$$F_1 = 25 ; S_1 = 30$$

$$F_2 = 25 ; S_2 = 20$$

$$F_3 = 45 ; S_3 = 15$$

$$S_{\text{overall}} = \frac{100}{(100 - \sum f_i) + \sum \left(\frac{f_i}{s_i}\right)}$$

$$= \frac{100}{(100 - 95) + \sum \left(\frac{25}{30} + \frac{25}{20} + \frac{45}{15}\right)}$$

$$S_{\text{overall}} = \frac{100}{5 + 0.83 + 1.25 + 3} = \underline{\underline{9.917}}$$

## High-performance CPU design:-

- High-performance CPU exhibits the concurrency.
- Concurrency means two or more <sup>Inst<sup>n</sup></sup> execution at a time.
- According to Flynn's classification computer architecture is of 4 kinds named as:

- (i) SISD (<sup>Inst<sup>n</sup></sup> Single ~~Input~~ Stream & single Data Stream)
- (ii) SIMD (<sup>Inst<sup>n</sup></sup> Single ~~Input~~ Stream & multiple data stream)
- (iii) MISD (<sup>Inst<sup>n</sup></sup> Multiple ~~Input~~ Stream & single data stream)
- (iv) MIMD (<sup>Inst<sup>n</sup></sup> Multiple ~~Input~~ Stream & Multiple data stream)

shortcuts used :-

CU : Control unit

CPU

PU : Processing unit

MU : Memory unit

IS : <sup>Inst<sup>n</sup></sup> Input Stream

DS : Data stream

(i) SISD :-



Implemented as an uni-processor system

eg:- 8085 MP                          - No concurrency  
      8086 MP

(ii) SIMD :-



Implemented as an Array processor system (vector processor).

e.g.: - Storar processor

PEPE processor

- Data-level concurrency

### (iii) MISD :-



This architecture contains multiprocessors but only one is in use at a time as Data stream is single, only one CPU can communicate with MU.

So, this Architecture is not yet Implemented

(iv) MIMD :-



Implemented as a multi-processor system

Eg:- Cray Processor - Inst<sup>n</sup> level concurrency  
Cyber Processor

SISD without pipeline (non-pipeline) :-



Execution seq of inst<sup>n</sup>'s :-

|    |                |                |                |                |                |                |                |                |   |    |    |    |    |    |    |    |
|----|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|---|----|----|----|----|----|----|----|
| WB |                | I <sub>1</sub> |                | I <sub>2</sub> |                | I <sub>3</sub> |                | I <sub>4</sub> |   |    |    |    |    |    |    |    |
| EX |                | I <sub>1</sub> |                | I <sub>2</sub> |                | I <sub>3</sub> |                | I <sub>4</sub> |   |    |    |    |    |    |    |    |
| ID |                | I <sub>1</sub> |                | I <sub>2</sub> |                | I <sub>3</sub> |                | I <sub>4</sub> |   |    |    |    |    |    |    |    |
| IF | I <sub>1</sub> |                | I <sub>2</sub> |                | I <sub>3</sub> |                | I <sub>4</sub> |                |   |    |    |    |    |    |    |    |
|    | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |

| <u>Inst<sup>n</sup></u> | <u>Required cycles</u> | <u>Counted cycles</u>    |
|-------------------------|------------------------|--------------------------|
| I <sub>1</sub>          | 4                      | 4                        |
| I <sub>2</sub>          | 4                      | 4                        |
| I <sub>3</sub>          | 4                      | 4                        |
| I <sub>4</sub>          | 4                      | 4                        |
|                         |                        | Total cycles = <u>16</u> |

SISD with pipeline :-

Execution stages : IF , ID , EX , WB



Execution sequence of 4 Inst<sup>n</sup> ( $I_1, I_2, I_3, I_4$ ) -

| WB |       |       |       | $I_1$ | $I_2$ | $I_3$ | $I_4$ |  |
|----|-------|-------|-------|-------|-------|-------|-------|--|
| EX |       |       | $I_1$ | $I_2$ | $I_3$ | $I_4$ |       |  |
| ID |       | $I_1$ | $I_2$ | $I_3$ | $I_4$ |       |       |  |
| IF | $I_1$ | $I_2$ | $I_3$ | $I_4$ |       |       |       |  |
|    | 1     | 2     | 3     | 4     | 5     | 6     | 7     |  |

| Inst <sup>n</sup> | Required cycles | counted cycles          |
|-------------------|-----------------|-------------------------|
| $I_1$             | 4               | 4                       |
| $I_2$             | 4               | 1                       |
| $I_3$             | 4               | 1                       |
| $I_4$             | 4               | 1                       |
|                   |                 | Total cycles = <u>7</u> |

Generalization { K - stage pipeline  
n - tasks } (Total cycle =  $k + n - 1$ )

| Task  | Req. - cycles | counted - cycles |
|-------|---------------|------------------|
| $T_1$ | K             | K                |
| $T_2$ | K             | 1                |
| :     | :             | :                |
| $T_n$ | K             | 1                |

## Pipelining :-

- Function - Pipelining uses decomposition ~~process~~ technique i.e divide the problem statement into independent subproblems, assign them to a independent hardware. Later connect the hardware in a pipeline sequence i.e



- Definition - Accepting the new I/p at one end before the previously accepted I/p is appears as an O/p at the other end.
- Definition states that, insert new I/p into a pipeline before completion of a old I/p.
- Pipeline allows overlapping execution (new I/p overlapping with old I/p)
- Overlapping execution sequence in the pipeline is described using space-time diagram.

## Space - Time Diagram



- Successful characteristic of a pipeline is, at every new cycle new I/p must be inserted into a pipeline  
 $\therefore CPI = 1$ .

NOTE:- In a non-pipeline sys new I/p is inserted after completion of old I/p called as non-overlapping execution.

## Design :-

- Pipeline contain two ends (I/p and O/p end) b/w these ends multiple pipes are interconnected to satisfy the functionality of a pipeline. Each piping in the

pipeline is known as stages.

- Interface registers are used b/w stages to hold the intermediate result. It is also called as buffer, or latch or pipeline register.
- All the stages in the pipeline along with the buffer is connected to a common clock. so clock adjustment is very important in the pipeline design.

(a) in uniform - delay pipeline -

$$(\text{cycle time } (t_p) = \text{stage delay})$$



(b) in non-uniform - delay pipeline -

$$(\text{cycle time } (t_p) = \text{max-stage Delay})$$



(c) if buffer-delay present in the pipeline -

$$(t_p = \max (\text{stage delay} + \text{buffer delay}))$$

(d) If skew time / setup time overhead is present then,

$$(t_p = t_p + \text{overhead})$$

extra-delay  
at run-time



CASE-1: 2ns      2ns      2ns      2ns  
 $t_p = 2\text{ns}$  (uniform delay)

CASE-2: 2ns      4ns      8ns      3ns

$t_p = 8\text{ns}$  (non-uniform delay)

CASE-3

Buffer delay = 1ns

 $t_p = 3\text{ ns}$  uniform delay with buffer delay $t_p = 9\text{ ns}$  non uniform delay with "CASE-4

Skew time overhead is 0.2ns

 $t_p = 3.2\text{ ns}$  uniform delay + Buffer delay + Skew time $t_p = 9.2\text{ ns}$  non uniform + " + "Performance Analysis:-

- Consider 'k-segment' pipeline with cycle time of ' $t_p$ ' used to execute 'n' tasks.
- In the pipeline, very first task is executed in a non-overlapping order. so it takes k-cycles. The remaining  $n-1$  task are executed in overlapping order i.e emerging out at the rate of 1-task / cycle so  $n-1$  task takes  $n-1$  cycles.

Thus total time to complete n-tasks by k-stage pipeline ,

$$(ET_{\text{pipe}} = (k+n-1) \text{ cycles})$$

- consider non-pipeline procedure used to execute 'n' tasks in which each task takes  $t_n$  time to complete ~~per task~~ therefore total time to execute n tasks so,

$$(ET_{\text{non-pipe}} = n * t_n)$$

- Performance gain of a pipeline is -

$$S = \frac{ET_{\text{non-pipe}}}{ET_{\text{pipe}}}$$

$$\left( S = \frac{n \cdot t_n}{(k + (n-1)) \cdot t_p} \right) ; n \text{ is limited}$$

when no. of task (n) increases then  $n \gg k$

so,  $k+n-1 \rightarrow n$ , under this condition

$$S = \frac{n \cdot t_n}{n \cdot t_p}$$

$$\left( S = \frac{t_n}{t_p} \right) ; n \text{ is infinity}$$

$t_n$  = time of one-task ET in non-pipe

$t_p$  = cycle-time

- when the pipeline stages are perfectly balanced (uniform-delay) then 1-task execution time in the non-pipeline is also equal to no. of stages in the pipeline

i.e.

$$\begin{aligned} t_n &= K \cdot \text{cycles} \\ t_n &= K \cdot t_p \end{aligned}$$

Under this condition

$$S = \frac{k t_p}{t_p}$$

$$S = K \quad \text{i.e. } \eta = 100\%$$

K is pipeline depth.

- when the system is operating with  $\eta = 100\%$  then maximum speed-up is possible i.e. equal to K (pipeline-depth)

$$\eta_{\text{pipe}} = \frac{S}{S_{\max}} = \frac{S}{K}$$

$$\text{Throughput pipe} = \frac{\text{tasks processed}}{\text{Total time to process the task}}$$

$$= \frac{n}{(k+n-1)tp}$$

**NOTE:-** When the pipeline stages are perfectly balanced (uniform delay) then one task execution time in the pipe line is also equal to 1 task execution time in the non-pipeline.



(a) 1 - Task ET in pipeline

$$k = 4$$

$$tp = 2\text{ns}$$

$$n = 1$$

$$\begin{aligned} ET_{\text{pipe}} &= (k + n - 1)tp \\ &= 4 + 1 - 1 = 4tp \\ &= \underline{8\text{ns}} \end{aligned}$$

(b) 1-task ET in non-pipeline ( $t_n$ )

$$t_n = S_1 + S_2 + S_3 + S_4$$
$$= \underline{8 \text{ ns}}$$

$$S = \frac{t_n}{t_p} \quad \text{when } n \rightarrow \infty$$

$$\eta = S/K = K/K = 1$$

$$S = \frac{8}{2} = 4 \text{ i.e. } K$$

so, for  $n \rightarrow \infty$  pipeline achieves 100% efficiency.

**NOTE:-** When pipeline stages are not balanced (non-uniform delay). Then 1-task execution time in the pipeline is greater than 1-task ET in non-pipeline.

(a) 1-task ET in pipeline -



$$K = 4$$

$$t_p = 8 \text{ ns}$$

$$n = 1$$

$$ET_{\text{pipe}} = (K + n - 1) t_p$$
$$= (4) t_p$$
$$= 32 \text{ ns}$$

(b) 1-task ET in non-pipe ( $t_n$ )

$$t_n = s_1 + s_2 + s_3 + s_4$$

$$= 20 \text{ ns}$$

$$S = \frac{t_n}{t_p} ; n \rightarrow \infty \quad S = \frac{20}{8} \quad S = 2.5 \neq K$$

$$\eta = \frac{S}{K} = \frac{2.5}{4} = 62.5\%$$

NOTE :- For a non-uniform pipe  $\eta$  can't be 100%.

### Types of Pipe-line :-

- (i) **Linear-pipe-line** - This pipeline contains only forward connection. So latency is always 1. Latency means clock-cycle difference b/w two successive initiations in the pipeline.

Linear-pipeline are synchronous pipeline.



(ii) Non-Linear pipeline :- This pipeline contains forward and backward connection so reservation table is required to prove the F/P. Therefore Latency depends on the reservation table.

They are asynchronous pipeline.



Reservation - Table

|       | 1 | 2 | 3 | 4 | 5 | 6 |
|-------|---|---|---|---|---|---|
| $S_1$ | X |   |   |   |   |   |
| $S_2$ |   | X |   | X |   |   |
| $S_3$ |   |   | X |   | X |   |
| $S_4$ |   |   |   |   |   | X |

$$(S_1 \rightarrow S_2 \rightarrow S_3 \rightarrow S_2 \rightarrow S_3 \rightarrow S_4)$$

Q. Consider "first" pipeline with a speedup of 8 operating with 72% efficiency. what could be the no. of stages in the pipeline.

$$\gamma = \frac{S}{K} ; \quad S = 8 \quad \text{and} \quad \eta = 0.72$$

$$\text{so, } K = \frac{S}{\eta} = \frac{8}{0.72} = 11.11$$

$$(K \approx 12)$$

Q. Consider two pipelines A and B where A is having 8-stages of uniform delay of 2ns . B having 6-stages with delays of ( 2ns, 6ns, 8ns, 4ns, 3ns, 1ns)

- How much time is saved to process 100 task using A instead of B.

$$\begin{aligned}
 n &= 100 \\
 ET_B &= n \cdot T_n \\
 &= n(2+6+8+4+3+1) \\
 &= 100(24) \\
 &= 2400 \text{ ns}
 \end{aligned}$$

$$\begin{aligned}
 ET_A &= (k+n-1) T_p \\
 &= (8+100-1) 2ns \\
 &= 107(2) ns \\
 &= \underline{214 ns}
 \end{aligned}$$

$$ET_B = (k+n-1) T_p$$

$$\begin{aligned}
 &= (6+100-1) 8 \\
 &= 105(8) ns \\
 &= \underline{840 ns}
 \end{aligned}$$

$$\text{Time saved} = ET_B - ET_A$$

$$= 840 - 214$$

$$= \underline{(626 ns)}$$

Q. Consider 4-stage pipeline with a respective delays of (20ns, 40ns, 30ns, 10ns). What is performance gain and efficiency.

$$n \rightarrow \infty, k=4$$

$$t_p = 40ns, t_n = 100ns$$

$$\text{Performance gain } S = \frac{t_n}{t_p} = \frac{100}{40}$$

$$S = \underline{\underline{2.5}}$$

$$\eta \text{ (efficiency)} = \frac{S}{K} = \frac{2.5}{4} = \underline{\underline{62.5\%}}$$

Q: Consider 4-stage pipeline with a respective delays of (20ns, 10ns, 30ns, 40ns). Interface register used b/w stages having delay of 2ns. What is performance gain of pipeline when a large no. of task are executed in pipeline).

$$\eta \rightarrow \infty, t_p = 40 \text{ ns}, t_{IR} = 2 \text{ ns}, t_n = 10 \text{ ns}$$



$$t_p = \max(\text{stage delay} + \text{Buffer delay})$$

$$= \max(22, 12, 32, 40)$$

$\downarrow$   
no. Interface register after last stage

$$(t_m = 20 + 10 + 30 + 40)$$

Interface register  
are not present in  
non-pipeline.

$$S = \frac{t_m}{t_p} = \frac{100}{40} = \underline{2.5}$$

Q. Consider 3.2 ns clock cycle processor which consumes 6-cycles per data-transfer inst<sup>n</sup>. 8-cycles per ALU inst<sup>n</sup> and 4-cycles for branch inst<sup>n</sup>. Relative frequencies of these inst<sup>n</sup> (50%, 25%, 25%). System is enhanced with a pipeline. In this implementation 0.6 ns overhead is present. What is the performance gain.

$$\begin{aligned} ET_{\text{non-pipe}} &= \sum (IC_i * CPI_i) * CT \\ &= [0.5(6) + 0.25(8) + 0.25(4)] * 3.2 \text{ ns} \\ &= \underline{19.2 \text{ ns}} \end{aligned}$$

$$\begin{aligned} ET_{\text{pipe}} &= \sum (IC_i * CPI_i) * CT \\ &= (0.5 + 0.25 + 0.25)(3.2 + 0.6) \\ &= \underline{3.8 \text{ ns}} \end{aligned}$$

$$\text{Performance gain} = \frac{ET_{\text{non-pipe}}}{ET_{\text{pipe}}} = \frac{19.2}{3.8}$$

$$[S = 5.05]$$

- Q. Consider 4-stage pipeline where different inst<sup>n</sup> are spending different cycles, at different stages.

|       | $S_1$ | $S_2$ | $S_3$ | $S_4$ |
|-------|-------|-------|-------|-------|
| $I_1$ | 1     | 3     | 2     | 1     |
| $I_2$ | 1     | 1     | 3     | 2     |
| $I_3$ | 1     | 3     | 2     | 1     |
| $I_4$ | 1     | 1     | 2     | 1     |

- (a) How many cycles required to complete the code execution?
- (b) Speed up factor?
- (c) Following loop is executed in the pipeline -

for ( i=1 ; i<=1000 ; i++ )

{

I<sub>1</sub>;

I<sub>2</sub>;

I<sub>3</sub>;

I<sub>4</sub>;

}

The o/p of I<sub>2</sub> for (i=2) will be available after — cycles.

Since this pipeline doesn't have uniform or non-uniform delay cases so we have to proceed using reservation table -

|       | $i = 1$        |                |                |                |                |                |                |                |                |                |                |                |                |                |
|-------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| $S_4$ |                | 1              |                |                |                | I <sub>1</sub> |                | I <sub>2</sub> | I <sub>2</sub> | I <sub>3</sub> |                | I <sub>4</sub> |                |                |
| $S_3$ |                |                |                | I <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>4</sub> | I <sub>1</sub> |                |
| $S_2$ |                | I <sub>1</sub> | I <sub>1</sub> | I <sub>1</sub> | I <sub>2</sub> | -              | I <sub>3</sub> | I <sub>3</sub> | I <sub>3</sub> | I <sub>4</sub> | -              | I <sub>1</sub> | I <sub>1</sub> | I <sub>1</sub> |
| $S_1$ | I <sub>1</sub> | I <sub>2</sub> | -              | -              | I <sub>3</sub> | -              | I <sub>4</sub> | -              | -              | I <sub>1</sub> | -              |                |                |                |
|       | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12             | 13             | 14             |
|       |                |                |                |                |                |                |                |                |                |                |                |                |                | 15             |

Loop-level parallelism

(a) 14-cycles

$$(b) \quad S = \frac{ET_{\text{non-pipe}}}{ET_{\text{pipe}}} = \frac{(I_1 + I_2 + I_3 + I_4)}{14} = \frac{7+7+7+5}{14}$$

$(S = 1.85)$

(c) 2.1 - cycle

## RISC pipeline :-

- To analyse the implementation issues of a pipeline, let us consider RISC pipeline as a refrence modal.
- RISC CPU supports  $inst^n$  pipeline, used to execute the  $inst^n$ .
- RISC CPU  $inst^n$  set is -
  - (i) Data transfer inst $n$  : In the RISC CPU, LOAD & STORE inst $n$  are used to transfer the data b/w the memory & CPU registers. These  $inst^n$  are designed with index addressing mode to access the memory.

|      |       |          |
|------|-------|----------|
| LOAD | $r_0$ | $3(r_1)$ |
|------|-------|----------|

$$r_0 \leftarrow M[3 + [r_1]]$$

|       |          |       |
|-------|----------|-------|
| STORE | $3(r_1)$ | $r_0$ |
|-------|----------|-------|

$$M[3 + [r_1]] \leftarrow r_0$$

(ii) Data manipulation inst<sup>n</sup>: In the RISC CPU, ALU op<sup>n</sup> are performed only on a register data so it only follow register addressing modes.

|     |       |       |       |
|-----|-------|-------|-------|
| ADD | $r_0$ | $r_1$ | $r_2$ |
|-----|-------|-------|-------|

$$r_0 \leftarrow r_1 + r_2$$

(iii) Transfer of control inst<sup>n</sup>: They contain two types unconditional & conditional Jump inst<sup>n</sup>.

uncond TOC

|     |      |
|-----|------|
| JMP | 2000 |
|-----|------|

PC : ~~seq~~ 2000

cond TOC

|      |       |      |
|------|-------|------|
| DJNZ | $r_0$ | 2000 |
|------|-------|------|

DEC  $r_0$

NZ CMP  $r_0$

True

PC : ~~seq~~

2000

False

PC : ~~seq~~

- To Execute above inst<sup>n</sup>, 5-stages pipeline is used.

- Different stages present in the pipeline are as follows :

Stage - 1 (Inst<sup>n</sup> fetch) :- In this phase CPU phase the inst<sup>n</sup> from the memory.

Stage - 2 (Inst<sup>n</sup> decode) :- In this stage 2 op<sup>n</sup> are performed.

(i) Decode the inst<sup>n</sup>

(ii) Access the reg file to fetch the data.

This stage also contain comparative (CMP) circuit to evaluate the branch condition.

Stage - 3 ( Inst<sup>n</sup> execute) :- In this stage ALU op<sup>n</sup> are performed.

Stage - 4 ( Memory access) :- In this stage CPU refers the memory to access the data.

Stage - 5 ( Write Back) :- In this stage register WRITE op<sup>n</sup> is performed.

- Above stage inst<sup>n</sup> are assigned to an independent H/W to form the 5-stage pipeline.



$$\begin{aligned}
 t_p &= \max (S_1, S_2, S_3, S_4, S_5) \\
 &= \max (4, 2, 1, 4, 1) \\
 &= \underline{\text{4 cycles}}
 \end{aligned}$$



- Data manipulation inst<sup>n</sup> implementation :-

### (i) LOAD inst<sup>n</sup>

LOAD  $r_0, 3(r_1)$   
 $PC \leftarrow PC + S$

$\overline{MEMRD}$   
 $S : \text{mem}(3, r_1)$   
 $D : \text{reg}(r_0)$   
 $r_1$  value

IF

ID

~~PEA~~ EX

MA

WB



## (ii) STORE inst<sup>n</sup>



## - Data manipulation inst<sup>n</sup> implementation :-



## - TOC inst<sup>n</sup> implementation :-





### Dependencies in Pipeline :-

- Dependency is a major problem in the pipeline, causes extra-cycles.
- a cycle in the pipeline without new I/p initiation is called as extra-cycle, Also named as stall or hazard.
- When stall is present in the pipeline then  $CPI \neq 1$ .
- 3 types of dependencies are present in the pipeline names as:
  - structural dependency
  - Data - dependency
  - Control - dependency

(i) **Structural dependency** :- It occurred in the pipeline due to resource conflict. Resource may be a memory or register or any other functional unit (ALU).



In the above exec.  $I_1$  &  $I_4$  both inst<sup>n</sup> are accessing the same resource at same cycle time (i.e  $CC_4$ ). This situation in the pipeline is ~~used~~ known as resource conflict.

- Conflict is an unsuccessful op<sup>n</sup> so we need to keep the inst<sup>n</sup> Iy into waiting state until resource become available. This waiting creates stalls in the pipeline.



(7-cycles - 4 Ifp's = 3 stalls)

- To minimize the structural hazards, *HW* mechanism is used i.e renaming.
- In cycle-diagram I ~~ref~~ I<sub>4</sub> refers the memory in 4th stage to access the data, whereas I<sub>4</sub> refers the memory to access the inst<sup>n</sup> simultaneously.

- When the inst<sup>n</sup> & data both are present in the same memory then the above ~~sit~~ situation creates conflicts.
- Renaming mechanism states that organize memory into two independent modules to store the data and inst<sup>n</sup> separately known as ~~as~~ data memory (DM) and code memory (CM).
- Now I<sub>1</sub> refer DM and I<sub>4</sub> refer CM at CC<sub>4</sub>, and no conflict happens.

|                | CC <sub>1</sub> | CC <sub>2</sub> | CC <sub>3</sub> | CC <sub>4</sub> | CC <sub>5</sub> | CC <sub>6</sub> | CC <sub>7</sub> |
|----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| I <sub>1</sub> | CM              | ID              | ALU             | DM              | WB              |                 |                 |
| I <sub>2</sub> |                 | CM              | ID              | ALU             | DM              | WB              |                 |
| I <sub>3</sub> |                 |                 | CM              | ID              | ALU             | DM              | WB              |
| I <sub>4</sub> |                 |                 |                 | CM              | ID              | ALU             | DM              |
| I <sub>5</sub> |                 |                 |                 |                 | CM              | ID              | ALU             |
| I <sub>6</sub> |                 |                 |                 |                 |                 | CM              | ID              |
| I <sub>7</sub> |                 |                 |                 |                 |                 |                 | CM              |

(CYCLE - DIAGRAM - 3)

(7-cycles - 7 IP's = 0 stalls)

## (ii) Data dependencies :-

- Consider the program segment where  $\text{Inst}^n 'J'$  follows  $\text{Inst}^n 'I'$  in a program order.

Prog:-

|    |                 |
|----|-----------------|
| I: | $\text{Inst}^n$ |
| J: | $\text{Inst}^n$ |
| :  | :               |

Data-dependency condition will be occurred when  $\text{Inst}^n 'J'$  tries to read the data before  $\text{Inst}^n 'I'$  write it.

(READ — before — WRITE)

Eg:- ADD  $r_0, r_1, r_2 ; r_0 \leftarrow r_1 + r_2$  ( $r_0$  is Dest)  
MUL  $r_3, r_0, r_2 ; r_3 \leftarrow r_0 * r_2$  ( $r_0$  is source)

NOTE: When the above inst<sup>n</sup> are executed in the non-pipeline system then data dependency condition doesn't occurred because  $I_2$  inst<sup>n</sup> executes after comp execution  $I_1$ . i.e (READ - After - WRITE)

- When the above inst<sup>n</sup> are executed in the pipeline system then data - dependency condition will be occurred i.e. I<sub>2</sub> starts executing along with the I<sub>1</sub>. So I<sub>2</sub> reads the no data register before I<sub>1</sub> writes it. Therefore I<sub>2</sub> incorrectly access the old value from the register r<sub>0</sub> ( data loss).

