

## Introduction :-



ISA - CISC

RISC

## → Classes of Computers :-

1. Personal mobile
2. Laptops Desktop
3. Servers
4. Cluster of servers

## → Flynn's Classification - 4 types of computers

SISD :- Single Instruction Single Datastream

SIMD :- Single Instruction Multiple Datastream (Data level parallelism)

MISD :- Multiple Instruction Single Datastream

MIMD :- Multiple Instruction Multiple Datastreams

No Such  
Machine

1) SISD -



{ I\_i = Instruction }

$$\begin{cases} \text{stored} \\ \text{if } x = a + b \end{cases}$$

One instruction is performed at a time

2)

for (i=0 ; i&lt;10 ; i++)

$$z[i] = x[i] + y[i]$$

eg Vector processor or GPU or SIMD

$$z[0] = x[0] + y[0]$$

:

$$z[10] = x[10] + y[10]$$

{ Executed in single

attempt by SIMD

but SISD will execute this in 10 times

3)



Increasing performance decreases execution time  
 Improve execution time] — increasing performance & decrease execution,  
 or Improve performance time

## → Measuring, Reporting & Summarizing

*(one task, 1 report, 1 result)*  
 Assume program P is executed on Machine A & B.

$$\therefore \frac{\text{Ex. Time}_A}{\text{Ex. Time}_B} = n = \frac{\frac{1}{\text{Performance}_A}}{\frac{1}{\text{Performance}_B}}$$

$$\therefore \text{Performance}_B = n \cdot \text{Performance}_A$$

Type of program

- Benchmark program - P → used to test machines
- Toy programs - can't be used to test
- Kernels -
- Artificial - Biased programs

Set of benchmark programs = Benchmark Suite.

(SPEC CPU 2006)

exclusively test performance of CPU

- 12 Integers
- 17 floating.

Test  
 SPEC WEB :- Web Server

TPC-A | TPC-B | TPC-D (Transaction Processing council)

| Eg → | P <sub>1</sub>                                  | P <sub>2</sub>         | P <sub>3</sub>           | Sum of execution time<br>(S.E.T.) | (sum of weighted<br>exec. time)<br>SWET |
|------|-------------------------------------------------|------------------------|--------------------------|-----------------------------------|-----------------------------------------|
| A    | $10 \times \frac{1}{10} + 8 \times \frac{1}{8}$ |                        | $25 \times \frac{1}{25}$ | = 43                              | 3                                       |
| B    | 12                                              | 9                      | 20                       | = 41                              | 3.125                                   |
| C    | $8 \times \frac{1}{10}$                         | $8 \times \frac{1}{8}$ | $30 \times \frac{1}{25}$ | = 46                              | 3                                       |

Reference machine :- We compare B with A & C with A & then compare B with C.

According to SET,  $(Perf)_B > (Perf)_A > (Perf)_C$

But, with SWET

$$\frac{\text{Exec Time A (reference)}}{\text{Exec Time B}} = \frac{\text{Exec Time C}}{\text{Exec Time B}} = \frac{3}{3.125} = 1.25$$

*Used in SPEC benchmarks*

Geometric mean Comparison

$$A \quad (40 \times 8 \times 25)^{\frac{1}{3}} = 12.6$$

$$B \quad (12 \times 9 \times 20)^{\frac{1}{3}} = 12.93$$

$$C \quad (8 \times 8 \times 30)^{\frac{1}{3}} = 12.43$$

Prog 3 is taking much time  
may be as compare to 2 & 1 due to  
memory or CPU time  
∴ we multiply it by a no.

|        |                 |                 |                    |
|--------|-----------------|-----------------|--------------------|
|        | A               | B               | C                  |
| Prog 3 | $25 \times 0.6$ | $20 \times 0.6$ | $18.30 \times 0.6$ |

The multiplication of 0.6 with prog 3 will have no effect on geometric mean, either it is multiplied by prog 1 or 2

∴ Remove the const. value from geometric mean

According to geometric mean,  $(Perf)_C > (Perf)_B > (Perf)_A$

S.E.T                    S.W.E.T                    G.M

✗                        ✓                            ✓

Require reference m/c                    No reference m/c

There is no optimal method to measure performance of machines.

| Eg-2           | A                                            | B                                  | C                                | $(M/c \text{ c})_{\text{Perf}} > (M/c)_{\text{A}}$ |
|----------------|----------------------------------------------|------------------------------------|----------------------------------|----------------------------------------------------|
| P <sub>1</sub> | 1                                            | 10                                 | 20                               |                                                    |
| P <sub>2</sub> | $\frac{1000}{(1 \times 1000)^{\frac{1}{2}}}$ | $\frac{100}{(1000)^{\frac{1}{2}}}$ | $\frac{20}{(400)^{\frac{1}{2}}}$ |                                                    |

Now we want to execute 100 times P<sub>1</sub> program for 1 P<sub>2</sub> program

| A          | B    | C    |
|------------|------|------|
| total time | 1100 | 1100 |

↑ Failure of GM approach

### → Quantitative Principles of Computer design :-

Principle 1 - Take advantages of parallelism

- Instructions
- Instruction level Parallelism
- Hardware

2. Locality of Reference (program reads to update data & with which they leave  
 newly accessed memory) - Temporal LOR (will recently)  
 likely to be accessed in near future - Spatial LOR → same address are near each other &  
 tends to be referenced close together in time

3. Focus on common Case



M/c A      10 multiplier + 1 adder      total time 10 + 9 = 9

M/c B      1 multiplier + 10 adder      10 + 9 = 19

$$\frac{(Ex)_A}{(Ex)_B} = \frac{9}{19} = 4.$$

Speedup = Perf. of entire task using the enhancement

Perf. of entire task without using enhancement

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

→ Amdahl's law :- performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time.

Case 1 : 100 units 

Case 2 : 55 units time 

$$\frac{\text{Ex Time old}}{\text{Ex Time new}} = \frac{100}{55} = \frac{(\text{Perf})_{\text{new}}}{(\text{Perf})_{\text{old}}} = 1.818$$

$$50 + 5 = 55$$

Performance enhancement = 1.818



(SE) Speedup Enhanced :- Using boosters to speedup (How we are enhancing)

$$(\text{Ex Time})_{\text{new}} = (\text{Ex Time})_{\text{old}} \times \left[ (1 - F_E) + \frac{F_E}{S_E} \right]$$

$$\text{eg } (\text{Ex Time})_{\text{new}} = 100 \left[ (1 - 0.5) + \frac{0.5}{10} \right] = 55$$

$$\rightarrow \text{Speed up } \left( \frac{\text{Ex Time}_{\text{old}}}{\text{Ex Time}_{\text{new}}} \right) = \frac{1}{(1 - F_E) + \frac{F_E}{S_E}}$$

Speedup Enhanced =  $\frac{\text{Time of original code}}{\text{Time of enhanced mode}}$

$$\frac{1}{1 - 0.5 + \frac{0.5}{10}} = 55$$

Fraction of the computation time in the original machine that can be enhanced to take advantage of enhancement.

Fraction enhanced  $\leq 1$

Case 1 :-

$$\text{Ex } \frac{1}{1} \quad \Rightarrow 90\% \text{ Addition \& } 10\% \text{ Multiplication}$$

$$10 \text{ multipliers \& 1 adder} = 91 \text{ units of time}$$

$$\text{Case 2 :- 1 multiplier \& 10 adder} = 49 \text{ units of time}$$

(Using principle of focus on common case)

Speedup (multiplication)

In enhancement of speedup, the time spent has to be removed  
Time (enhanced) = Time (original) - time spent to take back time ( $-F_E$ )

Suppose that we want to enhance the processor used for multiplication  
The new processor is 10 times faster in computation as multiplication  
than the original processor. Assuming that the original processor is busy for 10% of the time. & is waiting for 20% of the time.  
What is the overall speedup gained by incorporating enhancement.



$$\text{Speedup} = \frac{1}{(1 - F_E) + \frac{F_E}{S_e}} = \frac{1}{1 - 0.4 + \frac{0.4}{10}} = \frac{100}{64} = 1.562$$

Suppose you want to achieve a speedup of 90 with 200 processors  
what fraction of the original computation can be sequential.

$$\text{Speedup} = \frac{1}{1 - F_E + \frac{F_E}{S_e}}$$

$$90 = \frac{1}{1 - F_E + \frac{F_E}{200}}$$

$$1 - F_E + \frac{F_E}{200} = \frac{1}{90} \Rightarrow F_E = 0.9938$$

$$F_E = 99.38\%$$

$$\text{Speedup (addition)} = \frac{1}{1 - 0.9 + \frac{0.9}{10}}$$

$$\text{Speed up (multiplication)} = \frac{1}{1 - 0.9 + \frac{0.1}{10}}$$

→ Limitation of Speed up :-

To enhance a system, the entire system has to be enhanced.

~~Max speedup~~ Ex. Time cannot be less than  $(1 - F_E)$

Ques Suppose that we want to enhance the processor used for web operation. The new processor is 10 times faster on computation in web operations than the original processor. Assuming that the original processor is busy with computation  $\frac{40\%}{60\%}$  of the time & is waiting for I/O  $\frac{60\%}{40\%}$  of time what is the overall speedup gained by incorporating enhancement.



$$\begin{aligned} \text{Sol'n} \quad \text{Speedup} &= \frac{1}{(1 - 0.4) + \frac{0.4}{10}} = \frac{1}{1 - 0.4 + 0.04} = \frac{1}{1.56} = 1.562 \\ F_E &= 0.4 \\ S_E &= 10 \end{aligned}$$

Ques Suppose you want to achieve a speedup of 90 with 200 processors what fraction of the original computation can be sequential.

$$\begin{aligned} \text{Sol'n} \quad \text{Speedup} &= \frac{1}{1 - F_E + \frac{F_E}{S_E}} \end{aligned}$$

$$90 = \frac{1}{1 - F_E + \frac{F_E}{200}}$$

$$1 - F_E + \frac{F_E}{200} = \frac{1}{90} \Rightarrow F_E = 0.9938$$

$$F_E = 99.38\%$$

Ques

100 instructions

1. Additions - 25%  $\rightarrow$  10 boosters2. Multiplication - 30%  $\rightarrow$  203. Division - 15%  $\rightarrow$  5

4. Others

$$S_A = \frac{1}{(1-0.25) + \frac{0.25}{10}} = \frac{1}{0.75 + 0.025} = \frac{1}{0.775} = 1.290$$

$$S_B = \frac{1}{(1-0.3) + \frac{0.3}{20}} = 1.398$$

$$S_C = \frac{1}{(1-0.15) + \frac{0.15}{5}} = \frac{1}{0.85 + 0.03} = 1.236$$

total speedup = 2.702

M-2

$$\frac{1}{0.3 + \frac{0.25}{10} + \frac{0.30}{20} + \frac{0.15}{5}} = 2.702 = \text{speedup.}$$

= 100  
64

= 1.562

Ques If 85% of operations in a parallel prog. must be performed sequentially.  
What is the max<sup>n</sup> speedup achievable.

$$F_E = 0.75$$

Sol<sup>y</sup>

$$\text{Speedup} = \frac{1}{1-0.75 + \frac{0.75}{S_E}} = 4$$

Sequentially - which is not enhanced.

→ Speedup is unitless

(Speedup v/s percentage)

|           | Old M/c                                          | New M/c                    |
|-----------|--------------------------------------------------|----------------------------|
| prog A    | 100                                              | 70                         |
| Speedup = | $\frac{\text{Ex Time Old}}{\text{Ex. Time New}}$ | $= \frac{100}{70} = 1.428$ |

$$\frac{\text{perf}_{\text{new}} - \text{perf}_{\text{old}}}{\text{Perf}_{\text{old}}} = \text{percentage change}$$

$$= \frac{\frac{1}{\text{Ex. Time}_{\text{new}}} - \frac{1}{\text{Ex. Time}_{\text{old}}}}{\frac{1}{\text{Ex. Time}_{\text{old}}}} = \frac{\frac{1}{70} - \frac{1}{100}}{\frac{1}{100}} = \frac{30}{0.428} = 70$$

$$\% \text{age} = 42.8\%$$

### → Processor Performance Equation

$$T \propto \frac{1}{F} \rightarrow \text{clock cycle time} = \frac{1}{\text{Clock cycle speed}}$$

CPU time :- time require to execute a program

$$= \text{clock cycles for a program} \times \text{clock cycle time}$$

$\downarrow$   
Ic x CPI

$$\text{eg CPU time} = 100 \text{ cycle} \times 1 \text{ ns} = 100 \text{ ns}$$

prog A  
 $\downarrow$

$n$  instructions  $\rightarrow$  Ic (instruction Count)  
(Instruction path length)

$$\text{CPI} = \text{Clock cycles per instruction} = \frac{\text{Clock cycles for a program}}{\text{Ic}}$$

∴ [CPU time = Ic x CPI x clock cycle time]

$$\text{Unit} = \text{CPU time}$$

classmate  
Date \_\_\_\_\_  
Page \_\_\_\_\_

Clock cycle time - Hardware technology & organisation  
 CPS - Organisation & Instruction Set Architecture  
 SC - Compiler technology & SSA -

classmate  
Date \_\_\_\_\_  
Page \_\_\_\_\_

|                    | M/C A                       | M/C B                 |
|--------------------|-----------------------------|-----------------------|
| Cycle time         | 500                         | 104                   |
| Cycles/instruction | 100                         | 100                   |
| S.                 | $CPS = \frac{500}{100} = 5$ | $\frac{104}{100} = 1$ |

Ques 0-428

SC : Depend on the user, less no. of instructions leads to good program (CPU time f)  
 CPS : Same Hardware but difference in architecture



$$\text{Max}^{\text{u}} \text{ speedup achievable} = \frac{1}{(1-0.33) + 0.33} = \frac{1}{0.66} = \frac{100}{66}$$

If all part is enhanced then

$$\text{Max}^{\text{u}} \text{ speedup achievable} = \frac{1}{\frac{33}{66} + \frac{33}{66} + \frac{33}{66}} = \infty$$

To improve the CPU time, you have to enhance all the factors -

→ det for any Program ①



$$\text{CPU time} = \sum_{i=1}^n I_{C_i} \times CPI_i \times \text{Clock cycle time}$$

$$\text{Overall CPI} = \frac{\sum_{i=1}^n I_{C_i} \times CPI_i}{I_C} = \sum_{i=1}^n \left[ \frac{I_{C_i}}{I_C} \right] \times CPI_i$$

$\left[ \frac{IC_i}{IC} \right]$  gives %age of i instruction

→ If we are comparing 2 machines by their CPU time, then  
 speedup =  $\frac{\text{CPU time old}}{\text{CPU time New}}$

Ques Suppose you have 2 implementations of the same IS architecture.  
 M/c A has a clock cycle time of 50ns & CPI of 4.0 for some program X. M/c B → CPI = 2.5, clock cycle time = 65ns for the same program. Which machine is faster & by how much.

Soln

$$\text{CPU time}_A = IC \times 50 \times 4 = 200 \text{ ns}$$

$$\text{CPU time}_B = IC \times 65 \times 2.5 = 162.5 \text{ ns}$$

M/c B is faster than M/c A

$$\frac{\text{CPU time}_A}{\text{CPU time}_B} = \frac{200}{162.5} = 1.230$$

Hence M/c A is 1.23 times slower than M/c B

Ques

Consider a M/c A for which following performance measures were recorded when executing a set of benchmark programs. Assuming the execution of benchmark programs what is the overall CPI of m/c A.

| Instruction type | % of occurrence | CPI |
|------------------|-----------------|-----|
| ALU              | 35              | 1.0 |
| L&S              | 20              | 3.0 |
| Branch           | 40              | 4.0 |
| Other            | $\frac{5}{100}$ | 5.0 |

$$\begin{aligned}CPI &= \frac{35}{100} \times 1 + \frac{20}{100} \times 3 + \frac{40}{100} \times 4 + \frac{5}{100} \times 5 \\&= \frac{35+60+160+25}{100} = \frac{280}{100} = 2.8\end{aligned}$$

→ CPU Time :-  $T_{ex} \times CPS \times CPI$

$IPC = \text{Instruction per clock}$

$$\text{Performance} = \text{Clock Speed} \times \frac{IPC}{IC}$$

$$CCT \propto \frac{1}{\text{Clock Speed (GHz)}}$$

Ques My New laptop has an IPC i.e. 20% worse than my old Laptop. It has a clock speed i.e. 30% higher than the old one. I am running the same binaries on both machine. What is the speedup of new laptop

$$\begin{aligned}\underline{\text{Soln}} \quad \text{Speedup} &= \frac{T_{ex} \text{Time Old}}{T_{ex} \text{Time New}} = \frac{(Perf)_{New}}{(Perf)_{Old}} \\&= \frac{1.3 \times \text{Clock Speed}_{Old} \times 0.8 \times IPC_{Old}}{\text{Clock Speed}_{Old} \times IPC_{Old}} / IC = 1.04\end{aligned}$$

→

IPC

CPS

Equal No. of clock cycles

i.e. all program finishes in equal no. of clock cycles.

When all work. Equal no. of instruction is considered.

$$\text{Arithmetic Mean of IPCs} = \frac{1}{\text{HM of CPIs}}$$

$$\text{AM of CPIs} = \frac{1}{\text{HM of IPCs}}$$

$$\text{GM of IPCs} = \sqrt{\text{GM of CPIs}}$$

→ My New laptop has a clock speed = 30% higher than old one. & running the same binaries on both PC. their IPCs are listed below.

|         | P <sub>1</sub> | P <sub>2</sub> | P <sub>3</sub> |                             |
|---------|----------------|----------------|----------------|-----------------------------|
| Old IPC | 1.2            | 1.6            | 2.0            | find speedup of new laptop. |
| New IPC | 1.6            | 1.6            | 1.6            |                             |

Soln  $(\text{Perf})_{\text{New}} = \frac{\text{clock speed}_{\text{old}} \times \text{IPC}_{\text{new/PC}}}{\text{clock speed}_{\text{old}} \times \text{IPC}_{\text{old/PC}}}$

Using AM's -

$$\text{AM of IPC}_{\text{new}} = 1.6$$

$$\text{AM of IPC}_{\text{old}} = 1.6$$

$$\therefore \text{Speedup} = 1.3$$

Using GM.

$$\text{GM of IPC}_{\text{new}} = 1.6$$

$$\text{IPC}_{\text{old}} = 1.57$$

$$\text{Speedup} = 1.32$$

Ques

Soln

## # Power vs Energy

$$\text{power} = \text{Dynamic Power (D)} + \text{Leakage Power (L)}$$

$$\text{Dynamic Power} \propto \text{Activity} \times \text{capacitance} \times V^2 \times f$$

↓                      ↓              ↓  
 Activity      Voltage      freq

$$DP \propto V^2 f$$

Activity = No. of transistors switches

When machine is in ideal state, Activity = 0  $\Rightarrow DP = 0$

$$\text{Leakage Power} \propto f(C_L, f, V) = I_{Lc} \times V$$

↓              ↓  
 leakage current      voltage

When system is doing nothing but consuming power = leakage power

$$\text{Energy} = \text{power} \times \text{time}$$

Ques Processor A consumes 1.2 times the power of processor B. But finishes the task in 30% less time. Which processor is better.

Soln

|       | A    | B |
|-------|------|---|
| Power | 1.2W | W |
| time  | 0.7t | t |

$$\text{Energy}_A = (1.2 \times 0.7) Wt = 0.84 Wt = 0.84 \text{ Energy}_B$$

Processor A less energy than processor B

$$1440 = 100$$

at expectation  
so we assume the freq.

Power  $\rightarrow$   $P = V^2 f$

$$\text{DP} \propto V^2 f$$
$$P \propto \frac{V^2}{R} \times C \times VTC$$
$$(100\%)$$

No of transistors

DFS (Dynamic Frequency Scale)  
(- frequency on change)



$$P_{new} = \Delta f L = 62.5$$

as freq = half, time  $\rightarrow$  double

i. New Energy  $= 62.5 \times 200 = 12500$

Old Energy  $\rightarrow 100 \times 100 = 10000$

$$E_{new} = 1.25 E_{old}$$

Here power is reduced but time increases so Overall Energy  $\uparrow$

DVFS :- Dynamic Voltage Frequency Scale



$$\text{New power} = 77.5$$

$$\text{New time} \rightarrow 110 \text{ ns}$$

$$\begin{aligned}\text{New Energy} &= 77.5 \times 110 = 8525 \\ \text{Old} &= 10000\end{aligned}$$

$$\text{Energy}_{\text{New}} = 0.85 \text{ Energy}_{\text{old}}$$

Ques For a processor running at 100% utilization at 100 watts. 20% of power is attributed to leakage. What is the total power dissipation when the processor is running at 50% utilization?

Soln  $\Delta P \propto \text{Activity} \times \text{capacitance} \times V^2 \times f$

$$A \rightarrow 100\% \text{ utilization}$$

100 watts

$$B \rightarrow 50\% \text{ utilization}$$

20% leakage

(Activity-utilization)

$$P = D + L$$

$$\rightarrow 80 + 20$$

$\hookrightarrow$  80 watts for 100% utilization

$$\text{for } B, P = D + L$$

$$= 80 \times \frac{50}{100} + 20$$

$$= 60 \text{ watts}$$

Ques If processor A consumes 1.4 times the power of processor B but finishes the task in 20% less time. Which processor would you pick

- If you were constraint by power requirements — processor B
- If you were trying to minimize energy — processor B
- If you were trying to minimize response time — processor A

A

B

Soln

$$\text{Power} \quad 1.4W$$

w

$$\text{Time} \quad 0.8t$$

t

$$\begin{aligned}\text{Energy} &= 1.4 \times 0.8 Wt \\ &= 1.12 Wt\end{aligned}$$

Wt

Ques Processor A at 1 GHz consumes 80 watts of dynamic power & 20 watts of leakage power. It completes a program in 20 sec.

(2100) a) what is the energy consumption if I scale freq. down by 20%

(1425) b) what is the energy " " " freq. & voltage down by 20%

Soln

$$\text{Initial freq} = 1 \text{ GHz}$$

$$\Delta \text{Power} = 80 \text{ Watts}$$

$$I \text{ Power} = 20 \text{ Watts}$$

a) freq  $\downarrow$  by 20%

$$\text{freq} = 0.8 \text{ GHz}$$

$$I \text{ Power} = 20 \text{ Watts}$$

$$\Delta \text{Power} = 0.8 \times 80$$

$$\text{Power} = \Delta + L$$

$$= 64 + 20 = 84$$

$$\text{Energy} = P \times t$$

$$= 84 \times \frac{20}{0.8} = 2100$$

b) freq & voltage  $\downarrow$  by 20%

$$\Delta \text{Power} =$$

## → Instruction set Architecture

CISC :- Complex Instruction Set Computer

Machine Instruction



RISC :- Reduced Instruction Set Computer

- Smaller, Faster

## CISC

- i) Small No. of instructions but wider



Only 40  
instructions  
required.

## RISC



100 instructions  
to execute.

- Instruction size - small

- 2) Length of instruction - variable

- Constant length

- 3) There is no separate load, store  
instructions.

- Easily fit in pipeline structure

1. How data is stored in a processor

↓  
Internal storage (Registers)

Based on internal storage, 4 type of ISA

i) Stack

ii) An Accumulator - ALU

iii) Register - memory architecture

iv) Register - Register Architecture

Ex

$$C = A + B \quad (A, B, C \text{ are stored in memory})$$

| Stack  | An accumulator | R-M                                     | R-R                                                  |
|--------|----------------|-----------------------------------------|------------------------------------------------------|
| Push A | Load A         | Load R <sub>1</sub> , A                 | Load R <sub>1</sub> , A                              |
| Push B | Add B          | Add R <sub>3</sub> , R <sub>1</sub> , B | Load R <sub>2</sub> , B                              |
| ADD    | Store C        | Store R <sub>3</sub> , C                | Add R <sub>3</sub> , R <sub>1</sub> , R <sub>2</sub> |
| Pop C  |                |                                         | Store R <sub>3</sub> , C                             |



## Memory Addressing

1 byte = 8 bit

Half word = 16 bits

word = 32 bits

double word = 64 bits



Convert 32 bit word in  
Hexadecimal & use 8 bit  
to store in every location



## Addressing Modes

Eg  $\text{Add } R_4, \#3 \rightarrow \text{Immediate}$

$$\text{Reg}[R_4] \leftarrow \text{Reg}[R_4] + 3$$

Add  $R_4, R_3 \rightarrow \text{Register}$

$$R_4 \leftarrow R_4 + R_3$$

Add  $R_4, (R_1)$  — Register Indirect  
 Add  $R_4, (1000)_{(R_2)}$  — Direct  
PC-Relative addressing mode :-  
 Add  $R_4, 100(R_1)$   
 displacement  
 $(R_4 \leftarrow R_4 + \text{New}[R_1, 100])$

### # Types of Instruction :-

1. Arithmetic & logical Instruction  
- Add, OR, AND, SUB
2. Data Transfer  
Load, store
3. System
4. Floating point operations
5. Decimal Operation
6. Strings
7. Graphics
8. Branch

- |                             |       |
|-----------------------------|-------|
| 1. Load - 22%               | } 96% |
| 2. Conditional Branch - 20% |       |
| 3. Compare - 16%            |       |
| 4. Add                      |       |
| 5. AND                      |       |
| 6. SUB                      |       |
| 7. MOVE                     |       |
| 8. CALL                     |       |

$I_1 \rightarrow PC \rightarrow CPU$

1 word = 32 bits = 4 bytes

$I_2 \quad PC = PC + 4$

$I_3 \quad PC = PC + 4$

⋮

⋮

⋮

⋮

$I_n$

if (cond.) then

Statement 1

else

Statement 2

$I_1 \rightarrow PC$

$I_2$

$PC = PC + 4$

X

can't execute

any statement  
until CPU

specify which inst.  
will execute

$DADD = 24 \text{ bit operation}$   
 $ADD = 32 \text{ bit operation}$

Pipelining :- Implementation technique in which there exist overlapping of instructions

### 1. Instruction Fetch (IF)

1. Instruction Fetch (IF)
2. Instruction Decode (ID)
3. Execution (EX)
4. Memory (Mem)
5. Write Back (WB)



eg) LOAD R<sub>4</sub>, B(R<sub>1</sub>)  
 = 5 clock cycle

$$\left. \begin{array}{l} R_4 \leftarrow \text{Mem}[B + \text{Reg}(R_1)] \\ \uparrow \text{offset} \end{array} \right\}$$



eg) STORE B(R<sub>1</sub>), R<sub>4</sub> = Only 4 clock cycle



(WB not required, Store in work at Mem)

eg BNZ R<sub>3</sub>, L

(Branch Not Equal)



- I<sub>1</sub> If (x == 0) then
- I<sub>2</sub>      x = x + 10

2 clock cycle to detect either it is branch or not

- I<sub>3</sub>      Else
- I<sub>4</sub>      x = x - 10

Let there are  $N$  instructions to be executed by pipeline



Instruction level parallelism - SLP

$$CPI_{UN} = \frac{SN}{N}, CPI_P = \frac{N+4}{N}$$

$$(N \approx \infty) \quad CPI_{UN} = 5$$

$$CPS_P = 1$$

[CPI can never be less than 1]

$$CPI_{ideal} = 1 \text{ (Whatever be instruction)}$$

The objective of SLP is to make  $CPI = 1$  i.e. for every clock, I have to execute one instruction

Irrespective of architecture, every instruction has to pass 5 clock cycle.



Latches | Buffers | Pipeline Registers → Used for temporal storage

Due to pipeline registers, instruction may take some extra clock cycle

$$CPI_s = 1 \text{ but due to overhead } CPI_s = 1 + ?$$

Ques An unpipelined processor with one clock cycle time = 1ns & it uses 4 clock cycles for ALU operations & 4 clock cycle for branch operations & 5 clock cycle for memory operations. The relative frequencies of this operation are 40%, 20% & 40% resp. Suppose due to clock ~~gate~~ skew & setup the pipeline processor adds 0.2ns overhead to the clock. What is the speedup of the pipeline processor.

Soln  $\text{for } ZC = 100 \text{ (assumed)}$

unpipelined

$$\text{CCT} = 1\text{ns}$$

$$\text{ALU} = 40\%$$

$$\text{Branch} = 20\%$$

$$\text{Memory} = 40\%$$

pipelined

$$\text{CCT} = 1.2\text{ns}$$

Not given for pipeline

E. we assume  $\text{CPI}_{\text{p}} = 1$

$$\frac{\text{CPU}_{\text{load}}}{\text{CPU}_{\text{new}}} = \frac{ZC \times \text{CPI}_{\text{UP}} \times \text{CCT}_{\text{UP}}}{ZC \times \text{CPI}_{\text{p}} \times \text{CCT}_{\text{p}}}$$

$$\begin{aligned} \frac{\text{Overall}}{\text{CPI}} / \frac{(\text{CPI})_{\text{UP}}}{\text{CPI}} &= \sum_{i=1}^n \frac{F_i}{ZC} \times \text{CPI}_i \\ &= \frac{40}{100} \times 4 + \frac{20}{100} \times 4 + \frac{40}{100} \times 5 \\ \Rightarrow \left( \frac{160 + 80 + 200}{100} \right) &= 4.4 \end{aligned}$$

$$\text{Speedup} = \frac{4.4 \times 1}{1.2 \times 1} = \frac{11}{3} = 3.67$$

# speedup from pipelining =  $\frac{ZC_{\text{UP}} \times \text{CPI}_{\text{UP}} \times \text{CCT}_{\text{UP}}}{ZC_{\text{p}} \times \text{CPI}_{\text{p}} \times \text{CCT}_{\text{p}}}$



Unpipelined



$$\therefore \text{Speedup} = \frac{\text{CPI}_{\text{UP}}}{\text{CPI}_{\text{p}}}$$

for N instructions,  $CPI_{UP} = \frac{SNL}{N} = 5$ ,  $CPI_p (\text{ideal}) = 1$

$$\therefore \text{speedup} = \frac{5}{1} = 5$$

If pipeline depth = 6, then speedup = 6

$$\boxed{\text{speedup} = \frac{\text{pipeline depth}}{1}}$$

If  $CPI_p \neq CPI_{\text{ideal}}$

→ Hazards of Overhead :-



Types of Hazard -

1. Structural Hazard
2. Data Hazard
3. Control / Branch Hazards

I, IF ID EX MEM WB

I<sub>2</sub>, IF LD EX MEM WB

Due to Some problems / Hazards

I<sub>1</sub>, IF ID EX MEM WB

I<sub>2</sub>, 0 0 0 0 → Bubbles (No o/p at ecc)

IF ID EX MEM WB

Note • There is only one register which can hold data / cache.

## 1. Structural Hazard

|                | 1cc | 2cc | 3cc | 4cc | 5cc | 6cc | 7cc | 8cc |
|----------------|-----|-----|-----|-----|-----|-----|-----|-----|
| I <sub>1</sub> | IF  | ID  | EX  | MEM | WB  |     |     |     |
| I <sub>2</sub> |     | IF  | ID  | EX  | MEM | WB  |     |     |
| I <sub>3</sub> |     |     | IF  | ID  | EX  | MEM | WB  |     |
| I <sub>4</sub> |     |     |     | IF  | ID  | EX  | MEM | WB  |

The register which fetches the inst., store the data to same register is used in IF & MEM.

So when 4 instruction, the same register can hold data as well as store data of I<sub>1</sub> inst. So I<sub>4</sub> instruction start at 7<sup>th</sup> cc.  
This is structural hazard.

Structural hazard arises due to use of same device for more than 1 purpose.

$$\text{speedup} = \frac{\text{Pipeline depth}}{1 + (\text{stalls/bubbles})}$$

Solution of Structural Hazards :- Increment in resources.

## 2. Data Hazards :-

I<sub>1</sub> ADD R<sub>4</sub>, R<sub>1</sub>, R<sub>2</sub>

I<sub>2</sub> ADD R<sub>5</sub>, R<sub>4</sub>, R<sub>6</sub>

(I<sub>1</sub>) Add  
R<sub>1</sub>, R<sub>2</sub>      R<sub>1</sub>+R<sub>2</sub>  
IF ID → EX      MEM      WB

so reading R<sub>4</sub>, R<sub>6</sub> not

allowed  $\Rightarrow$  bubble

(I<sub>2</sub>)  $\xrightarrow{\text{IF}} \text{ID}$       (B)      (B)

decoder will detect

Reading R<sub>4</sub>, R<sub>6</sub>

(old value of R<sub>4</sub>)  
not updated

WB dependency

at 5cc, data is available

Due to data hazard, I<sub>2</sub> finishes at 8cc

If there is bubble in pipeline, all will stop, no forward action.

Q) If  $I_3, I_4$  are independent just then at 9 & 10cc, these instruction finishes

Solution :- By passing or data forwarding



$R_1 = 10$   
 $R_2 = 10$   
 $R_4(\text{old}) = 90$   
 $R_6 = 40$



point A & point B are same so.



so old value is replaced by new value

∴ No bubble in Pipeline  
→ Data forwarding

2. Delayed Instruction - but not very useful

Data forwarding is not applicable for load operations.

3. Control Hazards :-

$I_1 ; \text{ if } (\text{Condition})$

then / else

$$I_2 \\ R_4 = R_4 + 7$$

$$I_3 \\ R_5 = R_4 + B$$

$I_3$

$IF$

$I_2 \rightarrow$

$IF \quad ID$

$I_1 \rightarrow$

$IF \quad ID \quad EX \quad MEM \quad WB$



eg. condition :  $x == 0$

false

$I_1$  = Branched Inst. $I_2$  = $I_3$  = Branched Target Inst.No jump / Untaken :-  $PC = PC + 4$  $PC + 4$  $+4$ 

!

Apart from +4, then it is jump ie if  $PC = PC + 8$  - Jump / takenSo we have to remove  $I_2$  from pipeline as this there is jump

No op at 6 & 7 $\alpha$ . Outcome of  $I_1$  come at 8 $\alpha$  & fetching of  $I_3$  takes place at 4 $\alpha$  while fetching of ~~I2~~ takes place at 2 $\alpha$  only.

→ If we add a small ALU at decoding, then we can compute small operations at decode stage & fetching starts from - 3 $\alpha$ .

3 ALU in pipeline :-

- 1. At fetching (for  $PC = PC + 4$ )
- 2. At decoding (for small computations)
- 3. Execution (for Computations)

Q How pipelining is implemented?

→ Techniques available for control hazards-

1. Flushing
2. Predicted Untaken
3. Predicted - Taken
4. Delayed branches

## 1. Flushing :-

Br:  $I_1 \rightarrow IF\ ID$

$I_2$                   IF  $\rightarrow$  flushed down from pipeline

$I_3$                   IF  $\dots\dots\dots$  TCC

## 2. Predicted - Untaken :- We assume that there is no jump. All instructions are executed line wise



both methods  
waste  
TCC

## 3. Predicted - taken :- We assume that there will always be a jump.

Br:  $I_1 \rightarrow IF\ SD\ EX\ MEM$

$I_3$                   IF  $\rightarrow$  switching to  $I_2$

$I_2$ , IF

If there is no jump, switch to  $I_2$  & TCC is wasted.

## 4. Delayed branches :-

choice :-

Br:  $I_1 \rightarrow SF\ SD$

$I_n$                   IF      ID      EX      MEM      WB

depending upon  $\rightarrow I_2 | I_3$                   IF      ID      EX      MEM      WB  
branching or Not

$I_x$  = those instructions which are independent of branch instruction

$I_x$

Branch ( $C_1$ )



# Pipeline Speedup = Pipeline Depth

1 + Overhead / bubbles / stall due to Br.



Branch Frequency

Ques For a deeper pipeline (more than 5 stages), it takes atleast 3cc or 3 stages before the target address is known. & an additional 1cc for evaluating the branch condition. find the effective addition to CPI ideal arising from branches assuming the following frequencies

I 1. Unconditional branch - 4%

II 2. Conditional branch Untaken - 6%

III 3. " " taken - 10%

Schemes - 1. flushing

I II III

2 3 3

2. predictive taken

3 2

3. Predictive Untaken

0 3

} Penalties

Soln

CPI ideal = 1.1

effective addition ? 1 + ?

for flushing, effective addition -  $\frac{2 \times 4 + 3 \times 6 + 3 \times 10}{100} = 0.56$

for P.T, effective addition =  $\frac{2 \times 4 + 3 \times 6 + 2 \times 10}{100} = 0.46$

for P.U.T, effective addition =  $\frac{2 \times 4 + 3 \times 10}{100} = 0.38$

Prediction Techniques :-

→ Static Branch Prediction technique

- 1 bit predictor
- 2 bit predictor
- Correlating prediction
- Branch Target Buffer (BTB)
- Return Address
- MISI
- Value prediction ( $M = n + 10$ )



Before running program you are scheduling instruction that are to be executed.

I suppose prog. is analyzed for 10 minutes &  $I_1, I_2, I_3$  are executed then only these inst. will be executed again & again.





We have to find before ID that  $I_1$  will be taken or not.



If correct then it will execute else value in BHT for that address is changed & then next time  $I_2$  will be executed.

BHT      (Branch History Table)

|       |      |   |
|-------|------|---|
| $I_1$ | 0000 | 1 |
|       |      | 0 |
|       |      | 1 |
|       |      | 0 |
|       |      | 1 |
|       |      | 1 |
|       |      | 0 |
|       | 0111 | 0 |
|       |      | 0 |
|       |      | 1 |
|       |      | 1 |

← If prediction is wrong then its get changed.

### Demerits



To avoid this, we can increase the buffer size. There is no information on what buffer size should be. To overcome this 1-bit branch prediction tech. is used.

while (1)

  for (i=0; i<10; i++)

    true

  for (j=0; j<20; j++)

    false

Predict Untaken &

|        | i =            | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|--------|----------------|---|---|---|---|---|---|---|---|---|---|----|
| NT - 0 | Actual outcome | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0  |
| T - 1  | Predict - NT   | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  |

$\frac{1}{11}$  is correct in this prediction technique

while (1)

{

    for ( $i=0$ ;  $i<10$ ;  $i++$ )

        (if true)

            {

                (BRI)

| i =            | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----------------|---|---|---|---|---|---|---|---|---|---|----|
| actual outcome | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0  |

        } false

        {

            D / NT

                (BRI)

                for ( $j=0$ ;  $j<20$ ;  $j++$ ) P·NT - 0 0 0 0 0 0 1 0 0 0 0 0

                {

                    (BRj)

                    1-bit

                    0 1 1 1 1 1 1 1 1 1 1 1

$\frac{1}{11}, \frac{2}{22}, \frac{2}{32}$

$\downarrow$

$-2\frac{1}{11}, 2\frac{1}{21} = 2\frac{8}{32}$

→ 4 bit addressing  $\therefore$  16 address in buffer

|      |    |     |        |      |
|------|----|-----|--------|------|
| 0000 | 00 | BRI | 10 bit | 0111 |
| :    | 01 |     |        |      |
| :    | 10 |     |        |      |
| 11   |    |     |        |      |
| 00   |    |     |        |      |
| 01   |    |     |        |      |
| 10   |    |     |        |      |
| 11   |    |     |        |      |

00 = Strongly not taken

01 = Weakly not taken

10 = Weakly taken

11 = Strongly taken

$\rightarrow$  2 bit prediction :-



~~buffer~~  $\text{PO} \rightarrow \text{Pi} \rightarrow \text{Pi} \cdot \cdots \cdots \text{Pi} \rightarrow \text{Pi}$

again 10 → 11

$\uparrow$  1<sup>st</sup> loop.  $\uparrow$  2<sup>nd</sup> loop

in 1 bit,  $9|_1 \rightarrow 19|_1$

$$1^{\text{st}} \text{ loop} - 28/32 = 87/32$$

$$2^{\text{nd}} \text{ loop} \rightarrow 30/3_1 \rightarrow 93\%$$

starting with 00,  $00 \rightarrow 01 \rightarrow 11 \rightarrow \dots \rightarrow 10$

Starting pattern can be anything but ending pattern for is always so.



7 bits buffer



MSB = 1 → taken  
MSB = 0 → Not taken

BRI      if (aa == 2)  
NT

1 bit & 2 bit predictions  
fails in this example.

BRI      if (bb == 2)  
NT



BRI      if (cc == 2)  
NT

$\frac{1}{2}$   
1  
10  
11 correct

→ Correlated predictor (min)

BRI

BR 2

BR 2



If real outcome in BR 1 is 1 then prediction  
is wrong  $\therefore$  flips 0  $\rightarrow$  1

Prediction is 1

= 87%

= 93%

BRI : if (Q == 0)      NT  
              then | else

0111



real  $\Rightarrow$  outcome = 0  
1  $\rightarrow$  0

BR 2 : if (P  $\times$  Q == 0)      then | else

1110



real outcome = 0

→ (m, n)

$$m = \begin{cases} \text{last outcome of } m \text{ branches} \\ 2^m \text{ Branch History Table} \end{cases}$$

$$n = \begin{cases} \text{every slot is occupied by } n \text{ bit} \\ \text{binary value} \end{cases}$$

BR0

BR1

BR2

BR3

B

eg (2,1)

$2 \rightarrow \{$  outcome of BR1 & BR2 wrt BR3  
 $2^2 = 4$  buffer BHT

if  $m=1$ , outcome of BR2 wrt BR3.

$1 \rightarrow \{0,1\}$

eg (2,2)

$2 = \{$  outcome of BR1 & BR2 wrt BR3  
 $2^2 = 4$  BHT

$2 \rightarrow \{00,01,10,11\}$



consider only last branch outcome

i 1 2 3 4 5 6 7 8 9 10

Real  $\rightarrow$  BR1 NT NT NT NT NT BT BT BT BT BT

BR2 NT NT NT NT T T T T T T

this way be wrong.

$1/100 = \text{wrong.}$

(Not taken : then pred)  
 when  $BR_1 = \text{Not taken}$ , then  $BR_2$  must be not taken according to corredating predictor.

Which buffer g have to index ?

→ BR1 consider outcome of BRx

let BRx = taken

∴ BR1 index to 1 buffer

let us take 0 is prediction for BR1 (last taken)

If BR1 = taken ∴ flip 0 → 1  
(real outcome)

for BR2, index to 1 buffer

prediction for BR2 = NT but real outcome = taken ∴ flip 0 → 1

For next iteration, BR2 is last branch for BR1

BRx = NT

|      |         |    |                           |
|------|---------|----|---------------------------|
| → eg | 5 times | B1 | $B_1 \in T, NT, T, NT, T$ |
|      |         | B2 | $B_2 \in T, T, T, NT, NT$ |

| Predictor (1,1) | iteration | BR1        |          | Actual outcome | Value in buffer | Prediction | Actual outcome |
|-----------------|-----------|------------|----------|----------------|-----------------|------------|----------------|
|                 |           | prediction | (buffer) |                |                 |            |                |
| 1               | 1         | 0/0        | NT       | T              | 0/0             | NT         | T              |
| 2               | 2         | 1/0        | NT       | NT             | 0/1             | NT         | T              |
| 3               | 3         | 1/0        | NT       | T              | 1/1             | T          | T              |
| 4               | 4         | 1/1        | T        | NT             | 1/1             | T          | T              |
| 5               | 5         | 1/0        | NT       | T              | 1/1             | T          | NT             |
| exit            |           | 1/1        |          |                | 1/0             |            |                |



Working of Correlating predictor (1,1)

Representation of 1 bit predictor  
 $(0,1)$  correlating predictor

CLASSMATE

Date \_\_\_\_\_  
 Page \_\_\_\_\_

$$BR_X = NT, BR_Y = NT$$

| predictor (2,1)<br>BR1 |                  | Actual    |        | BR2              |            |
|------------------------|------------------|-----------|--------|------------------|------------|
| iter                   | predictor buffer | predictor | Actual | Predictor buffer | Prediction |
| 1                      | 0/0/0/0          | NT        | T      | 0/0/0/0          | 0 NT       |
| 2                      | 1/0/0/0          | NT        | NT     | 0/1/0/0          | NT         |
| 3                      | 1/0/0/0          | NT        | T      | 0/1/1/0          | NT         |
| 4                      | 1/1/0/0          | NT        | NT     | 0/1/1/1          | T          |
| 5                      | 1/1/0/0          | T         | T      | 0/1/1/1          | T          |
| exit                   | 1/1/0/0          |           |        | 0/1/1/0          | NT         |

Q-1 How many bits are in the  $(0,2)$  branch predictor with  $4K$  entries?

Q-2 How many entries are in a  $(2,2)$  predictor with the same no. of bits. (Ans in ques 1)

→ Entries = slots

$$4K \text{ entries} = 4 \times 2^{10} \text{ slots} = 2^{12} \text{ entries}$$



$(0,2) = 2\text{bit predictor}$

$(2,1) = \text{Consult last 2 branches}$

- 4 buffers

- each slot of buffer is 0/1

Physically there is only one buffer



Q10  $(0,2) = \text{1 buffer}$

- 4k entries =  $2^{12}$  entries = 12 bits

- each entry = 2 bits  $\{00,01,10,11\}$



2. we subtract 4 bits  $= 2^2 \cdot 2^1 = 8K$

$2^3 = 8K$  entries

total 19 bits

$2^3 \times 2^1 \times 2^1$

$(2^3)$

for  $(2,2)$  predictor



Q #

The no. of bits in an  $(m,n)$  predictor =  $2^m \cdot n \cdot (\text{number of predictor entries})$

e.g. for  $(4,2)$  predictor

selected by the branch address

→ 12bit size buffer = 4K entries

→ Local predictor :-

5 times → for() → TTTTNT  
↓  
11110

zones, 0, 1000  
11110 11110 11110 11110 11110 → No pattern of with S ones  
4 Ones



of I<sub>i</sub> is stored in a buffer.

if correct then don't change \*

let us assume, NT as float prediction

11110 11110 11110 11110 11110 11110 11110 11110  
00000 00000 00000 00000 00000 00000 00000 00000

11110 → 11101 → 11011 → 10111 → 01111 → 11110 → 11101 → 11011



for n iterations, 95% - 100% predictions are correct.

→ If P<sub>i</sub> is influenced by itself → go for local predictor & if it is influenced by other branches, go for global

## Tournament Predictor



both predictors wrong

again selecting  $G$

$0X$

predictor 2

$\therefore 0X$

$1\checkmark$   
(Meta  $\rightarrow 01$ )  
( $G \neq T$ ,  $L = T$ )

(Meta  $= 10$ )

det last Selection  $\Rightarrow G$  (assume)

Meta predictor - 00

It means, last true selected by

but both predictions are wrong  
( $G = T$ ,  $L = T$ )



e.g. predictor 1

$0$

$0$

$0$

$1$

$0$

$0$

$01$

predictor 2

$0$

$1$

$0$

$1$

$1$

$0$

$G/L$   
 $G^W$   
 $G_W$   
 $G_W$   
 $L^W$   
 $L_S$   
 $L$

For conditional inst.



### → Branch Target Buffer (BTB)

Only taken branches are in buffer.

$I_1 \rightarrow$  PC + 4 : if (cond.) then  
else  
 $\rightarrow$  PC + 8 Stmt 1 Stmt 2

$I_2 \rightarrow$  PC + 32 goto L;  
NT PC + 36 Stmt 1  
PC + 40 Stmt 2

Ex 1- 1cc 2cc 3cc 4cc 5cc  
 $I_1 \rightarrow$  IF ID EX MEM WB

|         | PC      | Predicted PC |
|---------|---------|--------------|
|         | PC      | PC + 8       |
| Entries | PC + 32 | PC + 32 24   |



If I<sub>1</sub> address is present in table then only we can use BTB table otherwise go with earlier procedure.

eg



IF ID (Actual NT)

find written right one & update the buffer values.



| Inst. in buffer (BTB) | Prediction | Actual outcome | Penalty |
|-----------------------|------------|----------------|---------|
| YES                   | T          | T              | 0       |
| YES                   | T          | NT             | 2       |
| NO                    | NT         | NT             | 0       |
| NO                    | NT         | T              | 2       |

Ques Determine the total branch penalty for BTB assuming the penalty cycles for individual mis prediction. Make the following assumptions about the prediction accuracy & hit rate.

- (i) Prediction accuracy is 90% (for inst. in the buffer)
- (ii) Hit rate in buffer is 90%, (for branches predicted taken)

Soln Hit Rate = 90 instructions are present in buffer  
 ↓  
 (all branches are taken)



$$\frac{90 \times 10 \times 2}{10000} + \frac{10 \times 90 \times 2}{10000}$$

$$\Rightarrow 0.38$$

Ques is Hit Rate = 50% & predictions accuracy for branches in buffer = 50%.  
 viii Those branches Not in BTB, 50% taken.



$$\text{penalty} = \text{for BTB} \cdot \frac{50 \times 50 \times 2 + 50 \times 50 \times 2}{100 \times 100} = 1$$

$$P(\text{in buffer}, T) = \frac{1}{2} \times 2 \\ = (0.25 \times 2)$$

$$P(\text{in buffer, NT}) \times 0 \\ = (0.25 \times 0)$$

$$P(\text{Not in buffer}, T) \times 2 \\ = (0.25 \times 2)$$

$$P(\text{NOT in buffer, NT}) \times 0 \\ = (0.25 \times 0)$$



## 1) latency

## 2) Initiation Interval

|          | L  | II   |
|----------|----|------|
| ALU      | 0  | 1    |
| FP Adder | 3  | 1    |
| FP Mul   | 6  | 1    |
| FP D     | 24 | 25 * |

Latency

(Not dependent)

I<sub>1</sub> → IF ID A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> A<sub>4</sub> MEM WB

I<sub>2</sub> → IF ID A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> A<sub>4</sub> MEM WB

dependent on op of A

(Dependent): I<sub>2</sub> → IF ID B B B, A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> A<sub>4</sub> MEM WB

3 gap

Initiation Interval :-

In case of division, I<sub>2</sub> can't enter in (I-2S) when I<sub>1</sub> is present  
 ∵ I<sub>2</sub> is dependent on I<sub>1</sub>



\* source register is used by 2 instructions.



### → Data Dependence & Hazards

#### Data Dependence

- 1) Data dependence or true data dependence
  - 2) Anti-dependency
  - 3) Output dependence
- True data dependence & Name dependence

#### 1) True Data Dependence :-



#### 2) Anti Dependence :-



No data is transferred from  $\Rightarrow I \rightarrow J$  but there is dependency  
ie if J is executed before I, value of  $F_2$  will be updated which makes  $F_0$  incorrect.

So first I is executed & then J

#### 3) Output Dependence :-



RAR

(this is not a hazard)

|        |       |       |          |
|--------|-------|-------|----------|
| MULT D | $F_0$ | $F_2$ | $F_4$    |
| ADD D  | $F_2$ | $F_4$ | $F_{10}$ |

How to overcome this hazards?

1] RAW :- reading of ~~first~~ 2<sup>nd</sup> inst. before writing of 1<sup>st</sup> inst.

|        |       |       |       |
|--------|-------|-------|-------|
| MULT D | $F_0$ | $F_2$ | $F_4$ |
| ADD D  | $F_6$ | $F_0$ | $F_8$ |

Soln :- fill this by delayed inst.

$$x = a + b$$

$$y = x + z \rightarrow \text{delayed inst.}$$

2] WAR Hazard :-

This method is called Register Renaming



Soln :- 1) Replace  $F_2 \rightarrow F_x$  in eqn 2

So replace all  $F_2 \rightarrow F_x$  in all eqn except 1

2) Replace  $F_x \rightarrow F_y$  in eqn ③ (Replace all  $F_x$ )

3) Replace  $F_8 \rightarrow F_z$  in eqn ④ (Replace all  $F_8$ )

4) Replace  $F_6 \rightarrow F_q$  in eqn ⑤

|        |       |          |          |
|--------|-------|----------|----------|
| MULT D | $F_0$ | $F_2$    | $F_4$    |
| ADD D  | $F_x$ | $F_6$    | $F_8$    |
| SUB D  | $F_y$ | $F_4$    | $F_{10}$ |
| RAW [  | $F_x$ | $F_{12}$ | $F_z$    |
| SUB D  | $F_A$ | $F_2$    | $F_{16}$ |

## Advance pipelining

classmate

Date \_\_\_\_\_  
Page \_\_\_\_\_



- Advance pipelining
- I → S/W Techniques | static scheduling  
compiler Techniques
  - II → H/W Techniques | dynamic scheduling

### 1:- S/W Techniques

- Loop unrolling
- Software pipelining
- VLIW
- predicate instruction

### 2:- Hardware Techniques

- Score boarding
- Tomasula Alg
- Tomasula Alg + loops
- Tomasula Alg + ROB
- Tomasula Alg + ROB + VLIW

### → Compiler Techniques :-

#### - loop unrolling

for( $i=1$ ;  $i<=1000$ ;  $i++$ )  
 $x[i] = x[i] + 10$       ] 1000 times  
 ↪ rolled version      checking

for ( $i=1$ ;  $i<=1000$ ;  $i+=4$ )

850 times  
 ↪ Unrolled       $x[i] = x[i] + 10;$       if 8 times then  $\rightarrow 125$   
 + times       $x[i+1] = x[i+1] + 10;$   
 $x[i+2] = x[i+2] + 10;$   
 $x[i+3] = x[i+3] + 10;$

$\frac{1000}{125} = 8$  times complexity is reduced.

for ( $i=0; i >= 0; i--$ )

$$x[i] = x[i] + s_i$$

Assembly code

Loop : LD F<sub>0</sub>, OLR<sub>i</sub>  
 ADD F<sub>4</sub>, F<sub>0</sub> F<sub>2</sub> → const. value  
 SD F<sub>4</sub>, OLR<sub>i</sub>

DADDI R<sub>1</sub>, R<sub>1</sub> ≠ -8

(branch Not equal)

BNE R<sub>1</sub>, R<sub>2</sub>, Loop



| IPR       | IUR (Inst. Using Result) | Latency |
|-----------|--------------------------|---------|
| (1) FP OP | FP OP                    | 3       |
| (2) FP OP | Store                    | 2       |
| (3) Load  | FP OP                    | 1       |
| (4) Load  | Store                    | 0       |

LD F<sub>0</sub>, OLR<sub>i</sub> → IF ID EX MEM WB → + gap.  
 Add F<sub>4</sub>, F<sub>0</sub> F<sub>2</sub> → IF -----

$x[i] = x[i] + 10$

|                                                    |                                                                                                 |
|----------------------------------------------------|-------------------------------------------------------------------------------------------------|
| $LD F_0, OLR_i$<br>stall                           | $IF(1)$<br>$IF(2)$<br>$IF(3)$<br>$IF(4)$<br>$IF(5)$<br>$IF(6)$<br>$IF(7)$<br>$IF(8)$<br>$IF(9)$ |
| Add F <sub>4</sub> , F <sub>0</sub> R <sub>2</sub> |                                                                                                 |
| SD F <sub>4</sub> , OLR <sub>i</sub>               |                                                                                                 |
| DADDI R <sub>1</sub> , R <sub>1</sub> ≠ -8         |                                                                                                 |
| Stall                                              |                                                                                                 |
| BNE R <sub>1</sub> , R <sub>2</sub> , Loop         |                                                                                                 |

$\rightarrow i = i - 1$

$\rightarrow i >= 0$

- eq
- 1 - LD F<sub>0</sub>, O(R<sub>1</sub>)
  - 2 - DADDUI R<sub>1</sub>, R<sub>1</sub>, ±-8
  - 3 - ADD D F<sub>4</sub> F<sub>0</sub> F<sub>2</sub>
  - 4 - stall
  - 5 - stall
  - 6 - SD F<sub>4</sub>, -8(R<sub>1</sub>)
  - 7 - BNE R<sub>1</sub>, R<sub>2</sub>, loop

avoided  
by unrolling

Loop:

Unrolling -  $x[i] \left[ \begin{array}{l} \text{LD } F_0, O(R_1) \\ \text{ADD } F_4, F_0, F_2 \\ \text{SD } F_4, -8(R_1) \end{array} \right] \xrightarrow{\substack{1 \text{ gap} \\ 2 \text{ gap}}} \Rightarrow \begin{matrix} & & \\ \downarrow & \downarrow & \downarrow \\ \text{LD } F_0, O(R_1) & \text{ADD } F_4, F_0, F_2 & \text{SD } F_4, -8(R_1) \end{matrix}$  6 cycles  
: 24 - 16 - 8 = 0

Replace F<sub>4</sub> → F<sub>A</sub> × [i+1]  
F<sub>0</sub> → F<sub>X</sub>

$\left[ \begin{array}{l} \text{LD } F_0, -BL(R_1) \\ \text{ADD } F_A, F_0, F_2 \\ \text{SD } F_A, -8(R_1) \end{array} \right] = 6 \text{ CC}$

Replace F<sub>4</sub> → F<sub>B</sub> × [i+2]  
F<sub>0</sub> → F<sub>Y</sub>

$\left[ \begin{array}{l} \text{LD } F_0, -16(R_1) \\ \text{ADD } F_B, F_0, F_2 \\ \text{SD } F_B, -16(R_1) \end{array} \right] = 6 \text{ CC}$

Replace F<sub>4</sub> → F<sub>C</sub> × [i+3]  
F<sub>0</sub> → F<sub>Z</sub>

$\left[ \begin{array}{l} \text{LD } F_0, -24(R_1) \\ \text{ADD } F_C, F_0, F_2 \\ \text{SD } F_C, -24(R_1) \end{array} \right] = 6 \text{ CC}$

DADDUI R<sub>1</sub>, R<sub>1</sub>, ±-32 → 1 CC  
BNE R<sub>1</sub>, R<sub>2</sub>, loop

∴ 27 CC for complete  
program.

Gap utilized

|                                          |                                          |
|------------------------------------------|------------------------------------------|
| LD F <sub>0</sub> , O(R <sub>1</sub> )   | LD F <sub>A</sub> , O(R <sub>1</sub> )   |
| LD F <sub>A</sub> , -8(R <sub>1</sub> )  | SD F <sub>A</sub> , -8(R <sub>1</sub> )  |
| LD F <sub>B</sub> , -16(R <sub>1</sub> ) | SD F <sub>B</sub> , +16(R <sub>1</sub> ) |
| LD F <sub>C</sub> , -32(R <sub>1</sub> ) | SD F <sub>C</sub> , +24(R <sub>1</sub> ) |

(to adjust)

DADDUI R<sub>1</sub>, R<sub>1</sub>, ±-32 → 6 CC  
BNE R<sub>1</sub>, R<sub>2</sub>, loop

∴ only 214 CC

Gap b/w load & add = 1, ~~add~~ consumed by 3 cc  
 Add & store = 2 " by 3 cc  
 DADDUI & BNE = 1, consumed by 8 cc

Unrolled 3 times -

## 2 Software pipelining :-



So we execute the inst- in following way.



So transform the instruction in foll. manner



eg Loop LD F0, 0(R1)      Unrolling 3 times

ADD F4, F0, F2

SD F4, 0(R1)

DADDUI R1, R1, ≠ -8

BNE R1, R2, loop

(i) : LD F<sub>0</sub>, 0(R<sub>1</sub>)

ADD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub>, 0(R<sub>1</sub>)

pattern - 3rd inst of i<sub>1</sub>, 2nd of i<sub>2</sub> &  
1st of i<sub>3</sub>  
Protlogue { LD F<sub>0</sub>, 0(R<sub>1</sub>)  
ADD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>  
LD F<sub>4</sub>, -8(R<sub>1</sub>)}

i+1 : LD F<sub>0</sub>, -8(R<sub>1</sub>)

ADD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub>, -8(R<sub>1</sub>)

Loop : SD F<sub>4</sub>, 0(R<sub>1</sub>)

ADD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub>, -8(R<sub>1</sub>)

BADDUI R<sub>1</sub>, R<sub>2</sub> ≠ -8

2  
c b a

i+2 : LD F<sub>0</sub>, -16(R<sub>1</sub>)

ADD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub>, -16(R<sub>1</sub>)

BNE R<sub>1</sub>, R<sub>2</sub>, loop

SD F<sub>4</sub>, -8(R<sub>1</sub>)

ADD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub>, -16(R<sub>1</sub>)

{ Renaming  
is required }

i i+1 i+2 i+3



Software  
pipelining



- Size of code is less in software pipelining than loop unrolling
- Register requirement is very less

To overcome the problem of nested loops, loop unrolling & software pipeline fails so we use trace scheduling.

## Trace Scheduling :-



→ VLIW (Very Long Instruction Word)



Let there are 2 ALU  
 $\left\{ \begin{array}{l} \text{ALU 1} \\ \text{ALU 2} \end{array} \right\}$

$I_1, I_2$  are independent

At 3cc, ALU 1 is busy & ALU 2 is idle

At 4cc, ALU 2 is busy & ALU 1 is idle

} inefficient use of h/w

In VLIW, pairing of inst- is done : both  $I_1$  &  $I_2$  is issued at the same time with 2 fetching state & decoding state.



## Multi issue processor

- Statistically Scheduled Superscalar processor
- VLIW
- Dynamically Scheduled Superscalar processor

## Assembly Code

eg  $x = (a+b) + (c+d)$

b++;

|     |                      |                |                |                |
|-----|----------------------|----------------|----------------|----------------|
| RAW | I <sub>1</sub> : ADD | F <sub>1</sub> | F <sub>2</sub> | F <sub>3</sub> |
|     | MUL                  |                |                |                |
| RAW | I <sub>2</sub> : ADD | F <sub>2</sub> | F <sub>3</sub> | F <sub>4</sub> |
|     | ADDUI                |                |                |                |
|     | I <sub>3</sub> : ADD | F <sub>3</sub> | F <sub>1</sub> | F <sub>2</sub> |
|     | ADDUI                |                |                |                |
|     | I <sub>4</sub> : ADD | F <sub>4</sub> | F <sub>4</sub> | +1             |

VLIW - 3 issue processor

NOP = No operation



Q Suppose we have VLIW that could issue 2 memory references, 2 FP operations & 1 integer operation per branch in every clock cycle. Show an unrolled version of loop. Unroll as many times as necessary to eliminate stalls.

Soln VLIW can issue 5 instructions but with constraints, it can issue only 2 memory operations, 2 FP & 1 integer or branch inst.

|       |                                                          |                                                                         |                                                           |
|-------|----------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------|
| Loop: | LD <sub>1</sub> , F <sub>0</sub> , 0(R <sub>1</sub> )    | ADD D <sub>1</sub> , F <sub>4</sub> , F <sub>0</sub> , F <sub>2</sub>   | SD <sub>1</sub> , F <sub>4</sub> , 0(R <sub>1</sub> )     |
|       | LD <sub>2</sub> , F <sub>6</sub> , -B(R <sub>1</sub> )   | ADD D <sub>2</sub> , F <sub>8</sub> , F <sub>6</sub> , F <sub>2</sub>   | SD <sub>2</sub> , F <sub>8</sub> , -B(R <sub>1</sub> )    |
|       | LD <sub>3</sub> , F <sub>10</sub> , -16(R <sub>1</sub> ) | ADD D <sub>3</sub> , F <sub>12</sub> , F <sub>10</sub> , F <sub>2</sub> | SD <sub>3</sub> , F <sub>12</sub> , +16(R <sub>1</sub> )  |
|       | LD <sub>4</sub> , F <sub>14</sub> , -24(R <sub>1</sub> ) | ADD D <sub>4</sub> , F <sub>14</sub> , F <sub>14</sub> , F <sub>2</sub> | SD <sub>4</sub> , F <sub>16</sub> , -24(R <sub>1</sub> )  |
|       | LD <sub>5</sub> , F <sub>18</sub> , -32(R <sub>1</sub> ) | ADD D <sub>5</sub> , F <sub>20</sub> , F <sub>18</sub> , F <sub>2</sub> | B <sub>1</sub> → BNE R <sub>1</sub> , R <sub>2</sub> loop |
|       | LD <sub>6</sub> , F <sub>22</sub> , -40(R <sub>1</sub> ) | ADD D <sub>6</sub> , F <sub>24</sub> , F <sub>22</sub> , F <sub>2</sub> | SD <sub>5</sub>                                           |
|       | LD <sub>7</sub> , F <sub>26</sub> , -48(R <sub>1</sub> ) | ADD D <sub>7</sub> , F <sub>28</sub> , F <sub>26</sub> , F <sub>2</sub> | SD <sub>6</sub>                                           |
|       |                                                          |                                                                         | SD <sub>7</sub>                                           |

|           | $M_1$                                 | $M_2$   | $FP_1$                                                                              | $FP_2$                                      | $S/B$                                                         | $CC$                                   |
|-----------|---------------------------------------|---------|-------------------------------------------------------------------------------------|---------------------------------------------|---------------------------------------------------------------|----------------------------------------|
|           | $L_1$                                 | $L_2$   | $\downarrow ADD_1$<br>dependent<br>on $L_1$                                         | $\downarrow ADD_2$<br>dependent<br>on $L_2$ | $I_1$<br>$\downarrow$<br>Not required (dependent on all inst) |                                        |
| 1st issue | $L_1$                                 | $L_2$   | -                                                                                   | -                                           | -                                                             | 1cc                                    |
|           | $L_3$                                 | $L_4$   | $\overbrace{ADD_1 \quad ADD_2}$                                                     | $I_1$                                       |                                                               |                                        |
|           |                                       |         | can't keep due to latency<br>[one gap is b/w $L_1$ & $ADD_1$<br>& $L_2$ & $ADD_2$ ] |                                             |                                                               |                                        |
| 2nd issue | $L_3$                                 | $L_4$   | -                                                                                   | -                                           | -                                                             | 2cc                                    |
| 3rd issue | $L_5$                                 | $L_6$   | $ADD_1$                                                                             | $ADD_2$                                     | $\overline{I_1}$                                              | 3cc                                    |
| 4th issue | $L_7$                                 | (-)     | $ADD_3$                                                                             | $ADD_4$                                     | -                                                             | 4cc                                    |
| 5th issue | $\cancel{S_1}$                        | -       | $ADD_5$                                                                             | $ADD_6$                                     | -                                                             |                                        |
|           | latency b/w<br>$ADD_1$ & $S_1$ is 2cc |         |                                                                                     |                                             |                                                               |                                        |
| 6th issue | $S_0_1$                               | $S_0_2$ | $ADD_7$                                                                             | -                                           | -                                                             |                                        |
| 7th issue | $S_0_3$                               | $S_0_4$ | -                                                                                   | -                                           | $\rightarrow I_1$                                             |                                        |
| 8th issue | $S_0_5$                               | $S_0_6$ | -                                                                                   | -                                           | $B_1$                                                         | Latency<br>b/w $I_1$ & $B_1$<br>is 1cc |
| 9th issue | $S_0_7$                               | -       | -                                                                                   | -                                           | $B_1$                                                         |                                        |

∴ total 9cc VLIW taken to issue all instructions.

But in case of unrolling, 23cc is required. So most of hardware is ideal but in case of VLIW, all hardware is used efficiently.

## TRACE SCHEDULING

for (i=0; i<3; i++)  
    x[i] = x[i] + 10

## → unrolling

$$x[0] = x[0] + 10$$

$$x(1) + x(1) + 10$$

$$u(2) = u(2) + 10$$

### - Basic Block

Single Basic Block - Execution of statement without any jump except at last statement sequentially



Q How to Convert Multiple Basic Block into Single Basic Block ?



After executing 100 times, note down the <sup>most</sup> frequent path of

classmate  
Date \_\_\_\_\_  
Page \_\_\_\_\_

classmate  
Date \_\_\_\_\_  
Page \_\_\_\_\_

except at last  
statement.

To fill latency :-



single basic block

→ Head

→ Tail

→ Predicate Instruction :-



$$R_7 = f R_1$$

$$R_2 = R_3 + R_4 \quad (\text{Predicate on } R_7)$$

$$R_4 = R_5 + R_6 \quad (\text{Predicate on } R_7)$$

if ( $R_1 == 0$ ) then

$$R_2 = R_3 + R_4 ;$$

else

$$R_4 = R_5 + R_6 ;$$

→ Hardware Techniques :- (Dynamic Scheduling)



All these Hazards require Registers is also known as Register Data Hazards

Memory Data Hazard :-



We can't bypass LD operation over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation

over store operation



Compiler can't find this type of memory hazards.



Q How hardware techniques resolve memory data hazards?



independent → SUB D F<sub>0</sub> F<sub>2</sub> F<sub>4</sub> If F<sub>1</sub> stuck in pipeline, then F<sub>2</sub> can't execute  
 $\therefore$  bypass but F<sub>3</sub> can bypass & execute.

- Register Data Hazards - Register renaming, Instruction Scheduling
  - Memory Data Hazards
  - Control Hazards - both software & hardware techniques are available

Compiler can't  
solve this problem

Only how can  
solve

Horticultural techniques to solve above problems are



IS - Issue Stage  
RD - Read Operands

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

→ Scoreboarding :-



1. Issue Stage :-

- check for structural Hazard. (functional unit is available or not)
- check for RAW Hazard
- issue the instruction in pipeline

2. Read Operand :-

ADD D F<sub>0</sub> F<sub>2</sub> F<sub>4</sub>

If both operands F<sub>0</sub> & F<sub>2</sub> are available then only execution takes place else RD has to wait (RAW Hazard)

→ Instruction Status Table :-

|         |                                                   | IS     | R              | Ex                            | WB                                                                        |
|---------|---------------------------------------------------|--------|----------------|-------------------------------|---------------------------------------------------------------------------|
| LD      | F <sub>6</sub> , 34(R <sub>2</sub> )              | 1      | 2              | 3                             | 4                                                                         |
| LD      | F <sub>2</sub> , 45(R <sub>3</sub> )              | 5      | 6              | 7                             | 8                                                                         |
| MULT    | F <sub>0</sub> , F <sub>2</sub> , F <sub>4</sub>  | 6      | 9              | + 19                          | 20                                                                        |
| SUBD    | F <sub>8</sub> , F <sub>6</sub> , F <sub>2</sub>  | 7      | 9              | 11                            | 12                                                                        |
| DIVD    | F <sub>10</sub> , F <sub>0</sub> , F <sub>6</sub> | 8      | 11             | 61                            | 62                                                                        |
| ADD     | F <sub>6</sub> , F <sub>8</sub> , F <sub>2</sub>  | 13     | 14             | 15                            | 22                                                                        |
|         | (what type of operation)                          |        | source         | Operands                      | who produce F <sub>j</sub> & F <sub>k</sub>                               |
|         |                                                   |        | F <sub>j</sub> | F <sub>k</sub> Q <sub>j</sub> | operands are available or not R <sub>j</sub> & R <sub>k</sub>             |
| execute | Busy                                              | OP     | dest           | - R <sub>2</sub> - - -        | - - - Yes                                                                 |
|         | (1) Integer                                       | No     | LD             | F <sub>6</sub>                | - - - No Yes                                                              |
|         | (10) MULT 1                                       | No Yes | MULT           | F <sub>0</sub>                | F <sub>0</sub> F <sub>2</sub> LD <sub>2</sub> - No Yes                    |
|         | MULT 2                                            | No     |                |                               |                                                                           |
|         | (2) ADD 1                                         | No Yes | SUBD           | F <sub>8</sub>                | F <sub>6</sub> F <sub>2</sub> LD <sub>1</sub> LD <sub>2</sub> Yes, No Yes |
|         | (40) DIV D1                                       | No Yes | DIVD           | F <sub>10</sub>               | F <sub>0</sub> F <sub>6</sub> MULTD LD <sub>1</sub> No Yes                |
|         |                                                   |        |                |                               |                                                                           |
|         |                                                   |        |                |                               |                                                                           |
|         |                                                   |        |                |                               |                                                                           |

to perform LD & simple Arithmetic op.

W<sub>A</sub>R Hazard

Functional Unit Status Table

Registers



- When job is over, remove the operation so atleast all registers are available.
  - Initially do operation check whether functional unit is available or not.
  - All instructions follow the inorder.
  - At first clock cycle, LD is issued.
  - At second clock cycle, Integer function unit is not present for and LD op.
    - ∴ reading of operands of LD, takes place.
  - After 4th cc, Integer functional unit status  $\rightarrow$  Yes  $\rightarrow$  No
  - All instructions are executed out of order.
- LIMITATION**
- Scoreboarding can't solve WAW & WAR Hazard by register renaming.
  - Forwarding is not present in scoreboard.
  - Able to solve only single basic problem.

To solve the limitations of scoreboard, we have Tomasulo's algorithm

### Tomasulo's Algorithm :-

#### 1) Reservation station (RS)

- using this we can apply register renaming

RS, COB  $\rightarrow$  using these 2, we apply forwarding



In Scoreboarding, instructions pass 4 stages

In Tomásulo's algorithm, instruction passes 3 stages.



2 types of functional units



In Tomásulo's Algorithm, Issue (IS)

- i) If RS is available, then 9 can issue i.e. copy the content of operand registers into the reservation station.

Execution - 1) of operands available then start execution

WB - 1) Write result in CAB

2) Write result in registers

Instruction status:

LD F<sub>6</sub>, 34(R<sub>2</sub>)

LD F<sub>2</sub> 45(R<sub>3</sub>)

MULD F<sub>0</sub> F<sub>2</sub> F<sub>4</sub>

SUBD F<sub>8</sub> F<sub>6</sub> F<sub>2</sub>

DIVD F<sub>10</sub> F<sub>8</sub> F<sub>6</sub>

ADDD F<sub>6</sub> F<sub>2</sub> F<sub>2</sub>

↓

diff than one in DIVD.

In DIVD, F<sub>6</sub> is stored in MIA<sub>1</sub>

So ADDD will look in MIA<sub>1</sub>

not in F<sub>6</sub> anymore so WAR Hazard is completely eliminated.

| IS | EX           | WB |
|----|--------------|----|
| 1  | 3            | 4  |
| 2  | 4            | 5  |
| 3  | (5+10)<br>15 | 16 |
| 4  | 7            | 8  |
| 5  | 56           | 57 |
| 6  | 10           | 11 |

R.S = Reservation Station

## Reservation Station

|            | Busy   | OP  | $V_i$    | $V_K$       | $Q_j$  | $Q_K$ |
|------------|--------|-----|----------|-------------|--------|-------|
| (2) Add 1  | Yes    | SUB | $m(A_1)$ | $m(A_2)$    | Load 2 |       |
| (2) Add 2  | Yes    | ADD |          | $m(A_2)$    | SUB    |       |
| (2) Add 3  |        |     |          |             |        |       |
| (10) MUL 1 | No Yes | MUL | $m(A_1)$ | R1( $F_3$ ) |        |       |
| (40) DIV 1 | No Yes | DIV |          | $m(A_1)$    | MUL    |       |

Once the Algo is over, the RS is completely free.

OP

RS1 +  
RS2 /  
RS3 -Whe  
A

T

free SCC  
free Y4CC  
free ZCC

Q

|         | Busy | A          | $F_6$                                        | $F_2$ | $F_4$ | $F_B$ | $F_{10}$        |
|---------|------|------------|----------------------------------------------|-------|-------|-------|-----------------|
| Load 1  | Yes  | $34 + R_2$ | New location $m(A)$                          |       |       |       | $m(A_1) m(A_2)$ |
| Load 2  | Yes  | $45 + R_3$ | $m(A_1)$ passed using CDB to all other units |       |       |       |                 |
| Locat 3 |      |            |                                              |       |       |       |                 |

In 1st cc, check whether inst can be issued or not.

Above example enables forwarding prevents WAR &amp; WAW Hazards Only

Limitation :- Common Data Bus  $\rightarrow$  All units try to write on the CDB at the same time↳ soln  $\rightarrow$  have multiple CDB but the complexity of architecture increases

Ex-1

$$\begin{aligned} & \text{RCC } F_1 = F_3 + F_4 \quad \text{WAW Hazard} \\ & \text{MCC } F_2 = F_4 / F_2 \\ & \text{GCC } F_3 = F_4 - F_2 \end{aligned}$$

Ex-2  $F_1 = F_3 + F_4$  WAW Hazard

$F_1 = F_4 / F_2$

$F_2 = F_4 - F_3$

Using Romasulo's Algo  $\rightarrow$

|   | IS | EX   | WB | 2 for reading<br>(3, 4 for execution) |
|---|----|------|----|---------------------------------------|
| 1 | 1  | 4    | 5  |                                       |
| 2 | 2  | 43   | 44 |                                       |
| 3 | 3  | 49+2 | 47 |                                       |

Original Table

|       |     |
|-------|-----|
| $F_1$ | 1.0 |
| $F_2$ | 2.0 |
| $F_3$ | 3.0 |
| $F_4$ | 4.0 |

Register Renaming Table

|       |                   |
|-------|-------------------|
| $F_1$ | $R_{S1} / R_{S2}$ |
| $F_2$ |                   |
| $F_3$ |                   |
| $F_4$ |                   |

RS = Reservation Station

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

|     | OP | OP1       | OP2 |                        |
|-----|----|-----------|-----|------------------------|
| RS1 | +  | 3.0       | 4.0 | = [A B B] → 4cc        |
| RS2 | /  | 4.0       | 2.0 | = [B C C] → 43cc (2.0) |
| RS3 | -  | RS2 → 2.0 | 2.0 | = [A B B] → 4bCC (0.0) |

When writing is done, Register renaming table & RS are made free.

All the things stored earlier are removed -

|      | IS                | EX      | WB |                           | RRT   | free                    |
|------|-------------------|---------|----|---------------------------|-------|-------------------------|
| Ex-2 | $F_1 = F_3 + F_4$ | 1 4 5   |    | $F_0$ <del>44LL</del> 2.0 | $F_0$ | <del>RS2</del> 4cc      |
|      | $F_1 = F_4 / F_2$ | 2 43 44 |    | $F_1$ 1.0 → 1.00          | $F_1$ | <del>RS1</del> 5cc free |
|      | $F_2 = F_4 - F_3$ | 3 6 7   |    | $F_2$ 2.0 → 2cc (1.0)     | $F_2$ | <del>RS3</del> 2cc free |
|      |                   |         |    | $F_3$ 3.0                 | $F_3$ |                         |
|      |                   |         |    | $F_4$ 4.0                 | $F_4$ |                         |

|                     | OP  | OP1 | OP2 |      |
|---------------------|-----|-----|-----|------|
| free <del>5cc</del> | RS1 | +   | 3.0 | 4cc  |
| free <del>4cc</del> | RS2 | /   | 4.0 | 43cc |
| free <del>7cc</del> | RS3 | -   | 4.0 | 6cc  |

|   | IS                | EX      | WB |                  | RRT   | free               |
|---|-------------------|---------|----|------------------|-------|--------------------|
| Q | $F_2 = F_4 + F_1$ | 1 4 5   |    | $F_1$ 1.0 → 1.00 | $F_1$ | <del>RS2</del> 5cc |
|   | $F_1 = F_2 / F_3$ | 2 45 46 |    | $F_2$ 3.0 → 3.00 | $F_2$ | <del>RS1</del> 5cc |
|   | $F_4 = F_1 - F_2$ | 3 48 49 |    | $F_3$ 3.0 → 3.00 | $F_3$ | <del>RS3</del> 2cc |
|   | $F_1 = F_2 + F_3$ | 4 7 8   |    | $F_4$ 4.0 → 4.00 | $F_4$ |                    |

|       | OP         | OP1        | OP2 |                            |
|-------|------------|------------|-----|----------------------------|
| $F_1$ | RS3 → free |            |     | $A \xrightarrow{5.0} 4cc$  |
| $F_2$ | RS2 → free |            |     | $A \xrightarrow{4cc} 3.34$ |
| $F_3$ |            | RS1 → free |     | $B \xrightarrow{4cc} 1.67$ |
| $F_4$ | RS2 → free |            |     | $A \xrightarrow{2.0} 7cc$  |

|       |     |      |     |
|-------|-----|------|-----|
| $F_1$ | 1.0 | 8.0  | 8cc |
| $F_2$ | 2.0 | 5.0  | 5cc |
| $F_3$ | 3.0 |      |     |
| $F_4$ | 4.0 | 3.34 | 4cc |

→ Drawback of Tomasulo's -  
 Precise Exception - Division with zero detected  
 40th cc



If  $F_6 = 0 \rightarrow$  leads to exception (division by 0)

exceptional handler increases  $0 \rightarrow 0.00\cdots 1$

② Store value only after previous init. is solved.

→ To overcome exception and 'get out of order execution'

↳ Tomasulo's + RDB

|    |                |       | IS                                           | EX | WB |
|----|----------------|-------|----------------------------------------------|----|----|
| 2  | I <sub>1</sub> | SUBD  | F <sub>1</sub> F <sub>3</sub> F <sub>2</sub> | 1  | 4  |
| 10 | I <sub>2</sub> | MULTD | F <sub>4</sub> F <sub>2</sub> F <sub>3</sub> | 2  | 13 |
| 40 | I <sub>3</sub> | DIVD  | F <sub>0</sub> F <sub>3</sub> F <sub>1</sub> | 3  |    |
| 2  | I <sub>4</sub> | ADD D | F <sub>3</sub> F <sub>1</sub> F <sub>2</sub> | 4  | 16 |

RRT

|                |      |                |     |          |
|----------------|------|----------------|-----|----------|
| F <sub>0</sub> | 0.0  | F <sub>0</sub> | RS3 |          |
| F <sub>1</sub> | -1.0 | F <sub>1</sub> | RS1 | 5cc free |
| F <sub>2</sub> | 2.0  | F <sub>2</sub> |     |          |
| F <sub>3</sub> | -2.0 | F <sub>3</sub> | RS4 |          |
| F <sub>4</sub> | 4.0  | F <sub>4</sub> | RS2 |          |

|     |       |         |            |                |
|-----|-------|---------|------------|----------------|
| RS1 | SUBD  | 2.0     | 2.0        | A → 4cc (0.0)  |
| RS2 | MULTD | 2.0     | -2.0       | M → 13cc (4.0) |
| RS3 | DIVD  | 2.0     | -RS1 (0.0) | D              |
| RS4 | ADD D | RS2 4.0 | 2.0        | A → 16cc (4.0) |

System call to the exception handler which increases the value of  $f_1 \rightarrow 0$ .  
 $\rightarrow 0.001$

Hence the system call return to the inst.

$I_3 \text{ DIVD } F_0 F_3 F_1$

$RS_3 \text{ DIVD } 6.0 0.001$

Wrong value of  $F_3$  used which gives wrong result.

Re-Order Buffer:-

Clobber

Tomasulo's Algorithm -



Hence if exception occur in  $I_1, I_2$  restarted from the start

SUBD       $F_1 F_3 F_2$   
 MULTD      $F_4 F_2 F_3$   
 DIVD      $F_0 F_3 F_1$   
 ADDD      $F_3 F_4 F_2$

| IS | EX     | W. | C  |
|----|--------|----|----|
| 1  | ✓ 4    | 5  | 6  |
| 2  | ✓ 13   | 14 | 15 |
| 3  | W ✓    |    |    |
| 4  | W ✓ 16 | 17 | W  |

all inst. flushed out of table  
 & restarted

| ROB entry |     |   |      |             |      |
|-----------|-----|---|------|-------------|------|
| A         | RS1 | - | 2.0  | 2.0         | ROB0 |
| M         | RS2 | X | 2.0  | 2.0         | ROBL |
| D         | RS3 | D | 2.0  | ROBD<br>0.0 | ROB2 |
| A         | RS4 | A | ROBI | 2.0         | ROB3 |

4.0

tag

$4 \rightarrow (0.0) + ROB0$   
 $13 \rightarrow (4.0) + ROB1$   
 (start at 4.0)  
 $16 \rightarrow (6.0) + ROB3$

|       |     |
|-------|-----|
| $F_0$ | 0.0 |
| $F_1$ | 1.0 |
| $F_2$ | 2.0 |
| $F_3$ | 3.0 |
| $F_4$ | 4.0 |

|       |                                                 |
|-------|-------------------------------------------------|
| $F_0$ | ROB2                                            |
| $F_1$ | ROBO $\rightarrow 0.0 \text{ (in } 6\text{cc)}$ |
| $F_2$ | ROB3                                            |
| $F_3$ | ROB1 $\rightarrow 4.0 \text{ (in } 1\text{cc)}$ |
| $F_4$ |                                                 |

|      | Type | dest. | Value            | Status |     |                  |
|------|------|-------|------------------|--------|-----|------------------|
| ROBO | S    | $F_1$ | - / 0.0          | 0      | (H) | Head H           |
| ROB1 | M    | $F_4$ | - / 4.0 (in 4cc) | 0      | (H) | (Circular queue) |
| ROB2 | D    | $F_0$ | -                | 0      | (H) |                  |
| ROB3 | A    | $F_3$ | - / 6.0 (in 7cc) | 0      |     |                  |
| ROB4 |      |       |                  |        |     |                  |

→ Tomásito's Algo for Branches (Hardware Based Speculation)



| IS | EX   | W  | C  |
|----|------|----|----|
| 1  | ✓ 12 | 13 | 14 |
| 2  | 13   |    | 15 |
| 3  | ✓ 6  | 7  |    |
| 4  | 7    | 8  |    |
| 15 |      |    |    |

| T    | P | V              | S         | Header |
|------|---|----------------|-----------|--------|
| R0B0 | M | F <sub>2</sub> | $f_{0=0}$ | 0      |
| R0B1 | B | F <sub>2</sub> | -         | 0      |
| R0B2 | A | F <sub>4</sub> | $f_{4=0}$ | 0      |
| R0B3 | S | F <sub>5</sub> | $f_{2=0}$ | 0      |

16

|     | OP    | OP1 | OP2 | ROBE |
|-----|-------|-----|-----|------|
| RS1 | MULTD | 0·0 | 1·0 | R0B0 |
| RS2 | ADDD  | 3·0 | 1·0 | R0B2 |
| RS3 | SUBD  | 3·0 | 1·0 | R0B3 |

|                |     |      |
|----------------|-----|------|
| F <sub>0</sub> | 0·0 |      |
| F <sub>1</sub> | 1·0 |      |
| F <sub>2</sub> | 2·0 | R0B0 |
| F <sub>3</sub> | 3·0 |      |
| F <sub>4</sub> | 4·0 | R0B2 |
| F <sub>5</sub> | 5·0 | R0B3 |

The next instruction will be known after execution of branch inst. So ADD & SUB inst are called speculating inst because these inst are issued before knowing the outcome.

The real outcome of branch is known at 13<sup>th</sup> clock cycle but real outcome shows taken. Even though the ADD & SUB inst. are executed but their values are not updated so flush down these inst & change value in buffer from 0 → 1

Suppose that the prediction unit & scal is NT so the values are modified

Prediction = NT so execute inst 1

### Tomasulo's Algo for Branch & Exceptions



MULT D

BNEQ

ADD D

SUB D

I, (WD)

DIV D

→ This list is speculative and also generates exception (divide by 0)

label: ADD D

SUB D

If branch outcome says NT then only the exception is handled otherwise the instructions are flushed down when the real outcome comes.

→ Inst Speculative → Not Handled

Not Speculative → Exception Handled.

→ Memory Data Hazard :-

LD R<sub>1</sub>, O[R<sub>2</sub>] → Disp Add Mode Mem[0 + (R<sub>2</sub>)]  
 SD R<sub>2</sub>, O[R<sub>3</sub>]  
 LD R<sub>4</sub>, O[R<sub>4</sub>] → Mem[0 + (R<sub>3</sub>)]  
 SD R<sub>5</sub>, O[R<sub>0</sub>]  
 LD R<sub>5</sub>, O[R<sub>1</sub>]



DIVD R<sub>2</sub> R<sub>4</sub> R<sub>5</sub> — takes 4 clock cycle so store operation can't be executed.

But LD is allowed to execute



RAW Hazard b/w memory locations so DIVA operation give wrong answer.

## → LSQ (Load Store Queue)



- ① LD R<sub>1</sub>, O(R<sub>1</sub>)
- ② SD R<sub>2</sub>, O(R<sub>2</sub>)
- ③ LD R<sub>4</sub>, O(R<sub>4</sub>)
- ④ SD R<sub>5</sub>, O(R<sub>5</sub>)
- ⑤ LD R<sub>6</sub>, O(R<sub>1</sub>)

|       |       |
|-------|-------|
| 1-15  | to 24 |
| 16-31 |       |
| 32-47 |       |
| 48-63 |       |
| 64-79 |       |

ROB  
T ROB D

|      |   |    |
|------|---|----|
| ROBO | L | F0 |
| ROB1 | A | F1 |
| ROB2 | S | F2 |
| ROB3 | L | F3 |
| ROB4 | M | F4 |
| ROB5 | S | F5 |
| ROB6 | L | F6 |
| ROB7 | S | F7 |

F0  
F1  
F2  
F3  
F4  
F5  
F6

Non  
S

Non Speculative Load Technique

If effective add. of <sup>result</sup> SB & LD is same so no need to go for memory calc.  
simply forward the value. This is called Store to Load Forwarding.

Before knowing the add. of store, we can allow the load address & thus inst is known as speculative. If the add. is same then flush down the value,  
& otherwise do the calculation.

## → Tomasulo's Algorithm with Reorder Buffer &amp; Load Store Queue

Eq 1024

|                                                       | IS | EX | W  | C  | → add. is calculated as value is also obtained . |
|-------------------------------------------------------|----|----|----|----|--------------------------------------------------|
| LD F0, O(R <sub>1</sub> )                             | 1  | 4  | 5  | 6  |                                                  |
| ADD F <sub>1</sub> , F <sub>0</sub> , F <sub>2</sub>  | 2  | 7  | 8  | 9  |                                                  |
| SD L <sub>1</sub> , O(R <sub>2</sub> )                | 3  | 10 | 11 | 12 | (A) RS1                                          |
| LD (F <sub>3</sub> ), O(R <sub>2</sub> )              | 4  | 13 | 14 | 15 | (M) RS2                                          |
| MULT F <sub>4</sub> , F <sub>3</sub> , F <sub>2</sub> | 5  | 16 | 17 | 18 | (A) RS3                                          |
| SD F <sub>4</sub> , O(R <sub>3</sub> )                | 6  | 19 | 20 | 21 |                                                  |
| LB F <sub>5</sub> , O(R <sub>1</sub> )                | 7  | 22 | 23 | 24 |                                                  |
| SUB F <sub>6</sub> , F <sub>5</sub> , F <sub>2</sub>  | 8  |    |    |    |                                                  |

OP OP1 OP2 ROB

|         |            |          |      |
|---------|------------|----------|------|
| ADD     | 10<br>ROB5 | 2-0<br>E | ROB1 |
| MULT    | 50<br>ROB5 | 2-0      | ROB1 |
| ADD SUB | ROB6       | 2-0      | ROB7 |

Eq 2

| ROB  |   | LSQ            |       |   |   |
|------|---|----------------|-------|---|---|
| T    | R | L/S            | E/A   | V | C |
| ROB0 | L | F <sub>0</sub> | +10 0 | H |   |
| ROB1 | A | F <sub>1</sub> | - 0   |   |   |
| ROB2 | S | F <sub>2</sub> | - 0   |   |   |
| ROB3 | L | F <sub>3</sub> | - 0   |   |   |
| ROB4 | M | F <sub>4</sub> | - 0   |   |   |
| ROB5 | S | F <sub>5</sub> | - 0   |   |   |
| ROB6 | L | F <sub>5</sub> | - 0   |   |   |
| ROB7 | S | F <sub>6</sub> | - 0   |   |   |

|                |     |      |
|----------------|-----|------|
| F <sub>0</sub> | 0.0 | ROB0 |
| F <sub>1</sub> | 1.0 | ROB1 |
| F <sub>2</sub> | 2.0 |      |
| F <sub>3</sub> | 3.0 | ROB3 |
| F <sub>4</sub> | 4.0 | ROB4 |
| F <sub>5</sub> | 5.0 | ROB6 |
| F <sub>6</sub> | 6.0 | ROB7 |

Let us take that the store L is not yet calculated the effective add.

By load 2 is done with effective add.

Assume that the effective add of LD 2 & SD 1 is same.

Non Speculation technique - LD wait until SD completes its calculation of EA

Speculative technique - Before knowing the add of store, we can fetch the value in load operation. After calculation of store, if the value of EA is same that of load then flush down the value of in LR & all inst. following LD inst is flushed down.

→ Tomasulo's Algorithm + ROB + Multi-issue processor

Eg 2 issue processor.

SUBD F<sub>0</sub> F<sub>1</sub> F<sub>2</sub>

MULTD F<sub>3</sub> F<sub>0</sub> F<sub>4</sub>

ADDD F<sub>5</sub> F<sub>1</sub> F<sub>4</sub>

ALVD F<sub>2</sub> F<sub>1</sub> F<sub>2</sub>

ADDD F<sub>6</sub> F<sub>5</sub> F<sub>2</sub>

MULTD F<sub>3</sub> F<sub>5</sub> F<sub>1</sub>

IS EX N C

|   |   |   |   |
|---|---|---|---|
| 1 | 4 | 5 | 6 |
|---|---|---|---|

|   |    |    |    |
|---|----|----|----|
| 1 | 15 | 16 | 17 |
|---|----|----|----|

|   |   |   |   |
|---|---|---|---|
| 2 | 5 | 6 | X |
|---|---|---|---|

|   |    |    |    |
|---|----|----|----|
| 2 | 43 | 44 | 45 |
|---|----|----|----|

|   |    |    |    |
|---|----|----|----|
| 3 | 46 | 47 | 48 |
|---|----|----|----|

|   |    |    |    |
|---|----|----|----|
| 3 | 16 | 17 | 18 |
|---|----|----|----|

Check for functional units

Eg  
 SUBD F<sub>0</sub> F<sub>1</sub> F<sub>2</sub>  
 MULT F<sub>3</sub> F<sub>0</sub> F<sub>4</sub>  
 DIVD F<sub>2</sub> F<sub>1</sub> F<sub>2</sub>  
 BNE ↓ F<sub>2</sub>, 0, Label  
 ADDD F<sub>5</sub>, F<sub>1</sub>, F<sub>4</sub> > speculative  
 MULTD F<sub>3</sub>, F<sub>5</sub>, F<sub>1</sub> so lout complete  
 ADDD F<sub>6</sub>, F<sub>5</sub>, F<sub>1</sub>

|   | IS | EX | W  | C |
|---|----|----|----|---|
| 1 | 4  | 5  | 6  |   |
| 1 | 15 | 16 | 17 |   |
| 2 | 43 | 44 | 45 |   |
| 2 | 45 | -  | 46 |   |
| 3 | 6  | 7  | 47 |   |
| 3 | 17 | 18 | 48 |   |
| 4 | 9  | 10 | 49 |   |

If taken then remove  
the values of ADD1  
& MUL1

Label:

### → Simultaneous Multithreading



(16) Issue queues occupying the reservation stations

(achieving ideal work of the time)



Multithreading :-2.1 - 2.550 - 6 cores  
12 threads

It allows multiple threads share functional units of a single processor.

Multithreading

- Superscalar processor without multithreading
  - Coarse grain multithreading
  - Fine-grain multithreading
  - Simultaneous multithreading (SMT)
- } All are Superscalar -  
issuing at multiple inst  
in single processor

Single-core :- Maxm 4 issue (ie only 4 inst. in single clock cycle)

## 1) Superscalar processor :-

-  $I_4$  is dependent on  $I_1$ ,

-  $I_5$  having cache miss  
(so takes 2 extra clock cycles)



- Only 5 slots out of 20 are used remaining are idle



- All the inst. belonging of thread 1 is executed first. then new thread starts

a) Coarse grain multithreading :-

|                |     |                  |                  |                  |                  |
|----------------|-----|------------------|------------------|------------------|------------------|
|                | 1cc | I <sub>1</sub>   | I <sub>2</sub>   | I <sub>3</sub>   |                  |
| T <sub>1</sub> | 2cc | I <sub>4</sub>   |                  |                  |                  |
| T <sub>2</sub> | 3cc | I <sub>1</sub> ' | I <sub>2</sub> ' | I <sub>3</sub> ' | I <sub>4</sub> ' |
|                | 4cc | I <sub>5</sub> ' | I <sub>6</sub> ' |                  |                  |
|                | 5cc | I <sub>5</sub>   |                  |                  |                  |

- I<sub>5</sub> of T<sub>1</sub> is Not available due to Cache Miss.
- Similarly I<sub>4</sub>'
- Switching takes place when there is a long time (i.e. when there is cache miss)

b) Fine grain Multithreading :- Switching takes place at every clock cycle.



|                |     |                   |                   |                   |                   |
|----------------|-----|-------------------|-------------------|-------------------|-------------------|
|                | 1cc | I <sub>1</sub>    | I <sub>2</sub>    | I <sub>3</sub>    |                   |
| T <sub>1</sub> | 2cc | I <sub>1</sub> '  | I <sub>2</sub> '  | I <sub>3</sub> '  | I <sub>4</sub> '  |
| T <sub>1</sub> | 3cc | I <sub>1</sub> '' | I <sub>2</sub> '' | I <sub>3</sub> '' | I <sub>4</sub> '' |
| T <sub>2</sub> | 4cc | I <sub>5</sub>    | I <sub>6</sub>    | I <sub>7</sub>    |                   |
| T <sub>3</sub> | 5cc | I <sub>5</sub> '  | I <sub>6</sub> '  | I <sub>7</sub> '  |                   |

- Irrespective of dependency, threads are used.
- Horizontal slots are completely used but not vertical slots.

c) Simultaneous Multithreading :-

|  |     |                  |                  |                  |                  |
|--|-----|------------------|------------------|------------------|------------------|
|  | 1cc | I <sub>1</sub>   | I <sub>2</sub>   | I <sub>3</sub>   | I <sub>4</sub>   |
|  | 2cc | I <sub>4</sub>   | I <sub>2</sub> ' | I <sub>3</sub> ' | I <sub>4</sub> ' |
|  | 3cc | I <sub>5</sub> ' | I <sub>6</sub> ' | I <sub>5</sub>   |                  |
|  | 4cc |                  |                  |                  |                  |
|  | 5cc |                  |                  |                  |                  |

- Single clock cycles contain Inst. from diff. threads.
- Both horizontal & vertical slots are used.



In simultaneous multithreading, it flushes the inst. present in RS sitting idle & uses inst. from next thread.



Single core SMT has <sup>almost</sup> same performance as multiple cores without Multithreading  
 & ~~although~~ the SMT is less expensive



After a certain no. of IPC,  
 even though the no. of threads  
 increases  $\rightarrow$  SMT remains constant

as only a certain IPC can be  
 handled at a time after  
 using Multithreading

Thread level Parallelism :- TLP  $\rightarrow$  ILP

2 inst. - can't be executed at the same time using one core. One inst. executes & then the next one. To achieve TLP, more than one core should be there. More than one core will help in executing more instructions at a time.

### Single Core v/s Multicore

1CC      2CC      1CC      1CC      1CC  
 IF      ID      EX      M      WB      = 5CC

perform all these 5 functions in single clock cycle

For single core -

$\hookrightarrow$  if frequency  $\uparrow \rightarrow$  voltage  $\uparrow \rightarrow P \uparrow^3$  [  $P \propto V^2 f$  ]  
 cannot increase its capacity  
 as at high power system  $\rightarrow$  breakdown

In multicore, more than one thread will execute at same time

Flynn classification of computers:-

- SISD - Uniprocessor
- SIMD - Vector processor
- MISD - Not available commercially
- MIMD - Multiprocessor

Eg : Multicore (CMP) - Chip multiprocessor

We have cores on  
a chip.

Thread level parallelism :-

MIMD

- Centralised Shared memory multiprocessor
- distributed memory multiprocessor

We are switching to Multicore as in single core even upon using SMT, performance became stagnant to certain IPC even though no. of threads is increased.



Here if each core executes 1 inst then  $CPI = 1/n$

→ Distributed Memory Multiprocessor



All these above form can form a cluster

- Direction Based problem

→ Cache Coherence problem :- Let Shared memory has  $X=1$ . Let  $C_1, C_2, \dots, C_n$  read  $X$  from memory & places it in their cache. When  $C_2$  write the value of cache from  $X=1 \rightarrow 2$ , values only changes in its cache, all other will use the previous value of  $X$ . This problem is known as Cache Coherence problem.

Incoherent Support :- When  $C_2$  modifies the value of  $X$  but Shared memory keeps on sending  $X=1$  to other cores.

Cache

→ Write through Cache

When a core makes a change to  $X$ , it is propagated to all the cores & shared memory.

→ Write Back Cache

When a core makes a change to  $X$ , it is propagated only when  $X$  is removed from the core.

To solve Incoherent Support

- Write - Invalidate with Snoop Bus (DBP)
- Write Update with Snoop Bus

## Cache Coherence Problem :-



When the block is not available in cache = Cache Miss

Let  $C_1$  = Cache Miss

$\therefore V=1, T=A, Data=100$

Let  $C_2$  want to write A  $\therefore$  write miss (A not available in cache)

So read A & change A  $\rightarrow$  200

So the value of A in both the resources differed - this problem is called Cache Coherence problem.

Problem -  
1) Memory  
2) CPU  
3) Bus

At

The lo

→ +H00

RDA

Write  
back  
Cache

- (1) Write update using Snoopy Bus
  - (2) Write Invalidate using Snoopy Bus
  - (3) Write Update using Directory Based
  - (4) Write Invalidate using Directory Based
- } Soln of Cache Coherence problem.

1)  
RD-A  
Write Update

| V | T | Data |
|---|---|------|
| 1 | A | 100  |

Using  
Snoopy  
Bus

| V | T | Data |
|---|---|------|
| 1 | A | 200  |

$WR_A \leftarrow 200$

Shared Bus

SM

When Core 2 writes A, it sends a message regarding the change in value.

While message passing, Core 1 will check the change in the Bus -

(called snoopy Bus & changes its value).

If the Core 1 does not have A in Cache then none of the block will take the value. Only SM will change the value.

Problem- 1) Memory bus handling.

2) If A is modified  $10^3$  times then SM is also modified  $10^3$  times

v 3)

At one time, only one bus is present so if 2 cache simultaneously

changes the value of A then which one is allowed?

The cache which first access the bus, can change the value of block.

→ How to minimize the Use of Bus & Shared Memory ?



Core 2 will keep the latest value of block in its cache only. It will not propagate the value in Bus.

If  $D=1$ , then SM will not allow the change in SM.

If Core 1 requests to read A, Core 2 will supply as Cache speed is greater than memory speed. So  $A=100$  in Core 1. &  $D=0$  for Core 1.

When A block is removed from Cache 2, <sup>Value of A</sup> it is updated ~~is the value in SM~~ <sup>with</sup> ~~to~~ updates in SM.

→ How to minimize the access of Shared Bus ?



How the cache will know that the block is shared or not -  
 → when the cache is writing, it will check whether the block is shared or not.

If Not Shared,  $S = 0$

If Cache 1 wants to read A, Cache 2 will supply the value of A  
 $S = 1$  in both cache

When Cache writes a shared block, it broadcast the value on the bus.

If the block is removed from cache,  $S = 0$  again & no broadcasting takes place.

Eg



A is NOT removed from 1

- 1) How many Bus Access
- 2) How many Memory writers

] - No optimization / D / D & S.

RD -

SOP

1) No optimization

Memory writes =

first cycle

$$\begin{array}{r} 1 \cancel{+} + 999 = 1000 \\ 1 + 1 + 999 = 1001 \end{array}$$

Bus Access =

for reading A from SM

2) Using D bits

Bus Access =  $1 + 1 + 999 + 1 = 1002$

Memory writes = 1

↳ when A block is removed

if A is replaced

first cycle, D Sets D = 1

3) Using D &amp; S bits

Bus Access =  $1 + 999 + 1 = 1001$

Memory writes =  $\cancel{1} + 1$

Eg

CORE 0

RDA | 500 Times  
KIR A

CORE 1

RDA | 500  
WR A |  
D | 1002  
S | 502  
Memory supply data  
in No. optimization  
methodA is NOT  
removed  
from coreBus Access | NO opti  
499 = 1002  
1 + 1 + 1 + 1 + 499  
Memory writes | 1 + 499 + 499 + 1  
= 1000D | 1002  
1 + 500 + 1 + 500  
~~1~~ 0 | 0

2) Write Invalidate Using Snoopy Bus.

RDA

| V | T | D | S | DATA |
|---|---|---|---|------|
| 1 | A | 0 | 0 | 100  |

Shared Bus

| V | T | D | S | DATA |
|---|---|---|---|------|
|   |   |   |   |      |

KIRA → 300

INV(A)

SM

When core 2 want to write A, it will send a invalidate message to bus. Since  $D=1$  so no change in SM but in core 1 ~~V=A~~ V becomes 0 which means u can't access A because someone is writing A.

Core 0

$V=0, T=A, D=0, S=1$

Core 1

$V=1, T=A, D=1, S=1, Data = 300$

In case of sharing

If core 1 wants to update 1000 times, for that it can write 1000 times broadcasting the write message.

After updation, it sends message & changes  $V=0 \rightarrow V=1$



Since No one is accessing ~~A~~ so



|               | Replace A        | Replace A          | No Optimization    | D                  | D & S              |
|---------------|------------------|--------------------|--------------------|--------------------|--------------------|
| Bus Access    | $1+1+999 = 1001$ | $1+1+999+1 = 1002$ | $1+1+999+1 = 1002$ | $1+1+999+1 = 1002$ | $1+1+999+1 = 1002$ |
| Memory writes | $1+999 = 1000$   | 1                  | 1                  | 1                  | 1                  |

- Replace A in No optimization not require the bus as it is updating the value time to time.
- Reading not require the bus access in No optimization & D bcz updatation is done ~~at~~ <sup>at</sup> time to time.
- When the writing process takes place, it doesn't know that the Bus Block is shared or not so it broadcasts the message ~~everytime~~ first time.  
As reading process happen, ~~it~~<sup>it's</sup> knows the shared bit = 1

|              | Write Update<br>(D&S) | Write Invalidate<br>(D&S) |
|--------------|-----------------------|---------------------------|
| Bus Access   | 1002                  | $1 + 1 + 999 + 999 + 1$   |
| Memory write | 1                     | $1 + 1$                   |

↑ to get right data from Core D

| Eg-2           | Core 0              | Core 1              | (Without Replacing A) |
|----------------|---------------------|---------------------|-----------------------|
|                | RDA<br>WRA } 500    |                     |                       |
|                |                     | RDA<br>WRA } 500    |                       |
| Write Update → | No opti             | D                   | D&S                   |
| Bus Access     | $1 + 500 + 1 + 500$ | $1 + 500 + 1 + 500$ | $1 + 1 + 500$         |
| Memory writes  | $500 + 1 + 500$     | 0                   | 0                     |

With the help of read operation, Core 1 knows that there is no sharing but the 2<sup>nd</sup> read ~~not~~ gives the information of sharing of block.

|               | WU  | WI              |                                                          |
|---------------|-----|-----------------|----------------------------------------------------------|
| Bus Access    | 502 | $1 + 1 + 1 = 3$ | $\rightarrow$ 1 = Reading Core 0<br>1 = D Reading Core 1 |
| Memory writes | 0   | 0               | 1 = D0 Sending invalid message to Core 0                 |

→ MSI (Modified Shared Invalid)

MOSI  
MEST  
MOESI

| V | T | D | S | Data |
|---|---|---|---|------|
| 0 |   |   |   |      |



Invalid ( $v = 0$ )

block is present

but can't access

(someone is writing)

block not present

Modified = Only one valid block is present remaining all invalid

Want to  
WR A<sub>100</sub>

So send msg  
to bus

(J A<sub>10</sub>)  
Modified

| V | T | D | S | Data |
|---|---|---|---|------|
| 0 | A | 0 | 0 |      |

RD A  
(0 A 0)  
Shared

(0 A 0 0)  
Local write  
(Put WR on bus)

Local Read  
(Put RD request)  
Modified  
(J A<sub>10</sub>)

Shared  
(0 A 0 1)



Modified → Invalid (first update the value in memory)

Shared → Modified (Send Invalid msg to invalid all the shared blocks)

Eg Block X in main memory

|     | $C_1$ | $C_2$    | state of X<br>in $C_1$ | state of X<br>in $C_2$ |
|-----|-------|----------|------------------------|------------------------|
| RDX | -     | Shared   | Invalid                | Invalid                |
| -   | RDX   | Shared   | Shared                 | Shared                 |
| WRX |       | Modified | Invalid                | Invalid                |

Eg

|     | $C_1$ | $C_2$ | state of X in $C_1$ | state of X in $C_2$ |
|-----|-------|-------|---------------------|---------------------|
| RDX | -     |       | Shared              | Invalid             |
| -   | WRX   |       | Invalid             | Modified            |
| WRX | -     |       | Modified            | Invalid             |

→ How to minimise the Write Back operations?

$C_1$ : Block A in M

(MOSI)

$C_2$ : Read(A)  $\Rightarrow C_1:S | C_2:S$

$C_3$ : Read(A)  $\Rightarrow C_1:S | C_2:S | C_3:S$  (who provide data to  $C_3$ )

memory will provide the data when more than 1 cache have same value of A

To avoid problem  $\uparrow$   $C_1$  change its state from M  $\rightarrow$  O (O=owner)

and  $C_1$  will provide the block to  $C_2$

when the block is replaced from block, it will replace the value in memory.

→ MESI :- E = Exclusive



When the block read by core from the memory,

No copy of block is present in any of the core

So no need to send message to any core

∴ Exclusive block.

Common data =  $\emptyset$



→ MOEST :- Sharing a block, one block become owner & in case of exclusive, one core is exclusive.

| MOEST | MESI | MSI |
|-------|------|-----|
|       | MOSI |     |

Eg → MOEST - 3 cores 1 Block X is in memory

|           | $C_0$    | $C_1$         | $C_2$  |
|-----------|----------|---------------|--------|
| RDX       |          |               |        |
| Exclusive | Invalid  | Invalid       |        |
| Shared    | RDX      | Invalid       |        |
| Shared    | Shared   | RDX           |        |
| Invalid   | WRX      | Shared        |        |
| Invalid   | Modified | Invalid       |        |
| Invalid   | Owner    | RDX<br>Shared |        |
| RDX       | Shared   | Owner         | Shared |

| Eg | How many Memory Reads Bus Request | Memory will give data to $C_2$ |                 | MOEST           |
|----|-----------------------------------|--------------------------------|-----------------|-----------------|
|    |                                   | MESI                           | MOSI            |                 |
|    |                                   | $1+1+1+1 = 4$                  | $1 \oplus$      | $1$             |
|    |                                   | $1+1+1+1+1 = 5$                | $1+1+1+1+1 = 6$ | $1+1+1+1+1 = 5$ |

$C_1 \oplus RDA$

$C_1 \oplus WRA$

$C_2 \oplus RDA$

$C_2 \oplus WRA$

$C_3 \oplus RDA$

$C_3 \oplus WRA$

$C_2 \oplus RDA$

1. Acc. to MESI & MOEST,  $C_1 \rightarrow$  Exclusive but

Acc to MOSI,  $C_1 \rightarrow$  Shared so send a message to bus

2.  $C_1 \oplus$  Modified

3.  $C_2 : S$ ,  $C_1$  (Acc to 2<sup>nd</sup> & 3<sup>rd</sup>) = 0,  $C_1^{(1)} = S$

4.  $C_2 \oplus$  Modified,  $C_3 \oplus$  Shared Invalid

5.  $C_3 : S$ ,  $C_2 \nearrow S$  (1<sup>st</sup> protocol) → data from memory  
 $\searrow O$  (2<sup>nd</sup> & 3<sup>rd</sup>)

6.  $C_1 : S$ ,  $C_3 : S$ ,  $C_2 \nearrow S$   
 $\searrow O$

7.  $C_2 = \text{owner}$  ⇒ No Bus Access.

Reply

forward

## → Directory Based Cache Coherence Techniques

In case of snoopy bus →



- Not Scalable, can't go beyond 4-5 cores

Every read & write message pass through the bus. And though there is no relevant means of broadcasting in case of large no. of cores.

But in case of directory-based protocol

For each block X, one directory is created



If we have n caches then size of presence bit = 0 → n-1

block present in core 1 only 1 0 0 ...

0 0 0 ... = Not present in any cache

Let us assume that there are 8 cache



First directory check that all 0 <sup>from</sup> directory fetch memory & send data to cache 1 <sup>0</sup> 0 0 0 0 0 0 0 → 110000000

& C<sub>1</sub>: Invalid → Exclusive

so Dirty bit = 1 that means Cache 1 can write without any msg

After some time, C<sub>2</sub> wants to WRX <sup>0</sup> Send a message to directory to write X

directory check that the block is available in Cache 1 so it will forward the req. to Cache 1. So Cache 1 changes E → I & reply (msg) &

Send response msg to Cache 2

$$100000000 \rightarrow 101000000$$

Now that cache 2 wants block Y so request from Directory Y.

If instead of writing, reading req is made by Cache 2

$$\therefore C_1 : E \rightarrow S \text{ & Directory } \rightarrow 110000000 \rightarrow 111000000$$

Eg  
 Cache 0 Cache 1 Cache 2 Cache 3



Directory X :- 

|   |   |   |   |   |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
|---|---|---|---|---|

      Directory Y :- 

|   |   |   |   |   |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
|---|---|---|---|---|

Directory Z :- 

|   |   |   |   |   |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
|---|---|---|---|---|

~~SDM~~  $\xrightarrow{\text{Dir X :- 00000} \rightarrow 11000}$  (2 msg - request & response) for (cache 1)  $\therefore E$   
 $C_1 : RDX$

$C_2 : RDY$   $\xrightarrow{\text{Dir Y :- 00000} \rightarrow 10100}$  (2 msg :-  $C_2 : E$ )

$C_3 : RDY$   $\xrightarrow{\text{Dir Y :- 10100} \rightarrow \cancel{10100}}$  (4 msgs &  $C_2 : S, C_3 : S$ )

$C_4 : WRX$   $\xrightarrow{\text{Dir X :- 11000} \rightarrow \cancel{11010}}$  (4 msgs)  $C_1 : I, C_2 : M$

$C_5 : WRY$   $\xrightarrow{\text{Dir Y :- 10110} \rightarrow 10001}$  (6 msgs)

Block Y is present in 2 blocks  $\therefore$  send a msg to each cache

$\therefore C_2 : I, C_3 : I, C_4 : M$

$C_3 : RDX$   $C_4 : O \xrightarrow{C_3 : S}$   $DIRX : 10100 \rightarrow 10101$

- Eg  
 1) How many req. to directory      3) How many reply msg.  
 2) How many forward msg.      4) How many response msg.

|             | Request   | Forward   | Reply     | Response    |
|-------------|-----------|-----------|-----------|-------------|
| $C_2 : RDA$ | $1+1+1+1$ | $1+2+2+3$ | $1+2+2+3$ | $1+1+1+1+1$ |
| $C_3 : RDA$ |           |           |           |             |
| $C_4 : WRX$ | $+1$      | $-4$      | $+3$      | $-4$        |
|             | $-5$      |           |           | $= A5$      |

SelD $00000 \rightarrow H \oplus 11000$  $11000 \rightarrow 11100$ 

when cache want to ~~write~~<sup>read</sup> a shared  
data  $\Rightarrow$  No forwarding

 $11100 \rightarrow 11110$  $11110 \rightarrow \cancel{11111} 11111$  $11111 \rightarrow 11000$