

## High Performance Computing

CAO → Sequential Computer → "Instr" goes through instr cycle & then only next instr is fetched.

Disadv: Lot of h/w units in idle state while others working

Performance Enhancements → Pipelined Execution.  
Parallel Execution



Improvement → Employ multiple resources (people here).



There are sufficient people so that when a brick reaches, the person is free (consider if 3<sup>rd</sup> person is not present.  
At t=6, person 2 not free) → Pipelined execution.

[Delivery Rate]  
1 brick every  
4 unit of time

This is analogous to pipelined processor.

We divide same H/W into independent stages & one ~~person~~ stage passes the result to other stage on completion of work.  
If no work for stage 1, idle.

Initially, some stages are idle, but later, all stages work.

for one person, person need to be strong  
for many persons, each person can be weaker.]

Each brick still takes 10 units to reach other side.

So work time for each i/p is same but i/p rate is.

Latency → Amount of work done for one i/p  
→ remains same.

parallel res available at starting point



If 1 Sequential costs 100

Pipelining doesn't increase cost much 120  
(only latches to separate stages)

Parallelism → costlier.

Pipelined → gives equivalent results but has a demerit

## 20 unit Sequential Processor

100 MHz



clock freq  $\Rightarrow$



500 MHz

10 stages

$\sim 5$  times

$\sim 10$  times

clock freq  $> 1000$  MHz

Made of CMOS transistors

→ No power when in stable state  
only during transition.

More transitions  $\rightarrow$  more switching  $\rightarrow$  Power Diss  $\uparrow$

↓  
Ultimately heat will burn the  
CMOS transistor

Diode:  $\rightarrow$  clock freq  $\uparrow \rightarrow$  heat  $\uparrow$

transistors:  $\rightarrow$  Pipelining overheads  $\rightarrow$  Latch Delay decides min. delay.

At a pt. Latch Delay becomes significant  
to stage Delay. After that no more  
enhancements.

## Bottlenecks

Date \_\_\_\_\_  
Page \_\_\_\_\_

Fetch → for Memory  
Not registers

S/W aspect →

Dependency

H/W aspect →

Hazard



IF

ID

AG

DF

Ex

Ex

WB



$$y_1 \leftarrow y_2 + M[10 + y_3]$$

$$y_4 \leftarrow y_1 - y_5$$

$y_1$  is written to only at stage 5 (write back)

$$\text{add } y_1, y_2, 10 (y_3)$$

$$\text{sub } y_4, y_1, y_5$$



Execution of I2 requires  $y_1$ 's value at 25. Logically correct value only available at 30. ∴ idle from 25 to 30.

Pipeline Stall / stall Cycle / Hazard

Reason of Hazard  $\rightarrow$  Dependency (if common)  
if  $r_7$  instead of  $r_1$ ,  
no stall



structural  
↳ same resource required.



control  
↓  
due to branch instructions

conditional ↙  
Action depends on cond'n (flag)  
unconditional ↘  
Jump call return interrupt



Anti Dependency } True  
 output Dependency } False  
 These is no data flow  
 from one instruction  
 to another; only a  
 name is involved.

True → data flows from one instruction to another  
 I<sub>10</sub>=add Y<sub>1</sub>, Y<sub>2</sub>, Y<sub>3</sub>  
 I<sub>15</sub>=sub Y<sub>4</sub>, Y<sub>5</sub>, Y<sub>1</sub>

If value reqd is not available → Pipeline Hazard / Pipeline Stall  
 if dependencies resulting in stall → Hazard.

WAW [ add Y<sub>1</sub>, Y<sub>2</sub>, Y<sub>3</sub> ]  
 write after [ sub Y<sub>1</sub>, Y<sub>4</sub>, Y<sub>5</sub> ]  
 write

Dependency is there but  
 no data flow  
 ↳ eg of output dependency.

### Anti Dependency

WAR [ add Y<sub>1</sub>, Y<sub>2</sub>, Y<sub>3</sub> ]  
 write after [ sub Y<sub>3</sub>, Y<sub>4</sub>, Y<sub>5</sub> ]  
 Read



Dependencies shown  
picturelly in a dep.  
graph.

### Dependency Graph



Name dependency come in picture when out of order execution is allowed  
To increase efficiency.

Technique to handle these:

### Register Renaming

Normal Pipeline processor → No false dep. → But not max. efficiency.

↓  
Super Scalar Processor (Multiple Pipelined Processor)



4 Pipelines can fetch 4 instr's at a time.  
 ↳ Instr<sup>n</sup> fetch  
 in H/W, Pipelining H/W fetches 4 bytes at a time.

WAW add  $y_1, y_2, y_3$

sub  $y_{11}, y_4, y_5$

→ renamed.

All occ. after this of  $y_1$   
 replaced by  $y_{11}$

→ No dependency.

Can be implemented by H/W or S/W  
 ↳ Compiler (Superscalar)  
 ↳ VLIW

Pentium → SS Processors

486 → Deep Pipelined

(Very Large Inst<sup>n</sup> Word)



→ can't be used for true dependencies as there is actual data flow.

In True Dep  $\rightarrow$  Inst $^{n^1}$  Reordering.



Place independent inst $n^1$ 's b/w <sup>the</sup> dependent inst $n^1$ 's.



If data is generated by ALU & reqd by ALU, directly provide a bypass path instead of involving Register file.

## Control



AGAIN: —

5 —  
—  
—  
—  
—

Flag  $\rightarrow$  same  
Condition Code

10 → DCR C

11 → JNZ AGAIN

12 → —  
—  
—

If we know that we have to go to target (S) after EX,  
All work done will be dumped & Pipeline will be  
started from beginning.

will  
study to  
minimize this.

cycles in which we did not get data  $\rightarrow$  Control Hazard cycles.

Unconditional  $\rightarrow$  Low Penalty (after IDs we know)

Conditional  $\rightarrow$  None Penalty (we know after EX of arithmetic matrix)

$\rightarrow$  Branch Elimination

By replacing Branch instr's by  
cond'nal Arithmetic/Logical  
Instr's.

$\rightarrow$  Branch Speedup

$\rightarrow$  Early CC Set  
 $\rightarrow$  Delayed Branch

$\rightarrow$  Branch Prediction

done by compiler  $\rightarrow$  Insert indep. instr's b/w Arith. & Jump instr's.

DB  $\rightarrow$  Insert Indep. instr's after Jump which will be run  
irrespective of Jump (Not  $\rightarrow$  Better for Unconditional  
Jump).

? Try to predict outcome of branch & instr's in predicted  
direction will run without penalty

Branch Prediction

Branch History Bits

B.H.T

Branch History Table

Date \_\_\_\_\_  
Page \_\_\_\_\_ BranchHist

Single bit History  $\rightarrow$  0 → Branch not taken (not go to target)  
1 → .. Taken (will go to target)

→ Taken.



name.



HPC

Date \_\_\_\_\_  
Page \_\_\_\_\_

- lib)m.so shared object → depends on external libraries  
lib)m.a collection of object files archive files. → full fledged executable

## Branch Prediction

ARM

MIPS

SPIM → Allows. to choose b/w Branch, etc. <sup>Delayed</sup>

(Simulator)



C=10

≡

AGAIN :

≡  
≡  
≡  
≡  
≡

DCR C

JNZ AGAIN

Prediction Action

|    |    |
|----|----|
| NT | T  |
| NT | T  |
| T  | T  |
| :  |    |
| :  |    |
| T  | NT |

3 Wrong Predictions

If 1 bit  $\rightarrow$  2 wrong predictions

Each Branch has separate Predictors.

C=10

AGAIN1 :

≡  
≡  
≡

1-bit

AGAIN2 :  
≡  
≡  
≡

NT-T  
T-T  
T-NT

DCR C

Similarly for outer

JNZ AGAIN2

If outer 10 times & inner 10

DCR B

Inner Mis predictions  
 $\rightarrow 20$

JNZ AGAIN1

10  $\rightarrow$  8 times.



BTAC → Branch Target Address Cache

BTIC → B

B·H·T

|                |              |
|----------------|--------------|
| Branch Address | History Bits |
|----------------|--------------|

or if in 1-bit predictor, we can remove history bits & detect it by presence of address.

|                |              |                            |
|----------------|--------------|----------------------------|
| Branch Address | History Bits | Target Instruction Address |
|----------------|--------------|----------------------------|



ADD R1, 10(R2)

$$R1 \leftarrow R1 + M[10 + R2]$$

In case of Branch, DF will be Target Instr<sup>n</sup> Fetch

Target Address is not available before D.F. ∵ It has to wait till that ∴ Some penalty even if goes to target & predicted target.

∴ Target address stored in B·HT  $\Rightarrow$  0 cycle branch

## Branch Prediction

B.H.T



Correlation Branch Predictors

Captures the branches that are data Dependent

(If one branch taken, other is also taken  $\rightarrow$  this type of predictor)

or in dedicated appl" specific instr" processors like Router



## Shared Memory Multiprocessor

Cache Coherence  $\rightarrow$  Maintaining consistency in diff. copies of a memory in caches of diff. processors

Problems  $\rightarrow$

$\rightarrow$  Process Migration

Policies  $\rightarrow$  write update  $\rightarrow$  All others will also update  
write invalidate  $\rightarrow$  All others will invalidate copy



Static N/Ws →

oooo Linear array

oooo Ring

Dynamic N/Ws

Mesh



Tree

congestion problem



FAT Tree → As we go towards  
root, bandwidth of links  
increases.

BUS



Collision Domain.

# High Performance Computing

## Pipeline Models

we have 8 stages which have conn's going to next stage only

### Linear Pipelines



### Non-linear Pipelines

Feedback  
feed connection



Row for each stage of pipelined system

shows stage usage pattern

\* → stage inactive state

## Reservation Table

Cycles → 1 2 3 4 5

| Stages ↓ | S1 | X |   |   |   |
|----------|----|---|---|---|---|
| S2       |    | X |   |   |   |
| S3       |    |   | X |   |   |
| S4       |    |   |   | X |   |
| S5       |    |   |   |   | X |

Linear

↓

only along  
diagonal.

1 2 3 4

| S1 | X |   |   |  | X |
|----|---|---|---|--|---|
| S2 |   | X |   |  |   |
| S3 |   |   | X |  |   |
| S4 |   | X | X |  |   |
| S5 |   |   |   |  |   |

Non-linear

can have multiple  
crosses in a row/cell.

Multiple in row

↓  
multiple stages in  
1 clock cycle.

Multiple in 1 row

Why need for Non-linear?

single stage in multiple cycles.  
(Any NL can be converted to LR)Von Neumann  
Harvard Architecture→ Instr & Data in same memory  
unit

|                      | 1 | 2 |   |   |
|----------------------|---|---|---|---|
| (Fetch) IF / DF / WB | X |   | X | X |
| (Decode) ID          |   | X |   |   |
| (ALU) ALU AG / EX    |   |   | X | X |

using 1 H/W unit is utilised & another in ID. AG in ALU. DF again  
in fetching unit EX in ALU & WB again in fetching unit

$\therefore$  whole H/W divided into 3 ~~parts~~ parts.

for converting to L, for each op<sup>n</sup>, diff. H/W & unit stage.

By using more TSS.

In Harvard Arch  $\rightarrow$  Introduced 2 Memory Blocks instead of,

Linear

Demerits  $\rightarrow$  More Loss.

Merits  $\rightarrow$  Throughput  $\uparrow$

Many-to-1

$\Rightarrow$  each cycle can be employed to 1 task

Now - Both.



|                |                |                |                |                |                |                |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| 1              | 2              | 3              | 4              | 5              | 6              | 7              |
| T <sub>1</sub> | T <sub>2</sub> |                | T <sub>1</sub> | T <sub>2</sub> | T <sub>1</sub> | T <sub>2</sub> |
| T <sub>1</sub> | T <sub>2</sub> |                |                |                |                |                |
|                |                | T <sub>1</sub> | T <sub>2</sub> | T <sub>1</sub> | T <sub>2</sub> |                |

$\rightarrow$  No conflict for 2 tasks  
but in 3 tasks

|                |                |                |                   |                |                |                |
|----------------|----------------|----------------|-------------------|----------------|----------------|----------------|
| 1              | 2              | 3              | 4                 | 5              | 6              | 7              |
| T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> | T <sub>1</sub>    | T <sub>2</sub> | T <sub>1</sub> | T <sub>2</sub> |
| T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> |                   |                |                |                |
|                | T <sub>1</sub> | T <sub>2</sub> | (T <sub>1</sub> ) | T <sub>2</sub> |                |                |

If we start 2 tasks at 1 & 2,  
we can start 3<sup>rd</sup> only at 8<sup>th</sup> cycle  
 $\therefore$  no op in many cycles

Avg. No of ops / unit time  $\uparrow$

$\therefore$  Linear Pipelining reqd.

# Pipeline Models

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

Asynchronous

Synchronous



Synchronous  
All stages separated by latch given same clock



Max. Stage Delay

Neglecting (latch delay,  
 $(\max_{i=1}^N T_i)$ )

→ (delay of slowest stage)

considering it,

$$\text{Clock Cycle Time } = \left( \max_{i=1}^N T_i \right) + d$$

$$= T_m + d$$

Pipeline Frequency  $\rightarrow \frac{1}{T}$   
clock

clock doesn't arrive at diff places at  
same time

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

## Clock Skewing

Active Clock Edges do not arrive at various places in the hardware at same moment.



Inertial Delay  
Transport Delay  
due to  $c_e$  which  
take time to go  
from one place  
to another.

## Transmission delay

Parasitic capacitance

→ gets increases as we go far away.



## clock Net/Net

clock provided  
at many places  
& tapping provided  
at multiple  
places)  
& then taken  
from this net.



Speedup =

$N$  tasks

$K$  stage

Uniform Stage delay  $\tau$



$$\begin{array}{ll}
 5T & 5T + 0T \\
 2 \times 5T & 5T + 1T \\
 3 \times 5T & 5T + 2T \\
 \vdots & \vdots \\
 N \times 5T & 5T + (N-1)T
 \end{array}$$

$$\frac{\text{Time Required in a Sequential System}}{\text{Time Required in a Pipelined system}} = \frac{N \times 5T}{5T + (N-1)T}$$

$$\frac{N \times kT}{kT + (N-1)T} = \frac{NK}{k + (N-1)}$$

$$\lim_{N \rightarrow \infty} S_k = \lim_{N \rightarrow \infty} \frac{NK}{k + (N-1)} = \lim_{N \rightarrow \infty} \frac{NK}{\left[ \frac{k}{N} + \frac{N-1}{N} \right]} = \frac{k}{1}$$

→ Uniform stage Delay is not possible

## Latency Analysis, Collision Free Scheduling

|       |   |   |   |   |   |   |
|-------|---|---|---|---|---|---|
| $S_1$ | X |   |   | X |   |   |
| $S_2$ |   | X |   |   |   | X |
| $S_3$ |   |   | X |   |   |   |
| $S_4$ |   |   |   | X | X |   |
| $S_5$ |   | X |   |   | X |   |

collision  
if at same time,  
two tasks  
require same  
stage.

Latency:

the difference in two successive task initiations.

Forbidden

may lead to collision

Permissible

will not suffer from collisions

1 → to represent latencies

MSB of CV will always be 1

Shows max. forbidden latency

Foot → count dist b/w all & eff crosses in each row of Reservation table.

Binary Vector named Collision Vector

Forbidden latencies are shown by 1 and permissible by 0

Most Significant Bit of a Collision Vector will always be 1

Forbidden →  $S_1 \rightarrow 3$

$S_2 \rightarrow 5$

$S_3 \rightarrow -$

$S_4 \rightarrow 1$

$S_5 \rightarrow 3$

}

1, 3, 5

As latency 1 is  
forbidden (bcz of sq)  
P.  
B+

Collision Vector

$$\begin{array}{c} 1 \ 0 \ 1 \ 0 \ 1 \\ \hline 5 \ 4 \ 3 \ 2 \ 1 \end{array}$$

10101

If we start  $T_2$   
at  $T_2$ , 2<sup>nd</sup> cycle  
will be 1<sup>st</sup> cycle  
for  $T_2$

To get state now,  
right shift(logical)

If shift given, we  
cannot initiate  
task, otherwise  
we can

Logical shift right  
the collision vector of the state

Bitwise OR with original  
collision vector

00101010101

10101  
10101

00000101010

10101  
10101

10101

2

Collision  
Diagram

Simple Cycle  $\rightarrow$

start with anode, visit others &  
come at start without visiting anode  
multiple times

Simple Cycle  $\rightarrow$  2  
 $\downarrow$   
 $\rightarrow$  4

Greedy Cycle

involved latencies

$\rightarrow$  in which all cycles are  
minimal permissible latencies  
for all nodes in the cycle.

We can make pipeline structure from Table but not table from Pipeline Structure

|   | 1                | 2 | 3 | 4 | 5 | 6 |
|---|------------------|---|---|---|---|---|
| 5 | $\leftarrow S_1$ | X |   |   |   | X |
| 3 | $\leftarrow S_2$ |   | X | X |   |   |
| - | $\leftarrow S_3$ |   |   | X |   |   |
| - | $\leftarrow S_4$ |   |   | X |   |   |
| - | $\leftarrow S_5$ |   | X |   |   |   |



$01010(0)$   
 $10100$   
 $\underline{11110}$

$01111(0)$   
 $10100$   
 $\underline{11111}$

$00101(00)$   
 $10100$   
 $\underline{10101}$

$00101(01)$   
 $10100$   
 $\underline{10101}$

$00001(010)$   
 $10100$   
 $\underline{10101}$

shortest path  
 $000011111$   
 5 4 3 2 1  
 1 : cont takes



Simple Cycle

1, 1, 6  
 2  
 4

otherwise or  
 with original  
 vectors

Edge 6 will  
 be from  
 all nodes  
 to original  
 write only if  
 no cycles (??)  
 from nodes

Greedy  
 ↓  
 all nodes of  
 cycle are  
 from initial  
 permissible  
 edge.

Avg. Latency  $\rightarrow 1, 1, 6 : \frac{1+1+6}{3} = \frac{8}{3} = 2.67$

|   |   |   |
|---|---|---|
| 2 | : | 2 |
| 4 | : | 4 |

Greedy cycle : [1, 1, 6  
 2]

Premises

Dt.

Pg.

Bounds on MAL

Cannot be lower  
than max. no of  
crosses in any  
row of table  
 $\Leftrightarrow \geq 2$

We got 2

We got optimal  
MAL

M.A.L

Minimal Avg. Latency = 2

min. of avg. latency of all Greedy cycles.

Optimal MAL

Programming Assignment

C/C++/Java/Python/R

Input Text file will have Reservation Table

|   |   |   |   |   |   |   |
|---|---|---|---|---|---|---|
| x | , | , | , | , | , | x |
| , | x | , | , | x | , |   |
| , | , | x | , | , | , |   |
| , | , | , | x | , | , |   |
| , | , | , | , | x | , |   |

Text file name will be provided on command line  
Compute Collision Vectors, all simple cycles, Greedy cycle  
and MAL

Find whether the MAL is optimal or not.

HPCForbidden  $\rightarrow 1, 2, 4$ Permissible  $\rightarrow 3$ Collision Vector  $\rightarrow 1011$ 

|                | 1 | 2 | 3 | 4 | 5 |
|----------------|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   | X |
| S <sub>2</sub> |   | X |   | X |   |
| S <sub>3</sub> |   |   | X | X |   |

(1011) 3

$$\begin{array}{r} 0001011 \\ 1011 \\ \hline 1011 \end{array}$$

Simple cycle = 3

MAL = 3

Greedy Cycle = 3

optimal MAL = max. number of cross marks in any row  
of the reservation table.

### Bounds of MAL

- MAL is upper bounded by total number of 1's in the initial collision vector plus 1
- MAL is lower bounded by the max. no. of cross marks in any row of the reservation table.
- MAL is lower than or equal to the average latency of any greedy cycle in the Collision State Diagram.

2 Average Latency of any greedy cycle  
is upper bounded by the no. of 1's in initial  
Collision Vector plus 1.

For above eg  $\rightarrow$  Upper bound  $\rightarrow 4$   
Lower bound  $\rightarrow 2$ .

$\therefore$  This table is not optimal

if we don't get optimal MAL, we may insert non-compute delay to approach towards optimal value.



|    | 1 | 2 | 3 | 4 | 5 |
|----|---|---|---|---|---|
| S1 | X |   |   |   | X |
| S2 |   | X |   | X |   |
| S3 |   |   | X | X |   |

DTN (Deadlock Transfer Node)

Work done in 4 to be transferred next stage  
 $\therefore S_2 \rightarrow S_1$   
 Delay &  $S_3 \rightarrow S_1$   
 delay.

|    | 1 | 2 | 3 | 4       | 5   | 6 |
|----|---|---|---|---------|-----|---|
| S1 | X |   |   |         | (X) | X |
| S2 |   | X |   | X       |     |   |
| S3 |   |   | X | (X) → X |     |   |
| D1 |   |   |   |         | X   |   |

increase To ~~the~~ latency, separate  $S_3$  with latency.



Forbidden - 2, 5

Permissible  $\rightarrow 1, 3, 4$

5 + 32,  
 10010



Simple Cycles  $\rightarrow$

1, 3, 3

3

4, 3

4

Greedy  $\rightarrow$  1, 3, 3

$$\text{MAL} \rightarrow \frac{1+3+3}{3} = \frac{7}{3} = 2.33$$

Although task is residing for longer duration but throughput is increased.

|    | 1 | 2 | 3 | 4   | 5   | 6   | 7 |  |
|----|---|---|---|-----|-----|-----|---|--|
| S1 | X |   |   |     | (X) | (X) | X |  |
| S2 |   | X |   |     |     |     |   |  |
| S3 |   |   | X | (X) | X   |     |   |  |
| D1 |   |   |   | X   |     |     |   |  |
| D2 |   |   |   |     |     | X   |   |  |



## Instruction Pipeline Design



ARM  
 MIPS  
 Hennessy & Patterson



Instn?

Register file → 2 Instruction Port

You can do 2 read op's at the same time (same reg no diff ref.)

ld r1, 10(r2)

r2  
src110  
src2dst  
r1

JMP



All  
Instructions  
are

of uniform size equal to  
32-bits

MEM

Data Memory

MUX

Default increment → 4. Sometimes (in PC relative) increment by other value & (specified in instn\*)



## H.P.C.

### Prefetch Buffers



fetched by Mem. Unit & later

instr's are directed to Seq- Buffers

Whenever Target address generated

for Branch instr<sup>n</sup> then that instr<sup>n</sup> corresponds to that address is stored in Target Buffer. When exact direction is known, if Branch goes to target, instr<sup>n</sup> in Target Buffer will be provided to Pipeline & then they will swap their functionality.

### Branch Instruction

- When Branch instruction is executed and Branch goes to Target → Target Buffers start working as Sequential Buffers and vice-versa.

multiple functional Units → later

## Internal Data Forwarding



→ How dep b/w ALU type instr's.

add  $y_1, y_2, y_3$

sub  $y_4, y_5, y_1$

~~~~~

load  $y_1, 10(y_2)$

→ b/w load & ALU

add  $y_4, y_5, y_1$

~~~~~  
add  $y_4, y_5, y_1$   
store  $y_4, 10(y_7)$

→ b/w ALU &  
store

## Store - Load Forwarding



CISCMOV R2, R1

available.

RISCadd  $(r_1), r_2, r_3$ y  
regd.(sub  $r_4, r_5, r_1$ )

data not available  $\therefore$  need  
to wait for 3 cycles  
 $\therefore$  3 cycles waste  $\rightarrow$  no op.

value is generated in Ex of  $1^{st}$  & regd. in Ex of  $2^{nd}$   $\dots$  instead  
of waiting & then reading,  
provide a direct path from o/p of ALU to the i/p of ALU

& activate the bypass path if o/p of one instr $^{th}$  is same as  
source of ~~other~~ instr $^{th}$  of next.

2 instr $^n$ s that are ALU type & dest $^n$  of  $1^{st}$  = Source of  $2^{nd}$ .  
 $\therefore$  logically wrong value (old) is not used at all.

If ALU followed by ALU, we can fully remove stall cycles

But if  $\rightarrow$ Load  $r_1, 10(r_2)$ add  $r_4, r_5, r_1$ 

|    |    |    |     |    |
|----|----|----|-----|----|
| IF | ID | EX | MEM | WB |
|----|----|----|-----|----|

|    |    |    |      |        |
|----|----|----|------|--------|
| IF | ID | EX | MEM  | WB     |
| IF | ID | X  | (EX) | MEM WB |

Here, reqd. cycle before being generated : scheduled after 1 cycle

ALU op → Tapping → to MUX joined to ip of ALU

Data Forwarding → Purely HW soln.

Compiler → Instr<sup>n</sup> Reordering

## Hazard

### Bernstein Conditions

False-Anti

Inst<sup>n</sup> Process  
Statement Task.

Inst<sup>n</sup> followed by Inst<sup>n</sup>  
Process :

If { Input of  $I_1 \wedge$  output of  $I_2 = \phi$        $I_1 \left\{ P_1 \left\{ S_1 \right\} T_1 \right.$   
                 Output of  $I_1 \wedge$  <sup>Input</sup> <sub>Output</sub> <sup>TRUE</sup> of  $I_2 = \phi$        $I_2 \left\{ P_2 \left\{ S_2 \right\} T_2 \right.$   
                 Output of  $I_1 \wedge$  output of  $I_2 = \phi$

then  $I_1$  is said to be parallel to  $I_2$

(if both ip - no dependency)

False + Anti WAR

TRUE                    RAW

false output            WAW

$\rightarrow I_1 : \text{add } x_1, x_2, x_3$

$I_2$

$I_3$

$\rightarrow I_4 : \text{load } x_3, 10(x_5)$

$I_5$

If normal pipeline, no effect of false dependency (Anti-Above)  
If ~~instr<sup>n</sup> reordering~~  $\rightarrow$  Problem

$\hookrightarrow$  To take max adv. of parallelism.

If 1 instr<sup>n</sup> waiting of some other thing to happen, then rest of independent instr<sup>n</sup>'s can run.

To increase utilized Parallelism.

## Dynamic Instruction scheduling

Scoreboard

Tomasulo's Algorithm

Scoreboard

Computer CDC - 6600

Tomasulo's

IBM 360/91

Scoreboard appeared earlier in market & Tomasulo came 3 yrs later

### Scoreboard

To achieve the IPC =

IPC Instruction per cycle

CPI Cycles per instruction

$$(CPI = \frac{1}{IPC})$$

we don't get  $IPC = 1$  because of Dependencies causing Hazards

Stall cycles

Dependencies



Hazards

Ideal case  $\text{Avg. CPI} = 1$

~~CPI  $\neq 1$~~

$$\text{Avg. CPI} = \boxed{1 + \frac{\text{Avg. Stall cycles}}{}}$$

## A Modern Superscalar Processor



Dynamic Inst<sup>n</sup>  
Scheduling

Scoreboard 1963 ← CDC 6600  
Tomasulo 1966 → IBM 360/91

Scoreboard → To achieve IPC=1  
However, they also used stalling  
to handle dependencies

Tomasulo → Implemented Inst<sup>n</sup> Reserving, Register, Loop Unrolling  
Renaming

In normal Pipeline  $\rightarrow$  Only 1 EX unit to handle all type of operations.

Heterogeneous functional units.

Load/Store by FUs.

As complexity of  $f^n \uparrow$ , we try to have more pipeline units (stages)

$\therefore$  Integer  $\rightarrow$  1, FP  $\rightarrow$  2, MM  $\rightarrow$  3

- (1) To improve time of an operation, either parallelism (multiple execution units) or divide into stages.
- (2) H/w  $\rightarrow$  Disadv Cost.

### Pipeline Adv -> Cost

Simpler (No H/W reqd to monitor allocation of res.)

Data Mem  $\rightarrow$  L-Cache  $\rightarrow$  Fetch Unit  $\rightarrow$  Stores in  
Fetch Decode Buffer (Acts as latch)

↓  
Decode Unit will decode & store in DI Buffer.

Issue  $\rightarrow$  will look at type of opn & direct them to relevant Functional Units via Reservation Stations (buffers)

Can go directly to FUs if no "instx" waiting in Reservation Station (without buffer  $\rightarrow$  implementation dependent).

↓  
Reorder Buffer  $\rightarrow$  Complete.

These FUs can take diff. amounts of time

We can allow out of order Execution but not ~~out of order completion~~

- out of order exec but in order completion

Complete  $\Rightarrow$  Restores the program exec<sup>n</sup> order.

Last stage  $\rightarrow$  Mainly to handle memory inst<sup>n</sup> cycle.

If written into buffer  $\rightarrow$  assumed work is over

Other H/W will ~~not~~ keep waiting to memory.

(cant read & write to mem at same time)

## Dynamic Instruction Scheduling with Scoreboard

- Scoreboard is a centralized hardware that tries to maintain an execution ~~rate~~ rate of one instruction per cycle by executing an instruction as soon as its operands are available in registers and no hazard conditions prevent it.
- We have IF, ID<sub>1</sub>, ID<sub>2</sub>, EX, WB stages
- Every instruction goes through Scoreboard where a record of data dependencies is constructed.
- A system with Scoreboard is assumed to have several Functional Units (FUs) with their status information reported to the Scoreboard.
- If the score board determines that ~~one~~ an instruction can not execute immediately, it executes another waiting instruction and keeps monitoring FU's status to decide when the instruction can proceed to execute

Issue (ID<sub>1</sub>)  $\rightarrow$  An instruction can be issued if

- Required FU is available (structural Hazard)
- No WAW Hazard (will not allow if dest<sup>n</sup>s. same)

ID2 (Read operand) :- a FU can read the source operand when all of them are available

(If some earlier op<sup>n</sup> has marked the reg to be write-backenable.) / If s byte inst<sup>n</sup> + if r<sub>1</sub> - av. but not r<sub>2</sub>, w<sub>1</sub> not continue

A register is not available if some earliest active instruction has marked it for writing.  
(No RAW hazard)

EX (Execute) → When all source operands are available, FU starts execution

Write Result (WB) - FU can write the result to a reg. on execution completion if there is no WAR hazard.

(Later inst<sup>n</sup>'s are not allowed to write if WAR hazard.  
as old instr<sup>n</sup> may require old value of a reg.)

Hennessy  
Patterson

|                |     |                                   |                                 |
|----------------|-----|-----------------------------------|---------------------------------|
| I <sub>1</sub> | LD  | F <sub>6</sub> ,                  | 34 (R <sub>2</sub> )            |
| I <sub>2</sub> | LD  | F <sub>2</sub> ,                  | 45 (R <sub>3</sub> )            |
| I <sub>3</sub> | MUL | F <sub>0</sub> , F <sub>2</sub> , | F <sub>4</sub>                  |
| I <sub>4</sub> | SUB | F <sub>8</sub> , F <sub>6</sub> , | F <sub>2</sub>                  |
| I <sub>5</sub> | DIV | F <sub>10</sub> ,                 | F <sub>0</sub> , F <sub>6</sub> |
| I <sub>6</sub> | ADD | F <sub>6</sub> , f <sub>8</sub> , | F <sub>2</sub>                  |

### Dependency Graph



|                                                 |                                          |
|-------------------------------------------------|------------------------------------------|
| Instructions<br>earlier<br>in Program<br>order. | Instruction<br>Later in Program<br>order |
|-------------------------------------------------|------------------------------------------|



Scoreboard can  
be constructed with  
this graph.

|         | No. of FU | Ex.<br>cycles |
|---------|-----------|---------------|
| Integer | 1         | 1             |
| MUL     | 2         | 10            |
| FP ADD  | 1         | 2             |
| FP DIV  | 1         | 40            |

## Instruction Status

|     | J   | K  | Issue | Read operand | Complete | Ex Result | Write Result |
|-----|-----|----|-------|--------------|----------|-----------|--------------|
| LD  | F6  | 34 | R2    | 1            | 2        | 3         | 4            |
| LD  | F2  | 45 | R3    | 5            | 6        | 7         | 8            |
| MUL | F0  | F2 | F4    | 6            | 9        | 19        | 20           |
| SUB | F8  | F6 | F2    | 7            | 9        | 11        | 12           |
| DIV | F10 | F0 | F6    | 8            | 21       | 61        | 62           |
| ADD | F6  | F8 | F2    | 13           | 14       | 16        | 22           |

## FU Status

|              | BUSY | dest | src1           | src2           | FU for J       | FU for K       | FJ?            | FK?            |
|--------------|------|------|----------------|----------------|----------------|----------------|----------------|----------------|
| Integer Load | Yes  | OP   | F <sub>i</sub> | F <sub>j</sub> | Q <sub>j</sub> | Q <sub>k</sub> | R <sub>j</sub> | R <sub>k</sub> |
| MUL1         | Yes  | LD   | F6F2           | (34)45 R2 R3   | —              | —              | Yes            | Yes            |
| MUL2         | Yes  | MUL  | F0             | F2             | F4             | Integer4       | —              | No Yes         |
| ADD/SUB      | Yes  | SUB  | F8             | F6             | F2             | —              | Integer        | Yes            |
| DIV          | Yes  | DIV  | F10            | F0             | F6             | MUL1           | —              | No Yes         |

## Register Status

| FU →    | F0 | F2 | F4 | F6 | F8 | F10 | F12 |
|---------|----|----|----|----|----|-----|-----|
| Integer | —  | —  | —  | —  | —  | —   | —   |

MUL1 Integer

MUL1

ADD/SUB DIV

ADD/SUB DIV

Right now only "inst" :- no WAR Hazard ∴ can issue.

Issue is always done in Program Order.  
Rest of stages may be out of order

We cannot issue any other inst "I3" even if resource available as Program order need to be same.

4 → Integer load marked free

In Tomasulo's algo  $\rightarrow$  Enhancement.

We will read from FU (Reservation station) instead of Registers



Data Forwarding  
through  
Common Data  
Bus

through will all FUs  
will write their result  
to registers.

Data Bus  $\rightarrow$  Data &  
source of Data (whose  
writing) is present.

\* Each chip,

will store which FU has to write. When Data bus has  
source as that FU,

if to writing in Register file, data is read ~~by FU~~  
from bus.  $\therefore$  1 cycle is saved,

## Tomasulo's Algorithm

→ Control and Buffers distributed with FUs

vs. centralized in Scoreboard

\* FU buffers are called Reservation Stations

These Reservation Stations are also referred to

as renaming registers which are different

from architectural registers specified by ISA

- ISA registers in instructions are mapped by either values (if available) or pointers (renamed) to reservation stations that will supply value later on

This process is called Register Renaming

→ It allows hardware based loop unrolling

→ Instruction results go (forwarded) from RSs to RSs, not through registers, over common data bus which broadcasts result to all waiting RSs

→ Load and store are treated as FU with RSs as well