

## HPC

physical problem

↓  
mathematic model

↓  
numerical soln

↓  
simulate

] HPC

performance metrics → power consumption, FLOPS

- ① Theoretical peak performance ( $R_{\text{theory}}$ )
- ② Real performance ( $R_{\text{real}}$ )
  - ↳ flops for specific operations.

sustained performance : for a particular period of time, total flops ( $R_{\text{sustained}}$ )

$$R_{\text{sustained}} < 10\% R_{\text{theory}}$$

$$R_{\text{sustained}} \ll R_{\text{real}} \ll R_{\text{theory}}$$

CPU not given data  $\Rightarrow$  CPU is idle

$\rightarrow$  performance ↓

CPU  $\rightarrow$  ALU  
 $\downarrow$  CU

Dual core  
 $\rightarrow$  2ALU  
 $\rightarrow$  1CU

Quad core  
 $\rightarrow$  4ALU  
 $\rightarrow$  1CU

fast CPU requires fast memory

Computer time  $\rightarrow$  CPU cycle  
 $\downarrow$  min time to execute one instruction  
Time to execute the programme

$\rightarrow$  Time for 1 cycle  $\times$  no of instructions

Computer time measured in CPU cycle

Time to execute given programme      min time to execute 1 instruction.

$$T = n_c \times t_c \Rightarrow n_i \times \frac{n_c}{n_i} \times t_c \Rightarrow n_i \times CPI \times t_c$$

$n_c \rightarrow$  no of CPU cycles  
 $t_c \rightarrow$  time of the cycle

$n_i$  = no of instructions

CPI  $\rightarrow$  no of clocks required for 1 instr

Reduce

1) cycle time ( $t$ )

increasing frequency

2) Reduce  $n_i$  (# of instr<sup>ns</sup>)

e.g.

$$c = a + b \rightarrow \text{inst 1}$$

$\downarrow$

load a

load b

Add a and b

store c



better

compile, Interpret

Assembler

③ reduce CPI

clock per instructions

parallelism & pipelining

Reduce CPI  $\rightarrow$  interleaved execution, overlapping execution, overlapping execution of # of instrn.

parallelism  $\rightarrow$  in single processor computer

① pipelining

②  $\uparrow$  no of cores

③ functional units

$\rightarrow$  overlap operation of diff units of computer.

→ Increase speed of ALU by exploiting data temporal parallelism.

How to make computers fast.

Reduce CPI : Key is llism.

multiple CPU : → 2CPUs → llism in true sense  
parallelism.

Simultaneous execution → llism.

Interleaved execution → ~~one task at a time~~ but CPU is never idle.

FLYN classification of 11 ARCH's.

If you are able to read / write 2 data items @ time → called llism.

→ read / generation : etc.

collection of instructions → instruction stream.

data stream → { CPU → memory : write  
memory → CPU : read

parallelism can happen on both data & instruction stream.

← data stream. →



SISD : → Scalar processors → pipelining only solution to achieve 11xm here

→ Deterministic execution → tells what instructions are being executed when

SIMD → single instruction multiple data,

MISD → less significant.

MIMD → achieve 11xm in true sense.

SISD bottleneckSISD bottleneck

→ low level of I/Oism : due to data & control dependency

solutions →

- ① Pipelining
- ② Super scalar → multiple execution units
- ③ super pipeline scalar → fetches smaller chunks  
(complexity increase).

Pipelining →

2 instruction per cycle → goal

→ overlapped / interleaved execution → pipelining.

→ 1 instruction can be divided into 4 phases of stages

- i) Instruction fetch → memory to register
- ii) Read/decodes → memory → register.
- iii) execution → ALU
- iv) write back → reg → memory

for 7 instructions →  $3 \times 4 = 12$  cycles.  
 without pipelining →  $3 \times 7 = 21$  cycles (7 instructions)



after 7 → I7 over total time → 7.  
 ↳ after pipelining

measure the performance of pipelines →

- ① speedup
  - ② efficiency justification
- no of stages in an instruction.
- depth of pipeline → no of ops that can be performed in 1 cycle.

$$CPI = \frac{n + \text{depth} - 1}{n} \Rightarrow \frac{\text{total no of CPU cycles}}{\text{total no of instructions.}}$$

efficiency →  $\frac{\text{no of boxes utilised}}{\text{Total no of pipelines.}}$

speed up →  $\frac{\text{Time to execute without pipelining}}{\text{Time with pipelining}}$

→ ~~Eff.~~ ~~Eff.~~ ~~Eff.~~

→ in general -

$n \times \text{efficiency} \rightarrow \text{speed up.}$

$\frac{\text{no of stages}}{\text{depth}}$

○ efficiency < 1

speedup < m.

$$\text{speedup} = \frac{D}{CPI} \rightarrow \text{depth.}$$

$$T_{\text{nop}} \rightarrow \text{Time} \Rightarrow 1 \times \text{depth}/m + (n-1) \times 1$$

without pipeline

m → depth.

, n → no of instructions

$$T_{\text{wp}} \quad \text{Time with pipeline} \Rightarrow nxm.$$

$$\text{now speedup} \rightarrow \frac{T_{\text{nop}}}{T_{\text{wp}}} \Rightarrow \frac{nxm}{m+n-1}$$

## Inhibitors of pipelining

- ① control dependencies
  - ② data dependencies
  - ③ structural dependencies
- due to branch statements. → branch penalties happen.  
 when next operation is jump statements.  
 Then jump should run first we have to wait for the upper instruction & not start new  
 ↳ use NOP operation  
 ↳ done by smart compilers  
 ↳ not hardware support
- pipelining + stalls → Hardware support
- ↓  
 different from NOP.
- rearrange instructions → software  
 → rescheduling instructions  
 → Branch prediction → hardware  
 → Data dependency →
- when an instruction depends on the data coming from other instructions.



RAW → read after write [flow] true data dependency

WAW → write after read [anti-data dependency]

WAW → write after write [output data dependency]

### examples

RAW: read after write

$$I_i: R_1 \leftarrow R_1 + R_3$$

$$I_j: R_4 \leftarrow R_1 + R_5 .$$

WAR:

$$I_i: R_1 \leftarrow R_2 + R_3 \quad R_2 \text{ value vs to}$$

$I_j$ : load m,  $R_2$  read here  
its value

$I_j$  is update earlier

WAW: write after write [output data dependency]

$$I_i: R_1 \leftarrow R_2 * R_3$$

$$I_j: R_1 \leftarrow R_5 + R_6$$

\* → slower than addition

execution  $I_j$  will more than  $I_i$   
so  $I_j$  will change first (execute first).

sequence not followed.

solutions →

lock → draback → stall  
stall → no op.

Integrate registers → hold intermediate outputs of p.  
↳ called operand forwarding

structural dependence →

resource conflict in the pipeline  
eg → Bus  
when more than one instruction tries to fetch the same resource in same cycle then structural dependence.



same time S<sub>1</sub> & S<sub>2</sub> can ask for cache conflict

|    |   |    |   |    |
|----|---|----|---|----|
| I1 | 1 | 12 | 3 | 12 |
| I2 | 2 | 2  | 2 | 3  |
| I3 | 1 | 2  | 1 | 2  |
|    |   |    |   |    |

|          |  |
|----------|--|
| Page No. |  |
| Date     |  |

sol" →

- ① resource duplication (instead we have multiple copies of resource)

here divide cache into 2 parts

① instruction cache

② data cache

problem 2

instruction 3 at  $S_2$  & 1 at  $C_2$  both accessing  
at time  $T_4$

solution →

4 stage to 5 stage pipeline

microprocessor → MIPS RX1000  
(MIPS R2000/R3000)

stages →

- 1) IF → first fetching using  
~~I-cache~~  
Instruction cache

- 2) RD : operand loading (reading) from register file 'RF' will be decoding the fetched instruction.. whatever operands you require are loaded into register file .
- 3) execute → data processing using ALU & register file . as needed
- 4) memory access → operand accessing (load / store) using data - cache
- 5) write back → operand storing (written back in register file).

types of bus →  
now,

$$CPI = 1$$

what is throughput → no of inst executed  
2 sec

$CPI = 1 \rightarrow$  means in 1 clock cycle, 1 inst is execute

if CC = 1 ns  $\rightarrow$  1 ns , 1 unit is execute

$$10^{-6} \text{ sec} = 1 \text{ inst}$$

$$1 \text{ sec} \rightarrow \frac{1}{10^{-6}} \text{ inst executions}$$

## Types of pipeline

① Uniform delay pipeline

② non-uniform delay pipeline

every stage will take same time to complete an operation

normal cycle time ( $T_p$ ) = stage delay

cycle time ( $T_p$ )  $\rightarrow$  stage delay + buffer delay  
after stall adding

if we have  $\rightarrow$  stages let's say

use 3sec 2sec.

all will be given 4 sec  $\rightarrow$  now

&

add stalls.

non-uniform delay  $\rightarrow$  diff stage

Cycle Time ( $T_p$ ) = max( stage delay )

for  
uniformity

buffer

$T_p = \max(\text{stage} + \text{Buffer delay})$

not necessary that after buffer all are same.

SIMD

instruction

same → data different

deterministic data → we know the output

synchronous  $s_1 \rightarrow s_2 \rightarrow s_3$   
behavior ↴

↳ data source

in SIMD should know at what time  
which data is getting from where  
eg at  $T_1 \rightarrow$  data from  $s_1$ .

eg → addition is a subproblem and  
multiple data →  $(1, 2), (3, 4)$  are  
coming  
so instruction → single  
data → multiple.

MISD

MISD bottleneck

- low level of 11.1nm.
- high sync required synchronization.

inst 1



inst 2



inst 3



bottleneck.

- bottleneck in memory access, microcode, resource contention, instruction decoding.

- CISC → complex instruction set computers
- RISC Instruction set, variable length instruction, memory to memory instructions.

MIMD

high 11nm

super computers, grid computers

- no sync necessarily required but can be both.
- no data is shared → loosely coupled.
- if sharing then sync.



GPU → graphical processing units.  
specific type of test is assigned to small components.

### Kinds of parallel alarm

→ temporal / test → all tests should be similar with no delay  
→ no transfer time



let the no of jobs =  $n$   
let time to do a job =  $p$

let each job be divided into  $k$  tasks &  
let each task be done by a different individual

let the time for each task =  $\frac{p}{k}$ .

time to complete →  $np$ .  
n jobs with no pipelining.

with a pipeline organisation of  $K$  individuals

$$\Rightarrow P + (n-1) \frac{P}{K}$$

$$\Rightarrow \frac{KP + (n-1)P}{K}$$

$$\Rightarrow P \frac{(K+n-1)}{K}$$

speed up due to pipelining  $\rightarrow \frac{NP}{\frac{P}{K}(K+n-1)}$

$$\Rightarrow \frac{K}{\frac{1}{n}(K+n-1)}$$

$$\Rightarrow \frac{K}{(1 + \frac{K-1}{n})}$$

problems to temporal alignment

$\rightarrow$  Transfer time  $\rightarrow$  synchronisation

we assumed this negligible.

- ② fault tolerance (all processes are equal)  
one dies fault can't be solved/fixed.
- ③ scalability not possible
- ④ inter task comp communication
- ⑤ bubbles in pipeline (one stops all other stops)

### Data 11 am.

suppose we have 1800 answer sheets.  
read → principle → divides.

teacher one → 0 - 250

" 2 → 250 - 500

" 3 → 500 - 750

" 4 → 750 - 1000

→ so all teachers are assigned tasks at one time together & again assigned together → (no overhead, inter process communication)

disadvantage →

→ single point of failure.

→ task should be ~~mutually~~  
 → cost more (need takes more salary  
 before any 4 teachers).

### Mixed Job



[data Job]



eg  $\rightarrow$  NTC-SX  
 OFF. CRAY

Data Job  $\rightarrow$  with dynamic assignment

instead giving 1-250 at same time  
 give next if previous job is done by the process or. (1 given at a time)

Quasi - dynamic  $\rightarrow$  now one time in  
 dynamic giving 4-5 at a time.

specialist data II ism →

specialist person → who knows who can do what?

1. cause Grained specialist temporal parallel processing

in ways  
out ways of the processor.

### Detecting II ism

→ can this run on a II arch

if p. Pism.

dependent → cannot

independent → can run.

Bernstein's condition we define →

$I_i^r \rightarrow$  input set → from have to fetch  
read from memory

$O_i^w \rightarrow$  represent write set → write into  
memory

process →

read set

write set

$P_1$  and  $P_2$  will be independent when

$I_1 \cap D_2 = \emptyset$  only when they are independent.

$I_1 \cap D_2 = \emptyset \rightarrow$  flow independence

$I_2 \cap D_1 = \emptyset \rightarrow$  anti independence

$D_1 \cap D_2 = \emptyset \rightarrow$  output independence

eg →

$$P_1: C = D \times E$$

$$P_2: M = G + C$$

$$P_3: A \Rightarrow B + C$$

$$P_4: C = L + M$$

$$P_5: F = G \circ E$$

dependence graph  $\rightarrow$



both use +  
so resource  
dependency

there are 3 addition units ( $+, +_2, +_3$ ).

Solid arrows  $\rightarrow$  data dependence.

Dotted arrows  $\rightarrow$  resource dependence.



sequential execution (no pipelining)



II execution →

at a time, we take 2 processes

$P_1 \& P_2$

$I_1 \cap D_2 = \emptyset \checkmark$

$I_2 \cap O_1 = \emptyset \times$

$D_1 \cap O_2 = \emptyset \checkmark$

for pairs (all) when all 3 conditions are true then they can run parallelly.

$P_1 \& P_5$

$P_2 \& P_3$

$P_2 \& P_5$

$P_3 \& P_5$

$P_4 \& P_5$

these will come II.

$P_2, P_3, P_5 \rightarrow$  can be executed II.

Q

## 11 execution



## Processor types



locality of reference → Basic principle on which cache works.

→ spatial / temporal locality

→ cache

→ pipelining

→ Study all these terms.

multicore processors and multicore computers

(PRAM) parallel random access machine →



all  $n$  processors has equal access to the shared memory.



- ① shared memory model.
- ② message passing model.
- ③ data parallel model.

shared memory

uniifrom

- ① UMA  $\rightarrow$  uniform memory access
- ② NUMA  $\rightarrow$  non-uniform memory access
- ③ cache only

$P_1 \rightarrow [SM1]$

$P_2 \rightarrow [SM2]$

$P_3 \rightarrow [SM3]$

$\rightarrow$  each processor has its own memory and but other ~~processor~~ can access the other's memory equal.

NUMA  $\rightarrow$

every P will have local RAM

$[P_1] \rightarrow LR$  (local Ram)

$[P_2] \rightarrow LR$

$[P_3] \rightarrow LR$

eg  $\rightarrow$  super computer

here  $P_1$  has more right on LR  
 (all do not have same right)  
 processors are loosely coupled.

Cache only (C0NA) →

only cache shared b/w processors.  
RAM not shared.

~~parallel programming paradigms~~ →

slide 3 →

$$S \rightarrow \frac{\text{speed up (s)}}{\text{performance (non-pipeline)}} = \frac{\text{Performance (pipeline)}}{\text{Performance (non-pipeline)}}$$

(Response &  $\frac{1}{\text{Time}}$ )

$$S \rightarrow \frac{\text{avg execution time (non-pipeline)}}{\text{avg. " " (with pipeline)}}$$

•  $S \Rightarrow \frac{\text{CPI (non-pipeline)}}{\text{CPI (pipeline)}} \times \frac{\text{cycle time (non-pipeline)}}{\text{cycle time (with pipeline)}}$

due to stalls →

$$S \Rightarrow \frac{\text{CPI non pipeline}}{(1 + \text{no of stalls per instruction})}$$

because ideally 11.1cm CPI ⇒ 1