

## 228.Summary ranges

12 June 2023 09:58 PM

# 1146.Snapshot array

12 June 2023 09:59 PM

Maximum diff. b/w node & ancestor :



$$\text{MAX} \rightarrow |8 - 1|$$

+ node,

we'll find (min, max) from both its  $L_T$  &  $R_T$ .

Then compute V:

$$V = \left\{ \begin{array}{l} \min - \text{curr} \\ \max - \text{curr} \end{array} \right\} \text{Max.}$$

&

iterate till we find Max. (V)



841

4 September 2024 04:57 AM



$$\begin{matrix} N \\ \downarrow \\ (0, 0) \end{matrix}$$

$$-2 : L, -1 : R$$

$$[1, 4] \in \mathbb{R}$$

$$\boxed{x^2 + y^2}$$



$$(\sqrt{1^2 + 8^2})^2 \rightarrow \boxed{65}$$





0 1 2 3  
( $[1, 3]$ ,  $[3, 0, 1]$ ,  $[2]$ ,  $[0]$ )

$$ct\gamma = \cancel{\alpha} \cancel{\beta}$$

$$n = 4$$

|     |   |   |   |   |
|-----|---|---|---|---|
| vis | T | T | F | F |
|     | 0 | 1 | 2 | 3 |

### Segment tree

0 1 2 3 4 5  
1, 3, 5, 7, 9, 11  
1 4 9 16 25 36

$n = 6$



OR



→ leaf nodes are arr. elements

$$\begin{aligned} L_c &= 2i + 1 \\ R_c &= 2i + 2 \end{aligned} \quad \left\{ \text{for Root@ } i \text{ index} \right.$$

nodes  
 $2n - 1$





$u$   $v$   $wt$   
mgozi's i information[u]

2 0 1

2 1 1

-1 2

information[com\_mys]

3217

7 September 2024 12:27 AM



$$k = 4, n = 3$$

Max-heap ( $k$ )

$[1 \ 2 \ 3 \ 4 \ 5]$

diff.  
w.r.t.  $n$

2 1 0 1 2

$\{1, 2\}$

$\{5, 2\}$

$\{2, 1\}$

$\{3, 0\}$

$\{4, 1\}$

arr.  $\rightarrow 1, 2, 3, 4, 5$ ,  $n = 3$

$K = 4$

MIN HEAP

1 2 3 4 5

~~diff.~~ 2 1 0 1 2

{2, 1}

{2, 5}

$K = 4$

res : 3, 2, 4, 1

0 1 2 3 4

arr.  $\rightarrow [1, 2, 3, 4], 5$

$K = 4$

~~diff.~~ 2 1 0 1 2

$n \approx 3$

↑ ↑ ↑  
L R R



directed

$$n = \infty \rightarrow \text{ref } m$$

~~else~~

~~set (n, m% n)~~

$$\text{gcd}(18, 6)$$

$$(6, 0) \quad \boxed{6}$$

$$(10, 3)$$

$$(3, 1)$$

(1,0)

( ) ( )

answer :  $[0, 1, -1]$

$$(6, 10)$$

(10, 6)

$$(6, 4)$$

$$(4, 2)$$

(2, 0)

1

$$\underline{n = 3}$$



$$\text{nodes} \quad 0 \quad 1 \quad 2$$

Moore's law

# Transistors / chip will double roughly 1-2 yr.

Dennard Scaling → (power decipated)

→ Power density remains const.

→ Performance per watt ↑ exponentially.

SPEC → std. Performance Eval." Cooperative

↑ in Performance:

gen'. IN-ORDER ← n-issue IN-ORDER ← n-issue OOO

pipeline

pipeline

pipeline

Out-of-order Pipeline:

Processor with 5-stages:



Pipeline eliminates idleness in processor

{ except for time to fill & drain it. }

|                         | <u>ins. load</u> | <u>exec.</u> | <u>commit</u> |
|-------------------------|------------------|--------------|---------------|
| In-order pipeline :     | in-order         | in-order     | in-order      |
| Out-of-order pipeline : | 000              | 000          | in-order      |

→ Pipelines use -ve edge triggered latches / buffer b/w the 2 stages, to store the imm. results & synchronize the clock.

#### \* Hazards :

- i) Structural hazards → same resource / h/w access
- ii) Data hazards → data dependency ( RAW, WAR, WAW )
- iii) Control hazards → Branch statements

#### Data interlocks :

identifies dependencies on ins. & inserts NOP ins. / bubbles whenever it finds suitable.  
→ stalls the pipeline, Performance ↓

[Stalls]

→ Optimization for RAW dependency:

Forwarding : from MA to EX stage, resulting in NO stall,  
{WB to reg. file, will happen in background}

forwarding unit uses MUX to determine, which i/p needs to be taken for ALU exec.

#### Types of forwarding paths :



\* forwarding cannot help in case of LOAD-use hazard ( RAW dependency )  
↓  
requires 1 stall cycle.

#### \* dis-adv. of in-order pipelines :

- load-use hazard : causes 1 stall each
- Usually we have multicycle MA stage

- load-use hazard : causes 1 stall each
- usually we have multicycle MA stage
- Convey effect (Head of line) HOL blocking
- Taken branches have 2 stalls each
- IPC < 1

\* Performance eq.

$$\frac{\# \text{Programs}}{\# \text{sec.}} = \frac{\text{IPC} \times \text{freq.}}{\# \text{ins}}, \text{ for 1 program}$$

or

$$\text{Performance} \propto \frac{\text{IPC} * \text{freq.}}{\# \text{ins}}$$

$$\text{IPC} = \frac{1}{\text{CPI}}$$

→ #ins depends on compiler.

→ freq. depends on transistor technology & arch. (#Pipeline stages)

(clock Period)

$$t_{clk} = \frac{t}{k} + \Delta, \text{ where, } k\text{-stage pipeline,}$$

with  $\Delta$  as buffer delay.

$$f_{max} = \frac{1}{t + \Delta} \Rightarrow f_{max.} \propto k, \text{ give } \frac{t}{k} \gg \Delta$$

→ IPC again depends on arch. & compilers. { lower stalls, IPC ↑ }  
 i.e., forwarding,  
 rearranging code, keeping in mind data-dependency.

\* Methods to ↑ IPC :

- forwarding
- having more not taken branches
- faster ins? & data memories

\* ↑ freq. is LIMITED by

- i) Power,  $P \propto f^3$
- ii) Change in temp,  $\Delta T \propto P$
- iii) ↑ in pipeline stages  $\Rightarrow$  More hazards, more forwarding paths, design complexity
- iv) Effect of latch delay

\*  $CPI_{actual}$  =  $CPI_{ideal}$  + (stall\_rate \* stall\_penalty)

\* Super-scalar Processor: helps ↑ IPC

→ can execute multiple ins<sup>n</sup> per cycle.  
 { by issuing more ins<sup>n</sup> per cycle }  
 i.e.

Intel Pentium : had 2 in-order pipelines (U & V)  
 ins<sup>n</sup> : i<sub>j</sub> i<sub>j+1</sub>

dis-adv:

- dependency b/w ins<sup>n</sup>
- $O(n^2)$  forwarding paths for n-issue processor.
- complicated logic for detecting hazards, dependencies, & forwarding.



\* Out-of-order Pipelines:

- don't follow program order for exec<sup>n</sup>
- { it is as per data dependence order }
- extracts more //ism. from code segment.
- { ins<sup>n</sup> lvl parallelism }

- Steps:
- i) Creates a pool of ins<sup>n</sup> { fetches: Program order }
  - ii) finds mutually independent ins<sup>n</sup>, having operands ready
  - iii) Executes them ooo

→ larger the POOL  $\Rightarrow$  more ILP can be exploited,  
 ins<sup>n</sup>. window given ins<sup>n</sup> are on correct path  
 (typical size: 16-128)

Branch Prediction:

Consider,

Answe,

$$\# \text{ins} \rightarrow n$$

$$\# \text{branches} \rightarrow n/5$$

$P(E)$  predicting any given branch INCORRECTLY  $\rightarrow p$

Then,

$P(E)$  predicting ALL branches CORRECTLY  $\rightarrow (1-p)^{n/5}$

$\Rightarrow P(E)$  of making atleast a single mistake  $\rightarrow [1 - (1-p)^{n/5}]$   
in a pool of 'n' ins".

$\rightarrow$  We require Branch Prediction to be highly accurate.

i.e. it does have a very high substantial penalty.  
 $\Downarrow$

even with 97.5% acc. in BP  $\Rightarrow$  we have 40% chances of  
having incorrect ins". in window.

$\rightarrow$  Acc. of BP limits the size of ins" window.

\* Types of data-dependencies:

RAW  $\rightarrow$  true dependency { causes problem in-order pipeline }

(rp-dependency) WAW  
(anti-dependency) WAR } false dependency, caused due to finite no. of reg.

{ in 000 pipeline } Sol.: Register renaming

(done by HCo, internally)

Register renaming:

In the assembly eq<sup>w</sup> code,  
we map an architectural register to a physical reg.,  
in-order to eliminate false dependencies.

>Create new avatars of reg. whenever we have a W op.  
i.e,



(write once, read many times)

$\rightarrow$  Only RAW dependencies exists in the code, (if any)  
after renaming

→ Only RAW dependencies exists in the code, (if any)  
after renaming.

↓  
**higher ILP**

### Precise exceptions :

→ we need precise behaviour of ooo pipeline,  
to handle stop & resume effectively (CSW.), in case of an exception/  
interrupt

Consider, intr. after  $k^{\text{th}}$  ins?

Then,

ins? till  $k$  should has executed completely

↓ written their results to MEM./Reg. file.

↓,  $(k+1)^{\text{th}}$  ↓ later, should not appear  
to have started their EX at all.

### \* IF :

To explore ILP (in ooo), we need to ↑ fetch BW.

↙ But it incurs other overhead, → we only know PC value  
→ no time to look at ins? (prior decode in IF stage)

such as for [B.P] → It becomes difficult to predict correct  
set of ins?, in a multi-way branch.

{nested branch structure }

for an optimal compiler, if we have a fetch BW = 4, we'll  
expect it to place Branch statements as far as possible.

### Branch Predictor:

- Concerns : i) Predict if an ins? is BR or not.  
ii) if BR, predict its dir. (forward / backward)  
iii) predict its target (EA, if taken)



① given an ins<sup>n</sup>. with some PC value,  
 → it remains as BR or non-BR, throughout program runtime

↓  
 + PC we remember if its a BR & its type:  
 {  
 → Unconditional BR TAKEN B  
 → Conditional BR (need to Predict) BEQ, BNE  
 (depending on result of a prev. CMP ins<sup>n</sup>, some flags are SET)  
 → fn. call TAKEN call  
 → return TAKEN iret  
 → Non-BR

using a str. in H/w Ins<sup>n</sup>. Status table (IST)

provides better RANDOMNESS in case of Spatial & temporal locality

uses LSB 10-bits of PC, to mark the ins<sup>n</sup> as to which category it belongs.

dis-adv.: Destructive interference.

many addresses maps to same entry

keep updating the entries in IST.

i.e., 1 PC being BR as lsb 10-bits, another being non-BR with same IST entry

BRANCH ALIASING

(due to small-sized IST)

Sol<sup>n</sup>: To distinguish, augment each entry of IST with the TAG value.

(MSB 32-n bits)



→ If for a PC, we don't have entry in IST,

we'll have to predict if its a BR or not.

→ LST works because of temporal locality exhibited by codes (i.e., LOOPS)

## ② Prediction :

→ Branches in loops have roughly similar behaviour most of the time.  
Predominantly NOT TAKEN.

### I] Simple BIMODAL Predictor: (1-bit Predictor)

+ PC in LST, it remembers the last outcome of conditional BR.

↳ 1, uses it for current Prediction.

Same augmented LST table,  
along with 1-bit for BR as TAKEN / NOT TAKEN.  
for last recorded outcome of BR.

### II] Saturating counter (2-bits predictor)

→ instead of having 1-bit history, uses 2-bits for hysteresis.

Algo. for update:

→ if a BR is TAKEN, ↑ saturating ctr.  
else, ↓

for Prediction: 00, 01 → TAKEN  
10, 11 → NOT TAKEN



→ works well when BR is biased, towards 1 dir.

{either mostly TAKEN or NT}

### \* Global history Register (GHR):

Consider, a shift-register, that records history of last n-branches encountered by processor & regardless of PC value {

1 bit for each BR ( $0 \rightarrow \text{NT}$ ,  $1 \rightarrow \text{Taken}$ )

Hy,

we have 2-bit shift-register (GMR)

↳ uses the history of last 2 BR ins?

### III] GAp Predictor :

→ Use k-bit GMR along with Pattern History Table (PHT) to decide the outcome of a BR ins?

i.e.,



GMR : captures the behaviour of related branches.

i.e., {within a fn. scope}

### IV] PAp Predictor :

→ uses multiple GMRs vs n-bits of PC,

→ Most Accurate.

to decide outcome of the BR.

### V] Tournament Predictor :

very biased BR → bimodal predictor with saturating ctr., works well.

Alternating pattern → PAp predictor, works well

for a general piece of code,  
we'll use 2 Predictors, I choose 1 of them  
{based on behaviours}

→ To identify the target of BR ins<sup>n</sup>,

we augment the IST table  $\Rightarrow$  Branch Target Buffer (BTB)  
i.e,

it now hold EA aswell.

Entry  $\rightarrow$  BR type | Tag-bits | Target or EA

→ for CALL / RETURN BR ins<sup>n</sup>,

we use Return Addr. Stack (RAS)

Summary of BR Pred.:



\* Decode stage: (last part of IF unit)

Process of decoding:

- Expand imm. val. to 32/64 bit values.
- Extract all fields (IDs of regs.)
- Compute BR EA (if TAKEN)  $\xrightarrow{\text{we usually have PC-relative branches}}$
- Add all implicit sources (i.e., ret. ins<sup>n</sup>)
- Create Ins<sup>n</sup> pkt. (includes Control-bits)  $\xrightarrow{\text{return addr. [RA] reg.}}$

CISC ins<sup>n</sup>

→ do not have fixed len.

i.e., in x86: len. varies from 1 to 15 Bytes.

↓ SEQUENTIAL  
hard to fetch multiple ins<sup>n</sup> at once, ( $\because$  linear scanning reqd.)

hard for OOO pipeline to process them.

hard for OOO pipeline to process them.  
 CISC processors internally convert it to RISCy



→ To identify START/END of + CISC ins.,  
 + Byte is augmented with some additional info.  
 (done by L2 cache typically)



#### \* ins. compression in i-cache:

- using shorter width ins. (with smaller ISA)
- replacing frequently executed ins. with codewords (using dictionary)

stores decoded ins.

Issue, Execute, Commit Stages in OOO pipeline:

Reg. renaming with Phy. reg. file: (done by H/w)

|                  | #architectural reg. (visible to s/w) |                       |
|------------------|--------------------------------------|-----------------------|
| (intel, AMD) x86 | 8                                    | → used by compiler    |
| x86-64           | 16                                   | in ASSEMBLY CODE gen? |
| ARM              | 16                                   |                       |
| MIPS             | 32                                   |                       |

MIPS: MicroProcessor w/o Interlocked Pipeline Stages.  
 { INORDER pipeline }

more the #Phy. reg. → ↑ ILP can be achieved.  
 (store interim values)

\* Renaming uses 3 h/w structures:

i) Reg. Alias Table (RAT)

translates arch. reg. ID to phy. reg. ID.

ii) Dependency Check logic (DCL)

take care of dependencies b/w multiple ins'. renamed in same cycle,  
 { using MUX }

iii) Free list/queue (circular Q)

Maintain a list of unmapped Phy. registers.

→ RAT also store avl. bit,

{ whether Phy. reg. value ready or not }

↳ if not, ins' has to WART in ins' window.

Rename table entry : Phy. reg. ID | avlbl. bit

Dispatch:

(or issue) till rename stage, ins' proceeds INORDER

Sends the ins' to ins' window



→ structure of a  $ins^n$  in  $fns^n$  window:



When both the operands are available, the  $ins^n$  is ready  
to EXECUTED

either from Reg. file  
/ Bypass n/w.



Once an  $ins^n$  has all its operands READY,  
it asserts its attached REQUEST line

The Select unit chooses among these ready  $ins^n$ .  
(GRANTS it)

→ Speculative Broadcasting:

Consider 2  $ins^n$  with RAW dependency.  
( $i^{th}, i+1^{th}$ )



Allows back-to-back EX<sup>n</sup> of dependent  $ins^n$

→ In-case of a 2 cycle LOAD-USE hazard:

where LOAD requires 2 cycles in EX phase,  
we BROADCAST TAG 1 cycle later.

i.e., STALL of 1 cycle persists in ooo

i.e., for diff. ins<sup>n</sup>,  
we BROADCAST TAG at diff. pt. of time.

→ During speculative BROADCAST, (Rename table)  
we also send a signal to RAT table,  
to update the avlbl. bit of processed reg. [Destn. reg]  
{ in-order to help ins<sup>n</sup> that arrive after Broadcast, to not wait indefinitely }

\* Problem:

→ ins<sup>n</sup> being written to the LW will miss the broadcast.  
→ These ins<sup>n</sup> would also read avlbl. bit as 0, from RAT,  
leading to indefinite WAITING

Sol<sup>n</sup>: i) Split 1<sup>st</sup> half of clk cycle for DISPATCH.  
ii) use 2<sup>nd</sup> half for BROADCAST. } Not feasible

ii) Double BROADCAST

(w next cycle update)

→ log the BROADCASTED tags in a temporary str.  
→ record all ins<sup>n</sup> that are being dispatched in a separate str.,  
called **DISPATCH buffer**.

In the next cycle,

→ match the tag entries in the dispatch buffer,  
→ if matched, update corresponding LW entry.  
→ Clear entries of last cycle from buffer.

\* Reg. Read & Execute:

LSQ : Load-Store Queue

↳ Enforces mem. dependencies.

Consider,

ld r<sub>1</sub>, 4[r<sub>3</sub>]  
st r<sub>2</sub>, 10[r<sub>5</sub>] → if same MA.  
↓  
Problem

They're then sent to load-store unit:

- loads can execute immediately.
- stores update the processor's state (for Precise Exceptions)
  - { MUST be performed in Program Order }
  - ↪ can wait.

STORE Queue ≈ data-cache str.

- ↪ LOADS cannot be sent to mem. sys., until all the STORES before it are RESOLVED,
- I do not write to the same addr

\* LSD:

| <u>Op. type</u> | <u>Search dir.</u>             | <u>Condition : Action</u>                                                                                                                                                                                      |
|-----------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOAD            | search all stores<br>before it | <ul style="list-style-type: none"> <li>→ found store to same addr : terminate &amp; forward values</li> <li>→ found store with unresolved addr. : terminate</li> <li>→ Both not met : goto d-cache.</li> </ul> |
| STORE           | search all LD/ST<br>after it.  | <ul style="list-style-type: none"> <li>→ found store with same addr. : terminate</li> <li>→ found store with unresolved addr. : terminate</li> <li>→ LOAD from same addr : forward value / addr.</li> </ul>    |

\* Ins? commit:

Reorder Buffer: (inorder Q str.)

- Contains 1 entry + ins<sup>n</sup>. fetched. (in program-order)
- After decoding the ins<sup>n</sup>, we enter it to ROB.
- if no free entry → ooo pipeline STALLS.

Entry is cleared when an ins<sup>n</sup>. COMMITs / Retires. → also, add reclaimed phy. reg. back to free list. (if any)

+ cycle, we inspect for top  $\boxed{W}$  entries, if READY to be commit.  
in ROB

↪ Stop if any entry found to be NOT READY.

{ in general, fetch width = Commit width }

### \* Unexpected events :

- interrupts
- exceptions
- Branch mispredictions

↓

Some ins? in pipeline are on wrong path, i.e., they shouldn't be committed.

→ Mark ins? that has had an exception / suffered Misprediction in ROB.

In case of interrupt, mark topmost entry in ROB,

→ wait till marked ins? retires

→ initiate recovery seq. → flush ROB, LSQ, IW  
↓ all pipeline reg..

+

Restore / Rollback arch. reg., state, PC

### \* VLIW Processor :

(Very long ins? word)

→ exploits ILP.

→ It relies on compile time analysis,  
to identify ↓ bundle together ins? that can be executed  
concurrently.

→ compiler is responsible for static scheduling of ins? in VLIW processors.

→ compiler finds out which operations can be executed in // in the program  
↓, groups together these op? into single ins? known as ISSUE packet  
or Bundle, which is a very large ins? word.

→ compiler ensures that an op? is NOT issued before its operands are ready.

→ They deploy multiple independent FU.  
i.e.

Bundle :



(a VLIW ins?)

## VLIW Processors

- Parallelism is performed STATICALLY at compile time, by Compiler.
- Data dependency is checked by compiler.
- less H/w.
- complex compiler.

## Superscalar Processors

- dynamically at RUNTIME , by H/w
- data dependency is checked & resolved by H/w
- more H/w
- complex H/w