

Instead of BTB, we use stack.  
This is called Return address Predictor.  
By using stack.

## Advance ILP

## Simple Pipelining.

ICC ICC ICC ICC ICC  
IF ID EX MEM WB

$$\underline{E} = A \cos(\omega_0 t) F_0 F_1 F_2 = 500$$

ADROFO F1 F2 = 5CC

Ex.  $\text{AODAO} \rightarrow \text{FO F1 F2} \rightarrow \text{4CC}$

MULT.  $\circ$   $F_0$   $F_1$   $F_2 \rightarrow$  occ

$\text{at} \nu \cdot \omega \text{ for } f_1, f_2 \rightarrow 40\text{cc}$

## Pipelining



multicyclic pelining  
precipitation stage.

Add: 4cc

Mult: 10cc

DIV: 40cc

Page No.:  
\_\_\_\_\_

Ex: Div:  $F_0 F_1 F_2 = 44\text{ cc}$   
 $A \times A \times F_3 F_4 F_5 = 9\text{ cc}$   
Mult:  $F_6 F_7 F_8 = 16\text{ cc}$

Q. Add: 4cc, Mult  $\rightarrow$  7cc, DIV: 25cc, ALU = 1cc



Latency  $\rightarrow$  of two instruction of same type first one is producing the result and second one is consuming the result; the no. of clock cycle waiting by the second instruction is latency.

The gap in terms of CC ~~are in terms~~ b/w the two instruction are of same type.



ADD :  $\infty$

F0 F1 F2 F3

F0 F1 F2 F3



IF      ID      Execution      A<sub>1</sub>

Latency of 3 CC.

for multiplication  $\rightarrow$  6 CC }  
 for division  $\rightarrow$  24 CC }

### Initiation Interval:

when both are independent like  
 in addition (1 CC), multiplication (1 CC)  
 But in division (25 CC)

$\hookrightarrow$  As it is complex.

Hazards

$\rightarrow$  Structural

$\rightarrow$  Data

$\rightarrow$  Control

9 10 11 12

A<sub>3</sub> A<sub>4</sub> MEM WB

MULT.  $\infty$

ADD.  $\infty$

IF I O M<sub>1</sub> M<sub>2</sub> M<sub>3</sub> M<sub>4</sub> M<sub>5</sub> M<sub>6</sub> M<sub>7</sub> M<sub>8</sub>  
B

I<sub>II</sub> O A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> M<sub>W</sub>

801<sup>h</sup> is Extra buffer/Registers

Data → R A<sub>W</sub> (Read after write)

→ W A<sub>R</sub>

→ W A<sub>W</sub>

44CC.

DIV.  $\infty$  F<sub>0</sub> F<sub>1</sub> F<sub>2</sub>

A<sub>000.00</sub> F<sub>3</sub> F<sub>4</sub>

9 CC

DIV. D F<sub>0</sub> F<sub>1</sub> F<sub>2</sub>

ADD.  $\infty$  F<sub>1</sub> F<sub>3</sub> F<sub>4</sub>

ADD.  $\infty$  F<sub>1</sub> F<sub>3</sub> F<sub>4</sub>

DIV. D

ADD.  $\infty$

F<sub>0</sub> F<sub>1</sub> F<sub>2</sub>

F<sub>0</sub> F<sub>3</sub> F<sub>4</sub>

!

!

ADD.  $\infty$

F<sub>x</sub> F<sub>0</sub> F<sub>5</sub>

# Techniques to Optimize and Cache

- Software techniques / compiler techniques
- H/w (Hardware techniques)

## Compiler techniques

- Register renaming
- loop unrolling
- software pipelining
- trace scheduling
- predicate instructions
- Multiple Issue processor

{ Data Hazards.

{ Control Hazards

## H/w Techniques / Dynamic Scheduling

- dynamic score boarding
- Tomasulo's Algo
- Tomasulo's Algo + ROB
- Tomasulo's Algo + ROB + LSG
- ↓  
(Load Store) Queue



## Register Renaming.



we have a pool of registers {A, B, C, ..., Z}  
By using this, RAW, WAR, WAW can  
be removed.



LOOP UnRolling.

for( $i = 999;$   $i > 0;$   $i--$ )

1000 times.

$$x[i] = x[i] + s;$$

for( $i = 1000;$   $i > 0;$   $i = i - 4$ )

$$\{ x[i] = x[i] + s;$$

$$x[i+1] = x[i] + s$$

250 times.

$$x[i+2] = x[i+2] + s;$$

$$\{ x[i+3] = x[i+3] + s;$$

Assembly Programming

loop:

LD F0, 0(R1)

ADD.0 F4  $\rightarrow$  F0 F2  
RAW ↓

Store F4, 0(R1)

DADDUI

R1 R1 +- 8  
↓ RAW

B NEZ R1 R2 loop.

double  
Addition  
Unsigned  
immediate

Unrolling 4 times:

Page No.:

① LD F<sub>0</sub> O(R<sub>1</sub>)

ADD.0 F<sub>4</sub> F<sub>0</sub> - F<sub>2</sub>

SD F<sub>4</sub> O(R<sub>1</sub>)

② LD F<sub>0</sub> - 8(R<sub>1</sub>)

ADD.0 F<sub>4</sub> F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub> - 8(R<sub>1</sub>)

③ LD F<sub>0</sub> - 16(R<sub>1</sub>)

ADD.0 F<sub>4</sub> F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub> - 16(R<sub>1</sub>)

LD F<sub>0</sub> - 24(R<sub>1</sub>)

④ ADD.0 F<sub>4</sub> F<sub>0</sub>, F<sub>2</sub>

SD F<sub>4</sub> - 24(R<sub>1</sub>)

DADDUF R<sub>1</sub> R<sub>1</sub> ≠ -32

BNE 2 R<sub>1</sub> R<sub>2</sub> loop.

The latency b/w LD and ADD is 1cc  
and b/w ADD and SD is 2cc.

LD F0, O(R1)  
 ADD F4, F0, F2  
 SD F4 O(R1)  
 LD F6 -8(R1)  
 ADD F12 F6 F2  
 SD F12 -8(R1)  
 LD F8 -16(R1)  
 ADD F14 F8 F2  
 SD F14 -16(R1)  
 LD F10 -24(R1)  
 ADD F16 F10, F2  
 SD F16, -24(R1)

~~ADD R1, R1 ≠ -32~~  
 BNE 2 R1, R2 loop

a LD F0, O(R1)  
 b LD F6, -8(R1)  
 c LD F8, -16(R1)  
 a LD F10, -24(R1)  
 b ADD F4, F0, F2  
 c ADD F12, F6, F2  
 a ADD F14, F8, F2  
 b ADD F16, F10, F2  
 c SD F4 O(R1)  
 b SD F12 -8(R1)  
 c SD F14 -16(R1)  
~~ADD R1, R1 ≠ -32~~  
 b SD F16 -8(R1)  
 c BNE 2 R1, R2 loop

# Drawbacks of loop unrolling!

Page No.:

## - Software Pipelining.



loop:

|       |      |              |
|-------|------|--------------|
| LD    | FO   | O(R1)        |
| ADD.0 | F4   | FO F2        |
| LD    | FO   | O(R1)        |
| SD    | F4   | O(R1)        |
| ADD.0 | F4   | FO F2        |
| LD    | FO,  | O(R1)        |
| DA    | 0001 | (R1, R2 ≠ ?) |
| BNEZ  |      | R1, R2 loop  |
| SD    | F4   | O(R1)        |
| ADD.0 | F4   | 00, F2       |
| SD    | F4   | O(R1)        |

## loop unrolling



## Software Pipelining

## Traces Scheduling.





$w = 0$ ;  
 $x = x + y$ ;  
 $y = 0$ ;  
 $y = x$   
 $x++$   
 $w = x + z$ ;  
 $if (l < \infty)$

Tail computation

Recovery  
code

if 50% of is true and false then this technique  
is inefficient in tail computation

Predicate instructions

$gf(R_1 == 0)$

$R_2 = R_2 + R_4 \Rightarrow$

$R_7 = !R_1 \text{ predicated on } R_7$

$R_8 = R_2$

$R_2 = R_2 + R_4 \Rightarrow$

$R_6 = R_3 + R_5$

$R_6 = R_3 + R_5$

$R_4 = R_2 + R_3$

$R_4 = R_8 + R_3$

Predicated on  $R_4$

Predicated on  $R_1$

H/W Techniques.

|                    |    |       |                     |
|--------------------|----|-------|---------------------|
| MULT $\infty$      | F0 | F1    | F2                  |
| ADD $\cdot \infty$ | F3 | F0    | F4                  |
| SD                 | F4 | O(R1) | $= 0 + 1024 = 1024$ |
| LD                 | F3 | (OR2) | $= 0 + 1024 = 1024$ |

Memory read after write Hazard

RAW

Memory WAR

LD F4 O(R1)

SD FO OR(R2)

Memory WAW

SD F4 (OR1)

SD FO O(R2)

#

Div $\infty$  F0

F1 F2

ADD $\infty$  F1

F3 F4



Exception is detected

When any exception is happened to O<sub>1,2</sub> after that no instruction should be completed due to error



Hyperthreading (Simultaneous Multithreading)

11/03/2018

### H/W Techniques:

#### 1) Score Boarding

- out of order execution



FU  
functional unit

Simplest IF ID, EX, MEM, WB  
pipelining



### Issue stage:

- check for functional units. → If unavailable do stall
- check for **WAW** Hazards. → If it is then simply put the stall.

### Read operands:

→ Both the operands are available.

Execute: ~~IFP~~ perform execution using functional units.

Write results → writing into the destination register.

check **WAR** Hazard → If available then stall.

Scoreboarding → 3 tables.

### Instruction status unit

| IS | Rd | Ex | WR |
|----|----|----|----|
|    |    |    |    |

### Functional unit status

| BUSY | OP.                 | F <sub>i</sub> | F <sub>j</sub> | F <sub>k</sub> | R <sub>j</sub> | R <sub>k</sub> |
|------|---------------------|----------------|----------------|----------------|----------------|----------------|
|      | Div<br>Multi<br>Sub |                |                |                |                |                |

### Register status

F<sub>j</sub>, F<sub>k</sub> → inputs  
 F<sub>i</sub> → destination register  
 R<sub>j</sub> → Who will give f<sub>j</sub>?  
 R<sub>k</sub> → Who will give f<sub>k</sub>?  
 R<sub>j</sub>, R<sub>k</sub> → Whether operand available

# Instruction Status Unit

|        |        |        | I S | Rd | E X | W R | Page No.: |
|--------|--------|--------|-----|----|-----|-----|-----------|
| LD     | F6     | 34(R2) |     |    |     |     | 9         |
| LD     | F2     | 45(R3) | 1   | 2  | 3   | 4   |           |
| MULT D | F0 F2  | F4     | 5   | 6  | 7   | 8   |           |
| SUB D  | F8 F6  | F2     | 6   | 9  | 19  | 20  |           |
| DIV D  | F10 F0 | F6     | 7   | 9  | 11  | 12  |           |
| ADD D  | F6 F8  | F2     | 8   | 21 | 61  | 62  |           |
|        |        |        | 13  | 14 | 16  | 22  |           |

functional unit.

|        | BUSY | OP  | F1  | F2 | F3 | F4 | QJ | QK | RJ | RK |
|--------|------|-----|-----|----|----|----|----|----|----|----|
| INT    | NO   | yes | LDI |    |    |    |    |    |    |    |
| MULT I | NO   |     |     | F6 | 34 | R2 | —  | —  | —  | —  |
| ADD    | NO   |     |     |    |    |    |    |    |    |    |
| DIV D  | NO   |     |     |    |    |    |    |    |    |    |

## Register Unit



Clock cycle  $\Rightarrow$  0

Adder = 2CC

Div D = 40CC

MULT = 10CC

INT = 1CC

## Drawbacks of scoreboarding

- There is no forwarding.
- It is going to solve any hazards.
- Only put the stalls.

## Tomasulo's Algo.

g+ solves WAR & WAW Hazard.



WAW

Div  $\rightarrow$  40CC  
Add  $\rightarrow$  2CC

Page No.:

~~$F_2 = F_3 + F_4$~~

## Register table

|       |     |          |
|-------|-----|----------|
| $F_1$ | 1.0 | $R_S$    |
| $F_2$ | 2.0 |          |
| $F_3$ | 3.0 |          |
| $F_4$ | 4.0 | $R_{S2}$ |

$$F_4 = F_3 + F_1$$

$$F_1 = F_4 - F_2$$

$$F_1 = F_3 + F_4$$

|                   | OP | OP <sub>1</sub> | OP <sub>2</sub> |
|-------------------|----|-----------------|-----------------|
| + RS <sub>1</sub> | +  | 4.0             | 2.0             |
| + RS <sub>2</sub> | +  | 3.0             | RS <sub>1</sub> |
| / RS <sub>3</sub> |    | $F_3$<br>3.0    | $F_4$<br>4.0    |

RRT/RAT

|       |        |
|-------|--------|
| $F_1$ | $RS_3$ |
| $F_2$ |        |
| $F_3$ |        |
| $F_4$ | $RS_2$ |

|                   | IS | D | E  | WR |
|-------------------|----|---|----|----|
| $F_1 = F_3 / F_4$ | 1  | 2 | 42 | 43 |
| $F_1 = F_4 + F_2$ | 2  | 3 | 5  | 6  |
| $F_4 = F_3 + F_1$ | 3  | 7 | 9  | 10 |

# WAR problem

Page No.:

$$\begin{aligned} F_4 &\equiv F_1 - F_2 \\ F_1 &\equiv F_3 + F_4 \\ F_1 &\equiv F_4/F_2 \end{aligned}$$

$$F_3 = F_4 + F_3$$

$$F_4 = F_2 + F_3$$

$$F_1 = F_4/F_2$$

$$\begin{array}{lll} OP & OP_1 & OP_2 \\ \hline RS_1 + & F_2 & F_3 \\ RS_2 + & RSI & 3.0 \\ / RS_3 - & F_4 & F_2 \\ \hline RS_1 & 4.0 & 2.0 \end{array}$$

$$\begin{array}{llll} I & S & E & W/R \\ \hline 1 & 2 & 42 & 43 \\ 2 & 3 & 5 & 6 \\ 3 & 7 & 9 & 10 \end{array}$$

Reg table:

$$\begin{array}{ll} F_1 & 1.0/RS_3 \\ F_2 & 2.0 \\ F_3 & 3.0 \\ F_4 & 4.0/RS_1 \end{array}$$

RRT

$$\begin{array}{ll} F_1 & RS_3 \\ F_2 & \\ F_3 & \\ F_4 & \end{array}$$

~~Prob~~

- (I)
- (II)
- (III)
- (IV)

$$F_1 = F_2 + F_3$$

$$F_4 = F_1 - F_2$$

$$F_1 = F_2 / F_3$$

$$F_2 = F_4 + F_1$$

~~$F_1 = 1.0$~~

~~$F_2 = 2.0$~~

~~$F_3 = 3.0$~~

~~$F_4 = 4.0$~~

Add  
div.

w/

$$\begin{cases} S + RS_1 \\ + RS_2 \\ \hline RS_3 \end{cases}$$

$$d.0$$

$$OP_1$$

$$OP_2$$

$$RS_1 / 5.0$$

$$2.0$$

$$3.0$$

$$3.0$$

Reg.

$$F_1 = 1.0 / RS$$

$$F_2 = 2.0$$

$$F_3 = 3.0$$

$$F_4 = 4.0 / R$$

$$\begin{cases} F_1 = F_2 + F_3 \\ F_4 = F_1 - F_2 \\ F_1 = F_2 / F_3 \\ F_2 = F_4 + F_1 \end{cases}$$

$$IS$$

$$ID$$

$$Ex$$

$$WR$$

~~7~~

~~6~~

~~7~~

~~7~~

~~11~~

~~12~~

~~9~~

~~11~~

~~12~~

~~13~~

~~18~~

~~19~~

~~RS~~~~F1~~~~F2~~~~F3~~~~F4~~

~~1~~

~~2~~

~~6~~

~~7~~

~~2~~

~~8~~

~~12~~

~~13~~

~~3~~

~~4~~

~~29~~

~~30~~

~~8~~

~~31~~

~~35~~

~~36~~



$$F_1 = F_2 + F_3 \quad 1 \quad 2 \quad 6 \quad 7$$

$$F_4 = F_1 - F_2 \quad 2 \quad 8 \quad 12 \quad 13$$

$$F_1 = F_2 / F_x \quad 20' \quad 24$$

$$F_2 = F_4 + F_0 \quad 8 \quad 14 \quad 18 \quad 19$$

10.8

10.9

10

11

| F S               | D | EX | WR | C  | F <sub>0</sub> | 1.0                        |
|-------------------|---|----|----|----|----------------|----------------------------|
| $F_1 = F_2 + F_3$ | 1 | 2  | 6  | 7  | $F_1$          | $1.0 / ROB_1 /$<br>$ROB_2$ |
| $F_4 = F_1 - F_2$ | 2 | 8  | 12 | 13 | $F_2$          | $2.0$                      |
| $F_1 = F_2 / F_3$ | 3 | 4  | 29 | 30 | $F_3$          | $0.0$                      |
| $F_2 = F_4 + F_0$ | 9 | 10 | 14 | 15 | $F_4$          | $3.0$                      |
|                   |   |    | 32 |    |                |                            |

| S                 | T                 | (Rob entry)                      | T                | X D                               | X                | C/RC |
|-------------------|-------------------|----------------------------------|------------------|-----------------------------------|------------------|------|
| OP <sub>1</sub>   | OP <sub>2</sub>   | OP <sub>1</sub> , P <sub>2</sub> | ROB <sub>0</sub> | + F <sub>1</sub>                  | - 2.0            | H    |
| + RS <sub>1</sub> | + RS <sub>2</sub> | 2.0                              | 0.0              | ROB <sub>0</sub> ROB <sub>1</sub> | - F <sub>4</sub> |      |
| - RS <sub>2</sub> | -                 | ROB <sub>0</sub>                 | 2.0              | ROB <sub>1</sub> ROB <sub>2</sub> | / F <sub>1</sub> | -    |
| / RS <sub>3</sub> | -                 | 2.0                              | 0.0              | ROB <sub>2</sub> ROB <sub>3</sub> |                  |      |

R RT

|                |                    |
|----------------|--------------------|
| F <sub>0</sub> | R                  |
| F <sub>1</sub> | ROB <sub>0</sub> / |
| F <sub>2</sub> | ROB <sub>2</sub>   |
| F <sub>3</sub> |                    |
| F <sub>4</sub> | ROB <sub>1</sub>   |

$$F_1 = F_3 + F_4$$

$$F_1 = F_4/F_2$$

$$F_4 = F_3 - F_2$$

$$F_1 = 1.0$$

$$F_2 = 0.7$$

$$F_3 = 2.0$$

$$F_4 = 3.0$$

+RS  
1

OP

OP<sub>1</sub>

OP<sub>2</sub>

ROB  
Ruty.

+RS  
2

2.0

3.0

ROBO

+RS  
3

1

| FS                | D | EX | WR | C  |
|-------------------|---|----|----|----|
| $F_1 = F_3 + F_4$ | 1 | 2  | 6  | 7  |
| $F_1 = F_4/F_2$   | 2 | 3  | 28 | 29 |
| $F_4 = F_3 - F_2$ | 3 | 4  | 8  | 9  |

Tomasulo's Alg + ROB

$\Rightarrow$  Branch hazards

$\Rightarrow$  Speculation.

① MULT D F<sub>2</sub> F<sub>0</sub> F<sub>1</sub>

BNES Q F<sub>2</sub>, 0, Label

SGADD D F<sub>4</sub> F<sub>3</sub> F<sub>1</sub>

Label

ADD F<sub>3</sub> F<sub>4</sub> F<sub>2</sub>

MULT  $\rightarrow$  IDCC

IS D Ex WR C

|   |   |    |    |    |
|---|---|----|----|----|
| 1 | 2 | 12 | 13 | 14 |
|---|---|----|----|----|

2

3

4

| RS <sub>1</sub> | RS <sub>2</sub> | RS <sub>3</sub> | OP | OP <sub>1</sub> | OP <sub>2</sub> | ROB even.        | F <sub>0</sub> | O.D      |
|-----------------|-----------------|-----------------|----|-----------------|-----------------|------------------|----------------|----------|
|                 |                 |                 | *  | 0.0             | 1.0             | ROBO             | F <sub>1</sub> | 1.0      |
|                 | +               |                 |    | 3.0             | 1.0             | ROB <sub>2</sub> | F <sub>2</sub> | 2.0 ROBO |
|                 |                 |                 |    |                 |                 |                  | F <sub>3</sub> | 3.0      |
|                 |                 |                 |    |                 |                 |                  | F <sub>4</sub> | 4.0      |

RRT

|                  | T D V C/NT               |
|------------------|--------------------------|
| ROBO             | * F <sub>2</sub> - H     |
| ROB <sub>1</sub> | BNES Q F <sub>2</sub> -- |
| ROB <sub>2</sub> | + F <sub>4</sub> --      |
| ROB <sub>3</sub> |                          |

|                |      |
|----------------|------|
| F <sub>0</sub> |      |
| F <sub>1</sub> |      |
| F <sub>2</sub> | ROBO |
| F <sub>3</sub> |      |
| F <sub>4</sub> |      |

IS D EX WR C

LD R<sub>1</sub> O(R<sub>1</sub>) 1 2 3  
 SD R<sub>2</sub> O(R<sub>3</sub>) 2 3 4  
 LD R<sub>4</sub> O(R<sub>4</sub>) 3 4 5  
 SD R<sub>5</sub> O(R<sub>0</sub>)  
 LD R<sub>5</sub> O(R<sub>1</sub>)

| T | D              | V | C/NC |
|---|----------------|---|------|
| L | R <sub>1</sub> | — | H    |
| S | R <sub>2</sub> | — |      |

| LS | EA  | V | C/NC | After Calculation<br>of 1st Instruction<br>is 104, |
|----|-----|---|------|----------------------------------------------------|
| L  | 104 | — | H    |                                                    |
| S  | 204 | — |      |                                                    |

Store to load forwarding

## ~~Multi-issue Processor~~

VLIW

very long instruction word.



|       | IF | ID | EX | M | WB |
|-------|----|----|----|---|----|
| $I_1$ | 1  |    |    |   |    |
| $I_2$ | 1  |    |    |   |    |
| $I_3$ |    |    |    |   |    |
| $I_4$ |    |    |    |   |    |

A <sup>5</sup> VLIW can issue two load/store operations and 4 floating point operations and one decremental branch instruction in one clock cycle.

|                  |                   |     |                 |                 |                                     |
|------------------|-------------------|-----|-----------------|-----------------|-------------------------------------|
| LD <sub>1</sub>  | SD <sub>1</sub>   |     |                 |                 |                                     |
| LD <sub>2</sub>  | SD <sub>2</sub>   | 1CC | LD <sub>1</sub> | LD <sub>2</sub> | ---                                 |
| ---              | ---               | 2CC | LD <sub>3</sub> | LD <sub>4</sub> | ---                                 |
| LD <sub>7</sub>  | SD <sub>7</sub>   | 3CC | LD <sub>5</sub> | LD <sub>6</sub> | ADD, ADD <sub>2</sub> -             |
| ADD <sub>1</sub> | DADD <sub>1</sub> | 4CC | LD <sub>7</sub> | -               | ADD <sub>3</sub> ADD <sub>4</sub> - |
| ADD <sub>2</sub> | BNE <sub>2</sub>  | 5CC | -               | -               | ADD <sub>5</sub> ADD <sub>6</sub> - |
| ---              | ---               | 6CC | SD <sub>1</sub> | SD <sub>2</sub> | A007 - -                            |
| LD <sub>7</sub>  | SD <sub>7</sub>   | 7CC | SD <sub>3</sub> | SD <sub>4</sub> | - ADD <sub>0001</sub> -             |
| ---              | ---               | 8CC | SD <sub>5</sub> | SD <sub>6</sub> | - - ADD <sub>000</sub> -            |
| A007             |                   | 9CC | SD <sub>7</sub> | -               | BNE/2                               |
|                  |                   | INC |                 |                 |                                     |

## Simultaneous Multithreading.

4 Issue processor (4 instructions in 1cc)

Super scalar

fine grained

SMT (Simultaneous multithreading)

① Super Scalar



1cc      I<sub>1</sub>      I<sub>2</sub>

2cc      I<sub>3</sub>      I<sub>4</sub>

3cc

4cc      I<sub>5</sub>      I<sub>6</sub>

when there is a

big delay then move  
from one processor  
to another for  
fetching instruction.

② fine grained

1cc      I<sub>1</sub>      I<sub>2</sub>

2cc      J<sub>1</sub>      J<sub>2</sub>

3cc      K<sub>1</sub>      K<sub>2</sub>

4cc      L<sub>1</sub>      L<sub>2</sub>      L<sub>3</sub>

SMT



If there is  
a gap then  
jump to  
and run  
processor

Data level Parallelism.

Array Processors

No. of Processors  
connected in parallel  
manner.Single instruction stream multiple data  
streamSIMD computer organization:- (vector architecture)

$$C_i = A_i + B_i * C_i$$

 $i = 0 \text{ to } n-1$ 

$$A = (A_0, A_1, \dots, A_{n-1})$$

$$B = (B_0, B_1, \dots, B_{n-1})$$

$$C = (C_0, C_1, \dots, C_{n-1})$$

  $\Rightarrow$  Total time = 400 mins.



$$4 + 99 = 103 \approx 100$$

 This is Pipelining.

$$C_i = A_i + B_i * C_i$$

SIMD Computer  $C = \langle N; F, I, M \rangle$

$N \rightarrow$  No. of processing elements in the system

$F \rightarrow$  set of data routing function

$I \rightarrow$  the set of machine instruction

$M \rightarrow$  the set of masking schemes.



Components in a  
processing element

## Illustration of Data Routing and Masking schemes

$$A = A_0, A_1, \dots, A_{n-1}$$

We need to compute  $S(k) = \sum_{i=0}^k A_i$  for  $k=0, 1, 2, \dots, n-1$

$$S(0) = A_0$$

$$S(1) = A_0 + A_1$$

$$S(2) = A_0 + A_1 + A_2$$

$$S(n-1) = A_0 + A_1 + \dots + A_{n-1}$$

Time  $\rightarrow O(n)$  with the help of recursion.

$$S(0) = A_0$$

$$S(k) = S(k-1) + A_k \text{ for } k=0, 1, 2, \dots, n-1$$



④ Page No.:  
PE0 PE1 PE2 PE4

Time complexity =  $O(\log_2 n)$

PE1 PE3 PE5



$$R_i(j) = j + 2^i \quad 0 \leq i \leq \log(n)-1,$$

$$0 \leq j \leq n-1$$

for  $n=8$ :

$$0 \leq i \leq 2, 0 \leq j \leq 7$$

for  $i=0$

$$R_0(j) = j+1 \rightarrow 0, 1, 2$$

$$R_0(0) = 1 \quad PE0 \rightarrow PE1 \quad \text{3 steps of Routing.}$$

$$R_0(1) = 2 \quad PE1 \rightarrow PE2$$

for  $i=1$   
 ~~$R_1(j) = j+2$~~

$$R_1(0) = 2 \quad PE0 \rightarrow PE2$$

$$R_1(1) = 3 \quad PE1 \rightarrow PE3$$

for i = 2

$$R_2(j) = j + 4$$

$$R_2(0) = 4 \quad P(E_0) \rightarrow PE_4$$

Masking scheme. Step 1 0, 1, 1, 1, 1, 1, \_

0 → When not perform addition.

1 → when addition ~~not~~ performs.

Step 2.

Step 3 → 0 0 0 0 1 1 1 1



## Type of Interconnection network.

(6)

|           |  |  |
|-----------|--|--|
| Page No.: |  |  |
|-----------|--|--|

## Static vs Dynamic networks.



Linear



Ring



cube



tree

If vertices of cube is replaced by ~~3 cut~~ 3 circles  
then cube <sup>connected</sup> circled.

two classes of dynamic network:-

- single - stage
- multi stage.

IS  $\rightarrow$  Input selector



$1 \times D$  deMultiplexer

OS  $\rightarrow$  output selector.

$D \times 1$  multiplexer



not allowed  
and illegal  
and not recommended



## Evaluation criteria of ICN

- diameter.
- Bisection width
- no. of edges per Node
- Maximum edge length.

If we increase the no. of connections, diameter will decrease.



if changes  
to bidirectional  
 $\left(\frac{n-1}{2}\right)$

- Bisection width





4x4 2D - wrap around mesh

# of no. of connections is more to reduce diameter, it will create hazards in the form of heat due to interference, overlapping of networks.

Dynamic networks. → Three classes:-

- Blocking
- Rearrangeable
- Non-blocking

Simultaneous connected b/w source and Receiver.

# Cube interconnection Network.

Cube routing function

$$c_i(a_{n-2} a_{n-1} \dots a_{i+1} a_i a_{i-1} \dots a_2 a_1 a_0)$$



A 3 cube of 8 nodes

$$c_i(a_{n-2} a_{n-1} \dots a_{i+1} a_i a_{i-1} \dots a_2 a_1 a_0)$$

$$= a_{n-2} a_{n-1} \dots a_{i+1} a_i a_{i-1} \dots a_2 a_1 a_0$$

$$c_0(000) = 001$$

$$c_0(001) = 000$$

$$c_0(101) = 100$$

$$0 \leftrightarrow 1 \quad 2 \leftrightarrow 3 \quad 4 \leftrightarrow 5 \quad 6 \leftrightarrow 7 \quad c_0$$

$$C_1(000) = 010$$

: diagonal connectivity



$$C_2(000) = 100$$

$$C_2(111) = 011$$

vertical  
connectivity.



Horizontal  
connectivity



## 5D Hyper Cube



Shuffle function

$$S(a_{n-1} a_{n-2} \dots a_1 a_0) = a_{n-2} \dots a_0 a_{n-1}$$

$$S(000) = 000$$

$$S(001) = 010$$

Inverse.

$$a_{n-1} \dots a_1 a_0 = a_{n-1} \dots a_1 a_0$$

Exchange routing function

(15)

Page No.:

$$E(A) = C \circ A$$

0 0 0      0 0 1      0 1 0      0 1 1      1 0 0      1 0 1      1 1 0      1 1 1

shuffle:



Exchange:



8 node - shuffle exchange network.

Task: Prove that the diameter of an  $n$ -processor ( $n^p$ ) shuffle exchange network is  $O(\log n)$ .

Butterfly function

$$\beta_k (b_{n-1} \ b_{n-2} \ \dots \ b_{k+1} \ b_k \ b_{k-1} \ \dots \ b_2 \ b_1 \ b_0)$$

$$= b_{n-1} \ b_{n-2} \ \dots \ b_{k+1} \ b_0 \ b_{k-1} \ \dots \ b_2 \ b_1 \ b_0$$

$k \rightarrow 1 \text{ to } n-1$

$[0, n-1]$

$0 < k \leq n-1$

~~ex-2~~

-1 (single bit right rotation)

0

→ 0 0 0

→ 0 0 1

→ 0 1 0

→ 0 1 1

→ 1 0 0

→ 1 0 1

→ 1 1 0

→ 1 1 1



3 stage  $8 \times 8$  Omega switching network



Reposition Omega network. (changing position

$S_{32} \leftrightarrow S_{22}$

- # The cube network uses two function switch boxes whereas the omega network uses four function switch boxes.
- # The data flow directions in the two networks are opposite to each other. In other words, the roles of input-output lines are exchanged in the two network.

✓ Prove that diameter of n-processor shuffle exchange network is  $O(\log n)$

(19)

|           |  |  |
|-----------|--|--|
| Page No.: |  |  |
|           |  |  |

S

E

0 0 0 —————→ 0 0 0 —————

0 0 1                    0 0 1

~~0 0 0~~

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

[ for  $i = 1$  to  $\log_2 n$  ]

S  
E

let us consider a matrix of size  $m \times m$ . To prove the Inverse of Transpose matrix can be obtained by ~~m-perfect shuffle~~ or  $m$ -shuffle.



$$A^{-1} = ad$$

(2) Page No.:

$$\begin{bmatrix} \phi & 0 \\ 0 & \phi \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$

$[2 \times 2]$

$$\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}$$

$$m=2 \quad 2^2 \times 2^2 = 16$$

|         |         |         |         |
|---------|---------|---------|---------|
| 0 0 0 0 | 0 0 0 1 | 0 0 1 0 | 0 0 1 1 |
| 0 1 0 0 | 0 1 0 1 | 0 1 1 0 | 0 1 1 1 |
| 1 0 0 0 | 1 0 0 1 | 1 0 1 0 | 1 0 1 1 |
| 1 1 0 0 | 1 1 1 0 | 1 1 1 0 | 1 1 1 1 |

$\rightarrow$   
~~1000~~

row col

1 0 0 1

1 1  
✓

0 0 1 1

1 1  
✓

0 1 1 0

Other row

$$2^m \times 2^m = 2^{2m}$$

bits = 2m

L row | J th Column  
 $a_{2n-1} a_{2n-2} \dots a_m | a_{m-1} a_{m-2} \dots a_2 a_1 a_0$

↓  
 m shuffle.

$a_{m-1} a_{m-2} \dots a_2 a_1 a_0$  |  $a_{2m-1} a_{2m-2} \dots a_m$   
 Jth column | Lth row.

|     |      |      |      |      |
|-----|------|------|------|------|
| 00  | 0100 | 1000 | 0000 | 0000 |
| 100 | 0110 | 1010 | 0010 | 0010 |
| 001 | 0101 | 1001 | 0001 | 0001 |
| 000 | 0111 | 1011 | 0011 | 0011 |

Ans

Ans

Ans

Ans

Ans

01/04/2018.

Page No.:  
\_\_\_\_\_  
\_\_\_\_\_  
\_\_\_\_\_  
\_\_\_\_\_

## Iliac Routing function

i → address of Processor

$$R_{\pm 1}(i) = (i \pm 1) \bmod N$$

$$R_{\pm r}(i) = (i \pm r) \bmod n$$

where  $r = \sqrt{N}$ ,  $0 \leq i \leq N-1$



$$R+1(i) = (i+1) \bmod 16$$

$$r = 4$$

$$0 \leq i \leq 15$$

$$\text{for } 0 \Rightarrow 1 \bmod 16 = 1$$

$$\text{for } 1 \Rightarrow 2 \bmod 16 = 2$$

$$\text{for } 2 \Rightarrow 3 \bmod 16 = 3$$

$$\text{for } 3 \Rightarrow 4 \bmod 16 = 4$$

$$R-1(i) = (i-1) \bmod 16$$

for 0  $\Rightarrow -1 \bmod 16 = 15$

for 1  $\Rightarrow 0 \bmod 16 = 0$

$$R+4(i) = (i+4) \bmod 16$$

for 0  $\Rightarrow 4 \bmod 16 = 4$

$$R-4(i) = (i-4) \bmod 16$$

for 0  $\Rightarrow -4 \bmod 16$





# original Illiac Routing function is implemented for 64 PE (processing element).

R+1

R-1

R+4

R-4

This is not suitable for fabrication



Permutation:

(a, b, c) , (d, e)

a → b, b → c, c → a

d → e & e → d

R+1(i) = (0, 1, 2, ..., N-1)

R-1(i) = (N-1, N-2, ..., 2, 1, 0)

R+ $\gamma$ (i) = (0, 4, 8, 12) (1, 5, 9, 13)

(2, 6, 10, 14)

(3, 7, 11, 15)

=  $\sum_{i=0}^{\gamma-1} \pi(i, i+\gamma, i+2\gamma, \dots, i+N-\gamma)$

$\gamma-1$

R- $\gamma$  =  $\pi_{i=0}^{\gamma-1}$

$$l = a + (n-1)d$$

$$= i + (\gamma-1)\gamma$$

$$= i + \gamma^2 - \gamma$$

$$= i + N - \gamma$$

$\gamma, 2\gamma, 3\gamma$   
 $\gamma + (n-1)\gamma$        $\gamma + \gamma(\gamma-1)$

# Barrel Shifter.

$$B+j(i) = (i + 2^j) \pmod{N}$$

$$B-j(i) = (i - 2^j) \pmod{N}$$

$$0 \leq i \leq N-1, 0 \leq j \leq n-1 \text{ and}$$

$$n = \log_2 N$$

$$B+0 = R+1, B-0 = R-1, B+n/2 = R+r$$

$$B-n/2 = R-s$$

$$B \leq (\log_2 N)/2$$

CIS.

~~$$B \pm j(i) = (i \pm 2^j) \pmod{N}$$~~

$$N=8$$

$$, 0 \leq i \leq 7, 0 \leq j \leq 2$$

$$B+0(i) = (i+1) \pmod{8}$$



$$B-0(i) = (i-1) \pmod{8}$$

$$B+1(i) = (i+2) \pmod{8}$$

$$B+2(i) = (i+4) \pmod{8}$$

$$B-2(i) = (i-4) \pmod{8}$$



Redraw for  $N = 16$  in the form of a chordal ring.

$$B_{\pm j}(i) = (i \pm 2^j) \bmod N$$

$$N = 16$$



$$0 \leq i \leq 15, \quad 0 \leq j \leq 3$$

$$B+0(i) = (i + 2^j) \bmod 16$$

$$B-0(i) = (i - 1) \bmod 16$$

$$B \pm 0$$

$$B \pm 1$$

$$B \pm 2$$

$$B \pm 3$$

~~Prove that the set of illiac routing functions is a subset of Barrel Shifter.~~

$$R \pm r(i) = (i \pm r) \bmod N$$

$$r = \sqrt{N}, 0 \leq i \leq N-1$$

Barrel:

$$B \pm j(i) = (i \pm 2^j) \bmod N$$

$$0 \leq i \leq N-1,$$

$$0 \leq j \leq n-1$$

$$n = \log_2 N$$

$$\text{Let } N=4$$

$$\text{or } r=2, 0 \leq i \leq 3$$

$$R \pm 2(i) = (i \pm 2) \bmod 4 \quad B \pm 1(i) = (i \pm 2^j) \bmod N$$

$$\text{Let } N=4$$

$$0 \leq i \leq 3$$

$$0 \leq j \leq 1$$

$$R \pm 1(i) = (i \pm 1) \bmod N$$

$$R \pm r(i) = (i \pm r) \bmod N, r = \sqrt{N}$$

$$B \pm j(i) = (i \pm 2^j) \bmod N$$

$$B \pm 0(i) = (i \pm 1) \bmod N = R \pm 1(i)$$

$$B \pm 1(i) = i \pm 2$$

$$n = \log_2 N$$

$$N = 2^n$$

$$\sqrt{N} = 2^{n/2}$$

$$B \pm n/2 (i) = (i \pm 2^{n/2}) \bmod N$$

$$= (i \pm \sqrt{N}) \bmod N$$

$$= (i \pm r) \bmod N$$

$$= R + r(i)$$

~~Task~~

# Express the Barrel Shift Routing function in the form of Permutation.

~~Q3~~

Q. Draw the Barrel Shifter with 256 PEs by showing ~~at least~~ all the connections ~~of atleast~~ ~~from~~ at least from one PE.

$$B \pm J(i) = (i \pm 2^J) \bmod N$$

$$0 \leq i \leq N-1, 0 \leq J \leq n-1$$

$$0 \leq i \leq 2^{55}, 0 \leq J \leq 7$$

$$n = \log_2 N$$

$$= \log_2 \frac{256}{8}$$

$$B \pm 0(i) = (i \pm 1) \bmod 256$$

$$= \log_2 2^2$$

$$\text{for } i=0 \Rightarrow (\pm 1) \bmod 256$$

$$= 8$$

$$\Rightarrow 1 \bmod 256, -1 \bmod 256$$

$$1, 255$$

$$i=1' B \pm 1 \rightarrow$$





/disprove

Prove that the multi stage Omega network can't perform shift permutation.



2 Stage 2x2 Omega N/w



$$00 \rightarrow 00$$

$$01 \rightarrow 10$$

$$10 \rightarrow 01$$

$$11 \rightarrow 11$$

NOT Permitted.



Systolic System

performance is poor.

By choosing collection of processor, performance can be increased.



Graeffe's Root squaring.

$$A_j = 0 \text{ for } j < 0$$

$$\& j > n$$

$$C_j = A_j^2 - 2A_{j-1}A_{j+1} + 2A_{j-2}A_{j+2} - 2A_{j-3}A_{j+3}$$

$$\text{for } j=0$$

$$C_0 = A_0^2$$

$$C_1 = A_1^2 - 2A_0A_2$$

$$C_2 = A_2^2 - 2A_1A_3 + 2A_0A_4$$

$$C_3 = A_3^2 - 2A_2A_4 + 2A_1A_5 - [2A_0A_6]$$

$$C_4 = A_4^2 - 2A_3A_5$$

$$C_5 = A_5^2$$



left shift

right shift



n=5



n=6



for odd value  
blunt edge

for even value  
sharp edge

palindrome  $\rightarrow$   $w w^r$

form

$w^r$  is reverse of  $w$ .

110011 is a palindrome,

110011

$\rightarrow$  D P P

11001

D D D

1100

D D D

110

D D D

11

D D P

H

H P

P

another form:

$w \subset w^r$

10001

Page No.:

Buffered reader : org:

[SOC]

System on chip.

# Difference b/w CPU & CPU.