

## Advanced Computer Architecture (DA7020)

### History of Electronic Computer

Designed by Univ. of Pennsylvania, U.S.A.  
for US military.

(around)  
1926

2 People developed,

Vacuum tube technology used.

1<sup>st</sup> elec. comp. was named ENIAC  
(Electronic Numerical Integrator and Computer)  
designed by Eckhart + Mauchley.

Transistor invented 1947 worked as  
switch replaced vacuum tube.

Size small, less consumption.

Ball

~~AT&T~~ AT&T Lab

- William Shockley
  - Walter Brattain
  - John Bardeen
- } Nobel prize for Physics in 1956.

IC introduced 1958 Jack Kilby in Texas Instrument invented Integrated Circuit Chip

→ Small Scale Integration SSI

MST

LSI

VLSI (millions of  
transistors  
on a chip)

1960's → Super Computer - Mainframe

1970's → Mini Computer → Businesses &  
Researches used

1980's → Micro Processor - Desktop Computer

→ Laptop Computer

1990's → Internet and World Wide Web,  
Web Services.

2000's → PalmTop computer, PDA,  
cell phones, smartphones,  
~~etc~~ Medical Diagnostic Devices,  
Electronic Devices



### VAX 11/780 Growth in Unit Processor Performance

Gordon Moore gave  
1965 Moore's law

due to  
25% → technology

50% → innovation

20% → multi processor

Dual Core  
Quad Core  
processors

### 3 Main Components of Computer



Speed gap exists b/w memory & Processor chip  
 I/O devices much slower.

### Computer Architecture

→ Design Set  
 → Instruction Architecture (ISA).

### Design Principles Followed:

- 1) Common Case Fast. } Focus on more freq.  
used part
- 2) Regularity in Design. } Theme maintained
- 3) Simple Design.
- 4) Exploit parallelism. } Multicore processors  
multiple processors

Date \_\_\_\_\_

( Pipelining → Overlapping of instructions. )

5) Exploit Locality of Reference ? Looping of instr'n

Speed of Development has reduced because :

- All parallelism is exhausted.
- With increase in no. of transistors power consumption has increased.

### Trends in Power in Integrated Circuit

CMOS

$$\text{Dynamic Power} = \frac{1}{2} CV^2 f$$

C → Capacitive load

V → Voltage

f → Frequency of the clock

f → clock speed → if we use slower clock power consumption reduces.

C → if no. of ~~so~~ transistors are reduced power consumption reduces.

Page No. \_\_\_\_\_

Date \_\_\_ / \_\_\_ / \_\_\_

(Saathi)

Date \_\_\_ / \_\_\_ / \_\_\_

(Saathi)

$V \rightarrow$  if we reduce voltage power consumption reduces.

$$\begin{aligned} \text{Static power} &= \cancel{I_{\text{leakage}}} \times \text{Voltage} \\ &= \cancel{\text{Current}} \times \frac{\text{Voltage}}{\cancel{\text{leakage}}} \end{aligned}$$

When transistors are not used they still leak some current.

Static power = 25% of Total Power Consumption.

$$\frac{1}{2} \times C \times (V - 0.15)^2$$

$$\begin{aligned} &\cancel{\frac{1}{2} \times C \times V^2} \times 0.85 \times 0.85 \times f \\ &\cancel{\frac{1}{2} \times C \times V^2} \times 0.85^3 \end{aligned}$$

$$\frac{P_{\text{old}}}{P_{\text{new}}} = \frac{1}{(0.85)^3}$$

$$= 1.628$$

$$\frac{1}{2} ((0.85)^2 + (0.85)) = 60\% \text{ of original power}$$

- (Q) Some microprocessors today are designed to have adjustable voltage so that 15% reduction in voltage results in 15% reduction in frequency. What would be the impact on dynamic power.

$$\frac{1}{2} \times C \times V^2 \times f = \frac{1}{2} \times (0.15)^2 \times 0.15 \times f$$

2.8.18

Date / /

Saathis

Date / /

### A Summary of 3 mainstream computing classes

Throughput  $\rightarrow$  work done per unit time

| Features                        | Desktop                                                                   | Server                                               | Embedded                                             | Architecture                                                                                                                                                                                      |
|---------------------------------|---------------------------------------------------------------------------|------------------------------------------------------|------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| → Price of System               | \$1500 - \$5000                                                           | \$5000 - \$50000                                     | \$10 - \$100,000                                     |                                                                                                                                                                                                   |
| → Price of Microprocessor       | \$450 - \$500                                                             | \$200 - \$10000                                      | \$0.01 - \$100                                       |                                                                                                                                                                                                   |
| Handle                          |                                                                           |                                                      |                                                      |                                                                                                                                                                                                   |
| → Critical Point vs Performance | Price, power consumption, availability, application, specific performance | Processor, memory, storage, scalability, performance | Processor, memory, storage, scalability, performance | Instruction Set Architecture (ISA) Design<br>CISC (Complex Instruction Set Computer)<br>RISC (Reduced Instruction Set Computer)<br>Embedded systems<br>ARM processor for embedded systems<br>RISC |
| System Design Issues            |                                                                           |                                                      |                                                      |                                                                                                                                                                                                   |
| Leisure                         |                                                                           |                                                      |                                                      |                                                                                                                                                                                                   |
| Marketing and                   |                                                                           |                                                      |                                                      |                                                                                                                                                                                                   |

Saad

Date \_\_\_\_\_

## Measuring and Comparing Performance of Computers.

Performance  $\propto$  Execution Time

$$\begin{aligned} \text{Execution time} &= IC \times f \text{CPI}_{\text{avg}} \times T \\ &= IC \times CPI_{\text{avg}} \end{aligned}$$

IC  $\rightarrow$  instruction count

T  $\rightarrow$  Clock Period

f  $\rightarrow$  Clock Frequency  $= \frac{1}{T}$

CPI  $\rightarrow$  Cycles per instruction



We use CPI<sub>avg</sub> ( $\frac{\text{CPI}}{\text{Time}}$ )  $\leftrightarrow$

because speed varies from Memory to CPU to I/O.

M.T.P.S (Million Instructions Per Second)

$$\text{Execution Time (secs.)} \rightarrow \frac{IC}{10^6}$$

$$1 \text{ sec} \rightarrow \frac{IC}{10^6 \times \text{Ex. Time}}$$

$$\begin{aligned} &= \frac{IC}{10^6 \times IC \times CPI_{\text{avg}}} \\ &= \frac{1}{CPI_{\text{avg}} \times 10^6} \quad (\text{M.T.P.S Rating}) \end{aligned}$$

e.g;

Program = 100,000 instructions

| Instruction Mix:        | %   | C.P.I |
|-------------------------|-----|-------|
| 1) Arithmetic           | 60% | 1     |
| 2) Data Transfer        | 20% | 4     |
| 3) Control instructions | 20% | 2     |

Date \_\_\_ / \_\_\_ / \_\_\_

Saathi

Date \_\_\_ / \_\_\_ / \_\_\_

Find Execution time of the program if the computer clock has a frequency of 400 MHz.

A)

$$\text{CPI} \\ \frac{60}{100} \times 100,000 = 60,000 \rightarrow 60,000$$

$$\frac{20}{100} \times 100,000 = 20,000 \rightarrow 80,000 \\ 20,000 \rightarrow 40,000$$

~~$$60,000 + 20,000 + 20,000 = 100,000$$~~

$$\frac{60,000 + 20,000 + 20,000}{100,000} = 1.8$$

$$\text{CPI}_{avg} = 1.8$$

$$\text{Execution time} = \frac{100,000 \times 1.8}{400 \times 10^6}$$

$$= \frac{1.8}{4000} \\ = 450 \mu\text{sec}$$

Page No. 

$$\begin{aligned} \text{MIPS rating} &= \frac{400}{1.8 \times 10^6} \\ &= \frac{400}{1.8 \times 10^5} \\ &= 222.22 \times 10^{-6} \\ &= 222.22 \text{ MIPS} \end{aligned}$$

ScaleMemory & Storage

| Decimal<br>Term | Decimal<br>value | Binary<br>value |
|-----------------|------------------|-----------------|
| KB              | $10^3$ B         | $2^{10}$        |
| MB              | $10^6$           | $2^{20}$        |
| GB              | $10^9$           | $2^{30}$        |
| TB              | $10^{12}$        | $2^{40}$        |
| PB              | $10^{15}$        | $2^{50}$        |
| EB              | $10^{18}$        | $2^{60}$        |
| ZB              | $10^{21}$        | $2^{70}$        |
| YB              | $10^{24}$        | $2^{80}$        |

Page No.

Time

| Term                                                                                             | Decimal value | Binary value |
|--------------------------------------------------------------------------------------------------|---------------|--------------|
| Second                                                                                           | 1             | 1            |
| millisecond                                                                                      | $10^{-3}$     | $2^{-10}$    |
| microseconds                                                                                     | $10^{-6}$     | $2^{-20}$    |
| nanoseconds                                                                                      | $10^{-9}$     | $2^{-30}$    |
| pico-seconds                                                                                     | $10^{-12}$    | $2^{-40}$    |
| atto-seconds                                                                                     | $10^{-15}$    | $2^{-50}$    |
| <del>ET<sub>A</sub> = C<sub>A</sub> * T<sub>A</sub></del>                                        |               |              |
| <del>ET<sub>B</sub> = C<sub>B</sub> * T<sub>B</sub></del>                                        |               |              |
| <del><math>ET_A = 500 \times 10^{-12}</math></del>                                               |               |              |
| <del><math>ET_B = 600 \times 10^{-12}</math></del>                                               |               |              |
| Computer B is faster 1.2 times                                                                   |               |              |
| <del>MFPS<sub>A</sub> = <math>2 \times 10^6 \times 250 \times 10^{-12}</math></del>              |               |              |
| <del>= <math>10^6</math> MFPS</del>                                                              |               |              |
| <del>MFPS<sub>B</sub> = <math>1.2 \times 10^6 \times 500 \times 10^{-12} = 600</math> MFPS</del> |               |              |

1) Suppose we have two implementations of the same computer A has a clock cycle time of 250 ps and CPI of 2.0 for same program.

And

Computer B has a clock cycle time of 500 ps and a CPI of 1.2 for same program.

Which computer is faster and by how much?

$$\text{MFPS}_A = 2$$

$$\text{CPI}_B = 1.2$$

$$T_A = 250 \times 10^{-12}$$

$$T_B = 500 \times 10^{-12}$$

$$\text{MFPS}_A = C_A \times T_A = 2 \times 250 \times 10^{-12} = 500 \times 10^{-12}$$

$$\text{MFPS}_B = C_B \times T_B = 1.2 \times 500 \times 10^{-12}$$

Scanned by CamScanner

2)

A given application written in Java source in 15 secs on a desktop processor. A new java compiler is released that requires only 0.6 as many instructions as the old compiler. Unfortunately it increases the CPI by H.

How fast we expect the application to run using this new compiler?

3)

A compiler designer is trying to decide between three code sequences for a particular computer. The hardware designer have supplied the following facts:

| CPI | A | B | C |
|-----|---|---|---|
| 1   | 2 | 3 |   |

For a particular high level language statement, the compiler writer is considering three code sequences, requiring following instruction counts:

$$\begin{aligned}
 E_{\text{time old}} &= 0.6 \times IC \times ICPI \times T \\
 &= 0.6 \times 1.1 \times 15 \\
 &= 9.9 \text{ sec}
 \end{aligned}$$

5.1sec factor

$$\begin{aligned}
 E_{\text{time old}} &= 15 \\
 E_{\text{time new}} &= 9.9
 \end{aligned}$$

Which code sequence executes the most instructions? — ②

Which will be faster?

What is the CPI of each sequence?

$$CPI \text{ for A} = \frac{(2 \times 1) + (1 \times 2) + (2 \times 3)}{5}$$

$$= \frac{2 + 2 + 6}{5} = \frac{10}{5} = 2$$

$$\begin{aligned} CPI \text{ for B} &= (4 \times 1) + (1 \times 2) + (1 \times 3) / 6 \\ &= 4 + 2 + 3 = \frac{9}{6} = 1.5 \end{aligned}$$

Assume that we design alternatives to decrease the CPI of F P S Q R to 2.00. decrease the average CPI of all FP operations to 2.5. Compute these two design alternatives using processor performance equations.

$$(2.5 \times 4) + (2 \times 20) + (1.5 \times 15) = \text{original CPI}$$

$$40 + 97.00 = 137.00$$

$$\text{Original CPI} = 2.5 \times 4 + 0.75 \times 1.33 = 2$$

$$\text{Design 1 CPI}_{avg} = 2 - 0.02(20 - 2)$$

$$= 1.64$$

$$\text{Design 2 CPI}_{avg} = 0.25 \times 2 + 0.75 \times 1.33 = 1.625$$

- 4) Suppose we have made the following measurements.

~~Design 2~~ Design 2 is better  
(Common case fast)

Frequency of FP operations = 25%

Average CPI of FP operations = 9

Average CPI of other operations = 1.33

Fragility of F P S Q R = 2%

CPI of F P S Q R = 20

$$P_{\text{active}} = I_{\text{through}} \times V$$

20% / year

~~Summarizing Performance~~

Computer-A Computer-B

| Program-1 (sec) | 1    | 10  |
|-----------------|------|-----|
| Program-2 (sec) | 1000 | 100 |

$$\text{Arithmetic mean} = \frac{1+1000}{2} = 500.5 \text{ sec}$$

500.5 sec

5.5 sec

### Trend in Power in Integrated Circuits

$$P_{\text{dynamic}} = \frac{1}{2} CV^2$$

C = Capacitive Load

Computer-A Computer-B Weights

| Program-1 (sec) | 1    | 10  |
|-----------------|------|-----|
| Program-2 (sec) | 1000 | 100 |

100%

10%

80%

5V → 1V

voltage scaling

Weighted arithmetic mean

Computers are now having power problem therefore multiprocessor is introduced.

$$\text{Comp. (A) (WAM)} = 1 \times 0.2 + 1000 \times 0.8 = 800.2$$

$$\text{Comp. (B) (WAM)} = 10 \times 0.2 + 100 \times 0.6 = 82$$

Comp-A      Comp-B      Weightage

### Benchmark Suite

Page - 1 (sec)      1      10      98%      1

Page - 2 (sec)      1000      100      2%

1

2252

0.6

1) String Processing

2)

String Processing

0.6

3) Block Sorting

2.390

0.376

0.376

4) GNUMU Compiler

7.94

2.66

2.66

5) Benchmark (AL)

2.21

2.21

2.21

6) Search tree

Sequence.

Sequence.

Sequence.

### System Performance Evaluation Corporation (SPEC)

- 1) Video Compression
- 2) XML Parsing

Benchmark suite to be run by all producers issued by this corporation. This is to check performance of computers. Include CPU, I/O, etc.

$$\text{SPEC-Ratio}_i = \frac{T_{ref,i}}{T_{comp,i}}$$

} for each process in Benchmark Suite

Date / /

Saathi

If there are  $n$  SPEC-Ratio for a computer

$$\text{Geometric mean} = \sqrt[n]{\prod_{j=1}^n \text{SPEC-Ratio}_j}$$

$$\sum_i T_i = \text{sum} \\ \pi = \text{product} \\ \bar{x} = \sqrt[n]{\text{Ex-Time Ratio}_1 \times \text{Ex-Time Ratio}_2 \times \dots \times \text{Ex-Time Ratio}_n}$$

= SPEC Rating

~~Geometric Mean is used instead of Arithmetic Mean because GM is unbiased to reference comp.~~

e.g;

There are two machines A and B and a reference machine.

There are two Tests T<sub>1</sub> and T<sub>2</sub> and we obtain the following scores for the machines.

| Machines          | Test T <sub>1</sub> | Test T <sub>2</sub> |
|-------------------|---------------------|---------------------|
| Machine A         | 10 sec              | 100 sec             |
| Machine B         | 1 sec               | 500 sec             |
| Reference Machine | 1 sec               | 100 sec             |

Page No. [ ]

Date / /

Saathi

|           | Test T <sub>1</sub>  | Test T <sub>2</sub>     |
|-----------|----------------------|-------------------------|
| Machine A | $\frac{1}{10} = 0.1$ | $\frac{1}{100} = 0.01$  |
| Machine B | $\frac{1}{1} = 1$    | $\frac{1}{500} = 0.002$ |

$$\text{SPEC Rating}_{JA} = \sqrt[3]{0.1 \times 1} = 0.3162$$

$$\text{SPEC Rating}_{JB} = \sqrt[3]{1 \times 0.002} = 0.0472$$

B is better

Performance Improvement By Multicore System

$$\text{Ex-Time multicore} = T$$

$$\text{Ex-Time multicore} = \frac{T}{N} \quad (N = \text{No. of Processors})$$

Page No. [ ]

## Parallel Part

Amdahl's Law: The amount of speedup is limited by the fraction of the program that can't be parallelized.

When  $N$  is very high ( $N = \infty$ )

$$\frac{1}{(1-\frac{f}{N})} + \frac{f}{N} = 1 - \frac{f}{N}$$

Original Execution time by Uniprocessor =  $T_{old}$

New Execution time by Multiprocessors

$$= (1 - \frac{f}{N}) T_{old} + \frac{f}{N} T_{old}$$

$N = \text{No. of Processors}$



$$\text{Speed Up} = \frac{T_{old}}{T_{new}} = \frac{T_{old}}{(1-f)T_{old} + fT_{old}}$$

→ Number of Processors

$$\text{Speed Up} = \frac{1}{(1-f) + \frac{f}{N}}$$

Date \_\_\_ / \_\_\_ / \_\_\_

Saathi

Date \_\_\_ / \_\_\_ / \_\_\_

Saathi



i)  $f = 50\%$   $\frac{\text{Max Speed up}}{\text{speed up}} = \frac{1}{1-0.5} = 2 \text{ times}$

ii)  $f = 75\%$   $\frac{\text{Max Speed up}}{\text{speed up}} = \frac{1}{1-0.75} = 4 \text{ times}$

iii)  $f = 90\%$   $\frac{\text{Max Speed up}}{\text{speed up}} = \frac{1}{1-0.9} = 10 \text{ times}$

iv)  $f = 95\%$   $\frac{\text{Max Speed up}}{\text{speed up}} = \frac{1}{1-0.95} = 20 \text{ times}$

Andehl's Law

e.g;

Consider an enhancement which runs 20 times faster but which is usable only 15% of the time. What is the overall speedup?

20 times  $\rightarrow f = 20 \times 95\%$ .

$$15\% \rightarrow 20x$$

$$T = \frac{1}{f} = \frac{1}{0.15}$$

15%

$$\text{Speed up} = \frac{(1 - 0.15) + \frac{0.15}{20}}{20} = 1.166$$

$$S = \frac{1}{1 + \frac{q}{N}}$$

$$N + q = N$$

$$1 + \frac{q}{N} = 1$$

$$\frac{q}{N} = 0$$

$$N = \infty$$

e.g;  
Suppose a program spends 80% of its time in a square root routine. How much must you speed up square root to make program 5 times faster?

Date \_\_\_ / \_\_\_ / \_\_\_

Saathi

### Classification of Multiprocessor Architecture.

Flynn's Taxonomy 1968 based on number of instruction stream and data stream.

- 1) Single Instruction, Single Data (SISD)
- 2) Single Instruction, Multiple Data (SIMD)
- 3) Multiple Instruction, Single Data (MISD)
- 4) Multiple Instruction, Multiple Data (MIMD)

### SISD architecture



Uniprocessor

Date \_\_\_ / \_\_\_ / \_\_\_

Saathi

### SIMD architecture (Vector Processor),

[Instruction Pool]



### MISD architecture



$$\text{e.g. } z = \sin(x) + \cos(x) + \tan(x)$$

(Very rarely used)

MTMD Architecture

Only rarely or sometimes access shared memory

MPT (Message Passing Interface) Architecture

SMP (Shared Memory Processing)

UMA (Uniform Memory Access Architecture)

~~UMA (Uniform Memory Access Architecture)~~

Drawback → for large system congestion  
in the much

16-8-18

Date / /

UNNPLI

MIPS 32 bit ISA (Instruction Set Architecture)

+ S1 → how software &amp; hardware interact



RISC also called load/store architecture  
around 200 different instruction

all are 32 bits long

Number of Registers = 32

A few data access modes

Design principle → Simple is faster.

MIPS 32 bit design principle

- 1) Simplicity favors regularity
- 2) Smaller is faster
- 3) Make common case fast
- 4) Good design demands good compromise  
(3 different instruction format)

CISC

Intel 8086 uses this

Very large no. of instructions.



RF = register file

$$\text{e.g. } f = (g + h) - (i + j) \quad \text{HLL}$$

Registers

Compiler assigns registers for each variable.

$$\text{add } \$t_0, \$t_1, \$t_2 \quad // \$t_0 = \$t_1 + \$t_2$$

$= i + j$

$$\text{add } \$t_0, \$t_1, \$t_2 \quad // \$t_0 = \$t_1 + \$t_2$$

$= g + h$

$$\text{sub } \$t_0, \$t_1, \$t_2 \quad // \$t_0 = \$t_1 - \$t_2$$

$f = (g + h) - (i + j)$

Load → send data from memory to processor  
Store → send data from processor to memory.

Date — / — /

Date — / — /

Sam

## MIPS Registers and Usage Convention

Name Number Class

\$2024

\$0

\$0 = 0

setvalzero

\$at

\$1

Assembly Temporary

\$v0-\$v1

\$2,\$3

values of function

results

int add (int a, int b)

\$20,\$23

short → shift amount

result = a + b;

RT → source register

3

RD → destination register

file stored

int var

When 3 registers the "R format".

\$30-\$a3 \$4-\$7

Arguments

\$t0-\$t2

\$8-\$15

Temporaries

\$s0-\$s7

\$16-\$23

Saved temporaries

\$t8-\$t9

\$24-\$25

Temporaries

\$K0-\$K1

\$26-\$27

Reserved for OS Kernel

\$gp

\$28

Gotto Global Pointer

\$sp

\$29

Stack pointer

\$fp

\$30

Frame pointer

\$ra

\$31

return address

| opcode | 5 bits | 6 bits | 5 bits | 5 bits  |
|--------|--------|--------|--------|---------|
| opcode | AS     | RT     | RD     | short 0 |

opcode

function

RT

source register

RD

destination register

When 3 registers the "R format".

MIPS instruction Format.



For R format shades will have 6 zeroes  
address offset

For R format shades will have 6 zeroes

Date \_\_\_\_\_

(Saathi)

opcode \$t3 \$t4 \$t5 \$0 function

|        |       |       |       |       |               |
|--------|-------|-------|-------|-------|---------------|
| 000000 | 01011 | 01100 | 01101 | 00000 | ! 00000       |
| \$11   | \$12  | \$13  | \$14  | add   | $(\$13)_{10}$ |

add  $\rightarrow 32_{10}$  ~~\_\_\_\_\_~~

Sub \$t0, \$t6, \$t5

$\downarrow$   
 $(\$2)_{10}$

|        |       |       |       |                |          |
|--------|-------|-------|-------|----------------|----------|
| opcode | \$t6  | \$t5  | \$t0  | <del>\$0</del> | sub      |
| 000000 | 01110 | 01101 | 00000 | 00000          | ! 100010 |

Page No. \_\_\_\_\_

Date  $t_0$ ,  $t_1$ ,  $t_2$ ,  $t_3$

(Saathi)

$$g = h + A[i], \text{ index } H[i]$$

across the memory

$$\$50 = \text{Addr}(A[0])$$

Array  
size 8

already known  
to compiler

|   |   |   |   |   |      |
|---|---|---|---|---|------|
| 0 | 0 | 1 | 2 | 3 | A[0] |
| 4 | 4 | 5 | 6 | 7 | A[1] |

lsl  $\rightarrow$  shift left logic

lw  $\rightarrow$  load word

bytes 12

8

7

6

5

4

3

2

1

0

Memory is Byte  
addressed by  
byte address

3LL \$t6, \$t2, \$2 //  $\$t2 = i * 4$

$\rightarrow$  shift left 2 times

add \$t0, \$50, \$t2

shift left logic

$$\begin{aligned} & // \$t0 = \$50 + \$t2 \\ & = \text{Addr}(A[0] \\ & + i * 4) \end{aligned}$$

lw \$t3, 0(\$t0)

add \$t0, \$t3, \$t1 //  $\$t0 = \$t3 + \$t1$

$$g = M(\text{addr}(A[0] + i * 4))$$

+ 1

Page No. \_\_\_\_\_

Tut)

MIT PS 32 bit Assembly Code

e.g.:  
for (i=0; i< 100, i++)

a[i] = b[i] + 10; }

\$s0      \$s1



Loop:

addi \$t1, \$t0, \$100 // \$t1 = 0 + 100  
lw \$t2, \$t0, 2 // \$t2 = \$t0 \* 4  
= i \* 4  
add \$t3, \$s1, \$t2 // \$t3 = \$s1 + \$t2  
= add(\$B[0]) + i \* 4  
= add(\$B[i])

~~lw \$t4, 0(\$t3)~~

$\rightarrow [t_4 = B[i]]$

addi \$t5, \$t4, 10 //  $t_5 = B[i] + 10$

add \$t6, \$s0, \$t5 //  $t_6 = \text{addr}(B[i])$

sw \$t5, 0(\$t6) //  $a[i] = b[i] + 10$

Page No.

addi \$t2, \$t2, 1 //  $t_2 = t_2 + 1$

bne \$t2, \$t1, LOOP

END:

e.g.:  $i = j$

while ( $i != j$ )

if ( $i > j$ )

{  $i = i - j;$  }

else

{  $j = j - i;$  }

}

bne \$t1, \$t2

~~SLT \$t3, \$t1, \$t2~~

~~LOOP~~

LOOP: bne \$t0, \$t1, END

slt \$t2, \$t0, \$t1, //  $t_2 = 1$  if  $t_0 < t_1$

bne \$t2, \$t0, ELSE

sub \$t0, \$t0, \$t1, //  $t_0 = t_0 - t_1$

j LOOP

~~ELSE: sub \$t1, \$t1, \$t0, //  $t_1 = t_1 - t_0 = j - i$~~

~~j LOOP~~

Page No.

END

F1 - A17 \* CC1

## MTPS 32 Architecture

| Type | R                | opcode(6) | rs(6) | rt(5) | rd(5) | shift function | T       | opcode(6) | rs(5) | Imm(5) | shift(5) | Func(6)     |
|------|------------------|-----------|-------|-------|-------|----------------|---------|-----------|-------|--------|----------|-------------|
| add  | \$rt, \$rt, \$rt | 000000    | 01010 | 01100 | 01110 | 00000          | 1000010 | 000000    | 01001 | 01010  | 01000    | 00000100000 |
| sub  | \$rt, \$rt, \$rt | 000000    | 01010 | 01100 | 01110 | 00000          | 1000010 | 000000    | 01001 | 01010  | 01000    | 10000       |

Rs = source register

Rt = target register

rd = destination register

shift = shift amount

Load &amp; store instruction

Mnemonic      Format      Encoding

|    |   |                      |
|----|---|----------------------|
| lw | I | 32 bit offset ← rt → |
| lw | I | 32 bit offset ← rt → |
| lw | I | 32 bit offset ← rt → |
| lw | I | 32 bit offset ← rt → |
| lw | I | 32 bit offset ← rt → |

- i) jump ✓ (only opnd & address)

## MIPS 3.2 Architecture

Arithmatic, logic, instructions

Jump & Branch

| Mnemonic | Format | Encoding                         | Mnemonic | Format | Encoding                              |
|----------|--------|----------------------------------|----------|--------|---------------------------------------|
| add      | R      | 0 <sub>10</sub> 0100000000000000 | J        | T      | 0 <sub>10</sub>                       |
| addr     | R      | 0 <sub>10</sub> 0100000000000000 |          |        | ← instruction address                 |
| sub      | R      | 0 <sub>10</sub> 0100000000000000 | JAL      | T      | 3 <sub>10</sub> ← instruction address |
| and      | R      | 0 <sub>10</sub> 0100000000000000 |          |        |                                       |
| or       | R      | 0 <sub>10</sub> 0100000000000000 | beq      | I      | 4 <sub>10</sub> 0000000000000000      |
| xor      | R      | 0 <sub>10</sub> 0100000000000000 |          |        | ← offset →                            |
| addi     | I      | 0 <sub>10</sub> 0100000000000000 | lbeq     | I      | 4 <sub>10</sub> 0000000000000000      |
|          |        | from <del>base</del> offset →    |          |        | ← offset →                            |
|          |        | e.g.: 101010                     |          |        |                                       |
|          |        | 101001                           |          |        |                                       |
|          |        | 101000                           |          |        |                                       |

and

I

12<sub>10</sub> ~~base~~ offset →

Shift instruction

Mnemonic Format Encoding

|     |   |                                  |
|-----|---|----------------------------------|
| SLL | R | 0 <sub>10</sub> 0000000000000000 |
| SRL | R | 0 <sub>10</sub> 0000000000000000 |
| SRA | R | 0 <sub>10</sub> 0000000000000000 |

```

c9;
if (i == j)
    f = g + h;
}
else
{
    f = g - h;
}

```

- 1) Instruction Fetch
- 2) Instruction Decode, add R2, R3 //  $R1 = R2 + R3$
- 3) Instruction Execute  $\Rightarrow R2 + R3$
- 4) NOP (No operation)
- 5) Write Back  $\Rightarrow R1 = R2 + R3$



- 1) Instruction Fetch
- 2)  $T \cdot D \Rightarrow R3$
- 3)  $F \rightarrow 1032 + R3$  (Address calculation)
- 4) Memory Access
- 5) Write Back  $R1 = \text{value}[1032(R3)]$

SW RL, 32 + R3)

$$\text{Throughput} = \frac{1}{IPC} = \frac{1}{\frac{4}{8}} = \frac{1}{2}$$

$$\text{Non-pipeline}(IPC) = \frac{1}{5}$$

- 3) Ex  $\Rightarrow$  32 + R3 (Address)
- 4) Mem  $\Rightarrow$  Value [adder(32 + R3)]  $\leftarrow R_1$
- 5) NOR

Pipeline is better.

$$\text{Speed up} = \frac{T_{\text{Non Pipeline}}}{T_{\text{Pipeline}}} = \frac{20}{6} = 2.5$$

Maximum speed up is 5 times.  
 $\uparrow$  stages  $\uparrow$  speed up

add R1, R2, R3 F D E M W  
 sub R4, R5, R6 F D E M W  
 and R7, R8, R9 F D E M W  
 XOR R11, R12, R13 F D E M W

Pipeline Hazards or when maximum benefit from pipeline won't be obtained.

Structural Hazard  $\rightarrow$  2 processes required

(same port)

3 cycles for 4 instructions

2) Data Dependency  $\rightarrow$

CR1 = 2  
 3) Control Hazard  $\rightarrow$  Branch,

Date / /

Sarthi

Date / /

Sarthi

## MIPS 5-stage Pipeline



## Data Dependency

- $S_1 : C = D \times E$   
 $S_2 : M = G + C$  } not || or .. can't work same time  
 $S_3 : A = B + C$  } parallel  
 $S_4 : C = L + M$ .  
 $S_5 : F = G \div E$

## Data Flow Diagram

Read & write both require it  
In half cycle write in another  
half read takes place

Hardest architecture followed

By modifying hardware structural hazard is removed.



## Types of Data Dependency

- 1) Read after Write (RAW) dependency.
- 2) Write after Write (WAW) dependency (output dependency)
- 3) ( $S_4, S_2$ ) Write after Read ( $S_3, S_4, S_2$  Antidependency).

$I \rightarrow \text{Input}$   
 $O \rightarrow \text{Output}$

$$C = D \times E$$

## Dependency Graph



## Bernstein's Condition for Parallelism

$$\begin{array}{c|c} I_1, I_2 & I_1 \cap O_2 = \emptyset \\ O_1, O_2 & I_2 \cap O_1 = \emptyset \\ & O_1 \cap O_2 = \emptyset \end{array}$$



- S1: Load  $R_1, A // R_1 = A$   
 S2: add  $R_2, R_1 // R_2 = R_1 + R_2$   
 S3: Mem  $R_1, R_3 // R_1 = R_3$   
 S4: store  $B, R_1 // B = R_1$

29.8.19

Date — / — / —

*Saathie*

Program

$I_1$ : Sub 21, Open  $\rightarrow$  F D E M W  
 $I_2$ : Sub 24, 21, 26  $\rightarrow$  F - - D  
 $I_3$ : and 26, 21, 27  
 $I_4$ : on 28, 21, 29

1 cycle

Program

$I_1$ : add 21, 22, 23  $\rightarrow$  F D E M W  
 $I_2$ : sub 24, 21, 23  $\rightarrow$  F - D E M W  
 $I_3$ : and 26, 21, 27  $\rightarrow$  F D E M W  
 $I_4$ : on 28, 21, 29  $\rightarrow$  F D E M W  
 $I_5$ : XOR 21, 24, 21, 29  $\rightarrow$  F D E M W

$I_1$   $\rightarrow$   $I_2$   
 $I_2$   $\rightarrow$   $I_3$   
 $I_3$   $\rightarrow$   $I_4$   
 $I_4$   $\rightarrow$   $I_5$

Forwarding



$$CPI = \frac{9}{4} = 2.25$$

$$T_{PC} = \frac{4}{9} = 0.44$$

$F \rightarrow E M W$   
 $E \rightarrow D E M W$   
 $F \rightarrow F R M W$   
 $F \rightarrow R M W$   
 $F \rightarrow D E M W$

Stalling is done to cope with hazard

Forwarding is mechanism so that it can be used when available & produce stalls. (Good Use Hazard)

Date — / — / —

*Saathie*

Date \_\_\_\_\_

FDEMW

FDEMw

FDEMw

FDEMw

①

②

③

④

⑤

⑥

⑦

⑧

$$\text{Cycles} = 9$$

$$\text{Inst.} = 5$$

$$CPI = \frac{9}{5}$$

$$IPC = \frac{5}{9}$$

Program

Forwarding is done / can be done from M only

- 1) SW \$t1, 0(\$s0) FDEMw
- 2) LW \$t2, 0(\$s1) FDEMw
- 3) add \$t3, \$t1, \$t2 F - D EMW
- 4) SW \$t3, 0(\$s2) FDEMw
- 5) LW \$t4, 0(\$s3) FDEMw
- 6) LW \$t5, 0(\$s4) FDEMw
- 7) sub \$t6, \$t4, \$t5 F - D EMW
- 8) SW \$t6, 0(\$s5) FDEMw

$$CPI = \frac{1}{8}$$

$$IPC = \frac{8}{14}$$

Date \_\_\_ / \_\_\_ / \_\_\_

(Saathi)

Code reordering is done to cope up with load-use hazard.

Reordered code

|                         |                      |
|-------------------------|----------------------|
| lw \$t_1, 0(\$s_0)      | F D E M W            |
| lw \$t_2, 0(\$s_1)      | F D E M W            |
| lw \$t_4, 0(\$s_2)      | F D E M W            |
| lw \$t_5, 0(\$s_4)      | F D E M W            |
| add \$t_3, \$t_1, \$t_2 | <del>F D E M W</del> |
| sw \$t_3, 0(\$s_2)      | F D E M W            |
| sub \$t_6, \$t_4, \$t_5 | <del>F D E M W</del> |
| sw \$t_8, 0(\$s_5)      | F D E M W            |

$$CPT = \frac{12}{8}$$

$$IPC = \frac{8}{12}$$

Date \_\_\_ / \_\_\_ / \_\_\_

(Saathi)

Hazard control techniques:

- 1) Forwarding  
(Load-Use Hazard)
- 2) Code reordering

(Stall in branch - (Branch Hazard))

- 3) Branch Prediction
  - Branch Taken
  - Branch Not Taken

+ Delay slot

Branch Hazard solution

5) Loop unrolling

Branch Hazard

- 1) Branch prediction
- 2) Delay slot
- 3) Loop unrolling

## Program

comparision  $\leq$

```
for(i=1000; i>0; i=j-1)
{
    if(x[i] <= x[j])
        j = i;
}
```

1) bcy \$r1, \$r3, 36 F D E M W  
 2) and \$r2, \$r3, \$r5 - F D E M W  
 3) or \$r6, \$r1, \$r7  
 4) add \$r8, \$r1, \$r9  
 5) xor \$r10, \$r1, \$r11

~~1) add \$r1, \$r2, \$r3, 1000~~

add \$t1, \$0, \$1000  
 add \$t2, \$t0, 2  
 add \$t3, \$t1, \$t2  
 dcl \$t4, 0(\$t2)



No dependency  $\therefore$  bcy do not write 6 before

$g_1 > g_3 \therefore 5$

dependent  
between

loop: see \$t2, \$t0, 2 //  $t_2 = t_0 * 4$

add \$t3, \$t0, \$t2, //  $t_3 = Add(x[0], t_0 * 4$

$t_0 * 4$

i.e.,  
 delay slot  
 (load hazard)  
 add

\$t4, 0(\$t3) //  $t_4 = R[x[1]]$   
 add \$t5, \$t4, \$t1 //  $t_5 = t_1 + 5$

control  
dependency

sw \$t5, 0(\$t5) //  $x[i] = x[i] + 5$

subi \$t0, \$t0, 1 //  $t_0 = t_0 - 1$

Date / /

Saathi

bne \$t0,\$0, Loop  
ENJ

L Branch always stall for one.

Cycles

$$5 + (12 \times 1000) + 1000 \\ = 5 + 12000 + 1000 \\ = 13005$$

Date / /

1000  
A.

Saathi

loop unrolling  
 $\{$   
 $x[i] = x[i] + s_0 ;$   
 $x[i-1] = x[i-1] + s_0 ;$   
 $x[i-2] = x[i-2] + s_0 ;$   
 $x[i-3] = x[i-3] + s_0 ;$   
 $\}$

add \$t1, \$0, \$0 //  $t1 = 0$   
 SW \$t2, \$t0, 2 //  $t2 = i+4$   
 add \$t2, \$0, \$t2 //  $t3 = add(x[i])$   
 LW \$t4, 0(\$t2) //  $t4 = x[i]$   
 LW \$t5, -4(\$t2) //  $t5 = x[i-1]$   
 LW \$t6, -8(\$t2) //  $t6 = x[i-2]$   
 LW \$t7, -12(\$t2) //  $t7 = x[i-3]$   
 add LW \$t8, \$t1, \$t1 //  $t8 = x[i]+5$   
 add LW \$t9, \$t5, \$t1 //  $t9 = x[i-1]+5$   
 add t9 \$S2, \$t6, \$t1 //  $s2 = x[i-2]+5$   
 add \$t8, 0(\$t2) //  $x[i] = x[i]+5$   
 SW \$t8  
 add \$S3, \$t7, \$t1 //  $s3 = x[i-3]+5$   
 SW \$tB, 0(\$t2) //  $x[i] = x[i]+5$   
 SW \$t9, -4(\$t2)  
 SW \$S2, -8(\$t2)  
 SW \$S3, -12(\$t2)  
 Sub \$t0, \$t0, -4 //  $i = i - 4$   
 bne \$t0, \$0, Loop

Page No. [ ]

Page No. [ ]

Date / /

*Saathit*

Date / /

*Saathit*

No of cycles

$$\begin{array}{r}
 250 \\
 \times 250 \\
 \hline
 25000 \\
 + 250 \\
 \hline
 5005
 \end{array}$$

Superscalar Architecture

Superscalar Processor

Pipeline is overlapping it is not composite pipeline.

CPI: 7

IPC: 6 (Compiler based)

Multiple registers ports, multiple fetches, decoder etc.

Very Large Instruction Word (VLIW) Processor

| I <sub>1</sub> | I <sub>2</sub>    |
|----------------|-------------------|
| ALU or Branch  | LW/SW instruction |

Date / /

Saathi

4-9-18

Date / /

Saathi

| Smt 1                 | Smt 2                                |
|-----------------------|--------------------------------------|
| 1) add \$t1,\$0,\$0   |                                      |
| 2) Sll \$t2,\$t0,2    |                                      |
| 3) add \$t3,\$50,\$t2 | lw \$t4,0(\$t3) <sup>depending</sup> |
| 4) bne \$t0,\$50,-4   | lw \$t4,0(\$t3)                      |
| 5) add \$t5,\$t1,     | lw \$t5,-4(\$t3)                     |
| 6) add \$t6,\$t4,\$s1 | lw \$t6,-8(\$t3)                     |
| 7) add \$t7,\$t5,\$s1 | lw \$t7,-12(\$t3)                    |
| 8) add \$s2,\$t6,\$s1 | sw \$t8,0(\$t3)                      |
| 9) add \$s3,\$t7,\$s1 | sw \$t9,-4(\$t3)                     |
| 10) bne \$t0,\$0,100  | sw \$s2,-8(\$t3)                     |
| 11)                   | sw \$s3,-12(\$t3)                    |

Cycles

$$= 5 + (10 \times 250)$$

$$= 2505$$



feedback path

- 1) Streamline path
- 2) Feedback path
- 3) Feedforward path
- 4) Multiple outputs from any stages of pipeline

Linear Pipeline



Reservation Table for function X

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----------------|---|---|---|---|---|---|---|---|---|
| S <sub>1</sub> | X | X |   |   |   | X |   | X |   |
| S <sub>2</sub> |   | X | X | X |   |   |   |   |   |
| S <sub>3</sub> |   |   | X | X | X |   |   | X |   |

Date \_\_\_\_\_

Saathi

Date \_\_\_\_\_

Saathi

Reservation Table for function Y

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 |  |
|----------------|---|---|---|---|---|---|---|--|
| S <sub>1</sub> | Y |   |   |   | Y |   |   |  |
| S <sub>2</sub> | X |   |   |   |   |   |   |  |
| S <sub>3</sub> |   | Y |   | Y |   | Y |   |  |

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 |  |
|----------------|---|---|---|---|---|---|---|--|
| S <sub>1</sub> | Y |   |   |   | Y |   |   |  |
| S <sub>2</sub> | X |   |   |   |   |   |   |  |
| S <sub>3</sub> |   | Y |   | Y |   | Y |   |  |

Stage 1: 5      7      2  
 $(6-1) \quad (8-1) \quad (6-6)$

Stage 2: 2  
 $(4-2)$

Stage 3: 2      4  
 $(5-3) \quad (7-3)$

→ collision

not allowed

2 ins can't be assigned at same time.

### Forbidden latency for function X

Stage 1: 5 is forbidden latency.

5, 7 are forbidden latency.

6, 8

all forbidden latency



## Collision Vector

C<sub>1</sub> C<sub>2</sub> C<sub>3</sub> C<sub>4</sub> C<sub>5</sub> C<sub>6</sub> C<sub>7</sub>

1 0 1 1 0 1 0

State diagram

1 0 1 1 0 1 0

Right shift

1 1 3 4 5 6 7 8 9  
S<sub>1</sub> X X X X S<sub>1</sub>  
S<sub>2</sub> X X X X S<sub>2</sub>  
S<sub>3</sub> X X X X S<sub>3</sub>

1 0 1 1 0 1 0

Right shift

1 2 3 4 5 6 7 8  
S<sub>1</sub> X X X X S<sub>1</sub>  
S<sub>2</sub> X X X X S<sub>2</sub>  
S<sub>3</sub> X X X X S<sub>3</sub>

Saathie

0 will comment, if 0

comes out then to

Collision

you can see a new

construction

1 0 1 1 0 1 0

0 1 0 1 0 1 ← on left right shift

OR

1 1 1 1 1 1

all

Date / /

Saath

Date / /

Saath



Greedy Cycle  $\rightarrow$  no other option  
 $(3,8)$ ,  $(6,8)$ ,  $(1,8)$ ,  $(3,3)$

$(6,3)$     $(3,6)$     $(6,6)$

|       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------|---|---|---|---|---|---|---|---|---|
| $S_1$ | X |   |   | O |   | X |   | X | O |
| $S_2$ |   | X |   | X | O |   | O |   |   |
| $S_3$ |   |   | X |   | X | O | X | O | O |

Simple Cycles:  $(3,8)$   $(6,8)$   $(1,8)$   $(8)$

$(3,3)$   $(6,3)$   $(6,6)$

Cycles:

- $(3,8) \rightarrow 1$
- $(6,8) \rightarrow 14$
- $(1,8) \rightarrow 9$
- $(3,3) \rightarrow 6$
- $(6,3) \rightarrow 9$
- $(6,6) \rightarrow 12$
- $(3,6) \rightarrow 9$

Minimum average latency (MAL)

$$= 3+3 = 3$$

After every 3 cycle a new instruction can be issued with no conflict.

### Nonlinear Pipeline Scheduling

a) What is throughput of this pipeline?

Reservation Table

|                | 1 | 2 | 3 | 4 | 5 | 6 |
|----------------|---|---|---|---|---|---|
| S <sub>1</sub> | X | 0 | 1 | . | X |   |
| S <sub>2</sub> | X | 0 | 1 | X | 0 | □ |
| S <sub>3</sub> |   | X | 0 | 1 |   |   |
| S <sub>4</sub> |   |   | X | 0 | 1 |   |
| S <sub>5</sub> | X | 0 | □ | X | 0 | 0 |

$$S_1: 6 - 1$$

$$= 5$$

$$S_2: 5 - 2$$

$$= 3$$

S<sub>3</sub>:

S<sub>4</sub>:

$$S_5: 6 - 2$$

$$= 4$$

Forbidden = {3, 4, 5}  
Latency

C<sub>5</sub> C<sub>4</sub> C<sub>3</sub> C<sub>2</sub> C<sub>1</sub>

1 1 1 0 0  $\rightarrow$  Collision vector

Page No.



Simple cycles: (2, 6), (1, 6), (6), (1, 1, 6)

Greedy cycle: (2, 6), (1, 6), (6), (1, 1, 6)

Minimum average latency:

$$(2, 6) \rightarrow 8/2 = 4 \quad (1, 1, 6) \rightarrow 8/3 = 3$$

$$(1, 6) \rightarrow 7/2 = 4$$

$$(6) \rightarrow 6/1 = 6$$

Page No.

MAL = 3

{ 5 3 5 }

Throughput = No. of cycles per instruction

Per cycle = 1 instruction is executed

$$= \frac{1}{3}$$



$$\begin{matrix} C_5 & C_4 & C_3 & C_2 & C_1 \\ 1 & 0 & 1 & 1 & 0 \end{matrix}$$

$$a) \quad \begin{array}{|c|c|c|c|c|} \hline & 1 & 2 & 3 & 4 & 5 & 6 \\ \hline S_1 & X & & & X & & X \\ \hline S_2 & & X & & X & & \\ \hline S_3 & X & & X & & & \\ \hline S_4 & & & X & & & \\ \hline S_5 & X & & & X & & \\ \hline \end{array}$$

$$\begin{matrix} 10111 \\ 00001 \\ 10111 \end{matrix}$$

$$S_1 : 3 \quad 5 \quad 2$$

$$(4-1) \quad (6-1) \quad (6-4)$$

Average 1

Simple cycles: {6}, {1,6}, {4,4}, {4,6}

Complex cycles: {6}, {1,6}, {4,4}, {4,6}

$$6 \rightarrow 6 \quad \{4,6\} \rightarrow 14 \approx 5$$

$$1,6,3 \xrightarrow[2]{ } 7 \approx 4$$

Averagel

$$4,4 \xrightarrow[2]{ } 8 \approx 4$$

throughput

Throughput = 4

$$S_3 : 2 \quad S_4 : 2$$

$$(3-1) \quad (6-4)$$

throughput

$$4,6,3 \xrightarrow[2]{ } 10 \approx 5$$

55. Name

Date \_\_\_ / \_\_\_ / \_\_\_

*Saathi*

|    | 1              | 2 | 3 | 4 | 5 | 6 |
|----|----------------|---|---|---|---|---|
| Q) | S <sub>1</sub> | Y |   |   | Y |   |
|    | S <sub>2</sub> |   |   | Y |   |   |
|    | S <sub>3</sub> | Y |   | Y | Y |   |

$$S_1: (5-1)=4$$

S<sub>2</sub>: None

$$S_3: (4-2)=2$$

$$(6-4)=2$$

$$(6-2)=4$$

§ 2, 4, 3

C<sub>0</sub> C<sub>1</sub> C<sub>2</sub> C<sub>3</sub> C<sub>4</sub> C<sub>5</sub>

0 1 0 1 0



Simple cycles : {5} {1,5} {3,3} {3,3,5}

{3,5}, {1,3,5} ~~{3,3,3}~~

Garbage cycles : ~~all simple cycles here~~

Date \_\_\_ / \_\_\_ / \_\_\_

*Saathi*

A.L:

$$5 \rightarrow 5$$

$$\{1,5\} \rightarrow \frac{6}{2} = 3$$

$$\{3,3\} \rightarrow \frac{6}{2} = 3$$

$$\{3,5\} \rightarrow \frac{8}{2} = 4$$

$$\{3,3,5\} \rightarrow \frac{11}{3} = 4$$

$$\text{Throughput} = \frac{1}{3}$$

Page No. \_\_\_\_\_

Page No. \_\_\_\_\_