

06/09

# A brief history

## 1. Pre history (1400 - 1930s)

Pascal, Leibniz - calculators

↳ binary system (1705)

Jacquard - weaving machines (end of 18<sup>th</sup> c.)

Johann Helerich von Müller (1486)

Charles Babbage → Analytical Engine (1837)

↳ concept: arithmetic + memory + I/O + IFs  
unit devices conditionals

Ada, countess of Lovelace → first programmer (1840s)

Vannevar Bush (1931) - mechanical integrators

(analog computers)

## 2. 1<sup>st</sup> generation: electro-mechanical (1930s - 1960s)

ASCC - Howard Aiken 1937 - 1944

ENIAC - John Mauchly + John Presper Eckert (1943-1945)

first computer SP lawsuit: vs John Atanasoff

EDVAC - Mauchly + Eckert (1948 - 1950)

basis of Von Neumann Architecture



## 3. 2<sup>nd</sup> generation: transistors (1965 - 1975)

DEC PDP-1 (1959)

Cray's CDC 6600 - first supercomputer

Sintel 4004 (1971)

Apple (1976)

first monitor (1951)

first mouse (1968)

IBM personal computer - released in 1980s  
open standards to third-party

Microprocessors (1960s - today)

4. IV generation: multi-computing  
ARPANET (1965-1969) - Leonard Kleinrock  
1972 - ARPANET public + Email

08/09

## Assembly

Logic circuits [3, 4]

Data representation [5, 6]

Instruction Set Architectures [7, 8]

Software [2]

- assembly

- C

- C++ / Java

13/09

## Logic Circuits §

$$x \cdot x = x^2 = x$$

$$x \wedge x = x$$

$$f(x) = a_0 + a_1 x + a_2 x^2 + \dots$$

$$= a_0 + a_{m_2} x$$

$$x + x = 2x = x$$

$$x \vee x = x$$

$$f(x) = a \cdot x + b \cdot (1 - x)$$

$$f(x) = f(1) \cdot x + f(0) \cdot \bar{x}$$

$$1 - x = \bar{x}$$

$$\neg x$$

$$f(x, y) = f(1, 1) \cdot \boxed{x} \cdot \boxed{y} + f(1, 0) \cdot \boxed{x} \cdot \boxed{\bar{y}} +$$

minterms

$$f(0, 1) \cdot \boxed{\bar{x}} \cdot \boxed{y} + f(0, 0) \cdot \boxed{\bar{x}} \cdot \boxed{\bar{y}}$$

XOR

| x | y | $x \otimes y$ |
|---|---|---------------|
| 0 | 0 | 0             |
| 1 | 0 | 1             |
| 1 | 1 | 0             |

$$f(x, y) = f(1, 1) \cdot x \cdot y + f(0, 1) \cdot \bar{x} \cdot y + f(1, 0) \cdot x \cdot \bar{y} + f(0, 0) \cdot \bar{x} \cdot \bar{y}$$

|   |   |   |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

$$x \otimes y = f(0, 1) \cdot \bar{x} \cdot y + f(1, 0) \cdot x \cdot \bar{y}$$

$$x \wedge y = f(1, 1) \cdot x \cdot y$$

$$f(x, y, z) = (\bar{x} \cdot \bar{z}) \cdot y + \bar{z} \cdot \bar{x}$$

$$= (x + z) \cdot y + \bar{z} \cdot \bar{x} = xy + z \cdot y + \bar{z} \cdot \bar{x}$$

$$\bar{x} \cdot \bar{y} = \bar{x} + \bar{y}$$

$$\bar{x} + y = \bar{x} \cdot \bar{y}$$

$$x \cdot \bar{y} = x$$

| x | y | z | f |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 |



$$x \quad y \quad f \quad f = x \cdot y + \bar{x} \cdot y + x \cdot \bar{y}$$

$$0 \quad 0 \quad 0$$

$$0 \quad 1 \quad 1$$

$$1 \quad 0 \quad 1$$

$$1 \quad 1 \quad 0$$

$$f = \bar{x} \cdot \bar{y} \cdot \bar{z} + \bar{x} \cdot \bar{y} \cdot z + \bar{x} \cdot y \cdot \bar{z} + x \cdot y \cdot z$$

$$= (\bar{x} \cdot \bar{y})(\bar{z} + z) + y \cdot z(\bar{x} + x)$$

$$= \bar{x} \cdot \bar{y} + y \cdot z$$

| $\bar{z} \cdot \bar{y}$ | 00 | 01 | 11 | 10 |
|-------------------------|----|----|----|----|
| 0                       | 1  | 0  | 0  | 0  |
| 1                       | 1  | 1  | 1  | 0  |

| $\bar{z} \cdot \bar{y}$ | 00 | 01 | 11 | 10 |
|-------------------------|----|----|----|----|
| 00                      | 1  | 1  | 0  | 1  |
| 01                      | 1  | 1  | 0  | 1  |
| 11                      | 0  | 0  | 0  | 0  |
| 10                      | 1  | 1  | 1  | 0  |

$$f = \bar{x} \cdot \bar{z} + \bar{y} \cdot z + y \cdot z$$

| $\bar{z} \cdot \bar{y}$ | 00 | 01 | 11 | 10 |
|-------------------------|----|----|----|----|
| 0                       | 1  | 1  | 0  | 0  |
| 1                       | 1  | 1  | 1  | 1  |

15/09

## Logic Circuits §

20/09

## Data Representation §

$\mathbb{Z}^+ \rightarrow$  nonnegative integers  
 $\Sigma = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$

$2557_{10} \rightarrow$  binary (radix-2)  
2 5 5 7  
0010 0101 0101 0111

$-52_{10}$  8 bit 2's comp       $52_{10} \rightarrow 00110100$   
 $-64_{10} \rightarrow 1000 0000$       ↓  
 $\underline{-64_{10}}$        $\begin{cases} 1000 \\ 1100 \end{cases}$        $-52_{10} \rightarrow 11001100$   
 $12_{10}$   
 $64_{10}$

$4998_{10}$  in BCD

$4998 \div 2 = 2499$       0      1  
 $2499 \div 2 = 1249$       1      2  
 $1249 \div 2 = 624$       1      4  
 $624 \div 2 = 312$       0      8  
 $312 \div 2 = 156$       0      16  
 $156 \div 2 = 78$       0      32  
 $78 \div 2 = 39$       0      64  
 $39 \div 2 = 19$       1      128  
 $19 \div 2 = 9$       1      256  
 $9 \div 2 = 4$       1      512  
 $4 \div 2 = 2$       0      1024  
 $2 \div 2 = 1$       0      2048  
 $1 \div 2 = 0$       1      4096

22/09

## Data Representation

 $S \& M \rightarrow$  overflow not detected

$$\begin{array}{r}
 + 0111 \quad (+7) \leftarrow \text{bigger magnitude} \\
 + 1011 \quad (-3) \\
 \hline
 0100 \quad (+4)
 \end{array}$$

 $1C \rightarrow$  overflow not detected

$$\begin{array}{r}
 + 0111 \quad (+7) \\
 + 1100 \quad (-3) \rightarrow 0011_2 = 3_{10} \\
 \hline
 \begin{matrix} \text{carry out} \\ 1 \end{matrix} \quad 0011 \quad 1+ \\
 0100 \quad (+4)
 \end{array}$$

 $2C \rightarrow$  overflow detected iff MSB of both is the same,

$$\begin{array}{r}
 + 0111 \quad (+7) \quad \text{the sum should have the} \\
 + 1101 \quad (-3) \rightarrow 0011_2 = 3_{10} \quad \text{same MSB (0 \rightarrow + | 1 \rightarrow -)} \\
 \hline
 \begin{matrix} \text{carry out} \\ 1 \end{matrix} \quad 0100
 \end{array}$$

integer values  $105949 \div 105989$  $\hookrightarrow$  Excess -105949

4224 for use in LCD Clock

 $\hookrightarrow$  BCD

Fixed point fractions

$$x \cdot 2^{-x} \Rightarrow 8.45_{10}$$

$$0 \frac{1}{2} \downarrow$$

$$1 \quad 0.5 \quad 1000.11_2$$

$$2 \quad 0.25$$

$$3 \quad 0.125$$

 $\dots \dots$ 

$$e=0, \text{ value}=0 | e \neq 2^{15} \text{ value}=\pm\infty, m=0$$

Floating point numbers

| sign | exponent                                                                         | mantissa    |
|------|----------------------------------------------------------------------------------|-------------|
| s    | e<br>$30 \xleftarrow{8 \text{ bits}} 23 \dots 22 \xleftarrow{23 \text{ bits}} 0$ | m<br>$1. m$ |

$$\text{value} = (-1)^s \times 2^{e-127} \times m$$

$$m' = 1.m$$

$$1 \leq e \leq 254 \quad \text{excess } -127 \quad [-126; 127]$$

$$\begin{aligned} \text{value} &= 8.75 \\ \text{fixed: } &1000.11 \\ &1.00011 \times 2^3 \end{aligned}$$

$$\begin{array}{c|c|c|c} s & e & & m \\ \hline 0 & \underbrace{10000010}_{+130-127=3} & & \underbrace{000110000\dots}_{m' = 1.00011} \end{array}$$

$$\begin{array}{c|c|c|c} s & e & & m \\ \hline 1 & 10000000 & 0101000\dots & = -2.625_{10} \\ \hline \downarrow & 128-127=1 \Rightarrow 2^1 & 1.0101 & \Rightarrow 10.101_2 = 2_{10} + 0.5_{10} + 0.125_{10} = 2.625_{10} \end{array}$$

| Sign | Exponent   | Mantissa                                                |
|------|------------|---------------------------------------------------------|
| 0 +  | 10000010 3 | 001001100000000000000000 $\xrightarrow{+} 1001.00111$   |
| 0 +  | 10000011 4 | 001010001100000000000000 $\xrightarrow{+} 10010.100011$ |
| 0    | 10000011   | 1011110001 $\xrightarrow{+} 1.1011.110001$              |

|   | s | exponent | mantissa          |                          |
|---|---|----------|-------------------|--------------------------|
| + | 0 | 10000011 | 001101101000...   | 111111<br>010011.01101 + |
|   | 0 | 10000100 | 001011111000...   | 100101.1111              |
|   | 0 | 10000100 | 11001010110000... | 111001.01011             |

22/09

### Instruction Set Architecture 3

stack based  $\rightarrow$  VISC

register based  $\rightarrow$  RISC

VISC  $\rightarrow$  very long instruction word

RISC  $\rightarrow$  minimal instruction set computer

## Memory Layout

bits  $\rightarrow$  flip-flop

words  $\rightarrow$  parallel access register

memory chip (SoC)

byte addressable

$KB = KiB$  (binary kilobyte) = 1024 bytes

$$4 \text{ GB} = 2^{10} \cdot 2^{10} \cdot 2^{10} \cdot 2^2 \rightarrow 32 \text{ bytes} \rightarrow 35 \text{ bits} \rightarrow 30 \text{ words of 32 bits}$$

big endian memory: ...|a|b|c|...  $\rightarrow$  can be processed as  
little endian memory: ...|c|b|a|...  $\rightarrow$  waits for last one

32 registers  $\rightarrow$  5 bit reg operand

3-reg instruction  $\rightarrow$  15 bit  $\Rightarrow 32 - 15 = 17$  bit OP

reg+dir in 16 MB memory  
 $\hookrightarrow 2^9$   $\downarrow 2^{24}$

$\hookrightarrow 32 - 2^9 = 3$  bit OP  $\Rightarrow 2^3 = 8$  instructions

6<sup>4</sup> reg + reg + imm + ? 16 bit const  
 $\hookrightarrow 2^6$   $\downarrow 2^6$   $\downarrow 2^5$   $\downarrow 2^5$   $\downarrow 16 = 32$  ✓

## Structure Set Architecture

register transfer notation

[R1] - register

M(A) - memory (direct)

TOS - top of stack

29/09

Accumulator = Memory + TOS (Processor)

Stack = TOS + Stack (Processor)

Register-Memory = Memory + registers (Processor)

Register-Register = registers (Processor)

## 1. ACCU-based ISA

single-register

simple design  $\rightarrow$  easy to build and program

hard access to data (bottleneck memory)

example: 16-bit words, byte-addressable

S & M integers

instruction  $\rightarrow$  4 OP, 12 operand  $\Rightarrow$  4 KiB

JS: Data copy (load, store), Arithmetic & Logic, control

LOAD 3 take M(3)

SUB 3 sub M(3) M(3)=0

STORE 3 store in M(3)

## 2. Register-based ISA

general purpose registers

- faster than memory

- fewer address bits

bottleneck memory

EA = effective address

index operand:  $EA = [Reg] + I_{dp}$

Scale s = 1, 2, 4 or 8

Displacement disp = 8, 16, 32 or 64 bit

### 3. Multiple-register Architecture

CISC style: 15 bytes at most

Legacy prefixes (1-4 bytes, optional)

Opcode with prefixes (1-4 bytes)

ModR/M (1 byte, if required)

SIB (1 byte, id required)

Displacement (1/2/4/8 bytes, id required)

Immediate (1/2/4/8 bytes, id required)

RISC (Reduced Instruction Set)

Pro-CISC

easier to program

reduced code size

complex hardware

+ legacy

Con-CISC

complex encoding

may be slow

large energy consumption

KISS (Keep it simple and stupid)

trade-off: CISC vs. RISC

HW vs. SW

11/10

### Basic Processing Unit (CPU)

#### RISC

1. Fetch instr from mem
2. Decode instr and read reg
3. Execute ALU op
4. Memory access
5. Writeback to reg



#### CISC

MAR - Memory Address Register (sends to memory, return)

MDR - Memory Data Register (send and receive from mem)

Instruction address generator (PC)

2. Memory
3. MAR / MDR
4. IR (to Control circuitry)



MFC - memory function complete



| Enable | In | Out |
|--------|----|-----|
| 0      | X  | Z   |
| 1      | 0  | 1   |
| 1      | 1  | 0   |



ADD R2, (Z1)

Even

1. R<sub>1</sub> out, X<sub>IN</sub>
2. R<sub>1</sub> out, MAR<sub>IN</sub>, READ, WMFC (WMFC?)
3. MDR<sub>out</sub>, ADD, Z<sub>IN</sub>
4. ~~R<sub>1</sub> out, MAR<sub>IN</sub>~~
5. Z<sub>out</sub>, MDR<sub>IN</sub>, WRITE, WMFC

Wait for Memory  
to Complete

1. R<sub>1</sub> out, MAR<sub>IN</sub>, READ
2. R<sub>2</sub> out, X<sub>IN</sub>, WMFC
3. MDR<sub>out</sub>, ADD, Z<sub>IN</sub>
4. Z<sub>out</sub>, MDR<sub>IN</sub>, WRITE, WMFC

13/10

## Tutorial

Exam Question 9 - Fixed w/ Floating Point

5 -----. ----- = 11 bits Fixed point

Largest positive - 1.0<sub>110</sub>  
 smallest positive - 0.101<sub>0.101</sub>  
 31.5 32 - 0.5 11111.1 → 5 bit mantissa  
 $2^{\frac{7}{-2}} \cdot 2^{\frac{-3}{-4}}$  mantissa  
 $2^{\frac{7}{-2}} + 6$  exp 1 sign bit  
 $2^{\frac{7}{-2}} + 4$  4 bit exponent  
10 bits Floating point

Exam Question 11 - 16 fixing ISA

$$16 \text{ MiB} = 2^4 \text{ MiB} = 2^4 \times 2^{20} \text{ B} = 2^{24} \text{ B} \rightarrow 24 \text{ bits addressing}$$

solution - 16 B memory = 2<sup>4</sup> B → 4-bit dir16 instructions = 2<sup>4</sup> → 4-bit OP16 registers = 2<sup>4</sup> → 4-bit reg

mov → ----- 4-bit ----- 4-bit source ----- 4-bit destination  
 OP

Question 8 - Bees

$$65536 = 2^{16} \rightarrow \text{overflow of 16 bits}$$

Question 6 - de Morgan / NOR

bits for microcontrolling

2 (reg. add. + 1) → reg.

ex. 16 reg. → 4 reg. add. →  $2(4+1) = 10$

ALU → # of op.

ex. 104 → 4 bits

P - only JN-1 bit

2 - \$N and one - 2 bits

MA2 - only JN-1 bit

MDR - \$N and one - 2 bits

18/10

## Input/Output



synchronous bus



dual-bus architecture



synchronous bus



bus driver delay  $1.5\text{ns}$   
max bus length  $15\text{cm}$

$$\text{speed of light} \\ c \text{ in copper}$$

$$\frac{c}{2 \times 10^8 \frac{\text{m}}{\text{s}}} = 1.5 \times 10^{-11} \text{ s} = 1.5 \text{ ns}$$

address decoder delay  $6\text{ns}$   
time to fetch data  $3-15\text{ns}$   
setup time (control circuitry)  $\leq 2\text{ns}$



unconditional I/O

passive signaling (polling)

active signaling (interrupt)





Memory-mapped I/O means I/O devices and memory use the same address space.

20/40

## DMA & Memory Organization

DMA - direct memory access

1 instruction: MDV<sub>B</sub> (=COPY)



DMA - bus master when transferring data



$$\text{speed up} = \frac{\text{time (CPU)}}{\text{time (DMA)}} = \frac{6 \text{ c/w}}{2 \text{ c/w}} = 3 \rightarrow \text{otherwise } 2 \text{ cycle per word}$$

$$= \frac{6 \text{ c/w}}{3 \text{ c/w}} = 2 \rightarrow \text{bus occupied every 3 cycles}$$

$\Rightarrow 3 \text{ cycle per word for DMA}$

1024-6-bit memory -  $128 \times 8$



$32 \times 16$   
5 address pins  
16 data pins  
2 R/W + CS  
2 power + gnd  
25 pins

$$64 = 16 \times ? [32]$$

One word = 32 bits

$$MDR = 32^5$$

$$MAR = \log_2 32 + 1 = 6$$

? x 32 memory organization  
k bits in bits



20/10

## Tutorial

 $512 \times 32$ 4 row address  
lines

$$2048 \text{ bytes} = 2^{11} \text{ bytes}$$

5 column address  
line

$$\frac{2^4}{2^1 \text{ data}} = 2^3 \text{ address pin} \rightarrow 9 - 4 = 5 \text{ row column}$$

$$2^5 = 32 \text{ data pins}$$

double word

25/10

## Caching





### L1 cache

$$\text{cache hit ratio: } 0 \leq h \leq 1 \quad \text{avg.} = hc + (1-h)(cm)$$

$$\text{access time ratio: } c$$

$$\text{cache miss ratio: } 1-h \quad \text{speedup} = \frac{\text{time old}}{\text{time new}}$$

$$\text{access time main memory: } m$$

$$h = 0.85 \quad s = \frac{m}{c + (1-h)m} = \frac{400}{40 + 0.15 \cdot 400} = 4$$

$$m = 400 \text{ ns}$$

$$c = 40 \text{ ns}$$

$$m = 400 \text{ ns}$$

$$h = 0.925 \quad s = \frac{m'}{c + (1-h)m} = \frac{800}{40 + 0.045 \cdot 400} = 8$$

$$c = 40 \text{ ns}$$

$$m' = 800 \text{ ns}$$

### Cache Mapping

cache - smaller than memory (density)

main mem  $\rightarrow$  library

cache  $\rightarrow$  books at home (2 books at a time)

block  $\rightarrow$  book

byte  $\rightarrow$  line



Cash Replacement

$x$ -way associative  $\rightarrow$   $x$ -bit age counter per block

HIT

|   |   |    |   |   |
|---|---|----|---|---|
| 0 | 1 | +1 | 1 | 0 |
| 0 | 0 | +4 | 0 | 1 |
| 1 | 0 | =0 | 0 | 0 |
| 1 | 1 | +1 | 1 | 1 |

age before      after

increased      - Reset hit counter  
 increased      of hit block  
 w.r.t top      - Increment lower  
 unchanged      counters

MISS

|   |   |    |   |   |
|---|---|----|---|---|
| 0 | 1 | +1 | 1 | 0 |
| 0 | 0 | +1 | 0 | 1 |
| 1 | 0 | +1 | 1 | 1 |
| 1 | 1 | =0 | 0 | 0 |

age before      after

full:  
 increased      - Replace block with  
 increased      highest counter, reset  
 increased      - Increment all  
 w.r.t top      others  
 not full: - empty spot with 0 count  
 - Increment all others



Harvard Architecture:  
 instructions - spatial + temporal  
 data - temporal + spatial



27/10

## Pipelining

- 1 washer  $\rightarrow$  30 min  
1 drier  $\rightarrow$  50 min  
1 folding mat  $\rightarrow$  40 min  
4 people

$$30 \quad 50 \quad 40 = 120 \text{ (one load)}$$

$$30 \quad 50 \quad 40$$

$$30 \quad 50 \quad 40$$

$$30 \quad 50 \quad 40$$

$$\underline{30 + 4 \cdot 50 + 40 = 30 \quad 50 \quad 50 \quad 50 \quad 50 \quad 40 = 270}$$

$$\text{wash} = 30 \text{ min}$$

$$30 \quad 40 \quad 20 = 90 \text{ (one load)}$$

$$\text{dry} = 40 \text{ min}$$

$$30 \quad 40 \quad 20$$

$$\text{fold} = 20 \text{ min}$$

$$30 \quad 40 \quad 20$$

$$\underline{30 + 4 \cdot 40 + 20 = 210 = 30 \quad 40 \quad 40 \quad 40 \quad 40 \quad 20}$$

## Performance

### Latency ( $L$ )

- time for one instruction to finish
- lower is better

### Throughput ( $T$ )

- the # instructions per time unit
- higher is better
- time per instruction decreases (on average)

Increase clock frequency  $\rightarrow$  latency boost

problem: power wall

Multiple threads & cores  $\rightarrow$  throughput boost

problem: parallel programming

| Fetch          | Fetch 1   | Fetch 2   | Fetch 3   | Fetch 4 |
|----------------|-----------|-----------|-----------|---------|
| CPU            |           |           |           |         |
| Execute time → | Execute 1 | Execute 2 | Execute 3 |         |

Fetch instruction

Decode instruction and fetch operands

Execute instruction

Write result



stalls:

Cache miss [fetch, decode]

Dependency between instructions [decode]

Branching [exec]

Long operation [exec]



ADD R1, R2      R2 += R1  
 MUL R3, R4      R4 \*= R3  
 ADD R2, R3      R3 \*= R2

MOU R1, R2      R2 = R1  
 ADD \$100, R2      R2 += 100

ADD R1, R2      R2 += R1  
 ADD R2, R3      R3 += R2



BG2 LOOP  
 MOV R3, SUM



ADD R6, RF, R8  
 BG2 R3 LOOP  
 MOV R8, SUM

→ BG2 R3 LOOP  
 ADD R6, RF, R8  
 MOV R8, SUM

### Static prediction

backwards branches likely to be taken (loop)  
 forward branches mostly skipped (rule vs. exception)  
 sign bit (or compiler directed)

### Dynamic prediction



ADD R2, R3  
DIV R3, R1

I<sup>1</sup>  
I<sup>2</sup>  
I<sup>3</sup>  
I<sup>4</sup>

|                |                |                |                |     |
|----------------|----------------|----------------|----------------|-----|
| F <sup>1</sup> | D <sup>1</sup> | E <sup>1</sup> | W <sup>1</sup> | ADD |
| F <sup>2</sup> | D <sup>2</sup> | E <sup>2</sup> | W <sup>2</sup> | DIV |
| F <sup>3</sup> | D <sup>3</sup> | E <sup>3</sup> | W <sup>3</sup> |     |
| F <sup>4</sup> | D <sup>4</sup> | E <sup>4</sup> | W <sup>4</sup> |     |

Execution time of a program:  $T$

Instruction count:  $N$

#cycles per instruction, CPS:  $S(y_1, y_2)$

Clock rate:  $R(y_1, 1 \text{ GHz})$

without pipelining:  $T = (N \times S) / R$

with n-stage pipeline:  $T' = T / n$

but  $S$  increases due to stalls

$$h_i = 0.95 \rightarrow \text{miss} = 0.05$$

$$h_d = 0.9 \rightarrow \text{miss} = 0.1$$

cache miss  $\rightarrow 32$  cycles

20% memory access = 0.2

$$0.35 \times 32$$

$$m_i = 0.05$$

$$m_d = 0.1$$

$$f_{\text{cache}} = 32$$

$$f_{\text{data}} = 0.2$$

$$h_b = 0.8 \rightarrow \text{miss} = 0.2$$

1 additional cycle

20% Branching  $\rightarrow 0.2$

$$0.2 \times 0.2 \times 1$$

$$m_{\text{branch}} = 0.2$$

$$f_{\text{branch}} = 1$$

$$d_{\text{branch}} = 0.2$$

## Tutorial

$$h_{\text{branch}} = 0.90 \quad m_{\text{branch}} = 0.10$$

$$p_{\text{branch}} = 1 \text{ cycle}$$

$$f_{\text{branch}} = 0.20$$

$$\begin{aligned} \text{Cache misses} &= (m_{\text{instr.}} + m_{\text{data}} \times m_{\text{data}}) \times f_{\text{cache}} \\ &= 1.28 \end{aligned}$$

Branching

$$\begin{aligned} &= p_{\text{branch}} \times m_{\text{branch}} \times p_{\text{branch}} \\ &= 0.2 \times 0.1 \times 1 = 0.02 \end{aligned}$$

01/11

## Superscalar Execution

J<sub>1</sub> Fadd

|    |    |    |    |      |
|----|----|----|----|------|
| F1 | D1 | E1 | W1 |      |
| F1 | D2 | E2 | W1 | → W1 |

J<sub>3</sub> FsubJ<sub>4</sub> Sub

|    |    |    |    |      |
|----|----|----|----|------|
| F1 | D1 | E3 | W1 | W1   |
| F1 | D2 | E2 | W1 | → W1 |

### Amdahl's Law

Parallelization can:

- increase throughput
- latency remains same
- reduce program runtime T

not everything can be parallelized

→ data dependencies

→ sequential algorithms

$f_p \rightarrow$  fraction of code that can be parallelized

$1 - f_p = f_s \rightarrow$  sequential execution

$T_s \rightarrow$  time to run program sequentially

$$T_p = T_s \times (f_s + \frac{f_p}{P})$$

$P \rightarrow$  number of processors

$$S = \frac{1}{f_s + \frac{f_p}{P}}$$

$$T_s = 10s$$

$$f_p = 80\% \quad f_s = 20\%$$

$$P = 4$$

$$T_p = 10 \left( 20\% + \frac{80\%}{4} \right) = 10.40\% = 4$$

$$S = \frac{T_s}{T_p} = \frac{10}{4} = 2.5$$

1 core with 1 thread

$$f_p = 20\% \quad f_s = 80\%$$

$$P = 8$$

$$S_{max} = \frac{1}{f_s} = \frac{1}{80\%} = 1.25$$

$$S = \frac{T_s}{T_p} = \frac{T_s}{T_s(1 + \frac{f_p}{P})} = \frac{1}{80\% + \frac{20\%}{8}} = \frac{1}{82.5\%} = \frac{100}{82.5} = \frac{20}{16.5}$$

40

20

33

$$s' = \frac{10}{q} = \frac{1}{\frac{fs + fp}{p}} = \frac{1}{\frac{0.8 + 0.2}{p}} = \frac{p}{1.0} = p$$

$$p = ?$$

$$0.8 + \frac{0.2}{2} = 0.9$$

A certain program can be parallelized for 20%  
 $fp = 20\% = 0.2$      $fs = 80\% = 0.8$

What is the speedup if we run the program on 8 processors?  $p = 8$      $s = \frac{1}{\frac{fs + fp}{p}} = \frac{1}{\frac{0.8 + 0.2}{8}} = \frac{40}{33}$

What is the maximum speedup that can be obtained?  $p = \infty$      $s_{max} = \frac{1}{fs} = 1.25$

How many processors are required to get a speedup of  $\frac{10}{9}$ ?  $p = ?$      $s' = \frac{1}{\frac{0.8 + 0.2}{p}} = \frac{10}{9}$      $0.9 = 0.8 + \frac{0.2}{p}$      $p = 2$

$$fp = 90\% \quad fs = 10\%$$

$$p = 3 \quad s = \frac{1}{\frac{fs + fp}{p}} = \frac{1}{\frac{10\% + 90\%}{3}} = \frac{3}{10} = 0.3$$

$$s_{max} = \frac{1}{fs} = \frac{1}{fs} = 10 \quad p = ? \quad s = 5 = \frac{1}{\frac{10\% + 90\%}{p}} \quad 0.2 = 0.1 + \frac{0.9}{p}$$

## Flynn's Taxonomy

Single Instruction, Single Data (SISD)  
 → conventional system

Single Instruction, Multiple Data (SIMD)  
 → one instruction on multiple data streams

Multiple Instruction, Multiple Data (MIMD)  
 → multiple instruction streams on multiple data streams

Multiple Instruction, Single Data (MISD)





## Virtual Memory

64-bit processor  
256 GiB memory  
16 TiB address space  
8 KiB pages  
byte-aligned  
4 control bits  
page table size = ?



## DPU

X, Bus → ALU inputs

memory access  $\geq 1$  cycle

R2 → register addressing

(R1) → indirect register addressing

$$a) R1 \leftarrow R2 + (R1)$$

1. R1out, MARin, READ
2. R2out, Yin, WMFC
3. MDRout, Xout, ALUadd, Zin
4. Zout, R1in, END

b) least amount of time

WMFC happens after last input reading

32 address lines  
 3 devices connected  
 4 ports  
 2 ignored address lines



Block 0 hit  $\text{I}_{\text{y}}$   
 Block 3 hit  $\text{II}_{\text{y}}$   
 Block 5 hit  $\text{I}_{\text{y}}$   
 cache miss (full)  $\text{IV}_{\text{y}}$

instructions

| block | before | I <sub>y</sub> | II <sub>y</sub> | IV <sub>y</sub> | after                        |
|-------|--------|----------------|-----------------|-----------------|------------------------------|
| 0     | 001    | 010            | 000             | 001             | 010 $\rightarrow$ hit        |
| 1     | 010    | 011            | 011             | 100             | 101                          |
| 2     | 101    | 101            | 101             | 110             | 111                          |
| 3     | 110    | 110            | 110             | 000             | 001 $\rightarrow$ hit        |
| 7     | 100    | 100            | 100             | 101             | 110                          |
| 5     | 011    | 000            | 001             | 010             | 011 $\rightarrow$ hit        |
| 6     | 000    | 001            | 010             | 011             | 100                          |
| 7     | 111    | 111            | 111             | 111             | 000 $\rightarrow$ cache miss |

