

# Lecture 9 - Basic Processing Unit (BPU)

## Component structure

### CISC

→ Components connected many-to-many  
→ flexible



### RISC

→ Components connected in sequence  
→ pipelined by design

## CPU (RISC):

1. Fetch instruction from memory
2. Decode instruction and read registers
3. Execute ALU operation
4. Memory access (if needed)
5. Writeback to register (if needed)



## CPU (CISC):

### 1. Instruction fetch

- 1) Instruction address gen
- 2) MAR + MDR
- 3) IR

### 2. Operand fetch

- 1) IR → instruction + operands
- 2) values in X, MUX, or directly

### 3. Instruction execution

- 1) ALU output in Z

### 4. Write back

- 1) output in register file



## Quiz Q.1.

1. inst. fetch → IR, MAR, MDR, PC
2. writing data to memory  
→ address moved to MAR
3. writing data to memory  
RISC → RM

# Important Concepts

## Interacting with main memory



## Register Gating



## CPU bus



## Tri-state gates



## Register File



- Gates  $R_{in}$  and  $R_{out}$
- $Address_{in}$  and  $Address_{out}$
- $2^x$  registers  $\rightarrow \log_2 2^x = x - 6$  bit addressing
- $\Rightarrow 2(x+1)$  -bit register file addressing



$ADD R_2, R_1 \leftrightarrow R_1 + R_2$

Fetch

1. PCout, MARin, READ, WMFC  $\rightarrow$  wait memory to fully complete
2. MDRout, IRin  $\quad$  (always after READ and WRITE)

Execute

3. R2out, X1in
4. R1out, ALUadd, Zin  $\quad$  (red arrow)
5. Zout, R1in  $\quad$  (red arrow)

Quiz Q.2 ADD R2, (R1)

4. R1out, MARin, READ, WMFC
5. MDRout, ALUadd, 2in
6. Zout, MDRin, WRITE, WMFC

# Lecture 10 - Basic Processing Unit (BPU)

**STORE R2, (R1)**

1. PC<sub>out</sub>, MAR<sub>in</sub>, READ, WMFC
2. MDR<sub>out</sub>, IR<sub>in</sub>
3. R1<sub>out</sub>, MAR<sub>in</sub>
4. R2<sub>out</sub>, MDR<sub>in</sub>, WRITE, WMFC

| →1 | PC <sub>out</sub> | PC <sub>in</sub> | R1 <sub>in</sub> | R1 <sub>out</sub> | MAR <sub>in</sub> | MDR <sub>out</sub> | IR <sub>in</sub> | ... | WMFC |
|----|-------------------|------------------|------------------|-------------------|-------------------|--------------------|------------------|-----|------|
| 2  | 1                 | 0                | 0                | 0                 | 1                 | 0                  | 0                | ... | 1    |
| 3  | 0                 | 0                | 0                | 0                 | 0                 | 0                  | 1                | 1   | 0    |
| 4  | 0                 | 0                | 0                | 1                 | 1                 | 1                  | 0                | 0   | 0    |
| 5  | ...               |                  |                  |                   |                   |                    |                  |     |      |
| 6  | ...               |                  |                  |                   |                   |                    |                  |     |      |
| 7  | ...               |                  |                  |                   |                   |                    |                  |     |      |

**control signals:**

- instruction code
- control step counter
- flags & external signals

## Quiz 10.1

32 general purpose registers

42 two-operand functions in ALU (X reg for input, 2 for output)

MDR and MAR registers

32 →  $\log_2 32 = 5$ -bit register addressing

2 (5+1) = 12-bit register file addressing

42 < 64 → 6-bit ALU operation addressing

X reg and MAR (only in) → 1 bit

2 reg and MDR (in and out) → 2 bits

$$12 + 6 + 1 + 2 + 1 + 2 = 24 \text{ bits}$$



ADD R1, R2 → fetch=2 steps } 5 steps  
(no memory) → execute=3 steps

SUB #16(R3), R0 → fetch=2 steps }  
(1 memory) → execute=4 steps } 7 steps  
displacement=1 step

## Method 1 - Hardwired Control

$RF_{write} = TS_5$ . (ALU + Load + Call)  
 → fast  
 → inflexible

## Quiz 10.2

vertical : 8 reg., MAR+MDR, 4 ALU  
 8 bits =  $\downarrow$  +  $\downarrow$  +  $\downarrow$   
 microprogrammed flexible hardwired  
 complex design  
 complex hardware higher performance

## Method 2 - Microprogrammed Control

|   | μ-program memory  |                  |                  |                   | control word=μ-instruction |                    |                  |     |
|---|-------------------|------------------|------------------|-------------------|----------------------------|--------------------|------------------|-----|
|   | PC <sub>out</sub> | PC <sub>in</sub> | R1 <sub>in</sub> | R1 <sub>out</sub> | MAR <sub>in</sub>          | MDR <sub>out</sub> | IR <sub>in</sub> | ... |
| 1 | 1                 | 0                | 0                | 0                 | 1                          | 0                  | 0                | 1   |
| 2 | 0                 | 0                | 0                | 0                 | 0                          | 1                  | 1                | 0   |
| 3 | 0                 | 0                | 0                | 1                 | 1                          | 0                  | 0                | 0   |
| 4 | ...               |                  |                  |                   |                            |                    |                  |     |
| 5 | ...               |                  |                  |                   |                            |                    |                  |     |
| 6 | ...               |                  |                  |                   |                            |                    |                  |     |
| 7 | ...               |                  |                  |                   |                            |                    |                  |     |

- Increased flexibility (branching)

Instruction

|     |    |    |
|-----|----|----|
| MOV | R1 | A  |
| ADD | R1 | R2 |



|                   |                  |                  |     |     |
|-------------------|------------------|------------------|-----|-----|
| PC <sub>out</sub> | PC <sub>in</sub> | IR <sub>in</sub> | ... | END |
| PC <sub>out</sub> | PC <sub>in</sub> | IR <sub>in</sub> | ... | END |
| PC <sub>out</sub> | PC <sub>in</sub> | IR <sub>in</sub> | ... | END |

microinstruction

compared to hardwired:  
 Branching vs sequencing

compared to full CPU:  
 → no ALU, registers, ...  
 → response to signals

### Micro-instruction layout

#### 1. No/little encoding: horizontal organization

- large words
- less decoding
- faster



#### 2. Much encoding: vertical organization

- small words
- more decoding
- slower



#### 3. Mixed

| micro-instruction | Horizontal organization |                 |                  |                  | vertical organization   |    |    |    |
|-------------------|-------------------------|-----------------|------------------|------------------|-------------------------|----|----|----|
|                   | PC <sub>out</sub>       | IR <sub>m</sub> | MAR <sub>m</sub> | MDR <sub>m</sub> | REG <sub>m</sub> , addr | F1 | F2 | F3 |
| 1                 | 1                       | 0               | 1                | 0                | 00                      |    |    |    |
| 2                 | 0                       | 1               | 0                | 1                | 00                      |    |    |    |
| 3                 | 0                       | 0               | 0                | 0                | 10                      |    |    |    |
| ...               |                         |                 |                  |                  |                         |    |    |    |

- Field 1 (4 bits): Register address<sub>in</sub>  
 Field 2 (4 bits): Register address<sub>out</sub>  
 Field 3 (2 bits): RF<sub>in</sub>/out  
 Field 4 (4 bits): Function ALU  
 Field 5 (2 bit): Read/Write/Nop  
 Field 6 (1 bit) : Carry-in ALU  
 Field 7 (1 bit) : WMFC  
 Field 8 (1 bit) : End

### Microroutine example

#### JN PC+OFFSET

| Address | Micro-instruction                                                | Fetch Instruction |
|---------|------------------------------------------------------------------|-------------------|
| 0       | PC <sub>out</sub> , MAR <sub>in</sub> , Read, WMFC               |                   |
| 1       | MDR <sub>out</sub> , IR <sub>in</sub>                            |                   |
| 2       | Branch to starting address routine (here, 103)                   | Decode            |
| 103     | PC <sub>out</sub> , X <sub>in</sub> , if N=0 then goto address 0 | ← Test N bit      |
| 104     | IR_Offset_out, MUX_X, F <sub>alu</sub> = "ADD", Z <sub>in</sub>  | ← Add offset      |
| 105     | Z <sub>out</sub> , PC <sub>in</sub> , End                        | ← Update PC       |

### Microroutine example

#### ADD R1, R2

| Address | Micro-instruction                                                                                 | Fetch Instruction |
|---------|---------------------------------------------------------------------------------------------------|-------------------|
| 0       | PC <sub>out</sub> , MAR <sub>in</sub> , Read, WMFC                                                |                   |
| 1       | MDR <sub>out</sub> , IR <sub>in</sub>                                                             |                   |
| 2       | Branch to starting address routine (here, 42)                                                     | Decode            |
| 42      | RF <sub>read</sub> , RF <sub>addr_out</sub> = 1, X <sub>in</sub>                                  | Exec              |
| 43      | RF <sub>read</sub> , RF <sub>addr_out</sub> = 2, MUX_X, F <sub>alu</sub> = "ADD", Z <sub>in</sub> |                   |
| 44      | Z <sub>out</sub> , RF <sub>write</sub> , RF <sub>addr_in</sub> = 2, End                           |                   |

- flexible

- slower

- more complex to build in hardware

# Self-Study 6



Figure 1: Basic Processing Unit with missing components.

1. (5 mins) Figure 1 shows the components of a CISC-style basic processing unit. Name the numbered components and describe what kind of information is stored in them.

| # | name                  | description                                     |
|---|-----------------------|-------------------------------------------------|
| 1 | instruction generator | generates the address of the next instruction   |
| 2 | MAR                   | Memory Address Register - stores memory address |
| 3 | MDR                   | Memory Data Register - stores data from memory  |
| 4 | Z                     | stores the result of ALU operations             |

2. (5 mins) Write down the 5 stages of a RISC-based BPU and explain what is happening during each of them:

1. Fetch - get instruction from memory
2. Decode - evaluate the OP code and the encoded register addressing
3. Execute - perform arithmetic and logic operations in the ALU
4. Memory - access information from memory if needed
5. Writeback - communicate back to registers if needed

$$16 \text{ registers} = 2^4 \Rightarrow 2(4+1) = 10 \text{ bits}$$

$$\text{ALU with } 10^4 \text{ instructions} \rightarrow 4 \text{ bits}$$

$$X \text{ as input (only in)} \rightarrow 1 \text{ bit}$$

$$2 \text{ as output (in and out)} \rightarrow 2 \text{ bits}$$

20 bits

encoding

register file with  
32 registers  
MAR + MDR

vertical

$$2(5+1) = 12$$

3

horizontal

$$2(32+1) = 66$$

3

20 ALU operations

5

20

$\Rightarrow X$  register

1

1

2 register

2

2

MUX

1

1

24 bits

93 bits

Instruction Fetching:

1. PCout, MARin, READ, WMFC
2. MDRout, IRin

ADD R2, (R4), R3  $\Rightarrow$  R3=R2+(R4)

1. R2out, Xin
2. R4out, MARin, READ, WMFC
3. MDRout, ALUadd, Zin
4. Zout, R3in

DIVIDE (R1), (R2), R3

1. R1out, MARin, READ, WMFC
2. MDRout, Xin
3. R2out, MARin, READ, WMFC
4. MDRout, ALUdivide, Zin
5. Zout, R3in

BRANCH PC + (R3)

1. R3out, MARin, READ, WMFC
2. MDRout, Xin
3. PCout, Xout, ALUadd, Zin
4. Zout, PCin

# Lecture 11 - I/O

Large variety of devices  
→ size  
→ speed  
→ distance



## Buses and Interfaces

Single-bus architecture  
Pro: simple, flexible, low cost  
Con: not scalable, slow



Dual-bus architecture  
Pro: fast, flexible, quite cheap  
Con: not scalable, complex



## Typical motherboard

Northbridge → high performance

Southbridge → flexibility

Chipsets → AMD AUS (2022 -)

→ Intel Coffee Lake (2017 -)

## Bus Interface



address decoder - detect if data is for data registers - to store/buffer in/out data  
status and control registers  
→ to certify status of device  
→ to control transfer

Synchronous bus  
ideal case  
→ perfect synchronization across all devices  
→ instantaneous transitions



## Propagation delays



## Multi-cycle bus



bus driver delay 1.5ns  
max bus length 15 cm  
address decoder delay 6 ns  
time to fetch data 3 - 15 ns  
setup time (control circuitry)  $\leq 2$  ns

$$t_{\text{max}} = \frac{l}{v} = \frac{15 \text{ cm}}{\frac{2}{3} c} = \frac{1.5 \times 10^{-2} \text{ m}}{2 \times 10^8 \frac{\text{m}}{\text{s}}} = 7.5 \times 10^{-11} \text{ s} = 75 \text{ ns}$$

## Asynchronous bus input operation

- Explicit handshaking
- timing must account for signal propagation/skew



# Program-controlled I/O



assigning address range:

|        | 2 | 1 | 0 | SIN | SOUT | IOSTATUS |
|--------|---|---|---|-----|------|----------|
| 0x4000 |   |   |   |     |      |          |
| 0x4004 |   |   |   |     |      | DATAIN   |
| 0x4008 |   |   |   |     |      | DATAOUT  |

- Unconditional I/O - no synchronization with I/O device
- Passive signaling (Polling) - synchronization between CPU and I/O device by programmed interrogation by CPU
- Active signaling (Interrupts) - synchronization between CPU and I/O device by active interrupt of device

## Interrupts



### ISR - Interrupt Service Routine

1. I/O device alerts CPU by hardware signal
2. CPU stops program execution
3. Interrupts are disabled

4. Device is informed of acceptance and clears IRQ
5. ISR is invoked to handle device's request
6. Interrupts are enabled
7. Execution of program resumes



shared IRQ line

CPU polls status registers

GRANT signal from CPU

Device sends ID on data bus

CPU calls ISR from interrupt vector [ID]

Rule out simultaneous interrupts by prioritizing devices  
fixed order



# Lecture 12 - DMA & Memory Organization

## Non-programmed I/O

- A separate active entity executes the I/O tasks.
- User instructs the device about what to copy
- Device takes care of data transport

## Direct Memory Access (DMA)

1 instruction: MOVE (= COPY)

## Special I/O processors

- more versatile
- run their own program



## DMA controller

- independent entity
- specialized in data transfer
- block based

## Bus priority

- cycle stealing
- burst mode



## CPU ↔ DMA interactions



## DMA controller

- Access as a normal I/O device
- signals when DONE and/or by interrupt
- Acts as a bus master when transferring data
- sets addresses and controls lines

MOV (R1), R4  
MOV R4, (R2)  
SUB \$1, R3  
ADD \$4, R1  
ADD \$4, R2

## 32-bit RISC machine

8 registers

blockmove (src, dst, n)

R1 = src | R2 = dst | R3 = n

## DMA without caching

→ every instruction needs to be fetched from memory

→ bus is mostly occupied

## DMA with cache

→ bus is mostly free



## Memory Organization



$A_0 - A_3 : 4 \text{ bits}$

$W_0 - W_{15} : 16 = 2^4$

$b_0 - b_7 : 8$

$\Rightarrow 16 \times 8 \text{ memory}$



DRAM  
Dynamic Random Access Memory

SRAM  
Static Random Access Memory



$Row Address (5bit)$

5-bit decoder

W0

W31

32 by 32 memory array

$7-bit address lines$

$\log_2 128 = 7$

$128 \times 8$

Column Address (2 bit)

1024x8

Row Address (5bit)

W0

W31

32 by 32 memory array

10-bit address lines

$\log_2 1024 = 10$

$1024 \times 1$

Column Address (5 bit)

32-to-1 MUX / DEMUX

In/Out

R/W

CS

128x8 organization

• 7 address pins

• 8 data pins

• 2 R/W + CS

• 2 power + ground

----- +

19 pins

1024x1 organization

• 10 address pins

• 1 data pin

• 2 R/W + CS

• 2 power + ground

----- +

15 pins

$k \times n$  organization

$\log_2 k$  address pins

$n$  data pins

+ 4 constant pins

$$512 = 16 \times 32 \quad 32 \text{ bit word}$$

$$MDR = 32 \quad MAZ = \log_2 32 + 1 = 6$$

Quiz 12.2 32x16

5 address pins

16 data pins

4 constant pins

25 pins

Memory address lines vs. byte addressability  
 ? x 32 memory organization

Memory reads/writes 32-bit word lines at once  
 Programs address memory by byte



## Self-Study 7



$$\log_2 2048 = 11 \text{ address lines}$$

$$C = 7 \text{ lines}$$

$$\Rightarrow B = 4 \text{ lines}$$

$$2B = 16 \text{ b} \Rightarrow A = 16 \text{ data lines}$$

$$\Rightarrow 2048 \times 16$$

(a) 128-bit words

$$\log_2 (2^{12} \times 2^{-7}) = 21$$

(b) 64-bit words

$$\log_2 (2^{12} \times 2^{-6}) = 22$$

(c) 32-bit words

$$\log_2 (2^{12} \times 2^{-5}) = 23$$

(d) 16-bit words

$$\log_2 (2^{12} \times 2^{-4}) = 24$$

$$32 \text{ MiB} = 2^{27} \text{ B} = 2^{24} \text{ b}$$

32 MiB word  
 addressable memory  
 chip  
 address lines per word size

# Lecture 13 – Caching



## READ operation:

- not in cache (MISS)
  - copy block into cache
  - read out of cache
- in cache (HIT)
  - read out of cache

## WRITE operation:

- not in cache (MISS)
  - write in main memory
- in cache (HIT)
  - write in cache
  - and either:
    - write in main memory
    - set modified (dirty) bit, and write later

main  $\xrightarrow{\text{read}}$  read-through  
memory  $\xrightarrow{\text{write}}$  store-through

spatial – close by data (array)  
locality

temporal – used before data  
L1 cache

cache hit ratio:  $h$

access time cache:  $c$

cache miss ratio:  $1-h$

access time main memory:  $m$

average access time:  $c + (1-h)m$

parameters depend on:

→ application, input (data-driven)

→ cache policy

$$\text{speedup} = \frac{m}{c + (1-h)m}$$

## Quiz 13.1

$$m = 400\text{ ns}$$

$$c = 40\text{ ns}$$

$$h = 60\% \quad 1-h = 40\%$$

$$\begin{aligned} \text{speedup} &= \frac{m}{c + (1-h)m} \\ m' &= 480 \\ h' &= 95\% \\ s &= \frac{480}{40 + 24} = \frac{480}{64} = 7.5 \\ &= \frac{480}{40 + 160} = \frac{480}{200} = 2 \end{aligned}$$



Transparent to CPU (and programmer)  
Storage to hold data  
and administration (in/out, clean/dirty)

Block based:

- amortize admin overhead
- efficient lookup (hit or miss?)
- efficient memory access

Access multiple memory words at a time

- Access word  $w$  in memory block  $B$  in the cache?
- Is block  $B$  in the cache?
- If not (MISS)
  - get block  $B$  from memory
  - calculate where to place in cache
  - is position free?
  - if not, evict block at  $B$ 's place from the cache
  - write  $B$  to its cache location
- Access  $v$  in the cache

cache - smaller than memory

→ multiple memory blocks map to the same cache block

mapping can be:

- static (direct mapped)
- dynamic (fully associative)
- mixed ( $x$ -way set associative)

hit ratio determined by:

- mapping strategy
- locality of reference

main memory → library  
cache → bookcase at home  
block → book  
byte → line

Direct  
Mapped  
Cache



(books)  
easy to look up blocks  
in the cache (bookcase)  
not flexible - high thrashing  
replacing blocks in cache ↴

## (Fully) Associative Cache



difficult to look up blocks (books) in the cache (bookcase)  
check labels (in parallel)  
very flexible → high hit ratio

**Direct Mapped**  
16 bit memory address  
16 bytes in a block (4b)  
128 blocks in cache (7b)



**Fully Associative**  
16 bit memory address  
16 bytes in a block (4b)  
4K blocks in memory (12b)



**Set Associative Cache**  
trade-off: performance

vs. flexibility  
blocks into sets  
- direct mapped  
parallel search in set  
- associative





Quiz 13.2

main memory: 3 GB  
 Cache: 512 blocks of 64 byte  
 Cache: 64-way set-associative  
 byte addressable (8)  
 $3\text{GB} \times 4\text{GB} = 2^{32}\text{B} \rightarrow 32\text{B}$

|     |          |           |
|-----|----------|-----------|
| tag | set (3b) | byte (6b) |
|-----|----------|-----------|

$512/8 = 64$  blocks  $\times 64$  bytes  
 64-way tag =  $32 - 3 - 6 = 23$  b

## Cache Replacement Algorithms

Least recently used (LRU)

- keep "time stamps" of accesses in counters

$2^x$ -way associative  $\rightarrow$

$x$ -bit age counter per block

- Basic algorithm:

- hit:

$\rightarrow$  Reset hit counter

$\rightarrow$  Increment lower counters

- miss, set full:

$\rightarrow$  Replace block with highest counter, reset this counter

$\rightarrow$  Increment other counters

- miss, set not full:

$\rightarrow$  Place in empty spot

$\rightarrow$  Increment other counters



# Lecture 14 - Pipelining

## Harvard Architecture



Exploit difference in locality  
 - instructions → spatial + temporal  
 - data → temporal + spatial

Separate data and instruction paths

### Quiz 14.1

1 laundry room, washer + drier + folding mat

Alice, Bob, Claire, Derek  $\times 1$  load each

1 load = wash (30) + dry (40) + fold (20)

30 40 20

$$30 + 40 + 20 = 90$$

30 40 20

$$30 + 40 + 20 = 90$$

30 40 20

30 40 20

Performance

Latency (L)

→ time for 1 instr.

→ lower is better

Throughput (T)

→ the # instructions per time <sup>unit</sup>

→ higher is better

→ time per instr. ↓ (on avg.)

## Improvements

clock frequency → latency boost (power wall)

threads & cores → throughput boost (parallel programming)

pipelined execution → throughput boost (limited effect)

dedicated circuitry → latency boost (one-trick pony)

Time →



Time →



## Instruction stages

1. Fetch instruction
2. Decode instruction and fetch operands

- 3 Execute operation

- 4 Write result

simplified 4 stage model



Insert buffers to allow pipeline stages to act independently

|    |    |    |    |    |
|----|----|----|----|----|
| I1 | F1 | D1 | E1 | W1 |
| I2 | F2 | D2 | E2 |    |
| I3 | F3 | D3 |    |    |
| I4 | F4 |    |    |    |

B1 ( $F \rightarrow D$ ):

→ instruction I3

B2 ( $D \rightarrow E$ ):

→ the source operand of I2

→ the operation

→ the specification of the destination operand



B3 ( $E \rightarrow W$ ):

→ the result of the execution of I1

→ the specification of the destination operand



Harvard organization  
multiple cycles per stage  
extra stages (cache)  
even more cycles per stage  
complications: pipeline hiccups

## Pipeline stalls



## Hazards (causes):

- cache miss [fetch, decode]
- dependency between instr. [decode] output → input
- branching [exec]
- long operation [exec] division

## Data Dependency

MUL R2, R3, R4 /\* R4 destination \*/  
ADD R5, R4, R6 /\* R4 source \*/



```

ADD R1, R2 /* destination */
MUL R3, R4
ADD R2, R3 /* source */

MOV R1, R2 /* destination */
ADD $100, R2 /* source! */

MOV 4(R3), R4 /* destination */
ADD 8(R4), R5 /* "source" */

ADD R1, R2
ADD R2, R3
  
```



## Data forwarding

Add fast path from ALU output to input

## Branching

BGZ LOOP  
MOV R3, SUM



ADD R6, R7, R8  
BGZ R3 LOOP  
MOV R8, SUM

- Penalty = 1 cycle
  - Use idle slot
- BGZ R3 LOOP  
ADD R6, R7, R8  
MOV R8, SUM

branch delay slot:  
always executed  
before the branch effects

### Unconditional branching:

- target known after decoding
- branch penalty reduces to 1 cycle

### Static prediction:

- backwards branches likely to be taken (loops)
- forward branches mostly skipped (rule vs exception)
- sign bit (or compiler directed)

### Dynamic prediction:

- maintain info per branching instr. (branch target buffer)
- do as before



## Long operations

ADD R2, R3  
DIV R3, R1



Execution time of a program

Instruction count

# cycles per instruction, CPI

Clock rate

T

N

S (u)

R (1GHz)

$$\text{without pipelining: } T = \frac{NS}{R}$$

$$\text{with } n\text{-stage pipeline: } T' = \frac{T}{n}$$

S  $\uparrow$   $\rightarrow$  stalls

## Self-Study 8

32 bit words

16 MiB word addressable memory

256 KiB direct mapped cache

128 bytes block size

$$\underline{\text{Clock size}} = \frac{128B}{32b} = 32 \frac{\text{words}}{\text{Clock}}$$

$$\underline{\text{cache size}} = \frac{256KiB}{128B} = 2048 \frac{\text{blocks}}{\text{cache}}$$

Clock size

128B

32B word  $\Rightarrow \log_2 32 = 5$ -bit addressing

2048 blocks  $\Rightarrow \log_2 2048 = 11$ -bit addressing

### Quiz 14.2

$$h_i = 0.95$$

$$h_{\text{branch}} = 0.90$$

$$h_d = 0.90$$

$$p_{\text{branch}} = 1$$

$$p_c = 16 \text{ cycles}$$

$$f_{\text{branch}} = 0.20$$

$$f_d = 0.30$$

cache misses

$$= (m_i + f_d \times m_d) p_c$$

$$= (0.05 + 0.30 \times 0.10) 16$$

$$= 1.28$$

branching

$$= f_b \times m_b \times p_b$$

$$= 0.20 \times 0.10 \times 1$$

$$= 0.02$$

$$P_{\text{total}} = 1.28 + 0.02 = 1.3 \text{ cycles}$$

1 Add R1, R2, R3

2 Move (R4), R5

3 Move #10, R6

4 Add R3, R4, R7

Pmemory =  
2 cycles

8-way set associative

512 blocks of 128 bytes

8 GiB memory

byte addressable

128 byte-addressable blocks  $\Rightarrow \log_2 128 = 7$ -bit addressing

512 = 64 sets  $\Rightarrow \log_2 64 = 6$ -bit addressing

$$\underline{\frac{8}{8}} \quad 8 \text{ GiB} = 2^{33} \text{ B} \rightarrow 33 \text{ bits}$$

total

|   |    |    |    |    |   |    |    |    |   |
|---|----|----|----|----|---|----|----|----|---|
| 1 | F1 | D1 | E1 | W1 | 1 | 5  | 6  | 7  | q |
| 2 |    | F2 |    |    | 2 | D2 | E2 | W2 |   |
| 3 |    |    | F3 |    | 3 | D3 | E3 | W3 | q |
| 4 |    |    |    | F4 | 4 | D4 | E4 | W4 |   |

|   |    |    |    |    |   |    |    |    |   |
|---|----|----|----|----|---|----|----|----|---|
| 1 | F1 | D1 | E1 | W1 | 1 | 5  | 6  | 7  | q |
| 2 |    | F2 |    |    | 2 | D2 | E2 | W2 |   |
| 3 |    |    | F3 |    | 3 | D3 | E3 | W3 |   |
| 4 |    |    |    | F4 | 4 | D4 | E4 | W4 |   |



$$33 - 6 - 4 = 20 \text{ bits}$$

tag

# Lecture 15 - Parallelization and Virtual Memory

## Superscalar Execution



|                |      |    |    |    |    |    |
|----------------|------|----|----|----|----|----|
| I <sub>1</sub> | Fadd | F1 | D1 | E1 | W1 |    |
| I <sub>2</sub> | Add  | F1 | D2 | E2 | W1 |    |
| I <sub>3</sub> | Fsub | F1 |    | D1 | E3 | W1 |
| I <sub>4</sub> | Sub  | F1 |    | D2 | E2 | W1 |

in-order issuing of instructions  
out-of-order completion

|                |      |    |    |    |    |    |
|----------------|------|----|----|----|----|----|
| I <sub>1</sub> | Fadd | F1 | D1 | E1 | W1 |    |
| I <sub>2</sub> | Add  | F1 | D2 | E2 | W1 |    |
| I <sub>3</sub> | Fsub | F1 |    | D1 | E3 | W1 |
| I <sub>4</sub> | Sub  | F1 |    | D2 | E2 | W1 |

in-order issuing of instructions  
program-order completion

## Amdahl's Law

### Parallelization:

- throughput ↑
- latency =
- program runtime T ↓

### Cons:

- Parallel programming is difficult
- Not all code can be parallelized:
  - data dependencies
  - sequential algorithms

$$T_p = T_s \left( f_s + \frac{f_p}{p} \right)$$

$$\text{speedup} = \frac{1}{f_s + \frac{f_p}{p}}$$

f<sub>p</sub> - fraction of code that can be parallelized

1 - f<sub>p</sub> = f<sub>s</sub> - sequentially executed code

T<sub>s</sub> - time to execute program sequentially

T<sub>p</sub> - time to execute program using p processors

### Quiz 15.1

$$a) s = \frac{1}{10\% + 90\%} = \frac{1}{40\%} = 2.5 \rightarrow 0.2 = 10\% + \frac{90\%}{p}$$

$$f_p = 90\% \quad f_s = 10\% \quad b) p = \infty \quad s_{\max} = \frac{1}{10\%} = 10 \Rightarrow p = 9$$

$$a) p = 3, s = ? \quad b) s_{\max} = ? \quad c) s = 5, p = ? \quad c) 5 = \frac{1}{10\% + \frac{90\%}{p}} = \frac{1}{20\%}$$

## Flynn's Taxonomy



**SISD** → Single Instruction, Single Data  
 → conventional system  
**SIMD** → Single Instruction, Multiple Data  
 → one instruction on multiple data streams  
**MIMD** → Multiple Instruction, Multiple Data  
 → multiple instruction streams on multiple data streams  
**MISD** → Multiple Instruction, Single Data

## SIMD

CPU: A lot of cache and control logic + reasonable compute logic  
 GPU: Small cache and control logic + a lot of compute logic



## Shared-memory Multiprocessors

Uniform Memory Access (UMA) architecture  
Any processor can access (transparently) directly any memory



## Non-Uniform Memory Access (NUMA)

Any processor can access directly any memory, but not at the same speed.  
 realization in hardware or in software (distributed shared memory)



## Distributed memory architecture

but not shared      not transparent  
 Any processor can access any memory, but  
 sometimes through another processor (messages)



## Interconnection networks

Difficulty in building systems with many processors - interconnections  
 Parameters:

- Diameter - maximal distance between any two processors
- Degree - maximal number of connections per processor
- Total number of connections = Cost
- Bisection width - largest number of simultaneous messages



## Hypercubes

Non-uniform delay - NUMA architecture  
 $n \times 2^{n-1}$  connections

Connected processors differ by 1 bit

Routing:

- scan bits from right to left
- if different, send to neighbour with same bit different
- repeat until end





**Caching**  
Blocks  
Cache miss  
Tag field  
Byte/word field  
Mapping function

**Virtual Memory**  
Pages  
Page fault  
Page number  
Page offset  
Page table

Divide address space in chunks  
Store more chunks on disk  
When chunk is needed, move to main memory



## Memory Management Unit

- hardware support
- mapping of pages in memory
- access control

## Operating System

- sets up MMU and paging tables
- handles misses



## Demand paging

- From virtual address
- through page table
- to physical address



## Logical address

- covers complete address space

## Page table

- stored in main memory (OS)
- translates virtual page number to a physical page frame





CPU contains special register that stores the starting location of the page table (<sup>one per</sup> process). Combining page frame and offset gives the complete physical memory address.

Large address space  $\rightarrow$  large page table  
Page table does not fit in cache

### Translation Lookaside Buffer (TLB)

- Special cache for small part of the page table
- Look up virtual page number to find page frame



If part of page table not found in TLB  $\rightarrow$  regular cache miss

If page frame found in page table, but marked as invalid  
 $\rightarrow$  page fault

If page frame not mapped in page tables  $\rightarrow$  segfault

MMU issues a trap  
(interrupt caused by execution)

- CPU
  - $\rightarrow$  aborts the instruction
  - $\rightarrow$  switches to kernel mode
- OS: exceptional handler
  - $\rightarrow$  switches for a free page frame
  - $\rightarrow$  loads the requested page from disk into memory
  - $\rightarrow$  updates the page table
  - $\rightarrow$  restarts the aborted instruction

### Quiz 15.2

$$\begin{aligned}
 & 42\text{-bit PC} & 4\text{KiB offset} = 2^{12} \text{ b} \rightarrow 12\text{b} \\
 & \text{word size - 16b} & 32\text{ GiB} = 2^{35} \text{ B} \rightarrow 35\text{b} \\
 & \text{word addr, RAM - 8 GiB} & 35 - 12 = 23\text{ b} \xrightarrow{\text{virtual page number}} \\
 & \text{OS memory - 32 GiB} & 8\text{ GiB} = 2^{33} \text{ B} \rightarrow 33\text{b} \\
 & \text{pages - 4 KiB} & 32\text{ KiB} \xrightarrow{\text{?}} 32\text{b} = 4\text{b} \\
 & \text{control bits - 6b} & 2^{7b} = 6\text{b} + \frac{21\text{b}}{32} \xrightarrow{\text{?}}
 \end{aligned}$$

Prevent programs from accessing

- OS internals (page tables, I/O devices)
- each others memory

DS loads correct page table for each process

- OS maps only what is allowed

- Other parts of the memory are unreachable



Separate 'kernel' mode for OS processes

- Give access to special instructions:
- access 'page-table base register', ...

# Self-Study 9

Amdahl's Law

$$\textcircled{1} \quad T_S = 32s \quad s = \frac{T_S}{T_P} > 2 \quad \frac{1}{f_S + \frac{f_P}{P}} \geq 2 \quad f_S + \frac{f_P}{P} \leq \frac{1}{2}$$

$$T_P \leq 16s \quad f_P = 87\% \quad f_S = 13\% \quad P \geq \frac{0.87}{0.37} \sim 2.4 \quad \frac{0.87}{P} \leq 0.37 \quad 0.13 + \frac{0.87}{P} \leq 0.5$$

$$P = ? \quad \Rightarrow \text{at least } 3$$

$$\textcircled{2} \quad T_S = 42s \quad T_P = T_S \left( f_S + \frac{f_P}{P} \right) \quad \frac{2}{3} = 1 - \frac{2.51}{2.52} f_P \quad f_P = \frac{84}{251}$$

$$P = 252 \quad f_P = ? \quad \frac{2.51}{2.52} = 1 - f_P + \frac{f_P}{2.52} \quad f_P = -\frac{1}{2} \cdot -\frac{2.52}{2.51}$$

$$T_P = 28s \quad f_S = 1 - f_P \quad f_P \approx 0.33$$

Virtual Memory

$$\textcircled{1} \quad \text{page} = 2^4 \text{ KiB} = 3 \times 2^{13} \text{ B}$$

$$42 \times 2^{20} \text{ entries} = 2^4 \times 2^{21}$$

$$\text{total addressable virtual memory} = \frac{3 \times 2^{13} \times 2^4 \times 2^{21} \text{ B}}{\sim 1 \text{ TB/TiB}} = 63 \times 2^{24} \text{ B} = 63 \times 16 \text{ GiB} = 1008 \text{ GiB}$$

