

## Cap 3 Computer System Performance

### 3.1. Introduction

#### Computer Systems



### 3.2 Parameters

$$\bullet \text{Performance}_x = \frac{1}{\text{Exec. time}_x}$$

$$\frac{\text{Performance A}}{\text{Performance B}} = \frac{\text{Execution time B}}{\text{Execution time A}}$$

example Computer X → 10 μs

Computer Y → 15 μs

$$\Rightarrow \frac{\text{Performance}_X}{\text{Perf.}_Y} = \frac{\text{Exe. time}_Y}{\text{Exe. time}_X} = \frac{15}{10} = 1,5$$

• CPU time translatate

Performance A > Performance B

Performance B > Performance C

⇒ Performance A > Performance C

program  
Workload

Benchmark file  
L (lab mai variate) ← comp, simulator, ...

kernels, toy programs, synthetic benchmark X  
real-world programs

**SPEC**

standard.

← NNW, spec. dg - max. referinta  
performance, evaluation, corporation

1989, --, 2017\*

m - number of programs in the benchmark suite

Geometric mean =  
(media geometrică)

performance relative  
față de maxima referință  
CPU time

Reference i

CPU time  $x_i$

$\prod_{i=1}^m$  Sample i

CPU time  $x_i$

$$\frac{\text{Performance } X}{\text{Performance } Y} = \sqrt[m]{\prod_{i=1}^m \frac{\text{CPU time Reference } i}{\text{CPU time } x_i}} = \sqrt[m]{\prod_{i=1}^m \frac{\text{CPU time } y_i}{\text{CPU time } x_i}}$$

maxima referință  
diferență

$$\sqrt[m]{\prod_{i=1}^m \frac{\text{CPU time } y_i}{\text{CPU time } x_i}} =$$

$$= \sqrt[m]{\prod_{i=1}^m \frac{\text{Performance } x_i}{\text{Performance } y_i}}$$

$$\text{Standard deviation } \sigma = \sqrt{\sum_{i=1}^m (\text{Sample } i - \bar{x})^2}$$

medie geometrică

### 3.3 Quantitative principles in computer sys. design

1) Use parallelism when possible

Intuition - level parallelism fetch, execute, memory





Boolean level : CLA

MPS<sub>c</sub>C  
multi processor Systems - on-chip

② Locality principle

~~spatial~~  
spatial  
temporal

lernkurve, iteration<sup>n</sup>

plot,



power - law

- The processor spends 90% of its time executing 10% of the instructions.

instr  
data

### 3) Amdahl's law - make the common case faster

am final beni:



| Second part | Hours<br>2 <sup>nd</sup> part | Speedup<br>2 <sup>nd</sup> part | Hours<br>tot | Speed up<br>tot |
|-------------|-------------------------------|---------------------------------|--------------|-----------------|
| Foot        | 50 hrs                        | 1                               | 70           | 1               |
| Bike        | 20                            | 2,5                             | 40           | 1,75            |
| Hyundai     | 4                             | 12,5                            | 24           | 2,92            |
| Ferrari     | 1,6                           | 31,25                           | 21,6         | 5,24            |
| Rocket car  | 1                             | 50                              | 21           | 3,33            |

Overall speedup =  $\frac{\text{Exec. time without enhancement}}{\text{Exec. time with enhancement}}$

$$= \frac{\text{Exec. time without enhancement}}{\text{Exec. without enh. } [1 - \text{Fraction enhanced}] + \frac{\text{Fraction enhanced}}{\text{Speedup}}}$$

$$\Rightarrow \text{Overall speedup} = \frac{\text{Exec. time without enhancement}}{(1 - \text{Fraction enhanced}) + \frac{\text{Fraction enhanced}}{\text{Speedup}}}$$

$$\text{bytes fraction enhanced} = \frac{\frac{1}{200} \text{ (drum pattern optimiza)}}{\frac{1}{280} \text{ (drum total)}} = \frac{10}{14} = \frac{5}{7}$$

lim speedup -> ∞

$$(\text{Overall speedup}) = \frac{1}{1 - \text{Fraction enhanced}}$$

make the common case faster

### 3.4 The computer performance equation

CPU time - doesn't lie

CPU time = Clock cycles for a program  $\times$  clock cycles time

Clock frequency (rate) =  $\frac{1}{\text{clock cycle time}}$

$\times 10^3 \rightarrow 2^{10} K_B$

$10^6 \rightarrow 2^{20} M_B$

$10^{12} \rightarrow 2^{30} G_B$

$10^{12} \rightarrow 2^{40} T_B$

$10^{15} \rightarrow 2^{50} P_B$

$10^{18} \rightarrow 2^{60} E_B$

$10^{21} \rightarrow 2^{70} Z_B$

$10^{24} \rightarrow 2^{80} Y_B$

$$CPU \text{ time} = \overbrace{C_{FP} \times C_C t}^{\substack{\text{mode, separate per instruction}}} + \underbrace{\text{Instruction count} \times \text{clock cycles per instruction}}_{\substack{\text{time}}} \times \text{clock cycles}$$

technic, silicon

$$\text{arhitectura} = \underbrace{ic \times CPI}_{\substack{\text{instruction de tip } i}} \times \text{clock cycles time}$$

$$CPU \text{ time} = \frac{\text{instructions}}{\text{Program}} \times \frac{\text{clock cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{clock cycles}} = \frac{\text{Time}}{\text{Program}}$$

$$PPI = \sum_{i=1}^m y_{ci} \times CPI_i$$

*y<sub>ci</sub> - instruction count for type i*

$$CPI = \frac{y_c}{\sum_{i=1}^m y_{ci}}$$

*CPI<sub>i</sub> - clockcycles per instruction of type i*

$$CPI = \sum_{i=1}^m \left( \frac{y_{ci}}{y_c} \right) \times CPI_i$$

example

$$FP_{instr} = 25 \text{ h.}, CPI_{FP} = 1,0 \text{ c.c.} \xrightarrow{\text{clockcycles}}, CPI_{others} = 1,33 \text{ c.c.},$$

$$FP_{SQRT} = 2 \text{ h.}, CPI_{FP_{SQRT}} = 2,0 \text{ c.c.}$$

Variante: A.  $CPI_{FRSGRT} = 2 \text{ c.c.}; B. CPI_{FP} = 2,5 \text{ c.c.}$

$$CPI_{diagonal} = 0,25 * 4 + 0,45 * 1,33 \approx 2 \text{ c.c.}$$

$$CPI_A = 1 \text{ c.c.} - (20-2) \cdot 0,02 = 1,64 \text{ c.c.}$$

$$\sqrt{CPI_B} = 0,25 * 2,5 + 0,75 * 1,33 \approx 1,62 \text{ c.c.}$$

B e mai bun (make the common case faster)

Example 2

- Maxima A, CPV A, instrucții comparative mănuște săt conditional

if  $a=b$  then

$$\equiv ①$$

← limbaj de asamblare

else

$$\equiv ②$$

branch  
not  
equal

CMP r1, r2  
BNE address 1

$$\equiv ①$$

branch  
mecanizational

(B) address 2  
address 1  $\equiv ①$   
address 2  $\equiv ②$

$$\equiv ②$$

- Maxima B, are CPVs, deoarece

in loc de

CMP r1, r2 , avem BNE r1, r2 address 1  
BNE address 1

CPV<sub>A</sub> = 20%, cond branch instruction

CPI branch = 2 c.c.

CPI others = 1 c.c.

clock cycle time<sub>B</sub> = 1,25 × clock cycle time<sub>A</sub>

Which CPI is better?

$$CPU_{time A} = iC_A \times CPI_A \times \text{clock cycle time } A$$

$$CPU_{time B} = iC_A \times 0,8 \times 1,25 \times \text{Clock Cycle time}_A$$

~~?~~  $iC_B = 0,8 \times 1,25$

$$CPI_B = 0,25 \times 2 + 0,45 \cdot 1 = 1,25$$

20% --- 8e)

--- 10%

$$\frac{x}{x} = \frac{0,100}{0,08} = 125\%$$

$$CPU_{time B} = iC_A \times 1,25 \times \text{clock cycle time } A$$

$CPU_{time A} < CPU_{time B} \Rightarrow \text{Performance } A > \text{Performance } B$

### Example 3

| instruction type | Frequency | CPI |
|------------------|-----------|-----|
| ALU              | 43%       | 1   |
| LOAD             | 21%       | 2   |
| STORE            | 12%       | 2   |
| BRANCH           | 24%       | 1   |

Old, 25% ALU 1 operand from memory

$$\text{ex: } C = A + B$$



New  
ALU register memory

$$ADD r3, r2, [var-a]$$

(2)

New tag -mem ALU instr.  $\rightarrow$  CPI<sub>ALU</sub> = 2.0 c.c.

CPI branches = 3 c.R.

Ce se schimbă? <sup>new</sup> CPI, ie

, doar clock cycle time nu e afectat

$$CPV_{time\ old} = \bar{C}_0 \times CPI_0 \times \text{clock cycle time}_0 = \bar{C}_0 \times 1,57 \times \frac{\text{clock}}{\text{cycle old}}$$

$$CPI_0 = 0,43 \times 1 + 0,57 \times 2 = 1,57 \text{ c.e.}$$

$$CPV_{time\ new} = \bar{C}_m \times CPI_m \times \text{clock cycle time}_0$$

$$\bar{C}_m = \bar{C}_0 (1 - 0,43 \times 0,25) = \bar{C}_0 \cdot 0,8925$$

$$CPI_m = \frac{(0,43 \times 0,25 \times 2) + (0,75 \times 0,43 \times 1) + (0,21 - 0,43 \cdot 0,05) \times 2 + 0,12 \times 2}{1 - 0,43 \times 0,25}$$

$$CPI_m = \frac{1,908}{0,8925} 1,6$$

$$CPV_{time\ new} = \bar{C}_0 \times \frac{1,6}{1,908} \times \text{clock cycle time}_0$$

$$CPV_{time\ old} < CPV_{time\ new} \Rightarrow$$

Performance old > Performance new

### 3.5 Other metrics

MIPS

(million instruct per second)

MFLOPS

(million floating point operations per second)

$$\text{MIPS} = \frac{IC}{\text{CPU time} \times 10^6} = \frac{IC}{Tc \times CPI \times \text{Clock cycle time} \times 10^6} = \frac{\text{Clock frequency}}{CPI \times 10^6}$$

Example 4 ← are ref ex 3)

compiler new - reduction of 50% ALU

clock cycle time = ~~20 ns~~ ms

$$\text{CPU time old} = IC_0 \times 1,54 \times \frac{1}{20 \text{ ns}} =$$

$$\text{MIPS old} = \frac{50 \cdot 10^6 \text{ ns}}{1,54 \cdot 10^6} = 31,85 \text{ ms}$$

$$\text{Clock frequency} = \frac{1}{20 \cdot 10^{-9} \text{ s}} = \frac{10^9}{20} \cdot 10^6 = 50 \text{ MHz}$$

$$\text{CPU time new} = IC_m \times (CPI_m \times 20 \text{ ms}) =$$

$$IC_m = IC_0 (1 - 0,43 \times 0,5)$$

$$CPI_m = \frac{0,43 \cdot 0,5 + 0,54 \cdot 2}{1 - 0,43 \cdot 0,5}$$

$$\Rightarrow \text{CPU time new} = IC_0 (1 - 0,215) \times \frac{0,215 + 1,14}{1 + 0,215} \times 20 \text{ ms}$$

$$= 27,1 \text{ ms} \times IC_0$$

temp modis  
instructions

31,4

$IC_0 \times 304 \text{ ms}$

CPU time new < CPU time old

Performance new > Performance old

$$\text{MIPS}_{\text{new}} = \frac{50 \times 10^6 \text{ NA}}{1,725 \times 10^6} = 28,96$$

performance cycle, MIPS scale

$$\text{Relative MIPS}_x = \frac{\text{Exec time reference}}{\text{Exec time } x} \times \text{MIPS}_{\text{reference}}$$

stone → prima referenza

1 MIPS machine VAX 11/780

$$\text{MFLOPS} = \frac{\text{IC FP}}{\text{CPU time} \cdot 10^6}$$

normalized IC<sub>FP</sub> dinner loops

| Instruct type          | Normalized IC |
|------------------------|---------------|
| +,-,*,CMP              | 1             |
| dir, dest              | 4             |
| Gx <sub>p</sub> , dest | 8             |

## Capitolo 4 Memory hierarchy

### 4.1 Introduction

locality principle  $1B = 2^3 \text{ bits}$





Resumām



Processor manages instructions & data

4. Memory hierarchy  
 ascendere  $\rightarrow$  evidentie undertaken instructioni  
 4.2 Cache mapping

4.2.1 Direct mapping

$M_2$   $2^m$  blocks

$M_1$   $2^m$  blocks

$$m < n$$

$M_2(0), M_2(1), \dots, M_2(j) \dots, M_2(2^m-1)$

$M_1(0), M_1(1), \dots, M_1(i) \dots, M_1(2^m-1)$

$$i = j \bmod 2^m$$

exemplu 1

$$m_2 = 3$$

$$m = 5$$



$$\text{acces } i = j \bmod 8$$

~~Tag~~



$\therefore M_1(0) \leftarrow 0, 8, 16, 24$

$M_1(1) \leftarrow 1, 9, 17$   
 $\downarrow \text{modulo}$

$$M_1(4) = 12$$

$$12 : 8 = 1 \text{ r } 4$$

(0,1) 00  
 index

Address Word  
 $\angle m-m \rangle$

$\langle m \rangle$

$\langle p \rangle$

$\langle g \rangle$

|     |       |              |             |
|-----|-------|--------------|-------------|
| Tag | index | Block offset | Word offset |
|-----|-------|--------------|-------------|

## 4 Memory hierarchy



### Example 2

ample 2 Sist. memoria / memoria principală  $MM = M_2 = 2^4$  cuvinte

- 1 byte = 1 word  
(1 word = 1 byte)
  - cache date size = 1 k i block
    - mini program - 3 instructions, ③ instructions

| 8k     | Code |
|--------|------|
| 8192   | 230  |
| 4102   | 1715 |
| 12 292 | 170  |

 MM
  (posta)

Block... 1)



2

$$M_2 = 2^{24} \text{ words}$$

= 1 byte

data size = 1Ki block

Code

230

2715

170





### example 3

Indirectly 1 FastMath

$$1 \text{ word} = 4 \text{ bytes} \Rightarrow 2^2$$

$$2^4 2^2 3^3$$

$$1 \text{ block} = 16 \text{ words}$$

bits/state

capacity  
main memory  $SMM = 4 GiB = 2^2 \cdot 2^{30} B = 2^{32} B$

capacity  
Cache  $= 2^8$  blocks



Vrem mișcări cat mai mici

#### 4.2.2 Set associative mapping

$$M_1: M_1(0) \dots M_1(i') \dots M_1(2^m)$$

$$M_2: M_2(0) \dots M_2(j) \dots M_2(2^{m-1})$$

Set associativity is  $2^k$

$$i' = j \bmod 2^{m-1}$$

#### Exemplu 5

$$m=3$$

$$n=5$$

$$\Delta=1$$

Set associativity is on  $k=2^{i=1}$  ways

$$k=2$$

$$m-\Delta = 3-1 = 2$$

01100  
Tag index

$$12 \bmod 2^{3-1} = 12 \bmod 4 = 0$$

- nu se mai dă afara în tag, conținutul



Exemplul 6

|     | ✓   | Tag <13> | Data <32> |   |      |          |
|-----|-----|----------|-----------|---|------|----------|
| 0   | 101 | 4        | 230       | - | -    | whatever |
| 1   | 01  | 2        | -         | - | 2715 | -        |
| :   |     |          |           |   |      |          |
| 0   |     |          |           |   |      |          |
| 1   |     |          |           |   |      |          |
| 511 | 0   |          |           |   |      |          |

Bank 0

|     | ✓  | Tag | Data |   |   |   |
|-----|----|-----|------|---|---|---|
| 0   | 0  | 0   |      |   |   |   |
| 1   | 01 | 6   | 140  | - | - | - |
| :   |    |     |      |   |   |   |
| 0   |    |     |      |   |   |   |
| 1   |    |     |      |   |   |   |
| 511 | 0  |     |      |   |   |   |

Bank 1



Example 6

|     |   | V | Tag | Data |      |   |
|-----|---|---|-----|------|------|---|
| 0   | 1 | 4 | 230 | -    | -    | - |
| 1   | 1 | 2 | -   | -    | 2315 | - |
| ⋮   |   |   |     |      |      |   |
| 0   |   |   |     |      |      |   |
| ⋮   |   |   |     |      |      |   |
| 511 | 0 |   |     |      |      |   |

Bank 0

Example

20>

Tag

|     |   | V | Tag | Data |   |   |
|-----|---|---|-----|------|---|---|
| 0   | 1 | 0 | 6   | 170  | - | - |
| 1   | 1 | 1 | -   | -    | - | - |
| ⋮   |   |   |     |      |   |   |
| 0   |   |   |     |      |   |   |
| ⋮   |   |   |     |      |   |   |
| 511 | 0 |   |     |      |   |   |

Bank 1

Exemplu 7

VAX 11/780



4.2.3 Full associative mapping



week 0      Example 7      VAX 11/780



fel problema ca la examen

### 4.2.2 Set-associative mapping

$$\begin{aligned} 4 \text{GiB} &= 2^2 \cdot 2^{30} \text{B} = 2^{32} \text{B} \\ \text{NM size} &\rightarrow 4 \text{GiB} \\ \text{Byte addressable} \\ 1 \text{Word} &= 8 \text{B} \quad 8 = 2^3 \\ 4\text{-way set associative} \\ 1 \text{Block} &= 16 \text{Words} = 2^4 \text{Words} \end{aligned}$$

a) formatul adresii

b) dimensiune totala cache

c) logic diagrame cache  
(de exemplu evit)

~~128 KiB~~ Data Cache

$$\text{Cache size} = 128 \text{KiB} = 2^4 \cdot 2^{10} \text{B} = 2^{14} \text{B} = \frac{2^{14}}{2^4} = 2^{10} \text{bytes}$$

1 block = ?



$$\frac{2^{10} \text{ (cache bytes)}}{2^2} = 2^8$$

(4 Way SA)

categorii bloc

b)  $\left[ (1+17) \text{ bits} + 2^7 \text{B} \right] \cdot 2^10 = \underbrace{18 \cdot 2^{10}}_{\text{bytes}} + 128 \text{KiB}$  apoi în  $\approx$   $130 \text{KiB}$

$$\approx 2 \text{B} \times 2^{10} + 128 \text{KiB} \approx 130 \text{KiB}$$







### 4.2.3 Full associative mapping

$$k = 2^s$$

$$f \sim s=0 \Rightarrow \delta M$$

$$\text{for } s=m$$

$2^m$  blocks in  $M_1$

#### Exemple 1

$$m = 3$$

$$m = 5$$



## Example 2

$$M1 \rightarrow 2^3 B = 16 MiB$$

$$1\text{ word} = 1 B$$

$$1\text{ block} = 4\text{ words}$$

1 Ki block <sup>cache</sup> data size  
implant  $\rightarrow$  Address      Code

8192            230

4102            2715

12292            170

|    |      | <22> |      | <32> |  |
|----|------|------|------|------|--|
|    |      | V    | Tag  | Data |  |
| 0  | T    | 2048 | 230  |      |  |
| X1 | 1025 |      | 2715 |      |  |
| Q1 | 3073 |      | 170  |      |  |
|    |      |      |      |      |  |
|    |      |      |      |      |  |
| m4 | O    |      |      |      |  |



## 4.3 Replacement techniques

a) FIFO

b) Random (for  $k$  small)

$$G(x) = x^4 + x + 1$$

Galois  
Galois

$$\begin{array}{r} 32 \\ \times 11 \\ \hline 21 \end{array}$$

Linear Feed Back Shift Register (LFSR)



| RA <sub>0</sub> | RA <sub>1</sub> | RA <sub>2</sub> | RA <sub>3</sub> |
|-----------------|-----------------|-----------------|-----------------|
| 1               | 0               | 0               | 0               |
| 0               | 0               | 0               | 1               |

| $R_{D0}$ | $R_{D1}$ | $R_{D2}$ | $R_{D3}$ | $k$ |
|----------|----------|----------|----------|-----|
| 1        | 0        | 0        | 0        | 1   |
| 0        | 0        | 1        | 0        | 2   |
| 0        | 0        | 1        | 0        | 4   |
| 0        | 0        | 0        | 1        | 8   |
| 1        | 1        | 0        | 0        | 3   |
| 0        | 1        | 1        | 0        | 6   |
| 0        | 0        | 1        | 1        | 12  |
| 1        | 1        | 0        | 1        | 11  |
| 1        | 0        | 1        | 0        | 5   |
| 0        | 1        | 0        | 1        | 10  |
| 1        | 1        | 1        | 0        | 7   |
| 0        | 1        | 1        | 1        | 14  |
| 1        | 1        | 1        | 1        | 15  |
| 1        | 0        | 1        | 1        | 13  |
| 1        | 0        | 0        | 1        | 9   |
| 1        | 0        | 0        | 0        | 1   |

c) least recently used      la hit varăte mănușă rămâne la fel  
 (miss)  
 când accesă în block văzută la = const  
 (restul increment)

### Age registers

← modifică legătură

dacă cei valori se  
implementează

4 way SA

0, 1, 2, 3

14, 7, 6, 12, 10,

6, 7, 14, 4, 3, 15

|    |   | B0 | B1  | B2  | B3 | B0* | B1* | B2* | B3* |
|----|---|----|-----|-----|----|-----|-----|-----|-----|
| 14 | M | 0  | 0   | 0   | 0  | 0   | 0   | 0   | 0   |
| 14 | M | 14 | 0   | 0   | 0  | (C) | 0   | 0   | 0   |
| 7  | M | 14 | 7   | 0   | 0  | 1   | (C) | 0   | 0   |
| 6  | M | 14 | 7   | 6   | 0  | 2   | 1   | (C) | 0   |
| 12 | M | 14 | 7   | 6   | 12 | 3   | 2   | 1   | (C) |
| 10 | M | 10 | 7   | 6   | 12 | 0   | 3   | 2   | 1   |
| 6  | H | 10 | 7   | (6) | 12 | 1   | 3   | (2) | 1   |
| 4  | H | 10 | (4) | 6   | 12 | 2   | 0   | 1   | (3) |

|    | 00 | B1  | B2 | B3   | B0* | B1* | B2* | B3* | 01 |
|----|----|-----|----|------|-----|-----|-----|-----|----|
| YH | 10 | 7   | 6  | 12   | 2   | 0   | 1   | 1   | ③  |
| MM | 10 | 7   | 6  | (15) | 3   | 1   | 2   | 0   |    |
| YH | 10 | (7) | 6  | 14   | 3   | 0   | 2   | 1   |    |
| MM | ③  | 7   | 6  | 15   | 0   | 1   | 3   | 2   |    |
| YH | 3  | 7   | 6  | (15) | 1   | 2   | 3   | 0   | 0  |

#### 4.5 Write policies

90% of all access are reads

For writes techniques

a) Write Through (advised)

b) Write Back

policy giving the most  
possible write-back flexibility

Cache Coherency

(mut mut area 2 check  
(miss time))





- b) Write Back update M2 upon replacement



- legătură înveli, deci nu face efectiv cu măngâie ✓

In case of a Write Miss

- I) Write allocate (nu mai aduce back in cache)
- II) Write No Allocate

## 4.5 Memory hierarchy performance

$\overline{AMAT}$  = average memory access time

$$\overline{AMAT} = \text{tac} \cdot \text{HitRate} + (\text{tac} + \text{MissPenalty}) \text{MissRate} =$$

↑  
time access

(timp)

$$(\text{MissPenalty} + \text{MissRate} = 1)$$

at miss time      at miss time

$$= \text{tac} + \text{MissPenalty} \cdot \text{Miss Rate}$$

(1)

## 4.5 Memory hierarchy performance

$$\text{Average memory access time} = \text{AMAT} = \frac{\text{tac}}{1 \text{ clk cycle}} + \dots$$

$\downarrow$   
1 clk cycle

$\downarrow$   
1 clk cycle

$$\text{AMAT} = \text{tac} + \text{Miss Rate} \times \text{Miss Penalty (time)}$$

$\text{CPU}_{\text{time}}$  for a ~~perfect~~ cache

$$= \text{IC} \times \text{CPI} \times \text{clk cycle time}$$

(real)

$$= \text{IC} \times (\text{CPI} + \text{Memory stalls}) \text{ clk cycles time}$$

$$\text{Memory stalls} = \text{Memory accesses per instruction} \times \text{Miss rate} \times \frac{\text{Miss penalty}}{\text{(miss per instruction)}} \times \frac{\text{Miss penalty}}{\text{(C-C)}}$$

$$\text{Memory stalls} = \text{Memory accesses per instruction} \times (\text{Percentage read} \times \text{Read Miss penalty} + \text{Percentage write} \times \text{Write Miss penalty})$$

$$\star \text{ Read miss penalty} + \text{ Write miss penalty} \times \text{Write miss Rate}$$

### Exemplu 1

$$\cdot \text{CPI ideal} = 8,5 \text{ C.C.}$$

$$\text{AMAT?}$$

$$\cdot \text{Memory access per instruction} = 3$$

$$\text{CPU}_{\text{time}}?$$

$$\cdot \text{Miss Rate} = 11\%$$

degrade performance cache

$$\cdot \text{Miss Penalty} = 6 \text{ C.C.}$$

$$\text{tac} = 1 \text{ C.C.} = 5 \text{ ms}$$

ANAT: 5 ms  
perfect

$$ANAT_{real} = 5 \text{ ms} + 0,11 \times 6 \text{ R.C.} \times 5 \text{ ms}$$
$$= 8,3 \text{ ms}$$

$$CPU_{time,ideal} = IC \times \underbrace{8,5}_{\text{average time per instr}} \times 5 \text{ ms}$$

$$CPU_{time,real} = IC \times CPI_{real} \times 5 \text{ ms}$$

$$CPI_{real} = 8,5 + 3 \times 0,11 \times 6 \approx 10,5 \text{ CC}$$

$$\frac{CPI_{real}}{CPI_{ideal}} = \frac{10,5}{8,5} \text{ Report degradat. performance}$$

### Example 2

64 KiB 2 way SR

cctime = 20 ms  $\rightarrow$  clock frequency

Mem access per ~~inst~~ inst = 3 Miss degradation with 35%.

CPI ideal = 1,5 CC Miss penalty = 200 ns

DM Cache?

Miss rate  $\Delta_M = \frac{\text{misses}}{\text{clock freq}} \cdot 100\% = 3,9\%$ .

Miss rate  $SR = 3\%$

$$AMAT_{1W} = 20 \text{ ms} + 0,039 \times 200 = 98 \text{ ms} \quad (27,8 \text{ ms})$$

$\downarrow$  frequency  $\Rightarrow$   $\uparrow$  time

$$clk \text{ cycle time } T_W = 20 \text{ ms}$$

$$clk \text{ cycle time } T_W = 20 \text{ ms} + 1,085 = 21,4 \text{ ms}$$

$$AMAT_{2W} = 21,4 \text{ ms} + 0,03 \times 200 \text{ ms} = 27,7 \text{ ms}$$

is better

$$CPU \text{ time } t_{1W} = IC \times (CPI_{ideal} \times clk \text{ cycle time} + Mem$$

$$\text{accesses/instruction} \times \underbrace{\text{Miss Rate} \times \text{Miss penalty} \times clk \text{ cycle time}}_{\text{miss penalty time}})$$

$$CPU \text{ time } t_{1W} = IC \times (1,5 \times 20 \text{ ms} + 1,3 \times 0,039 \times 200 \text{ ms})$$

$$= IC \times 40,14 \text{ ms}$$

*is better*

$$CPU \text{ time } t_{2W} = IC \times (1,5 + 21,4 \text{ ms} + 1,3 \times 0,03 \times 200 \text{ ms})$$

$$= IC \times 40,35 \text{ ms}$$

## 4.6 The 3 sources of cache misses

(the 3 C)

1) ~~bad address, cache go~~, dear miss

balanta

- 2) compulsory (off-start misses)  $\rightarrow$  bigger block sizes
- 3) conflict - many ~~blocks~~ mapped at ~~same~~ the same index  $\rightarrow$  higher SP
- capacity - cache goes miss - invalid cache more

### Example 3

In cache

- 4 way SA
- ③ 8 sets, 4 words / block
- Addressable unit word
- MM size 2<sup>16</sup> words
- cache is initially empty
- tac = 30 ns = 1 clk cycle
- Miss penalty = 480 ns

LRU Replacement



- a) Memory address format
- b) a. i) Total cache capacity if 1 word = 32 bits

b) Hit ratio, AMAT for the workload advise meepit

Address

M2

Block

| 0   | 1 | 2    | 3   | 0  |
|-----|---|------|-----|----|
| 4   | 5 | 6    | 7   | 1  |
| 8   | — | —    | 11  | 2  |
| 12  | — | —    | 15  | 3  |
| 16  | — | —    | —   | 4  |
| 20  | — | —    | —   | 5  |
| 24  | — | —    | —   | 6  |
| 28  | — | —    | —   | 7  |
| 32  | — | —    | —   | 8  |
| 36  | — | —    | —   | 9  |
| 40  | — | —    | —   | 10 |
| 44  | — | —    | —   | 11 |
| 48  | — | —    | —   | 12 |
| ~   | — | —    | —   | —  |
| 120 | — | —    | —   | 30 |
| 124 | — | —    | —   | 31 |
| 138 | — | —    | —   | 32 |
| 152 | — | —    | —   | 33 |
| ~   | — | —    | —   | 34 |
| 152 | — | —    | —   | 35 |
| 154 | — | —    | —   | 36 |
| 160 | — | —    | —   | 37 |
| 164 | 2 | 16-4 | 2-1 | 16 |
| 164 | 2 | 16-4 | 2-1 | 14 |

-5  
 26  
 27  
 28  
 29  
 STOP **30** START

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

1 miss, 1 hit, 1 miss, 3 misses  
 Tag valid

$$\left(\frac{1M}{1H}\right) \times 32 \times \left(\frac{1M}{2H}\right) \times 2 = 36M \approx 213 \text{ bytes} \quad (6)$$

$$\Rightarrow 249 \text{ entries}$$

|   | Tag | Valid Tag      | V Tag               | V Tag                 |
|---|-----|----------------|---------------------|-----------------------|
| 0 | 51  | 0 32 33 34 35  | 0 1 2 3 96 97 98 99 | 0 0 1 128 129 130 131 |
| 1 | 1   | 36 37 38 39    | 1 0 2 68 69 70 71   | 1 0 4 132 133 134 135 |
| 2 | 1   | 50 51 52 53    | 2 0 2 82 83 84 85   | 2 0 4 136 137 138 139 |
| 3 | 1   | 44 45 46 47    | 3 0 2 76 77 78 79   | 3 0 4 140 141 142 143 |
| 4 | 1   | 48 49 50 51    | 4 0 2 80 81 82 83   | 4 0 4 144 145 146 147 |
| 5 | 1   | 52 53 54 55    | 5 0 2 84 85 86 87   | 5 0 4 148 149 150 151 |
| 6 | 0   | 24 25 26 27    | 6 0 2 88 89 90 91   | 6 0 4 152 153 154 155 |
| 7 | 0   | 28 29 30 31    | 7 0 2 60 61 62 63   | 7 0 4 158 159 160 161 |
| 8 | 4   | 00 154 155 156 | 8 0 2 28 29 30 31   | 8 0 4 164 165 166 167 |
|   |     | B1 spp         | B1                  | B1                    |

a)

|     | <6>   | <8>   | <2>       |
|-----|-------|-------|-----------|
| Tag | index | Block | $\sim 2B$ |

$(\text{Cache capacity})_{\text{sa}} = 2^2 \times 2^3 \times (\underbrace{1 \text{ bit}_{\text{valid}} + 6 \text{ bits}_{\text{tag}} + 2 \text{ bits}_{\text{age}}}_{\text{V tag}} + 2^2 \times 2^2 B) = 2^9 B + 2^5 B = 544 B$ 
 $\frac{25}{2^3} \text{ bytes}$

## 4.5 Memory hierarchy performance

| N | T <sub>1L</sub> | 1G0 | 1G1 | 1G2 | 1G3 |
|---|-----------------|-----|-----|-----|-----|
| 1 | 154             | 102 | 205 | 75  | 425 |
| 2 | 1               | 36  | 37  | 38  | 39  |
| 3 | 1               | 10  | 41  | 42  | 43  |
| 4 | 1               | 43  | 45  | 46  | 47  |
| 5 | 1               | 48  | 49  | 50  | 51  |
| 6 | 1               | 52  | 53  | 54  | 55  |
| 7 | 04              | 58  | 57  | 58  | 59  |
| 8 | 0               | 26  | 27  | 28  | 29  |
| 9 | 4               | 156 | 152 | 152 | 159 |

| N | T <sub>1L</sub> | 1G0 | 1G1 | 1G2 | 1G3 |
|---|-----------------|-----|-----|-----|-----|
| 1 | 1               | 2   | 63  | 65  | 66  |
| 2 | 1               | 2   | 22  | 23  | 24  |
| 3 | 1               | 2   | 26  | 27  | 28  |
| 4 | 1               | 2   | 80  | 81  | 82  |
| 5 | 1               | 2   | 84  | 85  | 86  |
| 6 | 1               | 2   | 88  | 89  | 90  |
| 7 | 1               | 10  | 60  | 61  | 62  |
| 8 | 1               | 22  | 23  | 24  | 25  |
| 9 | 31              |     |     |     |     |

| N | T <sub>1L</sub> | 1G0 | 1G1 | 1G2 | 1G3 |
|---|-----------------|-----|-----|-----|-----|
| 1 | 1               | 3   | 96  | 97  | 98  |
| 2 | 1               | 5   | 100 | 101 | 102 |
| 3 | 1               | 3   | 104 | 105 | 106 |
| 4 | 1               | 3   | 103 | 104 | 105 |
| 5 | 1               | 3   | 112 | 113 | 114 |
| 6 | 1               | 3   | 115 | 116 | 117 |
| 7 | 1               | 2   | 120 | 121 | 122 |
| 8 | 1               | 3   | 123 | 124 | 125 |
| 9 | 2               | 92  | 93  | 94  | 95  |

| N | T <sub>1L</sub> | 1G0 | 1G1 | 1G2 | 1G3 |
|---|-----------------|-----|-----|-----|-----|
| 1 | 4               | 122 | 124 | 126 | 127 |
| 2 | 4               | 132 | 133 | 134 | 135 |
| 3 | 4               | 136 | 137 | 138 | 139 |
| 4 | 4               | 140 | 141 | 142 | 143 |
| 5 | 4               | 146 | 145 | 146 | 147 |
| 6 | 4               | 148 | 144 | 150 | 151 |
| 7 | 4               | 152 | 153 | 154 | 155 |
| 8 | 3               | 124 | 125 | 126 | 127 |

B9

B2

B3

$$\begin{pmatrix} 1M \\ 1H \end{pmatrix} \begin{pmatrix} 1M \\ 3H \end{pmatrix} \times 32 \begin{pmatrix} 1M \\ 112H \end{pmatrix} \begin{pmatrix} 1M \\ 2H \end{pmatrix} \times 2$$

$$\frac{\text{36 misses}}{\text{213 hits}} \rightarrow \frac{\text{213}}{\text{249 accesses}}$$

$$\frac{101}{112}$$

$$\frac{112}{213}$$

$$\text{Hit Ratio} = \frac{213}{249} = 85.5\%$$

$$\text{Miss Rat} = 14.5\%$$

$$\text{AMAT} = 30\text{us} + 0.145 \times 30\text{us} \times$$



99.6 us

$$\text{Hit ratio} = \frac{213}{249} = 85,5\%$$

$$\text{Miss Ratio} = 14,5\%$$

$$\text{Latency} \quad \text{Miss penalty}^{(\text{date read})} = 480 \text{ ms} = \left[ \frac{480 \text{ ms}}{30 \text{ ms}} \right] = 16 \text{ cc}$$

$$AT = 30 \text{ ms} + 0,145 \times 30 \text{ ms} \times 16 \text{ cc}$$

$$= 99,6 \text{ ms}$$

Summary 2 problems

4.6

Miss Rate = 14,5%.

$$AMAT = 20 \text{ ms}$$

$$CPU_{time} = iC \times (CPI_{ideal} + \underbrace{\frac{\text{Mem acc}}{\text{instr}}}_{\text{Miss/instruction}} \times \text{Miss Rate} \times \text{Miss Penalty}) \quad \begin{matrix} \text{achimbr} \\ \text{clk} \\ \text{cycle} \\ \text{time} \end{matrix}$$

Example 1

Cache A - 1M

$$\text{instruction miss rate} \quad \text{data miss rate}$$

$$iMR = 2\%$$

$$DMR = 1,4\%$$

Cache B - 4 way SA       $iMR = 1,6\%$        $DMR = 1,4\%$ 

- load store

20% of all instructions load store

Clock cycle time B degraded by 10).

$$CPI_{ideal} = 3 \text{ C.C}$$

$$\text{Miss Penalty} = 20^0 \text{ ms}$$

$$\text{Clk cycle time A} = 20 \text{ ms}$$

$$\text{Clk cycle time} = 1,1 \text{ Clk cycle time A}$$

$$CPU_{time A} = iC \times (CPI_{ideal} + \underbrace{\text{Misses/instr}}_A \times \text{Miss penalty A}) \times \text{Clk cycle time A}$$

$$= iC \times 20 \text{ ms}$$

$$\text{Clk cycle time B} = 22 \text{ ms}$$

$$\text{Miss Penalty B} = \lceil \frac{200}{22} \rceil = 10 \text{ C.C}$$

$$CPU_{time A} = iC \times 20 \text{ ms} (3 + 10 \times 0,0234) = iC \times 64,68 \text{ ms}$$

$$\text{Misses/instr}_A = iMR_A \times 1 + 0,2 \times DMR = 0,02 + 0,2 \times 0,014 = 0,0234$$

$$CPU_{time B} = iC \times 22 \text{ ms} \times (3 + 0,0188 \times 10) = 40,136 \text{ ms} \cdot iC$$

$$\text{Misses/instr}_B = iMR_B + 0,2 \times DMR_B = 0,016 + 0,2 \times 0,014$$

## 4. # Bus usage as performance metric



### Example 2

$BUS_{bandwidth} = 10^9 \text{ words/second}$  (Viteza, presión)

Hit rate = 90%  $\Rightarrow$  Miss rate = 10%.

1 block = 4 words

BUS width = 2 words (take one transactional per block)

Process records references to the code  
30% of references are writes  $10^8 \text{ words/sec}$   
(words/words)

At any moment 35% of all code blocks are dirty

writes allocate policy for miss

~~at least~~ 2 occupied  
~~bus~~ in queue a) WB , b) WT  
miss rate

$$a) 10^8 \text{ words/sec} \times 10\% \times (93 \times \text{Write Miss Penalty} + 97 \times \text{Read Miss Penalty})$$

$$\begin{aligned} \text{Read miss penalty} &= 935 \times \frac{\text{dirty}}{2 \text{ words}} \times \frac{4 \text{ words}}{2 \text{ words}} + 1 \times \frac{\text{dirty}}{2 \text{ words}} \times \frac{4 \text{ words}}{2 \text{ words}} = 2,7 \text{ BUS reads} \\ &\text{Write miss penalty} \quad \underbrace{\text{UPDATE}}_{\text{BUS WRITE}} \quad \underbrace{\text{ALLOCATE}}_{\text{BUS READS}} \end{aligned}$$

WB - CPU serie bar in cache

a)

$$\text{BUS used}_{WB} = 10^8 \text{ words/sec} \times 0,27 = 2,7 \cdot 10^7 \text{ word/sec}$$

$$\% \text{ BUS Used}_{WB} = \frac{2,7 \cdot 10^7}{10^9} \% = 2,7 \%$$

WT - CPU serie bar in cache & in main memory

b) BUS used<sub>WT</sub> =  $10^8 \text{ words/sec} \times 0,1 \times (0,3 \times \text{Write Miss Penalty}_{WT} + 0,7 \times \text{Read Miss Penalty}_{WT}) + 10^8 \text{ word/sec} \times 0,9 \times 0,3 \times \text{Write Hit Penalty}_{WT}$

$$= 10^8 \text{ word/sec} \times 0,9 \times 0,3 \times \cancel{\text{write hit penalty}}$$

Write Miss Penalty<sub>WT</sub> =  $\frac{\overbrace{4 \text{ words}}^{\text{ALLOCATE}}}{\underbrace{2 \text{ words}}_{2 \text{ Bus reads}}} + 1 \text{ Bus write} = 3 \text{ Bus Accesses}$

Read Miss Penalty<sub>WT</sub> =  $\frac{\overbrace{4 \text{ words}}^{4 \text{ words}}}{\underbrace{2 \text{ words}}_{2 \text{ Bus reads}}} = 2 \text{ Bus Accesses}$

Write Hit Penalty = 1 Bus Write = 1 BUS access

$$\text{BUS used}_{WT} = 10^8 \text{ WDS} \times 0,1 / (0,3 \times 3 + 0,7 \times 2) + 10^8 \times 0,9 \times 0,3 \times 1 = \\ = 5 \cdot 10^7 \text{ words/sec}$$

$$\% \text{ BUS Used}_{WT} = \frac{5 \cdot 10^7}{10^9 \text{ WDS}} = 5 \%$$

## 4.8 Cache performance optimization - multiple levels



### Example 3

2 level cache system

$$CPI \text{ ideal} = 10 \text{ c/c}$$
$$CCT = \frac{1}{4 \cdot 10^9 \text{ Hz}} = 0,25 \text{ ns}$$

$$CLX \text{ rate} = 4 \text{ GHz}$$

MM access time = 100 ns (including all miss handling)

Miss Rate per instruction = 2% for level 1 (primary cache)

Miss Rate  $\times$  Mem acc per instr

- 2nd level cache  $T_{AC} = 5 \text{ ms}$  for each hit or miss

Miss Rate per instruction = 0.5%

$$CPU \text{ time} = IC \times (CPI_{\text{L1}} \times \text{Miss rate}_{\text{L1}} \times \text{Miss penalty}_{\text{L1}} + \text{Miss rate}_{\text{L2}} \times \text{Miss penalty}_{\text{L2}})$$

Miss Rate per instruction  $\times$  Miss Penalty MM  $\times$  clock cycle time

$$\text{Miss penalty}_{MM} = \left\lceil \frac{100 \text{ ms}}{0,25 \text{ ms}} \right\rceil = 400 \text{ cc}$$

$$\text{Miss penalty}_{L2} = \left\lceil \frac{5 \text{ ms}}{0,25} \right\rceil = 20 \text{ cc}$$

$$CPI_{\text{L1 level}} = 10 + 0,02 \times 400 = 18 \text{ cc}$$

$$CPI_{\text{L2 level}} = 10 + 0,02 \times 20 \text{ cc} + 0,005 * 400 \text{ cc} = 12,4 \text{ cc}$$

$$\frac{\text{Performance}_{\text{L2 level}}}{\text{Performance}_{\text{L1 level}}} = \frac{18}{12,4} = 1,45$$

## Capitolo 5. Memoria virtuale

- Secondary storage
- Virtual machines, processes
  - according to the locality principle
    - each with its own memory space
- else
  - Paged <sup>egale</sup>
  - Segmented
    - ingle (segmenti di pagine)



& dictated by the huge miss penalties

Millions of cc

MM is 100 000 time faster than the Secondary storage

- large pages ARM VP page size  $4\text{KiB} \rightarrow 64\text{KiB}$

Servers  $32\text{KiB} - 64\text{KiB}$   
Embedded  $\rightarrow 1\text{KiB}$

- full associative mapping
- write back



## 5. Virtual machines

### 5.1 introduction

proces active - depinde de CPU

program - în protejează spațiul de memorie

locality principle

Cache <sup>memor</sup> VM <sup>(virtual memory)</sup>

block

pages / (Segments)

miss

page # fault



ARM<sub>y</sub>

granule

MMU exception

example ARM V8

- address word = 64 bits (16 bits unused)

- byte addresses

$$\frac{64}{16}$$

$\rightarrow 4 \text{ KiB pages} = 2^{12} \text{ B}$

- ~~4 KiB pages~~  $= 2^{12}$   $\rightarrow$  page table entries  $\rightarrow 64$  bits



$$\text{Virtual memory size} = 2^{48} = 256 \text{ TiB}$$

$$\text{Physical memory size} = 2^{40} = 1 \text{ TiB}$$

## Design choices

PROBLEMA Huge miss penalty → millions of clock cycles  
(page fault noise)

Lösun

1) Big page sizes → a amortiza

↳ 4 KiB → 1 KiB

32 KiB, 64 KiB

embedded

Desktop, Services

2) Replacement policy in software

3) Full associativity implizit implementiere LRU für page tables

4) Write Back with Write Allocate

5.2) Virtual

Cap 5-2) Virtual memory mapping

valid, memory  
and forced

↳ secondary storage

Register for Page Tables

Page Table Re





## Page Table size

$$2^{36} \text{ page table entries} \times 2^3 B = 2^{39} B = 95 TiB$$

↑  
tribute na facei cera

### 5-2-1) Page table size reduction

① limit register



② Heap stack





③ page Hashing → immeid page table

- ④ Multiple-level tables }  
 ⑤ paged page table }

ARM V8



o tabla de pagini, intra pe o pagina

$$\text{Level 2 Page Table Size} = 2^9 \text{ entries} \times 2^3 B = 2^{12} B = 4 \text{ KiB}$$

### 5.2.2 Translation Lookaside Buffer (TLB)

Page table mechanism → Performance overhead  
 ⇒ Page Table Caching

$$\text{Access mem} = \frac{\text{Accesses}}{2}$$

level memory

47 ..... 12, 11 ... 0

Page table size reduction

③ Hashing → Thread page table

④ Multi-level tables } ARM V8

⑤ Page table

PD - Page Descriptor



$$\text{Level}_2 \text{ Page Table Size} = 2^9 \text{ entries} \times 2^3 B = 2^{12} B = 4 KB$$



$$i_2 = 2^9 \text{ entries} \times 2^3 B = 2^{12} B = 4 KB$$



### example 2

Ymfinsky's Fast Math

Page - 4 KiB

4 GiB Physical and Virtual Mem Size

TLB access time = 1 clk cycle

Miss Penalty TLB = 10<sub>5</sub> - 10<sub>6</sub> of clock cycles

TLB size  $\rightarrow$  16 - 256 entries

1 TLB entry  $\rightarrow$  1 - 2 pag table entries

TLB Mapping - FA



## 5.2.2 Translation lookaside Buffer (TLB)

Page table → Performance overhead

⇒ Page Table Caching



### Lookaside Buffer (TLB)

more overhead



Page - 4 KiB  
4 GiB Physical and  
TLB acc. time = 1 c  
Miss Penalty TLB = 1  
TLB Size → 16 - 25  
1 TLB entry → 1 - 2  
TLB Mapping - F.E