

# Computer Architecture

## Chapter 5: Memory Hierarchy



# Contents

- Memory hierarchy
- Cache organization
  - Direct mapped
  - Fully associative
  - $n$ -way associative
- Virtual memory (self-learning)



# Motivation example



# Principle of Locality

- Programs access a **small proportion** of their address space at any time
- **Temporal locality**
  - Items accessed recently are **likely** to be accessed again soon
  - e.g., instructions in a loop, induction variables
- **Spatial locality**
  - Items near those accessed recently are **likely** to be accessed soon
  - E.g., sequential instruction access, array data

# Taking Advantage of Locality

- Memory hierarchy
- Store everything on disk
- Copy recently accessed (and nearby) items from disk to smaller DRAM memory
  - Main memory
- Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
  - Cache memory attached to CPU

# Memory Hierarchy Levels

- **Block** (aka **line**): unit of copying
  - May be multiple words
- If accessed data is present in upper level
  - **Hit**: access satisfied by upper level
    - Hit ratio: hits/accesses
- If accessed data is absent
  - **Miss**: block copied from lower level
    - Time taken: miss penalty
    - Miss ratio: misses/accesses  
 $= 1 - \text{hit ratio}$
  - Then accessed data supplied from upper level



# Memory Technology

- Static RAM (SRAM)
  - 0.5ns – 2.5ns, \$500 – \$1000 per GiB
- Dynamic RAM (DRAM)
  - 50ns – 70ns, \$3 – \$6 per GB
- Flash Memory
  - 5μs – 50μs, \$0.06 - \$0.12 per GiB
- Magnetic disk
  - 5ms – 20ms, \$0.01 – \$0.02 per GiB
- Ideal memory
  - Access time of SRAM
  - Capacity and cost/GB of disk

# Cache Memory

- Cache memory
  - The level of the memory hierarchy **closest** to the CPU
- Given accesses  $X_1, X_2, \dots, X_{n-1}, X_n$



a. Before the reference to  $X_n$



b. After the reference to  $X_n$

How do we know if the data is present?  
Where do we look?

# Direct Mapped Cache

- Location determined by address
- Direct mapped: **only one choice**
  - (Block address) modulo (#Blocks in cache)



#Blocks is a power of 2  
Use low-order address bits

# Tags and Valid Bits

- How do we know which **particular block** is stored in a cache location?
  - Store block address as well as the data
  - Actually, only need the high-order bits
  - Called the **tag**
- What if there is no data in a location?
  - Valid bit: 1 = present, 0 = not present
  - Initially 0

# Cache Example

- 8-blocks, 1 word/block, direct mapped
- Initial state

| Index | V | Tag | Data |
|-------|---|-----|------|
| 000   | N |     |      |
| 001   | N |     |      |
| 010   | N |     |      |
| 011   | N |     |      |
| 100   | N |     |      |
| 101   | N |     |      |
| 110   | N |     |      |
| 111   | N |     |      |



# Cache Example

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 22        | 10 110      | Miss     | 110         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | N |     |            |
| 001   | N |     |            |
| 010   | N |     |            |
| 011   | N |     |            |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Y | 10  | Mem[10110] |
| 111   | N |     |            |

# Cache Example

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 26        | 11 010      | Miss     | 010         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | N |     |            |
| 001   | N |     |            |
| 010   | Y | 11  | Mem[11010] |
| 011   | N |     |            |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Y | 10  | Mem[10110] |
| 111   | N |     |            |

# Cache Example

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 22        | 10 110      | Hit      | 110         |
| 26        | 11 010      | Hit      | 010         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | N |     |            |
| 001   | N |     |            |
| 010   | Y | 11  | Mem[11010] |
| 011   | N |     |            |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Y | 10  | Mem[10110] |
| 111   | N |     |            |

# Cache Example

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 16        | 10 000      | Miss     | 000         |
| 3         | 00 011      | Miss     | 011         |
| 16        | 10 000      | Hit      | 000         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | Y | 10  | Mem[10000] |
| 001   | N |     |            |
| 010   | Y | 11  | Mem[11010] |
| 011   | Y | 00  | Mem[00011] |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Y | 10  | Mem[10110] |
| 111   | N |     |            |

# Cache Example

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 18        | 10 010      | Miss     | 010         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | Y | 10  | Mem[10000] |
| 001   | N |     |            |
| 010   | Y | 10  | Mem[10010] |
| 011   | Y | 00  | Mem[00011] |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Y | 10  | Mem[10110] |
| 111   | N |     |            |

# Address Subdivision



# Example: Larger Block Size

- 64 blocks, 16 bytes/block
  - To what bytenuumber does address 1200 map?
- Block address =  $\lfloor \frac{1200}{16} \rfloor = 75$
- Block number = 75 modulo 64 = 11



# Block Size Considerations

- Larger blocks should reduce miss rate
  - Due to spatial locality
- But in a fixed-sized cache
  - Larger blocks  $\Rightarrow$  fewer of them
    - More competition  $\Rightarrow$  increased miss rate
  - Larger blocks  $\Rightarrow$  pollution
- Larger miss penalty
  - Can override benefit of reduced miss rate
  - Early restart and critical-word-first can help

# Cache Misses

- On cache hit, CPU proceeds normally
- On cache miss
  - Stall the CPU pipeline
  - Fetch block from next level of hierarchy
  - Instruction cache miss
    - Restart instruction fetch
  - Data cache miss
    - Complete data access



# Write-Through

- On data-write hit, could just update the block in cache
  - But then cache and memory would be inconsistent
- Write through: also update memory
- But makes writes take longer
  - e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
    - Effective CPI =  $1 + 0.1 \times 100 = 11$
- Solution: write buffer
  - Holds data waiting to be written to memory
  - CPU continues immediately
    - Only stalls on write if write buffer is already full

# Write-Back

- Alternative: On data-write hit, just update the block in cache
  - Keep track of whether each block is dirty
- When a dirty block is replaced
  - Write it back to memory
  - Can use a write buffer to allow replacing block to be read first



# Write Allocation

- What should happen on a write miss?
- Alternatives for write-through
  - Allocate on miss: fetch the block
  - Write around: don't fetch the block
    - Since programs often write a whole block before reading it (e.g., initialization)
- For write-back
  - Usually fetch the block



# Example: Intrinsity FastMATH

- Embedded MIPS processor
  - 12-stage pipeline
  - Instruction and data access on each cycle
- Split cache: separate I-cache and D-cache
  - Each 16KB: 256 blocks × 16 words/block
  - D-cache: write-through or write-back
- SPEC2000 miss rates
  - I-cache: 0.4%
  - D-cache: 11.4%
  - Weighted average: 3.2%

# Example: Intrinsity FastMATH



# Main Memory Supporting Caches

- Use DRAMs for main memory
  - Fixed width (e.g., 1 word)
  - Connected by fixed-width clocked bus
    - Bus clock is typically slower than CPU clock
- Example cache block read
  - 1 bus cycle for address transfer
  - 15 bus cycles per DRAM access
  - 1 bus cycle per data transfer
- For 4-word block, 1-word-wide DRAM
  - Miss penalty =  $1 + 4 \times 15 + 4 \times 1 = 65$  bus cycles
  - Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

# Increasing Memory Bandwidth



a. One-word-wide  
memory organization



b. Wider memory organization



c. Interleaved memory organization

## 4-word wide memory

$$\text{Miss penalty} = 1 + 15 + 1 = 17 \text{ bus cycles}$$

$$\text{Bandwidth} = 16 \text{ bytes} / 17 \text{ cycles} = 0.94 \text{ B/cycle}$$

## 4-bank interleaved memory

$$\text{Miss penalty} = 1 + 15 + 4 \times 1 = 20 \text{ bus cycles}$$

$$\text{Bandwidth} = 16 \text{ bytes} / 20 \text{ cycles} = 0.8 \text{ B/cycle}$$

# Measuring Cache Performance

- Components of CPU time
  - Program execution cycles
    - Includes cache hit time
  - Memory stall cycles
    - Mainly from cache misses
- With simplifying assumptions:

$$\begin{aligned}\text{Memory stall cycles} &= \frac{\text{Memory accesses}}{\text{Program}} \times \text{Miss rate} \times \text{Miss penalty} \\ &= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Misses}}{\text{Instruction}} \times \text{Miss penalty}\end{aligned}$$



# Cache Performance Example

- Given
  - I-cache miss rate = 2%
  - D-cache miss rate = 4%
  - Miss penalty = 100 cycles
  - Base CPI (ideal cache) = 2
  - Load & stores are 36% of instructions
- Miss cycles per instruction
  - I-cache:  $0.02 \times 100 = 2$
  - D-cache:  $0.36 \times 0.04 \times 100 = 1.44$
- Actual CPI =  $2 + 2 + 1.44 = 5.44$ 
  - Ideal CPU is  $5.44/2 = 2.72$  times faster

# Average Access Time

- Hit time is also important for performance
- Average memory access time (AMAT)
  - $\text{AMAT} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}$
- Example
  - CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
  - $\text{AMAT} = 1 + 0.05 \times 20 = 2\text{ns}$ 
    - 2 cycles per instruction

# Performance Summary

- When CPU performance increased
  - Miss penalty becomes more significant
- Decreasing base CPI
  - Greater proportion of time spent on memory stalls
- Increasing clock rate
  - Memory stalls account for more CPU cycles
- Can't neglect cache behavior when evaluating system performance

# Associative Caches

- **Fully associative**
  - Allow a given block to go in any cache entry
  - Requires all entries to be searched at once
  - Comparator per entry (expensive)
- **$n$ -way set associative**
  - Each set contains  $n$  entries
  - Block number determines which set
    - (Block address) modulo (#Sets in cache)
  - Search all entries in a given set at once
  - $n$  comparators (less expensive)

# Associative Cache Example



# Spectrum of Associativity

- For a cache with 8 entries

**One-way set associative  
(direct mapped)**

| Block | Tag | Data |
|-------|-----|------|
| 0     |     |      |
| 1     |     |      |
| 2     |     |      |
| 3     |     |      |
| 4     |     |      |
| 5     |     |      |
| 6     |     |      |
| 7     |     |      |

**Two-way set associative**

| Set | Tag | Data | Tag | Data |
|-----|-----|------|-----|------|
| 0   |     |      |     |      |
| 1   |     |      |     |      |
| 2   |     |      |     |      |
| 3   |     |      |     |      |

**Four-way set associative**

| Set | Tag | Data | Tag | Data | Tag | Data | Tag | Data |
|-----|-----|------|-----|------|-----|------|-----|------|
| 0   |     |      |     |      |     |      |     |      |
| 1   |     |      |     |      |     |      |     |      |

**Eight-way set associative (fully associative)**

| Tag | Data |
|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|
|     |      |     |      |     |      |     |      |     |      |     |      |     |      |     |      |

# Associativity Example

- Compare 4-block caches
  - Direct mapped, 2-way set associative, fully associative
  - Block access sequence: 0, 8, 0, 6, 8
- Direct mapped

| Block address | Cache index | Hit/miss | Cache content after access |   |        |   |
|---------------|-------------|----------|----------------------------|---|--------|---|
|               |             |          | 0                          | 1 | 2      | 3 |
| 0             | 0           | miss     | Mem[0]                     |   |        |   |
| 8             | 0           | miss     | Mem[8]                     |   |        |   |
| 0             | 0           | miss     | Mem[0]                     |   |        |   |
| 6             | 2           | miss     | Mem[0]                     |   | Mem[6] |   |
| 8             | 0           | miss     | Mem[8]                     |   | Mem[6] |   |

# Associativity Example

- 2-way set associative

| Block address | Cache index | Hit/miss | Cache content after access |        |
|---------------|-------------|----------|----------------------------|--------|
|               |             |          | Set 0                      | Set 1  |
| 0             | 0           | miss     | Mem[0]                     |        |
| 8             | 0           | miss     | Mem[0]                     | Mem[8] |
| 0             | 0           | hit      | Mem[0]                     | Mem[8] |
| 6             | 0           | miss     | Mem[0]                     | Mem[6] |
| 8             | 0           | miss     | Mem[8]                     | Mem[6] |

- Fully associative

| Block address | Hit/miss | Cache content after access |        |        |
|---------------|----------|----------------------------|--------|--------|
| 0             | miss     | Mem[0]                     |        |        |
| 8             | miss     | Mem[0]                     | Mem[8] |        |
| 0             | hit      | Mem[0]                     | Mem[8] |        |
| 6             | miss     | Mem[0]                     | Mem[8] | Mem[6] |
| 8             | hit      | Mem[0]                     | Mem[8] | Mem[6] |

# How Much Associativity

- Increased associativity decreases miss rate
  - But with diminishing returns
- Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000
  - 1-way: 10.3%
  - 2-way: 8.6%
  - 4-way: 8.3%
  - 8-way: 8.1%

# Set Associative Cache Organization



# Replacement Policy

- **Direct mapped:** no choice
- **Set associative**
  - Prefer non-valid entry, if there is one
  - Otherwise, choose among entries in the set
- **Least-recently used (LRU)**
  - Choose the one unused for the longest time
    - Simple for 2-way, manageable for 4-way, too hard beyond that
- **Random**
  - Gives approximately the same performance as LRU for high associativity



# Least Recently Used Algorithm

- Need to keep track of what was used when
- Keep “age bits” for each cache line
- Update all the “age bits” of all cache lines when a cache line is used
- → **Pseudo-LRU**: use one bit per cache line/block

# Multilevel Caches

- Primary cache attached to CPU
  - Small, but fast
- Level-2 cache services misses from primary cache
  - Larger, slower, but still faster than main memory
- Main memory services L-2 cache misses
- Some high-end systems include L-3 cache



# Multilevel Cache Example

- Given
  - CPU base CPI = 1, clock rate = 4GHz
  - Miss rate/instruction = 2%
  - Main memory access time = 100ns
- With just primary cache
  - Miss penalty =  $100\text{ns}/0.25\text{ns} = 400$  cycles
  - Effective CPI =  $1 + 0.02 \times 400 = 9$

# Example (cont.)

- Now add L-2 cache
  - Access time = 5ns
  - Global miss rate to main memory = 0.5%
- Primary miss with L-2 hit
  - Penalty =  $5\text{ns}/0.25\text{ns} = 20 \text{ cycles}$
- Primary miss with L-2 miss
  - Extra penalty = 400 cycles
- CPI =  $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$
- Performance ratio =  $9/3.4 = 2.6$



# Multilevel Cache Considerations

- Primary cache
  - Focus on minimal hit time
- L-2 cache
  - Focus on low miss rate to avoid main memory access
  - Hit time has less overall impact
- Results
  - L-1 cache usually smaller than a single cache
  - L-1 block size smaller than L-2 block size



# Interactions with Advanced CPUs

- Out-of-order CPUs can execute instructions during cache miss
  - Pending store stays in load/store unit
  - Dependent instructions wait in reservation stations
    - Independent instructions continue
- Effect of miss depends on program data flow
  - Much harder to analyse
  - Use system simulation



# Interactions with Software

- Misses depend on memory access patterns
  - Algorithm behavior
  - Compiler optimization for memory access



# Software Optimization via Blocking

- Goal: maximize accesses to data before it is replaced
- Consider inner loops of DGEMM:
  1. `for (int i = 0; i < n; ++i)`
  2. `for (int j = 0; j < n; ++j){`
  3. `double cij = C[i+j*n];`
  4. `for( int k = 0; k < n; k++ )`
  5. `cij += A[i+k*n] * B[k+j*n];`
  6. `C[i+j*n] = cij;`
  7. `}`



# DGEMM Access Pattern

- C, A, and B arrays



# Cache Accessed in DGEMM

- Cache miss depends on N
- For example:
  - N = 32, each element 8 bytes
  - Three matrix 24KB (32x32x8x3)
  - Core i7 Sandy Bridge: 32KB
- What would happen when N is increased?
- What would happen if matrices is divided into N-by-N Blocks



# Cache Blocked DGEMM

```
1. #define BLOCKSIZE 32
2. void do_block (int n, int si, int sj, int sk, double *A, double *B, double *C){
3.     for (int i = si; i < si+BLOCKSIZE; ++i)
4.         for (int j = sj; j < sj+BLOCKSIZE; ++j){
5.             double cij = C[i+j*n];/* cij = C[i][j] */
6.             for( int k = sk; k < sk+BLOCKSIZE; k++ )
7.                 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */
8.             C[i+j*n] = cij; /* C[i][j] = cij */
9.         }
10.    }

12.void dgemm (int n, double* A, double* B, double* C){
13.    for ( int sj = 0; sj < n; sj += BLOCKSIZE )
14.        for ( int si = 0; si < n; si += BLOCKSIZE )
15.            for ( int sk = 0; sk < n; sk += BLOCKSIZE )
16.                do_block(n, si, sj, sk, A, B, C);
17.}
```



# Blocked DGEMM Access Pattern



# Virtual Memory

- Use main memory as a “cache” for secondary (disk) storage
  - Managed jointly by CPU hardware and the operating system (OS)
- Programs share main memory
  - Each gets a private virtual address space holding its frequently used code and data
  - Protected from other programs
- CPU and OS translate virtual addresses to physical addresses
  - VM “block” is called a page
  - VM translation “miss” is called a page fault

# Address Translation

- Fixed-size pages (e.g., 4K)



# Page Fault Penalty

- On page fault, the page must be fetched from disk
  - Takes millions of clock cycles
  - Handled by OS code
- Try to minimize page fault rate
  - Fully associative placement
  - Smart replacement algorithms



# Page Tables

- Stores placement information
  - Array of page table entries, indexed by virtual page number
  - Page table register in CPU points to page table in physical memory
- If page is present in memory
  - PTE stores the physical page number
  - Plus other status bits (referenced, dirty, ...)
- If page is not present
  - PTE can refer to location in swap space on disk

# Translation Using a Page Table



# Mapping Pages to Storage



# Replacement and Writes

- To reduce page fault rate, prefer least-recently used (LRU) replacement
  - Reference bit (aka use bit) in PTE set to 1 on access to page
  - Periodically cleared to 0 by OS
  - A page with reference bit = 0 has not been used recently
- Disk writes take millions of cycles
  - Block at once, not individual locations
  - Write through is impractical
  - Use write-back
  - Dirty bit in PTE set when page is written

# Fast Translation Using a TLB

- Address translation would appear to require extra memory references
  - One to access the PTE
  - Then the actual memory access
- But access to page tables has good locality
  - So use a fast cache of PTEs within the CPU
  - Called a Translation Look-aside Buffer (TLB)
  - Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate
  - Misses could be handled by hardware or software

# Fast Translation Using a TLB



# TLB and Cache Interaction

- If cache tag uses physical address
  - Need to translate before cache lookup
- Alternative: use virtual address tag
  - Complications due to aliasing
    - Different virtual addresses for shared physical address



# TLB Misses

- If page is in memory
  - Load the PTE from memory and retry
  - Could be handled in hardware
    - Can get complex for more complicated page table structures
  - Or in software
    - Raise a special exception, with optimized handler
- If page is not in memory (page fault)
  - OS handles fetching the page and updating the page table
  - Then restart the faulting instruction



# TLB Miss Handler

- TLB miss indicates
  - Page present, but PTE not in TLB
  - Page not preset
- Must recognize TLB miss before destination register overwritten
  - Raise exception
- Handler copies PTE from memory to TLB
  - Then restarts instruction
  - If page not present, page fault will occur



# Page Fault Handler

- Use faulting virtual address to find PTE
- Locate page on disk
- Choose page to replace
  - If dirty, write to disk first
- Read page into memory and update page table
- Make process runnable again
  - Restart from faulting instruction



# Memory Protection

- Different tasks can share parts of their virtual address spaces
  - But need to protect against errant access
  - Requires OS assistance
- Hardware support for OS protection
  - Privileged supervisor mode (aka kernel mode)
  - Privileged instructions
  - Page tables and other state information only accessible in supervisor mode
  - System call exception (e.g., syscall in MIPS)



# The Memory Hierarchy

The BIG Picture

- Common principles apply at all levels of the memory hierarchy
  - Based on notions of caching
- At each level in the hierarchy
  - Block placement
  - Finding a block
  - Replacement on a miss
  - Write policy



# Exercise

- Given the TLB (fully associative) and the Page table (4KB pages) with LRU replacement
- If pages must be brought from disk, increment the next largest page number
- Show the final state of the TLB and Page table if virtual address requests are as follow:
  - 4669, 2227, 13916, 34587, 48870, 12608, 49225
  - 12948, 49419, 46814, 13975, 40004, 12707, 52236
- The same question but 16KB pages instead of 4KB

| V | Physical or Disk |
|---|------------------|
| 1 | 5                |
| 0 | Disk             |
| 0 | Disk             |
| 1 | 6                |
| 1 | 9                |
| 1 | 11               |
| 0 | Disk             |
| 1 | 4                |
| 0 | Disk             |
| 0 | Disk             |
| 1 | 3                |
| 1 | 12               |

| V | Tag | Physical |
|---|-----|----------|
| 1 | 11  | 12       |
| 1 | 7   | 4        |
| 1 | 3   | 6        |
| 0 | 4   | 9        |

Hint: Analyse the virtual address to extract virtual page number

# Solution

- Virtual address (decimal): 4669, 2227, 13916, 34587, 48870, 12608, 49225
- Binary address: 4KB pages => 12-bit page offset
  - 4669 = 1\_0010\_0011\_1101, VPN = 1
  - 2227 = 0\_1000\_1011\_0011, VPN = 0
  - 13916 = 11\_0110\_0101\_1100, VPN = 3
  - 34587 = 1000\_0111\_0001\_1011, VPN = 8
  - 48870 = 1011\_1110\_1110\_0110, VPN = 11
  - 12608 = 11\_0001\_0100\_0000, VPN = 3
  - 49225 = 1100\_0000\_0100\_1001, VPN = 12



| Address | Virtual Page | TLB H/M                  | TLB               |     |               |
|---------|--------------|--------------------------|-------------------|-----|---------------|
|         |              |                          | Valid             | Tag | Physical Page |
| 4669    | 1            | TLB miss<br>PT hit<br>PF | 1                 | 11  | 12            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 0) | 1   | 13            |
| 2227    | 0            | TLB miss<br>PT hit       | 1 (last access 1) | 0   | 5             |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 0) | 1   | 13            |
| 13916   | 3            | TLB hit                  | 1 (last access 1) | 0   | 5             |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1 (last access 2) | 3   | 6             |
|         |              |                          | 1 (last access 0) | 1   | 13            |
| 34587   | 8            | TLB miss<br>PT hit<br>PF | 1 (last access 1) | 0   | 5             |
|         |              |                          | 1 (last access 3) | 8   | 14            |
|         |              |                          | 1 (last access 2) | 3   | 6             |
|         |              |                          | 1 (last access 0) | 1   | 13            |
| 48870   | 11           | TLB miss<br>PT hit       | 1 (last access 1) | 0   | 5             |
|         |              |                          | 1 (last access 3) | 8   | 14            |
|         |              |                          | 1 (last access 2) | 3   | 6             |
|         |              |                          | 1 (last access 4) | 11  | 12            |
| 12608   | 3            | TLB hit                  | 1 (last access 1) | 0   | 5             |
|         |              |                          | 1 (last access 3) | 8   | 14            |
|         |              |                          | 1 (last access 5) | 3   | 6             |
|         |              |                          | 1 (last access 4) | 11  | 12            |
| 49225   | 12           | TLB miss<br>PT miss      | 1 (last access 6) | 12  | 15            |
|         |              |                          | 1 (last access 3) | 8   | 14            |
|         |              |                          | 1 (last access 5) | 3   | 6             |
|         |              |                          | 1 (last access 4) | 11  | 12            |

| Address | Virtual Page | TLB H/M                  | TLB               |     |               |
|---------|--------------|--------------------------|-------------------|-----|---------------|
|         |              |                          | Valid             | Tag | Physical Page |
| 4669    | 0            | TLB miss<br>PT hit       | 1                 | 11  | 12            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 0) | 0   | 5             |
| 2227    | 0            | TLB hit                  | 1                 | 11  | 12            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 1) | 0   | 5             |
| 13916   | 0            | TLB hit                  | 1                 | 11  | 12            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 2) | 0   | 5             |
| 34587   | 2            | TLB miss<br>PT hit<br>PF | 1 (last access 3) | 2   | 13            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 2) | 0   | 5             |
| 48870   | 2            | TLB hit                  | 1 (last access 4) | 2   | 13            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 2) | 0   | 5             |
| 12608   | 0            | TLB hit                  | 1 (last access 4) | 2   | 13            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1                 | 3   | 6             |
|         |              |                          | 1 (last access 5) | 0   | 5             |
| 49225   | 3            | TLB hit                  | 1 (last access 4) | 2   | 13            |
|         |              |                          | 1                 | 7   | 4             |
|         |              |                          | 1 (last access 6) | 3   | 6             |
|         |              |                          | 1 (last access 5) | 0   | 5             |

# Block Placement

- Determined by associativity
  - Direct mapped (1-way associative)
    - One choice for placement
  - n-way set associative
    - n choices within a set
  - Fully associative
    - Any location
- Higher associativity reduces miss rate
  - Increases complexity, cost, and access time



# Finding a Block

- Hardware caches
  - Reduce comparisons to reduce cost
- Virtual memory
  - Full table lookup makes full associativity feasible
  - Benefit in reduced miss rate

| Associativity         | Location method                               | Tag comparisons |
|-----------------------|-----------------------------------------------|-----------------|
| Direct mapped         | Index                                         | 1               |
| n-way set associative | Set index, then search entries within the set | n               |
| Fully associative     | Search all entries                            | #entries        |
|                       | Full lookup table                             | 0               |



# Replacement

- Choice of entry to replace on a miss
  - Least recently used (LRU)
    - Complex and costly hardware for high associativity
  - Pseudo-LRU
    - Close to LRU, less costly hardware
  - Random
    - Close to LRU, easier to implement
- Virtual memory
  - LRU approximation with hardware support

# Write Policy

- Write-through
  - Update both upper and lower levels
  - Simplifies replacement, but may require write buffer
- Write-back
  - Update upper level only
  - Update lower level when block is replaced
  - Need to keep more state
- Virtual memory
  - Only write-back is feasible, given disk write latency



# Sources of Misses

- Compulsory misses (aka cold start misses)
  - First access to a block
- Capacity misses
  - Due to finite cache size
  - A replaced block is later accessed again
- Conflict misses (aka collision misses)
  - In a non-fully associative cache
  - Due to competition for entries in a set
  - Would not occur in a fully associative cache of the same total size



# Concluding Remarks

- Fast memories are small, large memories are slow
  - We really want fast, large memories ☹
  - Caching gives this illusion ☺
- Principle of locality
  - Programs use a small part of their memory space frequently
- Memory hierarchy
  - L1 cache  $\leftrightarrow$  L2 cache  $\leftrightarrow \dots \leftrightarrow$  DRAM memory  
 $\leftrightarrow$  disk
- Memory system design is critical for multiprocessors

