

# Memory Systems

Debiprasanna Sahoo

Assistant Professor

Department of Computer Science and Engineering  
School of Electrical and Computer Sciences  
Indian Institute of Technology Bhubaneswar



# Content

## Book

Computer Organization and Design:  
The Hardware/Software Interface-  
RISC-V Edition, 5th Edition, 2017

Chapter-5

David A. Patterson and John L.  
Henessey

## Reference Books

Computer Organization and Design:  
The Hardware/Software Interface-  
MIPS Edition, 5th Edition, 2017

Chapter-5

Computer Architecture: A  
Quantitative Approach

Appendix-B

David A. Patterson and John L.  
Henessey

\*Image from the book and manual unless specified

# Von-Neumann Architecture (Recap)

Components of a Computer:

- Processor which does the computation with the help of Arithmetic and Logic Unit (ALU) and Control Unit (CU).
- Memory that stores data on which computation can happen.
- Input and Output Devices that supplies or uses data item.



Geeks for Geeks

# Different Types of Memory Technologies

| Property                   | SRAMs       | DRAMs                            | Flash                                 | Disk              |
|----------------------------|-------------|----------------------------------|---------------------------------------|-------------------|
| Access Time                | Fast        | Medium                           | Slower than DRAM,<br>faster than Disk | Slow              |
| Power                      | High        | Medium                           | Medium                                | High              |
| Non-Volatile               | No          | No                               | Yes                                   | Yes               |
| Cost                       | Really high | Medium                           | Lower than DRAM costly<br>than disk   | Low               |
| Density                    | Low         | High                             | High                                  | High              |
| Typical Size<br>(Capacity) | 2-40 MB     | 128MB-256/512GB                  | 128 GB to > 1 TB                      | 512 GB to > 1 TB  |
| Usage                      | Cache       | Main Memory (Now Caches<br>also) | Secondary Storage                     | Secondary Storage |

# Disk Memory

**Set of Platters form a disk**

**Track:** One of thousands of concentric circles that make up the surface of a magnetic disk.

**Sector:** One of the segments that make up a track on a magnetic disk; a sector is the smallest amount of information that is read or written on a disk.

**Cylinder:** All the tracks under the heads at a given point on all surfaces.



[https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/10\\_MassStorage.html](https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/10_MassStorage.html)

# Latency of Disk Drives

**Seek:** The process of positioning a read/write head over the proper track on a disk.

Typical seek time = 3-4 ms

**Rotational Delay/Latency:** The time required for the desired sector of a disk to rotate under the read/write head; usually assumed to be half the rotation time.

Disk Rotation Speed =  $v$  RPM

Rotational Latency =  $1/2v$  RPM =  $30/v$  seconds

For 7200 RPM,

Rotational Latency =  $30/7200$  sec

$$= 1/240 \text{ sec} = 4.17 \text{ ms}$$

**Transfer Latency:** Time to transfer the data from device to the cores (actually to main memory).

Typical transfer rates = Max 90 MBps

Disks are connected to a hub called South Bridge controlled by interfaces like ATA, SATA, USB, SCSI and is advocated by DMA controller.

# Memory Hierarchy

**Memory Hierarchy:** A structure that uses multiple levels of memories; as the distance from the processor increases, the size of the memories and the access time both increase.



# Steps of Memory Access

- CPU generates load/store operations for addresses to read/write data from the memory operation stage on a **data bus**.
- CPU generates an address for instruction in the fetch stage on a data bus.
- The request are transferred over on the **control bus** and the corresponding address over the **address bus**.
- For loads and instruction access, the data bus is empty while sending the request. However, once the location is read, the bus gets the valid data.
- The main memory performs a read operation corresponding to a load and a write operation corresponding to a store.



# Scaling of memory technology



- DRAM technology used to design main memory is not scaling when compared to the cores
- Gap in speed between the core and the main memory is continuously growing
- **Solution:** Search for new technology to design main memory?

# Cache Memory (\$ Memory)

- ❑ **Cache memory** is a small and fast memory used to store more recently accessed data; it exploits **data locality**.
- ❑ It is typically designed using **SRAMs** and it is used between the CPU and the main memory.
- ❑ Addresses in a cache is a subset of the addresses in the main memory. Hence, some requested addresses can be found in the cache and some may not.



# Principle of Locality

□ **Spatial Locality:** The locality principle stating that if a data location is referenced, data locations with nearby addresses will tend to be referenced soon.

□ **Temporal Locality:** The locality principle stating that if a data location is referenced then it will tend to be referenced again soon.

```
for(i = 0; i < 100; i++) {  
    C[i] = A[i] + B[i];  
}
```

What about Matrix Multiplication?

The diagram shows two 2x2 matrices, A and B, being multiplied by a 2x2 matrix G. The result is a 2x2 matrix C. The elements of matrix A are labeled A, B, C, D and matrix B is labeled E, F. The result matrix C has elements A x G + B x H, C x G + D x H, E x G + F x H. Blue arrows show the mapping from A to G and B to H. Orange arrows show the mapping from E to G and F to H.

$$\begin{bmatrix} A & B \\ C & D \end{bmatrix} \times \begin{bmatrix} G & H \\ G & H \end{bmatrix} = \begin{bmatrix} A \times G + B \times H & C \times G + D \times H \\ E \times G + F \times H & E \times G + F \times H \end{bmatrix}$$

What about Pointer Chasing?

What about Graphs, Graphics, and many more?

# Some Terminologies

- ❑ **Word:** Unit of transfer of data between the processor and memory hierarchy. Typically, 4B
- ❑ **Cache Block/Line:** The minimum unit of information that can be either present or not present in a cache. Typically, 64 B.
- ❑ **Cache Hit:** When request is serviced by the cache because the line is already present
- ❑ **Cache Miss:** Request is serviced by main memory because the line is absent in the cache
- ❑ **Hit Rate:** Fraction of memory requests serviced by the cache = No. of cache hits / Total No. of Accesses
- ❑ **Miss Rate:** 1 - Hit Rate
- ❑ **Hit Time:** Time required (typically in cycles) to access the cache including the time needed to determine whether the access is a hit or a miss.
- ❑ **Miss Penalty:** The time required to fetch a block into a level of the memory hierarchy from the lower level, transmit it from one level to the other, insert it in the level that experienced the miss, and then pass the block to the requestor.
- ❑ **Parallelism:** Number of requests being processed in parallel/pipelined inside the cache



# Performance of a Memory System

**Latency:** The amount of time in cycles the memory hierarchy takes to service a single request. Latency = End Time - Start Time = ET - ST

Average memory access time (AMAT) = Hit Time + Miss Rate \* Miss Penalty

**Bandwidth:** How much of data per cycle is delivered to the CPU

**Bandwidth** = Number of Request Serviced \* Cache Line Size / Time to service all the requests =  $N * S / ET_f - ST_i$

**Goal: Reduce Latency and Increase Bandwidth**

# Four Memory Hierarchy Questions

**Block Identification:** How to figure out that block is present in the cache?

**Block Placement:** Where can a block be placed in the cache?

**Block Replacement:** Which block to replace/evict/victimize on a miss?

**Write Strategy:** What happens on a write?

# Block Identification

- ❑ A block can be present inside the cache (the state shall be Valid or V) or not present inside the cache (the state shall be Invalid or I).
- ❑ **Valid bit:** Determines if a given address is present or not.
- ❑ CPU can make read or write requests
- ❑ On hit, CPU is given a response (CPUResp) for reads (CPURd).
- ❑ On miss, memory gets a requests – MemRd for CPURd and MemWr for CPUWr.
- ❑ Assume: On write hit data is updated and also sent to memory.
- ❑ What if another requests for same block arrives in the cache before response from memory for the previous one come back?



- ❑ **State Information:** Valid bit, Miss Pending bit, etc.

- ❑ Even an address is present in the cache, how do we know that its is the address that we need.
- ❑ Store part of the address inside the cache alongside data called **Tags**.
- ❑ Tag part of the incoming address is matched against the **stored tags**.

# Basic Cache Memory: Design



## MISS SCENARIO

- ❑ CPU sends a read/write request corresponding to load or store operation.
- ❑ The address associated with the address is divided into block address and block offset.
- ❑ **Block Address (assume tag)** identifies a block inside the cache and hence, refers to an entry in each tag, data, and state arrays.
- ❑ **Block Offset** refers to the word which has been accessed inside the block.
- ❑ The **comparator** checks if the stored tag matches incoming tag.
- ❑ The **and gate** ensures to provide a hit/miss after checking the state and tag comparison.
- ❑ In this case, all are invalid blocks, so we will see miss for any access.

# Basic Cache Memory: Design (Cont..)



## MISS SCENARIO

- ❑ We determine a place where the block can be placed when the response comes back.
- ❑ The state is marked as pending data or miss pending (MP) and the tag is written.
- ❑ **Miss Status Holding Registers (MSHR):** MSHR holds the address and status of the memory request upon miss in the cache.
- ❑ It serves the following purposes:
  - Stops another requests to same address to follow into the next level.
  - Quick lookup of the cache when the response comes back; we already have the way  $\rightarrow$  no tag comparison necessary.
  - Helps in verifying if the memory system is working correctly by implementing liveness property

# Basic Cache Memory: Design (Cont..)



## FILL SCENARIO

- If any address, A was accessed which contains D in main memory and D is transferred over **Data Bus**.
- Upon servicing a miss for that address, we save the tag of A (T) and data D (placed in Memory Data Register) in the tag and data arrays, respectively.
- We also mark the block as valid inside the cache. We also forward the data back to the CPU.

## HIT SCENARIO

- If any address, A was accessed which contains D in the cache.
- The comparator and AND gate specifies hit.
- The **multiplexor** reads the requested word and delivers it to the CPU.

# Basic Cache Memory: Implementation

## Inputs:

- CPURd and CPUWr for load and store instructions.
- They can be invalidates from replacement/bus.
- Memory response (MemResp) for read misses.

## Outputs:

- MemRd and MemWr for memory reads and writes upon read miss and writes.
- CPUResp for reads

**State Table**

| CS | Input           | NS | Output               |
|----|-----------------|----|----------------------|
| I  | CPURd           | MP | MemRd                |
| I  | CPUWr           | I  | MemWr                |
| I  | Invalidate      | I  | -                    |
| MP | CPURd/<br>CPUWr | MP | Action is<br>to wait |
| MP | MemResp         | V  | CPUResp              |
| V  | CPURd           | V  | CPUResp              |
| V  | CPUWr           | V  | MemWr                |
| V  | Invalidate      | I  | -                    |



○ **States:** Status of all the words part of a block inside the cache.

- **Invalid (I):** Block is not present inside the cache.
- **Valid (V):** Block is present inside the cache and all the words in the block have the same content as main memory.
- **MissPending (MP):** Tags are present but data has not been received from the memory.

# Basic Cache Memory: Implementation

- For each B in Block[]
    - If B.Tag = CPUReq.Tag & B.State = V  
Then CPUResp = B.Data
    - If B.Tag = CPUReq.Tag & B.State = MP  
Then CPUReq.wait
  - EvictedBlock = evict(Block[])
    - EvictedBlock.State = MP
    - EvictedBlock.Tag = CPUReq.Tag
  - MemReq = CPUReq
- 
- For each B in Block[]
    - If B.Tag = MemResp.Tag & B.State = MP  
Then B.State = V  
B.Data = MemResp.Data  
CPUResp.Data = B.Data  
Notify All Waiting To Restart



# Block Placement

- **Index:** The location at which a block is stored in the cache. A cache with n-blocks required  $\log n$  bits to index.
- **Block Offset:** The part of the address which determines exactly which data item/word is accessed.
- Organization of the Cache:
  - Direct-Mapped Cache
  - Associative Cache
  - Set Associative Cache



# Direct-mapped Cache: Example

A=36 = 00100100

|             | State | Tag | Data |
|-------------|-------|-----|------|
| T = 0 = 00  | I     | X   | X    |
| I = 2 = 10  | I     | X   | X    |
| BO=4 = 0100 | I     | X   | X    |
|             | I     | X   | X    |



Miss

- ❑ Word Size = W  
Example: W = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B  
Example:  $36/16 = 2$   
Index = I = BA%N  
Example:  $2\%4 = 2$   
Tag = T = BA/N  
Example:  $2/4 = 0$   
Block Offset = BO = A %B  
Example:  $36 \% 16 = 4$

- ❑ Cache Size = C  
Example: C = 64B
- ❑ #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- ❑ #Word in Cache = C/W  
Example =  $64B/4B = 16$
- ❑ #indexes = #Blocks = N  
Example = 4
- ❑ #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 49, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- ❑ #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

A=36 = 00100100

**State**

**Tag**

**Data**

T = 0 = 00

I = 2 = 10

BO=4 = 0100

I

I

MP

I

X

X

00

X

X

X

X

X



- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example = log 4 = 2 bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example: log 16 = 4 bits

Block Address = BA = A/B

Example: 36/16 = 2

Index = I = BA%N

Example: 2%4 = 2

Tag = T = BA/N

Example: 2/4 = 0

Block Offset = BO = A %B

Example: 36 % 16 = 4

Cache Size = C

Example: C = 64B

#Blocks in Cache = N = C/B

Example = 64B/16B = 4

#Word in Cache = C/W

Example = 64B/4B = 16

#indexes = #Blocks = N

Example = 4

#Bits for Index = log #Blocks

Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

Main Memory Size = M

Example: M = 64W = 64 \* 4 B

#Words/Memory in Example = 64

#Bits in Physical Address = log

#Words/Memory + log #wordsize

Example: log 64 + log 4 = 8

#Blocks/Memory = M/B

Example = 256B/16B = 16

# Direct-mapped Cache: Example

A=36 = 00100100

**State**

**Tag**

**Data**

|             |   |    |                |
|-------------|---|----|----------------|
| T = 0 = 00  | I | X  | X              |
| I = 2 = 10  | I | X  | X              |
| BO=4 = 0100 | V | 00 | 19, 72, 63, 52 |
|             | I | X  | X              |



- ❑ Word Size = W  
Example: W = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B

Example:  $36/16 = 2$

Index = I = BA%N

Example:  $2\%4 = 2$

Tag = T = BA/N

Example:  $2/4 = 0$

Block Offset = BO = A %B

Example:  $36 \% 16 = 4$

❑ Cache Size = C

Example: C = 64B

❑ #Blocks in Cache = N = C/B

Example =  $64B/16B = 4$

❑ #Word in Cache = C/W

Example =  $64B/4B = 16$

❑ #indexes = #Blocks = N

Example = 4

❑ #Bits for Index = log #Blocks

Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 49, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- ❑ #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

A=128 = 10000000

State

Tag

Data

T = 2 = 10

I = 0 = 00

BO=0 = 0000

I

X

X

I

X

X

V

00

19, 72, 63, 52

I

X

X

Miss

Block Address = BA = A/B

Example:  $128/16 = 8$

Index = I = BA%N

Example:  $8 \% 4 = 0$

Tag = T = BA/N

Example:  $8 / 4 = 2$

Block Offset = BO = A %B

Example:  $128 \% 16 = 0$

- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/W  
Example =  $64B/4B = 16$
- #indexes = #Blocks = N  
Example = 4
- #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

$A=128 = 10000000$

State

Tag

Data

$T = 2 = 10$

$I = 0 = 00$

$BO=0 = 0000$

MP

I

V

I

10

X

00

X

X

X

19, 72, 63, 52

X

Reserve

Block Address = BA = A/B

Example:  $128/16 = 8$

Index = I = BA%N

Example:  $8 \% 4 = 0$

Tag = T = BA/N

Example:  $8 / 4 = 2$

Block Offset = BO = A %B

Example:  $128 \% 16 = 0$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example:  $\#Words/Block = 4$
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/W  
Example =  $64B/4B = 16$
- #indexes = #Blocks = N  
Example = 4
- #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- Main Memory Size = M  
Example:  $M = 64W = 64 * 4 B$
- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

A=128 = 10000000

| State       | Tag | Data                |
|-------------|-----|---------------------|
| T = 2 = 10  | V   | 10 63, 14, 1076, 93 |
| I = 0 = 00  | I   | X X                 |
| BO=0 = 0000 | V   | 00 19, 72, 63, 52   |
|             | I   | X X                 |



- ❑ Word Size = W  
Example: W = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B  
Example:  $128/16 = 8$   
Index = I = BA%N  
Example:  $8\%4 = 0$   
Tag = T = BA/N  
Example:  $8/4 = 2$   
Block Offset = BO = A %B  
Example:  $128 \% 16 = 0$

- ❑ Cache Size = C  
Example: C = 64B
- ❑ #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- ❑ #Word in Cache = C/W  
Example =  $64B/4B = 16$
- ❑ #indexes = #Blocks = N  
Example = 4
- ❑ #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ Main Memory Size = M  
Example: M =  $64W = 64 * 4$  B
- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address =  $\log$  #Words/Memory +  $\log$  #wordsize  
Example:  $\log 64 + \log 4 = 8$
- ❑ #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

A=128 = 10000000

| State       | Tag | Data                |
|-------------|-----|---------------------|
| T = 2 = 10  | V   | 10 63, 14, 1076, 93 |
| I = 0 = 00  | I   | X X                 |
| BO=0 = 0000 | V   | 00 19, 72, 63, 52   |
|             | I   | X X                 |



- ❑ Word Size = W  
Example: W = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example = log 4 = 2 bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example: log 16 = 4 bits

Block Address = BA = A/B  
Example: 128/16 = 8  
Index = I = BA%N  
Example: 8%4 = 0  
Tag = T = BA/N  
Example: 8/4 = 2  
Block Offset = BO = A %B  
Example: 128 % 16 = 0

- ❑ Cache Size = C  
Example: C = 64B
- ❑ #Blocks in Cache = N = C/B  
Example = 64B/16B = 4
- ❑ #Word in Cache = C/W  
Example = 64B/4B = 16
- ❑ #indexes = #Blocks = N  
Example = 4
- ❑ #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ Main Memory Size = M  
Example: M = 64W = 64 \* 4 B
- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address = log #Words/Memory + log #wordsize  
Example: log 64 + log 4 = 8
- ❑ #Blocks/Memory = M/B  
Example = 256B/16B = 16

# Direct-mapped Cache: Example

$A=192 = 11000000$

| State       | Tag | Data                |
|-------------|-----|---------------------|
| T = 3 = 11  | V   | 10 63, 14, 1076, 93 |
| I = 0 = 00  | I   | X X                 |
| BO=0 = 0000 | V   | 00 19, 72, 63, 52   |
|             | I   | X X                 |

Miss.  
Why?

- ❑ Word Size = W  
Example: W = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B  
 Example:  $192/16 = 12$   
 Index = I = BA%N  
 Example:  $12\%4 = 0$   
 Tag = T = BA/N  
 Example:  $12/4 = 3$   
 Block Offset = BO = A %B  
 Example:  $192 \% 16 = 0$

- ❑ Cache Size = C  
Example: C = 64B
- ❑ #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- ❑ #Word in Cache = C/W  
Example =  $64B/4B = 16$
- ❑ #indexes = #Blocks = N  
Example = 4
- ❑ #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ Main Memory Size = M  
Example: M =  $64W = 64 * 4$  B
- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address =  $\log$  #Words/Memory +  $\log$  #wordsize  
Example:  $\log 64 + \log 4 = 8$
- ❑ #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

$A = 192 = 11000000$

State

Tag

Data

We don't care about this data because I is not a valid state

$T = 3 = 11$

$I = 0 = 00$

$BO = 0 = 0000$

Replacement

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example:  $\#Words/Block = 4$
- #Bits/Word =  $\log W$   
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block =  $\log B$   
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B

Example:  $192/16 = 12$

Index = I = BA%N

Example:  $12\%4 = 0$

Tag = T = BA/N

Example:  $12/4 = 3$

Block Offset = BO = A %B

Example:  $192 \% 16 = 0$

Cache Size = C

Example:  $C = 64B$

#Blocks in Cache = N = C/B

Example =  $64B/16B = 4$

#Word in Cache = C/W

Example =  $64B/4B = 16$

#indexes = #Blocks = N

Example = 4

#Bits for Index = log #Blocks

Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

Main Memory Size = M

Example:  $M = 64W = 64 * 4 B$

#Words/Memory in Example = 64

#Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$

#Blocks/Memory =  $M/B$

Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

A=192 = 11000000

State

Tag

Data

We don't care about this data because MP is not a valid state

|             |    |    |                  |
|-------------|----|----|------------------|
| T = 3 = 11  | MP | 11 | 63, 14, 1076, 93 |
| I = 0 = 00  | I  | X  | X                |
| BO=0 = 0000 | V  | 00 | 19, 72, 63, 52   |
|             | I  | X  | X                |

Reserve

- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B

Example:  $192/16 = 12$

Index = I = BA%N

Example:  $12\%4 = 0$

Tag = T = BA/N

Example:  $12/4 = 3$

Block Offset = BO = A %B

Example:  $192 \% 16 = 0$

Cache Size = C

Example: C = 64B

#Blocks in Cache = N = C/B

Example =  $64B/16B = 4$

#Word in Cache = C/W

Example =  $64B/4B = 16$

#indexes = #Blocks = N

Example = 4

#Bits for Index = log #Blocks

Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

#Words/Memory in Example = 64

#Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$

Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$

#Blocks/Memory =  $M/B$   
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

$A = 192 = 11000000$

| State       | Tag | Data           |
|-------------|-----|----------------|
| T = 3 = 11  | V   | 1, 1, 1, 1     |
| I = 0 = 00  | I   | X              |
| BO=0 = 0000 | V   | 19, 72, 63, 52 |
|             | I   | X              |



- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B  
Example:  $192/16 = 12$   
Index = I = BA%N  
Example:  $12\%4 = 0$   
Tag = T = BA/N  
Example:  $12/4 = 3$   
Block Offset = BO = A %B  
Example:  $192 \% 16 = 0$

- Cache Size = C  
Example: C = 64B
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/W  
Example =  $64B/4B = 16$
- #indexes = #Blocks = N  
Example = 4
- #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example:  $M = 64W = 64 * 4 B$
- #Blocks/Memory =  $M/B$   
Example =  $256B/16B = 16$

# Direct-mapped Cache: Example

A=132 = 10000100

| State       | Tag | Data           |
|-------------|-----|----------------|
| T = 2 = 10  | V   | 1, 1, 1, 1     |
| I = 0 = 00  | I   | X              |
| BO=0 = 0100 | V   | 19, 72, 63, 52 |
|             | I   | X              |



- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block Address = BA = A/B  
Example:  $132/16 = 8$   
Index = I = BA%N  
Example:  $8\%4 = 0$   
Tag = T = BA/N  
Example:  $8/4 = 2$   
Block Offset = BO = A %B  
Example:  $132 \% 16 = 4$

- Cache Size = C  
Example: C = 64B
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/W  
Example =  $64B/4B = 16$
- #indexes = #Blocks = N  
Example = 4
- #Bits for Index = log #Blocks  
Example = 2

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- #Blocks/Memory =  $M/B$   
Example =  $256B/16B = 16$

# Direct-mapped Cache: Design

- Every physical address maps to a specific location in the cache
- **Indexing Function:** The operation used to determine the index of a cache for a given address. Typically, it is **Index = (physical block address) mod (Total #Blocks)**
- **Benefit:** Simple in design, Lower hit time
- **Demerits:** More conflicts at the same index
  
- Block Size = B, Block Bits =  $\log B$ , Block Offset = ?
- Cache Size = C
- #Blocks = N =  $C/B$ . Index Bits =  $\log (C/B)$ , Index = ?
- Tag Bits = A -  $\log B$  -  $\log N$ , Tag = ?
- State Bits = S
- Tag Array Size =  $N * (S + A - \log B - \log N)$



# Direct-mapped Cache: Implementation

- $B = \text{Block}[CPUREq.\text{Index}]$
- If  $B.\text{Tag} = CPUREq.\text{Tag}$  &  $B.\text{State} = V$ 
  - Then  $\text{CPUResp} = B.\text{Data}$
- If  $B.\text{Tag} = CPUREq.\text{Tag}$  &  $B.\text{State} = MP$ 
  - Then  $\text{CPUREq.wait}$
  - $B = \text{evict}(B)$
  - $B.\text{State} = MP$
  - $B.\text{Tag} = CPUREq.\text{Tag}$
  - $\text{MemReq} = CPUREq$

- $B = \text{Block}[CPUREq.\text{Index}]$
- If  $B.\text{Tag} = MemResp.\text{Tag}$  &  $B.\text{State} = MP$ 
  - Then  $B.\text{State} = V$ 
    - $B.\text{Data} = MemResp.\text{Data}$
    - $\text{CPUResp.Data} = B.\text{Data}$
    - Notify All Waiting To Restart



# Direct-mapped Cache: Example

- **Problem:** A direct mapped cache of size 32KB can be addressed using a 40 bit address. Block size of the cache is 64B. Find the total number of bits required for indexing, block offset and tag. Also find the tag array size.
  - Block Size = 64 B => Block Offset = 6 Bits
  - Cache Size = 32 KB => No. of Blocks =  $32\text{KB}/64\text{B} = 512$
  - Indexing bits =  $\log 512 = 9$  Bits
  - Tag bits =  $40 - 9 - 6 = 25$
  - Tag array size =  $((25 + 1) * 512)$  bits =  $26 * 64$  B



# Associative Cache: Example

A=36 = 00100100

T= 2 = 0010

I = 0

BO=4 = 0100

State

Tag

Data

I

I

I

I

X

X

X

X

X

X

X

X

X

X

Miss

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

Block Address = BA = A/B  
Example:  $36 / 16 = 2$   
Index = I = 0  
Tag = T = BA  
Example: BA = 2  
Block Offset = BO = A % B  
Example:  $36 \% 16 = 4$

- Cache Size = C  
Example: C = 64B
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=36 = 00100100

T= 2 = 0010

I = 0 = 0

BO=4 = 0100

Reserve

State

Tag

Data

MP

I

I

I

0010

X

X

X

X

X

X

X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $36 / 16 = 2$   
Index = I = 0  
Tag = T = BA  
Example: BA = 2  
Block Offset = BO = A % B  
Example:  $36 \% 16 = 4$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example:  $\#Words/Block = 4$
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=36 = 00100100

T= 2 = 0010

I = 0 = 0

BO=4 = 0100

Fill

State

Tag

Data

V I I I 0010 X X X

19, 72, 63, 52 X X X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $36 / 16 = 2$   
Index = I = 0  
Tag = T = BA  
Example: BA = 2  
Block Offset = BO = A % B  
Example:  $36 \% 16 = 4$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=128 = 10000000

T= 8 = 1000

I = 0 = 0

BO=0 = 0000

Miss

State

Tag

Data

V I I I 0010 X X X

X X X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $128 / 16 = 8$   
Index = I = 0  
Tag = T = BA  
Example: BA = 8  
Block Offset = BO = A % B  
Example:  $128 \% 16 = 0$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=128 = 10000000

T= 8 = 1000

I = 0 = 0

BO=0 = 0000

Reserve

State

Tag

Data

V MP I I 0010 1000 X X

19, 72, 63, 52 X X X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $128 / 16 = 8$   
Index = I = 0  
Tag = T = BA  
Example: BA = 8  
Block Offset = BO = A % B  
Example:  $128 \% 16 = 0$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=128 = 10000000

T= 8 = 1000

I = 0 = 0

BO=0 = 0000

Fill

State

Tag

Data

V V I I 0010

1000 X X

19, 72, 63, 52

63, 14, 1076, 93

X

X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $128 / 16 = 8$   
Index = I = 0  
Tag = T = BA  
Example: BA = 8  
Block Offset = BO = A % B  
Example:  $128 \% 16 = 0$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=192 = 11000000

T= 12 = 1100

I = 0 = 0

BO=0 = 0000

Miss

State

Tag

Data

V V I I 0010 1000 X X

19, 72, 63, 52

63, 14, 1076, 93

X

X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $192 / 16 = 12$   
Index = I = 0  
Tag = T = BA  
Example: BA = 12  
Block Offset = BO = A % B  
Example:  $192 \% 16 = 0$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=192 = 11000000

T= 12 = 1100

I = 0 = 0

BO=0 = 0000

Reserve

State

Tag

Data

V V MP I 0010 1000 1100 X 19, 72, 63, 52

63, 14, 1076, 93 X X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

Block Address = BA = A/B  
Example:  $192 / 16 = 12$   
Index = I = 0  
Tag = T = BA  
Example: BA = 12  
Block Offset = BO = A % B  
Example:  $192 \% 16 = 0$

- Cache Size = C  
Example: C = 64B
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

- Word Size = W  
Example: W = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Block-0: 10, 20, 30, 40
- Block-1: 11, 07, 104, 120
- Block-2: 19, 72, 63, 52
- Block-3: 25, 36, 71, 203
- Block-4: 302, 340, 2, 91
- Block-5: 53, 47, 470, 21
- Block-6: 33, 43, 44, 92
- Block-7: 74, 10, 42, 20
- Block-8: 63, 14, 1076, 93
- Block-9: 91, 71, 77, 512
- Block-10: 1, 9, 0, 32
- Block-11: 0, 0, 0, 0
- Block-12: 1, 1, 1, 1
- Block-13: 97, 64, 48, 87
- Block-14: 12, 10, 0, 1024
- Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=192 = 11000000

T= 12 = 1100

I = 0 = 0

BO=0 = 0000

Fill

State

Tag

Data

V V V I 0010 1000 1100 X 19, 72, 63, 52 63, 14, 1076, 93 1, 1, 1, 1 X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $192 / 16 = 12$   
Index = I = 0  
Tag = T = BA  
Example: BA = 12  
Block Offset = BO = A % B  
Example:  $192 \% 16 = 0$

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Example

A=132 = 10000100

T= 8 = 1000

I = 0 = 0

BO=4 = 0100

Hit

State

Tag

Data

V V V I 0010 1000 1100 X 19, 72, 63, 52

63, 14, 1076, 93

1, 1, 1, 1

X

- Main Memory Size = M
- #Words/Memory in Example = 64  
Example:  $M = 64W = 64 * 4 = 256B$
- #Bits in PA = log #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- #Blocks/Memory = M/B  
Example =  $256B / 16B = 16$

- Cache Size = C  
Example:  $C = 64B$
- #Blocks in Cache = N = C/B  
Example =  $64B / 16B = 4$
- #Word in Cache = C/W  
Example =  $64B / 4B = 16$
- #indexes = 1
- #Bits for Index = 0
- #Bits for Tag = #Bits in PA - #Blockbits  
Example:  $8 - 4 = 4$

Block Address = BA = A/B  
Example:  $132 / 16 = 8$   
Index = I = 0  
Tag = T = BA  
Example: BA = 8  
BO = 4 = 0100

- Word Size = W  
Example:  $W = 4B$
- Block Size = B  
Example:  $B = 16B$
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
 $PC = PC + 4$
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

# Associative Cache: Design

- A physical address can map to any location in the cache
- **Index:** No indexing necessary
- **Parallel Search:** All tags are searched in parallel. Requires comparator for all the block.
- **Replacement Policy:** A block allocation policy is necessary
- **Benefit:** Can place blocks anywhere and hence, minimum conflict. (High hit-rate)
- **Demerits:** Large multiplexor is required that hit time, comparators consume a lot of power.

- Block Size =  $B$ , Block Bits =  $\log B$ , Block Offset = ?
- Cache Size =  $C$
- #Blocks =  $N = C/B$ . Index Bits = 0, Index = 0
- Tag Bits =  $A - \log B$ , Tag = ?
- State Bits =  $S$
- Tag Array Size =  $N * (S + A - \log B)$



# Associative Cache: Implementation

**for each way  $B$  in the Cache**

- If  $B.Tag = \text{CPUReq.Tag} \& B.State = V$   
Then  $\text{CPUResp} = B.Data // \text{Return}$
- If  $B.Tag = \text{CPUReq.Tag} \& B.State = MP$   
Then  $\text{CPUReq.wait} // \text{Return}$
- $B = \text{evict}(Cache)$
- $B.State = MP$
- $B.Tag = \text{CPUReq.Tag}$
- $\text{MemReq} = \text{CPUReq}$

**for each way  $B$  in the Cache**

- If  $B.Tag = \text{MemResp.Tag} \& B.State = MP$   
Then  $B.State = V$   
 $B.Data = \text{MemResp.Data}$   
 $\text{CPUResp.Data} = B.Data$   
Notify All Waiting To Restart



# Associative Cache

- **Problem:** An associative cache of size 32KB can be addressed using a 40 bit address. Block size of the cache is 64B. Find the total number of bits required for indexing, block offset and tag. Also find the tag array size.
  - Block Size = 64 B => Block Offset = 6 Bits
  - Cache Size = 32 KB => No. of Blocks =  $32\text{KB}/64\text{B} = 512$
  - Indexing bits = 0
  - Tag bits =  $40 - 6 = 34$
  - Tag array size =  $(34 + 1) * 512 \text{ bits} = 35 * 64 \text{ B}$



# Set Associative Cache: Example

A=36 = 00100100

T= 1 = 001

I = 0 = 0

BO=4 = 0100

State

Tag

Data



Miss

Block Address = BA = A/B

Example:  $36/16 = 2$

Index = I = BA%S

Example:  $2\%2 = 0$

Tag = T = BA/S

Example:  $2/2 = 1$

Block Offset = BO = A %B

Example:  $36 \% 16 = 4$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=36 = 00100100

T= 1 = 001

I = 0 = 0

BO=4 = 0100

State

Tag

Data



Reserve

Block Address = BA = A/B

Example:  $36/16 = 2$

Index = I = BA%S

Example:  $2\%2 = 0$

Tag = T = BA/S

Example:  $2/2 = 1$

Block Offset = BO = A %B

Example:  $36 \% 16 = 4$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example: M =  $64W = 64 * 4$  B
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=36 = 00100100

T= 1 = 001

I = 0 = 0

BO=4= 0100

State

Tag

Data

|   |   |     |   |                |   |
|---|---|-----|---|----------------|---|
| V | I | 001 | X | 19, 72, 63, 52 | X |
| I | I | X   | X | X              | X |



Block Address = BA = A/B

Example:  $36/16 = 2$

Index = I = BA%S

Example:  $2\%2 = 0$

Tag = T = BA/S

Example:  $2/2 = 1$

Block Offset = BO = A %B

Example:  $36 \% 16 = 4$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=128 = 10000000

T= 1 = 100

I = 0 = 0

BO=0 = 0000

State

Tag

Data

|   |   |     |   |                |   |
|---|---|-----|---|----------------|---|
| V | I | 001 | X | 19, 72, 63, 52 | X |
| I | I | X   | X | X              | X |



Block Address = BA = A/B

Example:  $128/16 = 8$

Index = I = BA%S

Example:  $8\%2 = 0$

Tag = T = BA/S

Example:  $8/2 = 4$

Block Offset = BO = A %B

Example:  $128 \% 16 = 0$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example: M =  $64W = 64 * 4$  B
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=128 = 10000000

T= 1 = 100

I = 0 = 0

BO=0 = 0000

State

Tag

Data

|   |    |     |     |                |   |
|---|----|-----|-----|----------------|---|
| V | MP | 001 | 100 | 19, 72, 63, 52 | X |
| I | I  | X   | X   | X              | X |

Miss

Block Address = BA = A/B

Example:  $128/16 = 8$

Index = I = BA%S

Example:  $8\%2 = 0$

Tag = T = BA/S

Example:  $8/2 = 4$

Block Offset = BO = A %B

Example:  $128 \% 16 = 0$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example: M =  $64W = 64 * 4$  B
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=128 = 10000000

T= 1 = 100

I = 0 = 0

BO=0 = 0000

**State**

**Tag**

**Data**

|   |    |     |     |                |                  |
|---|----|-----|-----|----------------|------------------|
| V | MP | 001 | 100 | 19, 72, 63, 52 | 63, 14, 1076, 93 |
| I | I  | X   | X   | X              | X                |



Block Address = BA = A/B

Example:  $128/16 = 8$

Index = I = BA%S

Example:  $8\%2 = 0$

Tag = T = BA/S

Example:  $8/2 = 4$

Block Offset = BO = A %B

Example:  $128 \% 16 = 0$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log$  #Words/Memory + log #wordsize  
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example:  $M = 64W = 64 * 4 B$
- #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=192 = 11000000

T= 1 = 110

I = 0 = 0

BO=0 = 0000

**State**

**Tag**

**Data**

|   |   |     |     |                |                  |
|---|---|-----|-----|----------------|------------------|
| V | V | 001 | 100 | 19, 72, 63, 52 | 63, 14, 1076, 93 |
| I | I | X   | X   | X              | X                |

**Miss**

Block Address = BA = A/B

Example:  $192/16 = 12$

Index = I = BA%S

Example:  $12\%2 = 0$

Tag = T = BA/S

Example:  $12/2 = 6$

Block Offset = BO = A %B

Example:  $192 \% 16 = 0$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example:  $M = 64W = 64 * 4 B$
- #Blocks/Memory =  $M/B$   
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=192 = 11000000

T= 1 = 110

I = 0 = 0

BO=0 = 0000

**State**

**Tag**

**Data**



We don't care about this data because I is not a valid state

Replace

Block Address = BA = A/B

Example: 192/16 = 12

Index = I = BA%S

Example: 12%2 = 0

Tag = T = BA/S

Example: 12/2 = 6

Block Offset = BO = A %B

Example: 192 % 16 = 0

- ❑ Word Size = WS  
Example: WS = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- ❑ Cache Size = C  
Example: C = 64B
- ❑ #Way = Number blocks to be searched in parallel  
Example: W = 2
- ❑ #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- ❑ #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- ❑ #indexes = #Sets = S = N/W  
Example = 2
- ❑ #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- ❑ Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- ❑ #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=192 = 11000000

T= 1 = 110

I = 0 = 0

BO=0 = 0000

**State**

**Tag**

**Data**



We don't care about this data because MP is not a valid state

Reserve

Block Address = BA = A/B

Example:  $192/16 = 12$

Index = I = BA%S

Example:  $12\%2 = 0$

Tag = T = BA/S

Example:  $12/2 = 6$

Block Offset = BO = A %B

Example:  $192 \% 16 = 0$

- Word Size = WS  
Example: WS = 4B
- Block Size = B  
Example: B = 16B
- #Words/Block = B/W  
Example: #Words/Block = 4
- #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- Cache Size = C  
Example: C = 64B
- #Way = Number blocks to be searched in parallel  
Example: W = 2
- #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- #Indexes = #Sets = S = N/W  
Example = 2
- #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- #Words/Memory in Example = 64
- #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- Main Memory Size = M  
Example:  $M = 64W = 64 * 4 B$
- #Blocks/Memory =  $M/B$   
Example =  $256B/16B = 16$

# Set Associative Cache: Example

A=192 = 11000000

T= 1 = 110

I = 0 = 0

BO=0 = 0000

**State**

**Tag**

**Data**

|   |   |     |     |         |                  |
|---|---|-----|-----|---------|------------------|
| V | V | 110 | 100 | 1,1,1,1 | 63, 14, 1076, 93 |
| I | I | X   | X   | X       | X                |



Block Address = BA = A/B

Example:  $192/16 = 12$

Index = I = BA%S

Example:  $12\%2 = 0$

Tag = T = BA/S

Example:  $12/2 = 6$

Block Offset = BO = A %B

Example:  $192 \% 16 = 0$

- ❑ Word Size = WS  
Example: WS = 4B
- ❑ Block Size = B  
Example: B = 16B
- ❑ #Words/Block = B/W  
Example: #Words/Block = 4
- ❑ #Bits/Word = log W  
Example =  $\log 4 = 2$  bits  
Address grows after 2 bits  
PC = PC + 4
- ❑ #Bits/Block = log B  
Example:  $\log 16 = 4$  bits

- ❑ Cache Size = C  
Example: C = 64B
- ❑ #Way = Number blocks to be searched in parallel  
Example: W = 2
- ❑ #Blocks in Cache = N = C/B  
Example =  $64B/16B = 4$
- ❑ #Word in Cache = C/WS  
Example =  $64B/4B = 16$
- ❑ #indexes = #Sets = S = N/W  
Example = 2
- ❑ #Bits for Index = log #Blocks  
Example = 1

Block-0: 10, 20, 30, 40

Block-1: 11, 07, 104, 120

Block-2: 19, 72, 63, 52

Block-3: 25, 36, 71, 203

Block-4: 302, 340, 2, 91

Block-5: 53, 47, 470, 21

Block-6: 33, 43, 44, 92

Block-7: 74, 10, 42, 20

Block-8: 63, 14, 1076, 93

Block-9: 91, 71, 77, 512

Block-10: 1, 9, 0, 32

Block-11: 0, 0, 0, 0

Block-12: 1, 1, 1, 1

Block-13: 97, 64, 48, 87

Block-14: 12, 10, 0, 1024

Block-15: 15, 17, 43, 76

- ❑ #Words/Memory in Example = 64
- ❑ #Bits in Physical Address =  $\log \#Words/Memory + \log \#wordsize$   
Example:  $\log 64 + \log 4 = 8$
- ❑ Main Memory Size = M  
Example: M =  $64W = 64 * 4 B$
- ❑ #Blocks/Memory = M/B  
Example =  $256B/16B = 16$

# Set Associative Cache: Design

- A physical address can map to any subset of location in the cache
- **Set:** Numbers of Indices
- **Way:** W-Blocks in a set are searched in parallel. Requires comparator for all the block.
- **Replacement Policy:** A block allocation policy works inside a set
- **Benefit:**
  - Can place blocks anywhere inside a set and hence, relatively low conflict. (High hit-rate).
  - Multiplexor size decreases and hence, the hit time.
  - Less comparators and hence, consume a less power.
  
- Block Size = B, Block Bits =  $\log B$ , Block Offset = ?
- Cache Size = C, Associativity = #Way = W
- #Blocks = N = C/B, #Sets = N/W
- Index Bits =  $\log(N/W)$ , Index = ?
- Tag Bits = A -  $\log(N/W)$ , Tag = ?
- State Bits = S
- Tag Array Size =  $N * (S + A - \log N/B)$



# Set Associative Cache: Implementation

- $S = \text{Block}/\text{CPUReq.Index}$
- for each way  $B$  in the Set  $S$** 
  - If  $B.\text{Tag} = \text{CPUReq.Tag} \& B.\text{State} = V$   
Then  $\text{CPUResp} = B.\text{Data} // \text{Return}$
  - If  $B.\text{Tag} = \text{CPUReq.Tag} \& B.\text{State} = MP$   
Then  $\text{CPUReq.wait} // \text{Return}$
- $B = evict(S)$
- $B.\text{State} = MP$
- $B.\text{Tag} = \text{CPUReq.Tag}$
- $\text{MemReq} = \text{CPUReq}$

- $S = \text{Block}/\text{CPUReq.Index}$
- for each way  $B$  in the Set  $S$** 
  - If  $B.\text{Tag} = \text{MemResp.Tag} \& B.\text{State} = MP$   
Then  $B.\text{State} = V$   
 $B.\text{Data} = \text{MemResp.Data}$   
 $\text{CPUResp.Data} = B.\text{Data}$   
Notify All Waiting To Restart



# Set-Associative Cache

- **Problem:** A 4-way set-associative cache of size 32KB can be addressed using a 40 bit address. Block size of the cache is 64B. Find the total number of bits required for indexing, block offset and tag. Also find the tag array size.
  - Block Size = 64 B => Block Offset = 6 Bits
  - Cache Size = 32 KB => No. of Blocks =  $32\text{KB}/64\text{B} = 512$
  - Ways = 4
  - No. of Sets =  $512/4 = 128$
  - Indexing or Set bits =  $\log 128 = 7$
  - Tag bits =  $40 - 7 - 6 = 27$
  - Tag array size =  $(27 + 1) * 512 \text{ bits} = 28 * 64 \text{ B}$



# Block Replacement: Concept and Design

- **Replacement Policy:** A block allocation policy that works inside a set
- An existing block is replaced when a new block is allocated inside a set.
- Predictions are used to replace the block. Why? -> We don't know future.



| B | I | T | D  |
|---|---|---|----|
| A |   |   |    |
| 0 | 0 | 0 | x  |
| 1 | 1 | 0 | x  |
| 2 | 0 | 1 | x  |
| 3 | 1 | 1 | 45 |
| 4 | 0 | 2 | x  |
| 5 | 1 | 2 | x  |
| 6 | 0 | 3 | x  |
| 7 | 1 | 3 | 10 |
| 8 | 0 | 4 | x  |
| 9 | 1 | 4 | 17 |

# Block Replacement: Concept and Implementation

## FIFO Replacement

- EvictionAlgorithm:
  - $S = \text{Set}[\text{CPUReq.Index}]$
  - $\text{ReplacedBlock} = \text{Front}(S)$
- Advantages:
  - Has certainty
- Disadvantages:
  - What about the blocks which are re-used?

Study! Pseudo LRU and Re-reference Interval Prediction!

## Random Replacement

- EvictionAlgorithm:
  - $S = \text{Set}[\text{CPUReq.Index}]$
  - $\text{ReplacedBlock} = \text{Random}(S)$
- Advantages:
  - Easy to Implement
- Disadvantages:
  - Causes performance penalty
  - No certainty

## LRU Replacement

- EvictionAlgorithm:
  - $S = \text{Set}[\text{CPUReq.Index}]$
  - $\text{ReplacedBlock} = \maxTime(\text{Accessed}[S])$
- BlockAccessReplacementAlgorithm:
  - $\text{Accessed}[B] = \text{SystemClock}$
- Advantages:
  - Reuse is taken care-of
- Disadvantages:
  - Complex implementation in Hardware

# Replacement Policy: Implementation

```

•  $S = \text{Block}[\text{CPUReq.Index}]$ 
• for each way  $B$  in the Set  $S$ 
  • If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = V$ 
    Then  $\text{CPUResp} = B.\text{Data}$  // Return
  • If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = MP$ 
    Then  $\text{CPUReq.wait}$  // Return
  •  $B = \text{EvictionAlgorithm}(S)$ 
  •  $B.\text{State} = MP$ 
  •  $B.\text{Tag} = \text{CPUReq.Tag}$ 
  •  $\text{MemReq} = \text{CPUReq}$ 
  •  $\text{BlockAccessReplacement}()$ 

```

```

•  $S = \text{Block}[\text{CPUReq.Index}]$ 
• for each way  $B$  in the Set  $S$ 
  • If  $B.\text{Tag} = \text{MemResp.Tag}$  &  $B.\text{State} = MP$ 
    Then  $B.\text{State} = V$ 
     $B.\text{Data} = \text{MemResp.Data}$ 
     $\text{CPUResp.Data} = B.\text{Data}$ 
     $\text{BlockAccessReplacement}()$ 
  • Notify All Waiting To Restart

```



# Write Strategy

Writes are important because the order in which they are serviced maintain coherence of the data



# Write-through Cache: Overview and Design

- Simplest way of maintaining memory coherence with the main memory.
- Data is updated in the cache and is also updated to the main memory immediately.
- Implementation and hardware complexity is low.
- Performance bottleneck as the memory starts seeing writes to the same address.
- Write-through caches are used only with Write-no allocate.



| B<br>A | I | T | D  |
|--------|---|---|----|
| 0      | 0 | 0 | x  |
| 1      | 1 | 0 | x  |
| 2      | 0 | 1 | x  |
| 3      | 1 | 1 | 45 |
| 4      | 0 | 2 | x  |
| 5      | 1 | 2 | x  |
| 6      | 0 | 3 | x  |
| 7      | 1 | 3 | 50 |
| 8      | 0 | 4 | x  |
| 9      | 1 | 4 | 17 |

# Write-through: Mealy Machine

- ❑ **Inputs:**
  - ❑ CPURd and CPUWr for load and store instructions.
  - ❑ They can be invalidates from replacement/bus.
  - ❑ Memory response for read misses
  
- ❑ **Outputs:**
  - ❑ MemRd and MemWr for memory reads and writes upon read miss and writes.
  - ❑ CPUResp for reads

**State Table**

| CS | Input           | NS | Output               |
|----|-----------------|----|----------------------|
| I  | CPURd           | MP | MemRd                |
| I  | CPUWr           | I  | MemWr                |
| I  | Invalidate      | I  | -                    |
| MP | CPURd/<br>CPUWr | MP | Action is<br>to wait |
| MP | MemResp         | V  | CPUResp              |
| V  | CPURd           | V  | CPUResp              |
| V  | CPUWr           | V  | MemWr                |
| V  | Invalidate      | I  | -                    |



- **States:** Status of all the words part of a block inside the cache.
  - **Invalid (I):** Block is not present inside the cache.
  - **Valid (V):** Block is present inside the cache and all the words in the block have the same content as main memory.
  - **MissPending (MP):** Tags are present but data has not been received from the memory.

# Write-through: Implementation

- $S = \text{Block}[\text{CPUReq.Index}]$
- for each way  $B$  in the Set  $S$ 
  - If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = V$   
Then If  $\text{CPUReq.Type} = READ$   
 $\text{CPUResp} = B.\text{Data}$  // Return
  - Else  $B.\text{Data} = \text{CPUReq.Data}$
  - If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = MP$   
Then  $\text{CPUReq.wait} // \text{Return}$
- If  $\text{CPUReq.Type} = READ$ 
  - Then  $B = \text{EvictionAlgorithm}(S)$
  - $B.\text{State} = MP$
  - $B.\text{Tag} = \text{CPUReq.Tag}$
  - $\text{BlockAccessReplacement}()$
- $\text{MemReq} = \text{CPUReq}$



# Write-back Cache: Overview and Design

- ❑ Lazy updation of main memory
- ❑ Data is updated in the cache and is only updated to the main memory after eviction of block.
- ❑ Implementation and hardware complexity is high.
- ❑ Performance bottleneck is very low.
- ❑ Write-back caches are used with write-no allocate and write-allocate.



| B | I | T | D  |
|---|---|---|----|
| A | 0 | 0 | x  |
|   | 1 | 1 | 0  |
|   | 2 | 0 | 1  |
|   | 3 | 1 | 1  |
|   | 4 | 0 | 2  |
|   | 5 | 1 | 2  |
|   | 6 | 0 | 3  |
|   | 7 | 1 | 3  |
|   | 8 | 0 | 4  |
|   | 9 | 1 | 17 |

# Write-no Allocate Cache: Overview and Design

- Does not allocate the cache line when there is a cache miss for writes.
- Data is updated in the next level directly.
- Advantage: Very simple to design, For one touch write operation, this design is performance friendly.
- Issue: Does not exploit spatial locality for writes



| B | I | T | D  |
|---|---|---|----|
| A | 0 | 0 | x  |
| 0 | 1 | 0 | x  |
| 1 | 0 | 1 | x  |
| 2 | 1 | 1 | 45 |
| 3 | 0 | 2 | x  |
| 4 | 1 | 2 | x  |
| 5 | 0 | 3 | x  |
| 6 | 1 | 3 | 50 |
| 7 | 0 | 4 | x  |
| 8 | 1 | 4 | 17 |
| 9 |   |   |    |

# Write-back No-Allocate: Mealy Machine

**State Table**

| CS | Input           | NS | Output               |
|----|-----------------|----|----------------------|
| I  | CPURd           | MP | MemRd                |
| I  | CPUWr           | I  | MemWr                |
| I  | Invalidate      | I  | -                    |
| MP | CPURd/<br>CPUWr | MP | Action is<br>to wait |
| MP | MemResp         | V  | CPUResp              |

| CS | Input      | NS | Output    |
|----|------------|----|-----------|
| V  | CPURd      | V  | CPUResp   |
| V  | CPUWr      | M  | -         |
| V  | Invalidate | I  | -         |
| M  | CPURd      | M  | CPUResp   |
| M  | CPUWr      | M  | -         |
| M  | Invalidate | I  | Writeback |



- **Modified (M):** The data is different from that of the main memory and hence, need to be written back on eviction.

# Write-back No-Allocate: Access Algorithm

```

• S = Block[CPUReq.Index]
• for each way B in the Set S
  • If B.Tag = CPUReq.Tag & B.State = {V,M} & CPUReq.read
  • Then CPUResp = B.Data // Return
  • If B.Tag = CPUReq.Tag & B.State = {V,M} & CPUReq.write
  • Then B.Data = Merge(B.Data, CPUReq.Data)
  • B.State = M // Return
  • If B.Tag = CPUReq.Tag & B.State = MP
  • Then CPUReq.wait // Return
  • If CPUReq.read
  • Then B = EvictionAlgorithm(S)
    B.State = MP
    B.Tag = CPUReq.Tag
    BlockAccessReplacement()
  • MemReq = CPUReq

```



# Write-back No Allocate: Fill Algorithm



# Write-allocate Cache: Overview and Design

- Allocates a cache line when there is a cache miss for writes
- Upon a write request that is a miss, generate a read request corresponding to this address to read the data from memory.
- Issue: For one touch write operation, this design is not performance friendly.
- Advantages: Exploit spatial locality for writes



| B | I | T | D  |
|---|---|---|----|
| A | 0 | 0 | x  |
|   | 1 | 1 | x  |
| 2 | 0 | 1 | x  |
| 3 | 1 | 1 | 45 |
| 4 | 0 | 2 | x  |
| 5 | 1 | 2 | x  |
| 6 | 0 | 3 | x  |
| 7 | 1 | 3 | 10 |
| 8 | 0 | 4 | x  |
| 9 | 1 | 4 | 17 |

# Write-back Allocate: Mealy Machine

**State Table**

| CS | Input              | NS | Output                    |
|----|--------------------|----|---------------------------|
| I  | CPURd              | MP | MemRd                     |
| I  | CPUWr              | MP | MemRd                     |
| I  | Invalidate         | I  | -                         |
| MP | CPURd/<br>CPUWr    | MP | Action is<br>to wait      |
| MP | MemResp<br>& wasRd | V  | CacheFill<br>&<br>CPUResp |
| MP | MemResp<br>& wasWr | M  | CacheFill                 |

| CS | Input      | NS | Output    |
|----|------------|----|-----------|
| V  | CPURd      | V  | CPUResp   |
| V  | CPUWr      | M  | -         |
| V  | Invalidate | I  | -         |
| M  | CPURd      | M  | CPUResp   |
| M  | CPUWr      | M  | -         |
| M  | Invalidate | I  | Writeback |



- **Modified (M):** The data is different from that of the main memory and hence, need to be written back on eviction.

# Write-back Allocate: Access Algorithm

```

•  $S = \text{Block}[\text{CPUReq.Index}]$ 
• for each way  $B$  in the Set  $S$ 
  • If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = \{\text{V}, \text{M}\} \& \text{CPUReq.read}$ 
  • Then  $\text{CPUResp} = B.\text{Data} // \text{Return}$ 
  • If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = \{\text{V}, \text{M}\} \& \text{CPUReq.write}$ 
  • Then  $B.\text{Data} = \text{Merge}(B.\text{Data}, \text{CPUReq.Data})$ 
  •  $B.\text{State} = \text{M} // \text{Return}$ 
  • If  $B.\text{Tag} = \text{CPUReq.Tag}$  &  $B.\text{State} = \text{MP}$ 
  • Then  $\text{CPUReq.wait} // \text{Return}$ 
  •  $B = \text{EvictionAlgorithm}(S)$ 
  •  $B.\text{State} = \text{MP}$ 
  •  $B.\text{Tag} = \text{CPUReq.Tag}$ 
  •  $\text{MemReq} = \text{CPUReq as Read}$ 
  •  $\text{BlockAccessReplacement}()$ 

```



# Write-back Allocate: Fill Algorithm



# Problem

- **Problem:** A 8-way set-associative cache of size 16KB can be addressed using a 40 bit address. Block size of the cache is 64B. Find the total number of bits required for indexing, block offset and tag. Also find the tag array size. Determine the tag array size if the cache is designed as write-through and write-back protocol.
  - Block Size = 64 B => Block Offset = 6 Bits
  - Cache Size = 16 KB => No. of Blocks =  $16\text{KB}/64\text{B} = 256$
  - Ways = 8
  - No. of Sets =  $256/8 = 32$
  - Indexing or Set bits =  $\log 32 = 5$
  - Tag bits =  $40 - 6 - 5 = 29$
  - Tag array size (write-through) =  $(29 + 1) * 256 \text{ bits} = 30 * 32 \text{ B}$
  - Tag array size (write-back) =  $(29 + 2) * 256 \text{ bits} = 31 * 32 \text{ B}$



# Causes of Cache Miss

- **Compulsory Miss:** First reference to an address
- **Capacity Miss:**
  - Occurs when working set does not fit inside the cache
- **Conflict/Collision Miss:**
  - It is not necessary that cache is full for a conflict miss occurs.
  - Conflict due to block placement in the same index.
  - They don't occur if the cache is fully-associative with LRU replacement policy.
  - Conflict misses are maximum with direct-mapped cache
- **Coherence Miss:** Occurs in multiprocessor/external processor system (Will study later)

Q: A direct mapped cache of size 4 has the following accesses: 0,1,2,3,4,1,2,3,0,4,0. Categorize the misses

Ans:

0- Compulsory Miss  
1- Compulsory Miss  
2- Compulsory Miss  
3- Compulsory Miss

4- Compulsory Miss  
1- Hit  
2- Hit  
3- Hit

0- Capacity Miss  
4- Capacity Miss  
0- Conflict Miss

# Large Cache Size and Higher Associativity

- Large Cache means a large subset of main memory close to the processor
- Large size of cache reduces conflict and capacity misses
- Performance is directly proportional to hit rate; it saturates after some particular configuration.
- Higher associativity reduces conflict misses
- Performance saturates when working set of an application fits inside the cache



# Large Block Size

- Large block size reduces miss rate
- Compulsory misses are the targets
- Exploits spatial locality
- Increases miss penalty
- Could increase conflict/capacity miss for same cache size



# Multilevel Cache

- Targets miss penalty
- Large caches have high access time. Why?
- Local Miss Rate
- Global Miss Rate

$$\text{Average memory access time} = \text{Hit time}_{L1} + \text{Miss rate}_{L1} \times \text{Miss penalty}_{L1}$$

$$\text{Miss penalty}_{L1} = \text{Hit time}_{L2} + \text{Miss rate}_{L2} \times \text{Miss penalty}_{L2}$$

$$\begin{aligned} \text{Average memory access time} &= \text{Hit time}_{L1} + \text{Miss rate}_{L1} \\ &\quad \times (\text{Hit time}_{L2} + \text{Miss rate}_{L2} \times \text{Miss penalty}_{L2}) \end{aligned}$$

$$\text{Miss rate}_{L1} \times \text{Miss rate}_{L2}$$

$$\begin{aligned} \text{Average memory stalls per instruction} &= \text{Misses per instruction}_{L1} \times \text{Hit time}_{L2} \\ &\quad + \text{Misses per instruction}_{L2} \times \text{Miss penalty}_{L2} \end{aligned}$$



Three Level Cache Organization

# Small and Simple First Level Cache

- Cache Access Time on Hit = Indexing time + Tag read/compare time + Multiplexor latency + Data access time + Request queuing delay etc.
- Less number of sets will have low index time
- Large tag size will have more power consumption
- More number of way will have high multiplexor latency
- Large cache size will have long data access latency
- L1 cache is close to the core and hence, must have very low hit latency



# Way Prediction

- Target: Power Optimization
- Predict a way that would be hit and access that way.
- On miss, access all the other ways.
- Can use a simple history predictor like n-bit per set



\*Way-predicting set-associative cache for high performance and low energy consumption  
ISLPED 1999

# Index Hashing

- Target: Miss Rate
- Attempts to reduce conflict misses due to hot spot
- What is hot-spot?
- Typically, higher-order bits of physical address (tags) are xor-ed with the index bits to get a hashed index
- Large caches may not require index hashing. Why?
- Direct mapped caches require index hashing. Why?



Images from paper by Gert-Jan van den Braak et al.,  
“Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs”,  
IEEE Transaction in Computers, Vol. 65, Issue. 7, July 2016