

## 1 2 Early/Simple Machines

**Computer Architecture:** design of the abstraction layers to implement info processing apps

**Instruction Set Architecture:** contract between software & hardware, giving programmer visible state

## 3 Microcoding

**Microcoding:** technique to manage control unit

**Single-Bus Datapath for Microcoded RISC-V**



**μPC jump = next | spin | fetch | dispatch | fine | fine**

### Horizontal

- fewer microcode steps

- sparser encoding  $\rightarrow$  more bits

### Vertical Microcode

- single datapath op per micro

- more compact  $\rightarrow$  less bits

## 4 5 Pipelining

**Iron Law:**  $\frac{\text{time}}{\text{program}} = \frac{\text{instr}}{\text{program}} * \frac{\text{cycles}}{\text{instr}} + \frac{\text{time}}{\text{cycle}}$



**Structural Hazards:** needs a resource being used by another  
 $\hookrightarrow$  Fix: adding more hardware

**Data Hazard:** depends on data from prev instruction

- RAW: Read after Write, data-dependence

- WAR: Write after Read, anti-dependence

- WAW: Write after Write, output dependence

$\hookrightarrow$  Fix: Interlock: wait by holding next instr in issue stage

$\hookrightarrow$  Fix: Bypass: resolve hazard earlier by bypassing when available

$\hookrightarrow$  Fix: Speculate: guess on value, correct if wrong

**Control Hazard:** depends on branch/exception for next instr addrs

- Branch Delay Slots: next instr after branch/jump is always executed before control flow change

$\hookrightarrow$  Complicates microarchitectures, removed in 1990s

$\hookrightarrow$  Fix: Branch prediction: start executing next instr based on predictions

**CPI:** measure cycles from first instr start to last instr finish

Full bypassing may be too expensive to implement

**Exception:** unusual internal event caused by program during execution

**Interrupt:** external event outside of running program

**Traps:** forced transfer of control to supervisor caused by exception or interrupt

**Asynchronous Interrupts:** I/O device requests attention by asserting one of prioritized interrupt request lines

**Trap Handler:** saves EPC before enabling interrupts to allow nested interrupts

**Synchronous Trap:** caused by exception on particular instr, must be restarted

Hold exceptions in pipeline until commit pt, earlier overrides later exception

- Pipeline hazards avoided through software techniques: scheduling, loop unrolling

- Delay writebacks so all operations have same latency to write

## 6 7 Memory

**CPU**  $\rightarrow$  **Small, Fast Memory (RF, SRAM)**

**Big, Slow Memory (DRAM)**  $\rightarrow$  **Asynchronous Interruptions**

**Latency:** time taken for a single operation start to finish, memory access  $\gg$  processor cycle time

**Bandwidth:** rate at which operations can be performed

**Occupancy:** time during which unit blocked on a operation

**In-order Super-scalar Pipeline:** fetch multiple instr per cycle

**Temporal Locality:** if referenced recently, likely to be referenced again soon

**Spatial Locality:** if referenced nearby, other close locations likely

### Caches

#### Direct Mapped

Each memory addr associated w/ one possible line within cache. N lines in each set. N compares  $\Rightarrow$  N ways



#### Fully Associative

Line can go in any row, 1 set



#### N-way Associative



#### Replacement Policy

1) Random: fast

2) Least Recently Used (LRU): complex, must be updated every access

3) First-In, First-Out (FIFO): highly associative caches

4) Not Most Recently Used (NMRU): FIFO w/ exception for most recently used line

#### Average Memory Access Time (AMAT)

hit time + miss rate \* miss penalty  
 to improve performance, reduce any of the three

#### Cache Misses:

- 1) Compulsory: first reference to a line, misses even w/ infinite cache
- 2) Capacity: cache too small, miss would occur w/ perfect replacement policy
- 3) Conflict: miss occurs b/c of collisions w/ line placement, would not occur w/ full associativity

Larger cache size: + capacity, conflict misses - hit time

Higher associativity: + b conflict misses, - increases hit time

Larger Line Size:  $\downarrow$  compulsory misses  $\rightarrow$  conflict, miss penalty  
miss tag overhead

### Write Policy

- ① Write Through: write both cache & memory
- ② Write Back: write cache only, memory written when entry evicted
- ③ nowrite-allocate: only write main memory on miss
- ④ Write allocate: fetch into cache on miss



$\hookrightarrow$  Write Buffer: hold updated value of location needed by read miss

$\hookrightarrow$  Sub-blocks (sector cache): valid bit added to units smaller than full line, read sub block on a miss

$\hookrightarrow$  Multi-level Cache: increasing sizes of cache at each level  
local miss rate: miss in cache / access to cache  
Global miss rate: miss in cache / CPU memory accesses  
Misses per instruction (MPI): misses in cache / # of instr

Inclusive multilevel cache: inner cache only lines also in outer

### ⑧ Prefetching

Prefetching: speculate on future instructions and data accesses and fetch into cache

Usefulness: should produce hits

Timeliness: not late and not early

- ① Prefetch on miss: prefetch b+1 on b miss
- ② On Block Lookahead (OBL) Scheme: prefetch b+1 for b access
- ③ Striped Prefetch: follow sequence b+N, b+2N, ...

### ⑨ Address Translation

Base-and-Bound: base register and bound as boundaries

- External Fragmentation: non contiguous memory usage

Faged Memory System: program generated address split

into page num and offset,

can store large contiguous

virtual memory space using non-contiguous physical memory pages

+ process memory grows and shrinks dynamically

- Internal Fragmentation: not all bytes on page used

Hierarchical Page Tables: multiple levels, allocate when need

Translation Lookaside Buffers (TLB): cache translations

typically fully associative, random or FIFO

### ⑩ Virtual Memory

Protection & Privacy: several users w/ private address space & one or more shared address spaces

Demand Paging: can run programs larger than primary memory

Page faults handled by

OS since it takes a while

Virtual Index/Virtual Tag (VIVT) cache before TLB, translate on miss

- Prevent aliasing in Cache by direct mapped



### Exam

#### Microcode

When M[x]:

| PseudoCode            | IdIR | Reg Sel | Reg Wr | en Reg | Id A | Id B | ALUOp   | en ALU | Id MA | Mem Mem | en Mem | en Imm | Im Sel | en Br | Next State |
|-----------------------|------|---------|--------|--------|------|------|---------|--------|-------|---------|--------|--------|--------|-------|------------|
| MA < R[rs1]           | 0    | rs1     | 0      | 1      | *    | *    | *       | 0      | 1     | 0       | 0      | *      | 0      | N     | *          |
| A < Mem               | 0    | *       | 0      | 0      | 1    | *    | *       | 0      | 0     | 0       | 1      | *      | 0      | S     |            |
| if (A < B) goto DONE  | 0    | rd      | 0      | 1      | 0    | 0    | SLTU    | 0      | 1     | 0       | 0      | *      | 0      | N     | DONE       |
| MA < R[rd]            |      |         |        |        |      |      |         |        |       |         |        |        |        |       |            |
| Mem <= A              | *    | *       | 0      | 0      | 0    | *    | COPY_A  | 1      | 0     | 1       | 0      | *      | 0      | S     |            |
| A, B < R[rs1]         | 0    | rs1     | 0      | 1      | 1    | 1    | *       | 0      | *     | 0       | 0      | *      | 0      | N     |            |
| R[rd] < A             | 0    | rd      | 1      | 0      | *    | 0    | COPY_A  | 1      | *     | 0       | 0      | *      | 0      | N     |            |
| A < A-1               | 0    | *       | 0      | 0      | 1    | *    | DEC_A-1 | 1      | *     | 0       | 0      | *      | 0      | N     |            |
| if (A==0) goto FETCH0 | 0    | rs2     | 1      | 0      | *    | 0    | COPY_A  | 1      | *     | 0       | 0      | *      | 0      | EZ    | FETCH0     |
| R[rs2] < A            |      |         |        |        |      |      |         |        |       |         |        |        |        |       |            |

#### Square

- always end on FETCH0

#### Memory

$$\# \text{virtual pages} = \frac{2^{\text{virtual address bits}}}{\text{page bytes}}$$

- max physical memory = PPN bits \* page

Page table entry is indexed by VPN

- Stores PPN and has memory address

- Physical memory found by ADN + offset

In Fully Bypassed CPU:

ADD  $\Rightarrow$  Store : store waits for updated accumulator

#### Digits

| Decimal | Hex | Binary |
|---------|-----|--------|
| 00      | 0   | 0000   |
| 01      | 1   | 0001   |
| 02      | 2   | 0010   |
| 03      | 3   | 0011   |
| 04      | 4   | 0100   |
| 05      | 5   | 0101   |
| 06      | 6   | 0110   |
| 07      | 7   | 0111   |
| 08      | 8   | 1000   |
| 09      | 9   | 1001   |
| 10      | A   | 1010   |
| 11      | B   | 1011   |
| 12      | C   | 1100   |
| 13      | D   | 1101   |
| 14      | E   | 1110   |
| 15      | F   | 1111   |

| 8              | 7              | 6              | 5              | 4              | 3              | 2              | 1              |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| 2 <sup>7</sup> | 2 <sup>6</sup> | 2 <sup>5</sup> | 2 <sup>4</sup> | 2 <sup>3</sup> | 2 <sup>2</sup> | 2 <sup>1</sup> | 2 <sup>0</sup> |
| 128            | 64             | 32             | 16             | 8              | 4              | 2              | 1              |

| halve line size (assoc + #sets constant) $\Rightarrow$ 1/2 cap              | compulsory                                                                                 | conflict                                                       | capacity                                                                                             | hit time                                           | miss rate                                                                                   | miss penalty                                      |
|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------|
| $\uparrow$ : shorter lines = fewer adj elems brought in w first access      | $\uparrow$ : program accesses more lines in total $\rightarrow$ more misses                | $\uparrow$ : capacity halved                                   | $\downarrow$ : smaller cache better than small increased tag check                                   | $\uparrow$ : smaller lines brought in faster       |                                                                                             |                                                   |
| $\rightarrow$ : assoc no impact on comp miss                                | $\uparrow$ : lower assoc = conflict misses $\uparrow$ and fewer places to put same element | $\rightarrow$ : capacity does not change                       | $\downarrow$ : #sets $\uparrow$ $\rightarrow$ tags smaller; fewer tags checked, fewer ways muxed out | $\uparrow$ : conflict misses                       | $\rightarrow$ : dominated by outer mem hierarchy                                            |                                                   |
| add good prefetching                                                        | $\downarrow$ : good prefetcher brings data before we need it                               | $\downarrow$ : prefetcher brings evicted lines back into \$    | $\downarrow$ : prefetch lines "just in time" $\rightarrow$ cap miss                                  | $\downarrow$ : no effect; prefetch not on hit path | $\downarrow$ : reduces miss rate                                                            | $\rightarrow$ : BUT may increase due to pollution |
| combine IS, DS $\Rightarrow$ L1 w combined cap (assoc + line size constant) | —                                                                                          | may $\uparrow$ : conflicts bt data + inst lines are introduced | $\uparrow$ : dual-port \$ slower than single port OR can freq stall                                  | $\downarrow$ : greater cap                         | $\uparrow$ : \$ more flexible OR may $\uparrow$ edge cases $\uparrow$ bt inst + data access | $\rightarrow$ : dominated by outer mem hierarchy  |

|                                                                           | IPC                                                                                      | CPI                                                                                     | seconds/cycle                                                                               | execution time                                                                              |
|---------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| add a branch delay slot                                                   | $\uparrow$ : must insert NOPs when the branch delay slot cannot be usefully filled       | $\downarrow$ : some control hazards eliminated, also NOPs execute quickly               | $\rightarrow$ : will not meaningfully change pipeline; or decrease: no branch kill          | $\rightarrow$ : depends on how often delay slot can be filled w/ useful work                |
| add a complex inst                                                        | $\downarrow$ : inst can condense a sequence of insts                                     | $\uparrow$ : can mean more stages, or more control logic                                | $\uparrow$ : more control logic + interlocks $\rightarrow$ incr crit path (also, no effect) | $\rightarrow$ : only if new inst will be taken advantage of                                 |
| reduce #regs in ISA                                                       | $\uparrow$ : values frequently spilled to stack; need to swap                            | $\uparrow$ : loads followed by dep. insts will cause more stalls                        | $\downarrow$ : fewer registers = shorter reg file access time                               | $\rightarrow$ : if program uses few regs                                                    |
| improve mem access speed                                                  | —                                                                                        | $\downarrow$ : less stall time for memory                                               | $\downarrow$ : if mem is on crit path or mem is 1 cycle                                     | $\downarrow$                                                                                |
| add 16-bit vers of common RISC-V insts                                    | —                                                                                        | $\downarrow$ : code size shrunk $\rightarrow$ fewer \$ misses + less waiting fetch time | $\uparrow$ : decode complexity                                                              | $\rightarrow$ : main adv=smaller code size; main disadvantage=complex decode                |
| for CISC, arch impl: ucode engine $\Rightarrow$ RISC pipeline + a decoder | —                                                                                        | $\downarrow$ : ucode engine needs mult clock cycles/inst; risc ~ 1                      | $\rightarrow$ : work done in 1 pipeline stage = 1 ucode cycle                               | $\downarrow$ : ucode decoding in CPI from pipeline is huge                                  |
| using wider ucode in a ucode machine                                      | $\rightarrow$ : ucode not visible @ ISA level                                            | $\downarrow$ : fewer microinst needed to implement ISA inst                             | $\uparrow$ : sparse encoding = larger ROM = slower access                                   | $\downarrow$ : wide ucode parallelism better than cycle time impact                         |
| pipelining ucode engine in a ucode machine                                | —                                                                                        | $\downarrow$ : pipeline is better than bus-based                                        | $\rightarrow$ : single bus is fast, but so is deep pipeline                                 | $\downarrow$ : inst throughput increases                                                    |
| add an L2 cache between L1 cache and DRAM                                 | $\rightarrow$ : L2 \$ not visible at ISA                                                 | $\downarrow$ : L2 \$ improves AMAT for L1 miss                                          | $\uparrow$ : larger SRAM = long c2q                                                         | $\downarrow$ : reduces L1 miss latency                                                      |
| add virtual memory                                                        | $\uparrow$ : page faults need extra inst for OS handler; SW TLB need refills on TLB miss | $\uparrow$ : inst fetch + mem ops use extra cycles for PT walks, but ok w/ TLB          | $\rightarrow$ : TLB accessed in    w/ VINT \$; should not impact crit path                  | $\uparrow$ : impact when working w/ apps (size > TLB reach) and not all pgs in physical mem |

## 11 Complex Pipelines

**Dispatch:** instr decoded and enters ROB, always in-order  
**Issue Stage:** holds instr waiting to issue for execution  
**#Registers limit instructions in pipeline**  
**Register renaming:** in decode, add to Reorder Buffer (ROB)  
**Completion:** kept in ROB until commit for precise exceptions  
 - Limited by RAW, WAR



## 12 13 Out-of-Order Execution

**In-order Issue:** stalls on RAW, NAR/WAW, cannot issue unless all preceding instr issued to execution units

**Out-of-order Issue:** dispatched in order to reservation stations, following instr can be issued out of order  
 - data & tag in reservation stations vs. tags only in reservations, data in Unified Physical Register file  
 ↳ if "P" set, use value in register file

**Reorder Buffer:** active instr, decoded but not committed  
**Unified Physical Register:** single physical register file during decode, tags in ROB



## 14 Very Long Instruction Word (VLIW)

**VLIW:** multiple operations packed into one instr parallelism within an instr, avoid data hazards

**Trace Scheduling:** trace through most frequent branch using whole trace, and use profiling feedback  
 ↳ Object code compatibility & obj code size  
 ↳ Scheduling variable latency mem op

**Static Scheduling vs Dynamic Scheduling** instructions

## 15 Branch Prediction

**Dynamic Branch Prediction:** based on past behavior, temporal or spatial correlation

**Branch History Table (BHT):** change prediction after consecutive mistakes

**Branch Target Buffer (BTB):** keep branch pt, target PC in the BTB, only taken branches jumps in BTB  
 ↳ can redirect fetches earlier

**Jump Register:** switch statements, jump to address, func, ra  
**In-order BrPred:** branch resolves before later instr complete  
**Out-of-order BrPred:** branch resolved after later instr complete  
**Speculative Store Buffer:** store speculative memory accesses

## 16 Multithreading

**Thread Level Parallelism (TLP):** many workloads can use TLP, hard to get more optimization from ILP  
 - Multiprogramming: run independent sequential jobs  
 - Multithreaded: run one job faster using parallel threads  
 - each thread requires PC, GPR, system state

### Thread Scheduling Policies

- 1) Fixed Interleave: N threads 1 instr every N cycles
- 2) Software Controlled Interleave: S pipeline slots for N threads
- 3) Hardware Controlled Thread Scheduling: track ready threads

**Coarse-grain:** can run more cycles

**Simultaneous Multithreading (SMT):** instr from multiple threads to enter execution on same clock cycle

### Instruction Choosing Policy:

Fetch from thread w/ least instr in flight



## 17 Vectors

**Vectors:** one instr encodes N operations - independent, some functional unit, access contiguous block of memory, scalable Data Level Parallelism (DLP)

### Vector Startup:

functional unit latency: time through pipeline  
 dead time/recovery time: time before another vector instr can start down pipeline

**Vector Chaining:** allows bypassing, dependent instr as soon as first result appears

**Vector Stripping:** break loops into pieces that fit in registers

**Vector Mask:** for conditional execution in vectors

**Packed SIMD:** limited instr set, limited vector register length



## 18 GPUs

GPU computational performance for general purpose computing

- consists of multiple multi-thread SIMD cores in-order

- CPU sends grid to GPU which distributes thread blocks among cores

**Single Instruction, Multiple Thread (SIMT)**: individual scalar instruction streams for each CUDA thread grouped together 32 CUDA threads into warp



Use a mask vector for taken-not taken branches

## Exam

### ROB

Enter ROB, Issue, WB, Commit : sequential, one at a time

Issue + WB at same time

Commit must be in order

ROB sequential

Pay attention to ROB entries

### Branching

| A       | j=0 | j=1 | j=2 | j=3 |
|---------|-----|-----|-----|-----|
| Counter | 00  | 01  | 10  | 11  |
| Predict | 0   | 0   | 1   | 1   |
| Actual  | 1   | 1   | 1   | 1   |

### Software Pipelining

| label | ALU      | FPU     | FPU     | MEM    |
|-------|----------|---------|---------|--------|
|       | addi x1  |         |         | fld f2 |
|       |          |         |         |        |
|       | addi x1  | fadd f0 | fmul f3 | fld f2 |
|       |          |         |         |        |
| loop: | addi x1  | fadd f0 | fmul f3 | fld f2 |
|       | bne loop | fadd f1 |         |        |
|       |          | fadd f0 | fmul f3 |        |
|       |          | fadd f1 |         |        |
|       |          |         |         |        |
|       |          | fadd f1 |         |        |

## Multithreading

### Fixed Round Robin Scheduling

- ① Find longest latency between instr

- ② Label 1, N+1, 2N+1

- ③ last - first  $\geq$  latency

### Data Dependent Scheduling

Steady state: max # instr thread can execute

- Switching off

- ① Find Steady State

- ② Find latency & subtract

- ③ (Steady State) $\times N \geq$  latency - (instr between longest b/w)

- ④ N+1

# Final Cheatsheet

## 19 RISC-V Vectors

**Vector Type Register (vtype):** 32 vector data registers of size VLEN  
 vsetvli rd, rs1, e8 ← vtype parameters  
 resulting machine → requested application  
 vector length setting → vector length (AL)  
 $VLMAX = LMUL * VLEN / SEW$

```
# void *memcpy(void* dest, const void* src, size_t n)
# a0=dest, a1=src, a2=n
#
memcpy:
    mv a3, a0 # Copy destination
loop:
    vsetvli t0, a2, e8,m8,ta,ma  # Vectors of 8b
    vle8.v v0, (a1)   # Load bytes
    add a1, a1, t0   # Bump pointer
    sub a2, a2, t0   # Decrement count
    vse8.v v0, (a3)  # Store bytes
    add a3, a3, t0   # Bump pointer
    bnez a2, loop   # Any more?
    ret             # Return
```

### Vector Unit-Stride Loads/Stores

```
# vd destination, rs1 base address, vm is mask encoding
vle8.v vd, (rs1), vm # 8-bit unit-stride load
vle16.v vd, (rs1), vm # 16-bit unit-stride load
vle32.v vd, (rs1), vm # 32-bit unit-stride load
vle64.v vd, (rs1), vm # 64-bit unit-stride load
```

**Vector Length Multiplier (LMUL):** longer vector registers, vector register groups

**Masking:** all operations can be under a mask

## 21 Cache Coherence

### Shared-Memory Multiprocessor:

- Multiple private caches for performance
- illusion of single shared memory

**Coherence:** what values can read return, r/w to single memory location

**Consistency:** when write visible to reads, r/w to multiple memory

**Snoopy Cache:** cache watch other memory ten

**Write Miss:** address invalidated in other caches before write

**Read Miss:** if dirty copy is found, write-back is performed before memory is read

### Cache State-Transition Diagram



**Inclusion:** entries in L1 must be in L2

**False Sharing:** cache-coherence done at the line level not word level

### Coherence Miss:

↳ True sharing miss: communication of data through cache coherence mechanism

↳ False sharing miss: line invalidated because some word in the line is written into

**Snooping:** must probe every other cache

↳ use multiple interleaved buses w/ interleaved tag banks, point-to-point network

**Directories:** every memory line has associated directory info

- track which

caches have copies of data, probe those

**Cache States:**

- 1) C-invalid: accessed data not resident in cache
- 2) C-shared: data in cache, possibly other sites, data in memory valid
- 3) C-modified: exclusive to cache, has been modified, memory not up to date
- 4) C-transient: transient sent, not received

Write miss, to read shared line



## 22 Memory Consistency Models

### Synchronization:

**Producer-Consumer:** consumer must wait until producer produced data

**Mutual Exclusion:** only one process uses resource at given time

**Memory Consistency Model:** sequential ISA sees it in order, what values can be returned by load coherence: ensure memory system in parallel comp as if caches were not there

**Sequentially Consistency:** result of any execution is same as if operations were in sequential order, order-preserving interleaving of memory ref - requiring SC made machines slower

**Store Buffer**: allows stores to buffered while waiting for access to shared memory

**TSO (Total Store Order)**: strongest memory model in common use

- allows local buffering of stores by processor

**Strong Memory Consistency Models**: more guarantees on ordering of loads & stores across hardware threads, easier ISA-level programming model, more hardware

**Weak, multi-copy-atomic memory models**: processor sees writes by another processor in same order

**Weak, non-multi-copy-atomic memory models**: processor can see another writes in diff orders

**Relaxed Memory Models**: each program can have diff memory models

## Exam

VIPT: min associativity for no virtual address aliasing in VIPT

$$(\text{num ways}) * (\text{num sets}) * (\text{line size}) \leq (\text{page size})$$

$$2^{15} \leq 2^{12}, 8 \text{ associativity needed}$$

## Coherence protocol

| initial state | other cached | ops       | actions by this cache | final state | this cache | other caches | mem |
|---------------|--------------|-----------|-----------------------|-------------|------------|--------------|-----|
| Invalid       | no           | none      | none                  | I           |            |              | yes |
|               |              | CPU read  | CR                    | CE          | yes        |              | yes |
|               |              | CPU write | CRI                   | OE          | yes        |              |     |
|               |              | replace   | none                  |             | impossible |              |     |
|               |              | CR        | none                  | I           | yes        | yes          |     |
|               |              | CRI       | none                  | I           | yes        |              |     |
|               |              | CI        | none                  |             | impossible |              |     |
|               |              | WR        | none                  |             | impossible |              |     |
| Invalid       | yes          | CWI       | none                  | I           |            |              | yes |
|               |              | none      |                       | I           | yes        | yes          |     |
|               |              | CPU read  |                       | CS          | yes        | yes          |     |
|               |              | CPU write |                       | OE          | yes        |              |     |
|               |              | replace   | same as above         |             | impossible |              |     |
|               |              | CR        |                       | I           | yes        | yes          |     |
|               |              | CRI       |                       | I           | yes        |              |     |
|               |              | CI        |                       | I           | yes        |              |     |
|               |              | WR        |                       | I           | yes        | yes          |     |
|               |              | CWI       |                       | I           | yes        | yes          |     |

Table P5.7-1

## Cache coherency

| ID | Event         | Message Shared | Cache1 State | Cache1 State | Cache2 State | Main Memory Up-to-Date? |
|----|---------------|----------------|--------------|--------------|--------------|-------------------------|
| 0  | CPU0: read A  | 0:CR           | CE           | I            | I            | Yes                     |
| 1  | CPU2: write B | 2:CRI          | I            | I            | OE           | No                      |
| 2  | CPU1: read B  | 1:CR; 2:CCI    | I            | CS           | OS           | No                      |
| 3  | CPU1: write B | 1:CI           | I            | OE           | I            | No                      |
| 4  | CPU2: write A | 2:CRI; 1:CCI   | I            | I            | OE           | No                      |
| 5  | CPU0: write B | 0:CRI; 2:CCI   | OE           | I            | I            | No                      |

**Mutual Exclusion**: to avoid a deadlock, let process give up reservation while waiting

## Dekker's Algorithm

Process 1

```
...
c1=1;
turn = 1;
L: if c2=1 & turn=1
    then go to L
< critical section>
c1=0;
```

Process 2

```
...
c2=1;
turn = 2;
L: if c1=1 & turn=2
    then go to L
< critical section>
c2=0;
```

Regular loads and stores in SC model sufficient to implement mutual exclusion

**Atomic Memory Operations (AMO)**: two ordering bits acquire and release

Non-blocking synchronization allows critical sections to execute w/o taking locks

**Compare-and-Swap**: complex instr, double Compare-and-Swap to access two words

**Load-Reserved / Store-Conditional**: load in cache in F/M stage

**RISC-V Atomic Instr**: guaranteed forward progress for simple operations