

Consider a shared mem arch

Shared memory should offer a consistent view to all cores

Core 0      Core 1      ...



the L1 will be too big & hence too slow.



Core 0 can keep on writing to n other values in future in private L1,  
Core 1 keeps on reading from its own private L1 as  $n=0$

## INCOHERENT

Problem of Shared mem in Inst.

### What is Coherence

Requirements →

- 1) Any read by core  $C_i$  from mem add  $x$  must return the most recent write  $w$  to add  $x$  by core  $C_j$ , if no other core has written to add  $x$  it between  $w$  and  $n$ .
- 2) If core  $C_1$  writes to  $X$  &  $C_2$  reads after sufficient time.  
X there are no other writers to  $X$  in between by any core, then  $C_2$ 's read returns the value from  $C_1$ 's write.
- 3) Writes to the same location have to be serialized, i.e., any two writes to  $X$  must occur in the same order on all the cores.  
Every core sees the same order of write values for a variable.

### How to Achieve Coherence in presence of Caches

1. No cache → bad performance

2. Common L1 → " "

3. Private write-through cache → then the shared memory is Incoherent coherent, but once a read is done, it is not updated following writes by others.

4. Need to force Cache to see writes by other caches.

→ Broadcast writes to update other caches.

→ Write update coherence.

→ Writers prevent hits to other cores.

→ Write invalidate coherence

Ans one of these can ensure property ①

property ① → Write broadcast on shared bus, the order in which they appear on bus is the unique order seen by all cores (shared bus can be bottleneck)

→ Each block is assigned an ordering point → (directories)

Different ordering points can be used for diff blocks.

For each block, all accesses are ordered by the same entity.

Diff blocks have diff entity → no contention.

(Directory band coherence)

Diff blocks have diff entry  $\rightarrow$  no contention.  
(Directory based coherence)

$2 \times 2 \rightarrow 4$  possible coherence schemes.

### Write Update Snooping coherent



What if both processors try write to same location exactly at same time  
 $\hookrightarrow$  bus arbitration logic will decide the order.

### Optimization for write updates

In W update scheme, every write to a location by every core, needs updating in memory, which is slow.  
 Mem  $\rightarrow$  can't handle so much write traffic.



The mem not yet updated  
 If cache 1 has a new read for block A  
 (after replacing previous instance)

Cache 1 makes a read req.

Cache 2 snoops req from Bus, the

dirty bit tells it that it only has the updated value & memory does not have updated value.

Cache 2 will give the value.

If now cache 2 update A = 20, then cache 1 snoops the value by cache 2's broadcast.

Cache 2 still has dirty bit set as mem still not updated. Only if cache 2 has the block replaced, then memory will get updated.

So, as long as one cache has dirty bit set, it ... value

then memory will get updated.  
 So, as long as one cache has dirty bit set, it has most recent value.  
 Now if cache 1 writes  $A=30$ , then it is snooped by cache 2 & the dirty bit of cache 1 gets set with that of cache 2 getting unset.

This many memory update can be avoided by designating a cache dynamically as having the recent value.  
 Only when the cache with 'dirty' int (i.e., the last writer) gets block replaced, then Mem write will happen.

We also read from memory only when no cache has dirty bit set.

### Write Update Opt for bus writes

Memory is saved by bus is still bottleneck.



Doing a read of any cache sees that it already has the same block, it pulls the shared line up.

When A is written by cache 2, cache 1 snoops the value, updates its shared bit & also pulls up the shared line.

Then cache 2 also will set  $S=1$  knowing that the block is shared.

Now if cache 2 updates A, it has  $S=1$ , so it needs to broadcast.

Say cache 1 reads  $B=2$  from mem,  $V=1, D=0, S=0$ ;

Now if it updates  $B=5$ , it checks  $S=0$ , hence no broadcast required, but now  $D=1$  as no more consistent with memory.

### Write Invalidate Snooping coherence

Consider last scheme setting  
 Let Cach 1 read A,  $V=1, T=A, S=0=0 \leftarrow$  change valid bit in cache 1  
 $\dots \rightarrow$  write  $A=1$ , will be snooped by cache 1  $= 0$ .

Now A read in cache 1 will have a miss, if sends a request to the bus & it is provided by cache 2.

Now, in cache 1  $\rightarrow V=1, T=A, S=1, D=0$   
 cache 2  $\rightarrow V=1, T=A, S=1, D=1$

So in this type protocol,  
 if somebody writes then there is a (low latency) miss b> others in not read.

So in this type protocol,  
if somebody writes then there is a (low latency) miss by  
others in next read.

Advantage → (many unnecessary writes avoided).  
Ex., say core 2 writes A, then core 1 no more has  
valid A but  $s=1$ .  
Now, core 2 can make many updates ( $s=0, D=1$ )  
on its local copy w/o any broadcast.  
Only when core 1 reads, it will have a miss & then get  
last update by core 2.  
So when data is periodically processed by diff cores &  
any core makes many consecutive updates, this  
is really a good idea.

| Application Activity                                 | Update                                   | Invalidate                                |
|------------------------------------------------------|------------------------------------------|-------------------------------------------|
| Burst of writes to one addr                          | Each wr in broadcast                     | First wr invalidates others<br>as hits ✓  |
| WR diff words in same block                          | Update sent for each word.               | n ✓                                       |
| One core producer (WR)<br>another core consumer (RD) | Producer sends update<br>Consumer hits ✓ | Producer invalidates<br>Consumer Misses ✗ |

Modern CPU → invalidation technique preferred.

Primary reason → OS migrates threads,  
data remains in source core.



### MSI Coherence (invalidation based)

A block can be in 3 possible cache states.



### Cache - Cache Transfer

- C1 (c1) has block B in M state
- C2 has RD request on BUS
- C1 provides data →

Solution 1      c1 cancels C2 req, C1 WB to mem ↪

- C<sub>2</sub> has RD request on bus
- C<sub>1</sub> provides data →

### Solution 1

C<sub>1</sub> cancels C<sub>2</sub> req, C<sub>1</sub> WB to mem  
C<sub>2</sub> entries & gets data from mem  
2 times mem latency

### Solution 2

C<sub>1</sub> notifies mem & responds directly to C<sub>2</sub>  
(intervention BUS signal) Mem gets data from C<sub>1</sub> in parallel.

HW complex but popular

Memory has to get the fresh data from C<sub>1</sub> since after transfer both C<sub>1</sub> & C<sub>2</sub> have the data ('脏' state). Both thinks the block is not dirty. So in future there is no body clearly responsible to WB the data.

### Avoiding Memory write in Cache—Cache transfer

MSI: We are writing to memory everytime we do a cache-to-cache transfer.

C<sub>1</sub> has a block B in M state

C<sub>2</sub> wants to read B, C<sub>1</sub> has it → C<sub>1</sub>: S, C<sub>2</sub>: S

Next C<sub>2</sub> writes to B, C<sub>1</sub>: I, C<sub>2</sub>: M

Next C<sub>1</sub> Read, C<sub>2</sub> responds with data → C<sub>1</sub>: S, C<sub>2</sub>: S

Next C<sub>3</sub> Read ← Mem provides data as C<sub>1</sub>/C<sub>2</sub> both in S state with D=0

Next C<sub>4</sub> read ← Mem, C<sub>1</sub>: S, C<sub>2</sub>: S, C<sub>3</sub>: S

both have data but none provides

< no clear rule like who will provide >

Many unnecessary Mem reads (costly)

We should have had a non-modified (non-M)

version of the same block in just one of the caches & we could have made it responsible for distributing data to other cache & finally WB to Mem.

### Introduce 'O' (OWN) state

### MOSI protocol

like MSI except

M ⇒ instead a RD ⇒ O (not S)

... , , ,

like MSI except

$M \Rightarrow$  no op a RD  $\Rightarrow 0$  (not S)  
no Mem access. (data given directly)

0 is like S except

$0 \Rightarrow$  no op a RD  $\Rightarrow$  give data  
WB to mem if block replaced

M: exclusive RD/WR access, Dirty

S: shared read access, Clean (mem has up-to-date copy)

O: Shared read access, Dirty (only one cache  
in O state)

### MOSI inefficiency

Thread-private data (data accessible to only one thread)  
e.g. stack of each thread

(or 4 separate  
program  
threads  
running in  
4 cores)



### The E (Exclusive) State

M: Exclusive Access (RD/WR), Dirty

S: Shared Access (RD), Clean

O: Shared Access (RD), Dirty

E: Exclusive Access (RD/WR), Clean (and not update mem)

|      | MSI                                   | MOSI                                  | MESI                     | MOESI                    |
|------|---------------------------------------|---------------------------------------|--------------------------|--------------------------|
| RD A | $I \rightarrow S$<br>miss             | $I \rightarrow S$<br>$M$              | $I \rightarrow E$<br>$M$ | $I \rightarrow E$<br>$M$ |
| WRA  | $S \rightarrow M$<br>Invalidation not | $S \rightarrow M$<br>Invalidation not | $E \rightarrow M$        | $E \rightarrow M$        |

2 bus accesses

2 bus access

↓ invalidation req (off chip bus  
access not required)

detect during read  
that we are the only one  
having the block