

# Parallel & Distributed Computing

## Lab/Lecture Handout: Cache Coherence in SMP (UMA)

Topics: Coherence Problem & Solutions, Ping-Pong (True & False Sharing), Profiling with perf

### 1. Learning Outcomes

After completing this handout, students should be able to:

- Explain the cache coherence problem in SMP/UMA systems and why it occurs.
- Describe coherence solutions: snooping vs directory, write-invalidate vs write-update, MESI-family protocols.
- Identify and explain cache-line *ping-pong* due to *true sharing* and *false sharing*.
- Implement simple pthread programs that trigger ping-pong behavior.
- Use Linux `perf` to measure performance impact and interpret key metrics.

### 2. Prerequisites

- Linux machine (Ubuntu/Debian recommended) with `gcc`, `pthread`, and `perf`.
- Basic knowledge of caches (L1/L2/L3) and threads.
- Two or more CPU cores available.

### 3. Background: SMP (UMA) and the Coherence Problem

#### 3.1. SMP (UMA)

In a classic Shared-Memory Multiprocessor (SMP) with Uniform Memory Access (UMA), all cores share the same physical memory, and each core typically has private caches (L1/L2) and sometimes a shared last-level cache (LLC).

#### 3.2. What is the cache coherence problem?

When multiple cores cache the same memory location (more precisely, the same **cache line**), each core may hold a copy. If one core writes to that line, other cores may still hold stale copies. Without coherence, subsequent reads could observe old values.

#### 3.3. Why it happens (the real villain: cache lines)

Coherence granularity is almost always a **cache line** (often 64 bytes). This means:

- Two different variables inside the same line can still interfere.
- Independent writes can trigger invalidations *even if variables are logically unrelated*.

### 4. Solutions to Cache Coherence

#### 4.1. Policy-level solutions

- **Write-invalidate (most common):** A writer invalidates other cached copies, then writes with exclusive ownership.
- **Write-update (less common):** A writer broadcasts updated data to other caches (can cause high traffic).

## 4.2. Mechanism-level solutions

- **Snooping-based coherence:** Caches monitor a shared interconnect (bus/ring). Requests are broadcast; caches respond/invalidate accordingly. Good for small/medium core counts.
- **Directory-based coherence:** A directory tracks which caches share each line. Invalidation is targeted (not broadcast). Scales better for many cores/NUMA.

## 4.3. Protocol-level solutions (MESI family)

A coherence protocol defines valid transitions for cache-line states. The classic MESI states:

- **M (Modified):** Only this cache has the newest data; memory is stale.
- **E (Exclusive):** Only this cache has it; memory matches.
- **S (Shared):** Multiple caches may have it; memory matches.
- **I (Invalid):** Not valid in this cache.

**Key idea:** Before writing, a core must obtain an exclusive/modified state, forcing other cached copies to become invalid (or otherwise ensuring they do not remain stale).

## 5. The Ping-Pong Effect (Coherence Traffic)

### 5.1. True sharing ping-pong

Two threads repeatedly write the **same variable**. The cache line must migrate between cores, causing frequent invalidations and ownership transfers.

### 5.2. False sharing ping-pong

Two threads write **different variables**, but those variables happen to be in the **same cache line**. Coherence still operates at cache-line granularity, so the line bounces anyway.

## 6. Lab Part A: Program Examples (pthreads)

### 6.1. Build settings (recommended)

Use symbols + frame pointers for better profiling:

```
gcc -O2 -g -fno-omit-frame-pointer -pthread -std=c11 file.c -o prog
```

### 6.2. A0: Very simple programs (3 variants)

This section provides minimal pthread codes for three cases:

1. True sharing (same variable)
2. False sharing (different variables, same cache line)
3. Fixed false sharing (padding, different cache lines)

#### 6.2.1. Variant 1 (Very Simple): True sharing (same shared variable)

**Idea:** Both threads increment the **same counter**. This forces the cache line containing that counter to bounce between cores.

Listing 1: Very simple true sharing: both threads update the same counter

```
1 #include <pthread.h>
2 #include <stdio.h>
3 #include <stdatomic.h>
```

```

4
5 #ifndef ITERS
6 #define ITERS 2000000000UL
7 #endif
8
9 static atomic_ulong shared_counter = 0;
10
11 void* worker(void* arg) {
12     (void)arg;
13     for (unsigned long i = 0; i < ITERS; i++) {
14         atomic_fetch_add_explicit(&shared_counter, 1, memory_order_relaxed)
15             ;
16     }
17     return NULL;
18 }
19
20 int main(void) {
21     pthread_t t0, t1;
22
23     pthread_create(&t0, NULL, worker, NULL);
24     pthread_create(&t1, NULL, worker, NULL);
25
26     pthread_join(t0, NULL);
27     pthread_join(t1, NULL);
28
29     printf("shared_counter = %lu (expected %lu)\n",
30           (unsigned long)atomic_load(&shared_counter),
31           2UL * ITERS);
32     return 0;
33 }
```

### 6.2.2. Variant 2 (Very Simple): False sharing (adjacent counters)

**Idea:** Each thread updates its own counter, but the counters are adjacent in memory and can share one cache line.

Listing 2: Very simple false sharing: two different counters but adjacent in memory

```

1 #include <pthread.h>
2 #include <stdio.h>
3 #include <stdatomic.h>
4
5 #ifndef ITERS
6 #define ITERS 2000000000UL
7 #endif
8
9 typedef struct {
10     atomic_ulong a;    // thread 0 updates
11     atomic_ulong b;    // thread 1 updates
12 } counters_t;
13
14 static counters_t c = {0, 0};
15
16 void* t0(void* arg) {
17     (void)arg;
18     for (unsigned long i = 0; i < ITERS; i++) {
19         atomic_fetch_add_explicit(&c.a, 1, memory_order_relaxed);
20     }
21 }
```

```

21     return NULL;
22 }
23
24 void* t1(void* arg) {
25     (void)arg;
26     for (unsigned long i = 0; i < ITERS; i++) {
27         atomic_fetch_add_explicit(&c.b, 1, memory_order_relaxed);
28     }
29     return NULL;
30 }
31
32 int main(void) {
33     pthread_t p0, p1;
34
35     pthread_create(&p0, NULL, t0, NULL);
36     pthread_create(&p1, NULL, t1, NULL);
37
38     pthread_join(p0, NULL);
39     pthread_join(p1, NULL);
40
41     printf("a = %lu, b = %lu\n",
42           (unsigned long)atomic_load(&c.a),
43           (unsigned long)atomic_load(&c.b));
44     return 0;
45 }
```

### 6.2.3. Variant 3 (Very Simple): Fixed false sharing (padding)

**Idea:** Insert padding so **a** and **b** are on different cache lines (typical 64 bytes).

Listing 3: Very simple fixed false sharing: padding separates cache lines

```

1 #include <pthread.h>
2 #include <stdio.h>
3 #include <stdatomic.h>
4
5 #ifndef ITERS
6 #define ITERS 2000000000UL
7 #endif
8
9 typedef struct {
10     atomic_ulong a;
11     char pad[64 - sizeof	atomic_ulong];
12     atomic_ulong b;
13 } counters_t;
14
15 static counters_t c = {0, {0}, 0};
16
17 void* t0(void* arg) {
18     (void)arg;
19     for (unsigned long i = 0; i < ITERS; i++) {
20         atomic_fetch_add_explicit(&c.a, 1, memory_order_relaxed);
21     }
22     return NULL;
23 }
24
25 void* t1(void* arg) {
26     (void)arg;
```

```

27   for (unsigned long i = 0; i < ITERS; i++) {
28     atomic_fetch_add_explicit(&c.b, 1, memory_order_relaxed);
29   }
30   return NULL;
31 }
32
33 int main(void) {
34   pthread_t p0, p1;
35
36   pthread_create(&p0, NULL, t0, NULL);
37   pthread_create(&p1, NULL, t1, NULL);
38
39   pthread_join(p0, NULL);
40   pthread_join(p1, NULL);
41
42   printf("a = %lu, b = %lu\n",
43         (unsigned long)atomic_load(&c.a),
44         (unsigned long)atomic_load(&c.b));
45   return 0;
46 }
```

### 6.3. A1: True-sharing ping-pong (strict alternation)

**What it demonstrates:** Two threads increment the same shared counter, alternating turns. This strongly forces cache-line ownership transfers.

Listing 4: True-sharing ping-pong (turn-based alternation)

```

1 #define _GNU_SOURCE
2 #include <pthread.h>
3 #include <stdio.h>
4 #include <stdint.h>
5 #include <time.h>
6 #include <stdatomic.h>
7
8 #ifndef ITERS
9 #define ITERS 500000000UL
10 #endif
11
12 static atomic_int turn = 0;           // 0 -> thread0, 1 -> thread1
13 static atomic_uint shared = 0;        // same variable updated by both
14   threads
15
16 static inline uint64_t ns_now(void) {
17   struct timespec ts;
18   clock_gettime(CLOCK_MONOTONIC, &ts);
19   return (uint64_t)ts.tv_sec*1000000000ULL + (uint64_t)ts.tv_nsec;
20 }
21
22 void* t0(void* arg) {
23   (void)arg;
24   for (unsigned long i=0; i<ITERS; i++) {
25     while (atomic_load_explicit(&turn, memory_order_acquire) != 0) { }
26     atomic_fetch_add_explicit(&shared, 1u, memory_order_relaxed);
27     atomic_store_explicit(&turn, 1, memory_order_release);
28   }
29   return NULL;
}
```

```

30
31 void* t1(void* arg) {
32     (void)arg;
33     for (unsigned long i=0; i<ITERS; i++) {
34         while (atomic_load_explicit(&turn, memory_order_acquire) != 1) { }
35         atomic_fetch_add_explicit(&shared, 1u, memory_order_relaxed);
36         atomic_store_explicit(&turn, 0, memory_order_release);
37     }
38     return NULL;
39 }
40
41 int main(void) {
42     pthread_t a,b;
43     uint64_t st = ns_now();
44     pthread_create(&a,NULL,t0,NULL);
45     pthread_create(&b,NULL,t1,NULL);
46     pthread_join(a,NULL);
47     pthread_join(b,NULL);
48     uint64_t en = ns_now();
49     printf("shared=%u (expected %lu)\n", atomic_load(&shared), 2UL*ITERS)
50     ;
51     printf("time=%.3f s\n", (en-st)/1e9);
52     return 0;
53 }
```

#### 6.4. A2: False sharing ping-pong (unpadded)

**What it demonstrates:** Each thread updates its own counter, but both counters are adjacent and likely share one cache line.

Listing 5: False sharing (unpadded) – likely same cache line

```

1 #define _GNU_SOURCE
2 #include <pthread.h>
3 #include <stdio.h>
4 #include <stdint.h>
5 #include <time.h>
6 #include <stdatomic.h>
7
8 #ifndef ITERS
9 #define ITERS 500000000UL
10 #endif
11
12 typedef struct {
13     atomic_uint x; // thread0 updates
14     atomic_uint y; // thread1 updates
15 } counters_t;
16
17 static counters_t c = {0,0};
18
19 static inline uint64_t ns_now(void) {
20     struct timespec ts;
21     clock_gettime(CLOCK_MONOTONIC, &ts);
22     return (uint64_t)ts.tv_sec*1000000000ULL + (uint64_t)ts.tv_nsec;
23 }
24
25 void* t0(void* arg){
26     (void)arg;
```

```

27     for(unsigned long i=0;i<ITERS;i++)
28         atomic_fetch_add_explicit(&c.x, 1u, memory_order_relaxed);
29     return NULL;
30 }
31
32 void* t1(void* arg){
33     (void)arg;
34     for(unsigned long i=0;i<ITERS;i++)
35         atomic_fetch_add_explicit(&c.y, 1u, memory_order_relaxed);
36     return NULL;
37 }
38
39 int main(void){
40     pthread_t a,b;
41     uint64_t st = ns_now();
42     pthread_create(&a,NULL,t0,NULL);
43     pthread_create(&b,NULL,t1,NULL);
44     pthread_join(a,NULL);
45     pthread_join(b,NULL);
46     uint64_t en = ns_now();
47     printf("x=%u y=%u\n", atomic_load(&c.x), atomic_load(&c.y));
48     printf("time=%.3f s\n", (en-st)/1e9);
49     return 0;
50 }
```

## 6.5. A3: False sharing fix (padding/alignment)

**Goal:** Place y on a different cache line from x. (Typical cache line size is 64 bytes.)

Listing 6: False sharing fixed via padding (typical 64-byte line)

```

1 typedef struct {
2     atomic_uint x;
3     char pad[64 - sizeof	atomic_uint];
4     atomic_uint y;
5 } counters_t;
6
7 static counters_t c = {0, {0}, 0};
```

## 6.6. A4: Pin threads to different cores (CPU affinity)

**Why:** Pinning reduces scheduler noise and makes ping-pong more repeatable.

Listing 7: False sharing + CPU affinity (pinning)

```

1 #define _GNU_SOURCE
2 #include <pthread.h>
3 #include <stdio.h>
4 #include <stdint.h>
5 #include <time.h>
6 #include <stdatomic.h>
7 #include <sched.h>
8 #include <unistd.h>
9 #include <string.h>
10
11 #ifndef ITERS
12 #define ITERS 50000000UL
13 #endif
```

```

14
15     typedef struct { atomic_uint x; atomic_uint y; } counters_t;
16     static counters_t c = {0,0};
17
18     static inline uint64_t ns_now(void){
19         struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
20         return (uint64_t)ts.tv_sec*1000000000ULL + (uint64_t)ts.tv_nsec;
21     }
22
23     static void pin_to_cpu(int cpu_id){
24         cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu_id, &set);
25         int rc = pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
26         if(rc!=0) fprintf(stderr,"Warning: affinity cpu=%d failed: %s\n",
27                           cpu_id, strerror(rc));
28     }
29
30     typedef struct { int cpu_id; int which; } arg_t;
31
32     void* worker(void* arg){
33         arg_t* a=(arg_t*)arg;
34         pin_to_cpu(a->cpu_id);
35         if(a->which==0){
36             for(unsigned long i=0;i<ITERS;i++)
37                 atomic_fetch_add_explicit(&c.x,1u,memory_order_relaxed);
38         }else{
39             for(unsigned long i=0;i<ITERS;i++)
40                 atomic_fetch_add_explicit(&c.y,1u,memory_order_relaxed);
41         }
42         return NULL;
43     }
44
45     int main(void){
46         int ncpu=(int)sysconf(_SC_NPROCESSORS_ONLN);
47         if(ncpu<2){ fprintf(stderr,"Need >= 2 CPUs.\n"); return 1; }
48
49         pthread_t t0,t1;
50         arg_t a0={.cpu_id=0,.which=0};
51         arg_t a1={.cpu_id=1,.which=1};
52
53         uint64_t st=ns_now();
54         pthread_create(&t0,NULL,worker,&a0);
55         pthread_create(&t1,NULL,worker,&a1);
56         pthread_join(t0,NULL);
57         pthread_join(t1,NULL);
58         uint64_t en=ns_now();
59
60         printf("Pinned CPU0/CPU1: x=%u y=%u time=%.3f s\n",
61               atomic_load(&c.x), atomic_load(&c.y), (en-st)/1e9);
62         return 0;
63     }

```

## 7. Lab Part B: Profiling and Evaluation with Linux perf

### 7.1. B1: Install perf tools

Ubuntu/Debian:

```
sudo apt update
sudo apt install -y linux-tools-common linux-tools-generic
# On some systems, you may need the exact kernel tools package:
# sudo apt install -y linux-tools-$(uname -r)
```

## 7.2. B2: If perf needs permissions

For a lab machine (temporary change):

```
sudo sysctl -w kernel.perf_event_paranoid=1
sudo sysctl -w kernel.kptr_restrict=0
```

**Note:** These are security-sensitive settings; revert after the lab if needed.

## 7.3. B3: Baseline timing comparison

Run each variant 3 times and record average runtime:

```
./pp_unpadded
./pp_unpadded
./pp_unpadded

./pp_padded
./pp_padded
./pp_padded
```

## 7.4. B4: Quick performance counters with perf stat

**Goal:** Quantify overhead differences between unpadded vs padded.

```
sudo perf stat ./pp_unpadded
sudo perf stat ./pp_padded
```

### Useful metrics to watch

Interpretation depends on CPU, but typically track:

- **cycles, instructions, IPC** (instructions per cycle)
- **cache-references, cache-misses** (overall caching behavior)
- **context-switches, cpu-migrations** (noise; should reduce with pinning)
- **task-clock** (runtime)

**Expected observation:** The unpadded false-sharing version often shows worse runtime and poorer efficiency due to cache-line bouncing.

## 7.5. B5: Intel-focused deep dive: perf c2c (Cache-to-Cache)

On many Intel systems, `perf c2c` can highlight cache-line contention (ownership transfers, HTM-like behavior).

```
sudo perf c2c record -- ./pp_unpadded
sudo perf c2c report
```

Repeat for padded:

```
sudo perf c2c record -- ./pp_padded
sudo perf c2c report
```

### What to look for in the report:

- A small number of cache lines dominating “cache-to-cache” transfers.
- Reduced contention hotspots after padding.
- If symbols are available, it may point to the data object or code path.

### 7.6. B6: AMD variant notes (perf support varies)

On AMD, deep cache-to-cache analysis may depend on IBS support and kernel/CPU generation. If `perf c2c` is limited, use:

- `perf stat` for quantitative comparison (still very effective for teaching).
- `perf record + perf report` to understand where CPU time goes (less direct for false sharing).
- Optional (outside this handout): AMD uProf’s cache analysis features can be used for false-sharing detection in a GUI/CLI workflow.

### 7.7. B7: Reduce measurement noise (recommended)

- Use thread pinning (affinity) and keep the system idle.
- Close heavy background apps.
- Repeat runs and take median/average.
- Ensure the program is compiled with `-O2` for realistic behavior.

## 8. Evaluation: How to Decide “Padding Helped”

### 8.1. Required evidence

Students should provide:

1. Runtime comparison (unpadded vs padded), with at least 3 runs each.
2. `perf stat` summary for both variants.
3. (Intel systems) `perf c2c report` screenshots or captured output showing reduced hot cache-line contention.

### 8.2. Suggested table format

| Variant                  | Time (s) | cycles | cache-misses | cpu-migrations |
|--------------------------|----------|--------|--------------|----------------|
| Unpadded (false sharing) |          |        |              |                |
| Padded (fixed)           |          |        |              |                |

## 9. Lab Tasks (What students must do)

1. Compile and run the **false sharing unpadded** program; record time.
2. Modify it to **padded** version; record time.
3. Enable **CPU affinity**; compare stability and runtime.
4. Run `perf stat` on both; capture metrics.
5. (Intel) Run `perf c2c record/report` on both; identify the contended cache line(s).
6. Write a short conclusion: *Why does padding help? Why is coherence per cache line?*

## 10. Common Pitfalls (and fixes)

- **No speedup seen:** Possibly both threads ran on the same core. Use affinity.

- **High variability:** Background processes or CPU frequency scaling. Repeat runs; pin threads; keep system idle.
- **perf permission errors:** Adjust `perf_event_paranoid` temporarily (lab admin policy permitting).
- **Confusing conclusion:** Emphasize: *false sharing occurs even if variables are different, because they share a cache line.*

## 11. Mini-Quiz (in-class check)

1. Why can two independent counters still slow each other down on an SMP?
2. In write-invalidate, what happens to other cached copies when a core writes?
3. What changes when we add padding between two counters?

**End of Handout**