

## Lectures

### Introduction

#### L0 and L1

##### Program Parallelization

**Decomposition:** Decompose a sequential algorithm into tasks (programmer)

- Granularity of tasks are important
- Tasks have dependencies (data or control) between each other which defines the execution order

**Scheduling:** Assign tasks to processes (programmer / compiler)

**Mapping** - Map processes to cores (OS)

**Von Neumann Computation Model** instruction and data are stored in memory, and processors computes.

**Memory Wall** disparity between memory speed and processor speed ( $\leq 1 \text{ ns}$  VS  $\geq 100 \text{ ns}$ )

**Processing unit** refers to a core that can execute a kernel thread

**Interconnect** busses between different components in the machine

**Node** Machine in a distributed system

##### Why Parallel

##### Primary Reasons

- 1 OVercome limits of serial computing
- 2 Solve larger problems
- 3 Save (wall-clock) time

##### Other Reasons

- Take advantage of non-local resources
- Cost/energy saving - use multiple cheaper computing resources
- Overcome memory constraints

##### Computational Model Attributes

- **Operation mechanism** Primitive units of computation or basic actions of the computer on a specific Architecture
- **Data Mechanism** How we access and store data in address space
- **Control Mechanism** How primitive units of computation are scheduled
- **Communication Mechanism** Modes and patterns of exchanging information between parallel tasks (e.g message passing, shared memory)
- **Synchronization Mechanism** ensures to ensure needed information arrives at the right time

##### Dependencies and Coordination

- Dependencies among tasks impose constraints on scheduling
- Memory organizations: Shared-memory (threads), distributed-memory (processes)
- Coordination (synchronization) imposes additional overheads

### Two algorithms



- Core 0 is active throughout the execution
- Some cores are idle
- This is a lot better than having all cores idle while the master core is executing

##### Parallel Performance

- Execution time Vs Throughput
- Parallel execution time = computation time + parallelization overheads
- Overheads: Distribution of work(tasks) to processes, information exchange, synchronisation, idle time, etc

## Background on Parallelism

### L2: Processes and Threads

#### Process

- Identified by PID
- Program counter, global data (open files, network connections), stack or heap, current values of the registers (GPRs and Special)
- These information are abstracted in the PCB, and each process can be viewed as having exclusive access to its address space
- Explicit communication is needed
- **Disadvantage**
  1. High overhead of system calls
  2. Potential re-allocation of data-structures
  3. Communication goes through OS (system calls) and context switch is costly



#### Multi tasking

- Overhead: Context switching (PCB change) is needed and states of suspended process must be saved
- Time slicing: Pseudo-parallelism
- Child processes can use parent's data
- **Inter-process communication (IPC)**
- Shared memory: need to protect access with locks

- Message passing: Blocking, unblocking, Synchronous, unsynchronous

#### Exceptions

- Executing a **machine level instruction** can cause exception
- For example: Overflow, Underflow, Division by Zero, Illegal memory address, Mis-aligned memory access

#### Synchronous

- Occur due to program execution
- Have to execute an **exception handler**

#### Asynchronous

- Occur **independently** of program execution
- Have to execute an **interrupt handler**

#### Threads

- A process may have multiple independent control flows called threads
- Each thread has its own stack and registers (PC, SP, registers), but share the same address space
- Shared memory model and Shared memory architecture
- Faster thread generation- no copy of address space
- Different process can be assigned to run on different cores of a multicore processor

#### User threads

- Managed by library
- Context switch is fast, OS not involved
- **Disadvantage**

1. OS cannot map different threads of the same process to different resources  $\Rightarrow$  No parallelism
2. OS cannot switch to another thread if one thread blocks

#### Kernel threads

- OS is aware of the threads and can manage accordingly
- Efficient in a multicore system
- Potential synchronisation issues

#### Many to one mapping

- All user-level threads mapped to one process.
- Efficiency depends on threading library

#### One to one mapping

- Each user-level thread is mapped to one kernel thread
- OS schedules

#### Many to many mapping

- Many user-level threads mapped to many kernel threads
- Library threads has overheads, and kernel threads has overheads
- At different points in time, different user threads are mapped to different kernel threads
- Number of threads must be suitable to the degree of parallelism and the resources available

#### Locks

- Spinlock: busy wait
- Blocking: mutex
- Using more locks increases the number of context switches
- DO NOT wait in the critical section

#### Semaphores

- Essentially shared global variables
- Can be potentially accessed anywhere in program
- No connection between semaphore and the data being protected

#### Barrier

- All threads must reach the barrier before any thread can proceed

#### Deadlock

- Deadlock exists among a set of processes if every process is waiting for an event that can be caused only by another process in the set
- **iff these condns are met**

1. Mutual exclusion-at least one resource is not shareable
  2. Hold and wait - at least one process holding a resource and waiting for another
  3. No preemption - critical section cannot be aborted externally
  4. Circular wait
- **Dealing with deadlock**
- Ignore it, prevent it, avoid it by controlling resource allocation, detection and recovery by breaking cycles

#### Starvation

- Side effect of the scheduling algorithm. Lower priority processes might starve

#### Livelock

- Active acquire release but no useful work done

#### Producer-Consumer Problem

- Specifications:
  - Producers put in a shared bounded buffer if not full, consumers read from it if not empty
- Solution:



- Concurrent read, exclusive write. Categorical starvation of writer is possible

- | Writers                                                                      | Readers                                                                                                                                                                                                                                                                                                                                            |
|------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>roomEmpty.wait() #critical section for writers roomEmpty.signal()</pre> | <pre>mutex.wait() # critical section for readers mutex.signal()</pre>                                                                                                                                                                                                                                                                              |
|                                                                              | <pre>readers += 1 if readers == K:     roomEmpty.signal() # first in locks</pre>                                                                                                                                                                                                                                                                   |
|                                                                              | <pre>readers -= 1 if readers == 0:     roomEmpty.signal() # last out unlocks</pre>                                                                                                                                                                                                                                                                 |
|                                                                              | <pre>Light switch: Abstracts out the shared lock for the reader counter = 0 mutex = Semaphore(1) lock(semaphore):     mutex.wait()     counter += 1     if counter == 1:         semaphore.wait()     mutex.signal() unlock(semaphore):     mutex.wait()     counter -= 1     if counter == 0:         semaphore.signal()     mutex.signal()</pre> |
|                                                                              | <pre>Starvation free solution (block out readers):</pre>                                                                                                                                                                                                                                                                                           |

## Writers

```
turnstile.wait()
roomEmpty.wait()
# critical section for writers
turnstile.signal()
roomEmpty.signal()
```

## Readers

```
turnstile.wait()
turnstile.signal()
readSwitch.lock(roomEmpty)
# critical section for readers
turnstile.signal()
readSwitch.unlock(roomEmpty)
```

- Prioritise Writer:

```
readSwitch = lightswitch();
writeSwitch = lightswitch();
noReaders = semaphore(1);
noWriters = semaphore(1);

reader() {
/* Waiting for writers to be done */
wait(noReaders);
/* Writers cannot enter */
    readswitch.lock(noWriters);
signal(noReaders);
# critical section
readSwitch.unlock(noWriters);
}
```

```
writer() {
/* Immediately acquires no readers so
writers have priority */
writeSwitch.lock(noReaders);
wait(noWriters);
# critical section
signal(noWriters);
writeSwitch.unlock(noReaders);
}
```

- This is implemented in C++ as a **shared\_lock** and **unique\_lock**
- GO has something similar: **readLock** and **writeLock**

## Barrier

- All threads must stop at a common point before proceeding, can be reusable (barrier) or single use (latch)
- `std::barrier`, `std::latch` in C++
- E.g. `std::barrier arrivalPoint(size)` ... `arrivalPoint.arrive_and_wait()`
- `sync.WaitGroup` in GO is a latch, we can use 2 of them to make a barrier
- C++ implementation
  - The naive version fails because context switch can happen right before counter == N, which causes multiple threads (that were context switched out after counter++) to signal the switch (another way to fail is to have 1 thread lap everyone else between the first barrier unlocks and second barrier unlocks)
  - The solution is to add a second turnstile (initialised as 1) to guard the `turnstile1.signal`, such that only one thread can signal it
  - But using mutex to increment `turnstile1` one-by-one is slow
  - So we use a counting semaphore instead so we can raise the barrier by 1 thread!

## Dining Philosophers

- Specifications: N philosophers, N chopsticks
- Deadlock: All pick up left simultaneously
- Livelock: Put down left if right cannot be acquired
- Slap a mutex: Becomes sequential
- Scoped Lock(left, right): Acquire multiple mutexes in a deadlock free manner (deadlock avoidance), but as we have seen in CS3223, deadlock avoidance can lead to livelock
- GO's Mutex Free Solution: Use odd-even ring communication, odd numbered philosophers pick up left first, even numbered philosophers pick up right first
- This is the same as the right hander argument
- Tanenbaum's solution:

```
#define N 5
#define LEFT ((i+N-1)%N)
#define RIGHT ((i+1)%N)
#define THINKING 0
#define HUNGRY 1
#define EATING 2
int state[N];
Semaphore mutex = 1;
Semaphore s[N];

void philosopher(int i){
    while (TRUE){
        Think();
        takeChpStcks(i);
        Eat();
        putChpStcks(i);
    }
}

void safeToEat(i){
    if ((state[i] == HUNGRY) &&
        (state[LEFT] != EATING) &&
        (state[RIGHT] != EATING)) {
        state[i] = EATING;
        signal(s[i]);
    }
}

void takeChpStcks(i){
    wait(mutex);
    state[i] = HUNGRY;
    safeToEat(i);
    signal(mutex);
    wait(s[i]);
}
```

```
82 // use a preloaded turnstile to let threads through faster
83 struct barrier {
84     std::atomic<int> expected;
85     std::atomic<int> count;
86     std::atomic<int> mut;
87     std::atomic<semaphore> turnstile1;
88     std::atomic<semaphore> turnstile2;
89     barrier(int expected) {
90         expected(expected).count(0).mut(0);
91         turnstile1.lock();
92         turnstile2.lock();
93     }
94     void arrive_and_wait() {
95         if (expected.load() == count.load()) {
96             if (count.load() == expected) {
97                 if (turnstile1.acquire() && turnstile2.acquire())
98                     turnstile1.release();
99                 else
100                     turnstile2.release();
101             }
102         }
103     }
104     void release() {
105         if (turnstile1.acquire() && turnstile2.acquire())
106             turnstile1.release();
107         else
108             turnstile2.release();
109     }
110     void close() {
111         if (count.load() == 0) {
112             turnstile1.acquire();
113             turnstile2.acquire();
114             turnstile1.release();
115             turnstile2.release();
116         }
117     }
118     void acquire() {
119         if (turnstile1.acquire() && turnstile2.acquire())
120             turnstile1.release();
121     }
122 }
```

Lines 102 and 115: counting semaphore can be increased by expected to allow threads to pass

```
void putChpStcks(i) {
    wait(mutex);
    state[i] = THINKING;
    safeToEat(LEFT);
    safeToEat(RIGHT);
    signal(mutex);
}
```

- Limited seats: Use a semaphore(N-1) to limit the number of philosophers that can eat at the same time

```
type DiningTable3 struct {
    numPhilosophers int
    numChopsticks int
    footman chan struct{}}
func (*DiningTable3).Init(numPhilosophers int) {
    t := new(DiningTable3)
    t.numPhilosophers = numPhilosophers
    t.numChopsticks = numChopsticks
    t.footman = make(chan struct{}, 1)
    for i := 0; i < numPhilosophers; i++ {
        chopstick := make(chan ChopStick, 1)
        chopstick <- (chan ChopStick)(i)
        t.chopstickChs = append(t.chopstickChs, chopstick)
    }
    t.chopstickChs = append(t.chopstickChs, t.footman)
    sync.Synchronize()
}
```

## Barber Shop

- Barbershop consists of a waiting room with n chairs and the barber chair
- If there are no customers to be served, the barber goes to sleep
- If the barber is busy, but waiting room is available, customer seats on one of the chairs
- If barber is sleeping, customer wakes him up
- If all chairs are occupied, customer leaves

|                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Customer Pseudo-code                                                                                                                                                                                                                                                                                               | Barber Pseudo-code                                                                                                                                                                                               |
| 1 wait(mutex);<br>2 if (customers == n) {<br>3     signal(mutex);<br>4     exit();<br>5 }<br>6 customers += 1;<br>7 signal(mutex);<br>8 signal(customer);<br>9 wait(barber);<br>10 getHairCut();<br>11 signal(customerDone);<br>12 wait(barberDone);<br>13 wait(mutex);<br>14 customers -= 1;<br>15 signal(mutex); | 21<br>22<br>23<br>24<br>25<br>26<br>27 while (TRUE) {<br>28     wait(customer);<br>29     signal(barber);<br>30     cutHair();<br>31     signal(customerDone);<br>32     signal(barberDone);<br>33 }<br>34<br>35 |

Line 31-32 are important since we need customer and barber need to agree that the haircut is done

- GO implementation:

```
void customer(void *handle, void *getHairCut()) {
    std::scoped_lock lock(mutex);
    if (customers == maxCustomers + 1) {
        balk();
        return;
    }
    customers++;
    customer_sem.release();
    cutHair();
    done_customer_sem.release();
    done_barber_sem.acquire();
}

template <size_t maxCustomers>
struct BarberHandle {
    size_t customers;
    std::atomic<mutex> mutex;
    std::condition_variable customer_sem;
    std::condition_variable barbers_sem;
    std::condition_variable done_customer_sem;
    std::condition_variable done_barber_sem;
};

void barber(void *handle) {
    while (true) {
        customer_sem.acquire();
        barbers_sem.release();
        cutHair();
        done_customer_sem.acquire();
        done_barber_sem.release();
    }
}
```

counting semaphores are not FIFO either; customers will not be served in FIFO order

## Architecture

### L3: Processor and memory organization

#### Single Processor Parallelism

- Bit level - we work with word (multiple bits), data parallelism
- Instruction level (from same thread)
- 1. Pipelining - parallelism across time

- Multiple instructions to occupy different stages in the same clock cycle - assuming no control or data dependencies

#### Disadvantages

- Independence
- Bubbles - idle stages
- Data and control flow hazard
- Wrong speculation of if-else branches can lead to wasted cycles
- Synchronisation - need to preserve read-after-write
- no more benefit to improving ILP now

#### 2. Superscalar - parallelism across space



- Duplicate pipelines and allow multiple instructions to pass through the same stage
- Scheduling tough - which ones to execute together?
- E.g. Multiple ALUs
  - Static - compiler decides
  - Dynamic - hardware decides

#### Thread level

- Motivated by the limitation of ILP
- SMT: Duplicate hardware context (PC, registers etc)
- By convention, SMT is limited to 2 threads to reduce overhead and memory contention
- Logical cores: hyperthreads

#### Processor level parallelism

- Add more cores to processors to enable **multiple execution flows**

- Each core can be hyperthreaded
- Shared Memory
- Distributed Memory



#### Flynn's Taxonomy

- Describes parallel architecture based on instruction stream (execution flow - PC) and data stream

#### Single Instruction Single Data

- Single stream of instructions with each working on a single data
- Not to be confused with SIMD from parallel patterns

#### Single Instruction Multiple Data

- Single stream of instructions with each working on multiple data

- Exploit data parallelism (vector processor)
- Same instruction broadcasted to all ALUs
- AVX: intrinsic functions operate on vectors of 4 64 bit values

### Multiple Instruction Single Data

- Multiple instructions operating with a single data

### Multiple Instructions Multiple Data

- Each PU fetches its own instructions
- Each PU operates its own data
- 



### Hierarchical designs



- Each core can have a separate L1 cache and shares the L2 cache
- All cores share common external memory



- Multiple packets being processed in a pipelined fashion
- Cores connected linearly, shares the same cache, memory
- Useful if the same computation has to be applied to a long sequence of data elements

### Network-based design



- Cores and their local memory and memories are connected via an interconnection network

### Why cache

- Cache provides high bandwidth data transfer to CPU and reduce latency in data access
- Memory latency: Amount of time for a memory request from a processor to be serviced
- Bandwidth: Rate at which the memory system can provide data to a processor

- A stall happens when the next instruction depends on previous instructions
- Bandwidth and latency affects stalls, since instructions (sw, lw) needs to wait for the memory system to become available

### Performant parallel programs

- Try not to overload the memory system with too many requests
- Share data across threads (inter-thread cooperation)
- Reuse data fetched previously (temporal locality)
- Favor additional arithmetic over load / store

### Parallel Computers



### Cache coherence

- Multiple copies of data exist on different caches
- Local updates should not be seen by other processes
- Maintained by additional instructions
- Instructions that mess up cache coherence hence presents severe overheads

### Memory consistency

- Memory consistency depends on the PL and architecture
- A seq consistent architecture makes a PL with seq const memory model run faster since fewer instructions are needed to ensure memory consistency

### Distributed Memory



- Each node is an independent unit with processor and memory
- Memory in each node is private
- Nodes communicate through a network

### Shared memory



### Intel Core i7 (quad core) (interconnect is a ring)

- Parallel programmes share memory through controller / provider
- Cache coherence and memory consistency is ensured



### Uniform Memory Access

- Latency of accessing main memory is the same for processors
- Suitable for small number of processors. Contention over memory can be high for large number of processes



### Non-uniform Memory Access

- Physically distributed memory of all processing elements are combined to form a global shared memory
- Local memory access has lower latency
- Reduce contention since each processor tends to access local memory
- Adding more processes does not increase contention as much as UMA
- Data consistency is easier too

### Cache Coherent NUMA (CCNUMA)

- Each node has cache to reduce contention

Example: modern multi-socket configuration



green faster than red





### Cache only Memory Architecture (COMA)

- Each memory block works as cache memory. This means that no fixed space stores data permanently and cache block with data can be moved around dynamically.
- Data migrates dynamically to keep data as close as possible to the processors
- Cache coherence is harder since data may not just be copied, they can also be shifted around.

## L7: Cache coherence and memory consistency

### Cache properties

- Larger cache reduces cache miss but increases access time
- Block size (cache line): data is transferred between main memory and cache in blocks of fixed size
- Larger block size – greater spatial locality
- Smaller block size – shorter replacement

### Case Study: Matrix Multiplication

- Size of matrix: A 256KB cache can only store a matrix of floats of size  $(178 \times 178) * 8 \text{ Bytes}$  (float size)

### Write Policy

- Write through
  - Write access is immediately transferred to memory
  - Advantage: always get the newest value of a block
  - Disadvantage: slow down due to many memory access (use a buffer!)
- Write-back
 

|            |     |                 |
|------------|-----|-----------------|
| Line state | Tag | Data (64 bytes) |
|------------|-----|-----------------|

  - Write is only performed in the cache, write to main memory is only performed when the cache is replaced (dirty bit)
  - Advantage: fewer write operations
  - Disadvantage: memory may contain invalid entries

### Cache coherence

- Problem: Multiple copies of the data exists on different cache lines, stale data may exist

### Coherence

- Each processing unit should have a consistent view of the memory through its local cache
- All processing units should agree on the order of read writes to the same memory space
- Property 1: Program Order Property
  - Programme should observe the effects of writes in the order of the programme
- Property 2: Write propagation
  - Writes become visible to other processing units eventually
- Property 3: Write serialization
  - Given:
    1. write  $v_1$  to x
    2. write  $v_2$  to x
  - programme should never read x as  $v_2$  and then as  $v_1$
  - All writes to a location are seen in the same order by all execution units, eventually

### Tracking cache line sharing status

#### • Snopping based

- No centralised directory
- Each cache keeps track of the sharing status
- Cache monitors and snoop on the bus to keep the cache line updated
- Used in architectures with a bus
- Write Propagation: All the processing units on the bus can observe changes made by every other bus
- Write serialization: Bus transactions are visible to the processing units in the same order
- Granularity: cache block

#### • Directory based

- Sharing status is kept in a central directory
- Commonly used in a NUMA architecture

#### • Implications

- Increased in overhead: increased memory latency, reduced cache hit rate
- Cache ping-pong: the effect where a cache line is transferred between multiple cores as a result of true / false sharing
- False sharing: different threads have data that is not shared in the program, but this data gets mapped to the same cache line
- False sharing makes cache ping pong difficult to detect, since the code ensures that memory are not shared but they happened to be mapped to the same cache line



### Memory Consistency Models

- Coherence ensures that processing units agree on the order of writes on the SAME memory location, and that all writes to shared memory will eventually propagate
- Consistency ensures that processing units agree on the order of writes on DIFFERENT memory locations
- Under the consistency rules, the instructions can be reordered to hide latencies
- **4 types of memory operations orderings**
  - must commit – the results are visible
  - $W \rightarrow R$ : write to X must commit before the subsequent read of Y
  - $R \rightarrow W$ : read of X must commit before the subsequent write of Y
  - $R \rightarrow R$ : read of X must commit before the subsequent

read of Y

- $W \rightarrow W$ : write to X must commit before the subsequent write of Y

### Sequential Consistency

- Every processing unit issues their memory operations in programme order
- Global results of all memory operation on every memory address appear in the same sequential order to every processing unit
- All 4 memory operation orderings are observed
- Poor performance
- Examples:

| processor | $P_1$                  | $P_2$                  | $P_3$                  |
|-----------|------------------------|------------------------|------------------------|
| program   | (1) $x_1 = 1$ ;        | (3) $x_2 = 1$ ;        | (5) $x_3 = 1$ ;        |
|           | (2) print $x_2, x_3$ ; | (4) print $x_1, x_3$ ; | (6) print $x_1, x_2$ ; |

#### Once 1 core sees an interleaving, the same interleaving will be observed by other cores

- Possible interleavings
  1. (1)-(3)-(5)-(2)-(4)-(6)
  2. (1)-(2)-(3)-(4)-(5)-(6)
- Impossible output: 011001
- To produce 0110, we need something like (1)-(3)-(2)-(4). But after this it is not possible to produce another 0 since the last read statements happens after all the write statements

### Relaxed memory consistency

- Relax if data dependencies allow
- **Data dependency: if two ops access the SAME memory location**
  - $R \rightarrow W$
  - $W \rightarrow W$
  - $W \rightarrow R$

### Relaxed Consistency: Write-to-read (WR)

- Allows a read on processing unit P to be reordered wrt the previous write operations on different memory locations
- Data dependencies must be observed, but it is only wrt the same memory location
- Data dependencies cannot be chained
- Different models depends on the timing of return
- **Total Store Ordering**
  - Processing units can **move its own reads** in front of its own writes
  - **Write Atomicity is observed**: Reads by other processing units cannot return new values of address A until the write to A is observed by all PUs
- **Processor Consistency**
  - **Write atomicity is not observed**: write can be read by some processing units before they are read by other processing units
  - **Write serialization is observed**: writes to the same memory location are seen in the same order by all processing units

SC:  
W  $\rightarrow$  R  
R  $\rightarrow$  R  
R  $\rightarrow$  W  
W  $\rightarrow$  W

TSO and PC:  
W  $\rightarrow$  R  
R  $\rightarrow$  R  
R  $\rightarrow$  W  
W  $\rightarrow$  W

PSO:  
W  $\rightarrow$  R  
R  $\rightarrow$  R  
R  $\rightarrow$  W  
W  $\rightarrow$  W

### Relaxed Consistency: Write-to-Write (WW)

- Writes can bypass earlier writes (to different locations) in write buffer
- Allows write misses to overwrite to hide latency
- Can only reorder within the same processing unit
- **Partial Store Order**
  - Relax  $W \rightarrow R$  similar to TSO
  - Relax  $W \rightarrow W$
- **Example 1**



- Only PSO can observe A=0, B=1 since only it reorders WW
- **Example 2**



| Property                                                                                 | Sequential Consistency (SC) | Relaxed Consistency: Total Store Ordering (TSO) | Relaxed Consistency: Processor Consistency (PC) | Relaxed Consistency: Partial Store Ordering (PSO) |
|------------------------------------------------------------------------------------------|-----------------------------|-------------------------------------------------|-------------------------------------------------|---------------------------------------------------|
| Respects data dependencies within the same core (e.g., don't touch: x = 5, read x)       |                             |                                                 |                                                 | Yes                                               |
| Preserves R $\rightarrow$ R and R $\rightarrow$ W order                                  |                             |                                                 |                                                 | Yes                                               |
| Preserves W $\rightarrow$ R                                                              | Yes                         | No                                              | No                                              | No                                                |
| Preserves W $\rightarrow$ W                                                              | Yes                         | Yes                                             | Yes                                             | No                                                |
| All processors must be able to see same value before a read completes? (Write Atomicity) | Yes                         | Yes                                             | No                                              | Yes                                               |

## L11: Interconnection networks

Torus



**3 x 3 PEs in a Mesh**

- Wraps left to right, top to bottom
- All PEs have four links

**Shear Sorting Algo**

**Shear Sort Algorithm:** Basic Idea



Arrange PEs into a 2D mesh



Sort into a "snake" width and length

Keep doing row sort and column sort till it is sorted in snake order!

for N numbers,  $\log_2 N + 1$  phases

**Topology**



Direct Interconnection

- Static, point-to-point
- Endpoints are of the same type (core, memory)

**Indirect Interconnection**

- Dynamic
- Interconnect formed by switches

**Topology Metric**

- Diameter:** maximum distance between any pair of nodes
  - Small diameter ensures small distances for message transmission
- Degree:** number of direct neighbors
  - Small node degree reduces the node hardware overhead
- Bisection width:** minimum number of edges that must be removed for the network to have 2 equal parts
  - Capacity of a network when transmitting message simultaneously
- Connectivity**
  - Node connectivity:** minimum number of nodes that fail to disconnect the network
  - Determines robustness of network
  - Edge connectivity:** minimum number of edges that fail to disconnect the network
  - Determine the number of independent paths between any pair of nodes

**Cube Connected Cycles**



Hypercube,  $k = 3$

- Replace a node with a cycle of  $k$ -nodes
- Total nodes:  $2 * 2^k$  nodes
- X-node index in hypercube
- Y-position in the cycle
- Node  $(X, Y)$  is connected to:
  - Cycle buddies:
  - $(X, (Y+1) \bmod k)$
  - $(X, (Y-1) \bmod k)$
  - Link from the corresponding dimension  $y$  in the hypercube
  - $(X \oplus 2^y, Y)$

**Indirect Interconnect**

- To reduce hardware costs by sharing switches and links
- Switches share links between nodes and can be configured dynamically
- Cost: number of switches and links
- Concurrent connection (bandwidth) **Bus Network**



- Only 1 pair can communicate at a time

- Bus used to coordinate
- Used for small number of processes

**Crossbar network**

A  $n \times m$  crossbar network has  $n$  inputs and  $m$  outputs



- Switch: straight or direction changing
- Hardware costly since there's  $n \times m$  switches
- Hard to scale since cost increases exponentially
- For small number of processes

**Multistage Interconnection Network (MIN)**



- Several intermediate switches connecting wires between neighbouring switches
- Goal: obtain a small distance for arbitrary small input and output devices

**Omega Network**

Has an edge from node  $(\alpha, i)$  to two nodes  $(\beta, i+1)$  where
 

- $\beta = \alpha$  by a cyclic left shift
- $\beta = \alpha$  by a cyclic left shift + inversion of the LSBit



- One unique path for every input to output
- $n \times n$  network has  $\log n$  stages
- $n / 2$  switches per stage
- Switch position:  $(\alpha, i)$  where  $\alpha$  is the position of a switch within a stage and  $i$  is the stage number

**Omega Network Vs Crossbar**

- 16 processors and 16 memory nodes
- Cross bar:  $16 \times 16 = 256$
- Omega:  $n=16$ , using  $2 \times 2$  switches,  $n/2 \times \log n = 32$  switches

**Butterfly network**



Node  $(a, i)$  connects to

- $(a, i+1)$ , i.e. straight edge
- $(a', i+1)$ ,  $a$  and  $a'$  differ in the  $(i+1)$ th bit from the left, i.e. cross edge

**Baseline network**



Node  $(a, i)$  to two nodes  $(b, i+1)$  where

- $b = \text{cyclic right shift of last } (k-i) \text{ bits of } a$
- $b = \text{inversion of the LSBit of } a + \text{cyclic right shift of last } (k-i) \text{ bits}$

## Parallel Computation Models

### L4: Shared-memory programming models

**Parallelism**

- Average number of units of work that can be performed in parallel per unit time.
- E.g. MIPS, MFLOPS
- Limitation: Program dependencies - data, control
- Runtime delays - memory contention, communication overheads, thread overhead, synchronisation
- We cannot reorder them however we like
- Work = Task + dependencies (limitations)

**Data parallelism**

- If iterations are **independent**, they can be executed in arbitrary order on multiple cores
- Partition data among processing units, each doing similar work
- Commonly expressed as a loop, if the iterations are independent and can be executed in arbitrary order
- E.g. SIMD computers
- OpenMP - matrix multiplication**

```

// parallelize result = a * b
// each thread works on one iteration of
// the outer-most loop
// vars (a, b ,result) are shared
#pragma omp parallel for num_thread(8)
    shared(a, b, result) private(i, j ,k)
    ...
    
```

**Same as**

```

for (i=0; i < size; i++)
    for (j=0; j < size; j++)
        for (k=0; k < size , k++)
            result[element][i][j] +=
                a.element[i][k] *
                b.element[k][j]
    
```

## • Single Program Multiple Data (SPMD)

- Same programme may behave differently based on the data
  - Good if
  - E.g. Scalar product of  $x \cdot y$  on p processing units
- ```

local_size = size/p;
local_lower = me * local_size;
local_upper = (me+1) * local_size - 1;
local_sum = 0.0;

for (i=local_lower; i<=local_upper; i++)
    local_sum += x[i] * y[i];

Reduce(&local_sum, &global_sum, 0, SUM);

```
- Same program executed by p processing units  
"me" is the processing units index (0 to p-1)

## Task parallelism

- Partition the tasks among the processing units
- independent program tasks/ parts can be executed in parallel
- Granularity: statement, loop, function
- More complexed than data parallelism → needs to schedule, map, take care of dependencies ...

## • Decomposition

- The room for parallelism in a task depends on how the task is decomposed



## Task dependence graph

- DAG: node=tasks, value=expected execution time, edge=control dependency
- Bad for one process to take disproportionately more data → idle time
- Critical path length: maximum slowest completion time
- Degree of concurrency=total work/critical path length



Critical Path = (Task 4 → 6 → 7)  
Critical Path Length = 27  
Degree of concurrency = 63 / 27 = 2.33

Critical Path = (Task 1 → 5 → 6 → 7)  
Critical Path Length = 34  
Degree of concurrency = 64 / 34 = 1.88

## Parallelism



## Coordination: Shared memory

- Protect access to shared address space, mutex.
- Needs hardware support to implement efficiency. NUMA makes it easier but it is still costly to scale due to contention (any processor can load/ store to any address)
- Can be done without a shared memory system (NUMA, UMA)
- Any type of coordination can be used in any hardware via software

## Coordination: Data-parallel

- SIMD, vector processors
- Traditional: Map a function onto a large collection of data
- Side effect free execution
- Modern: Data-parallel languages do not enforce this structure
- SPMD model used in CUDA, OpenCL, ISPC instead

## Coordination: Message passing

- Tasks operate within their own private address space and communicate by explicitly sending / receiving messages
- E.g. MPI, GO
- Hardware does not implement system wide loads and stores, can connect commodity systems together to form large parallel machines
- Many many computers, not a very big one
- Compatible with distributed memory systems

## Coordination and hardware

- Shared memory: UMA, NUMA. Copies of messages are sent / received from library buffers
- Message passing: distributed systems, clusters, supercomputers
- Any abstraction can be implemented with any hardware but it will be more costly
- Shared address space on incompatible hardware
  - Write: Send message to all cores to invalidate value
  - Read: page fault handler issues appropriate network requests

## Summary of Coordination Models

- Shared address space: very little structure
  - All threads can read and write to all shared variables
  - Drawback: not all reads and writes have the same cost (and that cost is not apparent in program text)
- Data-parallel: very rigid computation structure
  - Programs perform the same function on different data elements in a collection
- Message passing: highly structured communication
  - All communication occurs in the form of messages

## Foster's Design Methodology



## Foster's Design methodology

1. Partitioning
  - Divide computation and data into independent pieces to discover maximum parallelism
  - Two approaches:
    - 1.1. Domain decomposition: divide data into smaller, equal pieces. Associate computation with data.
    - 1.2. E.g. 24 tasks with 3 grids each → 6 tasks with 12 grids each
    - 1.3. Functional decomposition: Divide computation into piece. Associate data with computation.
    - 1.4. E.g. Climate model → Atmospheric model, hydrology model ...
  - Rule of thumb:
    - 10x more primitive tasks than cores in target computer
    - Minimize redundant computations and redundant data storage
    - Primitive data should be of roughly the same size
    - Number of tasks an increasing function of problem size
2. Communication (coordination)
  - Dependencies between tasks necessitates communication
  - Overlap computation and communication such that when some tasks are communicating, others are computing (improve utilisation)
3. Agglomeration
  - Combine tasks into larger tasks s.t. **tasks ≥ cores**
  - Goals:
    - Improve performance by reducing cost of task creation and communication
    - Maintain scalability of program
    - Simplify programming
    - E.g. Granular: One task per grid
      - 8\*8=64 tasks
      - 64 \* 4 (neighbors) \* 2(send/ receive)=512 data transfers
    - Coarse: 16 grid per task
      - 2\*2=4 tasks
      - 4\*4\*2=32 data transfers
      - larger messages
    - Rule of thumb:
      - Increases locality of parallel programmes (more neighbors read)
      - Number of tasks increases with problem size
      - Number of tasks suitable for likely target increases ( $10n\text{umCores}$ )
      - Trade-off between agglomeration and code modification should be resonable (man hour)
4. Mapping
  - Assignment of tasks to execution units
  - Goals:
    - Maximise processor utilisation: place tasks of different cores
    - Minimise inter-process communication: Place tasks that communicate often on the same core to increase locality
    - Can be performed by user (distributed memory systems) or OS (centralised multiprocessor)
    - Rule of thumb:
      - Finding optimal mapping is NP hard in general (set cover)
      - Consider designs based on one task per core and multiple tasks per core
      - Evaluate static and dynamic task allocation

- Does not allow overlap of computation and communication - **Sequential**

Unoptimized sum N numbers distributed among N (= 8) tasks need O(N) time



- Rule of thumb:
  - Communication operation balanced among tasks
  - Each task communicates with only a small group of neighbors
  - Tasks can communicate in parallel
  - Overlap computation with communication

## 3. Agglomeration

- Combine tasks into larger tasks s.t. **tasks ≥ cores**

## • Goals:

- Improve performance by reducing cost of task creation and communication
- Maintain scalability of program
- Simplify programming
- E.g. Granular: One task per grid
  - 8\*8=64 tasks
  - 64 \* 4 (neighbors) \* 2(send/ receive)=512 data transfers
- Coarse: 16 grid per task
  - 2\*2=4 tasks
  - 4\*4\*2=32 data transfers
  - larger messages

- Rule of thumb:
  - Increases locality of parallel programmes (more neighbors read)
  - Number of tasks increases with problem size
  - Number of tasks suitable for likely target increases ( $10n\text{umCores}$ )
  - Trade-off between agglomeration and code modification should be resonable (man hour)



## 4. Mapping

- Assignment of tasks to execution units
- Goals:
  - Maximise processor utilisation: place tasks of different cores
  - Minimise inter-process communication: Place tasks that communicate often on the same core to increase locality
  - Can be performed by user (distributed memory systems) or OS (centralised multiprocessor)
  - Rule of thumb:
    - Finding optimal mapping is NP hard in general (set cover)
    - Consider designs based on one task per core and multiple tasks per core
    - Evaluate static and dynamic task allocation

- Dynamic: allocator should not be performance bottleneck
- Static: task:core  $\geq 10:1$



### Automatic Parallelization

- Compilers perform decomposition and scheduling
- Drawbacks:
  - Dependence analysis is difficult for pointer-based computation or indirect addressing
  - Execution time of function calls or loops with unknown bounds is difficult to predict at compile time

### Functional programming languages

- Describe the computations of a program as the evaluation of mathematical functions without side effects
- Advantage: New language constructs are not necessary to handle a parallel execution
- Challenge: Extract the parallelism at the right level of recursion

### Parallel Programming Patterns

- Patterns are not mutually exclusive, use the best match

### Fork Join

- Children run in parallel but are independent
- Children execute the same or different program
- Children join the parent at different points
- Good for loop parallelism (independent for loops)
- **Implementation:** Processes, threads etc

```
P1 = Fork {
    P3 = Fork { return Model = "civic" }
    P4 = Fork { return Year = "2001" }
    Join P3, P4
    Return P1 AND P4
}

P2 = Fork {
    P5 = Fork { return Color = "green" }
    P6 = Fork { return Color = "white" }
    Join P5, P6
    Return P2 OR P6
}

Join P1, P2
Return P1 AND P2
```

### Parbegin - Parenend

- most relaxed, code is structured into sequential segments and parallel segments
- Programmer specifies a sequence of statements to be executed in parallel
- A set of threads is created and the statement of the construct are assigned to these threads
- All the forks are done at the same time and all the joins are done at the same time

- Statements after parbegin and parenend are only executed after all threads joins (barrier)
- **Implementation:** OpenMP or compiler directives
- E.g Matrix multiplication using openMD

### SIMD (not the Architecture)

- Single instructions are executed synchronously by different threads on different data
- Similar to parbegin-parenend but all threads execute the same instruction at the same time (synchronous)
- Parallel but synchronous
- **Implementation:** AVX / SSE instruction on intel processor

| mulps xmm1, xmm0 |      |      |     |   |  |  |  |
|------------------|------|------|-----|---|--|--|--|
| 127              | 95   | 63   | 31  | 0 |  |  |  |
| 4.0              | 3.0  | 2.0  | 1.0 | * |  |  |  |
| *                | *    | *    | *   | * |  |  |  |
| 5.0              | 5.0  | 5.0  | 5.0 |   |  |  |  |
| ---              | ---  | ---  | --- |   |  |  |  |
| 20.0             | 15.0 | 10.0 | 5.0 |   |  |  |  |

xmm registers are 128 bits long  
SSE instruction treats the xmm registers as 4 individual 32-bit floating point value

### SPMD

- Same program executed on different cores but operate on different data
- Different threads might execute on different instructions of the same program due to control flow (ifs) and speed of cores
- Similar to parbegin-parenend but there is no implicit synchronization (lack of barrier)
- E.g. programs on GPGPU

### Master-Worker

- Single program controls the execution of the program
- Master executes main function, assigns work to worker threads
- Initialisation, output and Coordination is done by master
- Worker waits for instruction
- **Benefit:** Good for simple and homogeneous worker threads and a master thread to organize them

```
int main(int argc, char ** argv)
{
    int nprocs;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    size = 2048;
    // One master (rank = 0) and nprocs-1 workers
    if (myid == 0) {
        master();
    } else {
        worker();
    }
    MPI_Finalize();
    return 0;
}
```

```
void master()
{
    matrix a, b, result;

    // Allocate memory for matrices
    allocate_matrix(&a);
    allocate_matrix(&b);
    allocate_matrix(&result);

    // Initialize matrix elements
    init_matrix(a);
    init_matrix(b);

    // Distribute data to workers
    master_distribute(a, b);

    // Gather results from workers
    master_receive_result(result);

    // Print the result matrix
    print_matrix(result);
}
```

```
void worker()
{
    int rows_per_worker = size / workers;
    float row_a_buffer[rows_per_worker][size];
    matrix b;
    float result[rows_per_worker][size];

    // Receives data
    worker_receive_data(&b, row_a_buffer);

    // Performs computations
    worker_compute(b, row_a_buffer, result);

    // Sends the results to master
    worker_send_result(result);
}
```

### Task Pool

- Common data structure for threads to retrieve tasks
- Number of threads is fixed
- Threads are statically created by main
- Work is not pre-allocated. Instead worker retrieves new tasks from pool
- Thread can generate new tasks to put in pool and coordination is not done by master (difference from master-worker)
- May run into producer consumer issues when accessing the pool
- Execution is completed when the pool is empty AND each thread has terminated the processing of its last task
- **Benefits:**
  1. Adaptive can generate tasks dynamically, good for irregular applications
  2. Overhead for thread creation is independent from execution
- **Disadvantages**
  1. For fine grained tasks, the overhead of retrieving and insertion becomes significant

```
class ThreadPoolExample {
    public static void main(String[] args) {
        ExecutorService executor =
            Executors.newFixedThreadPool(5);

        for (int i = 0; i < 10; i++) {
            Runnable Task = new Task(.....);
            executor.execute( Task );
        }
        .....
    }
}
```

5 threads

10 tasks added to the pool.

### Producer Consumer



- Producer produces data which are used as input by consumer threads
- Synchronisation is needed to ensure correct coordination between producer and consumer threads

```
void produce() {
    synchronized (buffer) {
        while (buffer is full)
            buffer.wait();
        Store an item to buffer;
        if (buffer was empty)
            buffer.notify();
    }
}
```

```
void consume() {
    synchronized (buffer) {
        while (buffer is empty)
            buffer.wait();
        Retrieve an item from buffer;
        if (buffer was full)
            buffer.notify();
    }
}
```

### Pipelining

- Data is partitioned into a stream that flows through pipeline stages synchronously
- Each stage (threads) can be processed in parallel (functional parallel stream)



## L6: Data parallel models (GPGPU)

### Shader GPU

- Hard to transfer data between GPU and CPU
- No scatter: threads cannot write to arbitrary or multiple mem locations
- No communication between fragments
- Coarse thread synchronisation
- Example of data parallelism: fast processors for performing the same computation on large collection of data

### FLOPs performance on GPGPU

- Best performance with single precision FLOPs
- 2 processors need to work to perform double precision

### GPU Architecture



- Multiple Streaming Multiprocessors (SMs) - Memory, cache, connecting interface (PCI)
- SM consists of multiple compute cores
  - Memories(register, L1 cache, shared memory)
  - Logic for thread and instruction management

### CUDA programming model

- Compute Unified Device Architecture
- Simple extension to standard C
- Mature software stack (high-level to low level)
- User launches batches to threads on the GPU Fully general load / store memory model (CRCW)
- Scales with non-NVIDIA GPUs too

- Transparently scales to hundreds of cores and thousands of parallel threads
- Programmer focus on parallel algorithms
- Enable heterogeneous systems (CPU + GPU)

## CUDA layers



## CUDA kernels and threads

- Device=GPU
- Host=CPU
- Kernel=function that runs on the device
- Parallel portions execute on device as kernels, and multiple are allowed in CUDA hardware
- CUDA threads are extremely light weight with minimal creation overhead and instant context switches
- The key is to divide work to thousands of threads

## Arrays of parallel threads

- A CUDA kernel is executed by an array of threads
- All threads run the same code (SPMD)
- Each thread has an ID that is used to compute memory addresses and make control decisions

## Thread cooperation

- Threads in the array need not be completely independent
- Shares results to save computation
- Share memory accesses which reduces bandwidth
- **Scalable Cooperation**
  - Divide monolithic thread away into multiple blocks
  - In a block: shared memory, atomic operations and barrier synchronisation
  - Threads in different blocks cannot cooperate
- Enables programs to transparently scale to any number of processes

## Thread Execution Mapping to Architecture

- SIMD execution model
- Multiprocessors, creates, manages, schedules and execute threads in SIMD Warps (32)
- Threads in a warp starts at the same program address
- Threads have individual programme counter and state
- A block is always split into warps in the same way
- Having divergent control flow will cause the programme to stall

## CUDA Memory Model



- Kernels are launched in grids
- A block executes on one SM (streaming multiprocessor)
- A block cannot be migrated, but several blocks can reside in one SM
- Register file and shared memory are partitioned among all resident thread blocks

## Cuda memory space

- Data must be explicitly transferred from CPU to device
- Shared memory is the cache, and is therefore not cached
- Global, local memory are cached and needs to be warmed up
- Constant memory is useful for uniformly-accessed read-only data
- Spatial data is useful for coherent random-access read-only data (cached too)

## Coalesced access

### CUDA Memory Model



- Simultaneous access to global memory by threads in a warp is coalesced to transactions of 32 bytes
- Reduce disk I/Os

## Shared Memory

- Higher bandwidth and low latency than local or global
- Divided into equally-sized banks
- Addresses from different banks can be accessed simultaneously
- Bank conflict: two threads access two different addresses in the same memory bank – has to be serialised
- Bank broadcast: (threads accessing the same address in a bank) one reading thread broadcast the result to the conflicting threads so they all get the info

## Strided access

## wasted bandwidth



- Threads within a warp access memory with a stride size of x
- This increases the number of bank conflicts by x times!
- Half of the elements in the transactions are not used and represent wasted bandwidth

## Optimisation in Cuda - goals

1. Maximum memory band width by coalescing memory access
2. Maximise parallel execution by maximising data parallelism and increase hardware utilisation (SIMD!)
3. Maximum instruction throughput by avoiding different execution paths within the same warp

## Memory Optimizations

- Minimize data transfer between host and device
- Ensure global memory are coalesced whenever possible
- Minimize global memory accesses by using shared memory
- Minimize bank conflicts in shared memory accesses (e.g. adding padding words between every 32 wordss)

## Data transfer between host and device

- Peak bandwidth between device and GPU is higher than between host and device
- Hence data transfer between host and device should be minimized
- E.g. running kernel on GPU without any performance benefits over CPU
- Batch small transfers into one larger transfer
- Use paged-locked or pinned memory transfer (not cached) – eliminates a step in memory transfer
- Page pinned: locked in RAM, cannot be moved to Disk. Both CPU and GPU can access them. Overuse can cause performance issues as it cannot be swapped out of RAM.

## Concurrent data transfers and executions



- Overlap asynchronous transfers with computation
- `cudaMemcpyAsync()` instead of `cudaMemcpy()`, and do CPU computation while data transfers

- Use different streams to achieve concurrent copy and execute

## Execution Configuration

- Improve occupancy
  - Number of warps should = number of processors
  - So every processor has 1 warp to execute
  - High occupancy hides memory latency and when a block synchronises
- Threads per block should be multiples of warp size
  - If one warp blocks, the other can execute. Better coalesced access
  - Use smaller thread blocks to reduce chances of bank conflict
- Limitation on block size
  - Limited by registers and shared resource
  - The kernel prevents launch if the block allocates more thread resources than available
  - This ensures that at least one block can execute
- Multiple contexts
  - If multiple CUDA apps access the same GPU concurrently, there are likely multiple contexts

## Maximise instruction output

- Use single precision floats where possible
- Replace integer division and modulo operations with bitwise operations
- Use signed loop counters

## Control Flow

- Reduce divergent warps caused by control flow instructions
- Reduce the number of instructions where possible

## L9: Parallel Programming Models - II

### Data distribution for 1D array



### Block wise distribution

- Preserves locality of data
- Difficult to apply if hunk size is unknown

### Cyclic data distribution

- Chunks are of more equal size

### Data distribution for 2D arrays

- Combination of blockwise and cyclic distribution in one or both dimensions
- One-dimension distribution: By column dimension



- One-dimension distribution: Block-cyclic



- Form blocks of size  $b$ , then perform cyclic round robin allocation
- Closer to even distribution
- Set block size to task size to improve locality
- Some PUs might have more data

## 2D Checkerboard



- Blockwise: elements split among both dimensions
- Cyclic: based on processor mesh. The communication delay between p1 to p4 will be longer
- Block-cyclic: elements split into  $b_1 \times b_2$  size blocks then cyclic assignment to processors

## Example: Heat Transfer Simulation

- $N \times N$  metal plate with  $p$  processors
- $P$  less than  $N$
- Idea 1: Include border



- Idea 2: Checkerboard, row-wise



- Minimizes communication: only talks to the 4 neighbours
- Connection is only 1 step away

- Idea 3: Row-wise Block Cyclic



- Low granularity, poor locality: task size is huge, and each block may not fit into the cache line
- Checker board is better as task size can be defined to fit into cache line

## Information Exchange

### Shared Address - Shared memory

- Assumes global memory access
- Need synchronisation

### Distributed Space - Communication Operation

- Assumes disjoint memory model

## Message Exchange Protocol

|               | Blocking Operations                                                                                                             | Non-Blocking Operations                                                                                                                                                 |
|---------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Buffered      | Sending process returns after data has been copied into communication buffer<br>(blocks until data is sent to buffer)           | Sending process returns after initiating the transfer to buffer. This operation might not be completed on return.<br>(does not block, no guarantees on the correctness) |
| Non-buffered  | Sending process blocks until matching receive operation has been encountered.<br>(blocks until it is safe to use the sent data) | Programmer must explicitly ensure completion of the operation by polling.                                                                                               |
| Blocking Send | Send operation blocks till input buffer is safe to be reused                                                                    |                                                                                                                                                                         |

- Non-buffered Blocking Send

- Operation blocks until the matching receive is performed
- Considerable idling timing due to mismatch between send and receive
- Possible to deadlock
- Can happen with buffered blocking send once the buffer is full

### Buffered Blocking Send

- Reduces idling overhead but increases copying overhead
- Sender returns after data is copied into buffer
- Deadlock!



### Non-blocking send

- Returns before it is semantically safe
- Allows programmers to overlap operations and hide communication overhead



## Send and Receive MPI

|              | Synchronous                                                                    | Asynchronous                                                                                              |
|--------------|--------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| Blocking     | <code>MPI_Ssend</code><br>( <code>MPI_Mrecv</code> )<br><code>MPI_Rsend</code> | May be buffered:<br><code>MPI_Send</code><br><code>MPI_Recv</code><br>Buffered:<br><code>MPI_Bsend</code> |
| Non-blocking | <code>MPI_Issend</code><br>( <code>MPI_Imrecv</code> )                         | <code>MPI_Isend</code><br><code>MPI_Irecv</code>                                                          |



## Logical Ring

|   | Phase | Process 0                      | Process 1                      | Process 2                      | Process 3                      |
|---|-------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
| 1 |       | <code>MPI_Send()</code> to 1   | <code>MPI_Recv()</code> from 0 | <code>MPI_Send()</code> to 3   | <code>MPI_Recv()</code> from 2 |
| 2 |       | <code>MPI_Recv()</code> from 3 | <code>MPI_Send()</code> to 2   | <code>MPI_Recv()</code> from 1 | <code>MPI_Send()</code> to 0   |

- Trace the logic from 0 to 3

- There will be no deadlocks since there are implicit barriers due to the blocking calls

## Process Group

- Process group: ordered set of processes, each with unique rank
- A process can be in multiple groups

## Communicators

- Communication domain for a group of processes

### Intra communicators

- Support the execution of arbitrary collective communication on a single group
- E.g. `MPI_COMM_WORLD`

### Inter communicators

- Supports point to point communication operations between process groups

## Process Group Ops



- Send completes after matching receive and source data sent
- Receive completes after matching send and data transfer completes

### Async

| Functionality             | MPI Call                            |
|---------------------------|-------------------------------------|
| Obtain a new group        | <code>MPI_Comm_group</code>         |
| Get size of a group       | <code>MPI_Group_size</code>         |
| Rank a process in a group | <code>MPI_Group_rank</code>         |
| Group union               | <code>MPI_Group_union</code>        |
| Group intersection        | <code>MPI_Group_intersection</code> |
| Group difference          | <code>MPI_Group_difference</code>   |
| Group inclusion           | <code>MPI_Group_incl</code>         |
| Group exclusion           | <code>MPI_Group_excl</code>         |
| Group compare             | <code>MPI_Group_compare</code>      |
| Delete group              | <code>MPI_Group_free</code>         |

## Communicator Ops

| Measure            | Definition                                                                                | Unit                    |
|--------------------|-------------------------------------------------------------------------------------------|-------------------------|
| Bandwidth          | Maximum rate at which data can be sent                                                    | bits (bytes) per second |
| Byte transfer time | Time to transmit a single byte                                                            | Seconds/byte            |
| Time of flight     | Time the first bit arrived at the receiver (channel propagation delay)                    | second                  |
| Transmission time  | Time to transmit a message                                                                | second                  |
| Transport latency  | Total time to transfer a message = transmission time + time of flight                     | second                  |
| Sender overhead    | Time of computing the checksum, appending the header, and executing the routing algorithm | second                  |
| Receiver overhead  | Time of checksum comparison and generation of an acknowledgment                           | second                  |
| Throughput         | Effective bandwidth                                                                       | bits (bytes) per second |

## Virtual Top

| Functionality                        | MPI Call                     |
|--------------------------------------|------------------------------|
| Create a Cartesian topology          | <code>MPI_Cart_create</code> |
| Get info on Cartesian topology       | <code>MPI_Cart_get</code>    |
| Get number of dimension              | <code>MPI_Cartdim_get</code> |
| Comm rank → Cartesian coords         | <code>MPI_Cart_coords</code> |
| Cartesian coords → comm rank         | <code>MPI_Cart_rank</code>   |
| Access neighbors in Cartesian coords | <code>MPI_Cart_shift</code>  |

## Virtual Topology Operations

| Functionality                        | MPI Call                     |
|--------------------------------------|------------------------------|
| Create a Cartesian topology          | <code>MPI_Cart_create</code> |
| Get info on Cartesian topology       | <code>MPI_Cart_get</code>    |
| Get number of dimension              | <code>MPI_Cartdim_get</code> |
| Comm rank → Cartesian coords         | <code>MPI_Cart_coords</code> |
| Cartesian coords → comm rank         | <code>MPI_Cart_rank</code>   |
| Access neighbors in Cartesian coords | <code>MPI_Cart_shift</code>  |

## Scatter, gather



## Single Broadcast



- Each processor sends the same data bloc to every other processor
- Data blocks are collected in rank order

## Single Accumulate



- Reduction operation is applied by element to the data blocks
- Result is collected at root

## Multi Accumulate



- Each processor provides for every other processor a potentially different data block
- Data block for the same receiver are combined with a given reduction operation

## Total exchange

- Each process executes a scatter operation (sends different message to each processor)
- Stepwise to multi-broadcast, that sends the same message to all processors

## Stepwise and Duality



- Duality: The same spanning tree can be used for both operations



## Performance and Scalability of Parallel Programs

### L5: Performance of parallel systems

#### Two Views

- Response Time (user): duration of a program is reduced (start - end time)
- Throughput (computer manager): more work to be done in the same time (jobs per second)

#### Performance Factors

- Programming Model: how well the programmer can code (like good language, API etc)
- Computational Mode: How well the given program runs in the given architecture
- Architectural Model: interconnection network, memory organization, execution mode, sync or async processing

#### Response time in sequential programs

- Wall-clock time
- Comprise of
  - User CPU time: time CPU spends executing program
    - Know that read and write cycles take different time
  - System CPU time: time CPU spends on system instructions. Depends on OS.
  - Waiting time: IO waiting time and execution of other programs due to time sharing. Depends on the load of the system.

#### User CPU time

- $Time_{user}(A) = N_{cycle} * Time_{cycle}$
- $N_{cycle}$ 
  - Depends on translation of program statements by the compiler into instructions
  - For a program with n instructions:

- $N_{cycle} = \sum_{i=1}^n CPI_i * n_i(A)$
- $n_i(A)$ : number of times instruction i is executed in program A
- Depends on architecture of the computer system and compiler
- $CPI_i$ : average number of cycles per instruction i
- Refinement with memory access**
- $Time_{cycle}$ : Execution time for each instruction,  $\frac{1}{clockrate}$
- $Time_{user}(A) = (N_{instr}(A) + N_{mm.cycle}(A)) * Time_{cycle}$
- $N_{mm.cycle} = N_{read.cycle} + N_{write.cycle}$
- $N_{read.write.cycle} = N_{read.op} * R_{miss} * N_{miss.cycles}$
- Miss rates**
  - Two-level Cache example:

$$T_{read.access}(A) = T_{read.hit}^{L1} + R_{read.miss}^{L1} \times T_{read.miss}^{L1}$$

$$T_{read.miss}^{L1}(A) = T_{read.hit}^{L2} + R_{read.miss}^{L2} \times T_{read.miss}^{L2}$$

Global miss rate:  $R_{read.miss}^{L1}(A) \times R_{read.miss}^{L2}(A)$

## Memory access flow

### Read access (load) workflow



## Throughput

- MIPS  $\frac{N_{instr}}{Time_{user} * 10^6}$  OR  $\frac{clock.freq}{CPI * 10^6}$
- Only considers the number of instructions
- Can be easily manipulated by making the instructions smaller so it takes more to run the same programme
- MFLOPS**  $\frac{N_{flops}}{Time_{user} * 10^6}$
- Does not differentiate between the different types of floating pt ops

## Misc

- Higher clock freq != shorter execution time, since we do not capture CPI

## Speed up



- Cost:**  $C_p(n) = p * T_p(n)$ , where  $C_p$  measures the total work performed by all processors
- A parallel programme is cost optimal if it executes the same total number of operations as the fastest sequential program
- Speed up:**  $S_p(n) = \frac{T_{best\_seq}(n)}{T_p(n)}$
- Theoretically  $S_p \leq p$
- Practically, sublinear can occur when the parallel working task fits within the cache but the seq one cannot

- **Efficiency**

$$E_p(n) = \frac{T_*(n)}{C_p(n)} = \frac{S_p(n)}{p} = \frac{T_*(n)}{p \times T_p(n)}$$

speed up / number of cores

- $T_*(n)$  refers to the best sequential

- In the ideal case  $T_p = p$ , and  $E(p) = 1$

### Scalability

- How size of problem and size of parallel computer interact
- Problem size small: Parallelism overheads dominates benefits
- Problem size large: working set cannot fit on machine, cannot start

### Amdahl's Law

- Speedup of the parallel execution is limited by the sequential fraction  $f = \frac{t_{sequential}}{t_{total}}$
- Manufacturers are discouraged from making large parallel computers
- Effort diverted to reducing sequential section

- Sequential execution time:



- Parallel execution time:



$$S_p(n) = \frac{T_*(n)}{f \times T_*(n) + \frac{1-f}{p} T_*(n)} = \frac{1}{f + \frac{1-f}{p}} \leq \frac{1}{f}$$

### Rebuttal to Amdahl's

- $f$  is not always constant
- In a good parallel programme,  $\lim_{n \rightarrow \infty} (n) = 0$
- Hence  $S_p = p$

### Gustafson's law



- In some programmes,  $f$  decreases when the problem size increases (the parallel parts increases more)
- Then  $S_p = p$  given a large enough problem size

### Performance Measure of Communication

| Measure            | Definition                                                                                | Unit                    |
|--------------------|-------------------------------------------------------------------------------------------|-------------------------|
| Bandwidth          | Maximum rate at which data can be sent                                                    | bits (bytes) per second |
| Byte transfer time | Time to transmit a single byte                                                            | Seconds/byte            |
| Time of flight     | Time the first bit arrived at the receiver (channel propagation delay)                    | second                  |
| Transmission time  | Time to transmit a message                                                                | second                  |
| Transport latency  | Total time to transfer a message = transmission time + time of flight                     | second                  |
| Sender overhead    | Time of computing the checksum, appending the header, and executing the routing algorithm | second                  |
| Receiver overhead  | Time of checksum comparison and generation of an acknowledgement                          | second                  |
| Throughput         | Effective bandwidth                                                                       | bits (bytes) per second |

### Latency of sending a m sized msg



$$T(m) = O_{send} + T_{delay} + m/B + O_{recv} = T_{overhead} + m/B = T_{overhead} + t_B * m$$

where  $B$  is network bandwidth,

$T_{delay}$  = time first bit to arrive at receiver  
no checksum error and network contention and congestion,  
 $T_{overhead} (= O_{send} + T_{delay} + O_{recv})$  is independent of the message size;  
 $t_B (= 1/B)$  is the byte transfer time

### Finding Possible Bottlenecks

- Instruction-rate limit: Add more non-memory instructions and check if execution time increases linearly with math operations count
- Memory bottleneck: remove most non-memory operations, did the time change proportionately?
- Locality of data access: change all arrays to access A[0]
- Sync overhead: remove all atomic operations or locks (might change control flow, so may not work)

### L8: performance instrumentation

#### Terminology

- Latency: Time waiting to be serviced
- Response time: Time for an operation to complete
- Throughput: Rate of operation/data performed
- Utilization: Proportion of time where the resource is busy
- Saturation: Degree to which a resource has queued work that it cannot service
- Bottleneck: Service that limits system performance

#### Resource analysis - system administrator

- Focus on system resource: CPU, memory, disks, network interface, buses, interconnects

#### Workload analysis - app dev

- Examines the workload and how the system is responding
- Requests, latencies, completion & error

#### Anti-methodologies

- No deliberate methodology
- **Street light**
  - Look for obvious issues that can be found online or by chance
- **Drunk man**
  - Tune things at random until problem goes away

- Tune the wrong software

### Problem statement method

- Ask questions about the problem to better understand it

### USE Method

- Utilization: Busy time
- Saturation: Queue length or time
- Errors: easy to interpret
- Start with questions and then find the tools

### Monitoring

- Records performance statistics overtime to find patterns
- Good for: capacity planning, quantifying growth, showing peak usage

### Performance analysis in 60 seconds

#### Performance Analysis in 60 Sec (Linux)

- > 

|                         |                             |
|-------------------------|-----------------------------|
| ■ Load averages         | —————> 1. uptime            |
| ■ Kernel errors         | —————> 2. dmesg   tail      |
| ■ Overall stats by time | —————> 3. vmstat 1          |
| ■ CPU balance           | —————> 4. mpstat -P ALL 1   |
| ■ Process usage         | —————> 5. pidstat 1         |
| ■ Disk I/O              | —————> 6. iostat -xz 1      |
| ■ Memory usage          | —————> 7. free -m           |
| ■ Network I/O           | —————> 8. sar -n DEV 1      |
| ■ TCP stats             | —————> 9. sar -n TCP,ETCP 1 |
| ■ Check overview        | —————> 10. top              |

### Tools method

- List available performance tools
- List the useful metrics they provide
- List ways to interpret metrics

### New Trends

### L12: Energy efficient computing

# Tutorials

## Tutorial 1

### Hybrid memory (Shared + Distributed)

- Hybrid: Each core has its own memory (cache) that needs to communicate to the shared memory to stay updated
  - Not hybrid: Everyone just sees one shared memory
- xs-4114 vs i7-7700

| Processor Type   | Cores | Threads |
|------------------|-------|---------|
| Xeon Silver (xs) | 10    | 20      |
| Core i7 (i7)     | 4     | 8       |

## Tutorial 2

### Task Dependence Graph

```

A parbegin
do
parbegin
B parallel
C
parent
F
end
parallel
do
parbegin
D parallel
E
parent
G
end
parent
H
  
```



- parallel: X and Y are executed in parallel
- parent: all tasks must complete before moving on

### Average CPI

- Not all instructions are created equal! Having fewer instruction != faster programme if translation of instructions takes more clock cycles

### Calculating MIPS

- Calculate time taken for the programme

$$\frac{\text{instructions} \times \text{cycles}}{\text{clockrate}}$$

Ghz =  $10^9$  Hz

- Find the total number of million instructions

$$\frac{\text{totalinst.}}{10^6}$$

- divide 2 by 1

### Amdahl's vs Gustafson's

- Amdahl's: If problem is fixed sized **OR** there's a constant sequential fraction with increasing problem size, then the speedup is limited by the sequential fraction
- Gustafson's: If the problem size can be varied **AND** the sequential fraction does not scale much with problem size, then we can solve larger problems with more speed up

### Flynn's Taxonomy

- MIMD is a superset to SIMD!

### Programming Models

- Distributed memory requires explicitly communication between processes

### Master-worker

- Good for relatively simple and homogeneous worker threads and a master thread to organize them
- Similar to the idea of SIMD

### Task Pool

- Good for heterogeneous tasks or those that finish at different times
- Good when you are unsure of finish timing

### Fork-join / Parbegin parent

- Fork join can be complicated to write (synchronisation, waiting etc)

### Pipelining

- Better for task parallelism. Bad for data parallelism where there is only one task
- Best when each stage takes similar time / are even in workload
- If parbegin-parent, use OpenMP!

### Producer Consumer

- Good when there is a pair of producer / consumer
- Need to have something that is produced and consumed!
- If producers/consumers have uneven workload, then we need more producer/consumer to offset the difference

## Tutorial 3

### Relaxed Orders

- Data Dependencies (while loops) DOES NOT COUNT**
- TSO: Relax W → R on the same processor
- PC: Relax W → R on the same processor, processes see writes at different times
- PSO: Relax WR on the same processor, relax WW on the same processor if no data dependency

### Drawing Dependencies



- To determine the final value in a SC execution, draw a line from the last write to the read

### Number of Warps and Blocks

- # of registers per kernel = # of registers per thread
- # of blocks = # of threads on device / # of registers per block
- # of warps = # of threads per block / warp size

### CUDA mat mul

```

managed float *a;
managed float *b;
managed float *c;

// One thread per output row over some N blocks
__global__ void multiplyKernel() {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    c[index] = 0;
    for (int j = 0; j < ARRAYSIZE; ++j)
        c[index] += a[index * ARRAYSIZE + j] * b[j];
}

managed float *a;
managed float *b;
managed float *c;

// One thread per output row over some N blocks
__global__ void multiplyKernel() {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    c[index] = 0;
    for (int j = 0; j < ARRAYSIZE; ++j)
        c[index] += a[index * ARRAYSIZE + j] * b[j];
}

  
```

• 1 thread process 1 element in the output array

- Use managed to let CUDA figure out when to copy the matrices

## Tutorial 5

### Dining Philosophers Vs Logical Ring

- Without the odd even soln, a deadlock can occur since send does not guarantee that the system buffer is large enough
- Logical ring takes num message \* 2 number of communications to complete
- Dining Philosopher can take many more steps!
- Logical ring looks like a pipeline but also comes with synchronisation issues since it's not just unidirectional

### Writing logical ring using non-blocking send recv

```

int rank, p, size = 8;
int left, right;
char send_buffer1[8], recv_buffer1[8];
char send_buffer2[8], recv_buffer2[8];
gethostname(send_buffer1, size) //repeat
for buffer2
  
```

```

MPI_Comm_rank(MPI_COMM_WORLD, &rank)
MPI_Comm_size(MPI_COMM_WORLD, &p)
left = (rank - 1 + p) % p ;
right = (rank + 1) % p ;
MPI_Request reqs[4]; MPI_Status stats[4];
MPI_Isend (send_buffer1, size, MPI_CHAR,
           left, TAG_LEFT , MPI_COMM_WORLD,
           &reqs[0]);
MPI_IRecv (recv_buffer1, size, MPI_CHAR,
           right, TAG_LEFT , MPI_COMM_WORLD,
           &reqs[1]);
MPI_Isend (send_buffer2, size, MPI_CHAR,
           right, TAG_RIGHT , MPI_COMM_WORLD,
           &reqs[2]);
MPI_IRecv (recv_buffer2, size, MPI_CHAR,
           left, TAG_RIGHT , MPI_COMM_WORLD,
           &reqs[3]);
MPI_Waitall(4, reqs, stats);
...
  
```

### Wimpy and Brawny cores

- Single threaded performance: Brawny
- Multi threaded performance: Wimpy
- Always good to use a mix of both!

## Tutorial 6

### Omega network - 8 \* 8



### XOR Tag Routing



## Labs

### Lab 2: OpenMP

#### Work Sharing Constructs

- Get thread id: `omp_get_thread_num()`
- Get num threads: `omp_get_num_threads()`
- To share work across multiple threads, we need to use work-sharing constructs

```

#pragma omp parallel
{
    #pragma omp for schedule (static, chunksize)
    for (i = 0; i < n; i++)
        x[i] = y[i];
}
  
```

- Static assigns work in a circular manner if `iter_size` is larger than the number of threads
- Dynamic will assign based on the given chunk size or iter size but in a random order
  - Pros: Appropriate when the iteration require different computational costs and the iterations are not balanced
  - Cons: Scheduling has higher overhead than static scheduling because it distributes iterations during run time

#### Nesting Work-Sharing constructs

<https://610yilingliu.github.io/penMP/>

```

#pragma omp parallel
{
    #pragma omp for
    for (i = 0; i < n; i++)
        #pragma omp for
        for (j = 0; j < m; j++)
            // can't nest the omp for directives
}

```

- Nesting of work sharing constructs is not allowed

```

#pragma omp parallel
{
    #pragma omp sections
    {
        #pragma omp section
        {
            work1();
        }
        #pragma omp section
        {
            work2();
        }
    }
}

```

## Sections }

- Each section will be assigned to any available thread, one at a time
- single section (pragma omp single) will be executed by only one thread, decided at run time
- master section (pragma omp master) will be executed by only the master thread

## Lab 3: CUDA

### Basics

- Complex cores like CPU has low latency
- Many simple cores like GPU has high throughput
- one SM runs one thread block and executes multiple warps of threads in parallel

### Synchronisation Constructs

- pragma omp barrier:** Synchronizes all threads
- pragma omp master:** Only master thread executes
- pragma omp critical:** Can only be executed by 1 thread

```

#pragma omp parallel shared(x)
{
    #pragma omp critical
    x = x + 1;
}

/* end of parallel region */

```

- at a time
- pragma omp atomic:** Only one thread can execute the atomic statement at a time

### Volta (CC7) Vs Pascal(CC6)

- Independent thread scheduling (ITS)
- Different statements in the divergent branches can interleave



- Pascal and earlier: entire divergent region must converge first



### CUDA Memory Types

| Type     | Scope   | Access type | Speed    | CUDA declaration syntax            | Explicit sync |
|----------|---------|-------------|----------|------------------------------------|---------------|
| Register | thread  | RW          | fastest  | Compiler decides (from local vars) | no            |
| Local    | thread  | RW          | depends* | float x;                           | no            |
| Shared   | block   | RW          | fast     | __shared__ float x;                | yes           |
| Global   | program | RW          | slow     | __device__ float x;                | yes           |
| Constant | program | R           | slow     | __constant__ float x;              | yes           |
| Texture  | program | R           | slow     | __texture__ float x;               | yes           |

- program scope = both host and device
- Prefers register local shared global

### Global memory

```

• cudaError_t cudaMalloc(void * *devPtr, size_tsize)
• Visible to all blocks
// Malloc host memory
start = (int*)malloc(num_elements * sizeof(int));
// Malloc device memory
cudaMalloc((void**)device_mem, num_elements * sizeof(int));

printf("Incrementor input:\n");
for (i = 0; i < num_elements; i++) {
    start[i] = rand() % 100;
    printf("start[%d] = %d\n", i, start[i]);
}

/** 
 * Copy values from start to our CUDA memory
 */
rc = cudaMemcpy(device_mem, start, num_elements * sizeof(int), cudaMemcpyHostToDevice);

if (rc != cudaSuccess)
{
    printf("Could not copy to device. Reason: %s\n", cudaGetErrorString(rc));
}

incrementor<<<1, num_elements>>>(device_mem);
check_cuda_errors();

// Retrieve data from global memory
rc = cudaMemcpy(start, device_mem, num_elements * sizeof(int), cudaMemcpyDeviceToHost);

```

### Shared memory

- `__shared__`
- Only resides in device, hence faster
- Only visible to those in the same thread block

### Unified memory

- Defines a common memory addressing space, allowing both CPU and GPU to access it as if it is in their memory space
- `cudaMallocManaged` and `__managed__`
- Page-locked memory (locked in the RAM)! GPU can access directly without CPU intervention.

```

// "Malloc" device memory
cudaMallocManaged((void**)&device_mem, num_elements * sizeof(int));

printf("Incrementor input:\n");
for (i = 0; i < num_elements; i++) {
    device_mem[i] = rand() % 100;
    printf("start[%d] = %d\n", i, device_mem[i]);
}

incrementor<<<1, num_elements>>>(device_mem);
check_cuda_errors();

```

### Synchronisation in CUDA

- CUDA provides synchronising primitives
- `atomicAdd(&counter, 1);`

### Barrier in CUDA

- `_syncthreads()` synchronises threads in the same block until all of them have reached this point
- Threads from other blocks are not synchronised
- `volatile` keyword: hints to the compiler to not optimise load and store operations to prevent stale version of the var from being read
- Volatile variables may be modified asynchronously by other threads

### Cuda Malloc

- `cudaMalloc(void * *pointer, size_t nbytes)` is called in host. Since host cannot touch the shared memory, the memory is allocated to global
- `cudaMemset(void * pointer, int value, size_t count);`
- `cudaFree(void * pointer)`

### CUDA Example Codes

#### Adding two arrays

##### CUDA C Program

A CUDA kernel

```

_global_
void addMatrixG( float *a, float *b, float *c, int N )
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    int index = i + j * N;
    if ( i < N && j < N )
        c[index] = a[index] + b[index];
}

```

Device code

```

void main()
{
    .....
    dim3 dimBlk( 16, 16 );
    dim3 dimGrd( N/dimBlk.x, N/dimBlk.y );
    addMatrixG<<<dimGrd, dimBlk>>>( a, b, c, N );
}

```

Host code

```

// Allocate vectors in device memory
size_t size = N * sizeof(float);
float *d_A;
cudaMalloc( (void**)&d_A, size );
float *d_B;
cudaMalloc( (void**)&d_B, size );
float *d_C;
cudaMalloc( (void**)&d_C, size );

```

```

// Copy vectors from host memory to device memory
// h_A and h_B are input vectors stored in host memory
cudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice );

```

```

// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) /
                    threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);

```

```

// Copy result from device memory to host memory
// h_C contains the result in host memory
cudaMemcpy( h_C, d_C, size, cudaMemcpyDeviceToHost );

// Free device memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

```

### Matrix multiplication

```

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
    int width;
    int height;
    float elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

```

```

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{

```

```

    // Load A and B to device memory
    Matrix d_A;
    d_A.width = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc((void**)&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

```

```

    // Load C in device memory
    Matrix d_C;
    d_C.width = B.width; d_C.height = B.height;
    size_t size = B.width * B.height * sizeof(float);
    cudaMalloc((void**)&d_C.elements, size);
    cudaMemcpy(d_C.elements, B.elements, size, cudaMemcpyHostToDevice);

```

```

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, B.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements); cudaFree(d_B.elements); cudaFree(d_C.elements);
}

```

```

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    for (int e = 0; e < A.width; ++e)
        Cvalue += A.elements[row * A.width + e] *
                  B.elements[e * B.width + col];

    C.elements[row * C.width + col] = Cvalue;
}

```

## Lab 4 and 5: MPI

### MPI Blocking Communication

```

int MPI_Send ( const void * buf , int count , MPI_Datatype datatype ,
               int dest , int tag , MPI_Comm comm )

```

```

int MPI_Recv ( void * buf , int count , MPI_Datatype datatype ,
               int source , int tag , MPI_Comm comm ,
               MPI_Status * status )

```

Table 1: Arguments for Blocking Communication Routines

|                    |                                                                                                                                    |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------|
| buf                | Pointer to the memory buffer that holds the contents of the message to be sent or received                                         |
| count              | The number of items that will be sent.                                                                                             |
| datatype           | Specifies the primitive data type of the individual item sent in the message, and can be one of the following:                     |
| MPI_CHAR           |                                                                                                                                    |
| MPI_SHORT          |                                                                                                                                    |
| MPI_INT            |                                                                                                                                    |
| MPI_LONG           |                                                                                                                                    |
| MPI_UNSIGNED_CHAR  |                                                                                                                                    |
| MPI_UNSIGNED_SHORT |                                                                                                                                    |
| MPI_UNSIGNED_LONG  |                                                                                                                                    |
| MPI_UNSIGNED       |                                                                                                                                    |
| MPI_FLOAT          |                                                                                                                                    |
| MPI_DOUBLE         |                                                                                                                                    |
| MPI_LONGDOUBLE     |                                                                                                                                    |
| MPI_BYTE           |                                                                                                                                    |
| MPI_PACKED         |                                                                                                                                    |
| dest/source        | Specifies the rank of the source / destination process in that MPI communicator                                                    |
| tag                | An integer that allows the receiving process to distinguish a message from a sequence of messages originating from the same sender |
| comm               | The MPI communicator                                                                                                               |
| status             | Pointer to an MPI_Status structure that allows us to check if the receive has been successful.                                     |

## Non Blocking Communication

```
int MPI_Isend ( const void * buf , int count , MPI_Datatype datatype , int dest , int tag , MPI_Comm comm , MPI_Request * request )

int MPI_Irecv ( void * buf , int count , MPI_Datatype datatype , int source , int tag , MPI_Comm comm , MPI_Request * request )

int MPI_Test ( MPI_Request * request , int * flag , MPI_Status * status )

int MPI_Wait ( MPI_Request * request , MPI_Status * status )
```

- other variations of test and wait: MPI\_Test, MPI\_Testall, MPI\_Testany, MPI\_Testsome MPI.Wait, MPI.Waitany, MPI\_Waitsome

- MPI does not guarantee fairness, starvation can still happen

## Collective Communication

### Barrier

```
int MPI_Barrier ( MPI_Comm comm )

# Example
if (rank == master_node_rank) {
    // Master node
    for (int i = 0; i < num_workers * ITERATIONS; i++) {
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Recv(&number, 1, MPI_INT,
                MPI_ANY_SOURCE, MPI_ANY_TAG,
                MPI_COMM_WORLD,
                MPI_STATUS_IGNORE);
        double print_delay = MPI_Wtime() -
            loop_start_time_s;
        printf("Master_node_received_number_%
d_at_time_%.5f_sec\n",
               number, print_delay);
    }
} else {
    // Workers
    for (int i = 0; i < num_workers * ITERATIONS; i++) {
```

```
// Only 1 process will be able to send
MPI_Barrier(MPI_COMM_WORLD);
if (rank == (i % num_workers)) {
    // Only this particular worker
    // should send in this iteration
    MPI_Send(&rank, 1, MPI_INT,
            master_node_rank, 0,
            MPI_COMM_WORLD);
}
// Random sleep to vary the workers
useconds_t sleepTime =
    (useconds_t)((rand() % 5) + 1) *
    100;
usleep(sleepTime);
```

### Data movement Operations

```
/* MPI_Bcast - broadcasts ( sends ) a
   message from the process with rank
   root to all other processes in the group */
int MPI_Bcast ( void * buffer , int count ,
                MPI_Datatype datatype ,
                int root , MPI_Comm comm )

-----
```

/\* MPI\_Scatter - sends data from one process to all processes in a communicator \*/

```
int MPI_Scatter ( const void * sendbuf ,
                  int sendcount ,
                  MPI_Datatype sendtype , void * recvbuf ,
                  int recvcount , MPI_Datatype recvtype ,
                  int root , MPI_Comm comm )

-----
```

/\* MPI\_Gather - gathers data from a group of processes into one root process \*/

```
int MPI_Gather ( const void * sendbuf , int
                 sendcount ,
                 MPI_Datatype sendtype , void * recvbuf ,
                 int recvcount , MPI_Datatype recvtype ,
                 int root , MPI_Comm comm )

-----
```

/\* MPI\_Allgather - gathers data from a group of processes into every process of that group \*/

```
int MPI_Allgather ( const void * sendbuf ,
                    int sendcount ,
                    MPI_Datatype sendtype , void * recvbuf ,
                    int recvcount ,
                    MPI_Datatype recvtype , MPI_Comm comm )

-----
```

/\* MPI\_Alltoall - each process in a group performs a scatter operation , sending a distinct message to all the processes in the group in order by their rank \*/

```
int MPI_Alltoall ( const void * sendbuf ,
                   int sendcount ,
```

```
MPI_Datatype sendtype , void * recvbuf ,
int recvcount ,
MPI_Datatype recvtype , MPI_Comm comm )
```

### Collective Computation

```
-----
```

/\* MPI\_Reduce - reduces values on all processes within a group ; the reduction operation must be one of the following :

```
MPI_MAX maximum | MPI_MIN minimum | MPI_SUM
sum | MPI_PROD product |
MPI_BAND logical AND | MPI_BAND bit - wise
AND | MPI_LOR logical OR |
MPI_BOR bit - wise OR | MPI_LXOR logical
XOR | MPI_BXOR bit - wise XOR |
MPI_MAXLOC max value and location |
MPI_MINLOC min value and location
*/
```

int MPI\_Reduce ( const void \* sendbuf ,
 void \* recvbuf , int count ,
 MPI\_Datatype datatype , MPI\_Op op ,
 int root , MPI\_Comm comm )

```
-----
```

/\* MPI\_Allreduce - applies a reduction operation and places the result in all processes in the communicator ( this is equivalent to an MPI\_Reduce followed by an MPI\_Bcast ) \*/

```
int MPI_Allreduce ( const void * sendbuf ,
                     void * recvbuf , int count ,
                     MPI_Datatype datatype , MPI_Op op ,
                     MPI_Comm comm )
```

### #Example

```
source = 0;
sendcount = 1;
recvcount = 1;

// Generate a random value for each process
 srand(rank);
int localval = rand() % 10;
printf("Rank.%d_generated_value.%d\n",
       rank, localval);
```

```
int sum = 0;
MPI_Allreduce(&localval , &sum, 1, MPI_INT,
              MPI_SUM, MPI_COMM_WORLD);
MPI_Finalize();
```

```
-----
```

/\* MPI\_Reduce\_scatter - first performs an element - wise reduction on a vector across all processes in the group , then splits the result vector into disjoint segments to distribute across the processes ( this is equivalent to an MPI\_Reduce followed by an MPI\_scatter ) \*/

```
int MPI_Reduce_scatter ( const void * sendbuf ,
                        void * recvbuf ,
```

```
const int recvcounts [] , MPI_Datatype
datatype ,
MPI_Op op , MPI_Comm comm )
```

### Managing Communicators

- A communicator contains a set of process MPI GROUP with an associated context

```
-----
```

/\* MPI\_Comm\_group - returns the group associated with a communicator \*/

```
int MPI_Comm_group ( MPI_Comm comm ,
                     MPI_Group * group )
```

```
-----
```

/\* MPI\_Group\_incl - produces a group by reordering an existing group and taking only listed members \*/

```
int MPI_Group_incl ( MPI_Group group , int
                     n , const int ranks [] ,
                     MPI_Group * newgroup )
```

```
-----
```

/\* MPI\_Comm\_create - creates a new communicator with a group of processes \*/

```
int MPI_Comm_create ( MPI_Comm comm ,
                      MPI_Group group , MPI_Comm * newcomm )
```

```
-----
```

/\* MPI\_Group\_rank - returns the rank of the calling process in the given group \*/

```
int MPI_Group_rank ( MPI_Group group , int
                     * rank )
```

```
-----
```

/\* MPI\_Comm\_rank - returns the rank of the calling process in the given communicator \*/

```
int MPI_Comm_rank ( MPI_Comm comm , int
                     * rank )
```

### #Example

```
ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7};
// Variables to hold the group of orig_group and new_group
// What is held in new_group is not necessarily the same as what is
MPI_Group orig_group , new_group;
MPI_Comm new_comm;
```

```
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numparts);
```

sendbuf = rank;

```
-----
```

/\* Extract the original group handle \*/

```
MPI_Comm_group(MPI_COMM_WORLD, &orig_group);
```

/\* Divide tasks into two distinct groups based upon rank \*/

```
if (rank < NPROCS/2) {
    MPI_Group_incl(orig_group , NPROCS/2,
                  ranks1 , &new_group);
```

```

} else {
    MPI_Group_incl(orig_group, NPROCS/2,
                  ranks2, &new_group);
}

/* Create new new communicator and then
   perform collective communications */
MPI_Comm_create(MPI_COMM_WORLD, new_group,
                &new_comm);
MPI_Allreduce(&sendbuf, &recvbuf, 1,
              MPI_INT, MPI_SUM, new_comm);

MPI_Group_rank(new_group, &new_rank);
printf("rank=%d,newrank=%d,recvbuf=%d\n", rank, new_rank, recvbuf);

MPI_Finalize();

Inter group communication

int MPI_Intercomm_create ( MPI_Comm
                           local_comm , int local_leader ,
                           MPI_Comm peer_comm , int remote_leader ,
                           int tag ,
                           MPI_Comm * newintercomm )

• All inter group constructors are grouping and requires the
local and remote groups to be disjoint to avoid deadlock
• The two groups communicate through Ring leaders
• MPI Barrier is a one to all operation. The one calling
process in a sub group waits for all other processes to
enter the barrier before continuing
Cartesian Virtual Topology

-----
```

/\* MPI\_Cart\_create - makes a new  
communicator to which Cartesian

```

topology information has been attached */
int MPI_Cart_create ( MPI_Comm comm_old ,
                      int ndims , const int dims [] ,
                      const int periods [] , int reorder ,
                      MPI_Comm * comm_cart )
-----
/* MPI_Cart_coords - determines process
   coordinates in the Cartesian
   topology , given its rank in the group */
int MPI_Cart_coords ( MPI_Comm comm , int
                      rank , int maxdims , int coords [])
-----
/* MPI_Cart_shift - returns the shifted
   source and destination ranks ,
   given a shift direction and amount */
int MPI_Cart_shift ( MPI_Comm comm , int
                     direction , int disp ,
                     int * rank_source , int * rank_dest )

#Example
#define SIZE 16

#define UP 0
#define DOWN 1
#define LEFT 2
#define RIGHT 3

int main(int argc, char *argv[])
{
    int numtasks, rank, source, dest, outbuf,
        i, tag=1,
        inbuf[4]={
            MPI_PROC_NULL,
            MPI_PROC_NULL,

```

```

            MPI_PROC_NULL,MPI_PROC_NULL,
        },
        // Row * Columns
        nbrs[4], dims[2]={4,4},
        periods[2]={0,0}, reorder=0, coords[2];

MPI_Request reqs[8];
MPI_Status stats[8];
MPI_Comm cartcomm;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

if (numtasks == SIZE) {

    MPI_Cart_create(MPI_COMM_WORLD, 2,
                   dims, periods, reorder, &cartcomm);
    MPI_Comm_rank(cartcomm, &rank);
    MPI_Cart_coords(cartcomm, rank, 2,
                    coords);
    MPI_Cart_shift(cartcomm, 0, 1,
                   &nbrs[UP], &nbrs[DOWN]);
    MPI_Cart_shift(cartcomm, 1, 1,
                   &nbrs[LEFT], &nbrs[RIGHT]);

    outbuf = rank;

    // for all 4 directions, send its rank
    // to its neighbours
    // and update the rank information
    for (i=0; i<4; i++) {
        dest = nbrs[i];
        source = nbrs[i];
        MPI_Isend(&outbuf, 1, MPI_INT,
                  dest, tag, MPI_COMM_WORLD,

```

&reqs[i]);  
MPI\_Irecv(&inbuf[i], 1, MPI\_INT,  
source, tag, MPI\_COMM\_WORLD,  
&reqs[i+4]);

MPI\_Waitall(8, reqs, stats);

for (i = 0; i < SIZE; i++) {  
 MPI\_Barrier(MPI\_COMM\_WORLD);  
 if (rank == i) {  
 printf("rank=%d,coords=%d,%d,%d,%d,%d,%d,%d,%d\n",  
 rank, coords[0], coords[1], nbrs[UP],  
 nbrs[RIGHT]);  
 printf("rank=%d,inbuf(u,d,l,r)= %d,%d,%d,%d\n",  
 rank, inbuf[UP], inbuf[DOWN], inbuf[LEFT], inbuf[RIGHT]);  
 }  
 // Prevent overflowing the previous  
 // barrier  
 MPI\_Barrier(MPI\_COMM\_WORLD);  
}  
}  
} else {  
 printf("Must specify %d processes..\nTerminating.\n",SIZE);  
}

MPI\_Finalize();  
}

## Examples



27

| Network $G$ with $n$ nodes                                 | Degree $g(G)$ | Diameter $\delta(G)$                    | Edge-connectivity $ec(G)$ | Bisection Bandwidth $B(G)$ |
|------------------------------------------------------------|---------------|-----------------------------------------|---------------------------|----------------------------|
| Complete graph                                             | $n - 1$       | 1                                       | $n - 1$                   | $((\frac{n}{2}))^2$        |
| Linear array                                               | 2             | $n - 1$                                 | 1                         | 1                          |
| Ring                                                       | 2             | $\lceil \frac{n}{2} \rceil$             | 2                         | 2                          |
| $d$ -dimensional mesh ( $n = r^d$ )                        | $2d$          | $d(\sqrt[d]{n} - 1)$                    | $d$                       | $n^{\frac{d-1}{d}}$        |
| $d$ -dimensional torus ( $n = r^d$ )                       | $2d$          | $d \lceil \frac{\sqrt[d]{n}}{2} \rceil$ | $2d$                      | $2n^{\frac{d-1}{d}}$       |
| $k$ -dimensional hypercube ( $n = 2^k$ )                   | $\log n$      | $\log n$                                | $\log n$                  | $\frac{n}{2}$              |
| $k$ -dimensional CCC-network ( $n = k2^k$ for $k \geq 3$ ) | 3             | $2k - 1 + \lceil \frac{k}{2} \rceil$    | 3                         | $\frac{n}{2k}$             |
| Complete binary tree ( $n = 2^k - 1$ )                     | 3             | $2 \log \frac{n+1}{2}$                  | 1                         | 1                          |
| $k$ -ary $d$ -cube ( $n = k^d$ )                           | $2d$          | $d \lceil \frac{k}{2} \rceil$           | $2d$                      | $2k^{d-1}$                 |

Table 1: Network Characteristics