

# CSE 6230 – Midterm Review

Ramakrishnan Kannan, Shruti Shivakumar

# Basic Terminologies - II

- **Parallel Computing**- Solving a task by simultaneous use of multiple processors in a unified architecture.
- **High Performance Computing**- Solving large problems via supercomputers + fast networks + massive storage.
- **Embarrassingly Parallel** - Solving many similar, but independent, tasks. E.g., parameter sweeps.
- **Multi-core/Many-core Processors** - Almost all processors today. Multiple compute cores on a single chip. They share memory, operating system and network.
- **Cluster Computing** - Combination of commodity units (e.g. multi-core processors) to build parallel system.
- **Pipelining (streaming)** - Breaking a task into steps performed by different units, much like an assembly line.



Source: Marat Dukhan <[mdukan3@gatech.edu](mailto:mdukan3@gatech.edu)>

# Some definitions

- For a given problem A, let
  - $\text{SerTime}(n)$  = Time of best serial program to solve A for input of size n.
  - $\text{ParTime}(n,p)$  = Time of the parallel program+architecture to solve A for input of size n, using p processors.
- Note that  $\text{SerTime}(n) \leq \text{ParTime}(n,1)$ .
- $\text{Speedup}(n,p) : \text{SerTime}(n) / \text{ParTime}(n,p)$   $0 < \text{Speedup} \leq p$
- $\text{Work}(n,p) : p \cdot \text{ParTime}(n,p) \leftarrow \text{cost}$  Serial Work  $\leq$  Parallel Work  $< \infty$
- $\text{Efficiency}(n,p) : \text{SerTime}(n) / [p \cdot \text{ParTime}(n,p)]$   $0 < \text{Efficiency} \leq 1$

# Speedup



Very rare. Some reasons for speedup  $> p$  (efficiency  $> 1$ )

- Parallel computer has  $p$  times as much RAM so higher fraction of program memory in RAM instead of disk. An important reason for using parallel computers
- In developing parallel program a better approach was discovered, older serial program was not best possible.
  - A useful side-effect of parallelization

# Amdahl's Law

- Amdahl [1967]: Let  $f$  be fraction of time spent on operations that are performed serially. Then for

- $\text{ParTime}(p) \geq \text{SerTime}(p) \cdot \left[ f + \frac{1-f}{p} \right]$

- $\text{Speedup}(p) \leq \frac{1}{f + \frac{1-f}{p}}$

- Which implies

$$\text{Speedup} \leq 1/f$$



# Scaling

- Utilize large computers by increasing  $n$  as  $p$  increases
- Fix the amount of data per processor: ***weak scaling***
  - Efficiency can remain high if communication does not increase excessively
  - Warning: efficiency improves, but parallel time will increase if  $\text{SerTime}(n)$  superlinear ( $\omega(n)$ ).
- Amdahl considered ***strong scaling*** :  $n$  is fixed
- Linear speedup is difficult
  - Nothing scales to arbitrarily many processors.
- However, for most users, the important question is:
  - Have I achieved acceptable performance on my software/hardware system for a suitable range of data and system sizes?



# ARCHITECTURAL TAXONOMIES

- These classifications provide ways to think about problems and their solution.
- The classifications were originally in terms of hardware, but there are natural software analogues.
- Many systems blend approaches, and do not exactly correspond to the classifications.

$$\left\{ \begin{array}{c} S \\ M \end{array} \right\} I \left\{ \begin{array}{c} S \\ M \end{array} \right\} D$$

**SI : Single Instruction:** All processors execute the same instruction.

**MI : Multiple Instruction:** Different processor may be executing different instructions.

**SD : Single Data:** All processors are operating on the same data.

**MD: Multiple Data:** Different processors may be operating on different data.

# Shared Memory – A node or a computer

- Global memory space, accessible by all processors
- Processors may have local copies (in cache) of some global memory, consistency of copies usually maintained by hardware (cache coherency)
- Advantages:
  - Global address space is user-friendly Data sharing between tasks is fast
- Disadvantages:
  - Shared memory - to - CPU path may be a bottleneck (is bandwidth of the network sufficient?)
  - Often: Non-Uniform Memory Access (NUMA)
    - ⇒ access time varies, depends on physical distance
    - Programmer responsible for correct synchronization
- Programming Models
  - OpenMP, Cilk



# Distributed Memory



- If processor A needs data in processor B, then B must send a message to A containing the data. Thus DM systems also known as message passing systems.
- Programming Models - MPI
- Advantages:
  - Memory is scalable with number of processors
  - Each processor has rapid access to its own memory
  - Cost effective: can use commodity parts

- Disadvantages:
  - Programmer is responsible for many of the details of the communication, easy to make mistakes.
  - May be difficult to distribute the data structures

# HPC Architectures with Accelerators

Shared Memory **Nodes** may be heterogeneous (**CPU cores and GPUs**)



*Shared Memory* within a node with CPUs and GPUs  
plus *Distributed Memory* concept: Non-local data can  
be sent across the network to other CPUs



# Domain and Functional Decomposition

- Domain decomposition: Partition a (perhaps conceptual) space. Different processors do similar work on different pieces (quilting bee, teaching assistants for discussion sections, etc.)
- Functional decomposition: Different processors work on different types of tasks (workers on an assembly line, sub-contractors on a project, etc.)
  - Functional decomposition rarely scales to many processors, so we'll concentrate on domain decomposition.

# Matrix Decompositions

- Suppose work at each position only depends on value there and nearby ones, equivalent work at each position.
- Dependencies force communication along boundary

Minimizes  
Bandwidth

|    |    |    |    |
|----|----|----|----|
| 0  | 1  | 2  | 3  |
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

Minimizes  
Latency



Minimize boundary (2D) Minimize # neighbors (1D)

# Local vs Global Arrays



# MPI Ranks -- Linear vs 2D Grid

-

MPI ranks 0 ... 15

|    |    |    |    |
|----|----|----|----|
| 0  | 1  | 2  | 3  |
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

logical rows and  
columns 0 ... 3

For  $\text{MPI\_COMM\_RANK} = i$  and  $\text{MPI\_COMM\_SIZE} = p$

$$\text{my\_row} = \lfloor i/\sqrt{p} \rfloor \quad \text{and} \quad \text{my\_col} = i - \text{my\_row} * \sqrt{p}$$

to send to logical

send to  $\text{MPI\_COMM\_RANK}$

|       |                                             |                |
|-------|---------------------------------------------|----------------|
| Right | ( $\text{my\_row}$ , $\text{my\_col} + 1$ ) | $i + 1$        |
| Left  | ( $\text{my\_row}$ , $\text{my\_col} - 1$ ) | $i - 1$        |
| Up    | ( $\text{my\_row} - 1$ , $\text{my\_col}$ ) | $i - \sqrt{p}$ |
| Down  | ( $\text{my\_row} + 1$ , $\text{my\_col}$ ) | $i + \sqrt{p}$ |

# OpenMP

# Three building blocks

- Compiler Directives
  - `private(list), shared(list)`
  - `firstprivate(list), lastprivate(list)`
  - `reduction(operator:list)`
  - `schedule(method[,chunk_size])`
  - `nowait`
  - `if(scalar_expression)`
  - `num_thread(num)`
  - `threadprivate(list), copyin(list)`
  - Ordered
- Runtime Libraries/APIs
  - `omp_set/get_num_threads, omp_get_thread_num, omp_{set,get}_dynamic,`  
`omp_in_parallel, omp_get_wtime`
- Environment variables
  - `OMP_NUM_THREADS, OMP_SCHEDULE, OMP_STACKSIZE, OMP_DYNAMIC, OMP_NESTED,`  
`OMP_WAIT_POLICY`

# Fork-Join Model

- OpenMP programs begin as a single process: the master thread.
- The master thread executes sequentially until the first parallel region construct is encountered.
- FORK: the master thread then creates a team of parallel threads.
  - The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads.
- JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread.
- The number of parallel regions and the threads that comprise them are arbitrary.



# Workshare vs Sections vs Single



DO / for shares iterations of a loop across the team.  
Represents a type of “data parallelism”.



SECTIONS breaks work into separate, discrete sections. Each section is executed by a thread. Can be used to implement a type of “functional parallelism”.



SINGLE serializes a section of code

# Synchronization : Critical and Atomic

- Critical: Only one thread at a time can enter a critical region
- Atomic: Only one thread at a time can update a memory location

```
#include
main()
{
    int x;
    x = 0;
#pragma omp parallel shared(x)
    {
        #pragma omp critical
        x = x + 1;
    } /* end of parallel section */
}
```

```
#include
main()
{
    int x;
    x = 0;
#pragma omp parallel shared(x)
    {
        #pragma omp atomic
        x = x + 1;
    } /* end of parallel section */
}
```

# Reduction

- The reduction clause allows accumulative operations on the value of variables.
- Syntax: reduction (operator:variable list)
- A private copy of each variable which appears in reduction is created as if the private clause is specified.
- Operators
  - Arithmetic
  - Bitwise
  - Logical

```
int main()
{
    int i, n;
    n = 10000;
    float a[n], b[n];
    double result, sequential_result;
    /* Some initializations */
    result = 0.0;
    for (i = 0; i < n; i++)
    {
        a[i] = i * 1.0;
        b[i] = i * 2.0;
    }
#pragma omp parallel for default(shared) private(i) \
    schedule(static) reduction(+ : result)
    for (i = 0; i < n; i++)
        result = result + (a[i] * b[i]);
    printf("Final result= %f\n", result);
    return 0;
}
```

MPI

# Different MPI Functionalities

- Point-2-Point communications
- Collective Communications
- MPI-IO
- Tools Interface
- One sided message passing
- Derived Datatypes

# Point to Point communication

- MPI provides four variants on send with blocking and nonblocking versions of each.
- Blocking means the call will not complete until the local data is safe to modify
- Nonblocking means the call returns “immediately”
  - Nonblocking data movement calls in MPI are MPI\_I{command}, e.g. MPI\_Irecv() or MPI\_Ialltoallv() (capital “Eye”)
  - Nonblocking calls require a mechanism to tell when they are done – MPI\_Wait\*, MPI\_Test\*
  - Data may or may not actually move before a call to MPI\_Wait\*/MPI\_Test\*
  - It is not safe to reuse buffers until the Wait/Test says the operation is locally done.
  - Nonblocking calls (can) allow for compute and communication to overlap

# Collective Communication

- One process wants to communicate with multiple process
  - One Process to all other process



# MPI\_All\*

## MPI\_Allgather

Gathers data from all tasks and then distributes to all tasks in communicator

```
sendcnt = 1;  
recvcnt = 1;  
MPI_Allgather(sendbuf, sendcnt, MPI_INT  
              recvbuf, recvcnt, MPI_INT  
              MPI_COMM_WORLD);
```



## MPI\_Allreduce

Perform reduction and store result across all tasks in communicator

```
count = 1;  
MPI_Allreduce(sendbuf, recvbuf, count, MPI_INT,  
              MPI_SUM, MPI_COMM_WORLD);
```



## MPI\_Reduce\_scatter

Perform reduction on vector elements and distribute segments of result vector across all tasks in communicator

```
recvcnt = 1;  
MPI_Reduce_scatter(sendbuf, recvbuf, recvcount,  
                   MPI_INT, MPI_SUM, MPI_COMM_WORLD);
```



## MPI\_Alltoall

Scatter data from all tasks to all tasks in communicator

```
sendcnt = 1;  
recvcnt = 1;  
MPI_Alltoall(sendbuf, sendcnt, MPI_INT  
             recvbuf, recvcnt, MPI_INT  
             MPI_COMM_WORLD);
```



# MPI Communicator Split



```
// Get the rank and size in the original
// communicator
int world_rank, world_size;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

int color = world_rank / 4; // Determine color
                           // based on row

// Split the communicator based on the color and
// use the original rank for ordering
MPI_Comm row_comm;
MPI_Comm_split(MPI_COMM_WORLD, color, world_rank,
               &row_comm);

int row_rank, row_size;
MPI_Comm_rank(row_comm, &row_rank);
MPI_Comm_size(row_comm, &row_size);

printf("WORLD RANK/SIZE: %d/%d --- ROW RANK/SIZE:
       %d/%d\n",
       world_rank, world_size, row_rank, row_size);

MPI_Comm_free(&row_comm);

MPI_Finalize();
```

# Introduction to CUDA

Ramakrishnan Kannan

Shruti Shivakumar

Motivated out of OLCF training seminar

# CPU vs GPU



## CPU

- can execute instructions, divided into stages, at the rate of up to one instruction per clock cycle (IPC) when there are no dependencies.
- Scalar Pipelined Execution
- designed to process serial instructions efficiently.
- find instruction-level parallelism and execute multiple out-of-order instructions per clock cycle.

## GPU

- SIMD == SIMD + Multithreading
- Is optimized for aggregate throughput across all cores, deemphasizing individual thread latency and performance.
- Efficiently processes vector data
- Dedicates more silicon space to compute and less to cache and control.

# CPU Vs GPU Demo

## Scalar Pipelined Execution



### CPU Advantages

- Out-of-order superscalar execution
- Sophisticated control to extract tremendous instruction level-parallelism
- Accurate branch prediction
- Automatic parallelism on sequential code
- Large number of supported instructions
- Lower latency when compared to offload acceleration
- Sequential code execution results in ease-of-development

### GPU Advantages:

- Massively parallel, up to thousands of small and efficient SIMD cores/EUs
- Efficient execution of data-parallel code
- High dynamic random-access memory (DRAM) bandwidth

# CUDA Programming Model

- Heterogenous Programming
  - program separated into serial regions (run on CPU) & parallel regions (run on GPU)
- Data Parallelism - Parallel regions consist of many calculations that can be executed independently



*At its core are three key abstractions – a hierarchy of thread groups, shared memories, and barrier synchronization – that are simply exposed to the programmer as a minimal set of language extensions (to C programming language)*

# CUDA Program Outline

```
int main(){  
    // Allocate memory for array on host  
    // Allocate memory for array on device  
    // Fill array on host  
    // Copy data from host array to device array  
    // Do something on device (e.g. vector addition)  
    // Copy data from device array to host array  
    // Check data for correctness  
    // Free Host Memory  
    // Free Device Memory  
}
```



# CUDA Kernel

- When kernel is launched, a grid of threads are generated
- SIMD Code – Same code is executed by all threads
- Serial – CPU
  - for (int i=0; i<N; i++) {  
    C[i] = A[i] + B[i];  
}
- GPU – Parallel Code
  - $C[i] = A[i] + B[i];$



# 1D Indexing in CUDA programming

```
__global__ void vector_addition(int *a, int *b, int *c)
{
    (4)          (0-3)      (0-3)
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i<N) c[i] = a[i] + b[i];
}

thr_per_blk = 4;
blk_in_grid = ceil( float(N) / thr_per_blk );
vec_add<<< blk_in_grid, thr_per_blk >>>(d_a, d_b, d_c);
```



# CUDA Threads to 2D Array

|                  |                  |                  |                  |                  |                  |                  |                  |                  |                  |  |  |
|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|--|--|
| A <sub>0,0</sub> | A <sub>0,1</sub> | A <sub>0,2</sub> | A <sub>0,3</sub> | A <sub>0,4</sub> | A <sub>0,5</sub> | A <sub>0,6</sub> | A <sub>0,7</sub> | A <sub>0,8</sub> | A <sub>0,9</sub> |  |  |
| A <sub>1,0</sub> | A <sub>1,1</sub> | A <sub>1,2</sub> | A <sub>1,3</sub> | A <sub>1,4</sub> | A <sub>1,5</sub> | A <sub>1,6</sub> | A <sub>1,7</sub> | A <sub>1,8</sub> | A <sub>1,9</sub> |  |  |
| A <sub>2,0</sub> | A <sub>2,1</sub> | A <sub>2,2</sub> | A <sub>2,3</sub> | A <sub>2,4</sub> | A <sub>2,5</sub> | A <sub>2,6</sub> | A <sub>2,7</sub> | A <sub>2,8</sub> | A <sub>2,9</sub> |  |  |
| A <sub>3,0</sub> | A <sub>3,1</sub> | A <sub>3,2</sub> | A <sub>3,3</sub> | A <sub>3,4</sub> | A <sub>3,5</sub> | A <sub>3,6</sub> | A <sub>3,7</sub> | A <sub>3,8</sub> | A <sub>3,9</sub> |  |  |
| A <sub>4,0</sub> | A <sub>4,1</sub> | A <sub>4,2</sub> | A <sub>4,3</sub> | A <sub>4,4</sub> | A <sub>4,5</sub> | A <sub>4,6</sub> | A <sub>4,7</sub> | A <sub>4,8</sub> | A <sub>4,9</sub> |  |  |
| A <sub>5,0</sub> | A <sub>5,1</sub> | A <sub>5,2</sub> | A <sub>5,3</sub> | A <sub>5,4</sub> | A <sub>5,5</sub> | A <sub>5,6</sub> | A <sub>5,7</sub> | A <sub>5,8</sub> | A <sub>5,9</sub> |  |  |
| A <sub>6,0</sub> | A <sub>6,1</sub> | A <sub>6,2</sub> | A <sub>6,3</sub> | A <sub>6,4</sub> | A <sub>6,5</sub> | A <sub>6,6</sub> | A <sub>6,7</sub> | A <sub>6,8</sub> | A <sub>6,9</sub> |  |  |

Let M = 7 rows, N = 10 columns and a 4x4 blocks of threads... To cover all elements in the array, we need 3 blocks in x-dim and 2 blocks in y-dim.

```
dim3 threads_per_block( 4, 4, 1 );
dim3 blocks_in_grid( ceil( float(N) / threads_per_block.x ), ceil( float(M) / threads_per_block.y ), 1 );
mat_add<<< blocks_in_grid,
threads_per_block >>>(d_a, d_b,
d_c);
```



```
__global__ void add_matrices(int *a, int *b, int *c)
{
    int column = blockDim.x * blockIdx.x + threadIdx.x; // [0-11]
    int row = blockDim.y * blockIdx.y + threadIdx.y; // [0-7]
    if (row < M && column < N) // There is no row 7 and col 11.
    {
        int thread_id = row * N + column; // [0-69]
        c[thread_id] = a[thread_id] + b[thread_id];
    }
}
```

Row = 4  
Column = 5  
Thread\_id = 4 \* 10 + 5 = 45

# Parallel IO

# Lustre

- Distributed file system
- Hierarchical management
- Concurrency from multiple OSTs
- Meta data management with multiple MDTs



# Lustre Components

- Lustre consists of four major components
  - MetaData Server (MDS)
  - Object Storage Servers (OSSs)
  - Object Storage Targets (OSTs)
  - Clients
- MDS: track meta data (eg., name, location)
- OST: back-end storage for file object data
- Performance: Striping, alignment, placement



# HPC I/O Patterns

Non-parallel I/O



Parallel Multi-file I/O



Parallel Single-file I/O



- Serial performance
- Scaling issue
- Memory limit

- Metadata issue
- Management issue

- User-friendly
- Sync/lock overhead

# What is MPI I/O?

- I/O interface specification for parallel MPI applications
- Parallel I/O part for MPI
- MPI IO provides
  - Parallel I/O operations
  - Enable to use efficiently parallel file systems
  - Independent/collective I/Os
- Low-level interface
- At the application level, users may want to use of a more abstract library



POSIX I/O

```
fd = open("foo.txt", O_WRONLY | O_CREAT, 0644);  
write(fd, buf, len);  
close(fd)
```



MPI I/O

# I/O and Storage Stack

- We encourage the use of self-describing, binary, portable I/O formats
- We encourage users to push the envelope of the I/O Middleware (ADIOS, HDF5, etc.)
- Abstractions should not “force” implementations
  - Use data in streams or files
  - Write data according to the matching of the I/O(??) from the application(s) and the storage layers

Application

DRAM, local, remote

Self Describing  
Parallel I/O

Lower Level I/O

SSD/NVRAM

Parallel File System

Cold Storage

**ADIOS , HDF5, pnetcdf**

maps variables to data output in a file and/or stream

**MPI-IO, POSIX**

Places the data to the storage system, often optimizing the data from the I/O path to the storage system

**Burst Buffer optimizations**

**GPFS, Lustre, ...**

maintains logical space and provides efficient access to data

**TAPE, HPSS, ...**

# Performance Modeling

# Roofline Model



## Goals Of Roofline Model

- A graphical aid that provides : **realistic expectations of performance and productivity**
- Show inherent **hardware limitations** for a given kernel
- Show potential benefit and priority of optimizations
- Focused on: rates and efficiencies (**GF/s, % of peak**)

## Three Principle Components

- Computation: measured in GF/sec, GTEPS
- Communication: GB/sec
- Locality: Cache Size, Data Reuse

# Roofline Curve



$$\begin{aligned} P &= \frac{W}{T} \\ &= \frac{1}{\tau_{fl}} \min \left( \frac{\mathbf{I}}{B_\tau}, 1 \right) \\ &= P_\infty \min \left( \frac{\mathbf{I}}{B_\tau}, 1 \right) \\ &= \min (\beta \mathbf{I}, P_\infty) \end{aligned}$$

$\beta \equiv \frac{1}{\tau_{mem}}$  Data Transfer Bandwidth



❖ **True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes**

- constant with respect to problem size for many problems of interest
- ultimately limited by compulsory traffic
- diminished by conflict or capacity misses.

```
for (int i=0; i<n; i++) {  
    A[i] = B[i] + C[i]
```

$$W = n$$
$$Q = 24n$$

$$I = \frac{1}{24}$$

```
for (int i=0; i<n; i++) {  
    for (int j=0; j<n; j++) {  
        C[i] += A[i][j]*B[j]
```

$$W = 2n^2$$
$$Q = 8n^2 + 16n$$

$$I = \frac{1}{4}$$

```
for (int k=0; k<n; k++) {  
    for (int i=0; i<n; i++) {  
        for (int j=0; j<n; j++) {  
            C[i][j] += A[i][k]*B[k][j]
```

$$W = 2n^3$$
$$Q = 24n^2$$

$$I = \frac{n}{2}$$

```
for (int i=0; i<n; i++) {  
    A[i] = B[i] + C[i]
```

$$I = \frac{1}{24} \quad P = \min\left(\frac{900}{24}, 3500\right) = 37.4 \text{GF/sec}$$

```
for (int i=0; i<n; i++) {  
    for (int j=0; j<n; j++) {  
        C[i] += A[i][j]*B[j]
```

$$I = \frac{1}{4} \quad P = \min\left(\frac{900}{4}, 7000\right) = 225 \text{GF/sec}$$

```
for (int k=0; k<n; k++) {  
    for (int i=0; i<n; i++) {  
        for (int j=0; j<n; j++) {  
            C[i][j] += A[i][k]*B[k][j]
```

$$I = \frac{n}{2} \quad P = \min(450n, 7000) = 7000 \text{GF/sec}$$

# Distributed Complexity Analysis

# A simple model for point-to-point messages

The time to send or receive a message of  $s$  words is  $\alpha + s \cdot \beta$

- ▶  $\alpha$  – **latency/synchronization cost per message**
- ▶  $\beta$  – **bandwidth cost per word**
- ▶ *each processor can send and/or receive one message at a time*

Consider the cost of a broadcast of  $s$  words

- ▶ *using a **binary tree** of height*

$$h_r = 2(\log_2(p + 1) - 1) \approx 2 \log_2(p)$$

$$T = h_r \cdot (\alpha + s \cdot \beta)$$

- ▶ *using a **binomial tree** of height*

$$h_m = \log_2(p + 1) \approx \log_2(p)$$

$$T = h_m \cdot (\alpha + s \cdot \beta)$$



# Collective communication in BSP

- ▶ When  $h = p$ , most collective communication routines involving  $s$  words of data per processor can be done with BSP cost  $O(\alpha + s \cdot \beta)$ 
  - ▶ Scatter: root sends each  $s/p$ -sized message to its target (root incurs  $s \cdot \beta$  send bandwidth)
  - ▶ Reduce-Scatter: each processor sends  $s/p$  summands to every other processor (every processor incurs  $s \cdot \beta$  send and receive bandwidth)
  - ▶ Gather: send each message of size  $s/p$  to root (root incurs  $s \cdot \beta$  receive bandwidth)
  - ▶ Allgather: each processor sends  $s/p$  words portion to every other processor (every processor incurs  $s \cdot \beta$  send and receive bandwidth)
  - ▶ Broadcast done by Scatter then Allgather
  - ▶ Reduce done by Reduce-Scatter then Gather
  - ▶ Allreduce done by Reduce-Scatter then Allgather
  - ▶ All-to-all can be done by sending messages directly in one round
  - ▶ For  $h < p$ ,  $O(\log_h p)$  supersteps required, but bandwidth cost same for all except all-to-all (higher by  $O(\log_h p)$  via  $h$ -ary butterfly protocols)

# Profiling and Matrix Multiplication

# Ramping up the Flops – matrix multiplication

- Matrix addition:  $C = AxB$ 
  - If  $A$  is  $n \times k$  and  $B$  is  $k \times m$ , then:
    - $C$  is  $n \times m$ , each position in  $C$  requires the sum of  $k$  multiplications (the dot product of each column of  $B$ , row of  $A$ )

# Ramping up the Flops – matrix multiplication



| N (N x N matrix) | AI (Flops/Byte) | Performance (GFLOPs) | Roofline Peak (GFLOPs) |
|------------------|-----------------|----------------------|------------------------|
| 1024             | 21.06           | 644.63               | 23900                  |
| 2048             | 4.06            | 631.72               | 6496                   |
| 4096             | 3.78            | 618.59               | 6048                   |

- For a matrix multiply, this doesn't look very good – even at larger sizes, achieving <3% of device capability
- Vendors will likely provide libraries for things they want to be highly optimized. Let's try one - **rocBLAS**

# Ramping up the Flops – matrix multiplication



| N (N x N matrix) | AI (Flops/Byte) | Performance (GFLOPs) | Theoretical Peak (GFLOPs) |
|------------------|-----------------|----------------------|---------------------------|
| 1024             | 42.00           | 8822.32              | 23900                     |
| 2048             | 47.44           | 11017.20             | 23900                     |
| 4096             | 35.77           | 12337.90             | 23900                     |

- This looks much better
- >10x floating-point performance
- >50% device peak at largest size

# Parallel Frameworks -- Kokkos

# The Kokkos EcoSystem



## How does this compare to OpenMP?

Serial

```
for (int64_t i = 0; i < N; ++i) {  
    /* loop body */  
}
```

OpenMP

```
#pragma omp parallel for  
for (int64_t i = 0; i < N; ++i) {  
    /* loop body */  
}
```

Kokkos

```
parallel_for(N, [=] (const int64_t i) {  
    /* loop body */  
});
```

### Important concept

Simple Kokkos usage is no more conceptually difficult than OpenMP, the annotations just go in different places.

## OpenMP

```
double totalIntegral = 0;
#pragma omp parallel for reduction(+:totalIntegral)
for (int64_t i = 0; i < numberOfIntervals; ++i) {
    totalIntegral += function(...);
}
```

## Kokkos

```
double totalIntegral = 0;
parallel_reduce(numberOfIntervals,
               [=] (const int64_t i, double & valueToUpdate) {
    valueToUpdate += function(...);
},
totalIntegral);
```

- ▶ The operator takes **two arguments**: a work index and a value to update.
- ▶ The second argument is a **thread-private value** that is managed by Kokkos; it is not the final reduced value.

# Execution Space

a homogeneous set of cores and an execution mechanism  
(i.e., “place to run code”)



Execution spaces: Serial, Threads, OpenMP, Cuda, HIP, ...

**Memory space:**  
explicitly-manageable memory resource  
(i.e., “place to put data”)

