

SPP z.B. Performance Modeling, Neural Networks, Parallelism Discovery, Scheduling in HPC & cloud, Scalable Algorithms

(like `!a`)  
conditions evaluated to `<0 => false`  
`>=const => true`

struct name {  
 type attribute;  
 ...  
};

Arrays:  
type name[length];  
length = sizeof(name)/sizeof(type)

Pointer  
type \*name;  
pointer++ berücksichtigt sizeof(type of pointer)

workaround for strings

char \* with \0 at the end

I/O

number written characters  
Read printf(const char\* format, ...)  
scanf(const char\* format, ...)

%c char  
%i int  
%s char\*

oder  
#pragma once  
#ifndef KEY  
#define KEY  
... code ...  
#endif

to prevent mehrfaches Reinlesen

#ifndef KEY  
#define KEY

shallow copies of everything  
arrays decay to pointers  
explicit casting like `int` (can cast pointers arbitrarily)

main function

int main(int argc, char \*\* argv) { ... }

break; cont;  
switch (number)  
case 0: case 1:  
 break;  
default:

[const] int \* [const] pointer  
can't change reference value

type aliases    `typedef existing_type alias;`

(++) - Zero-overhead principle    don't pay for what not used  
better than it done reasonably by hand

also flushes  
std::cout << val << std::endl;  
typecopy std::cin >> copy;

references

type & name = other; always initialized

other modifiers: `[nodiscard]` - warning if value not used  
begin/end

throws and catch everything

constexpr - can be evaluated during compile

requires std::is\_arithmetic\_v<T>

(noexcept)

templates (previously inefficiently done by void\*)

explicit, friend

template<typename T, int n,...> no floats allowed  
return type func(Param) { ... }    local explicit Typewichungen  
calling: func<int,3>(...);    compile produces func for needed types

std::size\_t store size of any possible object

[] allows multiple param

int operator=(type& other) { ~return \*this }

can overload operators.

using value-type classes - connect data and functionality, resource management

class name {

constructor: name(param) : attribute(param), ... { code... }

public: Bereiche

copy constructor: default shallow

private: starts with

~name deconstructor

unified constructor

auto-matches greedily, copy type  
const auto, const auto

standard containers std::

string

Iterating Containers

using vul

vector - managed memory block  
contains assert and const add

queue, stack,  
deque - contiguous memory  
const access, push/pop

for (auto it = cont.begin(); it != cont.end(); ++it) { ... }

BRUNNEN

list: linear access

const merge sort

array<-Ty,-si>

fixed size

map<-Ko,-Val> uses operator []  
logarithmic storage in red-black tree  
unordered-map hash-map  
amortized constant

lambda closure

[Identifiers,...] (args) { rumpf }

14

# Parallel Architectures

## Flynn's classification 1966

#data streams \ #instruction streams

|  | Single | Multiple |
|--|--------|----------|
|--|--------|----------|

Single

SISD  
classical uniprocessor

MISD  
no commercial

Frequency  $\rightarrow$  Voltage  $\Rightarrow$  Heat  
Power wall  
Memory wall  
ILP wall

Instruction  
Level  
Parallelism

Multiple

SIMD

Date parallelism  
z.B. extensions for  
multimedia,  
vector processing  
 $\downarrow$  GPU

MIMD

Thread-level parallelism  
+ flexibility  
popular execution model:  
Single Program Multiple Data SPMD  
potentially different control flow

Frequency  $\rightarrow$  Voltage  $\Rightarrow$  Heat

Functional parallelism

Data

Control > dependency  
Data <

SPMD

Single instruction multiple threads SIMD GPU

manages threads in groups of parallel threads (warps)  
instruction unit broadcasts to all  
threads may do nothing instead

## Memory Architecture

shared memory = single shared address space

UMA - Uniform memory access  
VCpus share same memory  
implementation + simple  
Bus switch

NUMA - Non-uniform memory access  
Memory has affinity to a CPU  
faster with local memory

+ scalable  
+ scalable

## Distributed memory

(Interconnecting)

### Network architecture

physical links of nodes (processors, memory)

protocol

### Topology, Routing technique

#### Bandwidth, Latency

Maximum rate  $\downarrow$  sending overhead + time of flight  
Aggregated total  $\downarrow$  bandwidth + receiving overhead  
Effective to application

### Shared medium

decentralized  
- only 1 message at time  
- broadcasts - listen to all  
not scalable - receiving expected if next  
+ cheap



### Switched

concurrent messages of different node pairs

+ scalable

Brick Wall  
Instruction Level Parallelism

## Lichtenberg

Processor - shared memory for cores  
Compute Node - typically 2 processors  
shared memory fast connection  
Server

Cluster - nodes as battalion by scheduler  
no shared memory

## Lichtenberg II

Big-Mem Accelerator nodes  
NVIDIA Tesla

infiniti and non-blocking fat tree  
Ethernet

## Heterogeneity, Customisation

CPU, GPU - data parallelism

FPGA - problem specific

Neuromorphic - asynchronous CMOS circuits

Quantum

Write invalidate  $\leftrightarrow$  update/broadcast  
+ common - bandwidth

Coherence misses true sharing miss invalidated writing same way  
false diff. mod prevented by smaller blocks

directory-based - für jeden Block 1 location for status  
centralized (UMA)  $\Rightarrow$  distributed (NUMA), + scalability  
- fast hardware & query at V memory reference

z.B. NUMA split address: Node | Line | Offset request forwarded  
directory: Line, cache level, where Node to owner, change  
optimization: only 1ropic search by directory Bitmask  
Bitmask: entries owner  
LinkedList: invalidates previous  
status of Line (dirty/clean) - traffic-useless with only 1 copy

snooping serialized

V caches on broadcast medium

copies have sharing status on to locations

$\hookrightarrow$  no centralized slot

Validates snoop for sharing status

Write-through caches

data can be retrieved from main memory

Write-back caches: (dirty)

data from processor with only copy

- longer than + bandwidth  
main memory scalable

z.B. MESI protocol

states: read from memory if able

Modified: no other memory caches all heard

Exclusive: copies invalidate modified?

Shared: to write What if cache full and block

Invalidate

## Indirect/dynamic interconnection

### Centralized switched networks

General links, intermediate switches  
for shared and distributed memory

Crossbar switch

+ Non blocking

- scale  $N^2$  switches

### N-dimensional mesh

direct link to neighbors

+ scaled + nearest neighbor

$\Downarrow$  borders

Torus mapped  $\Rightarrow$  no borders

direct / static /

typically 3-6 dimensions

$\Rightarrow$  Distributed switched networks

VSwitches direct link to End-node

switches 1:1 nodes

[processor / memory]

fully connected  $\leftrightarrow$  binRing

+ performance + cost

Dragonfly

### Multi-stage Interconnection network MIN

z.B. Omega Network

- blocking

Perfect shuffle permutation at each stage

- blocking

log<sub>2</sub> N stages  $\frac{N}{k} \cdot \log_2 N$  switches

$k \times k$  switches

to in. of top half

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

### Fat tree

leaves: end nodes

vertices: switches

+ popular for cluster interconnections

+ often composed of multiple smaller

same total

bandwidth

3/14

## Order of updates

- { between threads (through network) <sup>requests</sup>
- compiler instruction reordering

coherence  $\leftrightarrow$  consistency

single memory

ordering access of different memory locations

When to update?

Observed order?

## Sequential

result = if processors execute sequential and for individual processor order of program  
+ simple programming paradigm  
- potential performance degradation

## Relaxed consistency models

only synchronize important - hard to debug

program synchronized = values to shared data ordered by sync. operations

data races - not ordered access to single location

hardware supplied instructions

atomic, detect problems, can be bottleneck

## Time sharing / multitasking

processor switches process

state: process



## multiprogramming

## light weight process

### Thread

shares: code, data, files  
own: Registers, Stack



hides latency  
boosts performance

## Communication



explicit, often copies  
protected address space  
- profound design  
- hard writing debugger  
+ scalable



+ space, switch efficiently  
+ convenient  
- hard debugging race conditions

## On-chip multithreading

CPU runs other threads if stalled z.B. Memory

fine  $\leftrightarrow$  coarse grained

round robin  
- switch overhead

only switch upon stall  
- Lücken

up to 4 threads per core

2 B., i7 Hyperthreading

2 threads or processes per core

looks like 2 CPU with shared memory + concurrency

- conflicts

# Programming Models - abstraction computer allowing expression parallel algorithms/data structures

Implementation  
languages + extensions  
API  
compiler directives

Objectives  
code  
Productivity  
Portability  
optimization  
comparable performance

Nodes  
+ scale  
- re-design

**MPI** - Message Passing Interface z.B. compute clusters, distributed memory  
ranks as addresses  
MPI - Send buffer, count, datatype, RECEIVER, ... );  
Recv SENDER

Cores  
+ incremental  
parallelization  
- scale  
- debug

**OpenMP** - Multithreading  
multiple threads via shared variables  
sync. through barriers, lock-style methods,  
atomic operations



# pragma omp parallel for

for (int i=0; i<n; i++)

$$z[i] = a * x[i] + b * y[i];$$

}

**CUDA** - GPU computing (with NVIDIA extension for data-parallel workloads)



- global - parallel

void func(...){

int i = blockIdx.x \* blockDim.x + threadIdx.x;

if(i < n) ...

}

needs not full block

int nblocks = (n+255)/256; // 256 threads per block

func\_parallel(<nblocks, 256>>(...));

Performance = Resources to solution

response, execution time

Throughput

Energy

Peak Performance - guaranteed limit

Actual - limited by

critical resource

Speedup  $S = \frac{\text{time before enhancement}}{\text{time after}}$

Amdahl's law

Fraction enhanced = code Teil der besser wurde

Enhanced = speedup for it

$$S = \frac{1}{(1-f_{\text{enhanced}})} + \frac{f_{\text{enhanced}}}{f_{\text{enhanced}}}$$

for parallelism

$$S = f_{\text{seq}} + \frac{1-f_{\text{seq}}}{P} < \frac{1}{f_{\text{seq}}}$$

cost of parallelism  
communication demand  $E(p) = \frac{S(p)}{P}$

**BRUNNEN** if < 1 super-linear speedup  
z.B. through aggregated cache

Law of Gustafson - considers problem size

Assumption

sequential part const runtime  $t_f$

parallel part  $t_r(n,p)$

$$S(n,p) = \frac{t_f + t_r(n,p)}{t_f + t_r(n,p)}$$

if perfectly parallelized  $\lim_{n \rightarrow \infty} S(n,p) = p$

$t_r(n,p) = \frac{t_r(n)}{p}$  denke ich

Scalability

Weak: larger problems with more resources solvable  
harder problem size per processor constant

Strong same problems faster with more resources  
harder limited by Amdahl's law and  $E(p)$

Asymptotic

$$\text{z.B. dot product } t(N,p) = \Theta(N/p + \lg(p)) \quad S = \frac{T_1}{T_p} = \frac{\Theta(N)}{\Theta(N/p + \lg(p))} = \Theta\left(\frac{N}{\frac{N}{p} + \lg(p)}\right)$$

$$E(p) = \frac{T_1}{P \cdot T_p} = \Theta\left(\frac{N}{N + P \lg(p)}\right)$$

Little's formula

$$C = R \cdot L$$

L Latency  
desired throughput items  
# items concurrently in system

Concurrency required to hide latency



Data dependence - fine accesses multiple at least one writing 21 n-access 1-location 21 write race condition - outcome depends on ordering

### Detection

automatic (if possible)

z.B. PLUTO

Semi-automatic

if indizes linear expression  
solve system

manual

loop-carried - between iteration

Classification Action 1 sequentially before 2

|       |                                                                                               |                                                  |       |
|-------|-----------------------------------------------------------------------------------------------|--------------------------------------------------|-------|
|       | A1                                                                                            | Read                                             | Write |
| Read  | RAR<br>no dependence                                                                          | RAW<br>flow dependence<br>⇒ no parallel possible |       |
| Write | WAR<br>anti-dependence<br>only reuse<br>⇒ late firstprivate<br>↑ read only copy<br>- overhead | WAW<br>output dependence<br>⇒ lastprivate        |       |

Remove techniques

reduction clause

induction-variable elimination

if value only dependent on loop index

replace with expression z.B. Fibonacci

loop skewing

index für teile verschieden like Matrix transpose

Non removable

alternative algorithm?

parallelize different loop

Fissioning

split in parallelizable and non-parallelizable part

z.B. scalar Expansion if non-parallelizable part is only scalar

Parallel overhead ⇒ conditional parallelization  
(create distribute think clause e.g. if (n > MATRIX\_COUNT))

loop level parallelisation scheduling - distribution of iterations in chunks goal: finish even

static - distribution at loop-entry

- without chunk size

Threads 1 chunk nearly equal size

with chunk size - assigned in round robin fashion

assign 1 thread to first (round robin fashion)

Dynamic - threads request new chunks

Dynamic

default chunk size = 1

Guided

implementation specific start size

then exponential decrease

min. = chunk size

Runtime - via environment var/API call good for testing

Auto (no chunk size) - impl. decides

- complex + generic S P M D-style parallelism - not work distribution construct / divided execution

→ work replication

omp\_get\_thread\_num

0 for master ~ (team size - 1) ) Domain decomposition  
some same assignment based on thread num

omp\_get\_num\_threads() = team size

#pragma omp

structured block

only exec by

1 thread

single [clauses] E work distr. construct?

{(first-) private

copyprivate

no wait

private clause only construct

get it in region?

pass as arguments

Thread private

#pragma omp threadprivate (var, ...)

→ after var definition

Once initialized

persistent across parallel regions

⇒ same thread number ⇒ same var

no data-sharing clauses for them

but

copyin(var) master → workers

@ entering parallel construct

bei forc.:

copyin(var) executing → other

@ after single construct

bei single c. copyprivate(var) executing → other

@ after single construct

BRUNNEN

Work distr. constructs incl single must be reached by #threads exec in same order

have implicit barrier except nowait clauses

single entry/exit

no nesting

no explicit barrier

Orphaned - not in parallel construct

reached from serial behaves like 1 master thread V.

parallel like inside

- but different sharing semantics

# Synchronization - implicit communication for/through shared memory

does not order. - Mutual exclusion ↔ Event - can be used to order

| Result:                                             | Race condition ↔ Data race                                                       |
|-----------------------------------------------------|----------------------------------------------------------------------------------|
|                                                     | non-deterministic undefined behavior                                             |
| unenforced relative timing of concurrent operations | 2 unordered conflicting actions min 1 not atomic                                 |
| acceptable z.B. round-robin                         | probably error but 2 in threads scheduling first writes 1 likely works still bad |

often solved by private



Mutual Exclusion z.B. for access to shared data

critical construct

#pragma omp critical [name) [hint [hint-exp]]]

⇒ mutual exclusion of same name  
wait until no other thread inside

no branches in/out

no fairness guaranteed

Forward progress guaranteed

legible thread always gets access

⚠ nesting in regions same order  
to prevent deadlock

↳ lock impl.  
(un-)contented

(non-)speculative

+ One sync  
+ for single

construct multiple update

+ several locations

atomic construct

#pragma omp atomic [clauses]

expression-statement/structured block  
allowed depend on clause "not covered"

Only numeric is合法  
vars + may use hardware instructions  
→ performance

types changed  
atomic

don't mix  
for conflicting  
use

lock routines <sup>through</sup> by runtime library

[name] [-nest]

omp\_set\_lock(<reference to lock>);

blocks bis Lock acquisition

omp\_test\_lock();

returns true if lock acquired else false

omp\_unset\_lock();

releases it

[-nest]

omp\_lock\_t myLock;

omp\_init\_lock(&myLock);

...

[-nest]

omp\_destroy\_lock(&myLock);

+ flexibility

no block structure

dynamic lock

compute while waiting with test

nestable locks (z.B. for recursion)

same thread can get lock multiple times

release

Event synchronization

barrier construct

#pragma omp barrier

all threads in parallel region

must reach barrier before passing

⚠ not inside work-sharing construct

identical to implicit barrier

taskwait, taskgroup → siehe Tasking  
Ordered, Flush → not covered

Tasking - idea: separate problem decomposition, concurrency



- + over-decomposed  
⇒ load balance
- task overhead & too small  
(creation, sync pool)

explicit ↔ implicit task makes standard task more compact

L created by task constraint

L generated for k threads on begin parallel construct

#pragma omp task [clauses]

structured block

only works in parallel region

undefined order

barrier inside explicit task → deadlock



#pragma omp taskwait

waits for k direct child tasks

for grandchildren → use taskwait on children too

untied  
↳ data sharing attributes default created inside  
↳ relative to implicit task  
if(condition) (only 1 per task)  
final(condition) → if true → task becomes final  
ignore task constructs inside ind. itself  
depend(in/out) ibout : var-list  
↳ read-written by task  
inherit from creation context → first private  
→ private  
optimization, z.B. for recursion, getting smaller and smaller  
(Hubbildung, graph)

#pragma omp taskgroup

structured block

3

at end waits for all tasks  
created inside and their descendants

Task scheduling

↳ thread may switch task resume

implied ↔ explicit scheduling points

↳ #pragma omp taskyield

after task generation

last instruction task region

implicit explicit barriers

taskwait, group constructs

tieel ↔ untied task

↳ default Untied clause

some task → on task gen

resumes → any task in team

suspended task

## Parallel Algorithms - faster, larger

$T^*(n)$  = time of best known seq. algorithm  
energy efficient

$$\text{Speedup } S_p(n) = \frac{T^*(n)}{T_p(n)}$$

desired p  
 $\gg p \Rightarrow$  superlinear

$T_p(n)$  = time with p processors

$$\text{Parallel Efficiency } E_p(n) = \frac{T_1(n)}{p + T_p(n)}$$

S1 without superlinear  
L may with aggregate cache

$T_p(n)$  = p processors  
limited by seq part  
grain size

$$T^*(n) \leq T_1(n)$$

$$T_p(n) \geq T_\infty(n) \Rightarrow E_p(n) \leq \frac{T_1(n)}{p + T_\infty(n)}$$

$\Rightarrow$  drops quickly after

$$p > \frac{T_1(n)}{T_\infty(n)}$$

$T_1(n)$  with Landau Symbols  
number of operations  $\Rightarrow$  needs model

sequential

RAM - Random access machine

+ simple language



Parallel models

simplicity  $\leftrightarrow$  accuracy  
general, parallel  $\rightarrow$  theoretical  
realistic

DAG - Directed acyclic graph

+ architecture independent  
+ data dependence

- complex  
- communication, allocation missing

# = input size  
Inputs  $\Rightarrow$  nodes without incoming  
Operations  $\Rightarrow$  inner node = 1 time unit  
Outputs  $\Rightarrow$  without outgoing arcs  
No branches  
 $\Rightarrow$  e.g. loop unrolling

Algorithm  $\Rightarrow$  family of DAGs.  $\{G_n\}$

Scheduling p processes, shared data

node  $i \sim (j_i, t_i)$  are consistent  
processor  $i$  at time  $j_i, t_i$  at a time

$\{(j_i, t_i) | i \in N\}$  = schedule  
set of nodes

execution time =  $\max_{i \in N} (t_i)$

$T_p(n) = \min_{\text{schedule}} \{\max_{i \in N} (t_i)\} \geq \text{depth of DAG}$

Shared-memory model - MIMD

C-concurrent  
E-exclusive  
R-Read  
W-Write

for communication  $\rightarrow$  shared memory  
has local memory  $\rightarrow P_1, \dots, P_n$   
processor ID

begin  
global.read  
write  
Compute  
end

Tree  $\Rightarrow$  for  $h=1$  to  $\log_2 n$  do  
if  $(i \leq n/2^h)$  then  
begin ... end

variants

TEREW  
no simult.

var  
- synchronous  $\leftrightarrow$  asynchronous  
common clock separate clock

CREW  
only read

such. PRAM

sync points

CRCW

BRUNNEN

parallel

programmer responsible

common  
same value

arbitrary

Priority-min index succeeds

## Network model

-difficult

-topology dependent

asymmetric graph of processors

no shared memory

synchronous  $\leftrightarrow$  asynchronous

sync via messages

message-passing model

different

$send(X, j)$  with routing  
blockierend  $\rightarrow receive(X, j)$

Topology e.g. Linear Array, Ring, mesh, torus, hypercube, fully connected

evaluation param

diameter - max dist between pair

maximum degree of a node

Edge or Node connectivity - min removed to disconnect part

Bisection width - # nodes removed to cut

network in roughly in half

$$\#_p \text{ tradeoff} \quad T_{\text{comp}} = \text{computation time}$$

$$T_{\text{comm}} = \text{communication time} \quad \text{comm}(n) = \alpha + \frac{n}{\text{Bandwidth}}$$

latency

Bandwidth

## Work-depth model

DAG, but workers with greedy scheduling

$T_p$  = time to run with p workers

$T_1$  = work (seq)

$T_\infty$  = depth / critical path / span

no superlinear speedup

more processors don't slow down

depth  $\sim$  scalability

Brent's theorem

$$p \leq \text{processors for max concurrency} \Rightarrow \frac{T_1}{p} \leq T_p \leq \frac{(T_1 + T_\infty)}{p} + T_\infty$$

perfectly  $\leftrightarrow$  imperfectly

$\downarrow$  Parallelizable

$$\Rightarrow \left( \frac{T_1}{p} + T_\infty \right) \leq S_p \leq \min(p, \frac{T_1}{T_\infty})$$

estimate if  $T_\infty \ll T_1 \Rightarrow T_p \approx \frac{T_1}{p} + T_\infty$   
↓ decent parallelism

+ usually tighter than Amdahl's law  
+ lower bounds, notion of imperfect parallelization

## Over-decomposition

Greedy scheduling if piece  $\frac{T_1}{T_\infty} : S_p = \frac{T_1}{T_p} \approx p$

$$\text{Parallel slack} = \frac{S_p}{p} = \frac{T_1}{p + T_\infty}$$

= assumes  $\infty$  memory bandwidth  
greedy scheduling (no locks or task pinning)

## nur erreichbare Modelle

### BSP - Bulk-synchronous parallel model

Processor-/Memory-modules

Network

Synchronizer (barrier)

1 Superstep = compute, send, receive

### LogP model - cluster abstraction

distributed memory

parameters

- ↳ L = latency communication medium
- ↳ O = overhead messages
- ↳ g = required gap between messages
- ↳ P = # processing units

# GPU - Graphical Processing Unit

- 1980s VGA - video graphics array controllers
- Late 1990s + 3D accelerator functions
  - Triangle setup, rasterization
  - Texture mapping, shading
- 2000 GPU became our processor
  - Programmable
  - Precise floating point
  - Scalar
- Support for general-purpose programming languages (C, C++)

applications  
graphics, deep learning, ...



## Logical graphics pipeline

- programmable - physically in Unified processor array  
- fixed functions



## GPU programming interface

- Vulkan
  - for both Graphics: Open GL, Direct 3D
  - low overhead
  - cross platform
- general: CUDA, HIP extensions of C++
- purpose: SYCL, OpenCL (framework)
  - programs for heterogeneous platforms
- OpenACC, OpenMP directive based

Heterogeneous system  
1 CPU ≥ 1 GPU  
Heterogeneous cluster  
each processor has ≥ 1 GPU

## CUDA - Compute Unified Device Architecture & with NVIDLif extensions

slides 9 page 22E  
CUDA software stack, compute capability



Virtualizes multiprocessors ↔ Problem Decomposition  
⇒ portability

| CUA variable type qualifiers | Scope  | Memory   | Lifetime    |
|------------------------------|--------|----------|-------------|
| automatic                    | Thread | Register | Kernel      |
| scalar vars                  | "      | Local    | "           |
| array vars                   | Block  | Shared   | "           |
| - device -                   | Grid   | Global   | Application |
| -- constant --               | "      | constant | "           |

combines SIMD performance - Multithreading productivity

## SMT in SM

partitioned after linearized threadIdx  
last warp padded - results nullified

SM manages warps - groups of 32 parallel threads  
↳ do same/nothing  
scheduled like threads in normal processor  
many for latency hiding  
waiting kept in register files  
↳ fast

different

→ data dependent  
conditional branching  
in warp  
↳ diff Iterations in forloop

→ serializes

- performance

## Compilation workflow (NVCC - compiler)

need  
same  
word length  
(32bit)

### Host Code

replacing CUDA syntax → runtime function calls

compiled (by another tool)

### Device Code compiled to:

cuobj object (binary code)

- architecture specific

compatibility between minor revisions

PTX - Parallel Thread Execution (scalar instructions)

compiled just-in-time by device driver (and cached)

- load time

+ aufwärtskompatibel (benefit from optimizations)

← stable target Instruction Set Architecture  
for many compilers (neben Direct 3D vector instructions)

## CUDA programming interface (C++ extension)

implicit initialization for runtime @ first runtime call

↳ CUDA context & devices on system

↳ (just-in-time compilation device code)

cudaError\_t **cudaMalloc** (void \*\*devPointer, size\_t size)  
↳ returns pointer to location      sizeof(type) · number  
                                          in bytes

cudaError\_t **cudaMemcpy** (void \*dst, const void \*src, size\_t count, enum cudaMemcpyKind kind)

don't overlap

### Error handling

const char\* **cudaGetErrorString** (cudaError\_t error)

↳ cudaMemcpy Host To Host

" Host To Device

" Device To Host

" Device To Device

cudaError\_t **cudaFree** (void \*devPointer)

### Extension to function declaration

before return type

kombinierbar  
-- device -- ) nur von entsprechenden  
default → -- host -- Aufrufbar  
-- global --

### Built-in vector types of basic integer, floating types

type 1 fields  
" 2 x,y  
" 3         
" 4       

constructor:

**make-type**  $\frac{1}{2}$  (type x, ...)

↳ nicht angegebene 1

**dim3** - integer vector for dimensions

### Kernel call

name **<< gridDim, blockDim >>** ();

dim3 als var oder initialized mit {...} übergeben

### Thread organization

vars for device code → global coordinates

dim3 **gridDim**

uint3 **blockIdx**

dim3 **blockDim**

uint3 **threadIdx**

int **warpSize**

x = blockIdx \* blockDim.x + threadIdx.x

y = ...

z = ...

concurrently to each other

Streams of GPU instructions

Execute in issue-order

1 Default Stream

Sync with CPU

asynchronous/non-blocking

kernel launch

asynchronous cuda calls (allow overlapping)

synchronous/blocking kernel, memcpy

memcopies (except ones within same device,  $\leq 64kB$ )  
H2D

cudaDeviceSynchronize()

blocks until completion of kinda calls

Sync threads in block

Barrier

- syncThreads();

max threads pro block

more multiple blocks  $\lceil \frac{N}{\text{threads per block}} \rceil$

Rundbehandlung,  
Randthreads  
tun nichts.

may be useful to collaborate

load data in shared, syncThreads, then work

e.g. Matrix multiplication parts required multiple in block

change order, phases

plan multiple stages  
to reduce memory

= Locality

Reduction algorithm / Tournament Reduction - "divide and conquer"

```
int t = threadIdx.x;
for(unsigned int stride = blockDim.x >> 1; stride > 0; stride >> 1) {
```

- syncThreads();

if (t < stride)

array[t] = op(array[t], array[t+stride]);

}



Further performance aspects

threadblocks, warps

coalescing memory access

prevent bank conflicts

prefetching data

balance per thread resources

unroll loops  $\rightarrow$  reduce cost of loop control

typically compiler thing

thread granularity

Advanced CUDA features

NVIDIA Nsight

performance view

GPU Direct

copy data between GPUs  
even remote

Dynamic parallelism

kernel launching kernel  
recursive, data dependent parallelism

Unified memory (CPU, GPU)

runtime handles it

cooperative groups of threads to sync,  
bind block, tile, coalesced, multi-grid, grid

CUDA graphs of GPU operation  
launched as one from CPU

Tensor cores

specialized hardware execution units  
e.g. matrix, convolution operations

## Parallel Patterns

frame  
context  
forces (goals & constraints)  
solution

## pattern language

to solve problem only choose one ~~best~~ pattern after another

⇒ pattern web

### 1. Finding concurrency

#### Decomposition patterns

##### Task functional

break stream of instructions

⚠ not to fine granular or singular  
group based on memory  
cutoff point

##### Data

sieve task

Forces:

Efficiency

Simplicity ↔ Flexibility

↑

#### Dependence analysis

graph

group tasks

sharing constraints

order tasks

temporal/data dependencies

Data sharing

read only, accumulate  
n read twice, effective load

Design Evaluation

good enough?

target platform?

Forces?

⇒ concurrent tasks with local duty  
grouping, dependencies

### 2. Algorithm structure

16.1.24

Rest nicht relevant