

# AMD HARDWARE AND SOFTWARE

Suyash Tandon, Justin Chang, Julio Maia, Noel Chalmers,  
Paul T. Bauman, Nicholas Curtis, Nicholas Malaya,  
Alessandro Fanfarillo, Jose Noudohouenou, Chip Freitag,  
Damon McDougall, Noah Wolfe, Jakub Kurzak, Samuel  
Antao, George Markomanolis, Bob Robey

ADVANCED MICRO DEVICES, INC.



slides on LUMI in /project/project\_465001098/Slides/AMD/

hands-on exercises: [https://hackmd.io/@gmarkoma/lumi\\_finland](https://hackmd.io/@gmarkoma/lumi_finland)

hands-on source code: /project/project\_465001098/Exercises/AMD/HPCTrainingExamples/

# AMD HARDWARE FOR HPC AND AI

## CDNA ARCHITECTURE

# AMD GPUS



Radeon™ Graphics Cards

RDNA architecture

E.g.:

- RX 6000 Series
- RX 7000 Series



AMD Instinct™ Accelerators

CDNA architecture

E.g.:

- MI100
- MI200
- MI300

# AMD IN HPC



## Frontier@ORNL

- currently the largest machine in the world
- the first computer to cross 1 exaFLOPS
- AMD EPYC CPUs
- AMD Instinct GPUs



## LUMI@csc

- currently the largest machine in Europe
- 5<sup>th</sup> fastest in the world
- AMD EPYC CPUs
- AMD Instinct GPUs

# AMD INSTINCT™ MI200



## AMD INSTINCT™ MI250X

ONE OF THE WORLD'S MOST  
ADVANCED DATA CENTER ACCELERATOR

58B

Transistors in 6nm

220

Compute Units

880

2nd Gen Matrix Cores

128

GB HBM2E @ 3.2 TB/s

<https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>

# AMD INSTINCT™ MI200



## 2ND GENERATION CDNA ARCHITECTURE TAILORED-BUILT FOR HPC & AI

TSMC 6NM  
TECHNOLOGY

UP TO 110 CU PER  
GRAPHICS CORE DIE

4 MATRIX CORES PER  
COMPUTE UNIT

MATRIX CORES  
ENHANCED FOR HPC

8 INFINITY FABRIC  
LINKS PER DIE

SPECIAL FP32 OPS FOR  
DOUBLE THROUGHPUT

# MULTI-CHIP DESIGN

TWO GPU DIES IN PACKAGE TO MAXIMIZE COMPUTE & DATA THROUGHPUT

INFINITY FABRIC FOR CROSS-DIE CONNECTIVITY

4 LINKS RUNNING AT 25GBPS

400GB/S OF BI-DIRECTIONAL BANDWIDTH



# 2<sup>nd</sup> GENERATION MATRIX CORES

## OPTIMIZED COMPUTE UNITS FOR SCIENTIFIC COMPUTING



DOUBLE PRECISION (FP64)  
MATRIX CORE THROUGHPUT  
REPRESENTATION

### MI100 MATRIX CORES

OPS/CLOCK/COMPUTE UNIT

No FP64 Matrix Core

256 FP32

1024 FP16

512 BF16

512 INT8

### MI250X MATRIX CORES

OPS/CLOCK/COMPUTE UNIT

256 FP64

256 FP32

1024 FP16

1024 BF16

1024 INT8

# AMD INSTINCT™ MI200



INFINITY FABRIC

# AMD MI250X specifications

- Two graphic compute dies (GCDs)
- 64GB of HBM2e memory per GCD (total 128GB)
- 26.5 TFLOPS peak performance per GCD
- 1.6 TB/s peak memory bandwidth per GCD
- 110 CU per GCD, total 220 CU per GPU
- The interconnection is attached to the GPU (not on the CPU)
- Both GCDs are interconnected with 200 GB/s per direction
- 128 single precision FMA operations per cycle
- AMD CDNA 2 Matrix Core supports double-precision data
- Memory coherency

AMD CDNA™ 2 white paper: <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>

# NEW IN AMD INSTINCT MI250X PACKED FP32

FP64 PATH USED TO EXECUTE  
TWO COMPONENT VECTOR  
INSTRUCTIONS ON FP32

DOUBLES FP32 THROUGHPUT  
PER CLOCK PER COMPUTE UNIT

pk\_FMA, pk\_ADD, pk\_MUL, pk\_MOV  
operations



<https://www.amd.com/en/technologies/infinity-hub/mini-hacc>

# MI200 COMPUTE UNIT



## each SIMD unit

- has 16 SIMD lanes
- operates on vectors (waves) of size 64
- handles up to 10 waves simultaneously

# AMD SOFTWARE FOR HPC AND AI

## ROCM PLATFORM

# AMD ROCm™ Open Software Platform For GPU Compute



# Open Software Platform For GPU Compute



- Unlocked GPU Power To Accelerate Computational Tasks
- Optimized for HPC and Deep Learning Workloads at Scale
- Open Source Enabling Innovation, Differentiation, and Collaboration



# AMD ROCm 5.0

## DEMOCRATIZING EXASCALE FOR ALL

### EXPANDING SUPPORT & ACCESS

- Support for Radeon Pro W6800 Workstation GPUs
- Remote access through the AMD Accelerator Cloud

### OPTIMIZING PERFORMANCE

- MI200 Optimizations: FP64 Matrix ops, Improved Cache
- Improved launch latency and kernel performance

### ENABLING DEVELOPER SUCCESS

- HPC Apps & ML Frameworks on AMD InfinityHub
- Streamlined and improved tools increasing productivity

# LIBRARIES

## **rocBLAS / hipBLAS**

- basic operations on dense matrices

<https://github.com/ROCMSoftwarePlatform/rocBLAS>

<https://github.com/ROCMSoftwarePlatform/hipBLAS>

## **rocSOLVER**

- dense linear algebra solvers

<https://github.com/ROCMSoftwarePlatform/rocSOLVER>

## **rocSPARSE / hipSPARSE**

- basic operations on sparse matrices

<https://github.com/ROCMSoftwarePlatform/rocSPARSE>

<https://github.com/ROCMSoftwarePlatform/hipSPARSE>

## **rocALUTION**

- sparse linear algebra solvers

<https://github.com/ROCMSoftwarePlatform/rocALUTION>

## **rocFFT / hipFFT**

- Fast Fourier transforms

<https://github.com/ROCMSoftwarePlatform/rocFFT>

<https://github.com/ROCMSoftwarePlatform/hipFFT>

## **rocRAND / hipRAND**

- random number generation

<https://github.com/ROCMSoftwarePlatform/rocRAND>

<https://github.com/ROCMSoftwarePlatform/hipRAND>

## **rocPRIM / hipCUB / rocThrust**

- scan, sort, reduction, etc.

<https://github.com/ROCMSoftwarePlatform/rocPRIM>

<https://github.com/ROCMSoftwarePlatform/hipCUB>

<https://github.com/ROCMSoftwarePlatform/rocThrust>

# ALSO OPEN SOURCE

## **the compiler**

- <https://github.com/ROCMSoftwarePlatform/llvm-project>

## **the runtime**

- <https://github.com/RadeonOpenCompute/ROCR-Runtime>

## **the debugger**

- <https://github.com/ROCM-Developer-Tools/ROCgdb>

## **the profiler**

- <https://github.com/ROCM-Developer-Tools/rocprofiler>

## **the HPL benchmark**

- <https://github.com/ROCMSoftwarePlatform/rocHPL>

## **the HPCG benchmark**

- <https://github.com/ROCMSoftwarePlatform/rocHPCG>

etc.

# AMD SOFTWARE FOR HPC AND AI

## HIP PROGRAMMING

# GPU ACCELERATION

## HOST AND DEVICE

### the host is the CPU

- host code runs here
- usual C++ syntax and features
- entry point is the “main” function
- use the HIP API to
  - create device buffers
  - moved data between host and device
  - launch device code



### the device is the GPU

- device code runs here
- C/C++ syntax and features
- device code is launched as “kernels”
- instructions from the host are sent to streams



# FUNCTION QUALIFIERS

## HOST AND DEVICE

### **\_\_global\_\_**

- “kernels”
- execute the GPU
- can be called from the CPU

### **\_\_device\_\_**

- execute the GPU
- can be called from device code (kernels or a **\_\_device\_\_** functions)

### **\_\_host\_\_ \_\_device\_\_**

- executes on the CPU when called from CPU code
- executes on the GPU when called from GPU code

# HIP KERNEL LANGUAGE

## GPU CODE

in 2D



in 2D

- each colored box is a block
- each block has an index - blockIdx.[xyz]
- each small square is a thread
- each thread has a 2D index - threadIdx.[xyz]
- grid dimensions in - blockDim.[xyz]

# HIP KERNEL LANGUAGE

## GPU CODE

- all local variables and arrays are thread-private
- threads can exchange data through shared memory (LDS)
- declare using the `__shared__` keyword
- use `__syncthreads()` to synchronize

in 2D



# HIP KERNEL LANGUAGE

## GPU CODE

### saxpy loop

- two 1D arrays
- the  $y[i] += a*x[i]$  operation
- mapped to 1D grid of threads/blocks
- each thread takes on index

```
1  #include <cuda.h>
2
3  __constant__ float a = 2.0f;
4
5  __global__
6  void saxpy(int n, float const* x, float* y)
7  {
8      int i = blockDim.x*blockIdx.x + threadIdx.x;
9      if (i < n)
10         y[i] += a*x[i];
11 }
```

# HIP API

## MEMORY MANAGEMENT

- GPU operates on GPU memory
- need to allocate GPU memory
- need to copy data between the CPU memory and the GPU memory

```
hipError_t hipMalloc (void **ptr, size_t size)
```

```
hipError_t hipFree (void *ptr)
```

Free memory allocated by the hcc hip memory allocation API. This API performs an implicit [hipDeviceSynchronize\(\)](#) call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned. [More...](#)

```
hipError_t hipMemcpy (void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind)
```

Copy data from src to dst. [More...](#)

# HIP API

## ERROR HANDLING

- check last error
- get error name
- get error string

```
hipError_t hipGetLastError (void)
```

Return last error returned by any HIP runtime API call and resets the stored error code to **hipSuccess**.  
[More...](#)

```
hipError_t hipPeekAtLastError (void)
```

Return last error returned by any HIP runtime API call. [More...](#)

```
const char * hipGetErrorName (hipError_t hip_error)
```

Return hip error as text string form. [More...](#)

```
const char * hipGetErrorString (hipError_t hipError)
```

Return handy text string message to explain the error which occurred. [More...](#)

# HIP API

## DEVICE MANAGEMENT

- check number of devices
- switch devices
- synchronize devices

```
hipError_t hipDeviceSynchronize (void)
```

Waits on all active streams on current device. [More...](#)

```
hipError_t hipDeviceReset (void)
```

The state of current device is discarded and updated to a fresh state. [More...](#)

```
hipError_t hipSetDevice (int deviceId)
```

Set default device to be used for subsequent hip API calls from this thread. [More...](#)

```
hipError_t hipGetDevice (int *deviceId)
```

Return the default device id for the calling host thread. [More...](#)

```
hipError_t hipGetDeviceCount (int *count)
```

Return number of compute-capable devices. [More...](#)

# HIP API

## STREAM MANAGEMENT

- create stream
- destroy stream
- synchronize stream
- etc.
- etc.
- etc.

```
hipError_t hipStreamCreate (hipStream_t *stream)
```

Create an asynchronous stream. [More...](#)

```
hipError_t hipStreamDestroy (hipStream_t stream)
```

Destroys the specified stream. [More...](#)

```
hipError_t hipStreamSynchronize (hipStream_t stream)
```

Wait for all commands in stream to complete. [More...](#)

# STREAMS

- Suppose we have 4 small kernels to execute:

```
hipLaunchKernelGGL(myKernel1, dim3(1), dim3(256), 0, 0, 256, d_a1);  
hipLaunchKernelGGL(myKernel2, dim3(1), dim3(256), 0, 0, 256, d_a2);  
hipLaunchKernelGGL(myKernel3, dim3(1), dim3(256), 0, 0, 256, d_a3);  
hipLaunchKernelGGL(myKernel4, dim3(1), dim3(256), 0, 0, 256, d_a4);
```

- Even though these kernels use only one block each, they'll execute in serial on the NULL stream:



# STREAMS

- With streams we can effectively share the GPU's compute resources:

```
hipLaunchKernelGGL(myKernel1, dim3(1), dim3(256), 0, stream1, 256, d_a1);
hipLaunchKernelGGL(myKernel2, dim3(1), dim3(256), 0, stream2, 256, d_a2);
hipLaunchKernelGGL(myKernel3, dim3(1), dim3(256), 0, stream3, 256, d_a3);
hipLaunchKernelGGL(myKernel4, dim3(1), dim3(256), 0, stream4, 256, d_a4);
```

|             |           |
|-------------|-----------|
| NULL Stream |           |
| Stream1     | myKernel1 |
| Stream2     | myKernel2 |
| Stream3     | myKernel3 |
| Stream4     | myKernel4 |

Note 1: Kernels must modify different parts of memory to avoid data races.

Note 2: With large kernels, overlapping computations may not help performance.

# STREAMS

- There is another use for streams besides concurrent kernels:
  - Overlapping kernels with data movement.
- AMD GPUs have separate engines for:
  - Host->Device memcpys
  - Device->Host memcpys
  - Compute kernels.
- These three different operations can overlap without dividing the GPU's resources.
  - The overlapping operations should be in separate, non-NULL, streams.
  - The host memory should be **pinned**.

# STREAMS

Suppose we have 3 kernels which require moving data to and from the device:

```
hipMemcpy(d_a1, h_a1, Nbytes, hipMemcpyHostToDevice));
```

```
hipMemcpy(d_a2, h_a2, Nbytes, hipMemcpyHostToDevice));
```

```
hipMemcpy(d_a3, h_a3, Nbytes, hipMemcpyHostToDevice));
```

```
hipLaunchKernelGGL(myKernel1, blocks, threads, 0, 0, N, d_a1);
```

```
hipLaunchKernelGGL(myKernel2, blocks, threads, 0, 0, N, d_a2);
```

```
hipLaunchKernelGGL(myKernel3, blocks, threads, 0, 0, N, d_a3);
```

```
hipMemcpy(h_a1, d_a1, Nbytes, hipMemcpyDeviceToHost);
```

```
hipMemcpy(h_a2, d_a2, Nbytes, hipMemcpyDeviceToHost);
```

```
hipMemcpy(h_a3, d_a3, Nbytes, hipMemcpyDeviceToHost);
```



# STREAMS

Changing to asynchronous memcpys and using streams:

```
hipMemcpyAsync(d_a1, h_a1, Nbytes, hipMemcpyHostToDevice, stream1);
hipMemcpyAsync(d_a2, h_a2, Nbytes, hipMemcpyHostToDevice, stream2);
hipMemcpyAsync(d_a3, h_a3, Nbytes, hipMemcpyHostToDevice, stream3);
```

```
hipLaunchKernelGGL(myKernel1, blocks, threads, 0, stream1, N, d_a1);
hipLaunchKernelGGL(myKernel2, blocks, threads, 0, stream2, N, d_a2);
hipLaunchKernelGGL(myKernel3, blocks, threads, 0, stream3, N, d_a3);
```

```
hipMemcpyAsync(h_a1, d_a1, Nbytes, hipMemcpyDeviceToHost, stream1);
hipMemcpyAsync(h_a2, d_a2, Nbytes, hipMemcpyDeviceToHost, stream2);
hipMemcpyAsync(h_a3, d_a3, Nbytes, hipMemcpyDeviceToHost, stream3);
```

| NULL Stream |       |           |           |           |       |  |
|-------------|-------|-----------|-----------|-----------|-------|--|
| Stream1     | HToD1 | myKernel1 | DTоХ1     |           |       |  |
| Stream2     |       | HToD2     | myKernel2 | DTоХ2     |       |  |
| Stream3     |       |           | HToD3     | myKernel3 | DTоХ3 |  |

# AMD LINGO

## CUDA lingo

block



## AMD lingo

work group

thread



work item

warp



wavefront

```
1 #include <cuda.h>
2
3 __constant__ float a = 2.0f;
4
5 __global__
6 void saxpy(int n, float const* x, float* y)
7 {
8     int i = blockDim.x*blockIdx.x + threadIdx.x;
9     if (i < n)
10        y[i] += a*x[i];
11 }
```

## SIMPLE SAXPY KERNEL

- vector addition kernel in CUDA
- each thread takes one array index
- and performs one multiply-and-add operation

# ADDING THE CPU CODE

```
1 #include <cuda.h>
2
3 __constant__ float a = 2.0f;
4
5 __global__
6 void saxpy(int n, float const* x, float* y)
7 {
8     int i = blockDim.x*blockIdx.x + threadIdx.x;
9     if (i < n)
10        y[i] += a*x[i];
11 }
12
13 int main()
14 {
15     int n = 256;
16     std::size_t size = sizeof(float)*n;
17
18     float* d_x;
19     float* d_y;
20     cudaMalloc(&d_x, size); ← allocate arrays in device memory
21     cudaMalloc(&d_y, size);
22
23     int num_blocks = 2; ← set up the grid
24     int num_threads = 128; ←
25     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y); ← launch the kernel
26     cudaDeviceSynchronize();
27 }
28 }
```

# ADDING HOST↔DEVICE COPIES

```
1 #include <cuda.h>
2
3 __constant__ float a = 2.0f;
4
5 __global__
6 void saxpy(int n, float const* x, float* y)
7 {
8     int i = blockDim.x*blockIdx.x + threadIdx.x;
9     if (i < n)
10        y[i] += a*x[i];
11 }
12
13 int main()
14 {
15     int n = 256;
16     std::size_t size = sizeof(float)*n;
17
18     float* h_x = (float*)malloc(size);
19     float* h_y = (float*)malloc(size); ← allocate arrays in host memory
20
21     float* d_x;
22     float* d_y;
23     cudaMalloc(&d_x, size);
24     cudaMalloc(&d_y, size);
25
26     cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
27     cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice); ← copy content to device memory
28
29     int num_blocks = 2;
30     int num_threads = 128;
31     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
32
33     cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
34     cudaDeviceSynchronize(); ← copy results back to host memory
35
36 }
```

# ADDING MEMORY CLEANUP

```
1 #include <cuda.h>
2
3 __constant__ float a = 2.0f;
4
5 __global__
6 void saxpy(int n, float const* x, float* y)
7 {
8     int i = blockDim.x*blockIdx.x + threadIdx.x;
9     if (i < n)
10         y[i] += a*x[i];
11 }
12
13 int main()
14 {
15     int n = 256;
16     std::size_t size = sizeof(float)*n;
17
18     float* h_x = (float*)malloc(size);
19     float* h_y = (float*)malloc(size);
20
21     float* d_x;
22     float* d_y;
23     cudaMalloc(&d_x, size);
24     cudaMalloc(&d_y, size);
25
26     cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
27     cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
28
29     int num_blocks = 2;
30     int num_threads = 128;
31     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
32
33     cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
34     cudaDeviceSynchronize();
35
36     cudaFree(d_x); ← free arrays in device memory
37     cudaFree(d_y);
38
39     free(h_x); ← free arrays in host memory
40     free(h_y);
41 }
42 }
```

free arrays in device memory

free arrays in host memory

# ADDING ERROR CHECKS

```

1 #include <cuda.h>
2 #include <cassert>
3
4 __constant__ float a = 2.0f;
5
6 __global__
7 void saxpy(int n, float const* x, float* y)
8 {
9     int i = blockDim.x*blockIdx.x + threadIdx.x;
10    if (i < n)
11        y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == cudaSuccess) ← simple error checking macro
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(cudaMalloc(&d_x, size));
29     CHECK(cudaMalloc(&d_y, size));
30
31     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
32     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
39     CHECK(cudaDeviceSynchronize());
40
41     CHECK(cudaFree(d_x));
42     CHECK(cudaFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

## simple CUDA code

```
1 #include <cuda.h>
2 #include <cassert>
3
4 __constant__ float a = 2.0f;
5
6 __global__
7 void saxpy(int n, float const* x, float* y)
8 {
9     int i = blockDim.x*blockIdx.x + threadIdx.x;
10    if (i < n)
11        y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == cudaSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(cudaMalloc(&d_x, size));
29     CHECK(cudaMalloc(&d_y, size));
30
31     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
32     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
39     CHECK(cudaDeviceSynchronize());
40
41     CHECK(cudaFree(d_x));
42     CHECK(cudaFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47
```

## simple CUDA code

```

1  #include <cuda.h>
2  #include <cassert>
3
4  __constant__ float a = 2.0f;
5
6  __global__
7  void saxpy(int n, float const* x, float* y)
8  {
9      int i = blockDim.x*blockIdx.x + threadIdx.x;
10     if (i < n)
11         y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == cudaSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(cudaMalloc(&d_x, size));
29     CHECK(cudaMalloc(&d_y, size));
30
31     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
32     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
39     CHECK(cudaDeviceSynchronize());
40
41     CHECK(cudaFree(d_x));
42     CHECK(cudaFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

## same code in HIP

```

1  #include <hip/hip_runtime.h>
2  #include <cassert>
3
4  __constant__ float a = 2.0f;
5
6  __global__
7  void saxpy(int n, float const* x, float* y)
8  {
9      int i = blockDim.x*blockIdx.x + threadIdx.x;
10     if (i < n)
11         y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == hipSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(hipMalloc(&d_x, size));
29     CHECK(hipMalloc(&d_y, size));
30
31     CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
32     CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
39     CHECK(hipDeviceSynchronize());
40
41     CHECK(hipFree(d_x));
42     CHECK(hipFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

spot the differences

## simple CUDA code

```

1  #include <cuda.h>
2  #include <cassert>
3
4  __constant__ float a = 2.0f;
5
6  __global__
7  void saxpy(int n, float const* x, float* y)
8  {
9      int i = blockDim.x*blockIdx.x + threadIdx.x;
10     if (i < n)
11         y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == cudaSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(cudaMalloc(&d_x, size));
29     CHECK(cudaMalloc(&d_y, size));
30
31     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
32     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
39     CHECK(cudaDeviceSynchronize());
40
41     CHECK(cudaFree(d_x));
42     CHECK(cudaFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

## same code in HIP

```

1  #include <hip/hip_runtime.h>
2  #include <cassert>
3
4  __constant__ float a = 2.0f;
5
6  __global__
7  void saxpy(int n, float const* x, float* y)
8  {
9      int i = blockDim.x*blockIdx.x + threadIdx.x;
10     if (i < n)
11         y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == hipSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(hipMalloc(&d_x, size));
29     CHECK(hipMalloc(&d_y, size));
30
31     CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
32     CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
39     CHECK(hipDeviceSynchronize());
40
41     CHECK(hipFree(d_x));
42     CHECK(hipFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

# HIPIFY TOOLS

## **hipify-clang**

- compiler (clang) based translator
- handles very complex constructs
- prints an error if not able to translate
- supports clang options
- requires CUDA

<https://github.com/ROCM-Developer-Tools/HIPIFY>

## **hipify-perl**

- Perl® script
- relies on regular expressions
- may struggle with complex constructs
- does not require CUDA

```

1 #include <cuda.h>
2 #include <cassert>
3
4 __constant__ float a = 2.0f;
5
6 __global__
7 void saxpy(int n, float const* x, float* y)
8 {
9     int i = blockDim.x*blockIdx.x + threadIdx.x;
10    if (i < n)
11        y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == cudaSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(cudaMalloc(&d_x, size));
29     CHECK(cudaMalloc(&d_y, size));
30
31     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
32     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
39     CHECK(cudaDeviceSynchronize());
40
41     CHECK(cudaFree(d_x));
42     CHECK(cudaFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

```

saxpy$ perl /opt/rocm/bin/hipify-perl -examin saxpy.cu

[HIPIFY] info: file 'saxpy.cu' statistics:
CONVERTED refs count: 13
TOTAL lines of code: 46
WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
cuda.h => hip/hip_runtime.h: 1
cudaDeviceSynchronize => hipDeviceSynchronize: 1
cudaFree => hipFree: 2
cudaMalloc => hipMalloc: 2
cudaMemcpy => hipMemcpy: 3
cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 2
cudaSuccess => hipSuccess: 1
saxpy$ █

```

## hipify-perl

### hipify-perl -examin

- for initial assessment
- no replacements done
- prints basic statistics and the number of replacements

```

1 #include <cuda.h>
2 #include <cassert>
3
4 __constant__ float a = 2.0f;
5
6 __global__
7 void saxpy(int n, float const* x, float* y)
8 {
9     int i = blockDim.x*blockIdx.x + threadIdx.x;
10    if (i < n)
11        y[i] += a*x[i];
12 }
13
14 #define CHECK(call) assert(call == cudaSuccess)
15
16 int main()
17 {
18     int n = 256;
19     std::size_t size = sizeof(float)*n;
20
21     float* h_x = (float*)malloc(size);
22     float* h_y = (float*)malloc(size);
23     assert(h_x != nullptr);
24     assert(h_y != nullptr);
25
26     float* d_x;
27     float* d_y;
28     CHECK(cudaMalloc(&d_x, size));
29     CHECK(cudaMalloc(&d_y, size));
30
31     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
32     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
33
34     int num_blocks = 2;
35     int num_threads = 128;
36     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
37
38     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
39     CHECK(cudaDeviceSynchronize());
40
41     CHECK(cudaFree(d_x));
42     CHECK(cudaFree(d_y));
43
44     free(h_x);
45     free(h_y);
46 }
47

```

```

saxpy$ perl /opt/rocm/bin/hipify-perl saxpy.cu
#include "hip/hip_runtime.h"
#include <hip/hip_runtime.h>
#include <cassert>

__constant__ float a = 2.0f;

__global__
void saxpy(int n, float const* x, float* y)
{
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
}

#define CHECK(call) assert(call == hipSuccess)

int main()
{
    int n = 256;
    std::size_t size = sizeof(float)*n;

    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);

    float* d_x;
    float* d_y;
    CHECK(hipMalloc(&d_x, size));
    CHECK(hipMalloc(&d_y, size));

    CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
    CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));

    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);

    CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
    CHECK(hipDeviceSynchronize());

    CHECK(hipFree(d_x));
    CHECK(hipFree(d_y));

    free(h_x);
    free(h_y);
}
saxpy$ █

```

# hipify-perl

translating a file  
to standard  
output

but can also

- translate in place
- preserve orig copy
- recursively do folders

```

1 #include <hip/hip_runtime.h>
2 #include <cassert>
3 #include "cuda2hip.h" ←
4
5 __constant__ float a = 2.0f;
6
7 __global__
8 void saxpy(int n, float const* x, float* y)
9 {
10     int i = blockDim.x*blockIdx.x + threadIdx.x;
11     if (i < n)
12         y[i] += a*x[i];
13 }
14
15 #define CHECK(call) assert(call == cudaSuccess)
16
17 int main()
18 {
19     int n = 256;
20     std::size_t size = sizeof(float)*n;
21
22     float* h_x = (float*)malloc(size);
23     float* h_y = (float*)malloc(size);
24     assert(h_x != nullptr);
25     assert(h_y != nullptr);
26
27     float* d_x;
28     float* d_y;
29     CHECK(cudaMalloc(&d_x, size));
30     CHECK(cudaMalloc(&d_y, size));
31
32     CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
33     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
34
35     int num_blocks = 2;
36     int num_threads = 128;
37     saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
38
39     CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
40     CHECK(cudaDeviceSynchronize());
41
42     CHECK(cudaFree(d_x));
43     CHECK(cudaFree(d_y));
44
45     free(h_x);
46     free(h_y);
47 }
48

```

```

1 #define cudaSuccess          hipSuccess
2 #define cudaMalloc           hipMalloc
3 #define cudaMemcpy           hipMemcpy
4 #define cudaMemcpyHostToDevice hipMemcpyHostToDevice
5 #define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
6 #define cudaDeviceSynchronize hipDeviceSynchronize
7 #define cudaFree             hipFree
8

```

## alternatively

- create a file with renaming macros
- include conditionally, depending on target

# OPTIMIZATION TECHNIQUES

## basic

- thread divergence / SIMDization
- reuse in shared memory & bank conflicts
- coalescing of global memory accesses
- resource partitioning / occupancy / spills
- L1, L2 cache blocking
- ...

## advanced

- atomics
- warp primitives
- CPU-GPU coherence
- inter-stream synchronization
- ...

# DIFFERENCES FROM CUDA

- warpSize
  - 64 on AMD
  - 32 on NVIDIA
- dynamic parallelism not supported
- exercise caution:
  - atomics
  - managed memory
  - warp-level primitives
  - inter-process communication

# AMD RESOURCES

## DOCUMENTATION AND TRAINING

# AMD ROCM DEVELOPER HUB

## Engage with ROCm Experts

[Participate in ROCm Webinar Series](#)

[Post questions, view FAQ's in Community Forum](#)

## Increase Understanding

[Purchase ROCm Text Book](#)

[View the latest news in the Blogs](#)

## Get Started Using ROCm

[ROCM Documentation on GitHub](#)

[Download the Latest Version of ROCm](#)

<https://www.amd.com/en/developer/rocm-hub.html>

The screenshot shows the AMD ROCm Developer Hub homepage. At the top, there is a navigation bar with links: Intro to ROCm, Start Using ROCm, Community, Infinity Hub, and Blogs. Below the navigation bar, there is a main content area with a large image of a brain and various data visualizations. A central text block states: "AMD ROCm™ is an open software platform for accelerated compute offering a code once, use everywhere approach. Access all ROCm developer resources here — from documentation, to training webinars, to the latest blogs, and more." Below this text, there are three sections: "Get to Know ROCm™" featuring "Training Webinars" (with a photo of a person writing in a notebook), "ROCM Textbook" (with a photo of a glowing lightbulb over an open book), and "ROCM Docs" (with a photo of a person typing on a laptop). Further down, there is a section titled "Start Using ROCm" with links for "User Guide", "Download Containers", and "Get Started". At the bottom, there is a section titled "Ask the ROCm Community" with a photo of people working at a computer and a "Learn More" button.

# NEW ROCM DOCS

## Comprehensive Coverage

[Compilers and Frameworks](#)

[Math libraries, communication libraries](#)

[Management tools, validation tools](#)

...

## Howto Guides

[Installation](#)

[Tuning](#)

[Debugging](#)

...

<https://rocm.docs.amd.com/>

### ROCM Documentation

#### Home

What is ROCm?

#### Deploy ROCm

Linux Quick Start

Linux Overview

Docker

#### Release Info

Release Notes

Changelog

GPU Support and OS Compatibility  
(Linux)

Known Issues

Compatibility

Licensing Terms

#### APIs and Reference

All Reference Material

HIP

Math Libraries

C++ Primitive Libraries

Communication Libraries

AI Libraries

Computer Vision

OpenMP

Compilers and Tools

Management Tools

Validation Tools

#### Understand ROCm

All Explanation Material

Compiler Disambiguation



## AMD ROCm™ Documentation

Applies to Linux 2023-05-25 3 min read time

[What is ROCm?](#)

[Deploy ROCm](#)

[Release Info](#)

### APIs and Reference

- [Compilers and Development Tools](#)
- [HIP](#)
- [OpenMP](#)
- [Math Libraries](#)
- [C++ Primitives Libraries](#)
- [Communication Libraries](#)
- [AI Libraries](#)
- [Computer Vision](#)
- [Management Tools](#)
- [Validation Tools](#)

### Understand ROCm

- [Compiler Disambiguation](#)
- [Using CMake](#)
- [Linux Folder Structure Reorganization](#)
- [GPU Isolation Techniques](#)
- [GPU Architecture](#)

### How to Guides

- [System Tuning for Various Architectures](#)
- [GPU Aware MPI](#)
- [Setting up for Deep Learning with ROCm](#)
  - [Magma Installation](#)
  - [PyTorch Installation](#)
  - [TensorFlow Installation](#)
- [System Level Debugging](#)

### Tutorials & Examples

- [Examples](#)
- [ML, DL, and AI](#)
  - [Inception V3 with PyTorch](#)
  - [Inference Optimization with MIGraphX](#)

Next >

[What is ROCm?](#)

# HIP TEXTBOOK

## Comprehensive Coverage

- HIP Language
- AMD GPU Internals
- Performance Analysis
- Debugging
- Programming Patterns
- ROCm Libraries
- Porting to HIP
- Multi-GPU Programming
- Third Party Tools
- CDNA Assembly
- ML with ROCm



<https://www.barnesandnoble.com/w/accelerated-computing-with-hip-yifan-sun/1142866934>

# AMD INFINITY HUB

## AMD Instinct™ MI200 SUPPORT

29 key applications & frameworks on Infinity Hub & a catalogue supporting over 90 applications, frameworks & tools

## Accelerating Instinct™ adoption

Over 17000 application pulls. 10000+ since last year

## PERFORMANCE RESULTS

Published Performance Results for Select Apps / Benchmarks

<https://www.amd.com/en/technologies/infinity-hub>

## AMD Infinity Hub

### Categories

- AI & Machine Learning
- Benchmark
- Deep Learning
- Earth Science
- HPC
- Life Science
- Material Science
- Molecular Dynamics
- Oil and Gas
- Physics

### Containers

- Yes
- No





### Amber

Amber is a suite of biomolecular simulation programs. It began in the late 1970's, and is maintained by an...

[MORE INFO](#)



### BabelStream

BabelStream is a synthetic GPU benchmark based on the original STREAM benchmark for CPUs. The...

[MORE INFO](#)

[PULL TAG](#)



### CP2K

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic...



### GROMACS

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of...

# SOFTWARE CATALOG

STRONG MOMENTUM AND INCREASING LIST OF SUPPORTED APPLICATION, LIBRARIES & FRAMEWORKS

| Life Science                                             | Physics                                                                         | Chemistry                                             | CFD                                                                                              | Earth Science                                                                                      |
|----------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| AMBER<br>GROMACS<br>NAMD<br>LAMMPS<br>Hoomd-Blue<br>VASP | MILC<br>GRID<br>QUANTUM ESPRESSO<br>N-Body<br>CHROMA<br>PIConGPU<br>QuickSilver | CP2K<br>QUADA<br>NWCHEM<br>TERACHEM<br>QMCPACK        | OpenFOAM®<br>AMR-WIND<br>NEKBONE<br>LAGHOS<br>NEKO<br>NEKRS<br>PeleC                             | EXAGO<br>DEVITO<br>OCCA<br>SPECFEM3D-GLOBE<br>SPECFEM3D-CARTESIAN<br>ACECAST (WRF)<br>MPAS<br>ICON |
| Benchmarks                                               | Libraries                                                                       | ML Frameworks                                         | ISV Applications                                                                                 | + MANY MORE                                                                                        |
| HPL<br>HPCG<br>AMG<br>ML - TORCHBENCH<br>ML - SUPERBENCH | AMR-EX<br>Ginkko<br>HYPRE<br>TRILINOS                                           | PYTORCH<br>TENSORFLOW<br>JAX<br>ONNX<br>OPENAI TRITON | ANSYS MECHANICAL<br>CADENCE CHARLES<br>ANSYS FLUENT*<br>SIEMENS® STAR-CCM+*<br>SIEMENS® CALIBRE* |                                                                                                    |

\* Porting/optimization in progress

# AMD LAB NOTES

## Introductory Topics

[ROCM installation](#)

[Basics of HIP programming](#)

...

## Advanced Topics

[Matrix Cores](#)

[Register pressure](#)

[GPU-aware MPI](#)

...

<https://gpuopen.com/learn/amd-lab-notes/>  
<https://github.com/AMD/amd-lab-notes>



The screenshot shows the AMD GPUOpen website with a red header bar. The header includes the AMD GPUOpen logo, a search bar, and social media links. The main navigation menu has options for HOME, SOFTWARE, and DOCS.

The page content is titled "AMD matrix cores". It features a sub-header "AMD lab notes" with a red background, followed by a list of topics including "Finite difference method - Laplacian part 1" through "Part 4", "AMD matrix cores" (which is currently selected), and other items like "MFMA compiler intrinsic syntax" and "Vector (FMA) Unit Performance". Below this is a search bar labeled "Search this manual ...".

On the right side, there is a sidebar with the title "AMD matrix cores" and a detailed description of Matrix Core technology. It mentions its use in HPC applications, its hardware acceleration, and how it can achieve speedups. Below this is another sidebar titled "Vector (FMA) Unit Performance for MI100 and MI250X", which contains a table comparing performance across different data formats.

# DISCLAIMERS

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED ‘AS IS.’ AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

© 2023 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD CDNA, AMD ROCm, AMD Instinct, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

# ATTRIBUTIONS

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Git and the Git logo are either registered trademarks or trademarks of Software Freedom Conservancy, Inc., corporate home of the Git Project, in the United States and/or other countries.

Intel is a trademark of Intel Corporation or its subsidiaries.

Kubernetes is a registered trademark of The Linux Foundation.

NAMD was developed by the Theoretical Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. <http://www.ks.uiuc.edu/Research/namd/>

OpenCL is a trademark of Apple Inc. used by permission by Khronos Group, Inc.

OpenFOAM is a registered trademark of OpenCFD Limited, producer and distributor of the OpenFOAM software via [www.openfoam.com](http://www.openfoam.com).

The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.

Perl is a trademark of Perl Foundation.

Siemens is a registered trademark of Siemens Product Lifecycle Management Software Inc., or its subsidiaries or affiliates, in the United States and in other countries.

AMD