



Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

# Exploring Portability and Performance of OpenCL FPGA Kernels on HARPy2

**Anthony M. Cabrera, Roger D. Chamberlain**

Washington University in St. Louis

{*acabrera, roger*}@wustl.edu

IWOCL '19

May 14, 2019



# Motivation

## Moore's Law is "Dying"

Exploring HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Motivation

## Heterogeneous Systems

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion



Source:  
Kharya, Forbes 2018



Source:  
Forrest, TechRepublic 2017



# Motivation

## How about FPGAs?

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

Bloomberg

Deals

# Intel's \$16.7 Billion Altera Deal Is Fueled by Data Centers

Source:  
King, Bloomberg 2015



Source:  
Microsoft



# Motivation

## OpenCL to the Rescue!

Exploring  
HARPv2

Cabrera  
Chamberlain

### Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

### Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

### Conclusion





# Motivation

Intel's Hardware Accelerator Research Program (HARP)

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion



Source:  
Hemsoth, The Next Platform 2016



# We address the following questions:

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

- How performant and portable are OpenCL FPGA kernels on the HARPy2 platform?
- What are the hardware knobs we can turn to get the best performance?
- What is the impact of the FPGA sharing the same memory as the CPU on the HARPy system?



# Outline

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

## 1 Introduction

## 2 Preliminaries

- FPGAs
- HARPy2

## 3 Path to Portability and Performance

- Basic Kernel
- Design Choices
- Wavefront Parallelism
- Hardware Design Space
- SVM

## 4 Results

- HW Design Space Search
- Comparison
- SVM Performance

## 5 Conclusion



# What's an FPGA, anyway?

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

Source:  
Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide





# What's an FPGA, anyway?

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

Source:

Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide





# What's an FPGA, anyway?

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion



Source:

Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide



# What's an FPGA, anyway?

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# What's an FPGA, anyway?

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion



Source:

Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide



# Discrete FPGA Card

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM

Performance

Conclusion





# Intel HARPv2

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Intel HARPv2 (top) vs. Discrete FPGA Card (bottom)

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





## Exploring HARPv2

Cabrera  
Chamberlain

### Introduction

Preliminaries  
FPGAs  
HARPv2

### Path to Portability and Performance

Basic Kernel  
Design Choices  
Wavefront Parallelism  
Hardware Design Space  
SVM

### Results

HW Design Space Search  
Comparison  
SVM  
Performance

### Conclusion

## 1 Introduction

## 2 Preliminaries

- FPGAs
- HARPv2

## 3 Path to Portability and Performance

- Basic Kernel
- Design Choices
- Wavefront Parallelism
- Hardware Design Space
- SVM

## 4 Results

- HW Design Space Search
- Comparison
- SVM Performance

## 5 Conclusion



# Application Flavor

## Dynamic Programming

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat

|      |      |      |      |
|------|------|------|------|
| Blue | Blue | Blue | Blue |
| Blue | Grey | Grey | Grey |
| Blue | Grey | Grey | Grey |
| Blue | Grey | Grey | Grey |

```
1  __kernel void nw(__global int* ref_mat,
2                      __global int* out_mat,
3                      int num_rows,
4                      int num_cols,
5                      int penalty)
6  {
7      for (int i = 1; i < num_rows; ++i)
8      {
9          for (int j = 1; j < num_cols; ++j)
10         {
11             out_mat[i][j] =
12                 max( out_mat[i-1][j] - penalty,
13                     out_mat[i-1][j-1] + ref_mat[i][j],
14                     out_mat[i][j-1] - penalty );
15         }
16     }
17 }
18 }
```



# Application Flavor

i = 1, j = 1

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                         __global int* out_mat,
3                                 int num_rows,
4                                 int num_cols,
5                                 int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j] - penalty,
13                    out_mat[i-1][j-1] + ref_mat[i][j],
14                    out_mat[i][j-1] - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

i = 1, j = 2

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                      __global int* out_mat,
3                           int num_rows,
4                           int num_cols,
5                           int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j]      - penalty,
13                    out_mat[i-1][j-1]    + ref_mat[i][j],
14                    out_mat[i][j-1]      - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

i = 1, j = 3

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                      __global int* out_mat,
3                           int num_rows,
4                           int num_cols,
5                           int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j]      - penalty,
13                    out_mat[i-1][j-1]    + ref_mat[i][j],
14                    out_mat[i][j-1]      - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

$i = 2, j = 1$

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                         __global int* out_mat,
3                                 int num_rows,
4                                 int num_cols,
5                                 int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j] - penalty,
13                    out_mat[i-1][j-1] + ref_mat[i][j],
14                    out_mat[i][j-1] - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

i = 2, j = 2

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                         __global int* out_mat,
3                                     int num_rows,
4                                     int num_cols,
5                                     int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j] - penalty,
13                    out_mat[i-1][j-1] + ref_mat[i][j],
14                    out_mat[i][j-1] - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

$i = 2, j = 3$

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1  __kernel void nw(__global int* ref_mat,
2                      __global int* out_mat,
3                           int num_rows,
4                           int num_cols,
5                           int penalty)
6  {
7      for (int i = 1; i < num_rows; ++i)
8      {
9          for (int j = 1; j < num_cols; ++j)
10         {
11             out_mat[i][j] =
12                 max( out_mat[i-1][j] - penalty,
13                     out_mat[i-1][j-1] + ref_mat[i][j],
14                     out_mat[i][j-1] - penalty );
15         }
16     }
17 }
18 }
```



# Application Flavor

i = 3, j = 1

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                           __global int* out_mat,
3                           int num_rows,
4                           int num_cols,
5                           int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j] - penalty,
13                    out_mat[i-1][j-1] + ref_mat[i][j],
14                    out_mat[i][j-1] - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

i = 3, j = 2

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                           __global int* out_mat,
3                           int num_rows,
4                           int num_cols,
5                           int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j] - penalty,
13                    out_mat[i-1][j-1] + ref_mat[i][j],
14                    out_mat[i][j-1] - penalty );
15        }
16    }
17
18 }
```



# Application Flavor

i = 3, j = 3

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

out\_mat



```
1 __kernel void nw(__global int* ref_mat,
2                      __global int* out_mat,
3                           int num_rows,
4                           int num_cols,
5                           int penalty)
6 {
7     for (int i = 1; i < num_rows; ++i)
8     {
9         for (int j = 1; j < num_cols; ++j)
10        {
11            out_mat[i][j] =
12                max( out_mat[i-1][j]      - penalty,
13                     out_mat[i-1][j-1] + ref_mat[i][j] ,
14                     out_mat[i][j-1]      - penalty );
15        }
16    }
17
18 }
```



# Design Choices

## for authoring OpenCL FPGA kernels

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

- Width vs Depth

Multiple Work Item



Single Work Item





# Design Choices

## for authoring OpenCL FPGA kernels

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

### Compiler Directives

- `reqd_work_group_size(X, Y, Z)`
- `num_simd_work_items(NUM)`
- `#pragma ivdep`  
(ignore vector dependences)
- `#pragma unroll`



# Design Choices

## for authoring OpenCL FPGA kernels

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

- Expressing performant FPGA constructs in High Level Language





# Wavefront Parallelism

## Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

Multiple Work Item {Inter, Intra}-Work Group

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results  
HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM

Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM

Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM

Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion





# Wavefront Parallelism

## Single Work Item Blocked, Chunked

Exploring  
HARPV2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPV2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM

Performance

Conclusion





# Hardware Design Space

Example:  $\text{BSIZE} = \{ 4, 8 \}$ ,  $\text{PAR} = \{ 2, 4 \}$

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Hardware Design Space

Example:  $\text{BSIZE} = \{ 4, 8 \}$ ,  $\text{PAR} = \{ 2, 4 \}$

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Shared Virtual Memory (SVM)

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

Socket





## Exploring HARPv2

Cabrera  
Chamberlain

### Introduction

Preliminaries  
FPGAs  
HARPv2

### Path to Portability and Performance

Basic Kernel  
Design Choices  
Wavefront Parallelism  
Hardware Design Space  
SVM

### Results

HW Design Space Search  
Comparison  
SVM  
Performance

### Conclusion

## 1 Introduction

## 2 Preliminaries

- FPGAs
- HARPv2

## 3 Path to Portability and Performance

- Basic Kernel
- Design Choices
- Wavefront Parallelism
- Hardware Design Space
- SVM

## 4 Results

- HW Design Space Search
- Comparison
- SVM Performance

## 5 Conclusion



# Design Space Search

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space

SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion





# Design Space Search

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion





# Design Space Search

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion



It took 14 days to build all kernel configurations!



# Comparison Results

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

- SVP = Stratix V, PCIe
- vD = Dummy
- HARP = Arria 10, HARP

| V | Kernel Type | FPGA | $f_{max}$ (MHz) | Logic | Speedup |
|---|-------------|------|-----------------|-------|---------|
|---|-------------|------|-----------------|-------|---------|

Zohouri et al., 2018

Our Work



# Comparison Results

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion

- SVP = Stratix V, PCIe
- vD = Dummy
- HARP = Arria 10, HARP

| V  | Kernel Type | FPGA | $f_{max}$ (MHz) | Logic | Speedup |
|----|-------------|------|-----------------|-------|---------|
| v0 | MWI         | SVP  | 267.52          | 27%   | 1.00    |
|    |             | HARP | 211.77          | 25%   | 0.74    |
| v1 | SWI         | SVP  | 304.50          | 20%   | 0.05    |
|    |             | HARP | 256.6           | 26%   | 0.01    |
| v2 | MWI         | SVP  | 164.20          | 38%   | 2.48    |
|    |             | HARP | 162.865         | 50%   | 3.90    |
| v3 | SWI         | SVP  | 191.97          | 19%   | 3.55    |
|    |             | HARP | 178.12          | 25%   | 3.24    |
| v5 | SWI         | SVP  | 218.15          | 53%   | 38.22   |
| vD | N/A         | HARP | 350.26          | 23%   | N/A     |

Zohouri et al., 2018

Our Work



# Comparison Results

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion

- SVP = Stratix V, PCIe
- HARP = Arria 10, HARP

- vD = Dummy

| V  | Kernel Type | FPGA | $f_{max}$ (MHz) | Logic | Speedup |
|----|-------------|------|-----------------|-------|---------|
| v0 | MWI         | SVP  | 267.52          | 27%   | 1.00    |
|    |             | HARP | 211.77          | 25%   | 0.74    |
| v1 | SWI         | SVP  | 304.50          | 20%   | 0.05    |
|    |             | HARP | 256.6           | 26%   | 0.01    |
| v2 | MWI         | SVP  | 164.20          | 38%   | 2.48    |
|    |             | HARP | 162.865         | 50%   | 3.90    |
| v3 | SWI         | SVP  | 191.97          | 19%   | 3.55    |
|    |             | HARP | 178.12          | 25%   | 3.24    |
| v5 | SWI         | SVP  | 218.15          | 53%   | 38.22   |
| vD | N/A         | HARP | 350.26          | 23%   | N/A     |

Zohouri et al., 2018

Our Work



# Comparison Results

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion

- SVP = Stratix V, PCIe
- HARP = Arria 10, HARP

- vD = Dummy

| V  | Kernel Type | FPGA | $f_{max}$ (MHz) | Logic | Speedup     |
|----|-------------|------|-----------------|-------|-------------|
| v0 | MWI         | SVP  | 267.52          | 27%   | 1.00        |
|    |             | HARP | <b>211.77</b>   | 25%   | <b>0.74</b> |
| v1 | SWI         | SVP  | 304.50          | 20%   | 0.05        |
|    |             | HARP | <b>256.6</b>    | 26%   | <b>0.01</b> |
| v2 | MWI         | SVP  | 164.20          | 38%   | 2.48        |
|    |             | HARP | <b>162.865</b>  | 50%   | 3.90        |
| v3 | SWI         | SVP  | 191.97          | 19%   | 3.55        |
|    |             | HARP | <b>178.12</b>   | 25%   | <b>3.24</b> |
| v5 | SWI         | SVP  | 218.15          | 53%   | 38.22       |
| vD | N/A         | HARP | 350.26          | 23%   | N/A         |

Zohouri et al., 2018

Our Work



# Comparison Results

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion

- SVP = Stratix V, PCIe
- vD = Dummy
- HARP = Arria 10, HARP

| V  | Kernel Type | FPGA | $f_{max}$ (MHz) | Logic | Speedup |
|----|-------------|------|-----------------|-------|---------|
| v0 | MWI         | SVP  | 267.52          | 27%   | 1.00    |
|    |             | HARP | 211.77          | 25%   | 0.74    |
| v1 | SWI         | SVP  | 304.50          | 20%   | 0.05    |
|    |             | HARP | 256.6           | 26%   | 0.01    |
| v2 | MWI         | SVP  | 164.20          | 38%   | 2.48    |
|    |             | HARP | 162.865         | 50%   | 3.90    |
| v3 | SWI         | SVP  | 191.97          | 19%   | 3.55    |
|    |             | HARP | 178.12          | 25%   | 3.24    |
| v5 | SWI         | SVP  | 218.15          | 53%   | 38.22   |
| vD | N/A         | HARP | 350.26          | 23%   | N/A     |

Zohouri et al., 2018

Our Work



# Comparison Results

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices

Wavefront  
Parallelism

Hardware Design  
Space  
SVM

Results

HW Design  
Space Search

Comparison  
SVM  
Performance

Conclusion

- SVP = Stratix V, PCIe
- HARP = Arria 10, HARP

- vD = Dummy

| V  | Kernel Type | FPGA | $f_{max}$ (MHz) | Logic | Speedup |
|----|-------------|------|-----------------|-------|---------|
| v0 | MWI         | SVP  | 267.52          | 27%   | 1.00    |
|    |             | HARP | 211.77          | 25%   | 0.74    |
| v1 | SWI         | SVP  | 304.50          | 20%   | 0.05    |
|    |             | HARP | 256.6           | 26%   | 0.01    |
| v2 | MWI         | SVP  | 164.20          | 38%   | 2.48    |
|    |             | HARP | 162.865         | 50%   | 3.90    |
| v3 | SWI         | SVP  | 191.97          | 19%   | 3.55    |
|    |             | HARP | 178.12          | 25%   | 3.24    |
| v5 | SWI         | SVP  | 218.15          | 53%   | 38.22   |
|    |             | HARP | 186.81          | 40%   | 34.27   |
| vD | N/A         | HARP | 350.26          | 23%   | N/A     |

Zohouri et al., 2018

Our Work



# SVM Results

Exploring  
HARPy2

Cabrera  
Chamberlain

Introduction

Preliminaries  
FPGAs  
HARPy2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion





# Conclusion

Exploring  
HARPv2

Cabrera  
Chamberlain

Introduction

Preliminaries

FPGAs  
HARPv2

Path to  
Portability and  
Performance

Basic Kernel  
Design Choices  
Wavefront  
Parallelism  
Hardware Design  
Space  
SVM

Results

HW Design  
Space Search  
Comparison  
SVM  
Performance

Conclusion

- Design space search necessary to find most performant kernel
- OpenCL design practices for PCIe Card FPGAs hold for HARPv2
- Intel HARPv2 FPGA-CPU interface requires a lot of FPGA resources
- SVM implementation alleviates data movement problem
  - For snapshot of artifacts:  
<https://openscholarship.wustl.edu/data/17/>
  - For most recent updates:  
[https://github.com/cabreraam/iwocl2019\\_artifacts](https://github.com/cabreraam/iwocl2019_artifacts)