

# Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

Akshay Dua  
Arizona State University  
adua5@asu.edu

Yixing Li  
Arizona State University  
yixingli@asu.edu

Fengbo Ren  
Arizona State University  
renfengbo@asu.edu

## ABSTRACT

This paper presents Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture, optimized for accelerating the inference of various convolutional neural networks (CNNs) in multi-tenancy cloud/edge computing. The existing OpenCL-defined FPGA accelerators for CNN inference are insufficient due to limited flexibility for supporting multiple CNN models at run time and poor scalability resulting in underutilized FPGA resources and limited computational parallelism. Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture, which efficiently explores both spatial and temporal parallelism for accelerating CNN inference on FPGAs. Systolic-CNN is highly scalable and parameterized, which can be easily adapted by users to achieve up to 100% utilization of the coarse-grained computation resources (i.e., DSP blocks) for a given FPGA. Systolic-CNN is also run-time-flexible in the context of multi-tenancy cloud/edge computing, which can be time-shared to accelerate a variety of CNN models at run time without the need of recompiling the FPGA kernel hardware nor reprogramming the FPGA. The experiment results based on an Intel Arria/Stratix 10 GX FPGA Development board show that the optimized single-precision implementation of Systolic-CNN can achieve an average inference latency of 7ms/2ms, 84ms/33ms, 202ms/73ms, 1615ms/873ms, and 900ms/498ms per image for accelerating AlexNet, ResNet-50, ResNet-152, RetinaNet, and Light-weight RetinaNet, respectively. Codes are available at <https://github.com/PSCLab-ASU/Systolic-CNN>.

## CCS CONCEPTS

- Hardware → Hardware accelerators; • Computer systems organization → Neural networks.

## KEYWORDS

FPGA, neural networks, OpenCL, accelerator

## 1 INTRODUCTION

FPGAs offer superior hardware flexibility and energy efficiency that have attracted many researchers and developers to use FPGAs for accelerating convolutional neural network (CNN) inference for computer vision tasks [23–25]. The conventional development flow of FPGAs relies on designing FPGA hardware at the register-transfer level (RTL). Although it allows the fine control of resource utilization for precise performance improvement [22], the large efforts needed in design and verification make architecture design space exploration time-consuming. High-level synthesis (HLS) tools, such as the Intel FPGA SDK for OpenCL, allow function modeling at a much higher level, thus enabling a faster design and verification

cycle. The HLS tools also provide a rich set of synthesis attributes and directives that facilitates efficient architecture design space exploration [3].

Many recent works explore accelerating CNNs on FPGAs using C/OpenCL showing promising acceleration performance [1, 18, 23–25, 25, 27]. Nevertheless, these works suffer from two major limitations that make them insufficient for realizing acceleration-as-a-service for multi-tenancy cloud/edge computing: 1) the lack of flexibility for supporting multiple CNN models at run time; and 2) the poor scalability resulting in underutilized FPGA resources and limited computational parallelism.

In this paper, we present Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture for accelerating CNN inference in multi-tenancy cloud/edge computing. Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture, which efficiently explores both spatial and temporal parallelism for accelerating CNN inference on FPGAs. Systolic-CNN is highly scalable and has three key architectural parameters, based on which a user can optimally scale the accelerator architecture to fully utilize the off-chip memory bandwidth and available DSP block resource given an FPGA board. In addition, single-precision Systolic-CNN is run-time-flexible in the context of multi-tenancy cloud/edge computing, which can be time-shared to accelerate a variety of CNN models (including error-sensitive applications [21, 26]) at run time without the need of recompiling the FPGA kernel hardware nor reprogramming the FPGA. The experiment results based on an Intel Arria/Stratix 10 GX FPGA Development board show that the optimized single-precision implementation of Systolic-CNN can achieve an average inference latency of 7ms/2ms, 84ms/33ms, 202ms/73ms, 1615ms/873ms, and 900ms/498ms per image for accelerating AlexNet, ResNet-50, ResNet-152, RetinaNet, and Light-weight RetinaNet, respectively. The peak computational throughput is measured at 80-210 GFLOPS/s and 242-700 GFLOPS/s for accelerating different single-precision CNN models on Arria/Stratix 10 FPGA board. The source codes are available at <https://github.com/PSCLab-ASU/Systolic-CNN>.

## 2 BACKGROUND

### 2.1 CNN Models

CNNs are a subcategory of deep neural network models. CNNs can extract and learn spatial information automatically to process image classification and object detection tasks [7, 14, 15]. CNNs typically consist of convolutional layers, fully-connected layers, pooling layers and non-linearity layers [13]. Among them, convolutional layers consume most of the computations. A standard convolutional

layer is shown in Fig. 1 [6]. The convolutional kernels slide over the input feature maps and compute the inner-products for the output feature maps. It's possible to have both spatial and temporal parallelism to perform high-throughput convolutions.

In this paper, we evaluate the performance of the proposed accelerator architecture on five different CNN models, namely AlexNet [8], ResNet-50 [7], ResNet-152 [7], RetinaNet [15], and lightweight (LW) RetinaNet [14]. The first three CNN models are used for classification tasks, while the other two are used for object detection tasks.

## 2.2 Related Works on OpenCL-defined FPGA Accelerators for CNN Inference

There have been many works [1, 18, 23–25, 27] on accelerating CNNs on FPGAs using C/OpenCL published in recent years. However, these works suffer from one major limitation that makes them insufficient for handling the dynamic workloads in a multi-tenancy cloud/edge computing environment. Most of the existing works are either designed to exclusively accelerate a specific CNN model [18] or requires recompilation of the FPGA kernel and reprogramming of the FPGA device when changing the CNN model for acceleration [23–25]. For example, the work in [18] shows a high inference throughput but is restricted to accelerating a YOLO CNN model only [20]. PipeCNN [24], although designed for accelerating a variety of CNN models, requires updating the line buffer size and recompiling the FPGA kernel code for each CNN model due to the folded computation of inner product along the channel dimension that varies upon different CNN models. Given that FPGA kernel compilation can take a long time, this is a deal-breaker for providing acceleration-as-a-service in multi-tenancy cloud/edge computing. Differently, the architecture design of Systolic-CNN is completely invariant to CNN models. Specifically, the memory access pattern of input feature maps (IFMs) and local buffer sizes in Systolic-CNN only depend on user-defined architecture parameters regardless of the CNN model mapped. Thus, Systolic-CNN is run-time-flexible, which can be time-shared to accelerate a variety of CNN models at run time without the need to recompile the FPGA kernel hardware nor reprogramming the FPGA.



Figure 1: A convolutional layer.

In addition, many of the existing works have poor scalability [23–25], which can be evidenced by the reported under-utilization of the coarse-grained computing resources available on an FPGA (< 90% DSP block utilization). For example, although PipeCNN [24] exploits two levels of spatial parallelism, we notice in our experiment that scaling up the computation parallelism (e.g. *vec\_size*) in PipeCNN [24] can create difficulty for the placement-and-routing stage due to the large fan-out required at local memory buffer interfaces. [25, 27] propose to adopt a 2-D systolic array architecture [10] to improve the design scalability and operating frequency but still fail to fully utilize the available DSP block resources as the optimal mapping of a 2-D systolic array highly depends on the physical layout of an FPGA, which can change across different FPGA boards. Differently, Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture with shift-register based IFM buffers, which efficiently explores both spatial and temporal parallelism as well as the data reuse of IFMs to improve inference throughput with reduced off-chip memory access and well-bounded fan-out. As a result, Systolic-CNN is highly scalable and parameterized, which can be easily adapted by users to achieve up to 100% utilization of the coarse-grained computation resources (i.e., DSP blocks) for a given FPGA.

## 3 ARCHITECTURE DESIGN

### 3.1 System Architecture

Fig. 2 shows the high-level system architecture of Systolic-CNN. The convolution engine (CONV) in Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic processing element (PE) array architecture [9] for performing high-throughput convolutions with both spatial and temporal parallelism. The IFMs are read from the off-chip memory and cached in an on-chip shift-register-based IFM buffer for reducing off-chip memory access and maximizing the data reuse of IFMs for the convolution computation both within the same and across different output feature maps (OFMs). The weights are also read from the off-chip memory and cached inside the PEs to be reused for the convolution computation within the same OFMs. The system architecture also implements other commonly used layers in CNN models, namely normalization layer (LRN), element-wise sum layer (ELTWISE), rectified linear unit layer (ReLU), and pooling



Figure 2: The system architecture of Systolic-CNN.

layer (POOL). The LRN, POOL, ELTWISE, and ReLU computation are optional during the kernel execution, depending on the CNN model structure. The final output results are loaded back to the off-chip memory for either the next round of computation or return to the host kernel process.

To allow Intel FPGA SDK for OpenCL to better resolve the data-dependency and create the deep processing pipeline properly, the FPGA kernels of Systolic-CNN are all implemented as single-threaded kernels. Specifically, the shift-register-based IFM buffer is implemented as the MemRD kernel, each PE is implemented as an auto-run kernel to minimize the host-induced latency during CNN inference, the LRN and POOL modules are each implemented as a separate kernel, and the ELTWISE and ReLU modules are combined and implemented as the MemWrite kernel. Given that convolution is the bottleneck of computation in CNNs, the PE (convolution) kernels are designed to utilize most of the coarse-grained computation resources on an FPGA, while the other computation kernels are designed to utilize the minimum resources needed for making sure they are not the computational throughput bottleneck. In addition, the PE (convolution) kernels are optimized with a minimum initiation interval of 1 cycle. Systolic-CNN can support any customized residual neural networks with skipped connections.

### 3.2 Architectural Parameters

We parameterize the system architecture of Systolic-CNN with three architectural parameters, namely *pe\_num*, *vec\_fac*, and *reuse\_fac*.

From a system architecture perspective, *pe\_num* defines the number of PEs in the 1-D systolic array that performs temporally paralleled convolution in a deep pipeline. Each PE performs the convolution computation of a different OFM by sharing the same IFM data in a shifted fashion. Thus, *pe\_num* also defines the parallelism of OFM generation. *reuse\_fac* defines the parallelism of the inner product (IP) units inside each PE as well as how many times the same IFM data is reused by each PE for the convolution computation within the same OFM. Increasing *reuse\_fac* will improve the computational throughput without changing the amount of off-chip memory access needed for reading the IFMs, thus relaxes the off-chip memory bandwidth requirement and improves the off-chip memory bandwidth efficiency. *vec\_fac* defines the SIMD width of the partial IP computation between the weight vector and IFM vector across *vec\_fac* different channels inside each IP unit in each PE. Thus, *vec\_fac* and *reuse\_fac* also defines the parallelism of IFM computation along the channel and the row dimension of the IFMs, respectively. In addition, the size of the shift-register-based IFM buffer is defined by *reuse\_fac*  $\times$  *vec\_fac*. These three parameters allow users to efficiently perform architecture design space exploration to maximize the resource utilization of a given FPGA board subject to the available off-chip memory bandwidth. An example of the design space exploration is discussed in Section 4.2.

To illustrate the impact of the architectural parameters on acceleration performance, we provide the pseudo code of a standard convolutional layer in Fig. 3, and a convolutional layer optimized with the three architectural parameters implemented in Fig. 4. From an algorithmic perspective, *pe\_num*, *vec\_fac*, and *reuse\_fac* can be

```

1 for i = 0 to op_dim
2 for j = 0 to ic_dim
3 for y = 0 to col_dim
4 for x = 0 to row_dim
5 for m1 = 0 to K
6 for m2 = 0 to K
7   output[i][y][x] += ...
8   ip[j][y+m1][x+m2] * w[i][j][m1][m2]

```

Figure 3: The pseudo code for a standard convolutional layer

interpreted as the unrolling factor of the for loop along the depth of OFM (*op\_dim*), the depth (channel dimension) of IFM (*ic\_dim*), and the row dimension of the IFM (*row\_dim*), respectively. It should be noted that the system architecture of Systolic-CNN only depends on the three architectural parameters that are completely invariant to CNN models. Such invariance is the key to enabling the run-time flexibility needed for handling the dynamic workload in a multi-tenancy cloud/edge computing environment.

### 3.3 Data loading scheme

Fig. 5 shows the data loading scheme of the input feature map (IFM). In one clock cycle,  $1 \times 1 \times vec\_fac$  IFM values (highlighted in dark blue) are loaded onto shift-register-based IFM buffer. Then, the loading window slides along the row dimension *reuse\_fac* + *c* - 1 times and slides along the column dimension *c* times, where *c* represents the kernel size of a convolution kernel. Thus the IFM values can be reused *reuse\_fac* times computing with the *c*  $\times$  *c* convolution kernel. After the buffered IFM values have done all the computations with different weights, the loading window slides over the channel dimension to repeat the operations stated above.

### 3.4 PE Design

Fig. 6 illustrates the architecture of the *n*<sup>th</sup> PE in the convolution engine of Systolic-CNN. Each PE contains multiple IP units (defined by *reuse\_fac*), each of which computes the 3D inner product across

```

1 for i = 0 to op_dim/pe_num
2 for j = 0 to ic_dim/vec_fac
3 for y = 0 to col_dim
4 for x = 0 to row_dim/reuse_fac
5 for m1 = 0 to K
6 for m2 = 0 to K
7   #pragma unroll
8   for p = 0 to pe_num
9     #pragma unroll
10  for v = 0 to vec_fac
11    #pragma unroll
12  for r = 0 to reuse_fac
13    output[i*pe_num+p][y][x*reuse_fac+r] += ...
14    ip[j*vec_fac+v][y+m1][x*reuse_fac+r+m2] ...
15    *w[i*pe_num+p][j*vec_fac+v][m1][m2]

```

Figure 4: The pseudo code for a convolutional layer with the three architectural parameters implemented.



**Figure 5: Data loading scheme of the input feature map.**

different sliding windows of the convolution computation within the same OFM. Different IP units share the same set of weights and take in the same IFM vector sequence in a shifted fashion to reuse the IFM data by a factor of  $reuse\_fac$  times. Each IP unit contains multiple multipliers and a pipelined adder tree for computing partial IPs with a SIMD width defined by  $vec\_fac$  as well as an accumulator for computing the IP of an arbitrary dimension in a folded, pipelined fashion to eliminate the need of data movement for partial IP summation. To facilitate the IFM data movement throughout the 1D systolic array of PEs, each PE also shifts the input IFM data directly to the subsequent PE with a one cycle latency.

It should be noted that when performing the computation in fully connected layers, the weight sharing across different IP units in the PE becomes inefficient and causes low utilization of the computation resources. To address this problem, Systolic-CNN supports a batch processing mode for fully connected layer computation. By processing multiple input images in a batch mode, the same weights in the fully connected layer can be again shared across different IP units for performing the computation of different images. The batch size must be  $\leq reuse\_fac$ . When the batch size =  $reuse\_fac$ , the computation resources in each PE can be fully utilized for accelerating the computation in fully connected layers.

The convolution computation performed in each PE exploits two levels of spatial parallelism: the parallelism of IP units defined by  $reuse\_fac$  and the parallelism of partial IP computation defined by  $vec\_fac$ . Given the 1D systolic PE array also exploit a temporal/pipelined parallelism of  $pe\_num$ , the overall parallelism of convolution computation employed in Systolic-CNN is  $vec\_fac \times reuse\_fac \times pe\_num$ .

Based upon the understanding of the PE architecture, one should note that while increasing any of the three architectural parameters keeps the promise to improve the computation parallelism and the computational throughput proportionally, their impact on the required off-chip memory bandwidth is slightly different. Increasing  $vec\_fac$  increases the amount of IFM data accessed in each clock cycle, thus has a large impact on the required off-chip memory bandwidth. Increasing  $pe\_num$  increases the amount of weight

data access required only at the beginning of each convolution computation or in each clock cycle in the case of fully connected layer computation. Thus,  $pe\_num$  has a large impact on the required off-chip memory bandwidth during the computation of fully connected layers. Differently, increasing  $reuse\_fac$  will only change IFM data access pattern without affecting the amount of IFM data accessed in each clock cycle, thus have no impact on the required off-chip memory bandwidth.

The advantages of the 1-D systolic PE array architecture include 1) limiting the fan-out at the local IFM buffer interface; 2) assuring short and local interconnects used in the FPGA implementation; 3) reducing the amount of off-chip and on-chip memory access needed by reusing and moving IFM data through shift registers. These benefits are the key to improving the scalability of Systolic-CNN, the system operating frequency and the off-chip memory bandwidth efficiency, which are all essential to the system-level performance of CNN acceleration on an FPGA computing device. Compared with 2-D systolic array-based CNN accelerator architectures [25, 27], the 1-D systolic PE array architecture of Systolic-CNN has much more simplified memory control, data organization, and local buffering schemes for handling IFM and weight data.

### 3.5 Design for Scalability

The proposed 1D systolic PE array architecture of Systolic-CNN resolves the routing congestion problem caused by the large fan-out issue at the local memory buffer interface that exists in the current work [24] that exploits spatial parallelism only using NDRRange kernels. However, as the design scales up, we observe that the off-chip memory controller, i.e. the load-store unit (LSU), automatically synthesized by Intel FPGA SDK for OpenCL starts to show a large fan-out, which becomes the new bottleneck of routing congestion that prevents the design from further scaling up.

The high fan-out issue exists when the value of either  $vec\_fac$  and  $pe\_num$  is high. When  $vec\_fac$  and  $pe\_num$  is high, a large fan-out is observed at the LSU interface for loading the IFMs and the weights in parallel, respectively. Knowing that  $vec\_fac$  has a much bigger impact on the required off-chip memory bandwidth

Figure 6: The architecture of the  $n^{\text{th}}$  PE.

than  $pe\_num$  during convolution computation, one should consider limiting the value of  $vec\_fac$  to avoid a memory-bounded design regardless. Thus,  $pe\_num$  and the parallel loading of weights that are more likely to be the problems here. To resolve the high fan-out issue of the LSU, we propose to generate multiple LSUs to transfer the weights from the off-chip memory to local buffers in a sequential manner instead. In the case of  $pe\_num = 16$ , we observe that the proposed solution not only resolves the routing congestion problem but also improves the system operating frequency by 10%. This is the key to allowing users to further scale up the design to efficiently utilize the DSP blocks on an FPGA.

### 3.6 Host Kernel Design

While the FPGA kernels of Systolic-CNN are invariant to CNN models, a host kernel must be customized for deploying different CNN models onto the Systolic-CNN implementation on an OpenCL-supported FPGA computing device. The host kernel should invoke the corresponding computation kernel in Systolic-CNN just once for mapping each layer of a CNN model depending on the CNN model structure. The CNN model parameters (filter sizes, stride, padding information, etc.) are sent from the host kernel program to the FPGA kernels at run time to control the operations of each of the invoked FPGA kernel. The run-time flexibility of Systolic-CNN allows edge users to deploy a wide range of CNN models for acceleration without the need to change or recompile the FPGA kernel codes nor reprogramming the FPGAs. This is the key to enabling the acceleration-as-a-service for CNN inference in multi-tenancy cloud/edge computing.

## 4 EXPERIMENTS

### 4.1 Experimental Setup

We use two different settings to conduct experiments for edge and cloud computing scenarios. The experiments to reflect the edge computing user cases are conducted based on an Intel Arria 10 GX FPGA Development board that is equipped with an Intel 10AX115S2F45I1SG FPGA and 2GB DDR4 SDRAM with a maximum memory bandwidth of 19.2 GB/s. We use Intel FPGA SDK

for OpenCL version Pro 18.0 for kernel compilation and deployment. The experiments to reflect the cloud computing user cases are conducted based on BitWare 520N FPGA accelerator card that is equipped with a Stratix 10 GX2800 FPGA and 32GB DDR4 SDRAM with a maximum memory bandwidth of 2400 MT/s. We use Intel FPGA SDK for OpenCL version Pro 19.4 for kernel compilation and deployment.

Systolic-CNN adopts the single-precision floating-point data format for the sake of run-time flexibility—to maintain a sufficiently large dynamic range for supporting different CNN models (including error-sensitive applications, such as industrial robots or medical-related applications [21, 26]) at run time.

### 4.2 Design Space Exploration

A key design target of Systolic-CNN is to efficiently utilize the available DSP resources on an FPGA to maximize the computation parallelism and computational throughput for CNN inference subject to the available off-chip memory bandwidth. In this section, we use the AlexNet as an example to demonstrate the space exploration of the three architectural parameters with respect to the Intel Arria 10 GX FPGA Development board. Given the different impact on the off-chip memory bandwidth requirement, the values of the three architectural parameters shall be determined in the order of 1)  $vec\_fac$ , 2)  $pe\_num$ , and 3)  $reuse\_fac$ .

**4.2.1  $vec\_fac$ .**  $vec\_fac$  determines the parallelism of IFM data access from the off-chip memory to the shift-register-based IFM buffer per clock cycle, thus has a large impact on the off-chip memory bandwidth. As a result, the value of  $vec\_fac$  should depend on the per-cycle burst width of data access allowed by the off-chip memory and the bit width of the IFM. Specifically, the optimal value of  $vec\_fac$  can be calculated as  $vec\_fac = burstWidth/bitWidth$ . Given the value of  $vec\_fac$  determined by the equation above, there will be no memory stalling when the off-chip memory access of IFM data happens every clock cycle (the convolution kernels operate with a minimum initiation interval of 1 cycle), which guarantees a high off-chip memory bandwidth efficiency. Since the burst width of data access allowed by the off-chip memory on the Intel Arria



Figure 7: Runtime (ms) of the FC6 and FC7 layers in AlexNet at different  $pe\_num$  ( $vec\_fac=16$ ,  $reuse\_fac=1$ ) on Arria 10 FPGA board.



Figure 8: Inference latency (ms/img) of AlexNet and DSP block utilization (%) at different  $reuse\_fac$  ( $vec\_fac=16$ ,  $pe\_num=16$ ) on Arria 10 FPGA board.

10 GX FPGA Development board is 512 bits and the bit width of IFM data is 32 bits based on the single-precision floating-point data format, we set the value of  $vec\_fac$  to 16 in our experiments.

4.2.2  $pe\_num$ .  $pe\_num$  determines the parallelism of weight data access from the off-chip memory per clock cycle for fully connected layer computation thus has a large impact on the off-chip memory bandwidth during the computation of fully connected layers only. To determine the optimized value of  $pe\_num$ , we measure the runtime of the top two memory-intensive layers in the AlexNet – FC6 and FC7 (FC stands for a fully-connected layer) at different  $pe\_num$ . As shown in Fig. 7,  $pe\_num$  is swept from 2 to 20 with a step size of 2, while  $vec\_fac$  is fixed to 16 and  $reuse\_fac$  is set to 1. The runtime of FC6 and FC7 layers reaches the minimum at the  $pe\_num$  value of 16. The increase in runtime beyond the  $pe\_num$  value of 16 indicates that those cases are already memory-bounded, which are limited by the available off-chip memory bandwidth. Therefore, the optimal value of  $pe\_num$  is determined to be 16 in our experiments.

4.2.3  $reuse\_fac$ .  $reuse\_fac$  determines the parallelism of IP units inside each PE for reusing the IFM data as well as the size of the shift-register-based IFM buffer. Since  $reuse\_fac$  has no impact on the off-chip memory bandwidth requirement, the scaling of  $reuse\_fac$  is not limited by the off-chip memory characteristics but only depends on the available DSP resources on an FPGA. Fig. 8 shows the inference latency of running the entire AlexNet [8] on the ImageNet dataset [5] and the DSP block utilization at different  $reuse\_fac$ , which is swept from 1 to 4, while  $vec\_fac$  and  $pe\_num$  are both fixed to 16. It is shown that the DSP utilization increases and the runtime decrease both in a linear fashion as  $reuse\_fac$  increases. In addition, the DSP utilization of 100% and the minimum runtime are achieved at the  $reuse\_fac$  value of 4. The results shown in Fig. 8 illustrate the great scalability of Systolic-CNN. By optimizing the three architectural parameters of Systolic-CNN following the guidelines, one can efficiently utilize the available FPGA resources to maximize the computational throughput of CNN inference subject to the available off-chip memory bandwidth.

Through the design space exploration, the optimal value of  $pe\_num$ ,  $reuse\_fac$  and  $vec\_fac$  with respect to the Intel Arria 10 GX FPGA Development board is found to be 16, 4 and 16, respectively. We use the same methodology to explore three architectural parameters for Intel Stratix 10 GX FPGA Development board. The optimal value of  $pe\_num$ ,  $reuse\_fac$  and  $vec\_fac$  is found to be 16, 6, 32, respectively.

### 4.3 Experiment Results

We measure the inference latency of the optimized Systolic-CNN accelerators on an Intel Arria 10 GX FPGA Development board and an Intel Stratix 10 GX2800 FPGA Development board for running five different CNN models: AlexNet [8], ResNet-50 [7], ResNet-152 [7], RetinaNet [15], and LW-RetinaNet [14], respectively. As Systolic-CNN is run-time-flexible, only the host kernel is updated for deploying different CNN models on a single board without recompiling or redeploying the FPGA kernel. ’

The purpose of the comparison with state-of-the-art is not to show any performance benefits of Systolic-CNN, but rather, it is to show the runtime flexibility and scalability advantages with the comparable performance given the differences in data format, numerical precision, and computational methods used in different designs.

Table 1 shows the comparison with four prior works [1, 23–25] on OpenCL-defined FPGA accelerators for running AlexNet based on the ImageNet dataset [5] with an input size of 227×227×3. As the source codes of [1, 23, 25] are unavailable, the numbers of these three works used for comparison are quoted from the original papers. As the source codes of PipeCNN [24] is available, we implement PipeCNN with an 8-bit fixed-point data format on the same Intel Arria 10 GX FPGA Development board with the maximum computation parallelism that can be mapped by the tool and measure the inference latency based on this implementation for a fair comparison. As shown in Table 1, the PipeCNN [24] implementation can only achieve a limited DSP block utilization of 25% ( $vec\_size=16$ ,  $CU\_NUM=16$ ). In our experiment, we observe that

**Table 1: Comparison with Prior OpenCL-based FPGA Accelerators for AlexNet.**

| Work                 | [1]             | [25]            | [24]            | [23]           | This work                  |
|----------------------|-----------------|-----------------|-----------------|----------------|----------------------------|
| FPGA                 | Arria 10 GT1150 | Arria 10 GT1150 | Arria 10 GX1150 | Stratix-V GSD8 | Arria 10 GX1150            |
| CNN Model            | AlexNet         | AlexNet         | AlexNet         | AlexNet        | AlexNet                    |
| Data Format          | 16-bit float    | 32-bit float    | 8-bit fixed     | 8/16-bit fixed | 32-bit float               |
| Logic Utilization    | 246K (58%)      | 350K (82%)      | 105K (25%)      | N/A            | 250K (59%)                 |
| Memory Utilization   | 2487 (92%)      | 2360 (86%)      | 641 (24%)       | N/A            | 2472 (91%)                 |
| DSP Utilization      | 1476 (97%)      | 1290 (85%)      | 377 (25%)       | N/A            | 1518 (100%)                |
| Inference Latency    | 1ms             | 4ms             | 22ms            | 20ms           | 10ms/7ms (non-batch/batch) |
| $f_{CLK}$            | 303MHz          | 239MHz          | 250MHz          | 150MHz         | 202MHz                     |
| Recompilation Time   | N/A             | N/A             | 3 hr            | N/A            | 0 hr                       |
| Winograd             | Yes             | Yes             | No              | No             | No                         |
| Run-time Flexibility | No              | No              | No              | No             | Yes                        |

the tool fails to map the design with higher parallelism, mainly because of the large fan-out issue at the local memory buffer interfaces that causes routing congestion.

Our Systolic-CNN accelerator outperforms the prior work in [24] and [23] by 6.1x and 5.5x, respectively, in terms of inference latency. It should be noted that the Stratix-V FPGA used in [23], although running at a lower system frequency, has more logic, memory, and DSP block resources than the Arria 10 FPGA that we use. The prior work in [25] shows a 2.5x better inference latency than our Systolic-CNN accelerator. This is because Winograd transformation [11] is adopted in [25], which promises to reduce the computational complexity of a convolution layer by a factor of 4x [12] to further accelerate CNN inference. [1] only tests the on AlexNet, however, the performance of mapping any other models were unknown. For a rough estimation, by introducing 16-bit floating-point, Winograd transformation [12] and batch processing mode as [1] does, we can improve latency by 2x (estimated), 4x (estimated) and 1.3x (actual), respectively. The total improvement (around 10x) can fill the current gap between [1] and ours. While the prior work in [25] fails to fully utilize the available DSP block resource on the FPGA, Systolic-CNN shows better scalability and can achieve up to 100% utilization of the DSP block resource to fully take advantage of the FPGA device capability. In addition, while the OpenCL kernels of all the prior works are model-specific, Systolic-CNN is invariant to CNN models and have the run-time flexibility needed for handling the dynamic workload of accelerating different CNN models in multi-tenancy cloud/edge computing without the need of the recompilation nor redeployment of the FPGA kernel.

The Systolic-CNN results in Table 1 are measured with the batch processing mode turned on and off (batch size=1). The batch processing mode of Systolic-CNN can efficiently reduce the average latency of fully-connected layer computation. Since AlexNet has intensive computation in the fully-connected layers, one can enable the batch processing mode in Systolic-CNN (batch size= $reuse\_fac=4$ ) to improve the inference latency of the fully connected layers by 4x, which can further improve the average inference latency of the entire AlexNet by 1.3x. The comparison results in Table 1 also reflect the overhead of enabling runtime flexibility.

We also compare the inference performance of running ResNet-50 and ResNet-152 [7] with ImageNet dataset ( $224 \times 224 \times 3$ ) [5] classification tasks on Systolic-CNN with prior FPGA-based accelerators, as shown in Table 2. Here, we mainly compare with two prior works [2, 17] that achieve 100% DSP resource utilization. [17] is an RTL-level fine-grained accelerator design with design variables quantitatively investigated, while [2] is more focusing on leverage the off-chip feature map traffic with high-level synthesis (HLS) design flow. There are also other works that design OpenCL-based FPGA accelerator for ResNet models, such as [4]. Since [4] does not provide the latency information nor have open-source code, we do not include it for comparison here. For Systolic-CNN, we use the same kernel as the one used for AlexNet to run the ResNet-50 and ResNet-152 model with no need for recompilation. In terms of the data format and accuracy rate, 32-bit floating-point Systolic-CNN has no accuracy degradation, while the other two works [2, 17] with a 16-bit fixed-point data format can lead up to a 2% accuracy drop. As CNN grows deeper, it targets more on error-sensitive applications. Systolic-CNN is the one that more suitable for supporting error-sensitive applications in a multi-tenancy cloud/edge computing environment. [17] performs 6x better than our design in terms of inference latency, which reflects of the performance gap between the two data formats. As 32-bit floating- to fixed-point conversion can introduce 2.5x speedup [19] and 32-bit fixed-point to 16-bit fixed-point can offer another 2x speedup, 5x speedup in total can almost fill the gap of the latency performance between [17] and ours. At the same time, Systolic-CNN enjoys no recompilation and zero accuracy degradation. [2] also shows 6x better in inference latency than our design. Besides the data format difference between [2] and our work, [2] has 2x DSP block resources on their FPGA board. Considering both the data format and on-board DSP resource projection, Systolic-CNN performs better than [2] in terms of both latency and accuracy performance.

Table 3 summarizes the performance of Systolic-CNN accelerator evaluated on five different CNN models – AlexNet, ResNet-50, ResNet-152, RetinaNet and light-weight RetinaNet (LW-RetinaNet) with Intel Arria 10 and Stratix 10 FPGA, respectively. The evaluation on the same FPGA is done without any recompilation. The inference latency of RetinaNet/LW-RetinaNet is measured based

**Table 2: ResNet Inference Comparison of FPGA-based Accelerators with 100% DSP Resource Utilization.**

| Work                  | [17]             | [17]             | [2]              | This work        |              |
|-----------------------|------------------|------------------|------------------|------------------|--------------|
| CNN Model             | ResNet-50        | ResNet-152       | ResNet-152       | ResNet-50        | ResNet-152   |
| Data Format           | 16-bit fixed     | 16-bit fixed     | 16-bit fixed     | 32-bit float     | 32-bit float |
| FPGA                  | Arria 10 GX1150  | Arria 10 GX1150  | Virtex-7 485T    | Arria 10 GX1150  |              |
| Logic Utilization     | 221K/427K (52%)  | 235K/427K (55%)  | 372K/433K (86%)  | 250K/427K (59%)  |              |
| Memory Utilization    | 1931/2713 (71%)  | 2365/2713 (87%)  | 2039/2060 (99%)  | 2472/2713 (91%)  |              |
| DSP Utilization       | 1518/1518 (100%) | 1518/1518 (100%) | 2800/2800 (100%) | 1518/1518 (100%) |              |
| $f_{CLK}$             | 200 MHz          | 200 MHz          | 150 MHz          | 202MHz           |              |
| Inference Latency     | 13ms             | 32ms             | 35ms             | 84ms             | 202ms        |
| Accuracy Degradation  | <2%              | <2%              | <1%              | 0%               |              |
| Implementation Method | Verilog          | Verilog          | C/C++ HLS        | OpenCL           |              |
| Winograd              | No               | No               | No               | No               |              |
| Recompilation         | Yes              | Yes              | Yes              | No               |              |

**Table 3: Inference Performance of Running Different Models on Systolic-CNN Accelerators.**

| FPGA Board        | Arria 10 GX1150  |           |            |           |              | Stratix 10 GX2800 |           |            |           |              |  |
|-------------------|------------------|-----------|------------|-----------|--------------|-------------------|-----------|------------|-----------|--------------|--|
| Logic Utilization | 250K/427K (59%)  |           |            |           |              | 562K/933K (60%)   |           |            |           |              |  |
| Mem. Utilization  | 2472/2713 (91%)  |           |            |           |              | 9611/11721 (82%)  |           |            |           |              |  |
| DSP Utilization   | 1518/1518 (100%) |           |            |           |              | 5240/5760 (91%)   |           |            |           |              |  |
| $f_{CLK}$         | 200 MHz          |           |            |           |              |                   | 172 MHz   |            |           |              |  |
| CNN Model         | AlexNet          | ResNet-50 | ResNet-152 | RetinaNet | LW-RetinaNet | AlexNet           | ResNet-50 | ResNet-152 | RetinaNet | LW-RetinaNet |  |
| GFLOPs            | 1.4              | 8         | 22         | 312       | 178          | 1.4               | 8         | 22         | 312       | 178          |  |
| Latency(ms)       | 7                | 84        | 202        | 1615      | 900          | 2                 | 33        | 73         | 873       | 498          |  |

on the COCO dataset [16] with an input size of 800×800×3 for the object detection task. The DSP block utilization of both implementations is over 90%, which validates the efficiency of the proposed architecture parameter exploration. In addition, we can see a 2x-3x constant latency improvement between the same model mapped onto two FPGA boards, reflecting the scalability of the proposed Systolic-CNN.

In summary, when mapped with the single-precision floating-point data format, our Systolic-CNN accelerator can achieve an average inference latency of 7ms/2ms, 84ms/33ms, 202ms/73ms, 1615ms/873ms, and 900ms/498ms per image for running AlexNet, ResNet-50, ResNet-152, RetinaNet, and Light-weight RetinaNet on Arria/Stratix 10 FPGA board, respectively. The peak computational throughput is measured at 80-210 GFLOPS/s and 242-700 GFLOPS/s for accelerating different single-precision CNN models on Arria/Stratix 10 FPGA board. Since the current Systolic-CNN architecture is compatible for Winograd-based convolutions, we would like to explore adding support for Winograd-based CNN models to further improve its inference latency performance as future work.

## 5 CONCLUSION

In this paper, we present Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture for accelerating CNN inference in cloud/edge computing. Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture, which efficiently explores both spatial and temporal parallelism

for accelerating CNN inference on FPGAs. Systolic-CNN is highly scalable and parameterized with three architectural parameters. By optimizing the architectural parameters, one can efficiently utilize the available DSP block resource to maximize the computational throughput of CNN inference subject to the available off-chip memory bandwidth given an FPGA computing device. The experiment results based on an Intel Arria/Stratix 10 GX FPGA Development board for accelerating five different CNN models validate the scalability and run-time flexibility of Systolic-CNN, which makes it suitable for providing acceleration-as-a-service in cloud/edge computing. The invariance of Systolic-CNN architecture design for CNN models is the key to avoiding the long FPGA compilation time for handling the dynamic workload of accelerating different CNN models in a multi-tenancy cloud/edge computing environment.

## REFERENCES

- [1] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. 2017. An opencl™ deep learning accelerator on arria 10. In *Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. ACM, 55–64.
- [2] Arash Azizimazreah and Lihong Chen. 2019. Shortcut mining: Exploiting cross-layer shortcut reuse in dnn accelerators. In *2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 94–105.
- [3] Donald G Bailey. 2015. The advantages and limitations of high level synthesis for FPGA based image processing. In *Proceedings of the 9th International Conference on Distributed Smart Cameras*. ACM, 134–139.
- [4] Philip Colangelo, Nasibeh Nasiri, Eriko Nurvitadhi, Asit Mishra, Martin Margala, and Kevin Nealis. 2018. Exploration of low numeric precision deep learning inference using intel® FPGAs. In *2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*. IEEE, 73–80.

- [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In *2009 IEEE conference on computer vision and pattern recognition*. Ieee, 248–255.
- [6] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2016. Angel-eye: A complete design flow for mapping cnn onto customized hardware. In *2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)*. IEEE, 24–29.
- [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In *Proceedings of the IEEE conference on computer vision and pattern recognition*. 770–778.
- [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In *Advances in neural information processing systems*. 1097–1105.
- [9] Hsiang-Tsung Kung. 1982. Why systolic architectures? *IEEE computer* 15, 1 (1982), 37–46.
- [10] Sun Yuan Kung. 1988. VLSI array processors. *Englewood Cliffs, NJ, Prentice Hall, 1988, 685 p. Research supported by the Semiconductor Research Corp., SDIO, NSF, and US Navy*. (1988).
- [11] Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*. 4013–4021.
- [12] Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*. 4013–4021.
- [13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. *nature* 521, 7553 (2015), 436–444.
- [14] Yixin Li and Fengbo Ren. 2019. Light-Weight RetinaNet for Object Detection. *arXiv preprint arXiv:1905.10011* (2019).
- [15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In *Proceedings of the IEEE international conference on computer vision*. 2980–2988.
- [16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In *European conference on computer vision*. Springer, 740–755.
- [17] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. Optimizing the convolution operation to accelerate deep neural networks on FPGA. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 26, 7 (2018), 1354–1367.
- [18] Duy Thanh Nguyen, Tuan Nghia Nguyen, Hyun Kim, and Hyuk-Jae Lee. 2019. A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* (2019).
- [19] Michael Parker. 2014. Understanding peak floating-point performance claims. *Technical White Paper WP-01220-1.0* (2014).
- [20] Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In *Proceedings of the IEEE conference on computer vision and pattern recognition*. 7263–7271.
- [21] Benjamin A Rizkin, Karina Popovich, and Ryan L Hartman. 2019. Artificial Neural Network control of thermoelectrically-cooled microfluidics using computer vision based on IR thermography. *Computers & Chemical Engineering* 121 (2019), 584–593.
- [22] Roman A Solovyev, Alexandre A Kalinin, Alexander G Kustov, Dmitry V Telpukhov, and Vladimir S Ruhlov. 2018. FPGA implementation of convolutional neural networks with fixed-point calculations. *arXiv preprint arXiv:1808.09945* (2018).
- [23] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In *Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. ACM, 16–25.
- [24] Dong Wang, Ke Xu, and Diankun Jiang. 2017. PipeCNN: An OpenCL-based open-source FPGA accelerator for convolution neural networks. In *2017 International Conference on Field Programmable Technology (ICFPT)*. IEEE, 279–282.
- [25] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In *Proceedings of the 54th Annual Design Automation Conference 2017*. ACM, 29.
- [26] Weibin Yan. 2019. Study on the Application of Computer Vision Technology in Defect Detection. (2019).
- [27] Jiaxi Zhang, Wentai Zhang, Guojie Luo, Xuechao Wei, Yun Liang, and Jason Cong. 2019. Frequency Improvement of Systolic Array-Based CNNs on FPGAs. In *2019 IEEE International Symposium on Circuits and Systems (ISCAS)*. IEEE, 1–4.