

## Journal Pre-proof

A Fully Pipelined FPGA Accelerator for Scale Invariant Feature Transform Keypoint Descriptor Matching

Luka Daoud, Muhammad Kamran Latif, H.S. Jacinto, Nader Rafla

PII: S0141-9331(19)30080-8  
DOI: <https://doi.org/10.1016/j.micpro.2019.102919>  
Reference: MICPRO 102919



To appear in: *Microprocessors and Microsystems*

Received date: 8 February 2019  
Revised date: 16 September 2019  
Accepted date: 16 October 2019

Please cite this article as: Luka Daoud, Muhammad Kamran Latif, H.S. Jacinto, Nader Rafla, A Fully Pipelined FPGA Accelerator for Scale Invariant Feature Transform Keypoint Descriptor Matching, *Microprocessors and Microsystems* (2019), doi: <https://doi.org/10.1016/j.micpro.2019.102919>

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

# A Fully Pipelined FPGA Accelerator for Scale Invariant Feature Transform Keypoint Descriptor Matching

Luka Daoud, Muhammad Kamran Latif, H S. Jacinto, Nader Rafla

*Electrical and Computer Engineering  
Boise State University  
Boise, ID 83725, USA*

## Abstract

The scale invariant feature transform (SIFT) algorithm is considered a classical feature extraction algorithm within the field of computer vision. The SIFT keypoint descriptor matching is a computationally intensive process due to the amount of data consumed. In this paper, we designed a fully pipelined hardware accelerator architecture for the SIFT keypoint descriptor matching. It was implemented and tested on a field programmable gate array (FPGA). The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation and hit the roofline performance model achieving the potential maximum throughput. The fully pipelined matching architecture was designed based on cosine angle distance approach. It was optimized for 16-bit fixed-point operations and implemented on hardware using Xilinx Zynq-based FPGA development board. Our proposed architecture showed a noticeable reduction of area resources compared with its counterparts in the literature maintaining high throughput by alleviating the memory bandwidth restrictions. The results showed reduction in device-resources up to 91% in LUTs and 79% of BRAMs. Our hardware implementation is  $15.7\times$  faster than the comparable software approach.

*Keywords:* Scale Invariant Feature Transform, SIFT, Matching algorithm, FPGA, Pipeline, Acceleration, High Level Synthesis, HLS.

## <sup>1</sup> 1. Introduction

<sup>2</sup> Object recognition using feature-based algorithms are generally computationally intensive.  
<sup>3</sup> The scale-invariant feature transform (SIFT) algorithm proposed in 1999 by David  
<sup>4</sup> Lowe [1], is a classical and well-known algorithm within the field of computer vision. SIFT  
<sup>5</sup> algorithm is a feature-based algorithm that can be applied in object recognition. The best  
<sup>6</sup> candidate match for a SIFT keypoint is found by identifying its nearest-neighbor in the

---

*Email addresses:* LukaDaoud@u.boisestate.edu (Luka Daoud), MuhammadLatif@u.boisestate.edu (Muhammad Kamran Latif), SheltonJacinto@u.boisestate.edu (H S. Jacinto), nrafla@boisestate.edu (Nader Rafla)

7 keypoint database. The matching process often involves operating on data-at-rest but more  
 8 recently real-time applications using feature-based object recognition have gained popularity.  
 9 Feature extraction based object recognition is an approach commonly applied in several  
 10 varying applications such as medical imaging [2], satellite imaging [3], facial recognition [4],  
 11 and the landing of unmanned aerial vehicles (UAVs) [5].

12 Various steps in the extracting SIFT descriptors often require the use of complex software  
 13 routines that require intensive computations [1]. However, in a running scenario of keypoint  
 14 extraction, the extraction only occurs once per test image. The limitations of keypoint  
 15 descriptor matching thus requires that matching must be performed every time a test image  
 16 is compared with a possible match in the database. Each time the database needs to be  
 17 accessed, the overall matching time for the test image increases as the overall size of the  
 18 database grows.

19 The SIFT descriptor matching is based on the nearest-neighbor algorithm [1] where, for  
 20 a single test keypoint descriptor match, the Euclidean distances [6] of the test descriptor are  
 21 calculated between each descriptor in the descriptor database. The calculated distances are  
 22 then sorted such that the minimum and second minimum distances are found. A positive  
 23 match between the test descriptor and the descriptor database is found if the Euclidean  
 24 distance ratio is above a pre-set threshold, suggested by David Lowe in [1].

Since a SIFT keypoint descriptor is an array of 128 elements, calculated based on all pixels of an image around the centered keypoint in a  $16 \times 16$  sliding window. The generated descriptor by this method can be defined mathematically as:

$$d_k^\alpha = \{f_{k,1}^\alpha, f_{k,2}^\alpha, \dots, f_{k,128}^\alpha\} .$$

The Euclidean distance between two descriptors,  $d_k^\alpha$  and  $d_m^\beta$ , is thus calculated:

$$\sum_{i=1}^{128} \frac{(f_{k,i}^\alpha - f_{m,i}^\beta)^2}{(f_{k,i}^\alpha + f_{m,i}^\beta)}.$$

25 In the process of matching a descriptor,  $d_k^\alpha$ , with a database, the Euclidean distances of  $d_k^\alpha$   
 26 in relation to the database's descriptors is calculated. The process of calculating Euclidean  
 27 distances is computationally intensive however, resource consumption can effectively be re-  
 28 duced by changing the calculation of Euclidean distance. Instead of using a conservative  
 29 approach of calculating the Euclidean distance as mentioned, a cosine angle distances can  
 30 be calculated between the descriptors [7]. Since SIFT descriptors are normalized during  
 31 keypoint extraction, calculating the angular distances by taking the arc-cosine of the dot-  
 32 products of normalized descriptors prove to be a close approximation for Euclidean distances  
 33 [7]. Utilizing a method of angular distance will significantly reduce the hardware resource  
 34 consumption.

35 If an image of  $m$  descriptors is represented by a matrix of size  $m \times 128$ , there is a recur-  
 36 rent redundancy of memory access for descriptors of an image and the descriptor database.  
 37 Memory access times for descriptors further vary based on locality thus, in a software ap-  
 38 proach, memory access time becomes variant that may impact on both timing and resource  
 39 overheads for the SIFT descriptors matching.

<sup>40</sup> In this paper, our proposed SIFT descriptors matching architecture is designed and  
<sup>41</sup> implemented with the purpose of accelerating the matching process, handling the memory  
<sup>42</sup> bandwidth limitation, and reducing the area resources. The contributions of this paper are  
<sup>43</sup> summarized as following:

- <sup>44</sup> • A hardware implementation of the SIFT keypoint descriptor matching based on cosine  
<sup>45</sup> angle distance on FPGA including:
  - <sup>46</sup> – A fully pipelined architecture.
  - <sup>47</sup> – Minimal resource utilization.
  - <sup>48</sup> – High throughput hardware accelerator.
- <sup>49</sup> • Resulting analysis of memory bandwidth usage and its effect on the overall computa-  
<sup>50</sup> tional performance.

<sup>51</sup> The rest of this paper is organized as follows: Section 2 summarizes the literature review  
<sup>52</sup> and the related work of SIFT descriptors matching on accelerating platforms. Section 3 pro-  
<sup>53</sup> vides background and related definitions along with the software approach of the matching  
<sup>54</sup> algorithm based on calculating cosine angle distances. Section 4 studies the computation  
<sup>55</sup> and memory bandwidth optimization. Section 5 presents our proposed matching architec-  
<sup>56</sup> ture on FPGA. Section 6 evaluates our proposed matching architecture and provides the  
<sup>57</sup> experimental results. Finally, Section 7 concludes the paper.

## <sup>58</sup> 2. Related Work

<sup>59</sup> This section particularly focuses on different approaches of calculating nearest-neighbor  
<sup>60</sup> distances for descriptors matching algorithms on FPGA-based accelerators. It also provides  
<sup>61</sup> a brief overview of the matching implementations on other platforms.

<sup>62</sup> There have been several hardware-based implementations of descriptors matching on  
<sup>63</sup> FPGA [8, 9, 10, 11]. Most recently, Vourvoulakis et al. [8] proposed an FPGA-based archi-  
<sup>64</sup> tecture for SIFT descriptors matching based on the calculation of the distances between the  
<sup>65</sup> descriptors in the database. The similarity between the descriptors was determined based on  
<sup>66</sup> the minimum value of SAD (Sum of Absolute Distances) calculators. Their implementa-  
<sup>67</sup> tion was based on comparing the currently extracted descriptor with 128 previously detected  
<sup>68</sup> ones to find a potential match. The authors proposed a moving window of 16 descriptors to  
<sup>69</sup> fit the entire matching architecture on an FPGA. In their implementation, a total 8 clock-  
<sup>70</sup> cycles were required to calculate 128 SAD values to report a potential match using a single  
<sup>71</sup> matching core that required significant memory resources.

<sup>72</sup> Lentaris et al. [9] implemented a pipelined architecture for SIFT descriptor matching  
<sup>73</sup> using the Euclidean norm for computing distances between descriptors. In their implemen-  
<sup>74</sup> tation, a finite state machine fetches all the descriptors from the test image  $d_{k,i}^{\alpha}$  and the  
<sup>75</sup> descriptors from the database  $d_{k,i}^{\beta}$  stored in memory one by one. The descriptor pair is  
<sup>76</sup> passed to a chi-square distance state, where the similarity of the two descriptors was eval-  
<sup>77</sup> uated by calculating the distance between them. The distance calculating state consists of

78 128 chi-square ( $\chi^2$ ) calculators and each calculator performs  $(d_{k,i}^\alpha - d_{k,i}^\beta)^2 / (d_{k,i}^\alpha + d_{k,i}^\beta)$  calculation where  $i$  is the  $i^{th}$  element of the 128-dimensional vector. Each multiplier and divider in chi-square state is 16-bit and produces a 32-bit result. The output from  $\chi^2$  calculators is summed using linear systolic array and the result is passed to matching state to keep tracking of the two best matches. At the end of the database, the distance ratio of these two matches is compared with a fixed threshold to accept or reject the best match. Their used technique of the matching algorithm by calculating the Euclidean distance necessitated more resources than our approach as explained in Section 6.

86 Wang et al. [10] proposed an embedded System-on-Chip for features detection and  
 87 matching. Their system extracts binary robust independent elementary features (BRIEF)  
 88 [12] descriptors from the detected SIFT ones. Unlike SIFT descriptors that has 128 elements,  
 89 the BRIEF descriptor is a vector of 64 elements. The BRIEF matching detection was  
 90 performed by calculating the distances between two BRIEF descriptors. A successful match  
 91 is reported if the calculated distance is smaller than a minimum threshold [10].

92 Kapela et al. [11] presented a hardware-software platform in which fast retina keypoint  
 93 (FREAK) [13] descriptors were extracted in software and matched by calculating the Ham-  
 94 ming distance which was implemented on Xilinx Zynq-7000 FPGA. Their proposed matching  
 95 core included multiple Hamming distance calculator circuits that are running in parallel to  
 96 calculate the distance between the descriptors. The overall performance of their system  
 97 depends on the number of the Hamming distance cores. Additionally, the number of LUTs  
 98 and registers increases proportionally with the number of Hamming calculators.

99 Condello et al. [14] presented an OpenCL-based feature matching algorithm that made  
 100 use of the capabilities of GPUs to speedup the matching process for speeded-up robust fea-  
 101 ture (SURF) descriptors. The matching algorithm uses Euclidean distances to calculate the  
 102 nearest neighbors for a test descriptors with the others in the database. They implemented  
 103 their matching core on NVIDIA's GTX275, which has a theoretical peak of 2760 GFlops.  
 104 However, the latency of the global memory access affected on the computation power of  
 105 the GPU where it limited the memory reuse during the distance computation step of the  
 106 matching process.

107 Fassold et al. [15] used NVIDIA's Tesla K20 GPU for the SIFT descriptor matching  
 108 by calculating the nearest-neighbors between descriptors using Euclidean distance. Their  
 109 implementation of the matching architecture on the GPU achieved 13 milliseconds for a set  
 110 of 2,800 descriptors.

111 The matching algorithm for most of the implementations is based on calculating the  
 112 nearest-neighbor distances between the current feature and the features in the database. To  
 113 the best of our knowledge, this paper is the first attempt for hardware implementation of  
 114 SIFT matching algorithm on FPGA, where the matching technique is based on calculating  
 115 the nearest-neighbor distances using cosine angle distance rather than using the traditional  
 116 descriptor distance calculations. The following part of this paper moves on to describe in  
 117 greater detail the SIFT matching algorithm based on cosine angle distance technique.

<sup>118</sup> **3. Matching Algorithm based on Cosine Angle Distance**

<sup>119</sup> *3.1. Nomenclature and Definitions*

An image is a 2-D array of pixels that carry information and keypoint descriptors are highly distinctive features in an image. A SIFT descriptor is a vector of 128 elements that describe a scale-invariant local image region. It can be given as  $d_k^\alpha$ , where  $k$  and  $\alpha$  are the  $k^{th}$  descriptor in an image  $\tilde{\alpha}$ .

$$d_k^\alpha = \{f_{k,1}^\alpha, f_{k,2}^\alpha, \dots, f_{k,128}^\alpha\},$$

where  $f_{k,i}^\alpha$  is the  $i^{th}$  element of the  $k^{th}$  descriptor of image  $\tilde{\alpha}$  and ( $0 \leq f_{k,i}^\alpha \leq 1$ ). So, descriptors of an image  $\tilde{\alpha}$  that has a  $m$  set of descriptors is described as  $d^\alpha$ :

$$d^\alpha = \begin{bmatrix} d_1^\alpha \\ d_2^\alpha \\ \vdots \\ d_m^\alpha \end{bmatrix} = \begin{bmatrix} f_{1,1}^\alpha & f_{1,2}^\alpha & f_{1,3}^\alpha & \dots & f_{1,128}^\alpha \\ f_{2,1}^\alpha & f_{2,2}^\alpha & f_{2,3}^\alpha & \dots & f_{2,128}^\alpha \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ f_{m,1}^\alpha & f_{m,2}^\alpha & f_{m,3}^\alpha & \dots & f_{m,128}^\alpha \end{bmatrix}.$$

The dot-product operation of two descriptors,  $d_l^\alpha$  and  $d_s^\beta$ , is denoted as  $dp_{l,s}^{\alpha,\beta}$ , calculated in Equation 1.

$$dp_{l,s}^{\alpha,\beta} = d_l^\alpha \odot d_s^\beta = \sum_{i=1}^{128} f_{l,i}^\alpha \cdot f_{s,i}^\beta \quad (1)$$

Thus,  $dp_k^{\alpha,\beta}$  is a dot-product of the  $k^{th}$  descriptor of image  $\tilde{\alpha}$ , with each descriptor of image  $\tilde{\beta}$ , defined as

$$dp_k^{\alpha,\beta} = \begin{bmatrix} dp_{k,1}^{\alpha,\beta} \\ dp_{k,2}^{\alpha,\beta} \\ \vdots \\ dp_{k,n}^{\alpha,\beta} \end{bmatrix} = d_k^\alpha \odot \begin{bmatrix} d_1^\beta \\ d_2^\beta \\ \vdots \\ d_n^\beta \end{bmatrix} = \begin{bmatrix} d_k^\alpha \odot d_1^\beta \\ d_k^\alpha \odot d_2^\beta \\ \vdots \\ d_k^\alpha \odot d_n^\beta \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{128} f_{k,i}^\alpha \cdot f_{1,i}^\beta \\ \sum_{i=1}^{128} f_{k,i}^\alpha \cdot f_{2,i}^\beta \\ \vdots \\ \sum_{i=1}^{128} f_{k,i}^\alpha \cdot f_{n,i}^\beta \end{bmatrix}.$$

Therefore, the dot-product of all descriptors of image  $\tilde{\alpha}$  and image  $\tilde{\beta}$  can be denoted as  $dp^{\alpha,\beta}$ , defined by

$$dp^{\alpha,\beta} = \begin{bmatrix} dp_1^{\alpha,\beta} \\ dp_2^{\alpha,\beta} \\ \vdots \\ dp_m^{\alpha,\beta} \end{bmatrix} = \begin{bmatrix} dp_{1,1}^{\alpha,\beta} & dp_{2,1}^{\alpha,\beta} & dp_{3,1}^{\alpha,\beta} & \dots & dp_{m,1}^{\alpha,\beta} \\ dp_{1,2}^{\alpha,\beta} & dp_{2,2}^{\alpha,\beta} & dp_{3,2}^{\alpha,\beta} & \dots & dp_{m,2}^{\alpha,\beta} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ dp_{1,n}^{\alpha,\beta} & dp_{2,n}^{\alpha,\beta} & dp_{3,n}^{\alpha,\beta} & \dots & dp_{m,n}^{\alpha,\beta} \end{bmatrix}^T.$$

In the SIFT matching algorithm the cosine inverse (arc-cosine), denoted by  $ci$ , of each dot-product operation is calculated. Similarly,  $ci^{\alpha,\beta}$  is the arc-cosine of  $dp^{\alpha,\beta}$ , defined as

$$ci^{\alpha,\beta} = \begin{bmatrix} ci_1^{\alpha,\beta} \\ ci_2^{\alpha,\beta} \\ \vdots \\ ci_m^{\alpha,\beta} \end{bmatrix} = \begin{bmatrix} ci_{1,1}^{\alpha,\beta} & ci_{2,1}^{\alpha,\beta} & ci_{3,1}^{\alpha,\beta} & \dots & ci_{m,1}^{\alpha,\beta} \\ ci_{1,2}^{\alpha,\beta} & ci_{2,2}^{\alpha,\beta} & ci_{3,2}^{\alpha,\beta} & \dots & ci_{m,2}^{\alpha,\beta} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ ci_{1,n}^{\alpha,\beta} & ci_{2,n}^{\alpha,\beta} & ci_{3,n}^{\alpha,\beta} & \dots & ci_{m,n}^{\alpha,\beta} \end{bmatrix}^T.$$

120 *3.2. Software Approach of the SIFT Matching Algorithm*

121 The SIFT matching algorithm iterates through several steps to check if a match of a  
 122 single descriptor of image  $\tilde{\alpha}$  corresponds with another descriptor in image  $\tilde{\beta}$ . The efficient  
 123 design of a matching algorithm depends largely on the platform in which implementation is  
 124 to occur. In this section a software approach is detailed with a description of the resulting  
 125 implementation of the SIFT matching algorithm based on our proposed angular distance  
 126 measure between descriptors.

---

**Algorithm 1** Software approach for SIFT descriptor matching.

---

**Input:**  $k$ th descriptor of image  $\alpha$  with size  $1 \times 128$   
 Database descriptors of image  $\beta$  with size  $m \times 128$

**Output:** Matched result for  $k$ th descriptor with the database descriptors

```

1: for  $j = 1$  to  $m$  do
2:   for  $k = 1$  to  $128$  do
3:      $p[i][j] += A[i][k] \cdot B[j][k];$ 
4:   end for
5:    $[sort\_vals, index] = sort(arccos(p[i][j]));$ 
6:   if  $sort\_vals(1) < (threshold * sort\_vals(2))$  then
7:     return Match Found
8:   else
9:     return No Match Found
10:  end if
11: end for
```

---

Algorithm 1 provides the computational software flow of the SIFT matching algorithm for the  $k^{th}$  descriptor,  $d_k^\alpha$ , of image  $\tilde{\alpha}$  with the database descriptors of image  $\tilde{\beta}$ ,  $d^\beta$ , where image  $\tilde{\beta}$  has  $n$  descriptors, represented as

$$d^\beta = \{d_1^\beta, d_2^\beta, \dots, d_n^\beta\}.$$

The first step of the SIFT matching algorithm is to calculate the dot-product of the descriptor,  $d_k^\alpha$ , with each descriptor in the database according to Equation 1. The result of the dot-product operation is a vector,  $dp_k^{\alpha,\beta}$ , of  $n$  elements, shown as

$$dp_k^{\alpha,\beta} = [dp_{k,1}^{\alpha,\beta}, dp_{k,2}^{\alpha,\beta}, \dots, dp_{k,n}^{\alpha,\beta}].$$

The following step is to take the arc-cosine of each element in  $dp_k^{\alpha,\beta}$ , saving the result in memory or cache, presented mathematically as

$$ci_k^{\alpha,\beta} = [ci_{k,1}^{\alpha,\beta}, ci_{k,2}^{\alpha,\beta}, \dots, ci_{k,n}^{\alpha,\beta}].$$

127 The resulting output array,  $ci_k^{\alpha,\beta}$ , is sorted in ascending order where the first and second  
128 minimums are calculated.

129 David Lowe defined a threshold criteria [1], typically 0.6, to determine matching success.  
130 Matching success is determined by the match between the  $k^{th}$  descriptor,  $d_k^\alpha$ , of image  $\tilde{\alpha}$   
131 with the database descriptors,  $d^\beta$ , of image  $\tilde{\beta}$ , according to Equation 2.

$$\begin{cases} \text{minimum} < (0.6 \times \text{second\_minimum}) & \text{Match} \\ \text{otherwise} & \text{No Match} \end{cases} \quad (2)$$

132 The calculations listed are repeated for each descriptor in image  $\tilde{\alpha}$  to determine the  
133 matching features in image  $\tilde{\beta}$ . From Algorithm 1, the SIFT matching algorithm requires  
134 an equally large number of calculations and memory resources; quickly showing large time  
135 dependency due to both calculation and memory access latency.

#### 136 4. Proposed Optimization of Memory Bandwidth

137 In this section, we study the impact of the memory bandwidth on the overall performance  
138 of the matching process and explore an optimization scheme to fully utilize the computation  
139 core and the memory bandwidth.

##### 140 4.1. Roofline Performance Model

141 Image descriptors are streamed to the SIFT descriptor matching algorithm subsystem  
142 via an attached memory to the computing core. The total memory bandwidth plays a vital  
143 role in achieving maximum performance for a given system. In order for the matching core to  
144 start processing, one descriptor for each image,  $\tilde{\alpha}$  and  $\tilde{\beta}$ , should be ready at the input ports  
145 of the matching core. We assume that the  $k^{th}$  descriptor of image  $\tilde{\alpha}$  is always ready at the  
146 input port of the computational core. Since each descriptor is composed of 128 elements<sup>1</sup>,  
147 each data transfer between memory and the computational core is 256 bytes.

148 In order to study the effect of the memory bandwidth in the overall performance of the  
149 system, let's assume that only one computational core exists in the system, that is pipelined  
150 and works at 100 MHz. To execute one operation, a full descriptor (256 bytes) should be  
151 ready at the input port of the computational core. Hence, the memory bandwidth take  
152 part in the system throughput. For example, if the memory bandwidth reaches 32 bytes  
153 per clock-cycle (3.2 GB/s), the computational core will wait for 8 clock-cycles to completely  
154 receives a single descriptor to start the process. This will achieve 12.5 Mega operation/second  
155 (M op/s). When the memory bandwidth increases to 6.4 GB/s, similarly, the performance

---

<sup>1</sup>Each element would be nominally composed of a 16-bit fixed point for the angular distance method.

156 increases to 24 M op/s. As long as the memory bandwidth increases, the performance  
 157 increases. However, the maximum attainable throughput stops at its maximum peak when  
 158 the memory bandwidth reaches 256 bytes per clock-cycle (25.6 GB/s) at the input port of  
 159 the computation core. As the memory bandwidth increases above this limit, more data is  
 160 present at the input port of the computation core but only 256 bytes are processed at a time.  
 161 Figure 1 shows the effect of the memory bandwidth on the system performance. The speed  
 162 of the computational core increases with increasing the memory bandwidth till it reaches  
 163 the boundary of the peak performance.



Figure 1: Performance and memory bandwidth effect.

164 In this paper, we implemented the matching core on Zedboard [16]. The platform includes  
 165 two DDR3 memory components. The multi-protocol DDR controller is configured for 32-bit  
 166 wide accesses to a 512 MB address space. For 32-bit data width access of the DDR memory,  
 167 64 bits (8 bytes) are accessed in one clock-cycle. This limits the performance of the matching  
 168 core to 100/32 M op/s for 100 MHz running clock, where the core waits for 32 clock-cycles  
 169 to receive a complete descriptor to start its operation. However, by optimizing the memory  
 170 access, we can achieve the peak performance of the platform, 100 M op/s, as illustrated in  
 171 section 4.2.

#### 172 4.2. Memory Access Optimization

173 Due to the maximum memory bandwidth limitations presented by the hardware plat-  
 174 form, our goal is to increase throughput by executing one dot-product operation every  
 175 clock-cycle. To assure one dot-product operation can be computed every clock-cycle, a new  
 176 descriptor must be valid every clock-cycle. In order to alleviate the memory bandwidth  
 177 bottleneck, an internal memory (cache) is used for storing 32 descriptors. Since the time to  
 178 calculate 32 dot-product operations is 32 clock-cycles (one operation per clock-cycle), within  
 179 that time period, one complete descriptor can be fetched from external memory. The newly

180 fetched descriptor will execute a dot-product operation with each descriptor in the internal  
 181 cache (32 descriptors stored in internal cache). The result of the fetching optimization op-  
 182 erations allows calculating 32 dot-product operations in 32 clock-cycles. While executing  
 183 the 32 dot-products of the current descriptor, a new descriptor is received and the process  
 184 is repeated until the entirety of descriptors of image  $\tilde{\beta}$  is completed.

185 Therefore, in order to alleviate the memory bandwidth restriction, the descriptors of  
 186 image  $\tilde{\alpha}$  are divided into blocks of 32 descriptors each. Each block is passed to an internal  
 187 cache and the dot-product operation is executed with one block and the entirety of descrip-  
 188 tors of image  $\tilde{\beta}$ . Architecturally, two first-in first-out (FIFO) buffers are used to store the  
 189 descriptors from external memory as a linear cache, with a total fetch time of 1024 clock-  
 190 cycles per block<sup>2</sup>. The result from the block latency is that the highest throughput can be  
 191 achieved when image  $\tilde{\beta}$  has more than 32 descriptors. To further alleviate computation time  
 192 and memory requirements for the SIFT descriptors matching, the architecture must be fully  
 193 compatible with the platform. In our approach, the matching architecture is implemented  
 194 such that full utilization of the core is achieved.

## 195 5. Proposed Matching Algorithm Architecture on Hardware

196 FPGA is generally utilized for accelerating computational processes by increasing con-  
 197 current operations. It further increases the overall throughput of the system by pipelining  
 198 and overlapping the instructions. The goal of this work is to accelerate the SIFT descriptors  
 199 matching on FPGA and efficiently handling memory bandwidth limitations which is often  
 200 seen in software implementation, as explained in Section 4.

201 In our proposed architecture, the descriptors of image  $\tilde{\alpha}$  and image  $\tilde{\beta}$  are streamed from  
 202 external memory into an internal FIFO. Each descriptor is composed of 128 elements and  
 203 its location,  $(x, y)$ , in the image. Although each element should be represented as a double-  
 204 precision floating-point to increase the accuracy, such floating-point adder circuits [17] is  
 205 more complicated and consumes more resources. Therefore, for further optimization, each  
 206 element of the descriptor is represented as a 16-bit fixed-point value and a total of 32-bits  
 207 for its location, leading to an individual descriptor size of 2080 bits.

208 The proposed SIFT matching architecture [18] consists of four main sub-cores and two  
 209 internal caches to alleviate memory bottlenecks; all of which are fully pipelined and imple-  
 210 mented onto FPGA:

- 211     • Dot\_Product.
- 212     • Cosine\_Inverse.
- 213     • Minimum Search (MIN\_FIND).
- 214     • Match\_Check.

---

<sup>2</sup>1024 clock-cycles comes from the previous 32 clock-cycles to fetch a single descriptor by the number of descriptors per block,  $32 \times 32 = 1024$ .

- 215     • Descriptor Cache (DES\_MEM).  
 216     • Minimum(s) Cache (MIN\_MEM).

217     For making use of high-level synthesis design [19], Xilinx System Generator <sup>®</sup> is used for  
 218     designing and implementing the *Dot\_Prod* and the *Cosine\_Inverse* blocks as IP cores.

219     Figure 2 describes a block diagram of our proposed SIFT descriptor matching accelerator  
 220     core. In this architecture, the descriptors are streamed from the memory to the matching  
 221     core. The internal caches (buffers) are used for keeping the descriptors to alleviate the  
 222     memory bottleneck and fully utilize the matching core. This will allow a complete descrip-  
 223     tor available at every clock-cycle, where the external memory bandwidth is optimized for  
 224     increasing the throughput as explained in Section 4.2.



Figure 2: Matching core architecture for the proposed SIFT descriptors matching detailing the sub-modules and caches necessary for operation as controlled by the control unit.  $Z^{-n}$  is a shift register with n pipeline stages.

225     Since the descriptor size is 260 bytes (including the coordinates) and the memory band-  
 226     width is 8 bytes per clock, it takes 33 clock-cycles to fetch a complete descriptor. In order  
 227     to match the fetching time, 33 descriptors of image  $\tilde{\alpha}$  were saved in a separate descriptors  
 228     cache (DES\_MEM), where each descriptor of image  $\tilde{\beta}$  falls through in 33 clock-cycles and  
 229     saved into Register.

230     The outputs of the DES\_MEM and the Register are split into descriptors that are handed  
 231     to the Dot\_Product core and their coordinates that are passed through shift register with a  
 232     corresponding number of the pipeline stages. The output of the Dot\_Product is passed to a  
 233     Cosine\_Inverse core of 52 pipeline stages approximating the resulted output to 16-bits.

234     To calculate the minimum and second minimum on the fly, the previous minimum and  
 235     second minimum are retrieved from the minimum(s) cache (MIN\_MEM)<sup>3</sup> and passed to the  
 236     minimum-search core (MIN\_FIND) along with the current calculated cosine-inverse.

<sup>3</sup>Initially, the MIN\_MEM is empty; the minimum and second minimum values are delivered and temporarily considered as the maximum values ( $0 \times FFFF$ ).

237 For each new descriptor block of image  $\tilde{\alpha}$ , MIN\_MEM should be flushed. Therefore, a  
 238 multiplexer is used to pass a constant value of ( $0 \times FFFF$ ) when a new block of descriptor  
 239 is delivered at DES\_MEM. Otherwise, it passes the current value(s) in MIN\_MEM. This is  
 240 controlled by the Control Unit, seen in Figure 2.

241 The Control Unit present in the proposed SIFT matching architecture allows several  
 242 operations to run concurrently by using a scheduling method; increasing overall throughput  
 243 of the system. An example of scheduled concurrent operation is when the execution of a dot-  
 244 product operation occurs on the final descriptor of image  $\tilde{\beta}$ , the DES\_MEM is simultaneously  
 245 filled with the subsequent descriptor of image  $\tilde{\alpha}$  such that the following descriptor of image  
 246  $\tilde{\beta}$  will already have a new reference descriptor block. The Control Unit is aware of the total  
 247 number of descriptors and the processing time of each core, which make it able to handle the  
 248 control signals to receive new descriptors, enable the internal cores, and control the internal  
 249 caches.

### 250 5.1. Dot\_Product Core

251 The SIFT matching algorithm begins by calculating the product of each element of the  
 252  $k^{th}$  descriptor of image  $\tilde{\alpha}$  with the corresponding element of image  $\tilde{\beta}$ . Since the descriptor  
 253 is comprised of 128 elements, 128 multiplications are required to calculate them in parallel.  
 254 The output from each prior multiplication is sequentially added to obtain the resulting dot-  
 255 product by the Dot\_Product core which composed of 128 multiplication cores and seven  
 256 levels<sup>4</sup> of tree adders. The multipliers necessary for the Dot\_Product core were implemented  
 257 into the FPGA's digital signal processing (DSP) slices with 3 pipeline stages. The seven  
 258 levels of adders necessary for sequential addition are equivalent to 7 pipeline stages. Figure  
 259 3 shows the internal architecture of the Dot\_Product core with a total of 10 pipeline stages.

### 260 5.2. Cosine\_Inverse Core

261 In order to implement the cosine-inverse in hardware, a coordinate rotation digital com-  
 262 puter (CORDIC) core provided by Xilinx using System Generator for DSP [20] is used. The  
 263 Cosine\_Inverse core is shown in Figure 4. It is composed of two CORDIC cores: one to  
 264 calculate the square root of an input and another to find the polar coordinates of a feature,  
 265 labeled in Figure 4 as Square\_Root and Polar\_Sys, respectively. Internal to the Square\_Root  
 266 and Polar\_Sys core are 37 and 11 pipeline stages, respectively. The calculation of  $1 - x^2$   
 267 as an input to the Square\_Root core is completed with 4 pipeline stages, with a total of 52  
 268 pipeline stages for the Cosine\_Inverse core alone.

#### 269 5.2.1. Polar\_Sys Core

270 The Polar\_Sys core within the Cosine\_Inverse core has two inputs,  $u$ , and  $v$  of a Cartesian  
 271 system, and two outputs, magnitude,  $\rho$ , and angle,  $\theta$ . The relation between these Cartesian  
 272 system,  $(u, v)$  and the polar system,  $(\rho, \theta)$  is simply described for a right triangle with

---

<sup>4</sup>Seven levels of adders are required since  $\log_2 128 = 7$ , meaning for each tree we can compute a segment of the 128 elements.



Figure 3: Dot product core.

Figure 4: Internal depiction of the "Cosine\_Inverse" core, including square root calculation and the polar coordinate translation module.  $Z^{-n}$  is a shift register with n pipeline stages.

hypotenuse  $\rho$  and sides  $u$  and  $v$  as  $\rho = \sqrt{u^2 + v^2}$  and  $\theta = \tan^{-1}(v/u)$ . To obtain the arc-cosine, only the calculation of  $\theta$  is necessary, thus only computing  $\theta = \cos^{-1}x$ , where  $x$  is an input into the Cosine\_Inverse needs calculation. The two inputs of the Polar\_Sys core,  $(u, v)$  are thus obtained as  $u = x$  and  $v = \sqrt{1 - x^2}$ .

### 5.3. Minimum Search (MIN\_FIND) Core

After calculating the arc-cosine for each descriptor in the database, two minimum values are determined, serving to highlight the database descriptors that are potential candidates for similarity within the image's descriptor under consideration. In software approach to descriptor matching, the output values are stored into a memory then a sorting algorithm is applied to find the minimum and second minimum values. In hardware approach to

283 descriptor matching, a typical sorting algorithm is resource-inefficient due to memory needs  
 284 thus, in our design, the MIN\_FIND core is designed to find both the minimum and second  
 285 minimum values on the fly. This is done by retrieving the previous minimum and second  
 286 minimum values from (MIN\_MEM) and compared with the current calculated cosine-inverse.  
 287 The pseudo-code representing the hardware operation of the MIN\_FIND core is shown in  
 288 Algorithm 2.

---

**Algorithm 2** Comparison scheme for calculating minimum and second minimum values to highlight database descriptors as potential candidates for similarity with the image descriptor.

---

**Input:** Output of the cosine inverse (curr\_val), recent minimum value from the memory (prev\_min), and recent second minimum value (prev\_sec\_min).

**Output:** Updated minimum value (min) and second minimum value (sec\_min).

```

1: if curr_val < prev_min then
2:   min = curr_val;
3:   sec_min = prev_min;
4: else if curr_val < prev_sec_min then
5:   min = prev_min;
6:   sec_min = curr_val;
7: else
8:   min = prev_min;
9:   sec_min = prev_sec_min;
10: end if
11: return min, sec_min

```

---

289 *5.4. Match\_Check Core*

290 The final step of the SIFT matching algorithm is to check if an actual match occurs  
 291 by passing the minimum and second minimum values to the Match\_Check core. The  
 292 Match\_Check core then applies Equation 2 to check matching between the descriptor of  
 293 image  $\tilde{\alpha}$  with another descriptor within image  $\tilde{\beta}$ . The hardware design of the Match\_Check  
 294 core consists of 3 pipeline stages *without* the need for any multiplier. The multipli-  
 295 cation procedure of the second minimum with 0.6 is hidden in an addition process since  
 296  $(0.6)_d = (0.10011)_b$  can be used as a constant value. Therefore,  $minimum \times (100000)_b$  is  
 297 compared with  $second\ minimum \times (10011)_b$  to check matching between the two descriptors,  
 298 previously mentioned in Algorithm 1. Figure 5 shows a three pipelined stages of adder  
 299 circuits to implement Equation 2, where the multiplication of a number with the constant  
 300 value  $(10011)_b$  is done by adding the number to itself after it is shifted to the left one time  
 301 and the result is added to the same number after it is shifted to the left four times. The  
 302 shifting process was done by appending the right side of the number with extra zero-bits.



Figure 5: Matching Check core.

## 303 6. Experimental Results and Evaluation

304 In order to evaluate the proposed hardware architecture of the SIFT descriptors  
 305 matching, we used a Xilinx® Zynq-7000-based Zedboard. The Zedboard has a programmable  
 306 logic and two ARM Cortex-A9 co-processors. The SIFT hardware matching algorithm core  
 307 was implemented into the Zedboard’s programmable logic while a single ARM Cortex-A9  
 308 processor was only used for simulation within the Xilinx® Software Development Kit.

### 309 6.1. Experimental Setup

310 In order for the SIFT matching core to start processing, descriptors  $d^\alpha$  and  $d^\beta$  of both  
 311 images  $\tilde{\alpha}$  and  $\tilde{\beta}$  should be ready at the input ports of the matching core. The descriptors for  
 312 both images were initially stored onto a SD card used within the Zedboard, whose contents  
 313 were then loaded into the external DRAM by the on-board processing system (PS). To check  
 314 the matches between two images, the PS initializes and facilitates direct memory access  
 315 (DMA) transfer of the provided descriptors from the DRAM to the descriptor buffers.

316 The PS then initiates the matching process with the matching core over advanced exten-  
 317 sible interface (AXI) whereupon the matching core is provided with the number of descriptor  
 318 blocks, number of descriptors per block, and a start signal. The Xilinx® EDA design Suite  
 319 was used to synthesize and implement the design, including the matching core, DMA, and  
 320 FIFO buffers. The Xilinx® Software Development Kit was used to read descriptors for both  
 321 images from SD card into memory, and pass them to the fabric buffers/descriptor(s) caches.  
 322 The fabric clock of the Zedboard has a range of 100 kHz to 250 MHz, however the AXI DMA  
 323 has a maximum frequency of 150 MHz or 120 MHz for AXI4 and AXI4-Lite, respectively

[21]. Due to the frequency limitations of the Zedboard's systems, all experiments were run at a nominal frequency of 100 MHz.

### 6.2. Experimental Results

The SIFT matching algorithm was fully synthesized and implemented onto the Zedboard's fabric with a 135 MHz maximum frequency with a normal clock-speed of 100 MHz<sup>5</sup>. The implemented SIFT matching core has only one computational element of the “Dot\_Product”, “Cosine\_Inverse”, “MIN\_FIND”, and “Match\_Check” cores. The matching core includes a ”Control\_Unit”, additional internal memories for Descriptors and Minimum(s) Caches, and other registers for pipelining and synchronizing the data flow of the algorithm. Table 1 summarizes the overall resource utilization for individual components, with the “Others” category collecting registers and multiplexers used for synchronizing descriptors and control signals.

Table 1: Our proposed SIFT matching algorithm architecture utilization report for Zedboard FPGA implementation.

| Core                                    | LUT  | FF   | DSP | BRAM |
|-----------------------------------------|------|------|-----|------|
| <b>Dot-Product &amp; Cosine Inverse</b> | 3382 | 3557 | 132 | 0    |
| <b>Minimum Search</b>                   | 82   | 66   | 0   | 0    |
| <b>Matching Check</b>                   | 42   | 283  | 0   | 0    |
| <b>Control Unit</b>                     | 92   | 2132 | 0   | 0    |
| <b>Descriptor Cache</b>                 | 0    | 0    | 0   | 29   |
| <b>Minimum(s) Cache</b>                 | 0    | 0    | 0   | 1    |
| <b>Others</b>                           | 112  | 327  | 0   | 0    |
| Total                                   | 3710 | 6365 | 132 | 30   |

In order to evaluate our SIFT matching core based on cosine angle distance approach, several experiments were conducted. The experiments were run on four different images, *image\_1*, *image\_2*, *image\_3* and *image\_4* where each image has 579, 538, 882, and 1021 descriptors, respectively. We chose *image\_4* as the database image which the other three images were checked against for potential matches. To check the correctness of the matching points, *image\_4* was used for testing the self-matching ability of the SIFT matching core as developed. Figure 6 shows the matching points between the selected images with the database.

For comparison of the descriptor matching time using a traditional software approach, the SIFT matching algorithm was executed on a 64-bit Intel® Core 2 Duo CPU running at 3.16 GHz using MATLAB® 2017a, following the original design described by Section 3.2. By using our proposed hardware SIFT matching core, it took 6.08, 6.75, 9.11, and 10.46 milliseconds for images with 579, 638, 882, and 1021 descriptors, respectively, whereas the software approach took 71.6, 77.4, 105.6 and 163.9 milliseconds, respectively, for the

<sup>5</sup>Used in this context due to limitation of the DMA core and AXI4-Lite provided by the Vivado toolset.



Figure 6: Matching points for selected images with different number of descriptors

same set of images. The differences in time taken for both the software and our proposed hardware approach can be summarized that the software approach takes a quadratic increase in computational time with an increasing number of descriptors, compared with a linear increase for our proposed algorithm.

Our SIFT matching core accelerated the computational time of the selected images by  $(11.5 \sim 15.7) \times$  for  $(579 \sim 1021)$  descriptors. The double-floating point operations within the SIFT matching algorithm in software were approximated to 16-bit fixed-point operations in hardware with 98% of matched-descriptors detected with decreasing error for increasing fixed-point resolution. For further analysis, we compared our proposed hardware architecture of the SIFT matching algorithm with a similar approach accelerated on a graphics processing unit (GPU) in [15]. The Authors in [15] used both NVIDIA's Tesla K20 GPU and their local Intel Xeon® 2.7 Ghz Quad-Core CPU with a set of 2,800 descriptors for both the test image and database set. The authors noted that their matching algorithm takes 13 milliseconds on GPU and 80 milliseconds on CPU. Our hardware approach takes 10.46 milliseconds for 1021 descriptors using only one computation core, compared with the 2,496 cores present in the K20. It is noteworthy that by increasing the computational cores with fixed memory bandwidth, the throughput of our system is not affected due to memory bandwidth limitation. In this presented work, we achieved the maximum performance by

368 hitting the roofline performance model, i.e. achieving high throughput with the maximum  
 369 allowed memory bandwidth and one computational core.

370 In addition, it is important to compare our proposed cosine angle distance approach for  
 371 descriptor matching with other implementations in the literature. There are several matching  
 372 techniques based on different approaches such as calculating the Chi-square distances  
 373 [9], Sum of Absolute Differences (SAD) [8], and calculating Hamming distances [11].

374 The implementation in [9] used 64 cores to calculate distance metrics in parallel, re-  
 375quiring many multiplications and divisions. The comparison of resources to our proposed  
 376 implementation shows a reduction of 91% of LUTs, 79% of BRAM. Apart from hardware  
 377 resource utilization, our proposed SIFT matching core has the capability to check match-  
 378 points within 9.28 milliseconds while [9] takes 10 milliseconds for 900 descriptors, due to  
 379 pipelining and maximization of memory bandwidth within the computation core.

380 The implementation in [8] used 16 SAD (Sum of Absolute Differences) calculators to cal-  
 381 culate the absolute difference between descriptors. The 16 SAD values are passed through 4  
 382 levels of comparators to obtain the minimum SAD value as a potential candidate among 16  
 383 descriptors. They used 2 separate RAMs to store the intermediate descriptors. The compari-  
 384 son of resources compared to our proposed implementation results in a reduction of about  
 385 92% of LUTs, 98% of BRAM and 75% DSP area.

386 The implementation in [11] used a variable number of Hamming distance calculators in  
 387 the matching core. For implementation using 2 cores of Hamming distance a considerable  
 388 saving in the FPGA resources were obtained. Compared to 2 cores to calculate Hamming  
 389 distance, our implementation of the matching core using the cosine angle distance approach  
 390 saved 37% of LUTs and 89% of BRAM. As the number of Hamming distance calculation cores  
 391 increases, the number of resources utilization increases. Table 2 shows the utilized resources  
 392 of our proposed SIFT matching core using cosine angle distance versus other matching  
 393 methods such as Chi-square distance [9], SAD calculators [8] and Hamming distance [11].

Table 2: Utilized resources of our architecture core vs other implementation in the literature.

| Parameter             | [9]              | [8]                    | [11]       | Proposed         |
|-----------------------|------------------|------------------------|------------|------------------|
| FPGA used             | Virtex6          | Cyclone IV             | Zynq-7000  | Zynq-7000        |
| Image Size            | $512 \times 384$ | $640 \times 480$       |            | $512 \times 384$ |
| Type of descriptors   | SIFT             | SIFT                   | FREAK      | SIFT             |
| # of descriptors      | 900              | –                      | –          | 882              |
| LUTs (% saved)        | 42662 (91%)      | 51068 (92%)            | 5967 (37%) | 3710             |
| DSP (% saved)         | 104 (-27%)       | 528 (75%)              | –          | 132              |
| BRAM (% saved)        | 142 (79%)        | 213 (85%) <sup>6</sup> | 294 (75%)  | 30               |
| Clock Frequency (MHz) | 172              | 100                    | 100        | 100              |

---

<sup>6</sup>The authors reported 1697 kbits of memory, which is equivalent to 213 BRAM in the best case of memory utilization.

394      Compared to these recent works [9, 8, 11] of implementation of matching cores on  
 395 hardware, our proposed architecture consumes significantly fewer resources with acceptable  
 396 matching accuracy ( 98%).

## 397    7. Conclusions

398      In this paper, a fully pipelined accelerator for a keypoint descriptor matching scheme  
 399 for the SIFT object recognition algorithm was designed and implemented on FPGA where  
 400 the matching core was constructed of four main computational sub-modules and two local  
 401 caches. Utilizing a close construction and 16-bit fixed-point calculations helped alleviate  
 402 memory bandwidth restrictions in order to achieve maximum throughput. An experimental  
 403 system was designed on a Xilinx® Zedboard where the matching core was implemented on  
 404 the programmable fabric and the Zynq processing system initialized the matching process.  
 405 Our proposed SIFT matching architecture consumes fewer resources and accelerates the  
 406 matching process where 9.11, 6.75, and 6.08 milliseconds elapsed for calculating the matching  
 407 points of 882, 638, and 579 descriptors, respectively, with an image of 1021 descriptors. Our  
 408 proposed SIFT matching hardware implementation additionally utilized 91% fewer LUTs  
 409 and 79% fewer BRAM when comparing with the state of the art hardware matching core.  
 410 Future work includes the extension of the hardware architecture into a fully pipelined vision  
 411 system with increased number of computation cores.

## 412    Declaration of Competing Interest

413      All authors have participated in (a) conception and design, or analysis and interpretation  
 414 of the data; (b) drafting the article or revising it critically for important intellectual content;  
 415 and (c) approval of the final version.

416      This manuscript has not been submitted to, nor is under review at, another journal or  
 417 other publishing venue.

418      The authors have no affiliation with any organization with a direct or indirect financial  
 419 interest in the subject matter discussed in the manuscript

## 420    References

- 421      [1] D. G. Lowe, Object Recognition from Local Scale-Invariant Features, in: Proceedings of the Seventh  
         IEEE International Conference on Computer Vision, Vol. 2, 1999, pp. 1150–1157.
- 422      [2] H. B.-S. Lee D-H, Lee D-W, Possibility Study of Scale Invariant Feature Transform (SIFT) Algorithm  
         Application to Spine Magnetic Resonance Imaging, PLoS ONE 11 (4).
- 423      [3] Y. Jiang, Y. Xu, Y. Liu, Performance evaluation of feature detection and matching in stereo visual  
         odometry, Neurocomputing 120 (2013) 380 – 390, image Feature Detection and Description.
- 424      [4] J. Kriaj, V. truc, N. Paveic, Adaptation of SIFT Features for Face Recognition under Varying Illumi-  
         nation, in: The 33rd International Convention MIPRO, 2010, pp. 691–694.
- 425      [5] A. Cesetti, E. Frontoni, A. Mancini, A. Ascani, P. Zingaretti, S. Longhi, A Visual Global Positioning  
         System for Unmanned Aerial Vehicles Used in Photogrammetric Applications, Journal of Intelligent &  
         Robotic Systems 61 (1) (2011) 157–168.
- 426      [6] B. Kolman, D. R. Hill, Elementary Linear Algebra, Pearson Education, 2004.

- 433 [7] G. Qian, S. Sural, Y. Gu, S. Pramanik, Similarity Between Euclidean and Cosine Angle Distance for  
 434 Nearest Neighbor Queries, in: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC  
 435 '04, ACM, New York, NY, USA, 2004, pp. 1232–1237.
- 436 [8] J. Vourvoulakis, J. Kalomiros, J. Lygouras, Fpga-based architecture of a real-time sift matcher and  
 437 ransac algorithm for robotic vision applications, *Multimedia Tools and Applications* 77 (8) (2018)  
 438 9393–9415.
- 439 [9] G. Lentaris, I. Stamoulias, D. Soudris, M. Lourakis, HW/SW Codesign and FPGA Acceleration of  
 440 Visual Odometry Algorithms for Rover Navigation on Mars, *IEEE Transactions on Circuits and Systems*  
 441 for Video Technology
- 442 [10] J. Wang, S. Zhong, L. Yan, Z. Cao, An Embedded System-on-Chip Architecture for Real-time Visual  
 443 Detection and Matching, *IEEE Transactions on Circuits and Systems for Video Technology* 24 (3)  
 444 (2014) 525–538.
- 445 [11] R. Kapela, K. Gugala, P. Sniatala, A. Swietlicka, K. Kolanowski, Embedded platform for local image  
 446 descriptor based object detection, *Applied Mathematics and Computation* 267 (2015) 419 – 426, the  
 447 Fourth European Seminar on Computing (ESCO 2014).
- 448 [12] M. Calonder, V. Lepetit, M. Ozysal, T. Trzcinski, C. Strecha, P. Fua, BRIEF: Computing a Local  
 449 Binary Descriptor Very Fast, *IEEE Transactions on Pattern Analysis and Machine Intelligence* 34 (7)  
 450 (2012) 1281–1298.
- 451 [13] A. Alahi, R. Ortiz, P. Vandergheynst, FREAK: Fast Retina Keypoint, in: 2012 IEEE Conference on  
 452 Computer Vision and Pattern Recognition, 2012, pp. 510–517.
- 453 [14] G. Condello, P. Pasteris, D. Pau, M. Sami, An OpenCL-based feature matcher, *Signal Processing: Image Communication* 28 (4) (2013) 345 – 350, special Issue: VS&AR.
- 454 [15] H. Fassold, H. Stiegler, J. Rosner, M. Thaler, W. Bailer, A GPU-accelerated two stage visual matching  
 455 pipeline for image and video retrieval, in: Content-Based Multimedia Indexing (CBMI), 2015 13th  
 456 International Workshop on, IEEE, 2015, pp. 1–5.
- 457 [16] Xilinx Inc., ZC702 Evaluation Board for the Zynq-7000 XC7Z020 User Guide (September, 2015).
- 458 [17] L. Daoud, D. Zydek, H. Selvaraj, A Survey on Design and Implementation of Floating Point Adder in  
 459 FPGA, in: Progress in Systems Engineering, Springer, 2015, pp. 885–892.
- 460 [18] L. Daoud, M. K. Latif, N. Rafila, SIFT Keypoint Descriptor Matching Algorithm: A Fully Pipelined  
 461 Accelerator on FPGA, in: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-  
 462 Programmable Gate Arrays, ACM, 2018, pp. 294–294.
- 463 [19] L. Daoud, D. Zydek, and H. Selvaraj, A Survey of High Level Synthesis Languages, Tools, and Compilers  
 464 for Reconfigurable High Performance Computing, in: Advances in Systems Science, Springer, 2014, pp.  
 465 483–492, , DOI: 10.1007/978-3-319-01857-7\_47.
- 466 [20] Xilinx Inc., Vivado Design Suite Reference Guide: Model-Based DSP Design Using System Generator  
 467 (May, 2019).
- 468 [21] Xilinx Inc., AXI DMA v7.1: LogiCORE IP Product Guide (October, 2017).

<sup>470</sup> **Luka Daoud** received the B.S. degree in Electrical Engineering from Fayoum University,  
<sup>471</sup> Egypt in 2007, and M.S. degree in Electronics and Communications Engineering from  
<sup>472</sup> Egypt-Japan University of Science and Technology (E-JUST), Alexandria, Egypt in 2012.  
<sup>473</sup> Luka is currently a Ph.D. candidate in Electrical and Computer Engineering at Boise State  
<sup>474</sup> University, Boise, Idaho, USA. His main research focuses on hardware security, Network-on-  
<sup>475</sup> Chip (NoC), high performance computing, and High Level Synthesis (HLS) design.

<sup>476</sup> **Muhammad Kamran Latif** received the B.Sc. degree in electrical engineering from  
<sup>477</sup> the University of Engineering and Technology, Lahore, Pakistan, in 2013 and his M.Sc.  
<sup>478</sup> in electrical engineering from Boise State University (BSU), Idaho, USA in 2017. He is  
<sup>479</sup> currently pursuing the Ph.D. degree from Boise State University, Idaho, USA. His current  
<sup>480</sup> research interests include applications of power electronics to power systems, applications  
<sup>481</sup> of computer engineering in smart grids and cyber-physical security of power distribution  
<sup>482</sup> systems.

<sup>483</sup> **HS. Jacinto** received his B.S. degree in electrical and computer engineering from Boise  
<sup>484</sup> State University, Boise, Idaho, USA, in 2017, and is currently a Ph.D. candidate in electrical  
<sup>485</sup> and computer engineering from Boise State University, Boise, Idaho, USA. From 2015 to  
<sup>486</sup> 2017 he worked with Idaho National Labs in conjunction with the Advanced Energy Lab  
<sup>487</sup> conducting research on self-powered wireless sensor networks and their security. He has  
<sup>488</sup> moved over to the Air Force Research Lab Quantum Information Science group in 2018  
<sup>489</sup> under a fellowship to work on quantum information processing systems, integrated quantum  
<sup>490</sup> photonics, and quantum control. His main research focuses on quantum network hardware  
<sup>491</sup> cybersecurity, quantum informatics, and adaptive hardware anti-tamper and encryption  
<sup>492</sup> technologies for use in the field of hardware security to create a secure platform for an  
<sup>493</sup> upcoming quantum era.

<sup>494</sup> **Nader I. Rafla**, PhD., P.E., received his M.S.E.E. and Ph.D. in Electrical Engineering  
<sup>495</sup> from Case Western Reserve University, Cleveland, Ohio in 1984 and 1991, respectively. His  
<sup>496</sup> doctoral research concentrated on object recognition and localization from multi sensor data:  
<sup>497</sup> range image, force-torque, and touch. From 1991 to 1996 he was an Associate Professor at  
<sup>498</sup> the Department of Manufacturing Engineering at Central State University, Wilberforce,  
<sup>499</sup> Ohio. In January 1997, he joined the newly developed electrical and computer engineering  
<sup>500</sup> department at Boise State University where he is currently Chair and Associate Professor,  
<sup>501</sup> heading the development of the B.S. and M.S. programs. Dr. Rafla's areas of expertise are:  
<sup>502</sup> Systems on a Programmable Chip (SoPC): reconfigurable computing and neuromorphic ar-  
<sup>503</sup> chitectures; evolvable hardware and genetic algorithms; cryptography and cybersecurity;  
<sup>504</sup> Implementation and design of secure hardware architectures for the next-generation embed-  
<sup>505</sup> ded systems and their applications that have immediate impact on real-world problems and  
<sup>506</sup> related technical challenges such as data fusion, hardware security, and hardware accelera-  
<sup>507</sup> tion.

**Daoud:**



**Latif:**

509



Journal Pre-proof

Jacinto:

510



**Rafla:**

511

