k-Nearest Neighbor algorithm

Overview:

This project is aimed at using SDAccel to implement the k-Nearest Neighbor algorithm onto a Xilinx FPGA. The nearest neighbor algorithm is used to find the k nearest neighbors of a specified point among a set of unstructured data points. It is widely used in a diverse range of domains and applications such as pattern recognition, machine learning, computer vision and coding theory to name a few. For a set S of n reference data points in a d-dimensional space and a query point q, the k-nearest neighbor algorithm finds the k closest points in S to the point q. This is illustrated in Figure 1 for k=3 and n=12. The red triangle represents the query point while the blue diamonds represent the points in the reference data set in Figure 1.

![alt text](./figures/knn.png) Figure 1: Pictorial demonstration of KNN search algorithm

The algorithm includes the following main steps:

Compute n distances between the query point q and the n reference points of the set S. The distances in our case are the squared Euclidean distances which for a given set of two 2-dimensional points (x₁,y₁) and (x₂,y₂) are as follows:

_d = (x₁ - x₂)² + (y₁ - y₂)²_ 2. Sort the _n_ distances while preserving their original indices in _S_. 3. Return the _k_ points from _S_ corresponding to the lowest distances of the sorted distance array.

Further details can be found in the following paper. If you find this work useful for your research, please consider citing:

@ARTICLE{7859319, 
    author={F. B. Muslim and L. Ma and M. Roozmeh and L. Lavagno}, 
    journal={IEEE Access}, 
    title={Efficient FPGA Implementation of OpenCL High-Performance Computing Applications via High-Level Synthesis}, 
    year={2017}, 
    volume={5}, 
    pages={2747-2762}, 
    doi={10.1109/ACCESS.2017.2671881}, 
    }

Getting Started:

The repository contains three directories, namely "data", "knn_cpu" and "knn_fpga". The "data" directory contains a file with a sample set of reference data points. The "knn_cpu" and "knn_fpga" directories contain two different implementations of the KNN algorithm, in which the nearest neighbor identification is done on the CPU and FPGA respectively. Each implementation contains the OpenCL code of the algorithm along with the required host code and the Tcl script used to run the example in SDAccel.

###File Tree

KNN  
│   README.md  
│
└── data  
│   └─  filelist.txt
│
└── knn_cpu
│   │   main_cpu.cpp 
│   │   nn_cpu.cl     
│   │   params.h
│   └─  solution_cpu.tcl
│
└── knn_fpga
    │   main_fpga.cpp
    │   nn_fpga.cl
    │   params.h
    └─  solution_fpga.tcl

File/Dir name	Information
knn_cpu	Contains the "single kernel implementation" of the algorithm. The nearest neighbors identification in this implementation is done on the host.
knn_fpga	Contains the "two kernel implementation" of the algorithm, which uses a memory architecture optimization provided by SDAccel that implements the global memories used to communicate between the kernels as streaming buffers on the FPGA Block RAMs. The nearest neighbor identification in this case is done on the FPGA.
filelist.txt	Contains the points of a reference data set (294912 points).
params.h	Some constant parameters (see below).
main_cpu.cpp	The host code for the implementation with nearest neighbor identification done on the host.
nn_cpu.cl	The kernel which calculates all the distances and writes them to the host for nearest neighbors identification.
solution_cpu.tcl	Script for the "single kernel implementation" in SDAccel.
main_fpga.cpp	The host code for the "two kernel implementation" of the algorithm, where the distances are calculated by multiple work groups in one kernel and then the neighbors are computed by a single work-group (assuming that k is small) in another kernel.
nn_fpga.cl	The OpenCL code using global memory buffers to stream data between two kernels. One of the kernels calculates the distance between the query point and all the points in the reference data set. The second kernel identifies the k nearest neighbors using a single work-group.
solution_fpga.tcl	Script for the "two kernel implementation" in SDAccel.

Parameters

The number of nearest neighbors to be returned, the required work group size and the maximum size of an input file line are defined in "params.h"

In the "two kernel" implementation, the size of the inter-kernel on-chip global memory buffer is also specified. Note that SDAccel implements it using a streaming buffer, hence its specific size does not actually matter.

How to run an example

The code has been tested with Xilinx SDAccel 2015.1.

The script files for each test case can be used as follows:

sdaccel solution_cpu.tcl
sdaccel solution_fpga.tcl

The k nearest neighbors, the given query point, and the total number of points are printed on standard output.

Sample Output

The sample output after a successful run would look like:

![alt text](./figures/samplerun.png) Figure 2: Sample Output after a successful run

Performance Metrics

We considered the following performance metrics:

Initiation Interval: Indicating the number of clock cycles between two consecutive iterations of a loop
Latency: Indicating the number of clock cycles required to complete the execution of the entire kernel

Both metrics have a profound effect on the execution time and energy consumption in case of an FPGA implementation. The various optimization options applied cause a significant reduction in the overall latency for the different implementations. Both the cases use "reqd_work_group_size" attribute necessary to allow for performance optimization while generating custom logic for a kernel.
The latency after employing work-item pipeline optimization for the "knn_cpu" case goes from 7433 clock cycles down to 294 clock cycles. For the "knn_fpga" case, that uses the on-chip block RAMs to implement global memory buffers used for inter-kernel communication, the best case latency reduces to 292 clock cycles from 5901 clock cycles in the un-optimized case. This improvement in the overall latency shows up in the performance comparison tables given below.

Performance Comparison

The FPGA vs GPU performance comparison for the "knn_cpu" and "knn_fpga" implementations in terms of execution time, power and energy consumption is described here. Note that in the "knn_cpu" case, the nearest neighbors identification time on the host (the CPU) has also been added to the distance calculation time on FPGA to get the total execution time. This "sorting and neighbors identification time" for "knn_cpu" case is also reported below. The execution time on FPGA implementation in each case has been calculated by using extrapolation to take into account the total number of work-groups. The devices used for comparison are the following:

NVIDIA GeForce GTX 960 with 1024 cores
NVIDIA Quadro K4200 with 1344 cores
Xilinx Virtex 7 xc7vx690tffg1157-2

knn_cpu

Platform	Total Time	Sort Time (CPU)	Power(W)	Energy(mJ)
Virtex 7	5.24ms	4.0ms	0.422	0.523
GTX 960	3.04ms	3.0ms	120	4.4
K4200	3.05ms	3.0ms	108	5.6

Area Utilization	Values
BRAMs	0
DSPs	12 (0.33%)
FFs	3109 (0.36%)
LUTs	2006 (0.46%)

knn_fpga

Platform	Time	Power(W)	Energy(J)
Virtex 7	1.23ms	3.136	0.0039
GTX 960	0.93s	120	111.6
K4200	3.11s	108	335.88

Area Utilization	Values
BRAMs	512 (34.83%)
DSPs	12 (0.33%)
FFs	23892 (2.78%)
LUTs	11838 (2.76%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

figures

figures

knn_cpu

knn_cpu

knn_fpga

knn_fpga

README.md

README.md

license

license

Repository files navigation

k-Nearest Neighbor algorithm

Overview:

Getting Started:

Parameters

How to run an example

Sample Output

Performance Metrics

Performance Comparison

More Information

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
data		data
figures		figures
knn_cpu		knn_cpu
knn_fpga		knn_fpga
README.md		README.md
license		license

License

HLSpolito/KNN

Folders and files

Latest commit

History

Repository files navigation

k-Nearest Neighbor algorithm

Overview:

Getting Started:

Parameters

How to run an example

Sample Output

Performance Metrics

Performance Comparison

More Information

About

Topics

Resources

License

Stars

Watchers

Forks

Languages