



# Speedup Spark Applications using FPGA Accelerators on the cloud

Christoforos Kachris  
ICCS-NTUA

#EUres8

# Performance in Spark

- 91% of Spark users care about performance

FEATURES USERS CONSIDER IMPORTANT



# Accelerators versus Processors



[Source: National Instruments: Smart Grid Ready Instrumentation”

Christoforos Kachris, [www.inaccel.com](http://www.inaccel.com), 2017, #EUres8

# Flexibility versus Performance



CPU: High flexibility, lower efficiency



Accelerators: High throughput, higher efficiency

[Source: Amazon AWS f1]

GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for certain categories of applications

# Hardware Acceleration

FPGA handles compute-intensive, deeply pipelined, hardware-accelerated operations



[Source: Amazon AWS f1]



CPU handles the rest

# Process flow using Accelerators



Christoforos Kachris, [www.inaccel.com](http://www.inaccel.com), 2017, #EUres8

# Amazon Resources



# FPGAs in Amazon AWS



# A use case on Logistic regression

LR is used for building predictive models for many complex pattern-matching and classification problems.

It can be applied widely in such diverse areas as

- bioinformatics,
- finance and
- data analytics.

One of the most popular Machine Learning techniques.



# Hardware accelerator for LR

Given a new data point  $x$ , the model makes predictions by applying the logistic function  $h(z) = \frac{1}{1+e^{-z}}$ , where  $z = w^T x$ .



# Accelerated Mllib

- Extension of Spark libraries



# Spark – FPGA interface

- Spark worker
- Python API
- C API

Support for:



Christoforos Kachris, [www.inaccel.com](http://www.inaccel.com), 2017, #EUres8

# ML acceleration library

- Instantiate a LR model



- Train the model



- (Test the model)

```
from pyspark import SparkContext
from pyspark.mllib.regression import LabeledPoint
# from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib_accel.classification import LogisticRegression

def parsePoint(line):
    """
    Parse a line of text into an MLlib LabeledPoint object.
    """

    data = [float(s) for s in line.split(',')]
    return LabeledPoint(data[0], data[1:])

if __name__ == "__main__":
    sc = SparkContext(appName = "Python Logistic Regression")

    trainRDD = sc.textFile(train_file, numPartitions).map(parsePoint)

    #LR = LogisticRegressionWithLBFGS.train(trainRDD, iterations, numClasses)
    LR = LogisticRegression(numClasses, numFeatures)
    LR.train(trainRDD, alpha, iterations)

    sc.stop()
```

Christoforos Kachris, www.inaccel.com, 2017, #EUres8

# Evaluation in cluster

- Evaluation on a Handwritten Digit Recognition Problem
  - Take an image of a handwritten single digit, and determine what that digit is.
- 10 classes, 786 features
- 40k lines train file
- Given a new data point, models will be run, and the class with largest probability will be chosen as the predicted class.

2 3 9 7 8 3 4 / 0 9  
7 9 1 3 7 6 2 2 1 8  
5 7 0 7 3 2 8 0 0 8  
6 7 8 6 6 6 3 3 5 3  
4 7 3 5 0 4 4 9 5 7



# Evaluation

- Evaluation on a Handwritten Digit Recognition Problem
  - Take an image of a handwritten single digit, and determine what that digit is.
- Features of evaluated platforms

| Features         | Server  | PYNQ-Z1      |
|------------------|---------|--------------|
| Vendor           | Intel   | Xilinx       |
| Device           | E5-2658 | Zynq XC7Z020 |
| Cores(threads)   | 12(24)  | 2            |
| Processor        | E5-2658 | A9           |
| Architecture     | 64-bit  | 32-bit       |
| INstruction Set  | CISC    | RISC         |
| Process          | 22nm    | 28nm         |
| Clock Frequency  | 2.2GHz  | 667MHz       |
| Level 1 cache    | 380kB   | 32kB         |
| Level 2 cache    | 3072kB  | 512kB        |
| Level 3 cache    | 30MB    | -            |
| TDP              | 105W    | 3.5          |
| Operating System | Ubuntu  | Ubuntu       |

Main features of the evaluated Xeon and Zynq Platforms

# Spark implementation

- Run Batch Gradient Descent (BGD) in parallel.
  - Averaging the subgradients over different partitions is performed using one standard spark map-reduce in each iteration.

```
if weights is None:  
    weights = zeros((numClasses, 1 + numFeatures))  
  
for t in range(0, iterations):  
    gradients = trainRDD.map(lambda data: gradients_kernel(data, weights).reduce(lambda a, b: add(a, b))  
  
    weights = subtract(weights, (alpha / numSamples) * gradients)
```



# Cluster on FPGA nodes





# Spark running on FPGA-based cluster

The screenshot shows the Spark Master UI running on a Firefox browser. The title bar reads "Spark Master at spark://192.168.1.203:7077 - Mozilla Firefox (on vinemaster)". The main content area displays the following information:

**Spark Logo** 2.1.1-SNAPSHOT **Spark Master at spark://192.168.1.203:7077**

URL: spark://192.168.1.203:7077  
REST URL: spark://192.168.1.203:6066 (cluster mode)  
Alive Workers: 4  
Cores in use: 4 Total, 0 Used  
Memory in use: 2020.0 MB Total, 0.0 B Used  
Applications: 0 Running, 0 Completed  
Drivers: 0 Running, 0 Completed  
Status: ALIVE

**Workers**

| Worker Id                                 | Address             | State | Cores      | Memory                |
|-------------------------------------------|---------------------|-------|------------|-----------------------|
| worker_20170629001609-192.168.1.231-54796 | 192.168.1.231:54796 | ALIVE | 1 (0 Used) | 505.0 MB (0.0 B Used) |
| worker_20170629001613-192.168.1.232-39866 | 192.168.1.232:39866 | ALIVE | 1 (0 Used) | 505.0 MB (0.0 B Used) |
| worker_20170629001613-192.168.1.233-47368 | 192.168.1.233:47368 | ALIVE | 1 (0 Used) | 505.0 MB (0.0 B Used) |
| worker_20170629001615-192.168.1.234-41389 | 192.168.1.234:41389 | ALIVE | 1 (0 Used) | 505.0 MB (0.0 B Used) |

**Running Applications**

| Application ID | Name | Cores | Memory per Node | Submitted Time | User | State | Duration |
|----------------|------|-------|-----------------|----------------|------|-------|----------|
|                |      |       |                 |                |      |       |          |

**Completed Applications**

| Application ID | Name | Cores | Memory per Node | Submitted Time | User | State | Duration |
|----------------|------|-------|-----------------|----------------|------|-------|----------|
|                |      |       |                 |                |      |       |          |

# Speedup

- Xeon vs  
Xeon +  
small  
FPGA  
cluster



# Comparison

SW (Intel Xeon)



SW+HW  
(Accel on FPGA SoCs)



# Spark Accelerators

- Develop hardware components as IP cores for widely used applications
  - Spark
    - Logistic regression
    - Recommendation (ALS)
    - K-means
    - Linear regression
    - PageRank
    - Graph computing



Source: C. Kachris et al.,  
Spark acceleration using  
FPGAs, best paper award,  
IEEE MOCAST 2017

# Comparison with Amazon AWS EC

- c4 (36 cores)
- m4 (16 cores)
- f1 with our Accelerator



# MLlib library to be covered

MLlib contains many algorithms and utilities.

- **Classification:** logistic regression, naive Bayes,...
- **Regression:** generalized linear regression, survival regression,...
- **Recommendation:** alternating least squares (ALS)
- **Clustering:** K-means, Gaussian mixtures (GMMs),...
- **Topic modeling:** latent Dirichlet allocation (LDA)
- Frequent itemsets, association rules, and sequential pattern mining

# VINEYARD framework

- Utilize FPGA resources seamlessly

More info :  
[vineyardh2020.eu](http://vineyardh2020.eu)



29



# PARTNERS



(Coordinator)  
[www.microlab.ntua.gr](http://www.microlab.ntua.gr)



[www.maxeler.com](http://www.maxeler.com)



[www.bull.com](http://www.bull.com)



[www.qub.ac.uk](http://www.qub.ac.uk)



[www.ics.forth.gr](http://www.ics.forth.gr)



[www.hartree.stfc.ac.uk/hartree](http://www.hartree.stfc.ac.uk/hartree)



[www.neurasmus.com](http://www.neurasmus.com)



[www.neurocom.eu](http://www.neurocom.eu)



[www.helex.gr](http://www.helex.gr)



[www.leancale.com](http://www.leancale.com)



[www.loba.pt](http://www.loba.pt)

# Overview - InAccel



**InAccel** provides high performance accelerators for your application based on novel hardware reconfigurable engines as IP blocks.

The hardware accelerators can be used to improve the performance of your applications both in terms of throughput and execution time (latency).



## Spark integration

The hardware accelerators of **InAccel** are compatible with the Apache Spark framework and allow faster execution of Spark applications based on MLlib (machine learning) and GraphX libraries.

# Overview - InAccel



## Amazon compatibility

InAccel develops hardware accelerators that are compatible with the Amazon AWS F1 platform. Use the hardware accelerators as an IP to speedup your application, seamlessly, without the purchase of any hardware.



## No need for code changes

InAccel provides all the required APIs for the seamless integration of the accelerators without any modifications on your original code.

Execution time of Spark logistic regression



# Thank you

Questions?

[kachris@microlab.ntua.gr](mailto:kachris@microlab.ntua.gr)

