Microsoft MASC, an Apache Spark connector for Apache Accumulo

This code provides connectivity between Apache Accumulo and Apache Spark.

Main Goals

Provide native Spark interface to connect to Accumulo
Minimize data transfer between Spark and Accumulo
Enable use of Machine Learning with Accumulo as the datastore

Examples

# Read from Accumulo
df = (spark
      .read
      .format("com.microsoft.accumulo")
      .options(**options)  # define Accumulo properties
      .schema(schema))  # define schema for data retrieval

# Write to Accumulo
(df
 .write
 .format("com.microsoft.accumulo")
 .options(**options)
 .save())

See Pyspark notebook for a more detailed example.

See Scala benchmark notebook for details on how our evaluation.

Capabilities

Native Spark Datasource V2 API
Row serialization using Avro
Filter pushdown (server-side)
Expressive filter language using JUEL
ML Inference pushdown (server-side) using MLeap
Support Spark ML pipelines
Minimal Java-runtime

Installation

The connector is composed of two components:

The Datasource component provides the interface used on the Spark side
The Iterator component provides server-side functionality on the Accumulo side

The components can be built and tested with Maven (version 3.3.9 or higher) using Java version 8.

mvn clean install

Alternatively the JARs are published to the Maven Central Repository

Datasource
Iterator

The following steps are needed to deploy the connector:

Deploy iterator JAR to Accumulo lib folders on all nodes and restart the cluster

# use locally built shaded jar in connector/iterator/target folder
#  or
# use maven to download iterator from central repository
mvn dependency:get -Dartifact=com.microsoft.masc:microsoft-accumulo-spark-iterator:[VERSION]

Add Datasource JAR in Spark

# use locally built shaded jar in connector/datasource/target folder or 
#  or
# pull in package from maven central repository
com.microsoft.masc:microsoft-accumulo-spark-datasource:[VERSION]

Spark Runtime Java Version

While the iterator JAR can run on Accumulo tablet servers using JDK versions >= 1.8, the Spark Datasource component is only compatible with JDK version 1.8 (not higher) due to Spark's Java support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Microsoft MASC, an Apache Spark connector for Apache Accumulo

Main Goals

Examples

Capabilities

Installation

Spark Runtime Java Version

Files

README.md

Latest commit

History

README.md

File metadata and controls

Microsoft MASC, an Apache Spark connector for Apache Accumulo

Main Goals

Examples

Capabilities

Installation

Spark Runtime Java Version