Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Latest commit

 

History

History
76 lines (62 loc) · 3.47 KB

README.md

File metadata and controls

76 lines (62 loc) · 3.47 KB

Microsoft MASC, an Apache Spark connector for Apache Accumulo

Build Status Maven Central Maven Central

This code provides connectivity between Apache Accumulo and Apache Spark.

Main Goals

  • Provide native Spark interface to connect to Accumulo
  • Minimize data transfer between Spark and Accumulo
  • Enable use of Machine Learning with Accumulo as the datastore

Examples

# Read from Accumulo
df = (spark
      .read
      .format("com.microsoft.accumulo")
      .options(**options)  # define Accumulo properties
      .schema(schema))  # define schema for data retrieval

# Write to Accumulo
(df
 .write
 .format("com.microsoft.accumulo")
 .options(**options)
 .save())

See Pyspark notebook for a more detailed example.

See Scala benchmark notebook for details on how our evaluation.

Capabilities

  • Native Spark Datasource V2 API
  • Row serialization using Avro
  • Filter pushdown (server-side)
  • Expressive filter language using JUEL
  • ML Inference pushdown (server-side) using MLeap
  • Support Spark ML pipelines
  • Minimal Java-runtime

Installation

The connector is composed of two components:

  • The Datasource component provides the interface used on the Spark side
  • The Iterator component provides server-side functionality on the Accumulo side

The components can be built and tested with Maven (version 3.3.9 or higher) using Java version 8.

mvn clean install

Alternatively the JARs are published to the Maven Central Repository

The following steps are needed to deploy the connector:

  1. Deploy iterator JAR to Accumulo lib folders on all nodes and restart the cluster
# use locally built shaded jar in connector/iterator/target folder
#  or
# use maven to download iterator from central repository
mvn dependency:get -Dartifact=com.microsoft.masc:microsoft-accumulo-spark-iterator:[VERSION]
  1. Add Datasource JAR in Spark
# use locally built shaded jar in connector/datasource/target folder or 
#  or
# pull in package from maven central repository
com.microsoft.masc:microsoft-accumulo-spark-datasource:[VERSION]

Spark Runtime Java Version

While the iterator JAR can run on Accumulo tablet servers using JDK versions >= 1.8, the Spark Datasource component is only compatible with JDK version 1.8 (not higher) due to Spark's Java support.