BigDatalog on Spark 1.6.X

BigDatalog is a Datalog system for Big Data Analytics first presented at SIGMOD 2016. See the paper Big Data Analytics with Datalog Queries on Spark for details.

BigDatalog is implemented as a module (datalog) in Spark that requires a few changes to the core and sql modules. Building, configuring and running examples follows the normal Spark approach which you can read about under [here] (http://spark.apache.org/docs/1.6.1/).

Building BigDatalog

Building and running BigDatalog follows the same procedures as Spark itself (see "Building Spark") with one exception. Before building for the first time, run the following to install the front-end compiler into the maven's local repository:

$ build/mvn install:install-file -Dfile=datalog/lib/DeALS-0.6.jar -DgroupId=DeALS -DartifactId=DeALS -Dversion=0.6 -Dpackaging=jar

Once you have a successful build, you can verify BigDatalog by running its test cases (see "Building Spark").

Example Programs

BigDatalog comes with several sample programs in the examples module in the org.apache.spark.examples.datalog package. These are run the same as other Spark example programs (i.e., via spark-submit).

Writing BigDatalog Programs

You will want to examine the example BigDatalog programs and the test cases to see how to use the BigDatalogAPI (BigDatalogContext) and how write BigDatalog programs. For Datalog language help, see the DeAL tutorial.

Configuration

Spark Configuration options

The following are the BigDatalog configuration options:

Property Name	Default	Meaning
spark.datalog.storage.level	MEMORY_ONLY	Default StorageLevel for recursive predicate RDD caching.
spark.datalog.jointype	broadcast	Default join type. "broadcast" (or no setting at all) - the plan generator will attempt to insert BroadcastHints into the plan to produce a BroadcastJoin. "shuffle" - the plan generator will attempt to insert CacheHints to cache the build side of a ShuffleHashJoin. "sortmerge" - the plan generator will not attempt any hints and produce a SortMergeJoin. With "broadcast" or "shuffle", if no hints are given, SortMergeJoin is produced.
spark.datalog.recursion.version	3	1 = Multi Job PSN, 2 = Multi Job PSN w/ SetRDD, 3 = Single Job PSN w/ SetRDD
spark.datalog.recursion.memorycheckpoint	true	Each iteration of recursion, cache the RDDs in memory and clear lineage. Avoids a stack-overflow from long lineages and greatly reduces closurecleaning time but you better have enough memory. Use false if the program+dataset requires few iterations.
spark.datalog.recursion.iterateinfixedpointresulttask	false	Decomposable predicates will not require shuffling during recursion. This flag allows the FixedPointResultTask to iterate rather than perform a single iteration.
spark.datalog.aggregaterecursion.version	3	1 = Multi Job PSN, 2 = Multi Job PSN w/ SetRDD, 3 = Single Job PSN w/ SetRDD
spark.datalog.shuffledistinct.enabled	false	Enables a "map-side distinct" before a shuffle to reduce the amount of data produced during a join in a recursion.
spark.datalog.uniondistinct.enabled	true	Deduplicate union operations. Datalog uses set-semantics!

*PSN = Parallel Semi-naive Evaluation

Configuring BigDatalog Programs

To configure the number of partitions for a recursive predicate, set spark.sql.shuffle.partitions (by default 200). For programs without shuffling in recursion (decomposable programs) setting spark.sql.shuffle.partitions = [# of total CPU cores in the cluster] is usually a good choice. For programs with shuffling, the value to use for spark.sql.shuffle.partitions can vary depending on the program + workload combination, but 1, 2, or 4 X [# of total CPU cores in the cluster] are good values to try.

Many BigDatalog programs will perform better given more memory. Make sure to choose a 'good' setting for spark.executor.memory and consider increasing spark.memory.fraction and spark.memory.storageFraction, especially for programs that require little-to-no shuffling.

Name		Name	Last commit message	Last commit date
Latest commit History 14,459 Commits
R		R
assembly		assembly
bagel		bagel
bin		bin
build		build
conf		conf
core		core
data/mllib		data/mllib
datalog		datalog
dev		dev
docker-integration-tests		docker-integration-tests
docker		docker
docs		docs
ec2		ec2
examples		examples
external		external
extras		extras
graphx		graphx
launcher		launcher
licenses		licenses
mllib		mllib
network		network
project		project
python		python
repl		repl
sbin		sbin
sbt		sbt
sql		sql
streaming		streaming
tags		tags
tools		tools
unsafe		unsafe
yarn		yarn
.gitattributes		.gitattributes
.gitignore		.gitignore
.rat-excludes		.rat-excludes
CHANGES.txt		CHANGES.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.bat		build.bat
make-distribution.sh		make-distribution.sh
pom.xml		pom.xml
pylintrc		pylintrc
scalastyle-config.xml		scalastyle-config.xml
tox.ini		tox.ini

License

mrayva/Bigdatalog

Folders and files

Latest commit

History

Repository files navigation

BigDatalog on Spark 1.6.X

Building BigDatalog

Example Programs

Writing BigDatalog Programs

Configuration

Configuring BigDatalog Programs

About

Resources

License

Security policy

Stars

Watchers

Forks

Languages