SANSA-Rdf-Partition-Spark

SANSA RDF Graph Partitioning is a library to perform RDF Graph Partitioning algorithms by implementing SANSA_Stack framework(see http://sansa-stack.net). In the case, RDF files will be transformed as org.apache.spark.graphx.Graph and the algorithms will be performed on Apache Spark.

The SANSA-Rdf-Partition-Spark currently supports algorithms provided by following papers:

Semantic Hash Partitioning: Kisung Lee and Ling Liu. Scaling queries over big rdf graphs with semantic hash partitioning. Proceedings of the VLDB Endowment, 6(14): 1894-1905, 2013. (Completed)
Path Partitioning: Buwen Wu, Yongluan Zhou, Pingpeng Yuan, Ling Liu, and Hai Jin. Scalable sparql querying using path partitioning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 795–806. IEEE, 2015. (In construction)

Example#1: Semantic Hash Partitioning

Load N-Triple file to our example and apply a graph partitioning algorithm named Semantic Hash Partitioning. According to different type of triple groups, it has three sub algorithms - subject hash partition, object hash partition, subject object hash partition.

The example graph has in total 465 vertices and 499 edges, which is partitioned into 4 partitions. Firstly, we apply subject hash partition using subject triple groups. The number of vertices and edges in each partition are shown in following table.

PartitionId	Edges after 1-hop	Edges after 2-hop	Edges after 3-hop
0	71	331	339
1	212	344	344
2	174	323	331
3	42	301	316

Then we apply object hash partition by using object triple groups:

PartitionId	Edges after 1-hop	Edges after 2-hop	Edges after 3-hop
0	73	309	319
1	261	307	307
2	109	317	330
3	56	297	312

Finally, we apply subject object hash partition by using subject-object(so) triple groups:

PartitionId	Edges after 1-hop	Edges after 2-hop	Edges after 3-hop
0	71	496	499
1	212	499	499
2	174	499	499
3	42	493	499

Semantic hash partition is suitable for star queries, however, partition applied by subject triple group can only ensure star queries whose core in the star has only outgoing edges, and same for partition applied by object triple group. Although partition applied by so triple group can solve the problem, in previous example we can find after 2-hop-expansion, the size of graph in each partition is close to original graph.

Example#2: Path Partitioning

Load N-Triple file as example rdf graph and apply path partition algorithm. The example graph is shown as follow:

For Path Partitioning, it exacts end to end paths and generate path groups for each start vertices (Vertices has only outgoing edges, no incoming edges). Then merges a vertex according to number of paths pass through this vertex. In the example, start vertices are {1,3,8,9}, if we want to has 2 partitions, then paths from {1,8} are merged, paths from {3,9} are merged. So we have two partitions as follow:

Same graph partitioned by semantic hash partition with 2 partitions using subject triple groups:

Remark: Currently path partition algorithm is unable to handle with graphs which has paths contain directed circle.

Example#3 Evaluating Partition Algorithm

Load N-Triple file and apply path partition algorithm, subject hash partition algorithm, object hash partition and so hash partition algorithm. We use partition algorithm metrics to evaluate partition algorithms:

PartitionId	PP	SHP	OHP	SOHP
Duplication	1.67	1.67	1.87	2.73
Balance	1.44	1.6	1.71	1.27
Efficiency	0.6	0.6	0.4	0

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
Figures		Figures
src		src
LICENSE		LICENSE
README.md		README.md
SANSA-Rdf-Partition-Spark.iml		SANSA-Rdf-Partition-Spark.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figures

Figures

src

src

LICENSE

LICENSE

README.md

README.md