Skip to content

CescWang1991/SANSA-Rdf-Partition-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SANSA-Rdf-Partition-Spark

License

SANSA RDF Graph Partitioning is a library to perform RDF Graph Partitioning algorithms by implementing SANSA_Stack framework(see http://sansa-stack.net). In the case, RDF files will be transformed as org.apache.spark.graphx.Graph and the algorithms will be performed on Apache Spark.

The SANSA-Rdf-Partition-Spark currently supports algorithms provided by following papers:

  • Semantic Hash Partitioning: Kisung Lee and Ling Liu. Scaling queries over big rdf graphs with semantic hash partitioning. Proceedings of the VLDB Endowment, 6(14): 1894-1905, 2013. (Completed)
  • Path Partitioning: Buwen Wu, Yongluan Zhou, Pingpeng Yuan, Ling Liu, and Hai Jin. Scalable sparql querying using path partitioning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 795–806. IEEE, 2015. (In construction)

Example#1: Semantic Hash Partitioning

Load N-Triple file to our example and apply a graph partitioning algorithm named Semantic Hash Partitioning. According to different type of triple groups, it has three sub algorithms - subject hash partition, object hash partition, subject object hash partition.

The example graph has in total 465 vertices and 499 edges, which is partitioned into 4 partitions. Firstly, we apply subject hash partition using subject triple groups. The number of vertices and edges in each partition are shown in following table.

PartitionId Edges after 1-hop Edges after 2-hop Edges after 3-hop
0 71 331 339
1 212 344 344
2 174 323 331
3 42 301 316

Then we apply object hash partition by using object triple groups:

PartitionId Edges after 1-hop Edges after 2-hop Edges after 3-hop
0 73 309 319
1 261 307 307
2 109 317 330
3 56 297 312

Finally, we apply subject object hash partition by using subject-object(so) triple groups:

PartitionId Edges after 1-hop Edges after 2-hop Edges after 3-hop
0 71 496 499
1 212 499 499
2 174 499 499
3 42 493 499

Semantic hash partition is suitable for star queries, however, partition applied by subject triple group can only ensure star queries whose core in the star has only outgoing edges, and same for partition applied by object triple group. Although partition applied by so triple group can solve the problem, in previous example we can find after 2-hop-expansion, the size of graph in each partition is close to original graph.

Load N-Triple file as example rdf graph and apply path partition algorithm. The example graph is shown as follow:

For Path Partitioning, it exacts end to end paths and generate path groups for each start vertices (Vertices has only outgoing edges, no incoming edges). Then merges a vertex according to number of paths pass through this vertex. In the example, start vertices are {1,3,8,9}, if we want to has 2 partitions, then paths from {1,8} are merged, paths from {3,9} are merged. So we have two partitions as follow:

Same graph partitioned by semantic hash partition with 2 partitions using subject triple groups:

Remark: Currently path partition algorithm is unable to handle with graphs which has paths contain directed circle.

Load N-Triple file and apply path partition algorithm, subject hash partition algorithm, object hash partition and so hash partition algorithm. We use partition algorithm metrics to evaluate partition algorithms:

PartitionId PP SHP OHP SOHP
Duplication 1.67 1.67 1.87 2.73
Balance 1.44 1.6 1.71 1.27
Efficiency 0.6 0.6 0.4 0