SANSA RDF Graph Partitioning is a library to perform RDF Graph Partitioning algorithms by implementing SANSA_Stack framework(see http://sansa-stack.net). In the case, RDF files will be transformed as org.apache.spark.graphx.Graph and the algorithms will be performed on Apache Spark.
The SANSA-Rdf-Partition-Spark currently supports algorithms provided by following papers:
- Semantic Hash Partitioning: Kisung Lee and Ling Liu. Scaling queries over big rdf graphs with semantic hash partitioning. Proceedings of the VLDB Endowment, 6(14): 1894-1905, 2013. (Completed)
- Path Partitioning: Buwen Wu, Yongluan Zhou, Pingpeng Yuan, Ling Liu, and Hai Jin. Scalable sparql querying using path partitioning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 795–806. IEEE, 2015. (In construction)
Load N-Triple file to our example and apply a graph partitioning algorithm named Semantic Hash Partitioning. According to different type of triple groups, it has three sub algorithms - subject hash partition, object hash partition, subject object hash partition.
The example graph has in total 465 vertices and 499 edges, which is partitioned into 4 partitions. Firstly, we apply subject hash partition using subject triple groups. The number of vertices and edges in each partition are shown in following table.
PartitionId | Edges after 1-hop | Edges after 2-hop | Edges after 3-hop |
---|---|---|---|
0 | 71 | 331 | 339 |
1 | 212 | 344 | 344 |
2 | 174 | 323 | 331 |
3 | 42 | 301 | 316 |
Then we apply object hash partition by using object triple groups:
PartitionId | Edges after 1-hop | Edges after 2-hop | Edges after 3-hop |
---|---|---|---|
0 | 73 | 309 | 319 |
1 | 261 | 307 | 307 |
2 | 109 | 317 | 330 |
3 | 56 | 297 | 312 |
Finally, we apply subject object hash partition by using subject-object(so) triple groups:
PartitionId | Edges after 1-hop | Edges after 2-hop | Edges after 3-hop |
---|---|---|---|
0 | 71 | 496 | 499 |
1 | 212 | 499 | 499 |
2 | 174 | 499 | 499 |
3 | 42 | 493 | 499 |
Semantic hash partition is suitable for star queries, however, partition applied by subject triple group can only ensure star queries whose core in the star has only outgoing edges, and same for partition applied by object triple group. Although partition applied by so triple group can solve the problem, in previous example we can find after 2-hop-expansion, the size of graph in each partition is close to original graph.
Example#2: Path Partitioning
Load N-Triple file as example rdf graph and apply path partition algorithm. The example graph is shown as follow:
For Path Partitioning, it exacts end to end paths and generate path groups for each start vertices (Vertices has only outgoing edges, no incoming edges). Then merges a vertex according to number of paths pass through this vertex. In the example, start vertices are {1,3,8,9}, if we want to has 2 partitions, then paths from {1,8} are merged, paths from {3,9} are merged. So we have two partitions as follow:
Same graph partitioned by semantic hash partition with 2 partitions using subject triple groups:
Remark: Currently path partition algorithm is unable to handle with graphs which has paths contain directed circle.
Example#3 Evaluating Partition Algorithm
Load N-Triple file and apply path partition algorithm, subject hash partition algorithm, object hash partition and so hash partition algorithm. We use partition algorithm metrics to evaluate partition algorithms:
PartitionId | PP | SHP | OHP | SOHP |
---|---|---|---|---|
Duplication | 1.67 | 1.67 | 1.87 | 2.73 |
Balance | 1.44 | 1.6 | 1.71 | 1.27 |
Efficiency | 0.6 | 0.6 | 0.4 | 0 |