Skip to content

Shortest Path Search in parallel using SciSpark

Sujen Shah edited this page Apr 5, 2016 · 1 revision

Introduction

The algorithm below describes a method of finding the shortest path in a graph which may comprise of many smaller subgraphs. The intuition behind this approach is to find the source nodes(i.e nodes having 0 incoming edges) of these subgraphs in parallel and then find the corresponding graphs belonging to those source nodes.

Once the subgraphs are found, we can run Dijkstra's SP algorithm in parallel on each of them.

Algorithm

Approach

The algorithm comprises mainly of three steps:

  1. Partition: The partitioning of vertices is performed on the identifier on each vertex in such a manner that IDs within a certain range are kept in the same partition. The intuition behind such a partitioning is that consecutive IDs could represent vertexes that are closer to each other and potentially have an edge between them.
  2. Elimination: There are two methods of eliminating vertices/edges from the partition: If the partition encompasses an entire subgraph within itself then those edges are written out to a file. Eliminate based on threshold: This is an optional step which can be used to eliminate subgraphs based on a certain criteria (for ex- subgraphs having max depth less than a threshold)
  3. Merging: After the elimination step, we might find out that there are certain nodes in the current partition which have incoming or outgoing edges from other partitions. To accommodate this, we merge 2 consecutive partitions into one and repeat the above steps again.

The above three steps are performed iteratively till we have a single partition.

Assumptions

  • Assumption 1: The input is a sparse disconnected directed graph.

  • Assumption 2: There are no cycles in the directed graph.

  • Assumption 3: The first two assumptions imply that we will be able to find a node which has no incoming edges (source node).

Limitations

The partitioning is dependent on the vertex numbers/ids, and we assume that consecutive (or nearly consecutive) vertex ids would have an edge between them. If the source and destination vertices of an edge span across partitions far from each other (ex- partition #1 and partition #n-1), then this would not eliminate the edges and keep processing them again and again on each merge.

Implementation

The Mesoscale Convective Complexes (GTG) use case: The Grab 'em, Tag 'em, Graph 'em (GTG) is a feature detection, evolution and feature characterization algorithm. It was built for the weather/ climate application of identifying large-scale mesoscale convective complexes (MCSs) and mesoscale convective complexes (MCCs) - a type of MCSs - in highly resolved temporal and geospatial remote-sensed datasets and characterizing the features using various data sources. MCSs are complex of systems of thunderstorms that become organized on a scale larger than the individual thunderstorms but smaller than extratropical cyclones, and normally persists for upto 24 hours. The general workflow of the algorithm is to read data related to brightness temperature from files on a local or remote filesystem and to place the data into an array. Cloud elements - defined as pre-defined sized regions cooler than a given temperature - are identified in each time frame. These cloud elements are the graph nodes. Spatial overlapping between cloud elements in sequential time frames determines the graph edges. The graph is traversed to determine which connected nodes between frames represent various MCSs. Identifying these MCSs in the SciSpark environment is challenging because of the sequential constraint while parallelizing the job, as well as the need to traverse a large sparse graph with multiple subgraphs each of which can start at any time frame other than the first, and terminate at any time frame other than the last.

How to use

  1. git clone https://github.com/SciSpark/SciSpark.git
  2. Make sure the main class in compile in your build.sbt file is MainDistGraphMCC.scala
  3. Run sbt clean assembly
  4. Run the jar in your target directory with the following args:
    1. Master url - Spark master URL NB: this may be the scheduler URL if you are using Mesos or YARN
    2. Number of partitions
    3. Input file path on hdfs - For the format of the file have a look here. The file consists of an edgelist. An edge is represented by a tuple (source node, destination source). Each node contains two points (x, y).