Skip to content

This Maven Java project implements three common measures for link prediction in graphs: Common Neighbors, Jaccard Coefficient, and Adamic-Adar. The project leverages the power of Apache Spark to efficiently process large graphs in a distributed environment.

Notifications You must be signed in to change notification settings

panastasiadis/link-prediction-with-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Link Prediction with Spark: Implementing Common Measures for Link Prediction in Graphs using Apache Spark and Maven

This is a Maven project.

This guide assumes you have an existing installation of Spark and Java JDK on your Linux device.

In order to avoid Java heap errors, spark commands include the option --driver-memory 4g to give the Spark driver enough memory.

Compile

To compile the project's apps, simply open a terminal inside the project's root directory. Run cd ./adamic-adar, cd ./common-neighbors/ or cd ./jaccard-coefficient/ to navigate into the desired app. Then simply run:

./mvnw install

If you want to undo the previous command, run:

./mvnw clean

Execute

After compiling an app, you can execute it by running the following, inside its directory.

NOTE: Replace < input-file-directory > with the absolute path of your input file's directory. You can use the files inside the given input directory.

NOTE: Replace < number-of-the-top-results-to-be-displayed > with a number. E.g. 500 will produce the top-500 results.

For Common Neighbors:

spark-submit --class org.spark.CommonNeighbors --driver-memory 4g \
./target/commonneighbors-0.1.jar <input-file-directory> <number-of-the-top-results-to-be-displayed>

For Jaccard Coefficient:

spark-submit --class org.spark.JaccardCoefficient --driver-memory 4g \
./target/jaccardcoefficient-0.1.jar <input-file-directory> <number-of-the-top-results-to-be-displayed> 

For Adamic/Adar:

spark-submit --class org.spark.AdamicAdar --driver-memory 4g \
./target/adamicadar-0.1.jar <input-file-directory> <number-of-the-top-results-to-be-displayed> 

Input

Each app takes as input a text file, containing an undirected graph.

  • Lines starting with '#' are considered comments.
  • Every edge of the graph is described in one line by two integers (the nodes) separated with space.
  • Nodes must be described only as integers.
  • Graph must be undirected, so if for example "1 2" is an existing edge of the graph, "2 1" must also be included.

Below is a demonstration of a graph created by using the structure described above.

# comment1
# comment2
2 1 
1 3 
4 2 
3 4 
5 3 
4 5 
1 2 
3 1 
2 4 
4 3 
3 5 
5 4 

Ready-to-run Examples

You can test the apps by using the existing graph files from the input directory.

Inside the directory, there are two graph files from the snap.stanford.edu/data and a file called small_input containing the example described above.

Example

To test the apps with the ca-AstroPh dataset and display the top-5000 results, assuming you are in the project's root directory without having compiled anything, you can do:

For Common Neighbors:

cd common-neighbors
./mvnw install
spark-submit --class org.spark.CommonNeighbors --driver-memory 4g ./target/commonneighbors-0.1.jar  "$(realpath ../input/ca-AstroPh.txt)" 5000

For Jaccard Coefficient:

cd jaccard-coefficient
./mvnw install
spark-submit --class org.spark.JaccardCoefficient --driver-memory 4g ./target/jaccardcoefficient-0.1.jar "$(realpath ../input/ca-AstroPh.txt)" 5000

For Adamic/Adar:

cd adamic-adar
./mvnw install
spark-submit --class org.spark.AdamicAdar --driver-memory 4g ./target/adamicadar-0.1.jar "$(realpath ../input/ca-AstroPh.txt)" 5000

Enjoy the results!

About

This Maven Java project implements three common measures for link prediction in graphs: Common Neighbors, Jaccard Coefficient, and Adamic-Adar. The project leverages the power of Apache Spark to efficiently process large graphs in a distributed environment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages