Skip to content

A set of methods and model evaluation metrics for predicting links in an academic citation network using Apache Spark and Scala

License

Notifications You must be signed in to change notification settings

vbarzokas/apache-spark-link-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Link Prediction in Citation Networks

A set of methods and model evaluation metrics for predicting links in an academic citation network using Apache Spark and Scala.

Description

In this experimental study we develop methods and try to evaluate models for predicting links in an academic citation network, by taking two different aspects into consideration:

  1. Having an insight about the existing network and some of its links and trying to restore a portion of it that has been deliberately removed
  2. Having no information about the existing network and rely only on the information of the scientific papers in order to predict the structure of the whole network.

For the first aspect we used supervised binary classification and more specifically the method of Logistic Regression which had a very good result, with F1 score close to 86% against the testing set. For the second aspect we relied mainly on Jaccard Similarity of the MinHash LSH of each paper’s abstract which had being vectorized using TF-IDF.

For more detailed information check the draft paper.

Prerequisites

Dataset

Our dataset contains 27,770 academic papers that are associated with the following information:

1. unique ID
2. publication year (between 1993 and 2003)
3. title
4. authors
5. name of journal
6. abstract

And exists under src/main/resources.

About

A set of methods and model evaluation metrics for predicting links in an academic citation network using Apache Spark and Scala

Topics

Resources

License

Stars

Watchers

Forks

Languages