Skip to content

Semisupervised classification methods (SSC) with Spark-ML, study and implementation

License

Notifications You must be signed in to change notification settings

Dguipla/TFM-SemiSup

Repository files navigation

Semisupervised classification methods (SSC) with Spark-ML, study and implementation

Master thesis, Master (MSc) Business Intelligence and Big Data in Cyber-Secure Environments from the universities of Burgos, León and Valladolid

Author: David Guinart Platero

Supervisors: Dr. Álvar Arnaiz González and Dr. Juan José Rodríguez Diez

  • Status : Completed
  • Semisupervised algorithms implemented: Self-Training and Co-Training
  • Design Tools : DataBricks - Spark (Scala).
  • Visualization tools: PowerBI and Python (matplotlib, Seaborn, and Plotly).


Abstract

This master thesis studies different classification algorithms within the frame of semi-supervised learning (inductive) working with base classifiers (Decision tree, Naive Bayes...) using Spark ML.

The main goal of this study is to design, implement, verify and build a library on Spark ML in order to make it easy to reuse on the any Spark environment.
On the other hand, this project do an experimentation with different datasets where the main idea is to compare empirically the results between supervised algorithms /base classifiers (working with the samples labeled) and the semi-supervised algorithms working with semi-supervised (samples labeled and unlabeled). In order to analyze in more details, the outcomes and manage the huge number of data, this project has created a dashboard.

Finally, this project presents the conclusions and new lines of research working with the new library built (for Spark) in this research project as it has been described previously.

References (Self-Training and Co-Training algorithms):

[1] Yarowsky, David (1995). Unsupervised word sense disambiguation rivaling supervised methods. 3rd annual meeting of the association for computational linguistics, 189-196.

[2] Iosifidis, Vasileios and Ntoutsi, Eirini (2017). Large scale sentiment learning with limited labels Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 1823-1832.

[3] Blum, Avrim and Mitchell, Tom (1998). Combining labeled and unlabeled data with co-training Proceedings of the eleventh annual conference on Computational learning theory, 92-100

[4] Triguero, Isaac and Garcia, Salvador and Herrera, Francisco (2015). Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study Knowledge and Information systems (Springer), volume 42, number 2, 245-284.