Semisupervised classification methods (SSC) with Spark-ML, study and implementation

Master thesis, Master (MSc) Business Intelligence and Big Data in Cyber-Secure Environments from the universities of Burgos, León and Valladolid

Author: David Guinart Platero

Supervisors: Dr. Álvar Arnaiz González and Dr. Juan José Rodríguez Diez

Status : Completed
Semisupervised algorithms implemented: Self-Training and Co-Training
Design Tools : DataBricks - Spark (Scala).
Visualization tools: PowerBI and Python (matplotlib, Seaborn, and Plotly).

Abstract

This master thesis studies different classification algorithms within the frame of semi-supervised learning (inductive) working with base classifiers (Decision tree, Naive Bayes...) using Spark ML.

The main goal of this study is to design, implement, verify and build a library on Spark ML in order to make it easy to reuse on the any Spark environment.
On the other hand, this project do an experimentation with different datasets where the main idea is to compare empirically the results between supervised algorithms /base classifiers (working with the samples labeled) and the semi-supervised algorithms working with semi-supervised (samples labeled and unlabeled). In order to analyze in more details, the outcomes and manage the huge number of data, this project has created a dashboard.

Finally, this project presents the conclusions and new lines of research working with the new library built (for Spark) in this research project as it has been described previously.

References (Self-Training and Co-Training algorithms):

[1] Yarowsky, David (1995). Unsupervised word sense disambiguation rivaling supervised methods. 3rd annual meeting of the association for computational linguistics, 189-196.

[2] Iosifidis, Vasileios and Ntoutsi, Eirini (2017). Large scale sentiment learning with limited labels Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 1823-1832.

[3] Blum, Avrim and Mitchell, Tom (1998). Combining labeled and unlabeled data with co-training Proceedings of the eleventh annual conference on Computational learning theory, 92-100

[4] Triguero, Isaac and Garcia, Salvador and Herrera, Francisco (2015). Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study Knowledge and Information systems (Springer), volume 42, number 2, 245-284.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
doc		doc
notebooks		notebooks
project		project
src/main/scala/org/apache/spark/ml/semisupervised		src/main/scala/org/apache/spark/ml/semisupervised
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

notebooks

notebooks

project

project

src/main/scala/org/apache/spark/ml/semisupervised

src/main/scala/org/apache/spark/ml/semisupervised

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Semisupervised classification methods (SSC) with Spark-ML, study and implementation

Abstract

References (Self-Training and Co-Training algorithms):

About

Releases 1

Packages

Languages

License

Dguipla/TFM-SemiSup

Folders and files

Latest commit

History

Repository files navigation

Semisupervised classification methods (SSC) with Spark-ML, study and implementation

Abstract

References (Self-Training and Co-Training algorithms):

About

Topics

Resources

License

Stars

Watchers

Forks

Languages