Skip to content

Latest commit

 

History

History

machinelearning

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Machine Learning Algorithms using Spark

The purpose of these packages (solutions expressed in Java and Spark) are to show how to implement basic machine learning algorithms (K-Means, Naive Bayes, Logistic Regression, Linear Regression, ...) in Spark and Spark's MLlib library. Spark's MLlib offers a suite of machine learning libraries for

  • Naive Bayes
  • Logistic Regression
  • K-Means
  • Linear Regression

Machine Learning vs. Traditional Programming

Machine Learning in Pictures

K-Means Clustering Algorithm

K-Means clustering is a clustering algorithm that can be used to partition your dataset into K (where K > 1) clusters. We now look at how we can implement K-Means clustering using Spark to cluster the featurized Wikipedia dataset. K-Means is one of the simplest un-supervised learning algorithms that solve the well known clustering problem.

For details on K-Means clustering, you should read

K-Means using Spark's MLlib

  • org.dataalgorithms.machinelearning.kmeans.Featurization

    This is a standalone Spark program to featurize the WikiStats

  • org.dataalgorithms.machinelearning.kmeans.WikipediaKMeansUsingUtilVector

    This solution implements K-Means algorithm using the org.apache.spark.util.Vector class

  • org.dataalgorithms.machinelearning.kmeans.WikipediaKMeansUsingMLlibVector

    This solution implements K-Means algorithm using the org.apache.spark.mllib.linalg.Vector interface

Logistic Regression Algorithm

Use simple logistic regression when you have one nominal variable and one measurement variable, and you want to know whether variation in the measurement variable causes variation in the nominal variable.

For details on Logistic Regression, you should read

Logistic Regression Applications using Spark's MLlib

Breast Cancer Detection

These Spark programs detect breast cancer using Logistic Regression model

  • org.dataalgorithms.machinelearning.logistic.BreastCancerDetectionBuildModel

The class BreastCancerDetectionBuildModel builds the model from the given training data

  • org.dataalgorithms.machinelearning.logistic.BreastCancerDetection

This is the driver class, which uses the built model to classify new queried data

Detect Spam and Non-Spam Emails

This solution detects spam and non-spam emails

  • org.dataalgorithms.machinelearning.logistic.EmailSpamDetectionBuildModel

The class EmailSpamDetectionBuildModel builds the model from the given training data

  • org.dataalgorithms.machinelearning.logistic.EmailSpamDetection

This is the driver class, which uses the built model to classify new queried data

Naive Bayes Algorithm

"The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically. Naive Bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method." (source: http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)

Naive Bayes Classifier can be specified by using the following conditional probabilities:

P(a|b) = (P(b|a) P(a)) / P(b)

For details on Naive Bayes, you should read

Naive Bayes Applications using Spark's MLlib

The following Spark classes may be used to implement Naive Bayes:

  • org.apache.spark.mllib.classification.NaiveBayes Trains a Naive Bayes model given an RDD of (label, features) pairs.

  • org.apache.spark.mllib.classification.NaiveBayesModel Model for Naive Bayes Classifiers

Linear Regression

Regression analysis is the art and science of fitting straight lines to patterns of data. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X1, X2, ..., Xn is linear. In reality, true regression functions are never linear.

For details on Linear Regression, you should read

Questions/Comments

Thank you!

best regards,
Mahmoud Parsian

Data Algorithms Book