Skip to content

Some fundamental machine learning and data analysis techniques are revisited here through practical projects

License

Notifications You must be signed in to change notification settings

rlleshi/MachineLearning

Repository files navigation

MachineLearning

Some fundamental machine learning and data analysis techniques are revisited here through practical projects.
Almost every project has been worked in Jupyter notebooks. The notebooks have also been converted into clean pdf files.

The Machine Learning Pipeline utilized for every project

  1. Question and required data
  2. Acquire the data
  3. Data preprocessing
  4. Prepare the data for the machine learning model
  5. Train the model
  6. Make predictions on the test data
  7. Evaluate the model
  8. If the performance is not satisfactory, adjust the model
  9. Interpret the model and report results visually and numerically

Table of Contents

  1. Logistic Regression: Nudging customers to payed products by utilizing data produced by an app
  2. Random Forests: Wine quality predictor
  3. SVN: Disease predictor
  4. kMeans Clustering: Image compression
  5. Neural Nets: Autism Spectrum Disorder predictor
  6. Deep neural nets: Banks' Customer exit predictor




Logistic Regression

Nudging customers to payed products by utilizing data produced by apps.
Companies often provide free premium product/services in an attempt to transition their customers to premium membership. In this case study, the services offered by a mobile app are examined. Customers have a 24 hour frame of free premium membership. Our goal is to determine which users are less likely to subscribe to the paid membership. In this way they can be targeted for further marketing.

Random Forests

Our goal is to predict the quality of wines given a bunch of features (acidity, density, pH etc.). The original paper uses SVN, NN, and MR(multiple regression). A model with random forests is here investigated which achieves a 97.31% accuracy.

SVN

We explore a scikit-learn dataset concerning malignant and begnin breast cancers. Our dataset consists of a feature vector of 30 features and 1 class describing whether the cancer is malignant or benign. Our goal is to train a machine learning algorithm to classify unseen samples.

SVM achieves the best result with 98.25%, followed by Logistic regression (lasso) with 97.37% and Random Forests with 95.61%. SVM in particular outperforms even the SVM model in this paper. This is probably due to the state-of-the-art algorithms of scikit-learn as a library but also due to the choice of a different kernel from the one used in the paper as a result of hyperparameter tuning.

kMeans Clustering

KMeans clustering belongs to a category called prototype-based clustering because each cluster is represented by a prototype, which is usually a centroid. It belongs to the class of unsupervised Machine Learning algorithms where the purpose is to scout for latent properties in the data. More information on the algorithm can be found here.

KMeans clustering and image compresion

The idea is that you can find n (32 for instance) clusters in the image and basically reduce all the 256^3 combinations of colors by creating a new image where the true input color is replaced by the color of the closest cluster. This is very feasible to apply since an image can just be thought of as a numpy array with a length equal to the height of the image and where each element is another array with length equal to the width of the image. Of course the width arrays just contain the RGB properties of the element (which is just a single pixel). The immediate downside of this approach is that the compresion comes at the cost of reducing the image quality.

Neural Nets

The task is to build an NN model which can predict autism. The model achieves a 99% acuracy rate on the testing set.

Deep Neural Nets

The task is to predict whether a bank customer will leave the bank given a bunch of features. The DL model was built using Keras and hyperparameter optimization was performed using talos