Project Summary

This repository aims to compare the performances of multiple machine learning (ML) algorisms when the data distribution is highly imbalanced with one overwhelming response category. The dataset was randomly divided into two parts: training and test sets. Then, I will develop a statistical model out of the training set and apply it to the test set, recording down the misclassification errors.

Furthermore, I will use ROC and AUC to compare the performances and conclude KNN, as a non-parametrical method, outperforms the others when the distribution is highly imbalanced.

For the entire dataset, please refer to my Medium post: A Pain in the Neck: Predict A Rare Event using 5 Machine Learning Methods, https://towardsdatascience.com/classifying-rare-events-using-five-machine-learning-techniques-fab464573233.

Installing

This project is conducted in the R environment, and you have to pre-install the following libraries: readr, knitr, dplyr, plyr, class, reshape2, tree, randomForest, car, and e1071.

What is the data?

This dataset is collected by a Portuguese banking institution to assess the effect of direct marketing campaigns (phone calls) in predicting if the client will subscribe to a term deposit. The data source can be accessed here at https://archive.ics.uci.edu/ml/datasets/bank+marketing.

About the Author

Leihua Ye is a Ph.D. Researcher at the UC, Santa Barbara. He has received extensive training in Causal Inference, Research Design, Machine Learning, Big Data, and Machine Learning.

He receives his B.A. and M.A. from the Uni. of Nottingham.

Contact

Email: yeleihua@gmail.com

LinkedIn: www.linkedin.com/in/leihuaye

Tech Blog: https://leihua-ye.medium.com

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
01-Packages, libraries, and data cleaning.Rmd		01-Packages, libraries, and data cleaning.Rmd
02-Data distribution of the response variable.Rmd		02-Data distribution of the response variable.Rmd
03-Generate training and test sets.Rmd		03-Generate training and test sets.Rmd
04-Logistic regression.Rmd		04-Logistic regression.Rmd
05-Decision Tree.Rmd		05-Decision Tree.Rmd
06-KNN.Rmd		06-KNN.Rmd
07-Random Forests.Rmd		07-Random Forests.Rmd
08-SVM.Rmd		08-SVM.Rmd
09-ROC and AUC.Rmd		09-ROC and AUC.Rmd
10-Discussion and Future Research.Rmd		10-Discussion and Future Research.Rmd
All Codes.Rmd		All Codes.Rmd
README.md		README.md
bank-additional-full.csv		bank-additional-full.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01-Packages, libraries, and data cleaning.Rmd

01-Packages, libraries, and data cleaning.Rmd

02-Data distribution of the response variable.Rmd

02-Data distribution of the response variable.Rmd

03-Generate training and test sets.Rmd

03-Generate training and test sets.Rmd

04-Logistic regression.Rmd

04-Logistic regression.Rmd

05-Decision Tree.Rmd

05-Decision Tree.Rmd

06-KNN.Rmd

06-KNN.Rmd

07-Random Forests.Rmd

07-Random Forests.Rmd

08-SVM.Rmd

08-SVM.Rmd

09-ROC and AUC.Rmd

09-ROC and AUC.Rmd

10-Discussion and Future Research.Rmd

10-Discussion and Future Research.Rmd

All Codes.Rmd

All Codes.Rmd

README.md

README.md

bank-additional-full.csv

bank-additional-full.csv

Repository files navigation

Project Summary

Installing

What is the data?

About the Author

Contact

About

Releases

Packages

LeihuaYe/Machine-Learning-Rare-Event-Classification

Folders and files

Latest commit

History

Repository files navigation

Project Summary

Installing

What is the data?

About the Author

Contact

About

Topics

Resources

Stars

Watchers

Forks