Skip to content

Yiziwinnie/IBM-HR-Analytics-Employee-Attrition-Performance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Comparison of Probabilistic Classifiers

IBM HR Analytics Employee Attrition & Performance

@ Jiayu Qi Nov 5,2018

Abstract

Classification is a data mining technique used to predict group label for data points in a given dataset. For binary classification, techniques like k-nearest neighbor, support vector machine and decision tree provide non-probabilistic results such as yes or no. On the other hand, Naive Bayes classification technique applies Bayes' theorem and assumes class conditional dependency, provides probabilities for each class. The paper focuses on the comparisons of the probabilities among different classification techniques; by converting non-probabilistic classifiers to probabilistic classifiers, we are able to evaluate each classifier on sensitivity, specificity, accuracy, AUC, and threshold. Specifically, we conduct our research on the dataset on IBM employee attrition, which is a binary class problem. The class distribution is unbalanced where we apply different preprocessing methods and to compare such as oversampling, undersampling, normalization, feature extraction and feature selection among Naive Bayes, KNN and SVM. After preprocessing, all three classifiers improved the prediction performance over unpreprocessed data. The results indicate that the support vector machine combined with oversampling and normalization achieves the best classification performance. Application of these models has the potential to help reduce employee attrition.

Introduction

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. However, there are non-probabilistic classifiers that the uncertainty can’t be “quantified”. In this research project, we are interested in comparing non-probabilistic classifiers such as k-nearest neighbor (KNN) and Support Vector Machines (SVM) to probabilistic ones and compare with probabilistic classifier Naive Bayes; therefore, to gain the best probabilistic classifier quantifying the uncertainty of the case labeled into the certain class. Moreover, we are interested in exploring the effect of preprocessing steps on the overall performance of classifiers. The dataset we are using is IBM HR Analytics Attrition & Performance. Attrition in human resources refers to the gradual loss of employees over time. The goal is to find the best probabilistic classifier to predict the attrition of valuable employees.