Skip to content

cerenkasap/prediction_of_employee_promotion

Repository files navigation

Prediction of Employee Promotion - Binary Classification 🏆: Project Overview

Created a model that can classify an employee promotion with (89.49% Accuracy).

Used two datasets: PromosSample.csv as historical data with 54808 examples and test.csv as current data 23490 examples using pandas library in python.

Applied Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Naive Bayes, and KNeighborsClassifier and optimized using GridSearchCV to find the best model.

Code Used

Python version: Python 3.7.11

Packages: pandas, o seaborn, matplotlib, numpy, scikit-learn, pickle, and SMOTE

Resources Used

Machine Learning Yearning by Andrew Ng

Calculating the missing value ratio

Binary classification project with similar dataset

Binary classification project with similar dataset

Data Collection

Used historical dataset with 13 columns:

Column name Variable type
employee_id Numerical
department Categorical
region Categorical
education Categorical
gender Categorical
recruitment_channel Categorical
no_of_trainings Numerical
age Numerical
previous_year_rating Numerical
length_of_service Numerical
awards_won? Categorical
avg_training_score Numerical
is_promoted Categorical

Data Cleaning

After pulling the data, I cleaned up the both datasets (historical and current) to reduce noise in the datasets. The changes were made follows:

  • Removed duplicates if there are any based on the "employee_id" column,
  • Checked null values and their ratio, and none of the variables are removed since their ratio is quite small,

ratio_of_missing_values

  • Filled missing values of 'previous_year_rating' with mean based on 'awards_won?', 'education' and 'recruitment_channel' with most frequent value based on 'department', and 'gender' with mode based on 'awards_won? in the historical dataset,
  • Filled missing values of 'previous_year_rating' on the current dataset with mean based on 'awards_won?' from the historical dataset, and 'education' on the current dataset with most frequent value based on 'department' from historical dataset,
  • Replaced Bachelors with Bachelor's in 'education' column for consistency,

Before:

Before

After:

After

  • Replaced FEMALE, Female, and female variables with f and MALE, Male, and male with m in 'gender' column for consistency,

Before:

Before

After:

After

Exploratory Data Analysis

Visualized the cleaned data to see the trends.

  • Created Donut chart for is_promoted data. It looks like the data is imbalanced and needs to be balanced. Donut_Chart

  • Created pie charts and stacked bar chart for categorical variables: department variable: Pie_Chart Percentage

recruitment_channel variable: Pie_Chart Percentage

education variable: Pie_Chart Percentage

gender variable: Pie_Chart Percentage

  • Created bar graphs and stacked bar chart for numerical variables:

age variable: Bar_chart Dist

previous_year_rating variable:

Bar_chart Dist

Feature Engineering

Categorical variables are encoded, numerical ones are normalized, and 'employee_id' variable is removed from both datasets.

Data were balanced by applying SMOTE, and visualized by the donut chart:

Donut_Chart

Model Building

Data were split into train (80%) and test (20%) sets.

I used six models (Decision Tree Classifier, Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Bayes, and KNeighborsClassifier) to predict the sentiment and evaluated them by using Accuracy.

Model Performance Evalution

I used six models (Decision Tree Classifier, Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Bayes, and KNeighborsClassifier) to predict the sentiment and evaluated them by using Accuracy.

Random Forest Classifier model performed better than any other models in this project but after tuning the parameters the accuracy dropped to 78% so that's why I used Decision Tree model as it is the second-highest score.

Model Cross Validation Accuracy Score
Decision Tree 0.9272768163134083
Logistic Regression 0.7748373591096378
Support Vector Classifier 0.844780678076086
Random Forest Classifier 0.9496770892385463
Naive Bayes 0.6916930889263052
K-Neighbots 0.8362169216009582

Hyperparameter Tuning

I got the best accuracy 88.43% with GridSearchCV and find the optimal hyperparameters.

Best Model

Applied Decision Tree model with the optimal hyperparameters and got 89.49% Test Accuracy score.

Save model

This model is pickled so that it is saved on disk as model_file.p.

Feature Importances

'previous_year_rating', 'avg_training_score', and 'length_of_service' features have the most impact when it comes to deciding if the employee gets promoted or not.

Predictions for current data

When we apply our model to current dataset, we can expect 13902 current employees to get promotions while 9588 employees do not get, the donut chart shows the distribution of getting promoted.

Donut_Chart

Confusion Matrix

The Confusion Matrix below shows that our model needs to be improved to predict promotions better.

alt text

We estimate the bias as 8.05% and variance as 1.46% (10.51-8.05). This classifier is fitting the training set poorly with 8.05% error, but its error on the test set is barely higher than the training error.

The classifier therefore has high bias, but low variance.

We can say that the algorithm is underfitting.

Data Accuracy Score (%) Error (%)
Training 91.95 8.05
Test 89.49 10.51

Bias vs. Varince tradeoff

Adding input features might help to reduce bias on the model.

Notes

Error Analysis should be performed to understand the underlying causes of the error (missclassification).

Thanks for reading :)

About

Binary classification problem

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published