GitHub - cerenkasap/prediction_of_employee_promotion: Binary classification problem

Prediction of Employee Promotion - Binary Classification 🏆: Project Overview

Created a model that can classify an employee promotion with (89.49% Accuracy).

Used two datasets: PromosSample.csv as historical data with 54808 examples and test.csv as current data 23490 examples using pandas library in python.

Applied Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Naive Bayes, and KNeighborsClassifier and optimized using GridSearchCV to find the best model.

Code Used

Python version: Python 3.7.11

Packages: pandas, o seaborn, matplotlib, numpy, scikit-learn, pickle, and SMOTE

Resources Used

Machine Learning Yearning by Andrew Ng

Calculating the missing value ratio

Binary classification project with similar dataset

Data Collection

Used historical dataset with 13 columns:

Column name	Variable type
employee_id	Numerical
department	Categorical
region	Categorical
education	Categorical
gender	Categorical
recruitment_channel	Categorical
no_of_trainings	Numerical
age	Numerical
previous_year_rating	Numerical
length_of_service	Numerical
awards_won?	Categorical
avg_training_score	Numerical
is_promoted	Categorical

Data Cleaning

After pulling the data, I cleaned up the both datasets (historical and current) to reduce noise in the datasets. The changes were made follows:

Removed duplicates if there are any based on the "employee_id" column,
Checked null values and their ratio, and none of the variables are removed since their ratio is quite small,

Filled missing values of 'previous_year_rating' with mean based on 'awards_won?', 'education' and 'recruitment_channel' with most frequent value based on 'department', and 'gender' with mode based on 'awards_won? in the historical dataset,
Filled missing values of 'previous_year_rating' on the current dataset with mean based on 'awards_won?' from the historical dataset, and 'education' on the current dataset with most frequent value based on 'department' from historical dataset,
Replaced Bachelors with Bachelor's in 'education' column for consistency,

Before:

After:

Replaced FEMALE, Female, and female variables with f and MALE, Male, and male with m in 'gender' column for consistency,

Before:

After:

Exploratory Data Analysis

Visualized the cleaned data to see the trends.

Created Donut chart for is_promoted data. It looks like the data is imbalanced and needs to be balanced.
Created pie charts and stacked bar chart for categorical variables: department variable:

recruitment_channel variable:

education variable:

gender variable:

Created bar graphs and stacked bar chart for numerical variables:

age variable:

previous_year_rating variable:

Feature Engineering

Categorical variables are encoded, numerical ones are normalized, and 'employee_id' variable is removed from both datasets.

Data were balanced by applying SMOTE, and visualized by the donut chart:

Model Building

Data were split into train (80%) and test (20%) sets.

I used six models (Decision Tree Classifier, Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Bayes, and KNeighborsClassifier) to predict the sentiment and evaluated them by using Accuracy.

Model Performance Evalution

I used six models (Decision Tree Classifier, Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Bayes, and KNeighborsClassifier) to predict the sentiment and evaluated them by using Accuracy.

Random Forest Classifier model performed better than any other models in this project but after tuning the parameters the accuracy dropped to 78% so that's why I used Decision Tree model as it is the second-highest score.

Model	Cross Validation Accuracy Score
Decision Tree	0.9272768163134083
Logistic Regression	0.7748373591096378
Support Vector Classifier	0.844780678076086
Random Forest Classifier	0.9496770892385463
Naive Bayes	0.6916930889263052
K-Neighbots	0.8362169216009582

Hyperparameter Tuning

I got the best accuracy 88.43% with GridSearchCV and find the optimal hyperparameters.

Best Model

Applied Decision Tree model with the optimal hyperparameters and got 89.49% Test Accuracy score.

Save model

This model is pickled so that it is saved on disk as model_file.p.

Feature Importances

'previous_year_rating', 'avg_training_score', and 'length_of_service' features have the most impact when it comes to deciding if the employee gets promoted or not.

Predictions for current data

When we apply our model to current dataset, we can expect 13902 current employees to get promotions while 9588 employees do not get, the donut chart shows the distribution of getting promoted.

Confusion Matrix

The Confusion Matrix below shows that our model needs to be improved to predict promotions better.

We estimate the bias as 8.05% and variance as 1.46% (10.51-8.05). This classifier is fitting the training set poorly with 8.05% error, but its error on the test set is barely higher than the training error.

The classifier therefore has high bias, but low variance.

We can say that the algorithm is underfitting.

Data	Accuracy Score (%)	Error (%)
Training	91.95	8.05
Test	89.49	10.51

Bias vs. Varince tradeoff

Adding input features might help to reduce bias on the model.

Notes

Error Analysis should be performed to understand the underlying causes of the error (missclassification).

Thanks for reading :)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.ipynb_checkpoints		.ipynb_checkpoints
images		images
.DS_Store		.DS_Store
Data_Collection_and_Cleaning.ipynb		Data_Collection_and_Cleaning.ipynb
Exploratory_Data_Analysis.ipynb		Exploratory_Data_Analysis.ipynb
PromosSample.csv		PromosSample.csv
README.md		README.md
data_cleaned.csv		data_cleaned.csv
data_scaled.csv		data_scaled.csv
df_pred.csv		df_pred.csv
feature_eng.py		feature_eng.py
model_building.py		model_building.py
model_file.p		model_file.p
test.csv		test.csv
test_cleaned.csv		test_cleaned.csv
test_scaled.csv		test_scaled.csv

cerenkasap/prediction_of_employee_promotion

Folders and files

Latest commit

History

Repository files navigation