In this project I have tried to predict if an employee is churning by setting up extensive machine learning pipeline.
Determine what factors predict an employee leaving his/her job. For this project you will be evaluating a dataset composed of human resources data. Your objective is to build a model (or models) that predict whether or not an employee is likely to leave his/her job based on characteristics in the dataset.
The dataset contains around 15000 rows with 9 independent features and 1 dependent variable target which determines wheather an employee will leave or not.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction ----- 14999 non-null float64
evaluation ----- 14999 non-null float64
number_of_projects----- 14999 non-null int64
average_montly_hours ----- 14999 non-null int64
time_spend_company ---- 14999 non-null int64
work_accident ----- 14999 non-null int64
churn ----- 14999 non-null int64
promotion ------ 14999 non-null int64
department ------ 14999 non-null object
salary ------ 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
The target distribution is highly imbalanced and this type of problems are often imbalanced.
Barplot showing the count of different categories
As it could be observed eaisy that it is binary classification task.
we have a set of differernt classification metrics but the most suitable one is f1_score as it is harmonic mean of recall
ans precision.
We could have even used ROC_AUC score as the problem is imbalanced but I found f1_score more suitable as of problem description.
We have created feautres bascially by two method
1)Transformation:--
2)Aggregation:--- using mean,min,max,sum,std we have created whole lot of features.
3)Some of the features are created as a result of Exploratory Data Analysis.
The best model to start with is random forest as it supports both classification and regression task.
Description of the baseline model
we have started with some of the best models like Logistic Regression and KNN.
This graph describes the performance of the models:--
As it can be seen that extra tree is performing well along with MLPclassifier.
we went on making a hard voting classifier using Extratrees,randomforest and MLPclassifier.
It did performed failrly well.
We then moved to use our advanced model using xgboost which was untunned and the feature selected by xgboost are as:--
we then used cross_validation using xgboost and compared the mean error rate in train and test sets:--
we had a considerably bigger data set then we started with .
at this stage we have 202 features created by feature engineering step.
we then removed all those features having correlation more then 95% among themseleves and then used recursive feature
elimination to select only best performing features.
This reduced our dataset to 114 feaures.
we used pca,ica,umap to represent the dataset in to three dimensional space and we used these features in the mmodelling.
for umap:--
for pca:--
for ica:--
we have different techniques for selecting the best parameters for the modelling algorithm but they are not efficient as
of bayesian optimization.
Feature space:--
space = {
'boosting_type': hp.choice('boosting_type',
[{'boosting_type': 'gbdt',
'subsample': hp.uniform('gdbt_subsample', 0.5, 1),
'subsample_freq': hp.quniform('gbdt_subsample_freq', 1, 10, 1)},
{'boosting_type': 'dart',
'subsample': hp.uniform('dart_subsample', 0.5, 1),
'subsample_freq': hp.quniform('dart_subsample_freq', 1, 10, 1),
'drop_rate': hp.uniform('dart_drop_rate', 0.1, 0.5)},
{'boosting_type': 'goss',
'subsample': 1.0,
'subsample_freq': 0}]),
'limit_max_depth': hp.choice('limit_max_depth', [True, False]),
'max_depth': hp.choice('max_depth', np.arange(1, 14, dtype=int)),
'num_leaves': hp.quniform('num_leaves', 3, 50, 1),
'learning_rate': hp.loguniform('learning_rate',
np.log(0.025),
np.log(0.25)),
'subsample_for_bin': hp.quniform('subsample_for_bin', 2000, 100000, 2000),
'min_child_samples': hp.quniform('min_child_samples', 5, 80, 5),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.5, 1.0),
'objective':'binary:logistic'
}
Best Parameters
{'boosting_type': 1,
'colsample_by_tree': 0.9705787113662703,
'dart_drop_rate': 0.10481709074419068,
'dart_subsample': 0.8807619552902802,
'dart_subsample_freq': 10.0,
'learning_rate': 0.2488211061090212,
'limit_max_depth': 1,
'max_depth': 11,
'min_child_samples': 10.0,
'num_leaves': 18.0,
'reg_alpha': 0.1236107627491404,
'reg_lambda': 0.5912175686153809,
'subsample_for_bin': 30000.0}
Using the parameters obtained by the optimization we have reachecd the f1_score of 98.79%.
This confusion matrix better describes it.
Other Scores:--
roc_auc_score: --- 99.57%
precision: -- 97.75%
recall: -- 99.84%
accuracy_score: -- 99.34%
we haven't tried deep learning and advanced stacking of the models.
This can lead to increase the performance of the model.