passenger_satisfaction_stacking_anova

by Lennart Wallentin, lennartwallentin@gmail.com

In 'passenger_satisfaction_stacking_anova_lennart_wallentin.ipynb' I classify if a person flying with an unnamed north american airline company is satisfied or neutral/dissatisfied with their flight, but the classifying per se is not the end goal with this project instead the main objectives of this project are:

Is a Stacked Generalization (Stacking) model’s AUC score higher than two standalone (Logistic regression and XGBoost) machine learning model’s AUC scores?
Compare the three models through one-way analysis of variance (ANOVA) tests to see if they are different from each other.

In addition to that is my focus with this project to show that I have good knowledge regarding statistics, machine learning and business knowledge. I demonstrate that by explaining and using:

Statistics - More advanced statistical topics as chi-square and construct a Cramer’s V tests for association between categorical features. GLM, the logistic function and odds ratio in conjunction with logistic regression. And as already mentioned one-way analysis of variance (ANOVA) tests. In addition basic statistics such as boxplots, Pearson correlation coefficient, histograms, probabilities etc is also part of this project.
Machine learning - Stacking using base and meta learners as well as Logistic regression and XGboost models and how they fit the data in regard to bias and variance. I cross-validate and do feature selection on all three models, and also hyperparameter tuning with Bayesian optimization on the standalone XGBoost and the XGBoost stacked base and meta learners.
A range of evaluation metrics are displayed with each model and since AUC is my main evaluation metric for this project, I construct graphs for each model that show the different cut off thresholds for the predicted probabilities in regard to True Positive Rate/Sensitivity and False Positive Rate.
Business knowledge - For example noticed that some of the survey questions are wrongly labeled as 0 instead of NA-value and fixed that. As for outliers I use business sense in regard to flight distance and departure delay outliers. Working in an organization I would of course talk to stakeholders to get even more information regarding any questions about the data.

The xlsx-file, ‘bayesian_optimization_iterations.xlsx’ contains all the Bayesian optimization iterations.
The dataset that is used for this project is in the csv-file, 'data_passenger_satisfaction_stacking_anova_lennart_wallentin.csv'

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
bayesian_optimization_iterations.xlsx		bayesian_optimization_iterations.xlsx
data_passenger_satisfaction_stacking_anova_lennart_wallentin.csv		data_passenger_satisfaction_stacking_anova_lennart_wallentin.csv
passenger_satisfaction_stacking_anova_lennart_wallentin.ipynb		passenger_satisfaction_stacking_anova_lennart_wallentin.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bayesian_optimization_iterations.xlsx

bayesian_optimization_iterations.xlsx

data_passenger_satisfaction_stacking_anova_lennart_wallentin.csv

data_passenger_satisfaction_stacking_anova_lennart_wallentin.csv