Predicting no-shows in medical appointments

Project Overview

Predicting whether a patient will show up for their scheduled medical appointment is a critical task for healthcare providers as it can help optimize resource allocation and improve overall patient care. This machine learning project focuses on addressing the issue of patient no-shows in medical appointments. By harnessing the power of data and machine learning, we aim to develop a predictive model that can assist healthcare facilities in identifying patients at higher risk of no-shows.

Starting with a Kaggle data of medical appointment no-shows by Joni Hoppen and Aquarela Analytics, we evaluate the performances of four different classification algorithms (Logistic Regression, Decision Trees, Random Forests, XGBoost, and LightGBM) and settle on an LGBMClassifier model as our final model. Using the trained model we make predictions on whether a future appointment would lead to a no-show or not. Finally, we containerise this application and deploy it as an elastic beanstalk application on AWS and provide an API to access it with.

Credit: Austrian Medical Association (ÖÄK)

Getting started immediately

The application, called no-show-predictor is located at: no-show-predictor-env.eba-hpbyckm2.eu-north-1.elasticbeanstalk.com. You can use this API to start making predictions immediately.

Datasets

This project uses a Kaggle dataset of over 100,000 medical appointments characterised by 14 associated variables, including temporal details, patient information and the ultimate outcome (and the target variable of our classification task) of the appointment -- whether the patient showed up for the appointment or not. The dataset was created by Joni Hoppen and Aquarela Analytics, and can be downloaded from:

https://www.kaggle.com/datasets/joniarroba/noshowappointments/data

Further details of the datasets used can be found here.

Dependencies

The project requires the following dependencies to be installed:

Conda
Docker
AWSEBCLI

Workflow

To run this project locally, follow these steps:

1. Cloning the repository:

git clone https://github.com/abhirup-ghosh/medical-appointment-no-shows.git

2. Setting up the environment:

The easiest way to set up the environment is to use Anaconda. I used the standard Machine Learning Zoomcamp conda environment ml-zoomcamp, which you can create, activate, and install the relevant libraries in, using the following commands in your terminal:

conda create -n ml-zoomcamp python=3.9
conda activate ml-zoomcamp
conda install numpy pandas scikit-learn seaborn jupyter xgboost pipenv flask gunicorn lightgbm

Alternatively, I have also provided a conda environment.yml file that can be directly used to create the environment:

conda env create -f opt/environment.yml

In case, you are working in a python virtual environment, I provide a list of dependencies that can be pip installed using:

pip install -r opt/optional_requirements.txt

3. Running `notebooks/notebook.ipynb`

This notebook outlines the entire investigation and consists of the following steps [🚨 Skip this step, if you want to directly want to use the final configuration for training and/or final model for predictions]:

Data loading
Data cleaning and preparation
Exploratory data analysis
Feature Engineering
Feature importance
Setting up a validation framwork
Model evaluation [and hyper-parameter tuning]
Saving the best model and encoders [in the models directory]
Preparation of the test data
Making predictions using the saved model
Testing Flask framework

4. Training model

We encode our best model (LGBMClassifier) inside the scripts/train.py file which can be run using:

cd scripts
python train.py

The output of this script, which includes the model and the encoder/scaler transforms, can be found in: models/LGBMClassifier_tranformers_final.bin. It have an accuracy of 0.807 and an ROC AUC = 0.797. This is the model we use to make predictions in the next steps.

5. Making predictions

We have written a Flask code for serving the model by exposing the port:9696, which can be run using:

cd scripts
python predict.py

or gunicorn as:

cd scripts
gunicorn --bind 0.0.0.0:9696 predict:app

We can use this to make an example prediction on the appointment:

test_appointment = {
                    'PatientId': 377511518121127.0,
                    'AppointmentID': 5629448,
                    'Gender': 'F',
                    'ScheduledDay': '2016-04-27 13:30:56+0000',
                    'AppointmentDay': '2016-06-07 00:00:00+0000',
                    'Age': 54,
                    'Neighbourhood': 'MARIA ORTIZ',
                    'Scholarship': False,
                    'Hipertension': False,
                    'Diabetes': False,
                    'Alcoholism': False,
                    'Handcap': 0,
                    'SMS_received': True
                    }

using the command:

cd scripts
python predict-test.py
# {'no_show': False, 'no_show_probability': 0.2880257379453167}

This gives us a no_show class [0 or 1] as well as a probability.

🚨 Always remember to conda activate ml-zoomcamp whenever opening a new terminal/tab.

6. Containerizing the model

Run the Dockerfile using [make sure that the docker daemon is running?] to build the image no-show-prediction:

docker build -t no-show-prediction .

We can access the docker container via the terminal using:

docker run -it --rm --entrypoint=bash no-show-prediction

Once the image is built, we need to expose the container port (9696) to the localhost port (9696) using:

docker run -it --rm -p 9696:9696 no-show-prediction

We can now make a request in exactly the same way as Step 5:

cd scripts
python predict-test.py
# {'no_show': False, 'no_show_probability': 0.2880257379453167}

7. Deploying an AWS Elastic Beanstalk application

We provide some detailed documentations about how to launch the code as an elastic beanstalk application here. It involves the following steps:

Creating an AWS account
Renting and configuring an EC2 instance
Setting up the application environment using conda, pipenv and docker
Creating the elastic beanstalk application
Launching the application

Details of application:

⚠️ I have now deactivated this instance because it has reached the monthly limit allowed by AWS' FreeTier model

Application name: no-show-predictor
Host: no-show-predictor-env.eba-hpbyckm2.eu-north-1.elasticbeanstalk.com
API: ./scripts/predict-test-aws.py

Models

We evaluated the performances of four different models. Their accuracies and ROC AUC are listed in the table below:

Model	Accuracy	ROC_AUC
LogisticRegression	0.792	0.686
DecisionTreeClassifier	0.793	0.725
RandomForestClassifier	0.793	0.682
XGBClassifier	0.796	0.749
LGBMClassifier ✅	0.796	0.752

Our final model, LGBMClassifier, produced a score of 0.807 and an ROC AUC = 0.797.

Directory structure

./medical-appointment-no-shows
├── scripts
│   ├── train.py
│   ├── predict.py
│   ├── predict-test.py
│   ├── predict-test-aws.py
│   ├── constants.py
│   └── __pycache__
├── permissions
│   ├── aws-explorer_credentials.csv
│   └── aws-explorer_accessKeys.csv
├── opt
│   ├── optional_requirement.txt
│   └── environment.yml
├── notebooks
│   └── notebook.ipynb
├── models
│   ├── XGBClassifier_tranformers_final.bin
│   ├── XGBClassifier_final.bin
│   ├── XGBClassifier.bin
│   ├── RandomForestClassifier.bin
│   ├── LogisticRegression.bin
│   ├── LGBMClassifier_tranformers_final.bin
│   ├── LGBMClassifier.bin
│   └── DecisionTreeClassifier.bin
├── jupyter.pem
├── docs
│   └── setting-up-ec2-eb.md
├── data
│   ├── no-show-patients.jpg
│   ├── README.md
│   └── KaggleV2-May-2016.csv
├── README.md
├── Pipfile.lock
├── Pipfile
├── LICENSE
└── Dockerfile

9 directories, 28 files

Contributors

Abhirup Ghosh, abhirup.ghosh.184098@gmail.com

License

This project is licensed under the MIT License.

Acknowledgments

Contributions and Feedback

We welcome contributions from the community and feedback from healthcare professionals and data scientists. Together, we can refine our model and enhance its utility in real-world healthcare settings. Feel free to explore the project, contribute, or reach out with any questions or suggestions. Together, we can work towards a healthcare system that is more efficient, patient-centered, and cost-effective.

Keywords

#Classification #XGBoost #LightGBM #Conda #Pipenv #Flask #Gunicorn #Docker #AWS #ElasticBeanstalk #EC2 #API

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
docs		docs
models		models
notebooks		notebooks
opt		opt
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

License

abhirup-ghosh/medical-appointment-no-shows

Folders and files

Latest commit

History

Repository files navigation

Predicting no-shows in medical appointments

Table of Contents

1. Cloning the repository:

2. Setting up the environment:

3. Running notebooks/notebook.ipynb

4. Training model

5. Making predictions

6. Containerizing the model

7. Deploying an AWS Elastic Beanstalk application

Details of application:

Keywords

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

3. Running `notebooks/notebook.ipynb`