Skip to content

LightGBM no-show predictor + AWS deployment (EC2 + Elastic Beanstalk)

License

Notifications You must be signed in to change notification settings

abhirup-ghosh/medical-appointment-no-shows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Predicting no-shows in medical appointments

Table of Contents


Predicting whether a patient will show up for their scheduled medical appointment is a critical task for healthcare providers as it can help optimize resource allocation and improve overall patient care. This machine learning project focuses on addressing the issue of patient no-shows in medical appointments. By harnessing the power of data and machine learning, we aim to develop a predictive model that can assist healthcare facilities in identifying patients at higher risk of no-shows.

Starting with a Kaggle data of medical appointment no-shows by Joni Hoppen and Aquarela Analytics, we evaluate the performances of four different classification algorithms (Logistic Regression, Decision Trees, Random Forests, XGBoost, and LightGBM) and settle on an LGBMClassifier model as our final model. Using the trained model we make predictions on whether a future appointment would lead to a no-show or not. Finally, we containerise this application and deploy it as an elastic beanstalk application on AWS and provide an API to access it with.

Title Image

Credit: Austrian Medical Association (Γ–Γ„K)

The application, called no-show-predictor is located at: no-show-predictor-env.eba-hpbyckm2.eu-north-1.elasticbeanstalk.com. You can use this API to start making predictions immediately.

This project uses a Kaggle dataset of over 100,000 medical appointments characterised by 14 associated variables, including temporal details, patient information and the ultimate outcome (and the target variable of our classification task) of the appointment -- whether the patient showed up for the appointment or not. The dataset was created by Joni Hoppen and Aquarela Analytics, and can be downloaded from:

https://www.kaggle.com/datasets/joniarroba/noshowappointments/data

Further details of the datasets used can be found here.

The project requires the following dependencies to be installed:

Conda
Docker
AWSEBCLI

To run this project locally, follow these steps:

1. Cloning the repository:

git clone https://github.com/abhirup-ghosh/medical-appointment-no-shows.git

2. Setting up the environment:

The easiest way to set up the environment is to use Anaconda. I used the standard Machine Learning Zoomcamp conda environment ml-zoomcamp, which you can create, activate, and install the relevant libraries in, using the following commands in your terminal:

conda create -n ml-zoomcamp python=3.9
conda activate ml-zoomcamp
conda install numpy pandas scikit-learn seaborn jupyter xgboost pipenv flask gunicorn lightgbm

Alternatively, I have also provided a conda environment.yml file that can be directly used to create the environment:

conda env create -f opt/environment.yml

In case, you are working in a python virtual environment, I provide a list of dependencies that can be pip installed using:

pip install -r opt/optional_requirements.txt

3. Running notebooks/notebook.ipynb

This notebook outlines the entire investigation and consists of the following steps [🚨 Skip this step, if you want to directly want to use the final configuration for training and/or final model for predictions]:

  • Data loading
  • Data cleaning and preparation
  • Exploratory data analysis
  • Feature Engineering
  • Feature importance
  • Setting up a validation framwork
  • Model evaluation [and hyper-parameter tuning]
  • Saving the best model and encoders [in the models directory]
  • Preparation of the test data
  • Making predictions using the saved model
  • Testing Flask framework

4. Training model

We encode our best model (LGBMClassifier) inside the scripts/train.py file which can be run using:

cd scripts
python train.py

The output of this script, which includes the model and the encoder/scaler transforms, can be found in: models/LGBMClassifier_tranformers_final.bin. It have an accuracy of 0.807 and an ROC AUC = 0.797. This is the model we use to make predictions in the next steps.

5. Making predictions

We have written a Flask code for serving the model by exposing the port:9696, which can be run using:

cd scripts
python predict.py

or gunicorn as:

cd scripts
gunicorn --bind 0.0.0.0:9696 predict:app

We can use this to make an example prediction on the appointment:

test_appointment = {
                    'PatientId': 377511518121127.0,
                    'AppointmentID': 5629448,
                    'Gender': 'F',
                    'ScheduledDay': '2016-04-27 13:30:56+0000',
                    'AppointmentDay': '2016-06-07 00:00:00+0000',
                    'Age': 54,
                    'Neighbourhood': 'MARIA ORTIZ',
                    'Scholarship': False,
                    'Hipertension': False,
                    'Diabetes': False,
                    'Alcoholism': False,
                    'Handcap': 0,
                    'SMS_received': True
                    }

using the command:

cd scripts
python predict-test.py
# {'no_show': False, 'no_show_probability': 0.2880257379453167}

This gives us a no_show class [0 or 1] as well as a probability.

🚨 Always remember to conda activate ml-zoomcamp whenever opening a new terminal/tab.

6. Containerizing the model

Run the Dockerfile using [make sure that the docker daemon is running?] to build the image no-show-prediction:

docker build -t no-show-prediction .

We can access the docker container via the terminal using:

docker run -it --rm --entrypoint=bash no-show-prediction

Once the image is built, we need to expose the container port (9696) to the localhost port (9696) using:

docker run -it --rm -p 9696:9696 no-show-prediction

We can now make a request in exactly the same way as Step 5:

cd scripts
python predict-test.py
# {'no_show': False, 'no_show_probability': 0.2880257379453167}

7. Deploying an AWS Elastic Beanstalk application

We provide some detailed documentations about how to launch the code as an elastic beanstalk application here. It involves the following steps:

  • Creating an AWS account
  • Renting and configuring an EC2 instance
  • Setting up the application environment using conda, pipenv and docker
  • Creating the elastic beanstalk application
  • Launching the application

Details of application:

⚠️ I have now deactivated this instance because it has reached the monthly limit allowed by AWS' FreeTier model

Application name: no-show-predictor
Host: no-show-predictor-env.eba-hpbyckm2.eu-north-1.elasticbeanstalk.com
API: ./scripts/predict-test-aws.py

We evaluated the performances of four different models. Their accuracies and ROC AUC are listed in the table below:

Model Accuracy ROC_AUC
LogisticRegression 0.792 0.686
DecisionTreeClassifier 0.793 0.725
RandomForestClassifier 0.793 0.682
XGBClassifier 0.796 0.749
LGBMClassifier βœ… 0.796 0.752

Our final model, LGBMClassifier, produced a score of 0.807 and an ROC AUC = 0.797.

./medical-appointment-no-shows
β”œβ”€β”€ scripts
β”‚Β Β  β”œβ”€β”€ train.py
β”‚Β Β  β”œβ”€β”€ predict.py
β”‚Β Β  β”œβ”€β”€ predict-test.py
β”‚Β Β  β”œβ”€β”€ predict-test-aws.py
β”‚Β Β  β”œβ”€β”€ constants.py
β”‚Β Β  └── __pycache__
β”œβ”€β”€ permissions
β”‚Β Β  β”œβ”€β”€ aws-explorer_credentials.csv
β”‚Β Β  └── aws-explorer_accessKeys.csv
β”œβ”€β”€ opt
β”‚Β Β  β”œβ”€β”€ optional_requirement.txt
β”‚Β Β  └── environment.yml
β”œβ”€β”€ notebooks
β”‚Β Β  └── notebook.ipynb
β”œβ”€β”€ models
β”‚Β Β  β”œβ”€β”€ XGBClassifier_tranformers_final.bin
β”‚Β Β  β”œβ”€β”€ XGBClassifier_final.bin
β”‚Β Β  β”œβ”€β”€ XGBClassifier.bin
β”‚Β Β  β”œβ”€β”€ RandomForestClassifier.bin
β”‚Β Β  β”œβ”€β”€ LogisticRegression.bin
β”‚Β Β  β”œβ”€β”€ LGBMClassifier_tranformers_final.bin
β”‚Β Β  β”œβ”€β”€ LGBMClassifier.bin
β”‚Β Β  └── DecisionTreeClassifier.bin
β”œβ”€β”€ jupyter.pem
β”œβ”€β”€ docs
β”‚Β Β  └── setting-up-ec2-eb.md
β”œβ”€β”€ data
β”‚Β Β  β”œβ”€β”€ no-show-patients.jpg
β”‚Β Β  β”œβ”€β”€ README.md
β”‚Β Β  └── KaggleV2-May-2016.csv
β”œβ”€β”€ README.md
β”œβ”€β”€ Pipfile.lock
β”œβ”€β”€ Pipfile
β”œβ”€β”€ LICENSE
└── Dockerfile

9 directories, 28 files

Abhirup Ghosh, abhirup.ghosh.184098@gmail.com

This project is licensed under the MIT License.

We welcome contributions from the community and feedback from healthcare professionals and data scientists. Together, we can refine our model and enhance its utility in real-world healthcare settings. Feel free to explore the project, contribute, or reach out with any questions or suggestions. Together, we can work towards a healthcare system that is more efficient, patient-centered, and cost-effective.

Keywords

#Classification #XGBoost #LightGBM #Conda #Pipenv #Flask #Gunicorn #Docker #AWS #ElasticBeanstalk #EC2 #API