Patient Selection for Diabetes Drug Testing

This repository contains a completed cap-stone project for the Udacity "Applying AI to EHR Data" course, part of the AI for Healthcare Nanodegree program. It has been reviewed by Udacity instructors and met project specifications.

Introduction

Project Scenario

A hypothetical healthcare company is preparing for Phase III clinical trial testing for its novel diabetes drug. The drug requires administering and patient monitoring over a duration of 5-7 days in the hospital. Target patients are those who are likely to be in the hospital for this duration of time, so there will be no significant additional costs for drug administration and patient monitoring. The goal of this project is to utilize Electronic Health Record (EHR) information to build a regression model that can predict the hospitalization time for a patient, and then use this model to select/filter patients for this study.

A Deep Neural Network Model for Predicting Hospitalization Duration

A Deep Neural Network regression model was built to predict the days of hospitalization duration for patients in this dataset. These predictions are converted to binary prediction of whether to include or exclude that patient from the clinical trial.

This project utilizes EHR data by transforming line-level data into an appropriate data representation at the encounter level (per patient visit level), and then apply filtering, preprocessing, and feature engineering of key medical code sets. TensorFlow Feature Column API was used to prepare features and TensorFlow Probability Layers were used to create the regression model.

The completed regression model achieved binary predication accuracy of 0.77, precision of 0.71, recall of 0.61, and F1-score of 0.66. It can be further optimized by maximizing precision, recall, or F1-score with trade-off between precision and recall.
For full discussion, please read the "Model Evaluation Metrics" section of src\student_project_EY_completed.ipynb.

To understand model biases across key demographic groups, model predictions were analyzed with the UChicago Aequitas toolkit.

Dataset

Udacity provided a synthetic dataset(denormalized at the line level augmentation) built off of the UC Irvine Diabetes re-admission dataset.
The dataset can be found in /src/data/final_project_dataset.csv.

References
Original UCI Dataset (https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)

Getting Started

Set up your Anaconda environment.
Clone https://github.com/ElliotY-ML/Predict_Diabetic_Patient_Hospitalization_Duration.git GitHub repo to your local machine.
Open src/student_project_EY_completed.ipynb with Jupyter Notebook to explore EDA, feature transformations, feature columns, model training, inference, and bias analysis.

Dependencies

Using Anaconda consists of the following:

Install miniconda on your computer, by selecting the latest Python version for your operating system. If you already have conda or miniconda installed, you should be able to skip this step and move on to step 2.
Create and activate a new conda environment.

1. Installation

Download the latest version of miniconda that matches your system.

	Linux	Mac	Windows
64-bit	64-bit (bash installer)	64-bit (bash installer)	64-bit (exe installer)
32-bit	32-bit (bash installer)		32-bit (exe installer)

Install miniconda on your machine. Detailed instructions:

Linux: https://docs.conda.io/en/latest/miniconda.html#linux-installers
Mac: https://docs.conda.io/en/latest/miniconda.html#macosx-installers
Windows: https://docs.conda.io/en/latest/miniconda.html#windows-installers

2. Create and Activate the Environment

For Windows users, these following commands need to be executed from the Anaconda prompt as opposed to a Windows terminal window. For Mac, a normal terminal window will work.

Git and version control

These instructions also assume you have git installed for working with Github from a terminal window, but if you do not, you can download that first with the command:

conda install git

Create local environment

Clone the repository, and navigate to the downloaded folder. This may take a minute or two to clone due to the included image data.

git clone https://github.com/ElliotY-ML/Predict_Diabetic_Patient_Hospitalization_Duration.git
cd Predict_Diabetic_Patient_Hospitalization_Duration

Create (and activate) a new environment, named udacity-ehr-env with Python 3.8. If prompted to proceed with the install (Proceed [y]/n) type y.
- Linux or Mac:
```
conda create -n udacity-ehr-env 
source activate udacity-ehr-env
```
- Windows:
```
conda create --name udacity-ehr-env 
activate udacity-ehr-env
```
At this point your command line should look something like: (udacity-ehr-env) <User>:USER_DIR <user>$. The (udacity-ehr-env) indicates that your environment has been activated, and you can proceed with further package installations.
Install a few required pip packages, which are specified in the requirements text file. Be sure to run the command from the project root directory since the requirements.txt file is there.

pip install -r requirements.txt

Repository Instructions

The original Udacity project instructions can be read in Udacity_Project_Instructions.md.

Project Overview

Project Instructions & Prerequisites
Learning Objectives
Data Preparation and Exploratory Data Analysis
Create Categorical Features with TensorFlow Feature Columns
Create Continuous/Numerical Features with TensorFlow Feature Columns
Build Deep Learning Regression Model with Sequential API and TensorFlow Probability Layers
Evaluating Potential Model Biases with Aequitas Toolkit

Begin by opening /src/student_project_EY_completed.ipynb with Jupyter Notebook.

Inputs:

Udacity Dataset: src/data/final_project_dataset.csv
Admission Type ID: src/data_schema_references/IDs_mapping.csv
NDC Codes to Drugs Lookup Table: src/data_schema_references/ndc_lookup_table.csv
Dataset Schema: src/data_schema_references/project_data_schema.csv
NDC Codes to Drugs Lookup Table (copy): src/medication_lookup_tables/final_ndc_lookup_table

Output:

Trained Deep neural network regression model with TensorFlow Probability Layers in notebook
Predictions output in /out/pred_test_df3.csv

Data preparation begins in section 3. The project dataset is imported into a pandas DataFrame. There are medical code reference tables in src/data_schema_references that translate medical and medicine codes into descriptions. These are also imported into dataframes.
An exploratory data analysis to understand the data and demographics.
The dataset is then transformed from the line level into an aggregated encounter level. In other words, individual medical code entries are aggregated by individual patient visits.
Select categorical and numerical features to use for the model.
Split dataset into a 60%/20%/20% train/validation/test split and ensure that the demographics are reflective of the overall dataset. The patient_dataset_splitter function was completed in student_utils.py module.
Use TensorFlow Feature Columns API to create categorical features and embedding columns for each feature. The create_tf_categorical_feature_cols function was completed in student_utils.py module.
Use TensorFlow Feature Columns API to create numeric features. The create_tf_numeric_feature function was completed in student_utils.py module.
Build and train a deep learning regression model with Keras Dense Layers and TensorFlow Probability Layers.
Convert regression output to classification for patient selection
Use scikit-learn classification_report and confusion_matrix functions to compare patient selection performance of trained model against actual hospitalization durations.
Evaluate potential model biases with Aequitas toolkit. Visualizations show if there are model biases for gender and race demographics.

License

This project is licensed under the MIT License - see the LICENSE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_schema_references

data_schema_references

src

src

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

Udacity_Project_Instructions.md

Udacity_Project_Instructions.md

requirements.txt

requirements.txt

Repository files navigation

Patient Selection for Diabetes Drug Testing

Table of Contents

Introduction

Project Scenario

A Deep Neural Network Model for Predicting Hospitalization Duration

Dataset

Getting Started

Dependencies

1. Installation

2. Create and Activate the Environment

Git and version control

Repository Instructions

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_schema_references		data_schema_references
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
Udacity_Project_Instructions.md		Udacity_Project_Instructions.md
requirements.txt		requirements.txt

License

ElliotY-ML/Predict_Diabetic_Patient_Hospitalization_Duration

Folders and files

Latest commit

History

Repository files navigation

Patient Selection for Diabetes Drug Testing

Table of Contents

Introduction

Project Scenario

A Deep Neural Network Model for Predicting Hospitalization Duration

Dataset

Getting Started

Dependencies

1. Installation

2. Create and Activate the Environment

Git and version control

Repository Instructions

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages