By Mohammed Almanassra, Bowen Chen, Erima Goyal, Oscar Parrilla
This project builds a machine learning pipeline that trains a recurrent neural network architecture with the MIMIC dataset provided in this link.
The project is built in Anaconda Python 3.8.8, the dependencies are all outlined in the environment.yml
file. To recreate the environment, run conda env create -f environment.yml
The dataset could be downloaded using the shell script named extract_data.sh
. To get started,
- replace the
<CHANGE TO YOUR USER NAME>
to your MIMIC user name in linewget -r -N -c -np --user <CHANGE TO YOUR USER NAME> --ask-password https://physionet.org/files/mimiciii/1.4/
- run
sh extract_data.sh
in your terminal
Then the data would be downloaded and extracted to data/unzipped_files
To train the model, in your terminal,
- Run
cd src
- Run
python main.py
The whole pipeline will be completed in less than 5 minutes.
The model performed the best when setting the batch size to be 1 and traning for only 2 epochs. The best model achieved 54% in recall and 46% in precision. The confusion matrix is shown below
The folder structure is the following
-
main.py
- main script that calls the etl pipeline, model training and model evaluations steps -
train_model.py
- script that calls the dataloaders and training steps using the model defined inmodel_definition/variable_rnn.py
-
evaluate_model.py
- script that evaluates the model on the test set, plot the metrics -
etl.py
- script that builds and loads the raw data set into PyTorch data loaders by callingdata_transformation/make_dataset.py
-
data_transform - data transformation scripts
-
model_definition - variable RNN definition
-
utils - utility functions