Skip to content

BrahimFakri/Patient-Health-Data-Analysis-And-Feature-Extraction-For-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1. Introduction

This is an open-source python repository that is based on the the HAIM GitHub package study. We replicate the embeddings generation from the HAIM multimodal dataset containing data of 4 modalities (tabular, time-series, text and images) and 11 unique sources. PS: Notes data are not publically available, so the embedidngs were not generated for that type of data. Also, the vision probabilties calculation was not used. In order to have access to those embeddings, please check the csv file generated by the study. Below is an overview of different types of data sources used and the transformation type that was used to generate the emebddings:

image

The datasets used to replicate the embeddings generation are publicly available at: physionet (https://physionet.org/content/haim-multimodal/1.0.1/)

Follow the instruction below to download and copy.

2. Instructions on how to use the repository

The datasets used to replicate the embeddings generation are publicly available at: physionet.

Download:

Copy the unzipped folders to csvs

Install the requirements under Python 3.9.13 as following:

$ pip install -r requirements.txt

3. Steps of our work

In this repository, we intent to gradually provide five jupyter notebooks. Each of the first four will be for a data modality and the last one will be for all modalities.

In order to generate embeddings, we based our codes on subject_id. The user can also opt for stay_id embeddings generation. However, this can generate multiple rows for the same patient in terms of time series analysis. Data related to time of events will be spread on multiple rows, and machine learning algorithms might generate erroneous predictions.

4. Datasources description

For more details about the different tables and column names, please refere to MIMIC video tutorials at : MIMIC Tutorial

Below is an overview of the different MIMIC modules and their links to different patients movements through the hospital:

image

(Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0).PhysioNet. https://doi.org/10.13026/s6n6-xd98.).

Relationship between different mimiv core and icu tables

To have an overview of different links between tables, check the folder Table diagrams

Some important data facts

Below is table in which we summarize important information about the most important tables used in the embeddings generation.

To produce these number, we used the notebook general dataset exploration.ipynb in the folder notebooks

image

The summary table shows for example that we have 382278 unique subject_id in the table patient (created from the csv file patients.csv). However, in the icustays table, we only have 53150 unique subject_id, meaning that not all patients in the database have icu stays.

Also, we notice that not all patients have chest radiology images: only 65379 unique subject_id in the mimic_cxr_chexpert table.

So in order to find the patients who have both icu stays and chest radiology images, we ran the notebook icu_cxr_patients.ipynb and find that the number of patients with both a chest radiology image and an icu stay is: 20245

5. Demo

We recommand the user to start by running the notebook general tutorial notebook.ipynb to be familiarised with the different tables and data in the mimic database. At the end of that notebook, the user will have generated a sample of 10 patients that will be used for remaining of the work.

The second step is to generate features from demographic and time series data. In order to do so, the user should use the notebook Demographics_TimeSeries_features_Tutorial.ipynb.

Example of generating chart events features:

import os
os.chdir('../..')

from src.data import constants
from src.utils import extraction_classes

import pandas as pd

chart_fusion = []
for patient in constants.cohort:

    chart_fusion.append(extraction_classes.Event_extraction(patient).extract_chart_events(patient))
chart_fusion = pd.concat(chart_fusion, axis=0)

At the end of that notebook, the user will have generated a csv file fusion_ts_dem_dataframe.csv that contains features from demographics, chart events, lab events and procedure events. That csv file will be used with the vision features file to create the final features csv file.

The third step is to generate features from image data. In order to do so, the user should use the notebook Extract_vision_features_Tutorial.ipynb. At the end of the notebook, the csv file fusion_vision.csv will be generated.

Then we can use the function Generate_Final_Features to concatenate all types of features to generate the embeddings file:

# Import the function from the module:
from src.utils.Generate_Final_Features import Generate_Final_Features

# Call the general extraction function to display results:
Generate_Final_Features()

# Export results to csv file:
Generate_Final_Features().to_csv('csvs/Final_Features.csv', index=False)