Initial commit

clinicalml · Sep 25, 2020 · 92e69e1 · 92e69e1
1 parent 3741d1a
commit 92e69e1
Show file tree

Hide file tree

Showing 38 changed files with 3,183 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,110 @@
-# README
+# Code for replication of "A decision algorithm to promote outpatient antimicrobial stewardship for uncomplicated urinary tract infection"
 
-This repository will contain code for "A decision algorithm to promote outpatient antimicrobial stewardship for uncomplicated urinary tract infection", as well as a link to the de-identified dataset release through Physionet.
+This code is meant to be used in conjunction with the [AMR-UTI dataset](http://www.clinicalml.org/data/amr-dataset) for replication of the main analyses in the paper. 
 
-The code can be run against the de-identified dataset to replicate main analyses from the paper.
+## Preface: Important Note on Replication
 
-Please see our website www.clinicalml.org/data/amr-dataset for more information on the dataset release.
+The uncomplicated specimens included in this dataset release (in both the train and test sets) are identical to those used for our analyses.
+
+Nonetheless, there are minor differences that will arise when replicating our analyses with the released dataset.  There are two broad reasons for this:
+1. Slight differences in features, primarily due to de-identification efforts.
+2. The absence of patient-level identifiers, which were used in our original analysis to construct all train/validate splits.
+
+Regarding (1), the differences are as follows:
+* Any binary feature with fewer than 20 positive observations was dropped from the dataset.
+* All colonization pressure features were rounded to the nearest 0.01
+* Age was censored, so that any patient with an age > 89 has their age set to 90.
+* Basic laboratory values (WBC, lymphocytes, and neutrophil counts) are excluded from the dataset release, due to inconsistencies in reporting of laboratory values. These features did not have a noticeable impact on our results.
+
+Regarding (2), our analysis used patient identifiers to ensure that there were no patients with specimens in both the train and validate sets.  Because there are no patient identifiers included in this dataset release, the splits performed by our replication code are done without knowledge of patient identifiers.  As a result, they are necessarily different from the ones we used in our work.
+
+For this reason, we provide utilities to run our scripts both in an end-to-end fashion (replicating the approach taken in the paper, but with different train/validate splits and therefore different selection of hyperparameters), as well as directly using the hyperparameters and thresholds chosen by our original analysis.
+
+# Replication Instructions
+
+## Setup 
+
+First, you need to load in the data from Physionet.  See the [project website](http://www.clinicalml.org/data/amr-dataset) for more information on how to access this data. Place the Physionet data in a folder **outside** this repository to avoid accidentally uploading any files to Github.
+
+Second, you need to **edit the paths in `setup/paths.sh`** 
+* `DATA_PATH` in `setup/paths.sh` should reflect the absolute path to the files from Physionet. The files should be all be accessible at `${DATA_PATH}/<file_name>.csv`.
+* `REPO_PATH` in `setup/paths.sh` should reflect the absolute path to this directory.  That is, this file should be located at `${REPO_PATH}/README.md`.
+
+Third, you need to set up your `python`, `R`, and bash environment.
+* Run `bash setup/setup_env.sh` to create a python environment `amr` that will contain the necessary packages. 
+* Run `conda activate amr` to activate the environment.
+* Run `Rscript setup/install_r_packages.R` to install relevant `R` packages into the environment (for plotting purposes), which can take some time. 
+* Run `source setup/paths.sh` (note: use `source`, not `bash` here) to populate the necessary bash variables that define paths.
+* *Going forward (e.g., in subsequent terminal sessions), you will need to run `conda activate amr` and `source setup/paths.sh` before running the experiment scripts*
+
+Finally, you need to run `python setup/load_data.py` to split the data release into train/test splits that the remaining code expects.  This will create additional `.csv` files in `${DATA_PATH}`
+
+## Running the experiment scripts
+
+We present two options for replication:
+1. Using our original hyperparameters and thresholds: Run `bash run_all_rep.sh`
+2. Running the analysis end-to-end: Run `bash run_all.sh`
+
+Note that Option 1 is much faster, as it skips all hyperparameter tuning, while Option 2 will take approximatly 2-3 hours to run, driven primarily by threshold selection (see details below).
+
+After either of these scripts have been run, see the "Replicating Figures and Tables" section below for instructions on running the analysis notebook + plotting code.  If you have run with Option 1, then you will want to ensure that the `USE_REP_HP` flag in the notebook is set to `True`, and vice versa if you are using Option 2.
+
+## (Optional) Manually Running the Scripts
+
+Alternatively, you can manually run the relevant experiment scripts that are called by `run_all.sh` and `run_all_rep.sh` respectively.  These details are given below.
+
+Before manually running the experiments in this section, run `cd ${REPO_PATH}/experiments` to go to the right directory.  All the below assumes you are in that directory, and there will be errors if you try to run from a different directory.
+
+When you run these scripts for the first time, they will create (if it does not already exist) a folder `${REPO_PATH}/experiments/experiment_results/<exp_type>/<exp_name>/` that contains `results` to store the artifacts, and `logs` where you can watch progress by examining the log files.  In this context, `<exp_type>` might be `train_outcome_models` and `<exp_name>` might be `train_outcome_models_validation`.  
+
+**NOTE**: The scripts (and paths in `setup/paths.sh`) assume the experiment names that are already given in these scripts, so **do not change them**
+
+### Option 1: Running the analysis with the original hyperparameters
+
+This will skip the validation scripts, because those are used to choose hyperparameters, which in this analysis is not necessary.
+
+To run these analyses, you can either of the following equivalent options:
+* Run `bash run_all_rep.sh` from the base directory `${REPO_PATH}`
+* Move into the `${REPO_PATH}/experiments` directory and run the following script
+```
+bash scripts/eval_test_scripts/train_outcome_models_eval_test_replication.sh
+```
+This script will train models with the original hyperparameters on the entire training set, and then evaluate on the test set.  The original thresholds are then manually applied in the notebook that replicates tables and figures (see next section).
+
+### Option 2: Running our analysis end-to-end
+
+This will re-run the entire analysis, including the automatic selection of thresholds and hyperparameters using train/validate splits.  As noted above, this will result in slightly different choices than in our published analysis, in part because the splits will differ.
+
+To run these analyses, you can either of the following equivalent options:
+* Run `bash run_all.sh` from the base directory `${REPO_PATH}`
+* Move into the `${REPO_PATH}/experiments` directory and run the following scripts (in order)
+
+First, run the following to train outcome models using the train/validation set, using the hyperparameter grid defined in `../models/hyperparameter_grids.py`. This also generates train/validation set predictions using the chosen hyperparameters.  **NOTE**: In our experiments we observed that logistic regression and random forests performed comparably, and by default the script will only investigate hyperparameters for logistic regression models.  To change this, you will need to change the code in `experiment_train_outcome_model.py:117, 127` where we have commented out random forests.
+```
+bash scripts/validation_scripts/train_outcome_models_validation.sh
+``` 
+
+Second, run the following to choose thresholds based on the train/validation set.  Note that this script will take a significant amount of time (about 2 hours) to run, becuase it investigates a large set of thresholding combinations.
+```
+bash scripts/validation_scripts/thresholding_validation.sh
+```
+
+Third, run the following to re-train outcome models using the chosen hyperparameters, and then evaluate these models on the test set.
+```
+bash scripts/eval_test_scripts/train_outcome_models_eval_test.sh
+```
+
+Finally, run the following script to perform the final thresholding experiment on the test set.
+```
+bash scripts/eval_test_scripts/thresholding_test.sh
+```
+
+## Replicating tables and figures
+
+Within this repository, `notebooks/` contains a jupyter notebook that can be used to replicate figures from the paper and examine results.
+
+To replicate figures, you will need to do the following:
+* First, navigate to the folder `notebooks/` and open the Jupyter notebook `figures_and_tables.ipynb`
+* Set the flag `USE_REP_HP` in this notebook based on whether or not you wish to compute results using the original hyperparameters, or the results of the end-to-end analysis applied to the dataset release.  Note that either of these options assumes you have already run the relevant code above.
+* In either case, run the entire notebook end-to-end: The notebook generates the data used for plotting, as well as main tables in the paper.
+* To generate figures (AFTER running the notebook code), run the script `plot_all.sh` which will sequentially call the various `R` scripts to generate the plots.
diff --git a/__init__.py b/__init__.py
diff --git a/analysis_utils/best_case_baseline_utils.py b/analysis_utils/best_case_baseline_utils.py
@@ -0,0 +1,127 @@
+import pandas as pd
+import os
+import sys
+
+sys.path.append('../../')
+from models.indirect.policy_learning_thresholding import get_iat_broad
+from utils.evaluation_utils import calculate_ci
+
+def get_best_case_idsa_baseline(resist_df,
+                               switch_props,
+                               option='doc',
+                               avoid_nit_eids=None):
+
+    '''
+      Implementation of "best-case" modification of IDSA practice guidelines.
+    '''
+
+    assert option in ['total_rand', 'doc']
+
+    all_iat_broad_stats = []
+    resist_df = resist_df.copy()
+
+    for p in switch_props:
+
+        resist_df['idsa'] = resist_df.apply(lambda x: 'CIP' if x['example_id']
+                                           in set(avoid_nit_eids) else 'NIT', axis=1)
+        if option == 'total_rand':
+            cip_cohort = resist_df[resist_df['idsa'] == 'CIP']
+            prop_cip, iat_cip = len(cip_cohort) / len(resist_df), cip_cohort['CIP'].mean()
+
+            nit_cohort = resist_df[resist_df['idsa'] == 'NIT']
+            iat_nit_cohort, iat_cip_switch = nit_cohort['NIT'].mean(), nit_cohort['CIP'].mean()
+
+            iat = prop_cip * iat_cip + (1-prop_cip) * (p*iat_cip_switch + (1-p)*iat_nit_cohort)
+            broad = prop_cip + p*(1-prop_cip)
+
+        elif option == 'doc':
+            fixed_cohort = resist_df[~((resist_df['idsa'] == 'NIT') & 
+                                       resist_df['prescription'].isin(['CIP', 'LVX']))]
+            prop_fixed = len(fixed_cohort) / len(resist_df)
+            iat_fixed, broad_fixed = get_iat_broad(fixed_cohort, col_name='idsa')
+            switch_cohort = resist_df[(resist_df['idsa'] == 'NIT') & 
+                                       resist_df['prescription'].isin(['CIP', 'LVX'])]
+
+            iat_idsa_switch = switch_cohort['NIT'].mean()
+            iat_doc_switch, _ = get_iat_broad(switch_cohort, col_name='prescription')
+
+            iat = prop_fixed*iat_fixed + (1-prop_fixed)*(p*iat_doc_switch + (1-p)*iat_idsa_switch)
+            broad = prop_fixed*broad_fixed + p*(1-prop_fixed)
+
+        all_iat_broad_stats.append([iat, broad])
+
+    return pd.DataFrame(all_iat_broad_stats, columns=['iat', 'broad'])
+
+
+def get_best_case_stats(policy_df, p=.18):
+    fixed_cohort = policy_df[~((policy_df['idsa'] == 'NIT') & 
+                               policy_df['doc'].isin(['CIP', 'LVX']))]
+    prop_fixed = len(fixed_cohort) / len(policy_df)
+    iat_fixed, broad_fixed = get_iat_broad(fixed_cohort, col_name='idsa')
+
+    switch_cohort = policy_df[(policy_df['idsa'] == 'NIT') & 
+                               policy_df['doc'].isin(['CIP', 'LVX'])]
+
+    # Antibiotic distribution stats
+
+    prop_CIP = switch_cohort['doc'].value_counts(normalize=True).get('CIP')
+    prop_LVX = 1-prop_CIP
+
+    prop_CIP_all = prop_fixed*broad_fixed + p*(1-prop_fixed)*prop_CIP
+    prop_LVX_all = p*(1-prop_fixed)*prop_LVX
+
+    prop_CIP_ci, prop_LVX_ci = calculate_ci(prop_CIP_all, len(policy_df)), calculate_ci(prop_LVX_all, len(policy_df))
+
+
+    broad = prop_fixed*broad_fixed + p*(1-prop_fixed)
+    narrow = 1 - broad
+    narrow_ci, broad_ci =  calculate_ci(narrow, len(policy_df)), calculate_ci(broad, len(policy_df))
+
+
+    # IAT stats
+
+    iat_idsa_switch = switch_cohort['NIT'].mean()
+    iat_doc_switch, _ = get_iat_broad(switch_cohort, col_name='doc')
+
+    iat = prop_fixed*iat_fixed + (1-prop_fixed)*(p*iat_doc_switch + (1-p)*iat_idsa_switch)
+    iat_all_ci = calculate_ci(iat, len(policy_df))
+
+    all_NIT_size = (len(fixed_cohort[fixed_cohort['idsa'] == 'NIT'])) + (1-p)*len(switch_cohort)    
+    iat_fixed_NIT = fixed_cohort[fixed_cohort['idsa'] == 'NIT']['NIT'].mean()
+    iat_NIT = len(fixed_cohort[fixed_cohort['idsa'] == 'NIT']) / all_NIT_size  * iat_fixed + (1-p)*len(switch_cohort) / all_NIT_size * iat_idsa_switch
+    iat_NIT_ci = calculate_ci(iat_NIT, all_NIT_size)
+
+    iat_fixed_broad = fixed_cohort[fixed_cohort['idsa'] == 'CIP']['CIP'].mean()
+    all_broad_size = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) + p*len(switch_cohort)
+    iat_broad = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) / all_broad_size * iat_fixed_broad + p*len(switch_cohort)/all_broad_size * iat_doc_switch
+    iat_broad_ci = calculate_ci(iat_broad, all_broad_size)
+
+    all_CIP_size = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) + p*len(switch_cohort)*prop_CIP
+    iat_doc_CIP = switch_cohort[switch_cohort['doc'] == 'CIP']['CIP'].mean()
+    iat_CIP = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) / all_CIP_size * iat_fixed_broad + p*prop_CIP*len(switch_cohort)/all_CIP_size * iat_doc_CIP
+    iat_CIP_ci = calculate_ci(iat_CIP, all_CIP_size)
+
+    iat_doc_LVX = switch_cohort[switch_cohort['doc'] == 'LVX']['LVX'].mean()
+
+    #print(len(switch_cohort[switch_cohort['doc'] == 'LVX']))
+    #print(switch_cohort[switch_cohort['doc'] == 'LVX']['LVX'].sum())
+    iat_LVX = iat_doc_LVX
+    iat_LVX_ci = calculate_ci(iat_LVX, len(switch_cohort[switch_cohort['doc'] == 'LVX'])*p)
+
+    abx_prop_stats = {
+        'all': (broad, broad_ci),
+        'narrow': (narrow, narrow_ci),
+        'NIT': (narrow, narrow_ci),
+        'CIP': (prop_CIP_all, prop_CIP_ci),
+        'LVX': (prop_LVX_all, prop_LVX_ci)
+    }
+
+    iat_stats = {
+        'all': (iat, iat_all_ci),
+        'NIT': (iat_NIT, iat_NIT_ci),
+        'broad': (iat_broad, iat_broad_ci),
+        'CIP': (iat_CIP, iat_CIP_ci),
+        'LVX': (iat_LVX, iat_LVX_ci),
+    }
+
+    return abx_prop_stats, iat_stats
diff --git a/analysis_utils/model_analysis_utils.py b/analysis_utils/model_analysis_utils.py
@@ -0,0 +1,127 @@
+from sklearn.calibration import calibration_curve
+from sklearn.metrics import roc_curve
+
+import pandas as pd
+import numpy as np
+
+ABX_NAME_MAP =  {
+    'NIT': 'Nitrofurantoin',
+    'SXT': 'TMP-SMX',
+    'CIP': 'Ciprofloxacin',
+    'LVX': 'Levofloxacin'
+}
+
+# Create calibration curve data
+
+def create_calibration_data_df(preds_resist_df, abx_list=['NIT', 'SXT','CIP', 'LVX']):
+
+    full_calibration_df = pd.DataFrame()
+
+    for abx in abx_list:
+        counts = list(bin_counts(preds_resist_df[f'predicted_prob_{abx}'].values, 10)[:-1])
+        mean_probs, true_probs = get_calibration_curve(preds_resist_df, abx)
+
+        counts = counts[:len(mean_probs)]
+        calibration_df = pd.DataFrame(list(zip(mean_probs, true_probs, counts)),
+                                      columns=['predicted', 'prop_positive', 'count'])
+
+        calibration_df['drug'] = ABX_NAME_MAP[abx]
+        calibration_df['bin'] = np.linspace(0.5, 0.5 + len(calibration_df) - 1, len(calibration_df))
+
+        full_calibration_df = pd.concat([full_calibration_df, calibration_df], axis=0)
+
+    return full_calibration_df
+
+
+def get_calibration_curve(preds_df, abx):
+
+    prob_true, prob_pred = calibration_curve(preds_df[abx].values, 
+                                             preds_df[f'predicted_prob_{abx}'].values,
+                                             n_bins=10)
+
+    return list(prob_pred), list(prob_true)
+
+def bin_counts(y_prob, n_bins):
+    bins = np.linspace(0., 1., n_bins + 1)
+    binids = np.digitize(y_prob, bins) - 1
+    return np.bincount(binids, minlength=len(bins))
+
+
+# Create FPR/FNR curve and ROC curve data
+
+def create_fpr_fnr_data(preds_resist_df, abx_list=['NIT', 'SXT','CIP', 'LVX']):
+    all_fprs_fnrs_df = pd.DataFrame()
+
+    # Initialize first/last values for FPR / FNR data
+    base_fpr_df_1 = pd.DataFrame([[1, 0]], columns=['Threshold', 'value'])
+    base_fpr_df_2 = pd.DataFrame([[0, 100]], columns=['Threshold', 'value'])
+
+    base_fnr_df_1 = pd.DataFrame([[1, 100]], columns=['Threshold', 'value'])
+    base_fnr_df_2 = pd.DataFrame([[0, 0]], columns=['Threshold', 'value'])
+
+    for abx in abx_list:
+        fprs, fnrs, _, thresh = get_roc_curve_data(preds_resist_df, abx=abx)
+
+        fpr_df = pd.DataFrame([thresh, 100*np.array(fprs)], index=['Threshold', 'value']).transpose()
+        fpr_df = pd.concat([base_fpr_df_1, fpr_df, base_fpr_df_2], axis=0)
+        fnr_df = pd.DataFrame([thresh, 100*np.array(fnrs)], index=['Threshold', 'value']).transpose()
+        fnr_df = pd.concat([base_fnr_df_1, fnr_df, base_fnr_df_2], axis=0)
+
+        fpr_df['drug'] = ABX_NAME_MAP[abx]
+        fpr_df['set'] = 'FPR'
+
+        all_fprs_fnrs_df = pd.concat([all_fprs_fnrs_df, fpr_df])
+
+        fnr_df['drug'] = ABX_NAME_MAP[abx]
+        fnr_df['set'] = 'FNR'
+
+        all_fprs_fnrs_df = pd.concat([all_fprs_fnrs_df, fnr_df])
+
+    return all_fprs_fnrs_df
+
+
+def create_roc_curve_data(preds_resist_df, preds_resist_prev_history_df, abx_list=['NIT', 'SXT','CIP', 'LVX']):
+
+    data_full_cohort, data_prev_history_cohort = {}, {}
+
+    for abx in abx_list:
+        fprs, _, tprs, _ = get_roc_curve_data(preds_resist_df, abx=abx)
+        data_full_cohort[abx] = (fprs, tprs)
+
+        fprs, _, tprs, _ = get_roc_curve_data(preds_resist_prev_history_df, abx=abx)
+        data_prev_history_cohort[abx] = (fprs, tprs)
+
+    all_roc_curve_data_df = pd.DataFrame()
+
+    for abx, fpr_tprs in data_full_cohort.items():
+        fpr_tprs_df = pd.DataFrame([
+            fpr_tprs[0],
+            fpr_tprs[1]
+        ], index=['fpr', 'tpr']).transpose()
+        fpr_tprs_df['drug'] = ABX_NAME_MAP[abx]
+        fpr_tprs_df['set'] = 'full_cohort'
+
+        all_roc_curve_data_df = pd.concat([all_roc_curve_data_df, fpr_tprs_df])
+
+    for abx, fpr_tprs in data_prev_history_cohort.items():
+        fpr_tprs_df = pd.DataFrame([
+            fpr_tprs[0],
+            fpr_tprs[1]
+        ], index=['fpr', 'tpr']).transpose()
+        fpr_tprs_df['drug'] = ABX_NAME_MAP[abx]
+        fpr_tprs_df['set'] = 'prior_resist_cohort'
+
+        all_roc_curve_data_df = pd.concat([all_roc_curve_data_df, fpr_tprs_df])
+
+    return all_roc_curve_data_df
+
+
+def get_roc_curve_data(preds_resist_df, abx):
+    fprs, tprs, thresh = roc_curve(preds_resist_df[abx].values,
+                                  preds_resist_df[f'predicted_prob_{abx}'].values)
+    fnrs = [1 - tpr for tpr in tprs]
+
+    return fprs, fnrs, tprs, thresh
+
+
+