Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
moberst committed Sep 25, 2020
1 parent 3741d1a commit 92e69e1
Show file tree
Hide file tree
Showing 38 changed files with 3,183 additions and 4 deletions.
111 changes: 107 additions & 4 deletions README.md
@@ -1,7 +1,110 @@
# README
# Code for replication of "A decision algorithm to promote outpatient antimicrobial stewardship for uncomplicated urinary tract infection"

This repository will contain code for "A decision algorithm to promote outpatient antimicrobial stewardship for uncomplicated urinary tract infection", as well as a link to the de-identified dataset release through Physionet.
This code is meant to be used in conjunction with the [AMR-UTI dataset](http://www.clinicalml.org/data/amr-dataset) for replication of the main analyses in the paper.

The code can be run against the de-identified dataset to replicate main analyses from the paper.
## Preface: Important Note on Replication

Please see our website www.clinicalml.org/data/amr-dataset for more information on the dataset release.
The uncomplicated specimens included in this dataset release (in both the train and test sets) are identical to those used for our analyses.

Nonetheless, there are minor differences that will arise when replicating our analyses with the released dataset. There are two broad reasons for this:
1. Slight differences in features, primarily due to de-identification efforts.
2. The absence of patient-level identifiers, which were used in our original analysis to construct all train/validate splits.

Regarding (1), the differences are as follows:
* Any binary feature with fewer than 20 positive observations was dropped from the dataset.
* All colonization pressure features were rounded to the nearest 0.01
* Age was censored, so that any patient with an age > 89 has their age set to 90.
* Basic laboratory values (WBC, lymphocytes, and neutrophil counts) are excluded from the dataset release, due to inconsistencies in reporting of laboratory values. These features did not have a noticeable impact on our results.

Regarding (2), our analysis used patient identifiers to ensure that there were no patients with specimens in both the train and validate sets. Because there are no patient identifiers included in this dataset release, the splits performed by our replication code are done without knowledge of patient identifiers. As a result, they are necessarily different from the ones we used in our work.

For this reason, we provide utilities to run our scripts both in an end-to-end fashion (replicating the approach taken in the paper, but with different train/validate splits and therefore different selection of hyperparameters), as well as directly using the hyperparameters and thresholds chosen by our original analysis.

# Replication Instructions

## Setup

First, you need to load in the data from Physionet. See the [project website](http://www.clinicalml.org/data/amr-dataset) for more information on how to access this data. Place the Physionet data in a folder **outside** this repository to avoid accidentally uploading any files to Github.

Second, you need to **edit the paths in `setup/paths.sh`**
* `DATA_PATH` in `setup/paths.sh` should reflect the absolute path to the files from Physionet. The files should be all be accessible at `${DATA_PATH}/<file_name>.csv`.
* `REPO_PATH` in `setup/paths.sh` should reflect the absolute path to this directory. That is, this file should be located at `${REPO_PATH}/README.md`.

Third, you need to set up your `python`, `R`, and bash environment.
* Run `bash setup/setup_env.sh` to create a python environment `amr` that will contain the necessary packages.
* Run `conda activate amr` to activate the environment.
* Run `Rscript setup/install_r_packages.R` to install relevant `R` packages into the environment (for plotting purposes), which can take some time.
* Run `source setup/paths.sh` (note: use `source`, not `bash` here) to populate the necessary bash variables that define paths.
* *Going forward (e.g., in subsequent terminal sessions), you will need to run `conda activate amr` and `source setup/paths.sh` before running the experiment scripts*

Finally, you need to run `python setup/load_data.py` to split the data release into train/test splits that the remaining code expects. This will create additional `.csv` files in `${DATA_PATH}`

## Running the experiment scripts

We present two options for replication:
1. Using our original hyperparameters and thresholds: Run `bash run_all_rep.sh`
2. Running the analysis end-to-end: Run `bash run_all.sh`

Note that Option 1 is much faster, as it skips all hyperparameter tuning, while Option 2 will take approximatly 2-3 hours to run, driven primarily by threshold selection (see details below).

After either of these scripts have been run, see the "Replicating Figures and Tables" section below for instructions on running the analysis notebook + plotting code. If you have run with Option 1, then you will want to ensure that the `USE_REP_HP` flag in the notebook is set to `True`, and vice versa if you are using Option 2.

## (Optional) Manually Running the Scripts

Alternatively, you can manually run the relevant experiment scripts that are called by `run_all.sh` and `run_all_rep.sh` respectively. These details are given below.

Before manually running the experiments in this section, run `cd ${REPO_PATH}/experiments` to go to the right directory. All the below assumes you are in that directory, and there will be errors if you try to run from a different directory.

When you run these scripts for the first time, they will create (if it does not already exist) a folder `${REPO_PATH}/experiments/experiment_results/<exp_type>/<exp_name>/` that contains `results` to store the artifacts, and `logs` where you can watch progress by examining the log files. In this context, `<exp_type>` might be `train_outcome_models` and `<exp_name>` might be `train_outcome_models_validation`.

**NOTE**: The scripts (and paths in `setup/paths.sh`) assume the experiment names that are already given in these scripts, so **do not change them**

### Option 1: Running the analysis with the original hyperparameters

This will skip the validation scripts, because those are used to choose hyperparameters, which in this analysis is not necessary.

To run these analyses, you can either of the following equivalent options:
* Run `bash run_all_rep.sh` from the base directory `${REPO_PATH}`
* Move into the `${REPO_PATH}/experiments` directory and run the following script
```
bash scripts/eval_test_scripts/train_outcome_models_eval_test_replication.sh
```
This script will train models with the original hyperparameters on the entire training set, and then evaluate on the test set. The original thresholds are then manually applied in the notebook that replicates tables and figures (see next section).

### Option 2: Running our analysis end-to-end

This will re-run the entire analysis, including the automatic selection of thresholds and hyperparameters using train/validate splits. As noted above, this will result in slightly different choices than in our published analysis, in part because the splits will differ.

To run these analyses, you can either of the following equivalent options:
* Run `bash run_all.sh` from the base directory `${REPO_PATH}`
* Move into the `${REPO_PATH}/experiments` directory and run the following scripts (in order)

First, run the following to train outcome models using the train/validation set, using the hyperparameter grid defined in `../models/hyperparameter_grids.py`. This also generates train/validation set predictions using the chosen hyperparameters. **NOTE**: In our experiments we observed that logistic regression and random forests performed comparably, and by default the script will only investigate hyperparameters for logistic regression models. To change this, you will need to change the code in `experiment_train_outcome_model.py:117, 127` where we have commented out random forests.
```
bash scripts/validation_scripts/train_outcome_models_validation.sh
```

Second, run the following to choose thresholds based on the train/validation set. Note that this script will take a significant amount of time (about 2 hours) to run, becuase it investigates a large set of thresholding combinations.
```
bash scripts/validation_scripts/thresholding_validation.sh
```

Third, run the following to re-train outcome models using the chosen hyperparameters, and then evaluate these models on the test set.
```
bash scripts/eval_test_scripts/train_outcome_models_eval_test.sh
```

Finally, run the following script to perform the final thresholding experiment on the test set.
```
bash scripts/eval_test_scripts/thresholding_test.sh
```

## Replicating tables and figures

Within this repository, `notebooks/` contains a jupyter notebook that can be used to replicate figures from the paper and examine results.

To replicate figures, you will need to do the following:
* First, navigate to the folder `notebooks/` and open the Jupyter notebook `figures_and_tables.ipynb`
* Set the flag `USE_REP_HP` in this notebook based on whether or not you wish to compute results using the original hyperparameters, or the results of the end-to-end analysis applied to the dataset release. Note that either of these options assumes you have already run the relevant code above.
* In either case, run the entire notebook end-to-end: The notebook generates the data used for plotting, as well as main tables in the paper.
* To generate figures (AFTER running the notebook code), run the script `plot_all.sh` which will sequentially call the various `R` scripts to generate the plots.
Empty file added __init__.py
Empty file.
127 changes: 127 additions & 0 deletions analysis_utils/best_case_baseline_utils.py
@@ -0,0 +1,127 @@
import pandas as pd
import os
import sys

sys.path.append('../../')
from models.indirect.policy_learning_thresholding import get_iat_broad
from utils.evaluation_utils import calculate_ci

def get_best_case_idsa_baseline(resist_df,
switch_props,
option='doc',
avoid_nit_eids=None):

'''
Implementation of "best-case" modification of IDSA practice guidelines.
'''

assert option in ['total_rand', 'doc']

all_iat_broad_stats = []
resist_df = resist_df.copy()

for p in switch_props:

resist_df['idsa'] = resist_df.apply(lambda x: 'CIP' if x['example_id']
in set(avoid_nit_eids) else 'NIT', axis=1)
if option == 'total_rand':
cip_cohort = resist_df[resist_df['idsa'] == 'CIP']
prop_cip, iat_cip = len(cip_cohort) / len(resist_df), cip_cohort['CIP'].mean()

nit_cohort = resist_df[resist_df['idsa'] == 'NIT']
iat_nit_cohort, iat_cip_switch = nit_cohort['NIT'].mean(), nit_cohort['CIP'].mean()

iat = prop_cip * iat_cip + (1-prop_cip) * (p*iat_cip_switch + (1-p)*iat_nit_cohort)
broad = prop_cip + p*(1-prop_cip)

elif option == 'doc':
fixed_cohort = resist_df[~((resist_df['idsa'] == 'NIT') &
resist_df['prescription'].isin(['CIP', 'LVX']))]
prop_fixed = len(fixed_cohort) / len(resist_df)
iat_fixed, broad_fixed = get_iat_broad(fixed_cohort, col_name='idsa')
switch_cohort = resist_df[(resist_df['idsa'] == 'NIT') &
resist_df['prescription'].isin(['CIP', 'LVX'])]

iat_idsa_switch = switch_cohort['NIT'].mean()
iat_doc_switch, _ = get_iat_broad(switch_cohort, col_name='prescription')

iat = prop_fixed*iat_fixed + (1-prop_fixed)*(p*iat_doc_switch + (1-p)*iat_idsa_switch)
broad = prop_fixed*broad_fixed + p*(1-prop_fixed)

all_iat_broad_stats.append([iat, broad])

return pd.DataFrame(all_iat_broad_stats, columns=['iat', 'broad'])


def get_best_case_stats(policy_df, p=.18):
fixed_cohort = policy_df[~((policy_df['idsa'] == 'NIT') &
policy_df['doc'].isin(['CIP', 'LVX']))]
prop_fixed = len(fixed_cohort) / len(policy_df)
iat_fixed, broad_fixed = get_iat_broad(fixed_cohort, col_name='idsa')

switch_cohort = policy_df[(policy_df['idsa'] == 'NIT') &
policy_df['doc'].isin(['CIP', 'LVX'])]

# Antibiotic distribution stats

prop_CIP = switch_cohort['doc'].value_counts(normalize=True).get('CIP')
prop_LVX = 1-prop_CIP

prop_CIP_all = prop_fixed*broad_fixed + p*(1-prop_fixed)*prop_CIP
prop_LVX_all = p*(1-prop_fixed)*prop_LVX

prop_CIP_ci, prop_LVX_ci = calculate_ci(prop_CIP_all, len(policy_df)), calculate_ci(prop_LVX_all, len(policy_df))


broad = prop_fixed*broad_fixed + p*(1-prop_fixed)
narrow = 1 - broad
narrow_ci, broad_ci = calculate_ci(narrow, len(policy_df)), calculate_ci(broad, len(policy_df))


# IAT stats

iat_idsa_switch = switch_cohort['NIT'].mean()
iat_doc_switch, _ = get_iat_broad(switch_cohort, col_name='doc')

iat = prop_fixed*iat_fixed + (1-prop_fixed)*(p*iat_doc_switch + (1-p)*iat_idsa_switch)
iat_all_ci = calculate_ci(iat, len(policy_df))

all_NIT_size = (len(fixed_cohort[fixed_cohort['idsa'] == 'NIT'])) + (1-p)*len(switch_cohort)
iat_fixed_NIT = fixed_cohort[fixed_cohort['idsa'] == 'NIT']['NIT'].mean()
iat_NIT = len(fixed_cohort[fixed_cohort['idsa'] == 'NIT']) / all_NIT_size * iat_fixed + (1-p)*len(switch_cohort) / all_NIT_size * iat_idsa_switch
iat_NIT_ci = calculate_ci(iat_NIT, all_NIT_size)

iat_fixed_broad = fixed_cohort[fixed_cohort['idsa'] == 'CIP']['CIP'].mean()
all_broad_size = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) + p*len(switch_cohort)
iat_broad = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) / all_broad_size * iat_fixed_broad + p*len(switch_cohort)/all_broad_size * iat_doc_switch
iat_broad_ci = calculate_ci(iat_broad, all_broad_size)

all_CIP_size = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) + p*len(switch_cohort)*prop_CIP
iat_doc_CIP = switch_cohort[switch_cohort['doc'] == 'CIP']['CIP'].mean()
iat_CIP = len(fixed_cohort[fixed_cohort['idsa'] == 'CIP']) / all_CIP_size * iat_fixed_broad + p*prop_CIP*len(switch_cohort)/all_CIP_size * iat_doc_CIP
iat_CIP_ci = calculate_ci(iat_CIP, all_CIP_size)

iat_doc_LVX = switch_cohort[switch_cohort['doc'] == 'LVX']['LVX'].mean()

#print(len(switch_cohort[switch_cohort['doc'] == 'LVX']))
#print(switch_cohort[switch_cohort['doc'] == 'LVX']['LVX'].sum())
iat_LVX = iat_doc_LVX
iat_LVX_ci = calculate_ci(iat_LVX, len(switch_cohort[switch_cohort['doc'] == 'LVX'])*p)

abx_prop_stats = {
'all': (broad, broad_ci),
'narrow': (narrow, narrow_ci),
'NIT': (narrow, narrow_ci),
'CIP': (prop_CIP_all, prop_CIP_ci),
'LVX': (prop_LVX_all, prop_LVX_ci)
}

iat_stats = {
'all': (iat, iat_all_ci),
'NIT': (iat_NIT, iat_NIT_ci),
'broad': (iat_broad, iat_broad_ci),
'CIP': (iat_CIP, iat_CIP_ci),
'LVX': (iat_LVX, iat_LVX_ci),
}

return abx_prop_stats, iat_stats
127 changes: 127 additions & 0 deletions analysis_utils/model_analysis_utils.py
@@ -0,0 +1,127 @@
from sklearn.calibration import calibration_curve
from sklearn.metrics import roc_curve

import pandas as pd
import numpy as np

ABX_NAME_MAP = {
'NIT': 'Nitrofurantoin',
'SXT': 'TMP-SMX',
'CIP': 'Ciprofloxacin',
'LVX': 'Levofloxacin'
}

# Create calibration curve data

def create_calibration_data_df(preds_resist_df, abx_list=['NIT', 'SXT','CIP', 'LVX']):

full_calibration_df = pd.DataFrame()

for abx in abx_list:
counts = list(bin_counts(preds_resist_df[f'predicted_prob_{abx}'].values, 10)[:-1])
mean_probs, true_probs = get_calibration_curve(preds_resist_df, abx)

counts = counts[:len(mean_probs)]
calibration_df = pd.DataFrame(list(zip(mean_probs, true_probs, counts)),
columns=['predicted', 'prop_positive', 'count'])

calibration_df['drug'] = ABX_NAME_MAP[abx]
calibration_df['bin'] = np.linspace(0.5, 0.5 + len(calibration_df) - 1, len(calibration_df))

full_calibration_df = pd.concat([full_calibration_df, calibration_df], axis=0)

return full_calibration_df


def get_calibration_curve(preds_df, abx):

prob_true, prob_pred = calibration_curve(preds_df[abx].values,
preds_df[f'predicted_prob_{abx}'].values,
n_bins=10)

return list(prob_pred), list(prob_true)

def bin_counts(y_prob, n_bins):
bins = np.linspace(0., 1., n_bins + 1)
binids = np.digitize(y_prob, bins) - 1
return np.bincount(binids, minlength=len(bins))


# Create FPR/FNR curve and ROC curve data

def create_fpr_fnr_data(preds_resist_df, abx_list=['NIT', 'SXT','CIP', 'LVX']):
all_fprs_fnrs_df = pd.DataFrame()

# Initialize first/last values for FPR / FNR data
base_fpr_df_1 = pd.DataFrame([[1, 0]], columns=['Threshold', 'value'])
base_fpr_df_2 = pd.DataFrame([[0, 100]], columns=['Threshold', 'value'])

base_fnr_df_1 = pd.DataFrame([[1, 100]], columns=['Threshold', 'value'])
base_fnr_df_2 = pd.DataFrame([[0, 0]], columns=['Threshold', 'value'])

for abx in abx_list:
fprs, fnrs, _, thresh = get_roc_curve_data(preds_resist_df, abx=abx)

fpr_df = pd.DataFrame([thresh, 100*np.array(fprs)], index=['Threshold', 'value']).transpose()
fpr_df = pd.concat([base_fpr_df_1, fpr_df, base_fpr_df_2], axis=0)
fnr_df = pd.DataFrame([thresh, 100*np.array(fnrs)], index=['Threshold', 'value']).transpose()
fnr_df = pd.concat([base_fnr_df_1, fnr_df, base_fnr_df_2], axis=0)

fpr_df['drug'] = ABX_NAME_MAP[abx]
fpr_df['set'] = 'FPR'

all_fprs_fnrs_df = pd.concat([all_fprs_fnrs_df, fpr_df])

fnr_df['drug'] = ABX_NAME_MAP[abx]
fnr_df['set'] = 'FNR'

all_fprs_fnrs_df = pd.concat([all_fprs_fnrs_df, fnr_df])

return all_fprs_fnrs_df


def create_roc_curve_data(preds_resist_df, preds_resist_prev_history_df, abx_list=['NIT', 'SXT','CIP', 'LVX']):

data_full_cohort, data_prev_history_cohort = {}, {}

for abx in abx_list:
fprs, _, tprs, _ = get_roc_curve_data(preds_resist_df, abx=abx)
data_full_cohort[abx] = (fprs, tprs)

fprs, _, tprs, _ = get_roc_curve_data(preds_resist_prev_history_df, abx=abx)
data_prev_history_cohort[abx] = (fprs, tprs)

all_roc_curve_data_df = pd.DataFrame()

for abx, fpr_tprs in data_full_cohort.items():
fpr_tprs_df = pd.DataFrame([
fpr_tprs[0],
fpr_tprs[1]
], index=['fpr', 'tpr']).transpose()
fpr_tprs_df['drug'] = ABX_NAME_MAP[abx]
fpr_tprs_df['set'] = 'full_cohort'

all_roc_curve_data_df = pd.concat([all_roc_curve_data_df, fpr_tprs_df])

for abx, fpr_tprs in data_prev_history_cohort.items():
fpr_tprs_df = pd.DataFrame([
fpr_tprs[0],
fpr_tprs[1]
], index=['fpr', 'tpr']).transpose()
fpr_tprs_df['drug'] = ABX_NAME_MAP[abx]
fpr_tprs_df['set'] = 'prior_resist_cohort'

all_roc_curve_data_df = pd.concat([all_roc_curve_data_df, fpr_tprs_df])

return all_roc_curve_data_df


def get_roc_curve_data(preds_resist_df, abx):
fprs, tprs, thresh = roc_curve(preds_resist_df[abx].values,
preds_resist_df[f'predicted_prob_{abx}'].values)
fnrs = [1 - tpr for tpr in tprs]

return fprs, fnrs, tprs, thresh



0 comments on commit 92e69e1

Please sign in to comment.