GitHub

Evaluation of Transformer-based models

Code to accompany the paper:
"Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports"

About The Project

Language modeling has become a central tool for modern natural language processing across multiple domains. Here, we evaluated its utility for extracting cancer outcomes data from clinical text reports. This outcomes extraction task is a key rate-limiting step for asking observational cancer research questions intended to promote precision cancer care using large linked clinical and molecular datasets. Traditional medical record annotation is a slow manual process and scaling up this process is critically important to facilitate accurate and fast clinical decision making.

We have previously demonstrated that simple convolutional neural networks (CNNs), trained on a labeled dataset of imaging reports for over 1,000 patients with non-small cell lung cancer, can yield models able to accurately capture key clinical outcomes from each report, including cancer progression/worsening and response/improvement. In the current analysis, we evaluated whether pre-trained Transformer models, with or without domain adaptation using imaging reports from our institution, can improve performance or reduce the volume of training data necessary to yield well-performing models for this document classification task. We did extensive analyses of multiple variants of pre-trained Transformer models considering major modeling factors such as 1) training sample size, 2) classification architecture, 3) language- model fine tuning, 4) classification task, 5) length of text considered, and 6) number of parameters of the Transformer models. We reported the performance results of these models under different considerations.

Getting Started

To get a local copy up and running, follow these simple steps

Prerequisites

python 3.7, check environments.yml for list of needed packages

Installation

Clone the repo

git clone https://github.com/marakeby/clinicalNLP2.git

Create conda environment Note that not all packages are needed to generate the paper figures. Some of these packages are needed only for the training purposes.
```
conda env create --name cnlp_env --file=environment.yml
```
Based on your use, you may need to download one or more of the following

a. Log files (needed to regenerate paper figures). Extract the files under _cnlp_results directory. If you like to store it somewhere else, you may need to set the TEST_RESULTS_PATH variable in config_path.py accordingly.

b. Plots files (a copy of the paper images). Extract the files under _cnlp_plots directory. If you like to store it somewhere else, you may need to set the PLOTS_PATH variable in config_path.py accordingly.

Usage

Activate the created conda environment
```
source activate cnlp_env
```

Add the current diretory to PYTHONPATH, e.g.

export PYTHONPATH=~/clinicalNLP2:$PYTHONPATH

To generate all paper figures, run

cd ./paper_analysis
python generate_figures.py

To generate individual paper figure run the different files under the 'paper_analysis_revision2' directory, e.g.
```
cd ./paper_analysis_revision2
python figure_4_samples_sizes.py
```
To re-train a model from scratch run
```
cd ./train
python run_testing.py
```
This will run an experiment bert_classifier/progression_one_split_BERT_sizes_tiny_frozen_tuned which trains a model to predict progression of cancer patients using a fine-tuned tiny BERT model under different size of the training set. The results of the experiment will be stored under _logsin a directory with the same name as the experiment. To run another experiment, you may uncomment one of the lines in the run_testing.py to run the corresponding experiment.

Note that the underlying EHR text reports used to train and evaluate NLP models for these analyses constitute protected health information for DFCI patients and therefore cannot be made publicly available. Researchers with DFCI appointments and Institutional Review Board (IRB) approval can access the data on request. For external researchers, access would require collaboration with the authors and eligibility for a DFCI appointment per DFCI policies

License

Distributed under the GPL-2.0 License License. See LICENSE for more information.

Contact

Haitham - @HMarakeby

Project Link: https://github.com/marakeby/clinicalNLP2

References

Elmarakeby, H, et al. "Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports"
Kehl, K. L., Elmarakeby, H., Nishino, M., Van Allen, E. M., Lepisto, E. M., Hassett, M. J., ... & Schrag, D. (2019). Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports. JAMA oncology, 5(10), 1421-1429.
Kehl, K. L., Xu, W., Gusev, A., Bakouny, Z., Choueiri, T. K., Riaz, I. B., ... & Schrag, D. (2021). Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset. Nature communications, 12(1), 1-9.
Kehl, K. L., Xu, W., Lepisto, E., Elmarakeby, H., Hassett, M. J., Van Allen, E. M., ... & Schrag, D. (2020). Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clinical Cancer Informatics, 4, 680-690.

Acknowledgements

National Cancer Institute (NCI),
Doris Duke Charitable Foundation,
Department of Defense (DoD),
Mark Foundation Emerging Leader Award,
PCF-Movember Challenge Award

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
features_processing		features_processing
finetune		finetune
model		model
paper_analysis_revision2		paper_analysis_revision2
pipeline		pipeline
preprocessing		preprocessing
revision		revision
run		run
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config_path.py		config_path.py
environment.yml		environment.yml
logo.png		logo.png

License

marakeby/clinicalNLP2

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Transformer-based models

Table of Contents

About The Project

Getting Started

Prerequisites

Installation

Usage

License

Contact

References

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages