Skip to content

marakeby/clinicalNLP2

Repository files navigation

Contributors Forks Stargazers Issues GPL-2.0 License LinkedIn


Logo

Evaluation of Transformer-based models


Code to accompany the paper:
"Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports"


Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. References
  5. License
  6. Contact
  7. Acknowledgements

About The Project

Logo

Language modeling has become a central tool for modern natural language processing across multiple domains. Here, we evaluated its utility for extracting cancer outcomes data from clinical text reports. This outcomes extraction task is a key rate-limiting step for asking observational cancer research questions intended to promote precision cancer care using large linked clinical and molecular datasets. Traditional medical record annotation is a slow manual process and scaling up this process is critically important to facilitate accurate and fast clinical decision making.

We have previously demonstrated that simple convolutional neural networks (CNNs), trained on a labeled dataset of imaging reports for over 1,000 patients with non-small cell lung cancer, can yield models able to accurately capture key clinical outcomes from each report, including cancer progression/worsening and response/improvement. In the current analysis, we evaluated whether pre-trained Transformer models, with or without domain adaptation using imaging reports from our institution, can improve performance or reduce the volume of training data necessary to yield well-performing models for this document classification task. We did extensive analyses of multiple variants of pre-trained Transformer models considering major modeling factors such as 1) training sample size, 2) classification architecture, 3) language- model fine tuning, 4) classification task, 5) length of text considered, and 6) number of parameters of the Transformer models. We reported the performance results of these models under different considerations.

Getting Started

To get a local copy up and running, follow these simple steps

Prerequisites

  • python 3.7, check environments.yml for list of needed packages

Installation

  1. Clone the repo

    git clone https://github.com/marakeby/clinicalNLP2.git
  2. Create conda environment Note that not all packages are needed to generate the paper figures. Some of these packages are needed only for the training purposes.

    conda env create --name cnlp_env --file=environment.yml
  3. Based on your use, you may need to download one or more of the following

    a. Log files (needed to regenerate paper figures). Extract the files under _cnlp_results directory. If you like to store it somewhere else, you may need to set the TEST_RESULTS_PATH variable in config_path.py accordingly.

    b. Plots files (a copy of the paper images). Extract the files under _cnlp_plots directory. If you like to store it somewhere else, you may need to set the PLOTS_PATH variable in config_path.py accordingly.

Usage

  1. Activate the created conda environment

    source activate cnlp_env
  2. Add the current diretory to PYTHONPATH, e.g.

    export PYTHONPATH=~/clinicalNLP2:$PYTHONPATH
  3. To generate all paper figures, run

    cd ./paper_analysis
    python generate_figures.py
  4. To generate individual paper figure run the different files under the 'paper_analysis_revision2' directory, e.g.

    cd ./paper_analysis_revision2
    python figure_4_samples_sizes.py
  5. To re-train a model from scratch run

    cd ./train
    python run_testing.py

    This will run an experiment bert_classifier/progression_one_split_BERT_sizes_tiny_frozen_tuned which trains a model to predict progression of cancer patients using a fine-tuned tiny BERT model under different size of the training set. The results of the experiment will be stored under _logsin a directory with the same name as the experiment. To run another experiment, you may uncomment one of the lines in the run_testing.py to run the corresponding experiment.



Note that the underlying EHR text reports used to train and evaluate NLP models for these analyses constitute protected health information for DFCI patients and therefore cannot be made publicly available. Researchers with DFCI appointments and Institutional Review Board (IRB) approval can access the data on request. For external researchers, access would require collaboration with the authors and eligibility for a DFCI appointment per DFCI policies

License

Distributed under the GPL-2.0 License License. See LICENSE for more information.

Contact

Haitham - @HMarakeby

Project Link: https://github.com/marakeby/clinicalNLP2

References

  • Elmarakeby, H, et al. "Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports"
  • Kehl, K. L., Elmarakeby, H., Nishino, M., Van Allen, E. M., Lepisto, E. M., Hassett, M. J., ... & Schrag, D. (2019). Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports. JAMA oncology, 5(10), 1421-1429.
  • Kehl, K. L., Xu, W., Gusev, A., Bakouny, Z., Choueiri, T. K., Riaz, I. B., ... & Schrag, D. (2021). Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset. Nature communications, 12(1), 1-9.
  • Kehl, K. L., Xu, W., Lepisto, E., Elmarakeby, H., Hassett, M. J., Van Allen, E. M., ... & Schrag, D. (2020). Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clinical Cancer Informatics, 4, 680-690.

Acknowledgements

  • National Cancer Institute (NCI),
  • Doris Duke Charitable Foundation,
  • Department of Defense (DoD),
  • Mark Foundation Emerging Leader Award,
  • PCF-Movember Challenge Award

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published