Skip to content

ayushnoori/amanuensis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Winner at TreeHacks 2023

Amanuensis was awarded the BIG IDEA: Patient Safety Technology Prize at TreeHacks 2023, sponsored by the Pittsburgh Regional Health Initiative and the Patient Safety Technology Challenge. Learn more at www.patientsafetytech.com. We're grateful to the sponsors and judges for their consideration and recognition, and to the TreeHacks organizing team at Stanford University.

Prize Description

We're in search of bold new thinking. This is an invitation to solve the problem of medical error that harms millions of U.S. patients, kills approximately 250,000 patients, and costs billions of dollars every year. We’re calling on TreeHacks teams to envision the best technology-enabled patient safety solution that has the potential to avert patient harm and save lives and will be awarding $2,000 to the top team. Your hack must align with one of the following five leading patient safety challenges facing health care across the continuum of care: Medication errors, procedural/surgical errors, errors during routine patient care (e.g. pressure ulcers, blood clots, falls), infections and diagnostic safety. Learn more about the problem and get access to resources to help your hack here.

Summary

AI-enabled physician assistant for automated clinical summarization and question generation. Empowering physicians to achieve accurate diagnoses and effective treatments. Project for TreeHacks 2023 at Stanford University.

Problem Statement πŸ’‘

The modern electronic health record (EHR) encompasses a treasure trove of information across patient demographics, medical history, clinical data, and other health system interactions (Jensen et al.). Although the EHR represents a valuable resource to track clinical care and retrospectively evaluate clinical decision-making, the data deluge of the EHR often obfuscates key pieces of information necessary for the physician to make an accurate diagnosis and devise an effective treatment plan (Noori and Magdamo et al.). Physicians may struggle to rapidly synthesize the lengthy medical histories of their patients; in the absence of data-driven strategies to extract relevant insights from the EHR, they are often forced to rely on intuition alone to generate patient questions. Further, the EHR search interface is rarely optimized for the physician search workflow, and manual search can be both time-consuming and error-prone.

The volume and complexity of the EHR can lead to missed opportunities for physicians to gather critical information pertinent to patient health, leading to medical errors or poor health outcomes. It is imperative to design tools and services to reduce the burden of manual EHR search on physicians and help them elicit the most relevant information from their patients.

About Amanuensis πŸ“

Amanuensis is an AI-enabled physician assistant for automated clinical summarization and question generation. By arming physicians with relevant insights collected from the EHR as well as with patient responses to NLP-generated questions, we empower physicians to achieve more accurate diagnoses and effective treatment plans. The Amanuensis pipeline is as follows:

  1. Clinical Summarization: Through our web application, physicians can access medical records of each of their patients, where they are first presented with a clinical summary: a concise, high-level overview of the patient's medical history, including key information such as diagnoses, medications, and allergies. This clinical summary is automatically generated by Amanuensis using Generative Pre-Trained Transformer 3 (GPT-3), an autoregressive language model with a 2048-token-long context and 175 billion parameters. The clinical summary may be reviewed by the physician to ensure that the summary is accurate and relevant to the patient's health.

  2. Question Generation: Next, Amanuensis uses GPT-3 to automatically generate a list of questions that the physician can ask their patient to elicit more information and identify relevant information in the EHR that the physician may not have considered. The NLP-generated questions are automatically sent to the patient prior to their appointment (e.g., once the appointment is scheduled); then, the physician can review the patient's responses and use them to inform their clinical decision-making during the subsequent encounter. Importantly, we have tested Amanuensis on a large cohort of high-quality simulated EHRs generated by SyntheaTM.

By guiding doctors to elicit the most relevant information from their patients, Amanuensis can help physicians improve patient outcomes and reduce the incidences of all five types of medical errors: medication errors, patient care complications, procedure/surgery complications, infections, and diagnostic/treatment errors.

Building Process πŸ—

To both construct and validate Amanuensis, we used the SyntheaTM library to generate synthetic patients and associated EHRs (Walonoski et al.). SyntheaTM is an open-source software package that simulates the lifespans of synthetic patients using realistic models of disease progression and corresponding standards of care. These models rely on a diverse set of real-world data sources, including the United States Census Bureau demographics, Centers for Disease Control and Prevention (CDC) prevalence and incidence rates, and National Institutes of Health (NIH) reports. The SyntheaTM package was developed by an international research collaboration involving the MITRE Corporation and the HIKER Group, and is in turn based on the Publicly Available Data Approach to the Realistic Synthetic EHR framework (Dube and Gallagher). We customized the SyntheaTM synthetic data generation workflow to produce the following 18 data tables (see also the SyntheaTM data dictionary):

Table Description
Allergies Patient allergy data.
CarePlans Patient care plan data, including goals.
Claims Patient claim data.
ClaimsTransactions Transactions per line item per claim.
Conditions Patient conditions or diagnoses.
Devices Patient-affixed permanent and semi-permanent devices.
Encounters Patient encounter data.
ImagingStudies Patient imaging metadata.
Immunizations Patient immunization data.
Medications Patient medication data.
Observations Patient observations including vital signs and lab reports.
Organizations Provider organizations including hospitals.
Patients Patient demographic data.
PayerTransitions Payer transition data (i.e., changes in health insurance).
Payers Payer organization data.
Procedures Patient procedure data including surgeries.
Providers Clinicians that provide patient care.
Supplies Supplies used in the provision of care.

To simulate an EHR system, we pre-processed all synthetic data (see code/construct_database.Rmd) and standardized all fields. Next, we constructed a PostgreSQL database and keyed relevant tables together using primary and foreign keys constructed by hand. In total, our database contains 199,717 records from 20 patients across 262 different fields. However, it is important to note that our data generation pipeline is scalable to tens of thousands of patients (and we have tested this synthetic data generation capacity).

Finally, we coupled the PostgreSQL database with the RedwoodJS full stack web development framework to build a web application that allows:

  1. Physicians: Physicians to access the clinical summaries and questions generated by Amanuensis for each of their patients.
  2. Patients: Patients to access the questions generated by Amanuensis and respond to them via a web form.

To generate both clinical summaries and questions for each patient, we used the OpenAI GPT-3 API. In both cases, GPT-3 was prompted with a subset of the EHR record for a given patient inserted into a prompt template for GPT-readability. Other key features of our web application include:

  1. Authentication: Users can log in with their email addresses; physicians are automatically redirected to their dashboard upon login, while patients are redirected to a page where they can respond to the questions generated by Amanuensis.
  2. EHR Access: Physicians can also access the full synthetic EHR for each patient as well as view autogenerated graphs and data visualizations, which they can use to review the accuracy of the clinical summaries and questions generated by Amanuensis.
  3. Patient Response Collection: Prior to an appointment, Amanuensis will automatically collect the patient's responses to the NLP-generated questions and send them to the physician. During an appointment, physicians will be informed by these responses which will facilitate better clinical decision-making.

Future Directions πŸš€

In the future, we hope to integrate Amanuensis into existing EHR systems (e.g., Epic, Cerner, etc.), providing physicians with a seamless, AI-powered assistant to help them make more informed clinical decisions. We also plan to enrich our NLP pipeline with real patient data rather than synthetic EHR records. In concert with gold-standard annotations generated by physicians, we intend to fine-tune our question generation and clinical summarization models on real-world data to improve the sophistication and fidelity of the generated text and enable more robust clinical reasoning capabilities.

Base Dependencies πŸ“¦

First, create a new Anaconda environment. For example:

conda create --name amanuensis python=3.10

Then, install:

  • R, tested with version == 4.2.1.
  • Python, tested with version == 3.10.9.

Synthetic Patient Data 🩺

Generate synthetic patient data with SyntheaTM.

SyntheaTM requires Java 11 or newer. First, install the Java Development Kit (JDK).

Next, clone the SyntheaTM repo, then build and run the test suite:

git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test

In the synthea directory, modify ./src/main/resources/synthea.properties. Set exporter.csv.export and generate.only_alive_patients = true. Output will then be generated in ./src/output/csv.

Again in the synthea directory, use the following command to generate the desired number of patients. The parameters are as follows:

  • -p: number of patients.
  • -s: random seed.
  • -a: patient age range.
./run_synthea -p 20 -s 42 -a 0-100

As specified in the SyntheaTM wiki, the CSV exporter will generate files according to the CSV file data dictionary, which is specified here. Copy the generated files from synthea/src/output/csv to amanuensis/patient_data.

PostgreSQL Database πŸ’»

Next, construct the PostgreSQL database using code/construct_database.Rmd. Run the following command in Terminal to install PostgreSQL.

brew install postgres

Check the version of PostgreSQL as follows.

psql --version

To start PostgreSQL, run the following command.

brew services start postgresql@14

To stop PostgreSQL, run the following command.

brew services stop postgresql@14

Open the psql interactive terminal, which is designed to work with the PostgreSQL database.

psql postgres

Create a new database called amanuensis. List all users and databases.

CREATE DATABASE amanuensis;
\du
\l

Next, run the code in code/construct_database.Rmd to write the synthetic patient data (originally in CSV format) to the newly-created PostgreSQL database. Note that the keys must be specified according to the CSV file data dictionary.

After constructing the PostgreSQL database, in the psql terminal, connect to the amanuensis database. List all tables in the database.

\c amanuensis;
\d

To remove all tables from the database, the following SQL command can be used.

DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
GRANT ALL ON SCHEMA public TO public;

The database can be dumped for transfer by running the following in the Terminal.

pg_dump --dbname amanuensis > ./patient_data/db_dump/db_dump.sql

BioGPT Text Generation 🧬

For text generation, we can also use the BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining model by Luo et al., 2022. Thus, the dependencies mirror those of microsoft/BioGPT commit f186d88. These include:

  • PyTorch version == 1.13.1.
  • transformers, which provides APIs and tools to easily download and train state-of-the-art pretrained models.

Non-Hugging Face Usage πŸ€—

Alternatively, BioGPT can be used without Hugging Face πŸ€— by installing the below dependencies. See the BioGPT installation instructions for more information.

  • fairseq, tested with version == 0.12.0. Install as follows:
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..
  • Moses, install as follows:
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder
  • fastBPE, install as follows:
git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
  • sacremoses, install as follows:
pip install sacremoses
  • sklearn, install as follows:
pip install scikit-learn

Remember to set the environment variables MOSES and FASTBPE to the path of Moses and fastBPE respectively, as they will be required later. According to the conda documentation, Inside of a conda environment, environment variables can be viewed as follows:

conda env config vars list
conda env config vars set MOSES=${PWD}/mosesdecoder
conda env config vars set FASTBPE=${PWD}/fastBPE
conda activate amanuensis

Then, run the following.

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel

m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)

# m.cuda()
src_tokens = m.encode("The patient presents with a history of fever and abdominal cramps for the last 24 hours.")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

Development Team πŸ§‘β€πŸ’»

This project was completed during the TreeHacks 2023 hackathon at Stanford University.

References πŸ“š

  1. Noori, A. et al. Development and Evaluation of a Natural Language Processing Annotation Tool to Facilitate Phenotyping of Cognitive Status in Electronic Health Records: Diagnostic Study. Journal of Medical Internet Research 24, e40384 (2022).

  2. Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet 13, 395–405 (2012).

  3. Walonoski, J. et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association 25, 230–238 (2018).

  4. Dube, K. & Gallagher, T. Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. in Foundations of Health Information Engineering and Systems (eds. Gibbons, J. & MacCaull, W.) 69–86 (Springer, 2014). doi:10.1007/978-3-642-53956-5_6.

About

AI-enabled physician assistant for automated clinical summarization and question generation. Empowering physicians to achieve accurate diagnoses and effective treatments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published