BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance.

This is the official repository for the BioDEX paper.

BioDEX is a raw resource for drug safety monitoring that bundles full-text and abstract-only PubMed papers with drug safety reports. These reports contain structured information about an Adverse Drug Events (ADEs) described in the papers, and are produced by medical experts in real-world settings.

BioDEX contains 19k full-text papers, 65k abstracts, and over 256k associated drug-safety reports.

Our data and models are available on Hugging Face. If you're interested in full drug-reports, use BioDEX-ICSR. If you're here to only extract reactions (as in In-Context Learning for Extreme Multi-Label Classification), use BioDEX-Reactions.

Overview of this repository

This repository is structured as follows:

demo.ipynb contains some quick demonstrations of the data.
analysis/ contains the data and notebooks to reproduce all plots in the paper.
src/ contains all code to represent the data objects and calculate the metrics.
data_creation/ contains the code to create the Report-Extraction dataset starting from the raw resource. Code to create the raw resource from scratch from will be released soon.
task/icsr_extraction/ contains the code to train and evaluate models for the Report-Extraction task.

Installation

Create the conda environment and install the code:

conda create -n biodex python=3.9
conda activate biodex
pip install -r requirements.txt
pip install .

Demos

You can find the code for these demos in demo.ipynb or in the sections below.

Load the raw resource

import datasets

# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']

print(len(dataset)) # 65,648

# investigate an example
article = dataset[1]['article']
report = dataset[1]['reports'][0]

print(article['title'])    # Case Report: Perioperative Kounis Syndrome in an Adolescent With Congenital Glaucoma.
print(article['abstract']) # A 12-year-old male patient suffering from congenital glaucoma developed bradycardia, ...
print(article['fulltext']) # ...
print(article['fulltext_license']) # CC BY

print(report['patient']['patientsex']) # 1
print(report['patient']['drug'][0]['activesubstance']['activesubstancename']) # ATROPINE SULFATE
print(report['patient']['drug'][0]['drugadministrationroute']) # 040
print(report['patient']['drug'][1]['activesubstance']['activesubstancename']) # MIDAZOLAM
print(report['patient']['drug'][1]['drugindication']) # Anaesthesia
print(report['patient']['reaction'][0]['reactionmeddrapt'])  # Kounis syndrome
print(report['patient']['reaction'][1]['reactionmeddrapt'])  # Hypersensitivity

Optional, use our code to parse the raw resource into Python objects for easy manipulation

import datasets
from src.utils import get_matches

# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
dataset = get_matches(dataset)

print(len(dataset)) # 65,648

# investigate an example
article = dataset[1].article
report = dataset[1].reports[0]

print(article.title)    # Case Report: Perioperative Kounis Syndrome in an Adolescent With Congenital Glaucoma.
print(article.abstract) # A 12-year-old male patient suffering from congenital glaucoma developed bradycardia, ...
print(article.fulltext) # ...
print(article.fulltext_license) # CC BY

print(report.patient.patientsex) # 1
print(report.patient.drug[0].activesubstance.activesubstancename) # ATROPINE SULFATE
print(report.patient.drug[0].drugadministrationroute) # 040
print(report.patient.drug[1].activesubstance.activesubstancename) # MIDAZOLAM
print(report.patient.drug[1].drugindication) # Anaesthesia
print(report.patient.reaction[0].reactionmeddrapt)  # Kounis syndrome
print(report.patient.reaction[1].reactionmeddrapt)  # Hypersensitivity

Load the Report-Extraction dataset

import datasets

# load the report-extraction dataset
dataset = datasets.load_dataset("BioDEX/BioDEX-ICSR")

print(len(dataset['train']))        # 9,624
print(len(dataset['validation']))   # 2,407
print(len(dataset['test']))         # 3,628

example = dataset['train'][0]

print(example['fulltext_processed'][:1000], '...') # TITLE: # SARS-CoV-2-related ARDS in a maintenance hemodialysis patient ...
print(example['target']) # serious: 1 patientsex: 1 drugs: ACETAMINOPHEN, ASPIRIN ...

Use our fine-tuned Report-Extraction model

from transformers import AutoTokenizer, T5ForConditionalGeneration
import datasets

# load the report-extraction dataset
dataset = datasets.load_dataset("BioDEX/BioDEX-ICSR")

# load the model
model_path = "BioDEX/flan-t5-large-report-extraction"
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# get an input and encode it
input = dataset['validation'][1]['fulltext_processed']
input_encoded = tokenizer(input, max_length=2048, truncation=True, padding="max_length", return_tensors='pt')

# forward pass
output_encoded = model.generate(**input_encoded, max_length=256)

output = tokenizer.batch_decode(output_encoded, skip_special_tokens=True)
output = output[0]

print(output) # serious: 1 patientsex: 2 drugs: AMLODIPINE BESYLATE, LISINOPRIL reactions: Intentional overdose, Metabolic acidosis, Shock``` -->

Train and evaluate Report-Extraction models

All code for this task is located in task/icsr_extraction/. Make sure to activate the biodex environment!

Fine-tune a new Report-Extraction model

cd tasks/icsr_extraction

python run_encdec_for_icsr_extraction.py \
    --overwrite_cache False \
    --seed 42 \
    --dataset_name BioDEX/BioDEX-ICSR \
    --text_column fulltext_processed \
    --summary_column target \
    --model_name_or_path google/flan-t5-large \
    --output_dir ../../checkpoints/flan-t5-large-report-extraction \
    --max_source_length 2048 \
    --max_target_length 256 \
    --do_train True \
    --do_eval True \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --learning_rate 0.0001 \
    --optim adafactor \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --eval_accumulation_steps 16 \
    --num_train_epochs 5 \
    --bf16 True \
    --evaluation_strategy epoch \
    --logging_strategy steps \
    --save_strategy epoch \
    --logging_steps 100 \
    --save_total_limit 1 \
    --report_to wandb \
    --load_best_model_at_end True \
    --metric_for_best_model loss \
    --greater_is_better False \
    --predict_with_generate True \
    --generation_max_length 256 \
    --num_beams 1 \
    --repetition_penalty 1.0

Thus far, we only consider fine-tuning encoder-decooder models in the paper. Training a decoder-only model is still a work in progress, but we've supplied some code at ./tasks/icsr_extraction/run_decoder_for_icsr_extraction.py

Reproduce our fine-tune evaluation run

Using our model on Hugging Face.

cd tasks/icsr_extraction

python run_encdec_for_icsr_extraction.py \
    --overwrite_cache False \
    --seed 42 \
    --dataset_name BioDEX/BioDEX-ICSR \
    --text_column fulltext_processed \
    --summary_column target \
    --model_name_or_path BioDEX/flan-t5-large-report-extraction \
    --output_dir ../../checkpoints/flan-t5-large-report-extraction \
    --max_source_length 2048 \
    --max_target_length 256 \
    --do_train False \
    --do_eval True \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --learning_rate 0.0001 \
    --optim adafactor \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --eval_accumulation_steps 16 \
    --num_train_epochs 5 \
    --bf16 True \
    --evaluation_strategy epoch \
    --logging_strategy steps \
    --save_strategy epoch \
    --logging_steps 100 \
    --save_total_limit 1 \
    --report_to wandb \
    --load_best_model_at_end True \
    --metric_for_best_model loss \
    --greater_is_better False \
    --predict_with_generate True \
    --generation_max_length 256 \
    --num_beams 1 \
    --repetition_penalty 1.0

Add --do_predict True to get the results on the test set.

Reproduce our few-shot in-context learning results

We use the DSP framework to perform in-context learning experiments.

At the time of writing, DSP does not support a truncation strategy. This is vital for our task given the long inputs. To fix this and reproduce our results, you need to replace the predict.py file of your local dsp package (path/to/local/dsp/primitives/predict.py) with the adapted version located at tasks/icsr_extraction/dsp_predict_path.py.

Run text-davinci-003:

cd tasks/icsr_extraction

python run_gpt3_for_icsr_extraction.py \
    --max_dev_samples 100 \
    --max_tokens 128 \
    --max_prompt_length 4096 \
    --n_demos 7 \
    --output_dir ../../checkpoints/ \
    --model_name text-davinci-003 \
    --fulltext True

Run gpt-4:

cd tasks/icsr_extraction

python run_gpt3_for_icsr_extraction.py \
    --max_dev_samples 100 \
    --max_tokens 128 \
    --max_prompt_length 4096 \
    --n_demos 7 \
    --output_dir ../../checkpoints/ \
    --model_name gpt-4 \
    --chat_model True \
    --fulltext True

Add --validation_split test to get the results on the test set.

Limitations

See section 9 of the BioDEX paper for limitations and ethical considerations.

Contact

Open an issue on this GitHub page or email karel[dot]doosterlinck[at]ugent[dot].be and preferrably include "[BioDEX]" in the subject.

Data License

BioDEX bundles the following resources:

Medline: This produces all article fields except fulltext and fulltext_license
FAERS: This produces all report fields and is covered under a CC0 license, as stated on their website.
PubMed Central Open Access Subset: This produced the fulltext and fulltext_license fields for the article. The PubMed Open Access Subset covers papers that are copyrighted under Creative Commens or similar liberal distributions. BioDEX features full-text papers from the commercial (CC0, CC BY, CC BY-SA, CC BY-ND) and non-commercial (CC BY-NC, CC BY-NC-SA, CC BY-NC-ND) set. This license is denoted per applicable BioDEX example in the fulltext_license field of the article.

Medline was provided by courtesy of the U.S. National Library of Medicine (NLM). This does not imply the NLM has endorsed BioDEX. The data distributed in BioDEX does not reflect the most current/accurate data available from NLM.

Create a smaller, commercially licensed BioDEX dataset

Filter the raw resource to only include fulltext papers with a commercial license:

import datasets

# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
print(len(dataset)) # 65,648

# remove all fulltext papers with no commercial license
commercial_licenses = {'CC0', 'CC BY', 'CC BY-SA', 'CC BY-ND'}

def remove_noncom_paper(example):
    # remove the fulltext if no commercial license, keep all the other data of the example
    if example['article']['fulltext_license'] not in commercial_licenses:
        example['article']['fulltext'] = None
    return example

dataset_commercial = dataset.map(remove_noncom_paper)
print(len(dataset_commercial)) # 65,648 (no examples were dropped, only some fulltext fields were removed)

If you want to train a report-extraction model on this commercial dataset, repeat the steps outlined in data_creation/icsr_extraction/icsr_extraction.ipynb with this new dataset_commercial to create a new report-extraction dataset.

Citation

@misc{doosterlinck2023biodex,
      title={BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance}, 
      author={Karel D'Oosterlinck and François Remy and Johannes Deleu and Thomas Demeester and Chris Develder and Klim Zaporojets and Aneiss Ghodsi and Simon Ellershaw and Jack Collins and Christopher Potts},
      year={2023},
      eprint={2305.13395},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

BioDEX data schema

Article fields

View in fullscreen. (Adapted from pubmed-parser)

fields	description
title	Title of the article
pmid	PubMed ID
issue	The Issue of the journal
pages	Pages of the article in the journal publication
abstract	Abstract of the article
fulltext	The full text associated with the article from the PubMed Central Open Access Subset, if available
fulltext_license	The license associated with the full text paper from the PubMed Central Open Access Subset, if available
journal	Journal of the given paper
authors	Authors, each separated by ';'
affiliations	The affiliations of the authors
pubdate	Publication date. Defaults to year information only.
doi	DOI
medline_ta	Abbreviation of the journal name
nlm_unique_id	NLM unique identification
issn_linking	ISSN linkage, typically use to link with Web of Science dataset
country	Country extracted from journal information field
mesh_terms	List of MeSH terms with corresponding MeSH ID, each separated by ';' e.g. 'D000161:Acoustic Stimulation; D000328:Adult; ...'
publication_types	List of publication type list each separated by ';' e.g. 'D016428:Journal Article'
chemical_list	List of chemical terms, each separated by ';'
keywords	List of keywords, each separated by ';'
reference	String of PMID each separated by ';' or list of references made to the article
delete	Boolean, 'False' means paper got updated so you might have two
pmc	PubMed Central ID
other_id	Other IDs found, each separated by ';'

Report fields

View in fullscreen. (Adapted from OpenFDA)

fields	description	values
authoritynumb	Populated with the Regulatory Authority’s case report number, when available.	Undefined
companynumb	Identifier for the company providing the report. This is self-assigned.	Undefined
duplicate	This value is `1` if earlier versions of this report were submitted to FDA. openFDA only shows the most recent version.	Undefined
fulfillexpeditecriteria	Identifies expedited reports (those that were processed within 15 days).	1: True, 2: False
occurcountry	The name of the country where the event occurred.	name: Country codes, link: http://data.okfn.org/data/core/country-list
patient.drug.items.actiondrug	Actions taken with the drug.	1: Drug withdrawn, 2: Dose reduced, 3: Dose increased, 4: Dose not changed, 5: Unknown, 6: Not applicable
patient.drug.items.activesubstance.activesubstancename	Product active ingredient, which may be different than other drug identifiers (when provided).	Undefined
patient.drug.items.drugadditional	Dechallenge outcome information—whether the event abated after product use stopped or the dose was reduced. Only present when this was attempted and the data was provided.	1: Yes, 2: No, 3: Does not apply
patient.drug.items.drugadministrationroute	The drug’s route of administration.	001: Auricular (otic), 002: Buccal, 003: Cutaneous, 004: Dental, 005: Endocervical, 006: Endosinusial, 007: Endotracheal, 008: Epidural, 009: Extra-amniotic, 010: Hemodialysis, 011: Intra corpus cavernosum, 012: Intra-amniotic, 013: Intra-arterial, 014: Intra-articular, 015: Intra-uterine, 016: Intracardiac, 017: Intracavernous, 018: Intracerebral, 019: Intracervical, 020: Intracisternal, 021: Intracorneal, 022: Intracoronary, 023: Intradermal, 024: Intradiscal (intraspinal), 025: Intrahepatic, 026: Intralesional, 027: Intralymphatic, 028: Intramedullar (bone marrow), 029: Intrameningeal, 030: Intramuscular, 031: Intraocular, 032: Intrapericardial, 033: Intraperitoneal, 034: Intrapleural, 035: Intrasynovial, 036: Intratumor, 037: Intrathecal, 038: Intrathoracic, 039: Intratracheal, 040: Intravenous bolus, 041: Intravenous drip, 042: Intravenous (not otherwise specified), 043: Intravesical, 044: Iontophoresis, 045: Nasal, 046: Occlusive dressing technique, 047: Ophthalmic, 048: Oral, 049: Oropharingeal, 050: Other, 051: Parenteral, 052: Periarticular, 053: Perineural, 054: Rectal, 055: Respiratory (inhalation), 056: Retrobulbar, 057: Sunconjunctival, 058: Subcutaneous, 059: Subdermal, 060: Sublingual, 061: Topical, 062: Transdermal, 063: Transmammary, 064: Transplacental, 065: Unknown, 066: Urethral, 067: Vaginal
patient.drug.items.drugauthorizationnumb	Drug authorization or application number (NDA or ANDA), if provided.	Undefined
patient.drug.items.drugbatchnumb	Drug product lot number, if provided.	Undefined
patient.drug.items.drugcharacterization	Reported role of the drug in the adverse event report. These values are not validated by FDA.	1: Suspect (the drug was considered by the reporter to be the cause), 2: Concomitant (the drug was reported as being taken along with the suspect drug), 3: Interacting (the drug was considered by the reporter to have interacted with the suspect drug)
patient.drug.items.drugcumulativedosagenumb	The cumulative dose taken until the first reaction was experienced, if provided.	Undefined
patient.drug.items.drugcumulativedosageunit	The unit for `drugcumulativedosagenumb`.	001: kg (kilograms), 002: g (grams), 003: mg (milligrams), 004: µg (micrograms)
patient.drug.items.drugdosageform	The drug’s dosage form. There is no standard, but values may include terms like `tablet` or `solution for injection`.	Undefined
patient.drug.items.drugdosagetext	Additional detail about the dosage taken. Frequently unknown, but occasionally including information like a brief textual description of the schedule of administration.	Undefined
patient.drug.items.drugenddate	Date the patient stopped taking the drug.	Undefined
patient.drug.items.drugenddateformat	Encoding format of the field `drugenddateformat`. Always set to `102` (YYYYMMDD).	Undefined
patient.drug.items.drugindication	Indication for the drug’s use.	Undefined
patient.drug.items.drugintervaldosagedefinition	The unit for the interval in the field `drugintervaldosageunitnumb.`	801: Year, 802: Month, 803: Week, 804: Day, 805: Hour, 806: Minute, 807: Trimester, 810: Cyclical, 811: Trimester, 812: As necessary, 813: Total
patient.drug.items.drugintervaldosageunitnumb	Number of units in the field `drugintervaldosagedefinition`.	Undefined
patient.drug.items.drugrecurreadministration	Whether the reaction occured after readministration of the drug.	1: Yes, 2: No, 3: Unknown
patient.drug.items.drugrecurrence.drugrecuraction	Populated with the Reaction/Event information if/when `drugrecurreadministration` equals `1`.	Undefined
patient.drug.items.drugrecurrence.drugrecuractionmeddraversion	The version of MedDRA from which the term in `drugrecuraction` is drawn.	Undefined
patient.drug.items.drugseparatedosagenumb	The number of separate doses that were administered.	Undefined
patient.drug.items.drugstartdate	Date the patient began taking the drug.	Undefined
patient.drug.items.drugstartdateformat	Encoding format of the field `drugstartdate`. Always set to `102` (YYYYMMDD).	Undefined
patient.drug.items.drugstructuredosagenumb	The number portion of a dosage; when combined with `drugstructuredosageunit` the complete dosage information is represented. For example, 300 in `300 mg`.	Undefined
patient.drug.items.drugstructuredosageunit	The unit for the field `drugstructuredosagenumb`. For example, mg in `300 mg`.	001: kg (kilograms), 002: g (grams), 003: mg (milligrams), 004: µg (micrograms)
patient.drug.items.drugtreatmentduration	The interval of the field `drugtreatmentdurationunit` for which the patient was taking the drug.	Undefined
patient.drug.items.drugtreatmentdurationunit	None	801: Year, 802: Month, 803: Week, 804: Day, 805: Hour, 806: Minute
patient.drug.items.medicinalproduct	Drug name. This may be the valid trade name of the product (such as `ADVIL` or `ALEVE`) or the generic name (such as `IBUPROFEN`). This field is not systematically normalized. It may contain misspellings or idiosyncratic descriptions of drugs, such as combination products such as those used for birth control.	Undefined
patient.patientagegroup	Populated with Patient Age Group code.	1: Neonate, 2: Infant, 3: Child, 4: Adolescent, 5: Adult, 6: Elderly
patient.patientdeath.patientdeathdate	If the patient died, the date that the patient died.	Undefined
patient.patientdeath.patientdeathdateformat	Encoding format of the field `patientdeathdate`. Always set to `102` (YYYYMMDD).	Undefined
patient.patientonsetage	Age of the patient when the event first occured.	Undefined
patient.patientonsetageunit	The unit for the interval in the field `patientonsetage.`	800: Decade, 801: Year, 802: Month, 803: Week, 804: Day, 805: Hour
patient.patientsex	The sex of the patient.	0: Unknown, 1: Male, 2: Female
patient.patientweight	The patient weight, in kg (kilograms).	Undefined
patient.reaction.items.reactionmeddrapt	Patient reaction, as a MedDRA term. Note that these terms are encoded in British English. For instance, diarrhea is spelled `diarrohea`. MedDRA is a standardized medical terminology.	name: MedDRA, link: http://www.fda.gov/ForIndustry/DataStandards/StructuredProductLabeling/ucm162038.htm
patient.reaction.items.reactionmeddraversionpt	The version of MedDRA from which the term in `reactionmeddrapt` is drawn.	Undefined
patient.reaction.items.reactionoutcome	Outcome of the reaction in `reactionmeddrapt` at the time of last observation.	1: Recovered/resolved, 2: Recovering/resolving, 3: Not recovered/not resolved, 4: Recovered/resolved with sequelae (consequent health issues), 5: Fatal, 6: Unknown
patient.summary.narrativeincludeclinical	Populated with Case Event Date, when available; does `NOT` include Case Narrative.	Undefined
primarysource.literaturereference	Populated with the Literature Reference information, when available.	Undefined
primarysource.qualification	Category of individual who submitted the report.	1: Physician, 2: Pharmacist, 3: Other health professional, 4: Lawyer, 5: Consumer or non-health professional
primarysource.reportercountry	Country from which the report was submitted.	Undefined
primarysourcecountry	Country of the reporter of the event.	name: Country codes, link: http://data.okfn.org/data/core/country-list
receiptdate	Date that the _most recent_ information in the report was received by FDA.	Undefined
receiptdateformat	Encoding format of the `transmissiondate` field. Always set to 102 (YYYYMMDD).	Undefined
receivedate	Date that the report was _first_ received by FDA. If this report has multiple versions, this will be the date the first version was received by FDA.	Undefined
receivedateformat	Encoding format of the `transmissiondate` field. Always set to 102 (YYYYMMDD).	Undefined
receiver.receiverorganization	Name of the organization receiving the report. Because FDA received the report, the value is always `FDA`.	Undefined
receiver.receivertype	The type of organization receiving the report. The value,`6`, is only specified if it is `other`, otherwise it is left blank.	6: Other
reportduplicate.duplicatenumb	The case identifier for the duplicate.	Undefined
reportduplicate.duplicatesource	The name of the organization providing the duplicate.	Undefined
reporttype	Code indicating the circumstances under which the report was generated.	1: Spontaneous, 2: Report from study, 3: Other, 4: Not available to sender (unknown)
safetyreportid	The 8-digit Safety Report ID number, also known as the case report number or case ID. The first 7 digits (before the hyphen) identify an individual report and the last digit (after the hyphen) is a checksum. This field can be used to identify or find a specific adverse event report.	Undefined
safetyreportversion	The version number of the `safetyreportid`. Multiple versions of the same report may exist, it is generally best to only count the latest report and disregard others. openFDA will only return the latest version of a report.	Undefined
sender.senderorganization	Name of the organization sending the report. Because FDA is providing these reports to you, the value is always `FDA-Public Use.`	Undefined
sender.sendertype	The name of the organization sending the report. Because FDA is providing these reports to you, the value is always `2`.	2: Regulatory authority
serious	Seriousness of the adverse event.	1: The adverse event resulted in death, a life threatening condition, hospitalization, disability, congenital anomaly, or other serious condition, 2: The adverse event did not result in any of the above
seriousnesscongenitalanomali	This value is `1` if the adverse event resulted in a congenital anomaly, and absent otherwise.	Undefined
seriousnessdeath	This value is `1` if the adverse event resulted in death, and absent otherwise.	Undefined
seriousnessdisabling	This value is `1` if the adverse event resulted in disability, and absent otherwise.	Undefined
seriousnesshospitalization	This value is `1` if the adverse event resulted in a hospitalization, and absent otherwise.	Undefined
seriousnesslifethreatening	This value is `1` if the adverse event resulted in a life threatening condition, and absent otherwise.	Undefined
seriousnessother	This value is `1` if the adverse event resulted in some other serious condition, and absent otherwise.	Undefined
transmissiondate	Date that the record was created. This may be earlier than the date the record was received by the FDA.	Undefined
transmissiondateformat	Encoding format of the `transmissiondate` field. Always set to 102 (YYYYMMDD).	Undefined

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
analysis		analysis
assets		assets
data_creation		data_creation
src		src
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

License

KarelDO/BioDEX

Folders and files

Latest commit

History

Repository files navigation