Skip to content

The First Generate-then-Read Framework for Multiple-Choice Question Answering in Medicine

Notifications You must be signed in to change notification settings

disi-unibo-nlp/medgenie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MedGENIE

hamlet icon To Generate or to Retrieve?
On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Official source code of MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. This method generates relevant information through domain-specific models before answering questions, outperforming traditional retrieval-based approaches. Tested on MedQA-USMLE, MedMCQA, and MMLU datasets within a 24GB VRAM limit, MedGENIE sets new benchmarks, proving that generated contexts can significantly enhance accuracy in medical question answering.

medgenie architecture

For more information, refer to our paper To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

πŸ“Œ Tables Of Contents

πŸ–‡ Models

Model Params Role Checkpoint
MedGENIE-fid-flan-t5-base-medqa 250M πŸ‘οΈ FID-Reader
MedGENIE-fid-flan-t5-base-medmcqa 250M πŸ‘οΈ FID-Reader
LLaMA-2-chat 7B πŸ‘οΈ ICL-Reader
Zephyr-Ξ² 7B πŸ‘οΈ ICL-Reader
PMC-LLaMA (AWQ) 13B πŸ“ Context Generator

πŸ–‡ Datasets

Dataset N. options Original MedGENIE format
MedQA 4
MedQA 5
MedMCQA 4
MMLU medical* 4

* For the MMLU medical dataset, the chosen subjects are: high_school_biology, college_biology, college_medicine, professional_medicine, medical_genetics, virology, clinical_knowledge, nutrition, anatomy.
From: https://huggingface.co/datasets/lukaemon/mmlu


πŸš€ Quickstart

Begin by cloning the repository:

git clone https://github.com/disi-unibo-nlp/medgenie.git
cd medgenie

Next, set up a Docker container to install the necessary dependencies as follows:

docker build -t medgenie .

Execute the container using docker run.

docker run -v /path_to/medgenie:/medgenie --rm --gpus device=$CUDA_VISIBLE_DEVICES -it medgenie bash

πŸ“ Generate Context

Briefly explanation of how to generate contexts, using generate_contexts.py:

  • Model parameters configuration
cd context_generation
python3 generate_contexts.py \
    --model_name disi-unibo-nlp/pmc-llama-13b-awq \
    --batch_size 8 \
    --temperature 0.9 \
    --frequency_penalty 1.95 \
    --top_p 1.0 \
    --max_tokens 512 \
    --use_beam_search False \
  • Dataset information
    --dataset_name medqa \
    --train_set \
    --test_set \
    --data_path_train train.jsonl \
    --data_path_test test.jsonl \
  • Number of contexts
    --n 2 \
  • NOT to include options in the question (by default, the options are included)
    --no_options \

To obtain a multi-view artifical contexts we can first generate a set of contexts conditioned on question and options (option-focused), and then a set of contexts conditioned only on the question (option-free, with --no_options).

πŸ‘ Reader

Each reader is equiped with custom background passages, allowing them to tackle medical questions effectively even without prior knowledge.

βš™ Input data format

After the context generation is necessary to concatenate and convert all contexts into a single input file for the readers.
For conversion use preprocess.py as follows:

cd utils
python3 preprocess.py \
    --dataset_name medqa \
    --test_set \
    --data_path_test path_to_test_set \
    --contexts_w_ops path_to_generated_contexts_w_ops \
    --contexts_no_ops path_to_generated_contexts_no_ops \
    --n_context number_of_total_contexts \

Entry example:

{
        "id": 0,
        "question": "A junior orthopaedic surgery... Which of the following is the correct next action for the resident to take?\nA. Disclose the error to the patient and put it in the operative report\nB. Tell the attending that he cannot fail to disclose this mistake\nC. Report the physician to the ethics committee\nD. Refuse to dictate the operative report",
        "target": "B",
        "answers": [
            "B"
        ],
        "ctxs": [
            {
                "text": "Inadvertent Cutting of Tendon is a complication, ..."
            },
            {
                "text": "A resident is obligated to be..."
            },
            {
                "text": "This is an example of error in the operative note, ..."
            },
            {
                "text": "Residentserves as the interface between..."
            },
            {
                "text": "As a matter of ethical practice, ..."
            }
        ]
    }

1. Fusion-In-Decoder (FID)

For the supervised regime, we train a lightweight FID reader (Izacard and Grave, 2021).

Train

The first step in utilizing FID as a reader is to train the model:

cd fid_reader
python3 train.py \
    --dataset_name "medqa" \
    --n_options 5 \
    --model_size base \
    --per_gpu_batch_size 1 \
    --accumulation_steps 4 \
    --total_steps number_of_total_steps \
    --name my_test \
  • Contexts information
    --n_context 5 \
    --text_maxlength 1024 \

Test

Then, it is possible to evaluate the trained model:

cd fid_reader
python3 test.py \
    --model_path checkpoint/my_test/checkpoint/best_dev \
    --dataset_name "medqa" \
    --n_options 4 \
    --per_gpu_batch_size 1 \
    --n_context 5 \

2. In-Context-Learning (ICL)

This strategy consists in feed an LLM reader with few-shot open-domain question answering demonstrations and the test query preceded by its artificial context.

cd icl_reader
python3 benchmark.py \
    --model_name HuggingFaceH4/zephyr-7b-beta \
    --dataset_name medqa \
    --test_set \
    --n_options 4 \
    --batch_size 8 \
    --max_context_window 4000 \
  • It is possible to specify whether to use the contexts or not (by default, contexts are used).
    --no_contexts \
  • The human_crafted templates are utilized by default. To change them, use:
    --templates_dir path_to_templates \

Main accuracy results

Model Ground (Source) MedQA MedMCQA MMLU AVG (↓)
LLaMA-3-Instruct (8B) [1-shot] Ø 60.6 55.7 69.8 62.0
Phi-3-mini-128k-instruct (3.8B) [1-shot] Ø 55.1 53.5 70.3 59.6
MEDITRON (7B) Ø 52.0 59.2 55.6 55.6
PMC-LLaMA (7B) Ø 49.2 51.4 59.7 53.4
LLaMA-2 (7B) Ø 49.6 54.4 56.3 53.4
Zephyr-β* (7B) Ø 49.3 43.4 60.7 51.1
Mistral-Instruct* (7B) Ø 41.1 40.2 55.8 45.7
LLaMA-2-chat* (7B) Ø 36.9 35.0 49.3 40.4
Codex* (175B) Ø 52.5 50.9 - -
--- --- --- --- --- ---
MedGENIE-Phi-3-mini-128k-instruct (3.8B) [1-shot] G (PMC-LLaMA) 64.7 54.1 70.8 63.2
MedGENIE-LLaMA-3-Instruct (8B) [1-shot] G (PMC-LLaMA) 63.1 56.2 68.9 62.7
MedGENIE-Zephyr-Ξ²* (7B) G (PMC-LLaMA) 59.7 51.0 66.1 58.9
MedGENIE-FID-Flan-T5 (250M) G (PMC-LLaMA) 53.1 52.1 59.9 55.0
Zephyr-Ξ²* (7B) R (MedWiki) 50.5 47.0 66.9 54.8
VOD (220M) R (MedWiki) 45.8 58.3 56.8 53.6
MedGENIE-LLaMA-2-chat* (7B) G (PMC-LLaMA) 52.6 44.8 58.8 52.1
Mistral-Instruct* (7B) R (MedWiki) 45.1 44.3 58.5 49.3
LLaMA-2-chat* (7B) R (MedWiki) 37.2 37.2 52.0 42.1
*zero/few-shot inference

RAGAS Evaluation

Metrics

  • Context recall To calculate context recall, each sentence in the ground truth (GT) answer is examined to determine if it can be linked back to the retrieved/generated context. Ideally, all sentences in the ground truth answer should be identifiable within the retrieved/generated context for optimal context recall. The values range between 0 and 1, with higher values indicating better performance. Check out the Ragas Documentation for more on this metric.
$$Context\ Recall = \frac{|\text{GT sentences that can be attributed to context}|}{|\text{Number of sentences in GT}|}$$
  • Context precision is a measure that assesses if all relevant items from the ground truth (GT) in the contexts are ranked higher. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision. Check out the Ragas Documentation for more on this metric.
$$Context\ Precision@K = \frac{\left( \sum_\limits{k=1}^K \text{Precision@}K \times v_k \right)}{\text{Total number of relevant items in the top }K \text{ results}}$$ $$Precision@K = \frac{\text{true positives@k}}{(\text{true positives@k} + \text{false positives@k})}$$
  • Faithfulness evaluates the factual consistency of the generated answer with the provided context. It calculates a score based on how well the claims in the answer align with the given context. To determine faithfulness, the claims in the answer are compared against the context to see if they can be inferred accurately. The score is scaled to a range of (0,1), where higher values indicate better faithfulness. Check out the Ragas Documentation for more on this metric.
$$Faithfulness = \frac{|\text{Number of claims in the generated answer that can be inferred from given context}|}{|\text{Total number of claims in the generated answer}|}$$

Results

The table displays scores from 150 random questions in MedQA, where both retrieved (R) and generated (G) contexts led the LLM to answer correctly (positive). Higher scores reflect better context quality and correctness.

Additionally, context recall and precision are evaluated using 50 random questions where both retrieved (R) and generated (G) contexts led the LLM to answer incorrectly (negative). Higher scores in these cases might suggest correctness in the generated or retrieved contexts but also highlight the LLM's difficulty in finding relevant information within the given context.

Metric Dataset N. Samples Answer G R
Context Precision MedQA 150 positive 87.9
(check evals)
48.6
(check evals)
Context Recall MedQA 150 positive 93.4
(check evals)
76.2
(check evals)
Faithfulness MedQA 150 positive 59.7
(check evals)
23.8
(check evals)
--- --- --- --- --- ---
Context Precision MedQA 50 negative 55.3
(check evals)
29.5
(check evals)
Context Recall MedQA 50 negative 59.2
(check evals)
32.0
(check evals)

All scores have been computed by using gpt-4-turbo-2024-04-09 as evaluator.

Estimated cost: ~$55

πŸ“š Citation

If you find this research useful, or if you utilize the code and models presented, please cite:

@misc{frisoni2024generate,
      title={To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering}, 
      author={Giacomo Frisoni and Alessio Cocchieri and Alex Presepi and Gianluca Moro and Zaiqiao Meng},
      year={2024},
      eprint={2403.01924},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

The First Generate-then-Read Framework for Multiple-Choice Question Answering in Medicine

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published