Self-supervised context-aware Covid-19 document exploration through atlas grounding

This repository is the official implementation of Self-supervised context-aware Covid-19 document exploration through atlas grounding authored by Dusan Grujicic^*, Gorjan Radevski^*, Tinne Tuytelaars, Matthew Blaschko. NLP COVID-19 Workshop at ACL 2020.

See our Cord-19 Explorer and our Cord-19 Visualizer tools.

^* Equal contribution

Requirements

If you are using Poetry, navigating to the project root directory and running poetry install will suffice. Otherwise, a requirements.txt file is present so you can install all dependencies by running pip install -r requirements.txt. However, if you just want to download the trained models or dataset splits, make sure to have gdown installed. If the project dependencies are installed then gdown is already present. Otherwise, run pip install gdown to install it.

Fetching the data

The data we use to perform the research consist of the splits used for training, validation and testing the model, together with a 3D human model.

Downloading the dataset splits

The training, validation and test splits obtained from the original dataset can be downloaded with gdown using the code snippet bellow.

gdown "https://drive.google.com/uc?id=1kLvbRVzyR-66lrfzLfeFd3k9-l_S_Cl4" -O data/cord_dataset_train.json
gdown "https://drive.google.com/open?id=1mnlcI5HwgY9RaCqPyWmpEeftnqIxAUQQ" -O data/cord_dataset_val.json
gdown "https://drive.google.com/uc?id=18VSbspzB2VjxDdLaVSNyFB-GZAvEopGE" -O data/cord_dataset_test.json

Downloading the 3D human model

Instructions for obtaining the human atlas can be found on the Voxel-Man website. The obtained model contains images of the male head head.zip and torso innerorgans.zip. The unzipped directory innerograns/, contains a text file with organs and their segmentation labels, and three directories, CT/, labels/, rgb/.

The innerorgans/labels/ directory constains slices of the human atlas in the form of .tiff images, where the grayscale level represents the segmentation label for each organ. It is used for training and evaluating the model, and should be moved to the data/ directory in the project prior to running the scripts.

Generating required json files

The required four json files organ2ind.json, ind2organ.json, organ2label.json and organ2alias.json that contain the the dictionaries related to the organs in the human atlas can be downloaded and extracted by running:

gdown "https://drive.google.com/uc?id=18qxmrOovy1_Cd4ceUNLPKTQUHf3RRs1r" -O data/data_organs_cord.zip
unzip -qq data/data_organs_cord.zip
rm data/data_organs_cord.zip

Details of the steps (removals, mergers of organ segmentation labels and renamings) that resulted in such json files can be found here. An additional three json files need to be generated after obtaining the human atlas and moving the labels/ directory with images to the data/ directory of the project. This can be done by running the following script:

python src/generate_voxel_dict.py --organs_dir_path "data/data_organs_cord"\
                                  --voxelman_images_path "data/labels"

This script should generate three additional json files organ2voxels.json, organ2voxels_eroded.json, organ2summary.json, and place them in the data/data_organs_cord/ directory.

Training

To train a new model on the training data split, from the root project directory run:

python src/train_mapping_reg.py --batch_size 128\
                                --save_model_path "models/cord_basebert_grounding.pt"\
                                --save_intermediate_model_path "models/intermediate_cord_basebert_grounding.pt"\
                                --train_json_path "data/cord_dataset_train.json"\
                                --val_json_path "data/cord_dataset_val.json"\
                                --epochs 20\
                                --bert_name "bert-base-uncased"\
                                --loss_type "all_voxels"\
                                --organs_dir_path "data/data_organs_cord"\
                                --learning_rate 2e-5

The script will train a model for 20 epochs, and will save the model with that reports the lowest distance to the nearest voxel on the validation set at "models/cord_basebert_grounding.pt". Furthermore, keeping the arguments as they are, while changing --bert_name to bert-base-uncased, emilyalsentzer/Bio_ClinicalBERTpytorch, allenai/scibert_scivocab_uncased or emilyalsentzer/Bio_ClinicalBERT, will reproduce the BertBase, BioBert, SciBert and ClinicalBert models from the paper accordingly. To train the model we use for the Cord-19 Explorer tool, the --bert_name argument should be changed to google/bert_uncased_L-4_H-512_A-8, --learning_rate to 5e-5 and --epochs to 50.

Evaluation

To perform inference on the test data split, from the root project directory run:

python src/inference_mapping_reg.py --batch_size 128\
                                    --checkpoint_path "models/cord_basebert_grounding.pt"\
                                    --test_json_path "data/cord_dataset_test.json"\
                                    --bert_name "bert-base-uncased"\
                                    --organs_dir_path "data/data_organs_cord"

The script will perfrom inference with the trained model saved at models/cord_basebert_grounding.pt, and report:

Distance to the nearest voxel of the nearest correct organ (NVD).
Distance to the nearest correct organ voxel calculated only on the samples for which the projection is outside the organ volume (NVD-O).
Rate at which the sentences are grounded within the volume of the correct organ, which we denote as Inside Organ Ratio (IOR).

both NVD and NVD-O are calculated in centimeters.

Pre-trained models

All models used to report the results in the paper can be downloaded with gdown using the code snippet bellow.

gdown "https://drive.google.com/uc?id=17_2g3kWndZI64WpGSR4EZEIK2qBzLrtI" -O models/cord_basebert_grounding.pt
gdown "https://drive.google.com/uc?id=17nUZ0Iym6q7U83kO9QowdmCzvQlp7Cce" -O models/cord_biobert_grounding.pt
gdown "https://drive.google.com/uc?id=1_WxTKu7qJ0sF5oLqniYnTMUVIFcJ1pPJ" -O models/cord_scibert_grounding.pt
gdown "https://drive.google.com/uc?id=144TyLhPmPnZNH88hP4WHLzAC4So7OvFU" -O models/cord_clinicalbert_grounding.pt
gdown "https://drive.google.com/uc?id=11OHi9wETRPAHUTIH4p6BqZY3gH6NJtve" -O models/cord_smallbert_grounding.pt

Reference

If you found this code useful, or use some of our resources for your work, we will appreciate if you cite our paper.

@inproceedings{grujicic-radevski-covid-20,
    title={ Self-supervised context-aware Covid-19 document exploration through atlas grounding },
    author={Dusan Grujicic and Gorjan Radevski and Tinne Tuytelaars and Matthew Blaschko},
    year={2020},
    booktitle={Proceedings of the 1st Workshop on {NLP} for {COVID-19} at {ACL 2020}},
    month = jul,
    volume = 1,
    address = {Online},
    publisher = {Association for Computational Linguistics},
    abstract = {In this paper, we aim to develop a self-supervised grounding of Covid-related medical text based on the actual spatial relationships between the referred anatomical concepts. More specifically, we learn to project sentences into a physical space defined by a three-dimensional anatomical atlas, allowing for a visual approach to navigating Covid-related literature. We design a straightforward and empirically effective training objective to reduce the curated data dependency issue. We use BERT as the main building block of our model and perform a quantitative analysis that demonstrates that the model learns a context-aware mapping. We illustrate two potential use-cases for our approach, one in interactive, 3D data exploration, and the other in document retrieval. To accelerate research in this direction, we make public all trained models, codebase and the developed tools, which can be accessed at https://github.com/gorjanradevski/macchina/.},
}

License

Everything is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
data		data
logs		logs
models		models
notebooks		notebooks
src		src
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

gorjanradevski/macchina

Folders and files

Latest commit

History

Repository files navigation

Self-supervised context-aware Covid-19 document exploration through atlas grounding

Requirements

Fetching the data

Downloading the dataset splits

Downloading the 3D human model

Generating required json files

Training

Evaluation

Pre-trained models

Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages