Skip to content

Mauville/MedCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MedCLIP

MedCLIP is a medical image captioning Deep Learning neural network based on the OpenAI CLIP architecture.

Usage

Run main.ipynb on a Colab instance. Weights for the model are provided, so you don’t need to train again.

Introduction

CLIP is a beautiful hashing process.

Through encodings and transformations, CLIP learns relationships between natural language and images. The underlying model allows for either captioning of an image from a set of known captions, or searching an image from a given caption. With appropriate encoders, the CLIP model can be optimised for certain domain-specific applications. Our hope with MedCLIP is to help the radiologist emit diagnoses.

Explanation

CLIP works by encoding an image and a related caption into tensors. The model then optimises the last layer of the (transfer learnable) encoders to make both image and text encodings as similar as possible. (1. Contrastive Pretraining)

img README 4

After the model is successfully trained, we can query it with new information. (2. Zero shot)

  1. Take an input

  2. Encode with the custom trained encoders

  3. Find a match (image or text) from the known data set.

    1. Go through each entry of the data set

    2. Check similarity with the current input

    3. Output the pairs resulting in most similarity

  4. [Optionally] Measure the similarity between the real caption, and the guessed one.

Loss

The perfect relationship between encoded images and captions is described by their encoded representations being the same. This similarity can easily be measured by looking at the softmax between the dot product of the encoded inputs; a perfect encoding will yield the identity matrix.

img README 5

Tools Used

The model was trained using a curated MedPix dataset that focuses on Magnetic Resonance, Computer Tomography and X-Ray scans. ClinicalBERT was used to encode the text and ResNet50 was used for the images.

Similarity between captions was measured using Rouge, Bleu, Meteor and Cider.

Future Work

  • Add new datasets; the more datasets the model has, the better the captioning performance (bigger space from where to choose a caption/image).

Some relevant datasets:

  • IU Chest X-Ray

  • ChestX-Ray 14

  • PEIR gross

  • BCIDR

  • CheXpert

  • MIMIC-CXR

  • PadChest

  • ICLEF caption

    • Generate new captions instead of just looking them up. This will vastly improve accuracy.

Members and Acknowledgements

Achieved X by doing Y as measured by Z

Implemented a medical image captioning Deep Learning model by using the CLIP model, ResNet50 and ClinicalBERT. We obtained a 61% Rouge similarity rate on our implementation with the MedPix Dataset.