Skip to content

Semi/Self-Supervised Learning on a Pediatric Pneumonia Dataset

Notifications You must be signed in to change notification settings

TeamSemiSuperCV/semi-super

Repository files navigation

Semi/Self-Supervised Learning on a Pediatric Pneumonia Dataset

About

Fully supervised approaches need large, densely annotated datasets. Only hospitals that can afford to collect large annotated datasets can utilize these approaches to aid their physicians. The project goal is to utilize self-supervised and semi-supervised learning approaches to significantly reduce the need for fully labelled data. In this repo, you will find the project source code, along with training notebooks, and the final TensorFlow 2 saved model used to develop the web application for detecting Pediatric Pneumonia from chest X-rays.

The semi/self-supervised learning framework used in the project comprises of three stages:

  1. Self-supervised pretraining
  2. Supervised fine-tuning with active-learning
  3. Knowledge distillation using unlabeled data

Refer to Google reserach team's paper (SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners) for more details regarding the framework used.

The training notebooks for Stage 1, 2, and 3 can be found in the notebooks folder. Notebooks for Selective Labeling (active-learning) using Entropy or Augmentations policies can be found in the Active_Learn folder. We also evaluated another Semi-Supervised Learning approach called FixMatch. Benchmarks for Fully-Supervised Learning can be found in the FSL_Benchmarks folder. The code for Data Preprocessing can be found in Data_Preparation.

Results

Stage 1 - Contrastive Accuracy

Labels Stage 1 (Self-Supervised)
No labels used 99.99%

Contrastive Accuracy is a measure of how invariant the model's predictions are when tested against image augmentations.

Stage 2 and 3 - Test Accuracy Comparison

Labels FSL (Benchmark) Stage 2 (Finetuning) Stage 3 (Distillation)
1% 85.2% 94.5% 96.3%
2% 85.1% 96.8% 97.6%
5% 86.0% 97.1% 98.1%
100% 98.9% N/A N/A

Despite needing only a small fraction of labels, our Stage 2 and Stage 3 models were able to acheive test accuracies that are comparable to a 100% labelled Fully-Supervised (FSL) model. Refer to the Project Report and the Final Presentation for a more detailed discussion and findings.

ML Workflow

Web App Demo

Installation

Your can run the app locally if you have Docker installed. First, clone this repo:

git clone https://github.com/TeamSemiSuperCV/semi-super

Navigate to the webapp directory of the repo:

cd semi-super/webapp

Build the container image using the docker build command (will take few minutes):

docker build -t semi-super .

Start the container using the docker run command, specifying the name of the image we just created:

docker run -dp 8080:8080 semi-super

After a few seconds, open your web browser to http://localhost:8080. You should see the app.

Acknowledgements

We took the SimCLR framework code from Google Research and heavily modified it for the purposes of this project. We enhanced the knowledge distillation feature along with several other changes to make it perform better with our dataset. With these changes and improvements, knowledge distillation can be performed on the Google Cloud TPU infrastructure, which reduces training time significantly.

Other Resources